Web document clustering using hyperlink structures

نویسندگان

  • Xiaofeng He
  • Hongyuan Zha
  • Chris H. Q. Ding
  • Horst D. Simon
چکیده

With the exponential growth of information on the World Wide Web, there is great demand for developing e.cient methods for e/ectively organizing the large amount of retrieved information. Document clustering plays an important role in information retrieval and taxonomy management for the Web. In this paper we examine three clustering methods: K-means, multi-level METIS, and the recently developed normalized-cut method using a new approach of combining textual information, hyperlink structure and co-citation relations into a single similarity metric. We found the normalized-cut method with the new similarity metric is particularly e/ective, as demonstrated on three datasets of web query results. We also explore some theoretical connections between the normalized-cut method and the K-means method. c © 2002 Elsevier Science B.V. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Incorporating Hyperlink Analysis in Web Page Clustering

The size of the World Wide Web is growing rapidly and it has become a very important source of information that can be useful to various academic and commercial applications. However, because of the large number of documents online, it is becoming increasingly difficult to search for useful information on the Web. General-purpose Web search engines, such as Google and AltaVista, present search ...

متن کامل

Vision-Based Deep Web Data Extraction for Web Document Clustering

The design of web information extraction systems becomes more complex and time-consuming. Detection of data region is a significant problem for information extraction from the web page. In this paper, an approach to vision-based deep web data extraction is proposed for web document clustering. The proposed approach comprises of two phases: 1) Vision-based web data extraction, and 2) web documen...

متن کامل

Using Fuzzy Logic Clustering Discover Semantic Similarity in Web Document

The complex and high interactions between terms in documents demonstrates vague and ambiguous meanings. There exist complicated associations within one web document and linking to the others. Most of these approaches perform similarity and feature section methods. There is need of complex document clustering and produced meaningful document. This paper proposed methodology is capable of handles...

متن کامل

Performance Analysis of Vision-based Deep Web Data Extraction for Web Document Clustering

Web Data Extraction is a critical task by applying various scientific tools and in a broad range of application domains. To extract data from multiple web sites are becoming more obscure, as well to design of web information extraction systems becomes more complex and time-consuming. We also present in this paper so far various risks in web data extraction. Identifying data region from web is a...

متن کامل

An adaptive neural network approach to hypertext clustering

The WWW is an on-line hypertextual collection, and a more sophisticated algorithm for Web page clustering may have to be based on combined term-similarity and hyperlink-similarity measures. It has been observed that nearly all currently employed techniques for document classification on the Web make use of textual information only. In addition, most of these techniques are incapable of discover...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational Statistics & Data Analysis

دوره 41  شماره 

صفحات  -

تاریخ انتشار 2002