Efficient Cluster Representation in Similar Document Search

نویسندگان

  • Shankaran Sitarama
  • Uma Mahadevan
  • Mani Abrol
چکیده

Similar document search is the problem of retrieving documents that resemble a given document. In this paper, we describe a cluster-based retrieval scheme that approximates the classic nearest neighbor search scheme, by identifying the clusters that are closest to the input document and restricting attention to these clusters only. Cluster signatures play an important role in the effectiveness of this approximation, since the inclusion of a cluster in the restricted search depends entirely on whether its signature matches the given document. We study three different representations of cluster signatures and their role in performing a similar document search, while examining only a fraction of the documents from the target corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient document retrieval using text clustering

Similar document retrieval is the problem of finding documents that are most similar to a given query document. In this work, we present a retrieval based on clustering of the documents that approximates the nearest neighbor search. It is done by determining the clusters that are most similar to the query document and restricting the search to the documents in these clusters. Cluster representa...

متن کامل

Document Clustering: A Detailed Review

Document clustering is automatic organization of documents into clusters so that documents within a cluster have high similarity in comparison to documents in other clusters. It has been studied intensively because of its wide applicability in various areas such as web mining, search engines, and information retrieval. It is measuring similarity between documents and grouping similar documents ...

متن کامل

A word-based soft clustering algorithm for documents

Document clustering is an important tool for applications such as Web search engines. It enables the user to have a good overall view of the information contained in the documents. However, existing algorithms suffer from various aspects; hard clustering algorithms (where each document belongs to exactly one cluster) cannot detect the multiple themes of a document, while soft clustering algorit...

متن کامل

Phrase based Clustering Scheme of Suffix Tree Document Clustering Model

Document clustering is one of the difficult and recent research fields in the search engine research. Most of the existing documents clustering techniques use a group of keywords from each document to cluster the documents. Document clustering arises from information retrieval domains, and “It finds grouping for a set of documents belonging to the same cluster are similar and documents belongs ...

متن کامل

Using Web structure and summarisation techniques for Web content mining

The dynamic nature and size of the Internet can result in difficulty finding relevant information. Most users typically express their information need via short queries to search engines and they often have to physically sift through the search results based on relevance ranking set by the search engines, making the process of relevance judgement time-consuming. In this paper, we describe a nov...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003