A Novel Weighted Phrase-Based Similarity for Web Documents Clustering

نویسندگان

  • Ruilong Yang
  • Qingsheng Zhu
  • Yunni Xia
چکیده

Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, a weighted phrase-based document similarity is proposed to compute the pairwise similarities of documents based on the Weighted Suffix Tree Document (WSTD) model. The weighted phrase-based document similarity is applied to the Group-average Hierarchical Agglomerative Clustering (GHAC) algorithm to develop a new Web document clustering approach. According to the structures of the Web documents, different document parts are assigned different levels of significance as structure weights stored in the nodes of the weighted suffix tree which is constructed with sentences instead of documents. By mapping each node and its weights in WSTD model into a unique feature term in the Vector Space Document (VSD) model, the new weighted phrase-based document similarity naturally inherits the term TF-IDF weighting scheme in computing the document similarity with weighted phrases. The evaluation experiments indicate that the new clustering approach is very effective on clustering the Web documents. Its quality greatly surpasses the traditional phrase-based approach in which the Web documents structures are ignored. In conclusion, the weighted phrase-based similarity works much better than ordinary phrase-based similarity.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Phrase-based Document Similarity Based on an Index Graph Model

Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes web documents based on phrases, rather than sing...

متن کامل

Web Document Clustering based on Document Structure

Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, document structure should be reflected in the underlying data model. This paper presents a framework for web document clustering based on two important concepts. The first one is the web document structure, which is currently ...

متن کامل

SYDE 676 Project Report – Fall 2002 Web Document Clustering Using Phrase-based Document Similarity

Measuring the similarity between documents is an essential operation in text mining, especially document clustering. The traditional method of finding the similarity between documents has always been based on extracting individual words from the documents, and using heuristics to give weights to those features. Standard methods in data mining are then used to find the similarity between documen...

متن کامل

Phrase based Clustering Scheme of Suffix Tree Document Clustering Model

Document clustering is one of the difficult and recent research fields in the search engine research. Most of the existing documents clustering techniques use a group of keywords from each document to cluster the documents. Document clustering arises from information retrieval domains, and “It finds grouping for a set of documents belonging to the same cluster are similar and documents belongs ...

متن کامل

Clustering web documents using co-citation, coupling, incoming, and outgoing hyperlinks: a comparative performance analysis of algorithms

Querying search engines with the keyword ”jaguars” returns results as diverse as web sites about cars, computer games, attack planes, American football, and animals. More and more search engines offer options to organize query results by categories or, given a document, to return a list of links to topically related documents. While information retrieval traditionally defines similarity of docu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JSW

دوره 6  شماره 

صفحات  -

تاریخ انتشار 2011