Document Clustering in Reduced Dimension Vector Space
نویسنده
چکیده
Document clustering is a popular tool for automatically organizing a large collection of texts. Clustering algorithms are usually applied to documents represented as vectors in a high dimensional term space. We investigate the use of Latent Semantic Analysis to create a new vector space, that is the optimal representation of the document collection. Documents are projected onto a small subspace of this vector space and clustered. We compare the performance of clustering algorithms when applied to documents represented in the full term space and in reduced dimension subspace of the LSA-generated vector space. We report significant improvements in cluster quality for LSA subspaces with optimal dimensionality. We discuss the procedure for determining the right number of dimensions for the subspace. Moreover, when this number is small, the total running time of the clustering algorithm is comparable to the one that uses the full term space.
منابع مشابه
Double Clustering in Latent Semantic Indexing
Document clustering is a widely researched area of information retrieval. The large amount of documents which must be handled needs automatic organizing. A popular approach to clustering documents and messages is the vector space model, which represents texts with feature vectors, usually generated from the set of terms contained in the message. The clustering based on the document-term frequen...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملروش جدید تقطیع تصویر بر مبنای خوشهبندی فازی مبتنی بر تکامل تفاضلی چندهدفه
Image segmentation is one of the most important and difficult steps in machine vision problems and achieving the desired results often requires satisfaction of different objectives. One approach to face this situation uses multi-objective fuzzy clustering of pixels in the feature space. This paper proposes a new strategy for search within the family of multi-objective differential evolution alg...
متن کاملAn Efficient Text Clustering Approach using Affinity Propagation with weight modification
Recently the text mining has emerged as one of the most important fields of data mining because of most of the searching in the web is done on the basis of provided text, also the increasing use of social web network uses the text as major component and extracting the effective information directly or indirectly requires an efficient grouping algorithm which should be capable of providing effic...
متن کاملFeature Selection and Document Clustering
Feature selection is a basic step in the construction of a vector space or bag of words model [BB99]. In particular, when the processing task is to partition a given document collection into clusters of similar documents a choice of good features along with good clustering algorithms is of paramount importance. This chapter suggests two techniques for feature or term selection along with a numb...
متن کامل