Tech. Report: Matrix dimensionality reduction for LSI using Spherical K-means

ثبت نشده
چکیده

In this paper, we propose using Spherical K-means algorithm as a preprocessing step to using Latent Semantic Indexing (LSI). LSI is a well known approach in Information Retrieval (IR). Spherical K-means is a fast clustering algorithm that puts similar documents together, thus forming K clusters. We propose using Spherical Kmeans to form the matrix of normalized concept vectors yields high reduction in time and space complexity. The results obtained by proposed approach are compared with those using LSI. We obtained comparable results using only 10% of the original size (column-wise) matrix.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Systematic Study on Document Representation and Dimensionality Reduction for Text Clustering A Systematic Study on Document Representation and Dimensionality Reduction for Text Clustering

Increasingly large text datasets and the high dimensionality associated with natural language is a great challenge of text mining. In this research, a systematic study is conducted of application of three Dimension Reduction Techniques (DRT) on three different document representation methods in the context of the text clustering problem using several standard benchmark datasets. The dimensional...

متن کامل

A Similarity - based Probability Model for Latent Semantic IndexingChris

A dual probability model is constructed for the Latent Semantic Indexing (LSI) using the cosine similarity measure. Both the document-document similarity matrix and the term-term similarity matrix naturally arise from the maximum likelihood estimation of the model parameters, and the optimal solutions are the latent semantic vectors of of LSI. Dimensionality reduction is justiied by the statist...

متن کامل

Comparing Dimension Reduction Techniques for Document Clustering

In this research, a systematic study is conducted of four dimension reduction techniques for the text clustering problem, using five benchmark data sets. Of the four methods -Independent Component Analysis (ICA), Latent Semantic Indexing (LSI), Document Frequency (DF) and Random Projection (RP) -ICA and LSI are clearly superior when the k-means clustering algorithm is applied, irrespective of t...

متن کامل

Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering

A great challenge of text mining arises from the increasingly large text datasets and the high dimensionality associated with natural language. In this research, a systematic study is conducted of six Dimension Reduction Techniques (DRT) in the context of the text clustering problem using three standard benchmark datasets. The methods considered include three feature transformation techiques, I...

متن کامل

The influence of semantics in IR using LSI and K-means clustering techniques

In this paper we study the influence of semantics in the information retrieval preprocessing. We concretely compare the reached performance with stemming and semantic lemmatization as preprocessing. Three techniques are used in the study: the direct use of a weighted matrix, the SVD technique in the LSI model and the bisecting spherical k-means clustering technique. Although the results seem no...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006