Text Document Clustering Using Dimension Reduction Technique
نویسنده
چکیده
Text document clustering is used to group a set of documents based on the information it contains and to provide retrieval results when a user browses the internet. Experimental evidences have shown that Information Retrieval applications can benefit from document clustering and it has been used as a tool to improve the performance of retrieval of information. Information retrieval is an interdisciplinary field of knowledge management and text mining. Dimensionality Reduction (DR) is a typical step in many text mining problems which involves transforming sparse data into a shorter and more compact one. DR can be done in 2 ways: feature reduction and feature selection. This study implements dimensionality reduction through feature selection with k-means algorithm. Feature Selection is implemented through the InfoGain DR technique. This paper presents an experimental analysis of the performance of the document clustering with the InfoGain technique and proves that this method significantly improves the performance in terms of Accuracy, Precision and Recall for the BBC Sports Dataset.
منابع مشابه
Effective Dimension Reduction Techniques for Text Documents
Frequent term based text clustering is a text clustering technique, which uses frequent term set and dramatically decreases the dimensionality of the document vector space, thus especially addressing: very high dimensionality of the data and very large size of the databases. Frequent Term based Clustering algorithm (FTC) has shown significant efficiency comparing to some well known text cluster...
متن کاملA Systematic Study on Document Representation and Dimensionality Reduction for Text Clustering A Systematic Study on Document Representation and Dimensionality Reduction for Text Clustering
Increasingly large text datasets and the high dimensionality associated with natural language is a great challenge of text mining. In this research, a systematic study is conducted of application of three Dimension Reduction Techniques (DRT) on three different document representation methods in the context of the text clustering problem using several standard benchmark datasets. The dimensional...
متن کاملComparing and Combining Dimension Reduction Techniques for Efficient Text Clustering
A great challenge of text mining arises from the increasingly large text datasets and the high dimensionality associated with natural language. In this research, a systematic study is conducted of six Dimension Reduction Techniques (DRT) in the context of the text clustering problem using three standard benchmark datasets. The methods considered include three feature transformation techiques, I...
متن کاملComparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملComparing Dimension Reduction Techniques for Document Clustering
In this research, a systematic study is conducted of four dimension reduction techniques for the text clustering problem, using five benchmark data sets. Of the four methods -Independent Component Analysis (ICA), Latent Semantic Indexing (LSI), Document Frequency (DF) and Random Projection (RP) -ICA and LSI are clearly superior when the k-means clustering algorithm is applied, irrespective of t...
متن کامل