Document Clustering using Compound Words
نویسندگان
چکیده
Document clustering is a kind of text data mining and organization technique that automatically groups related documents into clusters. Traditionally single words occurring in the documents are identified to determine the similarities among documents. In this work, we investigate using compound words as features for document clustering. Our experimental results demonstrate that using compound words alone cannot improve the performance of clustering system. Promising results are achieved when the compound words are combined with the original single words to be the features. An evaluation of several basic clustering algorithms is also performed in our work for algorithm selection. Although the bisecting K-means method has been proposed as a good document clustering algorithm by other investigators, our experimental results demonstrated that for small datasets, a traditional hierarchical clustering algorithm still achieves the best performance.
منابع مشابه
Clustering Documents with Maximal Substrings
This paper provides experimental results showing that we can use maximal substrings as elementary building blocks of documents in place of the words extracted by a current state-of-the-art supervised word extraction. Maximal substrings are defined as the substrings each giving a smaller number of occurrences even by appending only one character to its head or tail. The main feature of maximal s...
متن کاملیک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجرههای همپوشان
A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...
متن کاملPerson Name Disambiguation on the Web Using Query Expansion
The more important the web search become, the bigger the same name problem in the web search. Proposed solution is forming clusters of people from search results. In this paper, we report our algorithms that disambiguates person names in web search results. Our clustering algorithm is based on hierarchical agglomerative clustering using named entities, compound key words and URLs as features fo...
متن کاملThe Effect of Word Sampling on Document Clustering
Many techniques have been used for document clustering that depended on the number of word occurrences in documents. In these techniques, words are considered as dimensions of the clustering space. Since a huge number of words is found in each document, studies were held to reduce this high dimensionality for better performance i.e., words pruning. Sampling was used to choose random documents r...
متن کاملDomain Based Punjabi Text Document Clustering
Text Clustering is a text mining technique which is used to group similar documents into single cluster by using some sort of similarity measure & separating the dissimilar documents. Popular clustering algorithms available for text clustering treats document as conglomeration of words. The syntactic or semantic relations between words are not given any consideration. Many different algorithms ...
متن کامل