A Frequent Concepts Based Document Clustering Algorithm
نویسندگان
چکیده
This paper presents a novel technique of document clustering based on frequent concepts. The proposed technique, FCDC (Frequent Concepts based document clustering), a clustering algorithm works with frequent concepts rather than frequent items used in traditional text mining techniques. Many well known clustering algorithms deal with documents as bag of words and ignore the important relationships between words like synonyms. the proposed FCDC algorithm utilizes the semantic relationship between words to create concepts. It exploits the WordNet ontology in turn to create low dimensional feature vector which allows us to develop a efficient clustering algorithm. It uses a hierarchical approach to cluster text documents having common concepts. FCDC found more accurate, scalable and effective when compared with existing clustering algorithms like Bisecting K-means , UPGMA and FIHC.
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملHybrid Approach for Punjabi Text Clustering
Text Clustering is a text mining technique which is used to group similar documents into single cluster by using some sort of similarity measure and placing dissimilar documents into different clusters. Most of the popular clustering algorithms treats document as conglomeration of words and do not consider the syntactic or semantic relations between words. To overcome this drawback, some algori...
متن کاملA Dynamic Programming Approach To Document Clustering Based On Term Sequence Alignment
Document clustering is unsupervised machine learning technique that, when provided with a large document corpus, automatically sub-divides it into meaningful smaller sub-collections called clusters. Currently, document clustering algorithms use sequence of words (terms) to compactly represent documents and define a similarity function based on the sequences. We believe that the word sequence is...
متن کاملWeb Document Clustering based on Document Structure
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, document structure should be reflected in the underlying data model. This paper presents a framework for web document clustering based on two important concepts. The first one is the web document structure, which is currently ...
متن کامل