Comparative Study on Context-Based Document Clustering

نویسنده

Soumen Swarnakar

چکیده

Clustering is an automatic learning technique aimed at grouping a set of objects into subsets or clusters. Objects in the same cluster should be as similar as possible, whereas objects in one cluster should be as dissimilar as possible from objects in the other clusters. Document clustering has become an increasingly important task in analysing huge documents. The challenging aspect to analyse the enormous documents is to organise them in such a way that facilitates better search and knowledge extraction without introducing extra cost and complexity. Document clustering has played an important role in many fields like information retrieval and data mining. In this paper, first Document Clustering has been proposed using Hierarchical Agglomerative Clustering and K-Means Clustering Algorithm.Here, the approach is purely based on the frequency count of the terms present in the documents where context of the documents are totally ignored. Therefore, the method is modified by incorporating Relatedness to measure the degree of relevance of the terms with respect to the concepts present in the documents. Thus, this Clustering is not only Term based but also understanding based (ie, Context Dependent). Next, the clustering is done by Hierarchical Agglomerative Clustering and K-Means with the Relatedness concept. Davies-Bouldin’s (DB) Index, which is a well-known metric, has been used to compare the quality of clusters-as they are obtained when the concept of Relatedness is not incorporated in the above mentioned document-clustering algorithms and secondly, when relatedness is integrated into the algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...

متن کامل

Extraction of Respiratory Signal Based on Image Clustering and Intensity Parameters at Radiotherapy with External Beam: A Comparative Study

Background: Since tumors located in thorax region of body mainly move due to respiration, in the modern radiotherapy, there have been many attempts such as; external markers, strain gage and spirometer represent for monitoring patientsâ€™ breathing signal. With the advent of fluoroscopy technique, indirect methods were proposed as an alternative approach to extract patientsâ€™ breathing signals...

متن کامل

A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering

Recent research shows that ontology as background knowledge can improve document clustering quality with its concept hierarchy knowledge. Previous studies take term semantic similarity as an important measure to incorporate domain knowledge into clustering process such as clustering initialization and term re-weighting. However, not many studies have been focused on how different types of term ...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Comparative Study on Context-Based Document Clustering

نویسنده

چکیده

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

Extraction of Respiratory Signal Based on Image Clustering and Intensity Parameters at Radiotherapy with External Beam: A Comparative Study

A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

عنوان ژورنال:

اشتراک گذاری