Involving Validity Indices in Document Clustering
نویسندگان
چکیده
The goal of any clustering algorithm is to find the optimal clustering solution with the optimal number of clusters. In order to evaluate a clustering solution, a number of validity indices are used during or at the end of a clustering process. They can be internal, external or relative. In this paper, we provide two main contributions: First, we present an experimental study comparing the major relative indices in the context of document agglomerative clustering. The objective is to highlight the limits of the existing indices for identifying both the optimal clustering solution and the optimal number of clusters in real datasets. Second, we explore the feasibility of using the relative indices as stopping criteria in agglomerative clustering algorithms. We present a new method that enhances the clustering process with context-awareness to improve their reliability for such utilization.
منابع مشابه
A comprehensive validity index for clustering
Cluster validity indices are used for both estimating the quality of a clustering algorithm and for determining the correct number of clusters in data. Even though several indices exist in the literature, most of them are only relevant for data sets that contain at least two clusters. This paper introduces a new bounded index for cluster validity called the score function (SF), a double exponen...
متن کاملDevelopment of An External Cluster Validity Index using Probabilistic Approach and Min-max Distance
Validating a given clustering result is a very challenging task in real world. So for this purpose, several cluster validity indices have been developed in the literature. Cluster validity indices are divided into two main categories: external and internal. External cluster validity indices rely on some supervised information available and internal validity indices utilize the intrinsic structu...
متن کاملImproving Cluster Method Quality by Validity Indices
Clustering attempts to discover significant groups present in a data set. It is an unsupervised process. It is difficult to define when a clustering result is acceptable. Thus, several clustering validity indices are developed to evaluate the quality of clustering algorithms results. In this paper, we propose to improve the quality of a clustering algorithm called ”CLUSTER” by using a validity ...
متن کاملClustering Validity Indices Evaluation with Regard to Semantic Homogeneity
Clustering validity indices are methods for examining and assessing the quality of data clustering results. Various studies provide a thorough evaluation of their performance using both synthetic and real-world datasets. In this work, we describe various approaches to the topic of evaluation of a clustering scheme. Moreover, a new solution to a problem of selecting an appropriate clustering val...
متن کاملیک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجرههای همپوشان
A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...
متن کامل