Cluster Center Initialization for Categorical Data Using Multiple Attribute Clustering
نویسندگان
چکیده
The K-modes clustering algorithm is well known for its efficiency in clustering large categorical datasets. The K-modes algorithm requires random selection of initial cluster centers (modes) as seed, which leads to the problem that the clustering results are often dependent on the choice of initial cluster centers and non-repeatable cluster structures may be obtained. In this paper, we propose an algorithm to compute fixed initial cluster centers for the K-modes clustering algorithm that exploits a multiple clustering approach that determines cluster structures from the attribute values of given attributes in a data. The algorithm is based on the experimental observations that some of the data objects do not change cluster membership irrespective of the choice of initial cluster centers and individual attributes may provide some information about the cluster structures. Most of the time, attributes with few attribute values play significant role in deciding cluster membership of individual data object. The proposed algorithm gives fixed initial cluster center (ensuring repeatable clustering results), their computation is independent of the order of presentation of the data and has log-linear worst case time complexity with respect to the data objects. We tested the proposed algorithm on various categorical datasets and compared it against random initialization and two other available methods and show that it performs better than the existing methods.
منابع مشابه
Cluster center initialization algorithm for K-modes clustering
Partitional clustering of categorical data is normally performed by using K-modes clustering algorithm, which works well for large datasets. Even though the design and implementation of K-modes algorithm is simple and efficient, it has the pitfall of randomly choosing the initial cluster centers for invoking every new execution that may lead to non-repeatable clustering results. This paper addr...
متن کاملA cluster centers initialization method for clustering categorical data
Keywords: The k-modes algorithm Initialization method Initial cluster centers Density Distance a b s t r a c t The leading partitional clustering technique, k-modes, is one of the most computationally efficient clustering methods for categorical data. However, the performance of the k-modes clustering algorithm which converges to numerous local minima strongly depends on initial cluster centers...
متن کاملClustering Categorical Data Using Community Detection Techniques
With the advent of the k-modes algorithm, the toolbox for clustering categorical data has an efficient tool that scales linearly in the number of data items. However, random initialization of cluster centers in k-modes makes it hard to reach a good clustering without resorting to many trials. Recently proposed methods for better initialization are deterministic and reduce the clustering cost co...
متن کاملInitialization of K-modes clustering using outlier detection techniques
The K-modes clustering has received much attention, since it works well for categorical data sets. However, the performance of K-modes clustering is especially sensitive to the selection of initial cluster centers. Therefore, choosing the proper initial cluster centers is a key step for K-modes clustering. In this paper, we consider the initialization of K-modes clustering from the view of outl...
متن کاملNumerical and Categorical Attributes Data Clustering Using K- Modes and Fuzzy K-Modes
Most of the existing clustering approaches are applicable to purely numerical or categorical data only, but not the both. In general, it is a nontrivial task to perform clustering on mixed data composed of numerical and categorical attributes because there exists an awkward gap between the similarity metrics for categorical and numerical data. This paper therefore presents a general clustering ...
متن کامل