Clustering Large Data Sets with Mixed Numeric and Categorical Values
نویسنده
چکیده
Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The k-means based methods are promising for their efficiency in processing large data sets. However, their use is often limited to numeric data. In this paper we present a k-prototypes algorithm which is based on the k-means paradigm but removes the numeric data limitation whilst preserving its efficiency. In the algorithm, objects are clustered against k prototypes. A method is developed to dynamically update the k prototypes in order to maximise the intra cluster similarity of objects. When applied to numeric data the algorithm is identical to the kmeans. To assist interpretation of clusters we use decision tree induction algorithms to create rules for clusters. These rules, together with other statistics about clusters, can assist data miners to understand and identify interesting clusters.
منابع مشابه
خوشهبندی خودکار دادههای مختلط با استفاده از الگوریتم ژنتیک
In the real world clustering problems, it is often encountered to perform cluster analysis on data sets with mixed numeric and categorical values. However, most existing clustering algorithms are only efficient for the numeric data rather than the mixed data set. In addition, traditional methods, for example, the K-means algorithm, usually ask the user to provide the number of clusters. In this...
متن کاملA k-mean clustering algorithm for mixed numeric and categorical data
Use of traditional k-mean type algorithm is limited to numeric data. This paper presents a clustering algorithm based on k-mean paradigm that works well for data with mixed numeric and categorical features. We propose new cost function and distance measure based on co-occurrence of values. The measures also take into account the significance of an attribute towards the clustering process. We pr...
متن کاملClustering the Mixed Numerical and Categorical Datasets using Similarity Weight and Filter Method
Clustering is a challenging task in data mining technique. The aim of clustering is to group the similar data into number of clusters. Various clustering algorithms have been developed to group data into clusters. However, these clustering algorithms work effectively either on pure numeric data or on pure categorical data, most of them perform poorly on mixed categorical and numerical data type...
متن کاملClustering the Mixed Numerical and Categorical Dataset using Similarity Weight and Filter Method
Clustering is a challenging task in data mining technique. The aim of clustering is to group the similar data into number of clusters. Various clustering algorithms have been developed to group data into clusters. However, these clustering algorithms work effectively either on pure numeric data or on pure categorical data, most of them perform poorly on mixed categorical and numerical data type...
متن کاملClustering Algorithm for Incomplete Data Sets with Mixed Numeric and Categorical Attributes
The traditional k-prototypes algorithm is well versed in clustering data with mixed numeric and categorical attributes, while it is limited to complete data. In order to handle incomplete data set with missing values, an improved k-prototypes algorithm is proposed in this paper, which employs a new dissimilarity measure for incomplete data set with mixed numeric and categorical attributes and a...
متن کامل