Subspace Clustering for High Dimensional Categorical Data
نویسنده
چکیده
A fundamental operation in data mining is to partition a given dataset into clusters such that objects in the same cluster are more similar to each other than objects in different clusters according to some defined criteria [2]. These criteria are usually defined in the form of some distance, and similarity is hence defined as follows, the smaller the distance is, the more similar the objects are. A clustering process is important because it can greatly aid the specialists such as the biologists to closely examine and analyze a dataset such as DNA sequences. In terms of a dataset, we have numerical (or real-valued) data and categorical (or symbolic) data, and of course the hybrid of the two. Technological advances have made data collection easier and faster, resulting in larger, more complex datasets with many objects and attributes (dimensions). This requires the data mining operations and algorithms to be scalable and capable of dealing with different types of attributes. Most existing clustering algorithms can either handle both data types but are not efficient when clustering large data sets or can handle large data sets efficiently but are limited to numerical attributes such as the k-means algorithm [2]. Huang gave one solution to the latter problem by presenting the k-modes algorithm, which is an extension of the k-means algorithm [2]. He also presented the k-prototypes algorithm, which is an integration of the k-means and k-modes processes, and hence can be used for datasets of both numerical and categorical attributes [2]. The group of Nakamori [4] presented an alternative extension of the k-means algorithm for clustering categorical data. They call their extension the k-representatives algorithm and they are able to demonstrate that the k-representatives algorithm gives similar good results but more stably than the k-modes algorithm [4]. The k-modes algorithm is unstable due to non-uniqueness of the modes, i.e., the clustering results depend strongly on the selection of modes during the clustering process [4]. The work of the groups of Huang and Nakomori aimed at clustering the whole space spanned by all the dimensions (attributes) of the given dataset directly. But once again, if a dataset has millions of dimensions, then it will probably not be wise to cluster the whole dataset directly. The reason is that as we mentioned above, the (dis)similarity between objects is often determined using distance measures (this distance need not to be the usual Euclidean distance. Actually, it is impossible to use Euclidean distance for categorical attributes.) over the various dimensions in the dataset, and additional dimensions spread out the points until, in very high dimensions, they are almost equidistant from each other [3]. In this case, traditional
منابع مشابه
Holo-Entropy Based Categorical Data Hierarchical Clustering
Clustering high-dimensional data is a challenging task in data mining, and clustering high-dimensional categorical data is even more challenging because it is more difficult to measure the similarity between categorical objects. Most algorithms assume feature independence when computing similarity between data objects, or make use of computationally demanding techniques such as PCA for numerica...
متن کاملCBK-Modes: A Correlation-based Algorithm for Categorical Data Clustering
Categorical data sets are often high-dimensional. For handling the high-dimensionality in the clustering process, some works take advantage of the fact that clusters usually occur in a subspace. In soft subspace clustering approaches, different weights are assigned to each attribute in each cluster, for measuring their respective contributions to the formation of each cluster. In this paper, we...
متن کاملAutomatic Clustering Subspace for High Dimensional Categorical Data Using Neuro-Fuzzy Classification
Clustering has been used extensively as a vital tool of data mining. Data gathering has been deliberated widely, but mostly all identified usual clustering algorithms lean towards to break down in high dimensional spaces because of the essential sparsely of the data points. Present subspace clustering methods for handling high-dimensional data focus on numerical dimensions. The minimum spanning...
متن کاملA weighting k-modes algorithm for subspace clustering of categorical data
Traditional clustering algorithms consider all of the dimensions of an input data set equally. However, in the high dimensional data, a common property is that data points are highly clustered in subspaces, which means classes of objects are categorized in subspaces rather than the entire space. Subspace clustering is an extension of traditional clustering that seeks to find clusters in differe...
متن کاملCLICK: Clustering Categorical Data using K-partite Maximal Cliques
Clustering is one of the central data mining problems and numerous approaches have been proposed in this field. However, few of these methods focus on categorical data. The categorical techniques that do exist have significant shortcomings in terms of performance, the clusters they detect, and their ability to locate clusters in subspaces. This work introduces a novel algorithm called Click, wh...
متن کاملHierarchical Density-Based Clustering of Categorical Data and a Simplification
A challenge involved in applying density-based clustering to categorical datasets is that the ‘cube’ of attribute values has no ordering defined. We propose the HIERDENC algorithm for hierarchical densitybased clustering of categorical data. HIERDENC offers a basis for designing simpler clustering algorithms that balance the tradeoff of accuracy and speed. The characteristics of HIERDENC includ...
متن کامل