Density-Connected Subspace Clustering for High-Dimensional Data
نویسندگان
چکیده
Several application domains such as molecular biology and geography produce a tremendous amount of data which can no longer be managed without the help of efficient and effective data mining methods. One of the primary data mining tasks is clustering. However, traditional clustering algorithms often fail to detect meaningful clusters because most real-world data sets are characterized by a high dimensional, inherently sparse data space. Nevertheless, the data sets often contain interesting clusters which are hidden in various subspaces of the original feature space. Therefore, the concept of subspace clustering has recently been addressed, which aims at automatically identifying subspaces of the feature space in which clusters exist. In this paper, we introduce SUBCLU (density-connected Subspace Clustering), an effective and efficient approach to the subspace clustering problem. Using the concept of density-connectivity underlying the algorithm DBSCAN [EKSX96], SUBCLU is based on a formal clustering notion. In contrast to existing grid-based approaches, SUBCLU is able to detect arbitrarily shaped and positioned clusters in subspaces. The monotonicity of density-connectivity is used to efficiently prune subspaces in the process of generating all clusters in a bottom up way. While not examining any unnecessary subspaces, SUBCLU delivers for each subspace the same clusters DBSCAN would have found, when applied to this subspace separately.
منابع مشابه
An Efficient Density Conscious Subspace Clustering Method using Top-down and Bottom-up Strategies
Clustering high dimensional data is an emerging research field. Most clustering technique use distance measures to build clusters. In high dimensional spaces, traditional clustering algorithms suffers from a problem called “curse of dimensionality”. Subspace clustering groups similar objects embedded in subspace of full space. Recent approaches attempt to find clusters embedded in subspace of h...
متن کاملEfficient Identification of Subspaces with Small but Substantive Clusters in Noisy Datasets
We propose an efficient filter approach (called ROSMULD) to rank subspaces with respect to their clustering tendency, that is, how likely it is to find areas in the respective subspaces with a (possibly slight but substantive) increase in density. Each data object votes for the subspace with the most unlikely high data density and subspaces are ranked according to the number of received votes. ...
متن کاملISC–Intelligent Subspace Clustering, A Density Based Clustering Approach for High Dimensional Dataset
Many real-world data sets consist of a very high dimensional feature space. Most clustering techniques use the distance or similarity between objects as a measure to build clusters. But in high dimensional spaces, distances between points become relatively uniform. In such cases, density based approaches may give better results. Subspace Clustering algorithms automatically identify lower dimens...
متن کاملClustering for High Dimensional Data: Density based Subspace Clustering Algorithms
Finding clusters in high dimensional data is a challenging task as the high dimensional data comprises hundreds of attributes. Subspace clustering is an evolving methodology which, instead of finding clusters in the entire feature space, it aims at finding clusters in various overlapping or non-overlapping subspaces of the high dimensional dataset. Density based subspace clustering algorithms t...
متن کاملNew techniques for clustering complex objects
The tremendous amount of data produced nowadays in various application domains such as molecular biology or geography can only be fully exploited by efficient and effective data mining tools. One of the primary data mining tasks is clustering, which is the task of partitioning points of a data set into distinct groups (clusters) such that two points from one cluster are similar to each other wh...
متن کامل