Projective clustering of high dimensional data

نویسنده

  • Vasileios Kandylas
چکیده

Clustering of high-dimensional data can be problematic, because the usual notions of distance or similarity break down for data in high dimensions. More specifically, it can be shown that, as the number of dimensions increases, the distance to the nearest point approaches the distance to the farthest one. Two approaches are common for dealing with this problem. The idea behind the first approach is to project all the points to a lower dimensional subspace and then use a standard clustering algorithm on the low-dimensional representation. However, if different subsets of the points cluster well on different subspaces of the original feature space, then a global dimensionality reduction will fail. In the second approach, projection and clustering are performed simultaneously, allowing each cluster to have a different subspace associated with it. These projective clustering algorithms compute pairs (Ci,Di), consisting of the points Ci belonging in cluster i and the subspace Di in which these points have low variance. Three algorithms are presented that follow different approaches to projective clustering. One is a partitional method that iteratively assigns and reestimates the cluster centroids, similar to k-means but with projection steps included in the iteration. The second is density based; it works by extending the clusters to nearby points, where proximity in high dimensions is defined based on the variance of the clusters along different axes. The last algorithm is an ensemble method. It repeatedly performs random projections, which are then clustered using the EM algorithm and combined. The partitional method optimizes a well-defined objective function, but scales poorly to large dimensions. The density-based method scales linearly in the number of dimensions, but it only finds projections to axes-parallel subspaces and not to ones that are arbitrarily rotated. The ensemble method can exploit the diversity of the individual solutions and produces high quality clusters in practice, but lacks theoretical guarantees.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

k-Means Projective Clustering

In many applications it is desirable to cluster high dimensional data along various subspaces, which we refer to as projective clustering. We propose a new objective function for projective clustering, taking into account the inherent trade-off between the dimension of a subspace and the induced clustering error. We then present an extension of the -means clustering algorithm for projective clu...

متن کامل

Projective Clustering Method for the Detection of Outliers in Non-Axis Aligned Subspaces

Clustering the case of non-axis-aligned subspaces and detection of outliers is a major challenge due to the curse of dimensionality. The normal clustering was efficient in axis-aligned subspaces only. To solve this problem, projective clustering has been defined as an extension to traditional clustering that attempts to find projected clusters in subsets of the dimensions of a data space. A pro...

متن کامل

Projective ART with buffers for the high dimensional space clustering and an application to discover stock associations

Unlike to traditional hierarchical and partitional clustering algorithms which always fail to deal with very large databases, a neural network architecture, Projective Adaptive Resonance Theory (PART), is developed for the high dimensional space clustering. However, the success of the PART algorithm depends on both accurate parameters and satisfied orders of input data sets. These disadvantages...

متن کامل

Projective Low-rank Subspace Clustering via Learning Deep Encoder

Low-rank subspace clustering (LRSC) has been considered as the state-of-the-art method on small datasets. LRSC constructs a desired similarity graph by low-rank representation (LRR), and employs a spectral clustering to segment the data samples. However, effectively applying LRSC into clustering big data becomes a challenge because both LRR and spectral clustering suffer from high computational...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007