Randomized Dimensionality Reduction for k-Means Clustering
نویسندگان
چکیده
We study the topic of dimensionality reduction for k-means clustering. Dimensionality reduction encompasses the union of two approaches: 1) feature selection and 2) feature extraction. A feature selection-based algorithm for k-means clustering selects a small subset of the input features and then applies k-means clustering on the selected features. A feature extraction-based algorithm for k-means clustering constructs a small set of new artificial features and then applies k-means clustering on the constructed features. Despite the significance of k-means clustering as well as the wealth of heuristic methods addressing it, provably accurate feature selection methods for k-means clustering are not known. On the other hand, two provably accurate feature extraction methods for k-means clustering are known in the literature; one is based on random projections and the other is based on the singular value decomposition (SVD). This paper makes further progress toward a better understanding of dimensionality reduction for k-means clustering. Namely, we present the first provably accurate feature selection method for k-means clustering and, in addition, we present two feature extraction methods. The first feature extraction method is based on random projections and it improves upon the existing results in terms of time complexity and number of features needed to be extracted. The second feature extraction method is based on fast approximate SVD factorizations and it also improves upon the existing results in terms of time complexity. The proposed algorithms are randomized and provide constant-factor approximation guarantees with respect to the optimal k-means objective value.
منابع مشابه
A New Method for Dimensionality Reduction using K-Means Clustering Algorithm for High Dimensional Data Set
Clustering is the process of finding groups of objects such that the objects in a group will be similar to one another and different from the objects in other groups. Dimensionality reduction is the transformation of high-dimensional data into a meaningful representation of reduced dimensionality that corresponds to the intrinsic dimensionality of the data. K-means clustering algorithm often do...
متن کاملDimensionality Reduction for k-Means Clustering
In this thesis we study dimensionality reduction techniques for approximate k-means clustering. Given a large dataset, we consider how to quickly compress to a smaller dataset (a sketch), such that solving the k-means clustering problem on the sketch will give an approximately optimal solution on the original dataset. First, we provide an exposition of technical results of [CEM15], which show t...
متن کاملClustering on High Dimensional data Using Locally Linear Embedding (LLE) Techniques
Clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). The dimension can be reduced by using some techniques of dimension reduction. Recently new non linear methods introduced for reducing the dimensionality of such data called Locally Li...
متن کاملDimensionality Reduction for Sparse and Structured Matrices
Dimensionality reduction has become a critical tool for quickly solving massive matrix problems. Especially in modern data analysis and machine learning applications, an overabundance of data features or examples can make it impossible to apply standard algorithms efficiently. To address this issue, it is often possible to distill data to a much smaller set of informative features or examples, ...
متن کاملClustering based color reduction -Improvements and tips-
This paper presents simple yet powerful improvements in color reduction field, targeting interactive high-quality applications. Here maximum distance clustering (MDC) is used to initialize K-means clustering, which eliminates the drawback of clustering-based color reduction that tends to ignore colors with a small number of pixels. Maximum distance clustering’s speed problem due to the problem ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IEEE Trans. Information Theory
دوره 61 شماره
صفحات -
تاریخ انتشار 2015