On Lloyd's Algorithm: New Theoretical Insights for Clustering in Practice

نویسندگان

Cheng Tang

Claire Monteleoni

چکیده

We provide new analyses of Lloyd’s algorithm (1982), commonly known as the k-means clustering algorithm. Kumar and Kannan (2010) showed that running k-SVD followed by a constant approximation k-means algorithm, and then Lloyd’s algorithm, will correctly cluster nearly all of the dataset with respect to the optimal clustering, provided the dataset satisfies a deterministic clusterability assumption. This method is viewed as the “Swiss Army knife” for clustering problems, subsuming popular generative models such as Gaussian mixtures. However, it is tailored to high dimensional data, i.e., when d >> k. We analyze Lloyd’s algorithm for general d without using the spectral projection, which leads to a weaker assumption in the case d < k. Surprisingly, we show that a simple and scalable heuristic that combines random sampling with Single-Linkage serves as a good seeding algorithm for Lloyd’s algorithm under this assumption. We then study stopping criteria for Lloyd’s algorithm under the lens of clusterability, accompanied by controlled simulations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical and Computational Guarantees of Lloyd's Algorithm and its Variants

Clustering is a fundamental problem in statistics and machine learning. Lloyd’s algorithm, proposed in 1957, is still possibly the most widely used clustering algorithm in practice due to its simplicity and empirical performance. However, there has been little theoretical investigation on the statistical and computational guarantees of Lloyd’s algorithm. This paper is an attempt to bridge this ...

متن کامل

Fast Clustering using MapReduce ∗ ( Extended

Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, k-center and k-median. We develop fast clustering algorithms...

متن کامل

Hartigan's K-Means Versus Lloyd's K-Means - Is It Time for a Change?

Hartigan’s method for k-means clustering holds several potential advantages compared to the classical and prevalent optimization heuristic known as Lloyd’s algorithm. E.g., it was recently shown that the set of local minima of Hartigan’s algorithm is a subset of those of Lloyd’s method. We develop a closed-form expression that allows to establish Hartigan’s method for k-means clustering with an...

متن کامل

Improved Automatic Clustering Using a Multi-Objective Evolutionary Algorithm With New Validity measure and application to Credit Scoring

In data mining, clustering is one of the important issues for separation and classification with groups like unsupervised data. In this paper, an attempt has been made to improve and optimize the application of clustering heuristic methods such as Genetic, PSO algorithm, Artificial bee colony algorithm, Harmony Search algorithm and Differential Evolution on the unlabeled data of an Iranian bank...

متن کامل

The Analysis of a Simple k - Means Clustering

K-means clustering is a very popular clustering technique which is used in numerous applications. Given a set of n data points in R d and an integer k, the problem is to determine a set of k points R d , called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyd's algorithm. In this paper we present...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

On Lloyd's Algorithm: New Theoretical Insights for Clustering in Practice

نویسندگان

چکیده

منابع مشابه

Statistical and Computational Guarantees of Lloyd's Algorithm and its Variants

Fast Clustering using MapReduce ∗ ( Extended

Hartigan's K-Means Versus Lloyd's K-Means - Is It Time for a Change?

Improved Automatic Clustering Using a Multi-Objective Evolutionary Algorithm With New Validity measure and application to Credit Scoring

The Analysis of a Simple k - Means Clustering

عنوان ژورنال:

اشتراک گذاری