Sublinear-Time Approximation for Clustering Via Random Sampling
نویسندگان
چکیده
In this paper we present a novel analysis of a random sampling approach for three clustering problems in metric spaces: k-median, min-sum k-clustering, and balanced k-median. For all these problems we consider the following simple sampling scheme: select a small sample set of points uniformly at random from V and then run some approximation algorithm on this sample set to compute an approximation of the best possible clustering of this set. Our main technical contribution is a significantly strengthened analysis of the approximation guarantee by this scheme for the clustering problems. The main motivation behind our analyses was to design sublinear-time algorithms for clustering problems. Our second contribution is the development of new approximation algorithms for the aforementioned clustering problems. Using our random sampling approach we obtain for the first time approximation algorithms that have the running time independent of the input size, and depending on k and the diameter of the metric space only.
منابع مشابه
Sublinear Algorithms for MAXCUT and Correlation Clustering
We study sublinear algorithms for two fundamental graph problems, MAXCUT and correlation clustering. Our focus is on constructing core-sets as well as developing streaming algorithms for these problems. Constant space algorithms are known for dense graphs for these problems, while Ω(n) lower bounds exist (in the streaming setting) for sparse graphs. Our goal in this paper is to bridge the gap b...
متن کاملSublinear Time Approximate Sum via Uniform Random Sampling
We investigate the approximation for computing the sum a1 + · · ·+ an with an input of a list of nonnegative elements a1, · · · , an. If all elements are in the range [0, 1], there is a randomized algorithm that can compute an (1+ ǫ)-approximation for the sum problem in time O( log n) ∑n i=1 ai ), where ǫ is a constant in (0, 1). Our randomized algorithm is based on the uniform random sampling,...
متن کاملComputing Heat Kernel Pagerank and a Local Clustering Algorithm
Heat kernel pagerank is a variation of Personalized PageRank given in an exponential formulation. In this work, we present a sublinear time algorithm for approximating the heat kernel pagerank of a graph. The algorithm works by simulating random walks of bounded length and runs in time O( log( ) logn 3 log log( 1) ), assuming performing a random walk step and sampling from a distribution with b...
متن کاملK-MC: Approximate K-Means++ in Sublinear Time
The quality of K-Means clustering is extremely sensitive to proper initialization. The classic remedy is to apply k-means++ to obtain an initial set of centers that is provably competitive with the optimal solution. Unfortunately, k-means++ requires k full passes over the data which limits its applicability to massive datasets. We address this problem by proposing a simple and efficient seeding...
متن کاملApproximate K-Means++ in Sublinear Time
The quality of K-Means clustering is extremely sensitive to proper initialization. The classic remedy is to apply k-means++ to obtain an initial set of centers that is provably competitive with the optimal solution. Unfortunately, k-means++ requires k full passes over the data which limits its applicability to massive datasets. We address this problem by proposing a simple and efficient seeding...
متن کامل