Speeding up k-means Clustering by Bootstrap Averaging

نویسندگان

  • Ian Davidson
  • Ashwin Satyanarayana
چکیده

K-means clustering is one of the most popular clustering algorithms used in data mining. However, clustering is a time consuming task, particularly with the large data sets found in data mining. In this paper we show how bootstrap averaging with k-means can produce results comparable to clustering all of the data but in much less time. The approach of bootstrap (sampling with replacement) averaging consists of running k-means clustering to convergence on small bootstrap samples of the training data and averaging similar cluster centroids to obtain a single model. We show why our approach should take less computation time and empirically illustrate its benefits. We show that the performance of our approach is a monotonic function of the size of the bootstrap sample. However, knowing the size of the bootstrap sample that yields as good results as clustering the entire data set remains an open and important question.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Speeding-Up the K-Means Clustering Method: A Prototype Based Approach

The paper is about speeding-up the k-means clustering method which processes the data in a faster pace, but produces the same clustering result as the k-means method. We present a prototype based method for this where prototypes are derived using the leaders clustering method. Along with prototypes called leaders some additional information is also preserved which enables in deriving the k mean...

متن کامل

Mining Student data by Ensemble Classification and Clustering for Profiling and Prediction of Student Academic Performance

Applying Data Mining (DM) in education is an emerging interdisciplinary research field also known as Educational Data Mining (EDM). Ensemble techniques have been successfully applied in the context of supervised learning to increase the accuracy and stability of prediction. In this paper, we present a hybrid procedure based on ensemble classification and clustering that enables academicians to ...

متن کامل

Hybrid hierarchical clustering: cluster assessment via cluster validation indices

This paper introduces a hybrid hierarchical clustering method, which is a novel method for speeding up agglomerative hierarchical clustering by seeding the algorithm with clusters obtained from K-means clustering. This work describes a benchmark study comparing the performance of hybrid hierarchical clustering to that of conventional hierarchical clustering. The two clustering methods are compa...

متن کامل

Persistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm

Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...

متن کامل

A Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS

Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003