Adaptive Grids for Clustering Massive Data Sets

نویسندگان

Harsha S. Nagesh

Sanjay Goil

Alok N. Choudhary

چکیده

Clustering is a key data mining problem. Density and grid based technique is a popular way to mine clusters in a large multi-dimensional space wherein clusters are regarded as dense regions than their surroundings. The attribute values and ranges of these attributes characterize the clusters. Fine grid sizes lead to a huge amount of computation while coarse grid sizes result in loss in quality of clusters found. Also, varied grid sizes result in discovering clusters with different cluster descriptions. The technique of Adaptive grids enables to use grids based on the data distribution and does not require the user to specify any parameters like the grid size or the density thresholds. Further, clusters could be embedded in a subspace of a high dimensional space. We propose a modified bottom-up subspace clustering algorithm to discover clusters in all possible subspaces. Our method scales linearly with the data dimensionality and the size of the data set. Experimental results on a wide variety of synthetic and real data sets demonstrate the effectiveness of Adaptive grids and the effect of the modified subspace clustering algorithm. Our algorithm explores at-least an order of magnitude more number of subspaces than the original algorithm and the use of adaptive grids yields on an average of two orders of magnitude speedup as compared to the method with user specified grid size and threshold.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering Algorithm for 2D Multi-Density Large Dataset Using Adaptive Grids

Clustering is a key data mining problem. Densitybased clustering algorithms have recently gained popularity in the data mining field. Density and grid based technique is a popular way to mine clusters in a large spatial datasets wherein clusters are regarded as dense regions than their surroundings. The attribute values and ranges of these attributes characterize the clusters In this paper we a...

متن کامل

Mafia: Eecient and Scalable Subspace Clustering for Very Large Data Sets Center for Parallel and Distributed Computing Mafia: Eecient and Scalable Subspace Clustering for Very Large Data Sets

Clustering techniques are used in database mining for nding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding end-results, and sensitivity to input order, have received attention in the recent past. Recent approach...

متن کامل

A Scalable Parallel Subspace Clustering Algorithm for Massive Data Sets

Clustering is a data mining problem which finds dense regions in a sparse multi-dimensional data set. The attribute values and ranges of these regions characterize the clusters. Clustering algorithms need to scale with the data base size and also with the large dimensionality of the data set. Further, these algorithms need to explore the embedded clusters in a subspace of a high dimensional spa...

متن کامل

PARALLEL ALGORITHMS FOR CLUSTERINGHIGH - DIMENSIONAL LARGE - SCALE DATASETSHarsha

Clustering techniques for large scale and high dimensional data sets have found great interest in recent literature. Such data sets are found both in scientiic and commercial applications. Clustering is the process of identifying dense regions in a sparse multi-dimensional data set. Several clustering techniques proposed earlier either lack in scalability to a very large set of dimensions or to...

متن کامل

High Performance Subspace Clustering for Massive Data Sets

Business establishments collect vast amounts of data every day. Leveraging this data for smart decision making is the key to identifying pro t opportunities, customer retention and giving a winning touch to the business. The path from large amounts of data to Knowledge Discovery is Information Mining, using a sophisticated set of tools to uncover associations, patterns, and trends; detect devia...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

Adaptive Grids for Clustering Massive Data Sets

نویسندگان

چکیده

منابع مشابه

Clustering Algorithm for 2D Multi-Density Large Dataset Using Adaptive Grids

Mafia: Eecient and Scalable Subspace Clustering for Very Large Data Sets Center for Parallel and Distributed Computing Mafia: Eecient and Scalable Subspace Clustering for Very Large Data Sets

A Scalable Parallel Subspace Clustering Algorithm for Massive Data Sets

PARALLEL ALGORITHMS FOR CLUSTERINGHIGH - DIMENSIONAL LARGE - SCALE DATASETSHarsha

High Performance Subspace Clustering for Massive Data Sets

عنوان ژورنال:

اشتراک گذاری