Disk-Based Successive Sampling for Outlier Detection in High Dimensional Data

نویسندگان

  • Pei Sun
  • Sanjay Chawla
چکیده

We propose a sampling based outlier detection method for large high-dimensional data. Our method consists of two phases. In the first phase, we combine a “successive sampling” strategy with a simple randomized partitioning technique to generate a candidate set of outliers. This phase requires one full data scan and the running time has linear complexity with respect to the size and dimensionality of the data set. An additional data scan, which constitutes the second phase, extracts the actual outliers from the candidate set. The running time for this phase has complexity where and are the size of the candidate set and the data set respectively. A major strength of the proposed approach is that no partitioning of the dimensions is required thus making it particularly suitable for high dimension data. Furthermore our method can handle both continuous and categorical attributes. We also present a detailed experimental evaluation of our proposed method on real and synethetic data sets. General Terms Outlier Detection, Sampling, High Dimension, Randomization

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Disk-Based Sampling for Outlier Detection in High Dimensional Data

We propose an efficient sampling based outlier detection method for large high-dimensional data. Our method consists of two phases. In the first phase, we combine a “sampling” strategy with a simple randomized partitioning technique to generate a candidate set of outliers. This phase requires one full data scan and the running time has linear complexity with respect to the size and dimensionali...

متن کامل

Rapid Distance-Based Outlier Detection via Sampling

Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling...

متن کامل

Outlier Detection for Support Vector Machine using Minimum Covariance Determinant Estimator

The purpose of this paper is to identify the effective points on the performance of one of the important algorithm of data mining namely support vector machine. The final classification decision has been made based on the small portion of data called support vectors. So, existence of the atypical observations in the aforementioned points, will result in deviation from the correct decision. Thus...

متن کامل

Robust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data

Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...

متن کامل

Outlier Detection in High Dimensional, Spatial and Sequential Data Sets

Of all the data mining techniques, outlier detection seems closest to the definition of “discovering nuggets of information” in large databases. When an outlier is detected, and determined to be genuine, it can provide insights, which can radically change our understanding of the underlying process. The purpose of the research underlying this thesis was to investigate and devise methods to mine...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004