Detecting Outliers in High-Dimensional Datasets with Mixed Attributes
نویسندگان
چکیده
Outlier Detection has attracted substantial attention in many applications and research areas. Examples include detection of network intrusions or credit card fraud. Many of the existing approaches are based on pair-wise distances among all points in the dataset. These approaches cannot easily extend to current datasets that usually contain a mix of categorical and continuous attributes, and may be scattered over large geographical areas. In addition, current datasets usually have a large number of dimensions. These datasets tend to be sparse, and traditional concepts such as Euclidean distance or nearest neighbor become unsuitable. We propose ODMAD, a fast outlier detection strategy intended for datasets containing mixed attributes. ODMAD takes into consideration the sparseness of the dataset, and is experimentally shown to be highly scalable with the number of points and number of attributes in the dataset.
منابع مشابه
Mammalian Eye Gene Expression Using Support Vector Regression to Evaluate a Strategy for Detecting Human Eye Disease
Background and purpose: Machine learning is a class of modern and strong tools that can solve many important problems that nowadays humans may be faced with. Support vector regression (SVR) is a way to build a regression model which is an incredible member of the machine learning family. SVR has been proven to be an effective tool in real-value function estimation. As a supervised-learning appr...
متن کاملA Robust Method for Detecting DB-Outliers from High Dimensional Datasets
Outlier detection is a popular technique that can be utilized in many modern applications like financial analysis and fraud detection. As data description becomes complex, operated datasets’ dimensionalities keep monotone increasing. However, current researches find that it is extremely difficult to pick out outliers directly from high dimensional datasets owing to the curse of dimensionality. ...
متن کاملFinding Key Knowledge Attribute Subspace of Outliers for High Dimensional Dataset
Detecting outliers is an important task in many applications. Since most applications possess high dimensional data, traditional outlier detecting methods will become inefficient in such cases. To solve the problem, we propose the concept of outlying reduction by extending attribute reduction in rough set theory. Additionally, we define the key knowledge attribute subspace (KKAS), which can pro...
متن کاملAn Effective Pattern Based Outlier Detection Approach for Mixed Attribute Data
Detecting outliers in mixed attribute datasets is one of major challenges in real world applications. Existing outlier detection methods lack effectiveness for mixed attribute datasets mainly due to their inability of considering interactions among different types of, e.g., numerical and categorical attributes. To address this issue in mixed attribute datasets, we propose a novel Pattern based ...
متن کاملBrush-Based Ranking For Navigating Within High-Dimensional Datasets
The analysis of high-dimensional data means a big challenge, as most common visualization techniques do not scale well for displaying a large number of attributes at one time. Therefore, the initial questions arising when analyzing a new dataset typically concern the dimensions themselves in order to assess the relevance of various attributes and to identify clusters of similar (i.e., highly co...
متن کامل