Visualizing Outliers
نویسنده
چکیده
Outliers have more than two centuries’ history in the field of statistics. Recently, they have become a focal topic because of their relevance to terrorism, network intrusions, financial fraud, and other areas where rare events are critical to understanding a process. This paper presents a new algorithm, called hdoutliers, for detecting multidimensional outliers. It is unique for a) dealing with a mixture of categorical and continuous variables, b) dealing with the curse of dimensionality (many columns of data), c) dealing with many rows of data, d) dealing with outliers that mask other outliers, and e) dealing consistently with unidimensional and multidimensional datasets. Unlike ad hoc methods found in many machine learning papers, hdoutliers is based on a distributional model that allows outliers to be tagged with a probability.
منابع مشابه
Net-Ray: Visualizing and Mining Billion-Scale Graphs
How can we visualize billion-scale graphs? How to spot outliers in such graphs quickly? Visualizing graphs is the most direct way of understanding them; however, billion-scale graphs are very difficult to visualize since the amount of information overflows the resolution of a typical screen. In this paper we propose NET-RAY, an open-source package for visualizationbased mining on billion-scale ...
متن کاملBagplots, boxplots and outlier detection for functional data
We propose some new tools for visualizing functional data and for identifying functional outliers. The proposed tools make use of robust principal component analysis, data depth and highest density regions. We compare the proposed outlier detection methods with the existing “functional depth” method, and show that our methods have better performance on identifying outliers in French male age-sp...
متن کاملminotaur: A platform for the analysis and visualization of multivariate results from genome scans with R Shiny.
Genome scans are widely used to identify 'outliers' in genomic data: loci with different patterns compared with the rest of the genome due to the action of selection or other nonadaptive forces of evolution. These genomic data sets are often high dimensional, with complex correlation structures among variables, making it a challenge to identify outliers in a robust way. The Mahalanobis distance...
متن کاملScorpion: Explaining Away Outliers in Aggregate Queries
Database users commonly explore large data sets by running aggregate queries that project the data down to a smaller number of points and dimensions, and visualizing the results. Often, such visualizations will reveal outliers that correspond to errors or surprising features of the input data set. Unfortunately, databases and visualization systems do not provide a way to work backwards from an ...
متن کاملRainbow plots, bagplots and boxplots for functional data
Abstract We propose new tools for visualizing large amounts of functional data in the form of smooth curves. The proposed tools include functional versions of the bagplot and boxplot, and make use of the first two robust principal component scores, Tukey’s data depth and highest density regions. By-products of our graphical displays are outlier detection methods for functional data. We compare ...
متن کامل