RealKrimp - Finding Hyperintervals that Compress with MDL for Real-Valued Data
نویسندگان
چکیده
The MDL Principle (induction by compression) is applied with meticulous effort in the Krimp algorithm for the problem of itemset mining, where one seeks exceptionally frequent patterns in a binary dataset. As is the case with many algorithms in data mining, Krimp is not designed to cope with real-valued data, and it is not able to handle such data natively. Inspired by Krimp’s success at using the MDL Principle in itemset mining, we develop RealKrimp: an MDL-based Krimp-inspired mining scheme that seeks exceptionally high-density patterns in a real-valued dataset. We review how to extend the underlying Kraft inequality, which relates probabilities to codelengths, to real-valued data. Based on this extension we introduce the RealKrimp algorithm: an efficient method to find hyperintervals that compress the real-valued dataset, without the need for pre-algorithm data discretization.
منابع مشابه
Mining hyperintervals Getting to grips with real-valued data
Many uses of data mining, such as clustering, classification, the construction of decision trees, subgroup discovery and itemset mining, often fail to be able to cope with real-valued data well. In fact, it is common for data mining methods to only work well on nominal data with little different values. We build the theory to fill this gap for data from arbitrary uncountable sets and introduce ...
متن کاملCompression-based methods for nonparametric on-line prediction, regression, classification and density estimation of time series
Jorma Rissanen has discovered some deep connections between universal coding (or universal data compression) and mathematical statistics. In particular, the MDL principle has been one of the most powerful methods of modern mathematical statistics. In this paper we apply Rissanen’s approach and ideas to some statistical problems concerned with time series. We address the problem of nonparametric...
متن کاملAnalyzing a Greedy Approximation of an MDL Summarization
Many OLAP (On-line Analytical Processing) applications have produced data cubes that summarize and aggregate details of data queries. These data cubes are multi-dimensional matrices where each cell that satisfies a specific property or trait is represented as a 1, notated as a 1-cell in this report. A cell that does not satisfy that specific property is represented as a 0, notated as a 0-cell. ...
متن کاملModel Selection Based on Minimum Description Length.
We introduce the minimum description length (MDL) principle, a general principle for inductive inference based on the idea that regularities (laws) underlying data can always be used to compress data. We introduce the fundamental concept of MDL, called the stochastic complexity, and we show how it can be used for model selection. We briefly compare MDL-based model selection to other approaches ...
متن کاملNml-optimal Histogram Density Estimation
Density estimation is one of the central problems in statistical inference and machine learning. Given a sample of observations, the goal of histogram density estimation is to find a piecewise constant density that describes the data best according to some pre-determined criterion. Although histograms are conceptually simple densities, they are very flexible and can model complex properties lik...
متن کامل