Fast and fully-automated histograms for large-scale data sets

نویسندگان

چکیده

G-Enum histograms are a new fast and fully automated method for irregular histogram construction. By framing construction as density estimation problem its automation model selection task, these leverage the Minimum Description Length principle (MDL) to derive two different criteria. Several proven theoretical results about criteria give insights their asymptotic behavior used speed up optimisation. These insights, combined greedy search heuristic, construct in linearithmic time rather than polynomial incurred by previous works. The capabilities of proposed MDL illustrated with reference other methods literature, both on synthetic large real-world data sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fully Automated Cyclic Planning for Large-Scale Manufacturing Domains

In domains such as factory assembly, it is necessary to assemble many identical instances of a particular product. While modern planners can generate assembly plans for single instances of a complex product, generating plans to manufacture many instances of a product is beyond the capabilities of standard planners. We propose ACP, a system which, given a model of a single instance of a product,...

متن کامل

Enabling Reproducibility for Small and Large Scale Research Data Sets

A large portion of scientific results is based on analysing and processing research data. In order for an eScience experiment to be reproducible, we need to able to identify precisely the data set which was used in a study. Considering evolving data sources this can be a challenge, as studies often use subsets which have been extracted from a potentially large parent data set. Exporting and sto...

متن کامل

: large - scale data sets ( spike , LFP ) recorded

Using silicon-based recording electrodes, we recorded neuronal activity of the dorsal hippocampus and dorsomedial entorhinal cortex from behaving rats. The entorhinal neurons were classified as principal neurons and interneurons based on monosynaptic interactions and wave-shapes. The hippocampal neurons were classified as principal neurons and interneurons based on monosynaptic interactions, wa...

متن کامل

Empirical Comparison of Fast Clustering Algorithms for Large Data Sets

Several fast algorithms for clustering very large data sets have been proposed in the literature. CLARA is a combination of a sampling procedure and the classical PAM algorithm, while CLARANS adopts a serial randomized search strategy to find the optimal set of medoids. GAC-R and GAC-RARw exploit genetic search heuristics for solving clustering problems. In this research, we conducted an empiri...

متن کامل

Fast Algorithms for Unsupervised Learning in Large Data Sets

The ability to mine and extract useful information automatically, from large datasets, is a common concern for organizations (having large datasets), over the last few decades. Over the internet, data is vastly increasing gradually and consequently the capacity to collect and store very large data is significantly increasing. Existing clustering algorithms are not always efficient and accurate ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Computational Statistics & Data Analysis

سال: 2023

ISSN: ['0167-9473', '1872-7352']

DOI: https://doi.org/10.1016/j.csda.2022.107668