Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

نویسندگان

  • Andrew W. Moore
  • Mary S. Lee
چکیده

This paper introduces new algorithms and data st.ruct,ures for quick rounting for machine learning dat.asets. We focus on t,he counting task of constructing contingent:. t.ables, but our approach is also applicahlc t.o counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptionsl t h c rosts of thesr operations ca,n he shown to be independent of the number of rpcords in the data.set a n d loglinear i n llie number of noli-zero entries in the coutingeucy table. We provide a very sparse dat.a st.ructurel the ADtrce. to minimize memory use. We provide analytical worst-case bounds for this structure for sewral inodcls of data distribution. We empirically deino~istrat.e t,hat t,ractably-sized data. st.ruct,tires cau br produced for large real-world datascts by (a) using a sparse tree st.ruclure that n w e r allocates memory for counts of zero! (b j nwer a,llocat.ing memory for rounts t,tia,t, can be deduced from other counts. and ( c ) not. bothering to expand t.he t.rec ful ly near i1.s leaves. 1% show 11ow the ADtree call be used to accelerate Bayes I I C ~ structure liidiiig algorithms, ru le learning algorithms, and feature selection algorithms. and we provide a number of empirical results comparing ADtree met.hods against traditional direct count.ing approaches. We also discuss thc possible uses of AMrees in ot,lier machine lesrniiig methods. and discuss the merits of ADtrees i n cornparisori with altcrnative representations such as M-t.rees, R-trees and frequent sets. 1 Caching sufficient statistics Coinputat.iona1 efficiency is an important concern fur machine leasiiing algorithms. especially when applied t,o large datasets (Fayyatl et al.: 1997: Fa.&yya,d and Uthuriisamy, 1996) or in real-time sceiiarios. In (Moore et al.. 1997) &trees with multiresolution cached regression

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data

This paper is about the use of metric data structures in high-dimensional or non-Euclidean space to permit cached sufficient statistics accelerations of learning algorithms. It has recently been shown that for less than about 10 dimensions, decorating kd-trees with additional "cached sufficient statistics" such as first and second moments and contingency tables can provide satisfying accelerati...

متن کامل

The Anchors Hierachy: Using the triangle inequality to survive high dimensional data

This paper is about metric data structures in high-dimensional or non-Euclidean space that permit cached sufficient statistics accel­ erations of learning algorithms. It has recently been shown that for less than about 10 dimensions, decorating kd­ trees with additional "cached sufficient statis­ tics" such as first and second moments and contingency tables can provide satisfying ac­ celeration...

متن کامل

Cached Suucient Statistics for Eecient Machine Learning with Large Datasets

This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of...

متن کامل

Su cient Statistics for E cient Machine Learning with Large Datasets

This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of...

متن کامل

Image Color Constancy Using EM and Cached Statistics

Cached statistics are a means of extending the reach of traditional statistical machine learning algorithms into application areas where computational complexity is a limiting factor. Recent work has shown that cached statistics greatly reduce the computational requirements of building a mixture model of a distribution using Expectation-Maximization for a small trade oo in model error. This pap...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Artif. Intell. Res.

دوره 8  شماره 

صفحات  -

تاریخ انتشار 1998