Summary Structures for Frequency Queries on Large Transaction Sets

نویسندگان

  • Dow-Yung Yang
  • Akshay Johar
  • Ananth Grama
  • Wojciech Szpankowski
چکیده

As large-scale databases become commonplace, there has been signi cant interest in mining them for commercial purposes. One of the basic tasks that underlies many of these mining operations is querying of transaction sets for frequencies of speci ed attribute values. The size of these databases makes it important to develop summary structures capable of high compression ratios as well as supporting fast frequency queries. The nature of the problem and its di erences with respect to traditional text compression allows very high compression ratios. In this paper, we propose a binary trie-based summary structure for representing transaction sets. We demonstrate that this trie structure, when augmented with an appropriate set of horizontal pointers, can support frequency queries several orders of magnitude faster than raw transaction data. We improve the memory characteristics of our scheme by compressing the trie into a Patricia trie and demonstrate that this does not have a signi cant adverse e ect on frequency query time. We further reduce the size of this trie by selectively pruning branches to compute a \dominant" trie that is capable of approximate frequency querying. The complement trie called the \deviant" trie is also useful in many data mining applications. Recompressing the \dominant" trie into a Patricia trie results in further compression of the trie. Finally, we demonstrate that our binary compressed trie structure has better memory (compression) characteristics compared to related schemes. We support our claims with experimental results on datasets from the IBM synthetic association data generator. This work is supported in part by the National Science Foundation grants EIA-9806741, ACI9875899, and ACI-9872101. Computing equipment used for this work was supported by National Science Foundation MRI grant EIA-9871053 and by the Intel Corp.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Itemset Support Queries Using Frequent Itemsets and Their Condensed Representations

The purpose of this paper is two-fold: First, we give efficient algorithms for answering itemset support queries for collections of itemsets from various representations of the frequency information. As index structures we use itemset tries of transaction databases, frequent itemsets and their condensed representations. Second, we evaluate the usefulness of condensed representations of frequent...

متن کامل

Probabilistic Models for Query Approximation with Large Sparse Binary Data Sets

Large sparse sets of binary transaction data with millions of records and thousands of attributes occur in various domains: cus­ tomers purchasing products, users visiting web pages, and documents containing words are just three typical examples. Real-time query selectivity estimation (the problem of estimating the number of rows in the data satisfying a given predicate) is an important practic...

متن کامل

LoT: Dynamic Declustering of TSB-Tree Nodes for Parallel Access to Temporal Data

In this paper, we consider the problem of exploiting I/O parallelism for efficient access to transaction-time temporal databases. As temporal databases maintain historical versions of records in addition to current ones, we consider range queries in both time dimension and key dimension. Multiple disks can be used to read sets of disk blocks in parallel, thereby improving the performance of suc...

متن کامل

OLAP++: Powerful and Easy-to-Use Federations of OLAP and Object Databases

On-Line Analytical Processing (OLAP) systems provide good performance and ease-of-use when retrieving summary information from very large amounts of data. However, the complex structures and relationships inherent in related non-summary data are not handled well by OLAP systems. In contrast, object database systems are built to handle such complexity, but do not support summary querying well. T...

متن کامل

بررسی میزان همخوانی عبارت‌های جستجوی کاربران با اصطلاحات پیشنهادی مقالات در پیشینه‌های کتابشناختی پایگاه‌های اطلاعاتی لاتین EBSCO و IEEE

Purpose: This study aims to investigate correspondence of users' queries with alternative terms of Latin databases namely IEEE and EBSCO. Databases display subjective content of their documents through natural or controlled language vocabularies in specified bibliographic fields along with other bibliographic information that are called papers alternative terms. Methodology: We used content an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000