A Rigorous Statistical Approach for Identifying Significant Itemsets

نویسندگان

  • Adam Kirsch
  • Michael Mitzenmacher
  • Andrea Pietracaprina
  • Geppino Pucci
  • Eli Upfal
  • Fabio Vandin
چکیده

As advances in technology allow for the collection, storage, and mining of vast amounts of data, the task of screening and assessing the significance of the discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s for a dataset, such that the family of frequent itemsets with respect to s embodies a substantial deviation from what would be expected in a random dataset, hence these itemsets can be flagged as significant. Our methodology hinges on a Poisson approximation of the distribution of the number of frequent itemsets of a given size, which is the main theoretical result of the paper. A crucial feature of our approach is that, unlike previous work, it takes into account the entire dataset rather than individual discoveries, hence it is able to distinguishing between significant observations and random fluctuations in data, thus resulting in fewer false discoveries. Extensive experiments are reported that substantiate the effectiveness of our methodology.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hypotheses-based Method for Identifying Skewed Itemsets

Parallel and distributed association rule mining are very important research subjects, with various work addressing them. Data skewness, which describes the degree of non-uniformity of the itemset distribution among database partitions, causes various problems to parallel and distributed association rule mining algorithms, such as the generation of many false candidate itemsets. However, some a...

متن کامل

Enterprise based approach to Mining Frequent Utility Itemsets from Transactional Database

Data mining can be used extensively in the enterprise based applications with business intelligence characteristics to provide a deeper kind of analysis while meeting strict requirements for administration management and security. Business intelligence is information about a company's past performance that is used to help predict the company's future performance. ARM is a well-known technique i...

متن کامل

Candidate Pruning-Based Differentially Private Frequent Itemsets Mining

Frequent Itemsets Mining(FIM) is a typical data mining task and has gained much attention. Due to the consideration of individual privacy, various studies have been focusing on privacy-preserving FIM problems. Differential privacy has emerged as a promising scheme for protecting individual privacy in data mining against adversaries with arbitrary background knowledge. In this paper, we present ...

متن کامل

On Mining Max Frequent Generalized Itemsets

A fundamental task of data mining is to mine frequent itemsets. Since the number of frequent itemsets may be large, a compact representation, namely the max frequent itemsets, has been introduced. On the other hand, the concept of generalized itemsets was proposed. Here, the items form a taxonomy. Although the transactional database only contains items in the leaf level of the taxonomy, a gener...

متن کامل

روشی کارا برای کاوش مجموعه اقلام پرتکرار در تحلیل داده‌های سبد خرید

Discovery of hidden and valuable knowledge from large data warehouses is an important research area and has attracted the attention of many researchers in recent years. Most of Association Rule Mining (ARM) algorithms start by searching for frequent itemsets by scanning the whole database repeatedly and enumerating the occurrences of each candidate itemset. In data mining problems, the size of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008