Discovering Interesting Subsets Using Statistical Analysis

نویسندگان

  • Maitreya Natu
  • Girish Keshav Palshikar
چکیده

In this paper we present algorithms for identifying interesting subsets of a given database of records. In many real life applications, it is important to automatically discover subsets of records which are interesting with respect to a given measure. For example, in the customer support database, it is important to identify subsets of tickets having service time which is too large (or too small) when compared with the service time of the rest of the tickets. We use Student’s t-test to check whether the measure values for a subset and its complement differ significantly. We first discuss the brute-force approach and then present heuristic-based state-space search algorithm to discover interesting subsets of the given database. To use the proposed heuristic-based approach on large data sets, we then present a samplingbased algorithm that uses sampling together with the proposed heuristics to efficiently identify interesting sets in large data sets. We discuss an application of these techniques to customer support data, to discover subsets of tickets that have significantly worse (or better) service times than the rest of the tickets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discovering occurrences of user-defined patterns in historical data representing collaborative activities in virtual user environment

The paper deals with analyses of performed collaborative activities in virtual user environment, focused on pattern discovering. All activities are monitored and recorded into separate database within defined log format. This log format provides sufficient historical data for various analytical purposes as visualization through timeline or extraction of different statistics based on user expect...

متن کامل

A Comparative Study of Clustering and Biclustering of Microarray Data

There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to ide...

متن کامل

Twitter data analysis by means of Strong Flipping Generalized Itemsets

Twitter data has recently been considered to perform a large variety of advanced analysis. Analysis of Twitter data imposes new challenges because the data distribution is intrinsically sparse, due to a large number of messages post every day by using a wide vocabulary. Aimed at addressing this issue, generalized itemsets sets of items at different abstraction levels can be effectively mined an...

متن کامل

Discovering Unique Game Variants

We present a method for computationally discovering diverse playable game variants, using a fine-tuned exploration of game space to explore a game’s design. Using a parameterized implementation of the popular mobile game Flappy Bird, we vary its parameters to create unique and interesting game variants. An evolutionary algorithm is used to find game variants in the playable space which are as f...

متن کامل

A New Study on Biclustering Tools, Biclusters Validation and Evaluation Functions

There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behaveindependently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to iden...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008