Discovering Interesting Subsets Using Statistical Analysis
نویسندگان
چکیده
In this paper we present algorithms for identifying interesting subsets of a given database of records. In many real life applications, it is important to automatically discover subsets of records which are interesting with respect to a given measure. For example, in the customer support database, it is important to identify subsets of tickets having service time which is too large (or too small) when compared with the service time of the rest of the tickets. We use Student’s t-test to check whether the measure values for a subset and its complement differ significantly. We first discuss the brute-force approach and then present heuristic-based state-space search algorithm to discover interesting subsets of the given database. To use the proposed heuristic-based approach on large data sets, we then present a samplingbased algorithm that uses sampling together with the proposed heuristics to efficiently identify interesting sets in large data sets. We discuss an application of these techniques to customer support data, to discover subsets of tickets that have significantly worse (or better) service times than the rest of the tickets.
منابع مشابه
Discovering occurrences of user-defined patterns in historical data representing collaborative activities in virtual user environment
The paper deals with analyses of performed collaborative activities in virtual user environment, focused on pattern discovering. All activities are monitored and recorded into separate database within defined log format. This log format provides sufficient historical data for various analytical purposes as visualization through timeline or extraction of different statistics based on user expect...
متن کاملA Comparative Study of Clustering and Biclustering of Microarray Data
There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to ide...
متن کاملTwitter data analysis by means of Strong Flipping Generalized Itemsets
Twitter data has recently been considered to perform a large variety of advanced analysis. Analysis of Twitter data imposes new challenges because the data distribution is intrinsically sparse, due to a large number of messages post every day by using a wide vocabulary. Aimed at addressing this issue, generalized itemsets sets of items at different abstraction levels can be effectively mined an...
متن کاملDiscovering Unique Game Variants
We present a method for computationally discovering diverse playable game variants, using a fine-tuned exploration of game space to explore a game’s design. Using a parameterized implementation of the popular mobile game Flappy Bird, we vary its parameters to create unique and interesting game variants. An evolutionary algorithm is used to find game variants in the playable space which are as f...
متن کاملA New Study on Biclustering Tools, Biclusters Validation and Evaluation Functions
There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behaveindependently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to iden...
متن کامل