Subjectively Interesting Subgroup Discovery on Real-valued Targets
نویسندگان
چکیده
Deriving insights from high-dimensional data is one of the core problems in data mining. The difficulty mainly stems from the fact that there are exponentially many variable combinations to potentially consider, and there are infinitely many if we consider weighted combinations, even for linear combinations. Hence, an obvious question is whether we can automate the search for interesting patterns and visualizations. In this paper, we consider the setting where a user wants to learn as efficiently as possible about real-valued attributes. For example, to understand the distribution of crime rates in different geographic areas in terms of other (numerical, ordinal and/or categorical) variables that describe the areas. We introduce a method to find subgroups in the data that are maximally informative (in the formal Information Theoretic sense) with respect to a single or set of real-valued target attributes. The subgroup descriptions are in terms of a succinct set of arbitrarily-typed other attributes. The approach is based on the Subjective Interestingness framework FORSIED to enable the use of prior knowledge when finding most informative non-redundant patterns, and hence the method also supports iterative data mining.
منابع مشابه
A Non-radial Approach for Setting Integer-valued Targets in Data Envelopment Analysis
Data Envelopment Analysis (DEA) has been widely studied in the literature since its inception with Charnes, Cooper and Rhodes work in 1978. The methodology behind the classical DEA method is to determine how much improvements in the outputs (inputs) dimensions is necessary in order to render them efficient. One of the underlying assumptions of this methodology is that the units consume and prod...
متن کاملMaximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data
In exploratory data mining it is important to assess the significance of results. Given that analysts have only limited time, it is important that we can measure this with regard to what we already know. That is, we want to be able to measure whether a result is interesting from a subjective point of view. With this as our goal, we formalise how to probabilistically model real-valued data by th...
متن کاملTowards Knowledge-Intensive Subgroup Discovery
Subgroup discovery can be applied for exploration or descriptive induction in order to discover ”interesting” subgroups of the general population, given a certain property of interest. In domains with available background knowledge, the user usually wants to utilize this to improve the quality of the subgroup discovery results. We describe a knowledge-intensive approach for subgroup discovery u...
متن کاملAssociation Analysis for Real-valued Data: Definitions and Application to Microarray Data
The discovery of biclusters, which denote groups of items that show coherent values across a subset of all the transactions in a data set, is an important type of analysis performed on real-valued data sets in several domains, such as biology. Several algorithms have been proposed to find different types of biclusters in such data sets. However, the search schemes used by these algorithms are u...
متن کاملNovel Techniques for Efficient and Effective Subgroup Discovery
Large volumes of data are collected today in many domains. Often, there is so much data available, that it is difficult to identify the relevant pieces of information. Knowledge discovery seeks to obtain novel, interesting and useful information from large datasets. One key technique for that purpose is subgroup discovery. It aims at identifying descriptions for subsets of the data, which have ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1710.04521 شماره
صفحات -
تاریخ انتشار 2017