Strata decision tree: a symbolic data analysis technique

نویسنده

  • Carmen Bravo
چکیده

Symbolic Data Analysis allow for the extension of Data Analysis to symbolic data, that is, more complex data structures than classical or monovalued data. Symbolic data are the symbolic descriptions given to individuals ω ∈ Ω associated to symbolic variables. For example, symbolic descriptions may be given by probability distributions, sets of categories or intervals. Strata Decision Tree is a generalized recursive tree-building algorithm (Breiman et al., 1984) for populations partitioned into strata and described by symbolic data. Common predictor and class variables describe population in all strata. Tree-growing methods are in general recursive non-parametric classification methods that recursively obtain the binary partition in the population which maximizes an information content measure of the new splits with respect to predefined classes. Our segmentation method is applied when the underlying population is not only subdivided into the classes to be investigated, but it is simultaneously composed of groups of individuals, which will be called strata such as when individuals of a country are partitioned into regions, individuals of a region are partitioned into towns, companies are partitioned into economic sectors, etc... Then, it is interesting to explain how strata influence the classification rules obtained by a tree-segmentation method. Some rules are applicable to some strata, and others are not. Our aim is: to predict or to explain the value of the class variable for individuals by the predictors, conditioned on the stratum; to explain how these predictions are affected by stratum membership; to detect sets of strata for which this explanation is the same; to describe a stratum symbolically by the set of prediction rules which apply to it with their importance. The algorithm is applied to classical data and to probabilistic symbolic data for predictors. A probabilistic-modal variable can be the variable Employment, which associates probability distributions to individuals ω. For example, the distribution given by (yes(0.8),no(0.2)) can represent either a subpopulation with 20% of them unemployed, or an individual with 20% of his/her working life unemployed. The algorithm also considers other kinds of symbolic data, as hierarchical dependencies between variables. Hierarchical dependence can be represented by non-applicable rules, as for example, if (smoke=no), then cigarrete mark=Non applicable. As output of the algorithm, symbolic objects describe decisional nodes and strata. Tree nodes are represented by assertions and strata are generalized by sets of assertions. A symbolic object provides a (symbolic) description of the properties of individuals in terms of (symbolic) variables, combined with an operator for selecting individu-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining

Decision-tree algorithms provide one of the most popular methodologies for symbolic knowledge acquisition. The resulting knowledge, a symbolic decision tree along with a simple inference mechanism, has been praised for comprehensibility. The most comprehensible decision trees have been designed for perfect symbolic data. Classical crisp decision trees (DT) are widely applied to classification t...

متن کامل

بررسی کارایی مدل درختان تصمیم‌گیری در برآورد رسوبات معلق رودخانه‌ای (مطالعه موردی: حوضه سد ایلام)

The real estimation of the volume of sediments carried by rivers in water projects is very important. In fact, achieving the most important ways to calculate sediment discharge has been considered as the objective of the most research projects. Among these methods, the machine learning methods such as decision trees model (that are based on the principles of learning) can be presented. Decision...

متن کامل

Genetic programming system for building block analysis to enhance data analysis and data mining techniques

Recently, many computerized data mining tools and environments have been proposed for finding interesting patters in large data collections. These tools employ techniques that originate from research in various areas such as machine learning, statistical data analysis, and visualization. Each of these techniques makes assumptions concerning the composition of the data collection. If the particu...

متن کامل

Data Mining for Management and Rehabilitation of Water Systems: The Evolutionary Polynomial Regression Approach

Risk-based management and rehabilitation of water distribution systems requires that company asset data are collected and also that a methodology is available to efficiently extract information from data. The process of extracting useful information from data is called knowledge discovery and at its core is data mining. This automated analysis of large or complex datasets is performed to determ...

متن کامل

Optimized Fuzzy Decision Tree for Structured Continuous-Label Classification

Mainly understandable decision trees have been intended for perfect symbolic data. Conventional crisp decision trees (DT) are extensively used for classification purpose. However, there are still many issues particularly when we used the numerical (continuous valued) attributes. Structured continuouslabel classification is one type of classification in which the label is continuous in the data....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002