Adjusting for Multiple Testing in Decision Tree Pruning
نویسنده
چکیده
Over tting is a widely observed pathology of induction algorithms. Over tted models contain unnecessary structure that re ects nothing more than chance variations in the particular data sample used to construct the model. Portions of these models are literally wrong, and can mislead users. Over tted models require more storage space and take longer to execute than their correctlysized counterparts. Finally, over tting has been shown to reduce the accuracy of induced models on new data [13, 6]. For induction algorithms that build decision trees [1, 12, 14], is a common approach to correct over tting. Pruning techniques take an induced tree, examine individual subtrees, and remove those subtrees deemed to be unnecessary. While pruning techniques can di er in several ways, their primary di erences concern the criteria used to judge sub-trees. Many criteria have been proposed, including statistical signi cance tests [8], pessimistic error estimates [14], and minimum description length calculations [11]. Most common pruning techniques, however, do not account for one potentially important factor | multiple testing. Multiple testing occurs whenever an induction algorithm examines several candidate models and selects the one that best accords with the data. Any search process necessarily involves multiple testing, and most common induction algorithms involve implicit or explicit search through a space of candidate models. In the case of decision trees, search involves examining many possible subtrees and selecting the best one. Pruning techniques need to account for the number of subtrees examined, because such multiple testing a ects the apparent accuracy of models on training data [7]. This paper examines the importance of adjusting for multiple testing. Speci cally, it examines the e ectiveness of one particular pruning method | . Bonferroni pruning adjusts the results of a standard signi cance test to account for the number of subtrees examined at a particular node of a decision tree. Evidence that bonferroni pruning leads to better models supports the hypothesis that multiple testing is an important cause of over tting. The next section brie y surveys several relevant approaches to decision tree pruning. Section 3 presents the results of an experiment comparing bonferroni pruning to other pruning techniques. The nal section discusses the implications of the experiment and the content of the full paper.
منابع مشابه
Adjusting for Multiple Comparisons in Decision Tree Pruning
Introduction Bonferroni Pruning Pruning is a common technique to avoid over tting in decision trees. Most pruning techniques do not account for one important factor | multiple comparisons. Multiple comparisons occur when an induction algorithm examines several candidate models and selects the one that best accords with the data. Making multiple comparisons produces incorrect inferences about mo...
متن کاملEvaluation of liquefaction potential based on CPT results using C4.5 decision tree
The prediction of liquefaction potential of soil due to an earthquake is an essential task in Civil Engineering. The decision tree is a tree structure consisting of internal and terminal nodes which process the data to ultimately yield a classification. C4.5 is a known algorithm widely used to design decision trees. In this algorithm, a pruning process is carried out to solve the problem of the...
متن کاملMinimal Cost Complexity Pruning of Meta-Classifiers
Integrating multiple learned classification models (classifiers) computed over large and (physically) distributed data sets has been demonstrated as an effective approach to scaling inductive learning techniques, while also boosting the accuracy of individual classifiers. These gains, however, come at the expense of an increased demand for run-time system resources. The final ensemble meta-clas...
متن کاملAnomaly Detection Using SVM as Classifier and Decision Tree for Optimizing Feature Vectors
Abstract- With the advancement and development of computer network technologies, the way for intruders has become smoother; therefore, to detect threats and attacks, the importance of intrusion detection systems (IDS) as one of the key elements of security is increasing. One of the challenges of intrusion detection systems is managing of the large amount of network traffic features. Removing un...
متن کاملAccounting for Multiple Comparisons in Decision Tree Pruning
Overtting is a widely observed pathology of induction algorithms. For induction algorithms that build decision trees, pruning is a common approach to correct overtting. Most common pruning techniques , do not account for one potentially important factor | multiple comparisons. Multiple comparisons occur whenever an induction algorithm examines several candidate models and selects the one that b...
متن کامل