Optimal instance selection for improved decision tree
نویسندگان
چکیده
Instance selection plays an important role in improving scalability of data mining algorithms, but it can also be used to improve the quality of the data mining results. In this dissertation we present a new optimization-based approach for instance selection that uses a genetic algorithm (GA) to select a subset of instances to produce a simpler decision tree with acceptable accuracy. The resultant trees are likely to be easier to comprehend and interpret by the decision maker and hence more useful in practice. We present numerical results for several difficult test datasets that indicate that GA-based instance selection can often reduce the size of the decision tree by an order of magnitude while still maintaining good prediction accuracy. The results suggest that GA-based instance selection works best for low entropy datasets. With higher entropy, there will be less benefit from instance selection. A comparison between GA and other heuristic approaches such as Rmhc (Random Mutation Hill Climbing) and simple construction heuristic, indicates that GA is able to obtain a good solution with low computation cost even for some large datasets. One advantage of instance selection is that it is able to increase the average instances associated with the leaves of the decision trees to avoid overfitting, thus instance selection can be used as an effective alternative to prune decision trees. Finally, the analysis on the selected instances reveals that instance selection helps to reduce outliers, reduce missing values, and select the most useful instances for separating classes.
منابع مشابه
A Decision Tree for Technology Selection of Nitrogen Production Plants
Nitrogen is produced mainly from its most abundant source, the air, using three processes: membrane, pressure swing adsorption (PSA) and cryogenic. The most common method for evaluating a process is using the selection diagrams based on feasibility studies. Since the selection diagrams are presented by different companies, they are biased, and provide unsimilar and even controversial results. I...
متن کاملProsodic Boundary Prediction for Greek Speech Synthesis
In this article, we evaluate features and algorithms for the task of prosodic boundary prediction for Greek. For this purpose a prosodic corpus composed of generic domain text was constructed. Feature contribution was evaluated and ranked with the application of information gain ranking and correlation -based feature selection filtering methods. Resulted datasets were applied to C4.5 decision t...
متن کاملAnomaly Detection Using SVM as Classifier and Decision Tree for Optimizing Feature Vectors
Abstract- With the advancement and development of computer network technologies, the way for intruders has become smoother; therefore, to detect threats and attacks, the importance of intrusion detection systems (IDS) as one of the key elements of security is increasing. One of the challenges of intrusion detection systems is managing of the large amount of network traffic features. Removing un...
متن کاملInstance selection for simplified decision trees through the generation and selection of instance candidate subsets
vi CHAPTER
متن کاملIFSB-ReliefF: A New Instance and Feature Selection Algorithm Based on ReliefF
Increasing the use of Internet and some phenomena such as sensor networks has led to an unnecessary increasing the volume of information. Though it has many benefits, it causes problems such as storage space requirements and better processors, as well as data refinement to remove unnecessary data. Data reduction methods provide ways to select useful data from a large amount of duplicate, incomp...
متن کامل