Confidence Decision Trees via Online and Active Learning for Streaming Data
نویسندگان
چکیده
Decision tree classifiers are a widely used tool in data stream mining. The use of confidence intervals to estimate the gain associated with each split leads to very effective methods, like the popular Hoeffding tree algorithm. From a statistical viewpoint, the analysis of decision tree classifiers in a streaming setting requires knowing when enough new information has been collected to justify splitting a leaf. Although some of the issues in the statistical analysis of Hoeffding trees have been already clarified, a general and rigorous study of confidence intervals for splitting criteria is missing. We fill this gap by deriving accurate confidence intervals to estimate the splitting gain in decision tree learning with respect to three criteria: entropy, Gini index, and a third index proposed by Kearns and Mansour. We also extend our confidence analysis to a selective sampling setting, in which the decision tree learner adaptively decides which labels to query in the stream. We provide theoretical guarantees bounding the probability that the decision tree learned via our selective sampling strategy classifies suboptimally the next example in the stream. Experiments on real and synthetic data in a streaming setting show that our trees are indeed more accurate than trees with the same number of leaves generated by state-ofthe-art techniques. In addition to that, our active learning module empirically uses fewer labels without significantly hurting the performance.
منابع مشابه
Confidence Decision Trees via Online and Active Learning for Streaming (BIG) Data
Decision tree classifiers are a widely used tool in data stream mining. The use of confidence intervals to estimate the gain associated with each split leads to very effective methods, like the popular Hoeffding tree algorithm. From a statistical viewpoint, the analysis of decision tree classifiers in a streaming setting requires knowing when enough new information has been collected to justify...
متن کاملOnline tree-based ensembles and option trees for regression on evolving data streams
The emergence of ubiquitous sources of streaming data has given rise to the popularity of algorithms for online machine learning. In that context, Hoeffding trees represent the state-of-the-art algorithms for online classification. Their popularity stems in large part from their ability to process large quantities of data with a speed that goes beyond the processing power of any other streaming...
متن کاملOnline Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features
Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...
متن کاملOn the Use of Provalets in a Predictive Maintenance Use Case
In this paper we report on a predictive maintenance use cases using Provalet rule agents for implementing expressive rule-based streaming analytics and decision logic on top of online machine learning prediction models, which are dynamically applied to the streaming data coming from on-board asset monitoring sensors. Provalets are component-based mobile agents for rule-based inference analytics...
متن کاملInvestigating Students' Use of Lecture Videos in Online Courses: A Case Study for Understanding Learning Behaviors via Data Mining
This study investigated students’ learning behaviors in a fully online psychology course which offered 76 streaming lecture videos and supplementary resources, as well as individual and group activities. This paper focuses on students’ use of lecture videos. Data collection included students’ real usage of data on Blackboard Learn 9.1, a course survey, and students’ final grades. The analysis a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Artif. Intell. Res.
دوره 60 شماره
صفحات -
تاریخ انتشار 2017