Active Mining of Data Streams
نویسندگان
چکیده
Most previously proposed mining methods on data streams make an unrealistic assumption that “labelled” data stream is readily available and can be mined at anytime. However, in most real-world problems, labelled data streams are rarely immediately available. Due to this reason, models are refreshed periodically, that is usually synchronized with data availability schedule. There are several undesirable consequences of this “passive periodic refresh”. In this paper, we propose a new concept of demand-driven active data mining. It estimates the error of the model on the new data stream without knowing the true class labels. When significantly higher error is suspected, it investigates the true class labels of a selected number of examples in the most recent data stream to verify the suspected higher error. 1 State-of-the-art Stream Mining State-of-the-art work on mining data streams concentrates on capturing time-evolving trends and patterns with “labeled” data. However, one important aspect that is often ignored or unrealistically assumed is the availability of “class labels” of data streams. Most algorithms make an implicit and impractical assumption that labeled data is readily available. Most works focus on how to detect the change in patterns and how to update the model to reflect such changes when there are “labelled” instances to be learned. However, for many applications, the class labels are not “immediately” available unless dedicated efforts and substantial costs are spent to investigate these labels right away. If the true class labels were readily available, data mining models would not be very useful we might just wait. In credit card fraud detection, we usually do not know if a particular transaction is a fraud until at least one month later after the account holder receives and reviews the monthly statement. Due to these facts, most current applications obtain class labels and update existing models in preset frequency, usually synchronized with data refresh. The effectiveness of the passive mode is dictated by some “statuary and static constraints”, yet not by the “demand” for a better model with a lower loss. Such a passive mode to mine data streams results in a number of potential undesirable consequences that contradict the notions of “streaming” and “continuous”. First, it may incur possibly higher loss due to neglected pattern drifts. If either the concept or data distribution drifts rapidly at an un-forecasted rate that statuary constraints do not catch up, the models are likely to be out-of-date on the data stream and important business opportunities might be missed. Second, it may have unnecessary model refresh. If there is neither conceptual nor distributional change, periodic passive model refresh and re-validation is a waste of resources. 1.1 Demand-driven Active Mining of Data Streams We are proposing a demand-driven active stream data mining process that solves the problems of passive stream data mining. As a summary, our particular implementation of active stream data mining has three simple steps: 1. Detect potential changes of data streams “on the fly” when the existing model classifies continuous data streams. The detection process does not use or know any true labels of the stream. One of the change detection methods is a “guess” of the actual loss or error rate of the model on the new data stream. 2. If the guessed loss or error rate of the model in step 1 is much higher than an application-specific tolerable maximum, we choose a small number of data records in the new data stream to investigate their true class labels. With these true class labels, we statistically estimate the true loss of the model. 3. If the statistically estimated loss in step 2 is verified to be higher than the tolerable maximum, we reconstruct the old model by using the same true class labels sampled in the previous step. In this paper, we concentrate on the first two steps. Our particular implementation extends on classification trees.
منابع مشابه
Mining Frequent Patterns in Uncertain and Relational Data Streams using the Landmark Windows
Todays, in many modern applications, we search for frequent and repeating patterns in the analyzed data sets. In this search, we look for patterns that frequently appear in data set and mark them as frequent patterns to enable users to make decisions based on these discoveries. Most algorithms presented in the context of data stream mining and frequent pattern detection, work either on uncertai...
متن کاملMining Multi-Label Data Streams Using Ensemble-Based Active Learning
Data stream classification has drawn increasing attention from the data mining community in recent years, where a large number of stream classification models were proposed. However, most existing models were merely focused on mining from single-label data streams. Mining from multi-label data streams has not been fully addressed yet. On the other hand, although some recent work touched the mul...
متن کاملMining Time-Changing Data Streams
Streaming data have gained considerable attention in database and data mining communities because of the emergence of a class of applications, such as financial marketing, sensor networks, internet IP monitoring, and telecommunications that produce these data. Data streams have some unique characteristics that are not exhibited by traditional data: unbounded, fast-arriving, and time-changing. T...
متن کاملMining maximal frequent itemsets from data streams
Frequent pattern mining from data streams is an active research topic in data mining. Existing research efforts often rely on a two-phase framework to discover frequent patterns: (1) using internal data structures to store meta-patterns obtained by scanning the stream data; and (2) re-mining the meta-patterns to finalize and output frequent patterns. The defectiveness of such a two-phase framew...
متن کاملActive learning for sentiment analysis on data streams: Methodology and workflow implementation in the ClowdFlows platform
Sentiment analysis from data streams is aimed at detecting authors ' attitude , emotions and opinions from texts in real-time. To reduce the labeling effort needed in the data collection phase , active learning is often applied in streaming scenarios , where a learning algorithm is allowed to select new examples to be manually labeled in order to improve the learner ' s performance. Even though...
متن کاملMining Multidimensional Sequential Patterns over Data Streams
Sequential pattern mining is an active field in the domain of knowledge discovery and has been widely studied for over a decade by data mining researchers. More and more, with the constant progress in hardware and software technologies, real-world applications like network monitoring systems or sensor grids generate huge amount of streaming data. This new data model, seen as a potentially infin...
متن کامل