Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R
نویسندگان
چکیده
In recent years, data streams have become an increasingly important area of research for the computer science, database and statistics communities. Data streams are ordered and potentially unbounded sequences of data points created by a typically non-stationary data generating process. Common data mining tasks associated with data streams include clustering, classification and frequent pattern mining. New algorithms for these types of data are proposed regularly and it is important to evaluate them thoroughly under standardized conditions. In this paper we introduce stream, a research tool that includes modeling and simulating data streams as well as an extensible framework for implementing, interfacing and experimenting with algorithms for various data stream mining tasks. The main advantage of stream is that it seamlessly integrates with the large existing infrastructure provided by R. In addition to data handling, plotting and easy scripting capabilities, R also provides many existing algorithms and enables users to interface code written in many programming languages popular among data mining researchers (e.g., C/C++, Java and Python). In this paper we describe the architecture of stream and and focus on its use for data stream clustering research. stream was implemented with extensibility in mind and will be extended in the future to cover additional data stream mining tasks like classification and frequent pattern mining.
منابع مشابه
An Algorithm for Streaming Clustering
A simple existing data stream clustering algorithm DenStream based on DBScan is studied. Based on DenStream a modified algorithm called DenStream2 is proposed. It follows most of the framework and theory of DenStream. Denstream2 is implemented as a foreign function in an extensible data stream management system (DSMS), where queries over streams are allowed. The generated clusters inferred from...
متن کاملrEMM: Extensible Markov Model for Data Stream Clustering in R
Clustering streams of continuously arriving data has become an important application of data mining in recent years and efficient algorithms have been proposed by several researchers. However, clustering alone neglects the fact that data in a data stream is not only characterized by the proximity of data points which is used by clustering, but also by a temporal component. The extensible Markov...
متن کاملDetecting Concept Drift in Data Stream Using Semi-Supervised Classification
Data stream is a sequence of data generated from various information sources at a high speed and high volume. Classifying data streams faces the three challenges of unlimited length, online processing, and concept drift. In related research, to meet the challenge of unlimited stream length, commonly the stream is divided into fixed size windows or gradual forgetting is used. Concept drift refer...
متن کاملTemporal Structure Learning for Clustering Massive Data Streams in Real-Time
This paper describes one of the first attempts to model the temporal structure of massive data streams in real-time using data stream clustering. Recently, many data stream clustering algorithms have been developed which efficiently find a partition of the data points in a data stream. However, these algorithms disregard the information represented by the temporal order of the data points in th...
متن کاملSTREAM CORRIDORS AS INVALUABLE URBAN ELEMENTS: SUGGESTIONS FOR IMPROVEMENT OF PAVEH STREAM
The study seeks to address the importance of urban stream ecosystems from the perspective of urban ecology, human health and social well-being in the context of urban planning. The case study area is Paveh stream in the City of Paveh. The data from the case study area were gathered from questionnaire, existing scientific and library studies and by conducting interviews with residents and auth...
متن کامل