Generalised interaction mining: probabilistic, statistical and vectorised methods in high dimensional or uncertain databases
نویسنده
چکیده
Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, useful and ultimately understandable patterns in data. The core step of the KDD process is the application of Data Mining (DM) algorithms to e ciently nd interesting patterns in large databases. This thesis concerns itself with three inter-related themes: Generalised interaction and rule mining; the incorporation of statistics into novel data mining approaches; and probabilistic frequent pattern mining in uncertain databases. An interaction describes an e ect that variables have or appear to have on each other. Interaction mining is the process of mining structures on variables describing their interaction patterns usually represented as sets, graphs or rules. Interactions may be complex, represent both positive and negative relationships, and the presence of interactions can in uence another interaction or variable in interesting ways. Finding interactions is useful in domains ranging from social network analysis, marketing, the sciences, e-commerce, to statistics and nance. Many data mining tasks may be considered as mining interactions, such as clustering; frequent itemset mining; association rule mining; classi cation rules; graph mining; ock mining; etc. Interaction mining problems can have very di erent semantics, pattern de nitions, interestingness measures and data types. Solving a wide range of interaction mining problems at the abstract level, and doing so e ciently ideally more e ciently than with specialised approaches, is a challenging problem. This thesis introduces and solves the Generalised Interaction Mining (GIM) and Generalised Rule Mining (GRM) problems. GIM and GRM use an e cient and intuitive computational model based purely on vector valued functions. The semantics of the interactions, their interestingness measures and the type of data considered are exible components of vectorised frameworks. By separating the semantics of a problem from the algorithm used to mine it, the frameworks allow both to vary independently of each other. This makes it easier to develop new methods by focusing purely on a problem's semantics and removing the burden of designing an
منابع مشابه
A Framework on Data Mining on Uncertain Data with Related Research Issues in Service Industry
There has been a large amount of research work done on mining on relational databases that store data in exact values. However, in many real-life applications such as those commonly used in service industry, the raw data are usually uncertain when they are collected or produced. Sources of uncertain data include readings from sensors (such as RFID tagged in products in retail stores), classific...
متن کاملGuest Editors' Introduction: Special Section on Mining Large Uncertain and Probabilistic Databases
RECENT years have witnessed the emergence of novel database applications in various nontraditional domains, including location-based services, sensor networks, RFID systems, and biological and biometric databases. Traditionally, data mining has been widely used to reveal interesting patterns in the vast amounts of data generated by such applications. However, for most of these emerging domains,...
متن کاملSimilarity Search and Mining in Uncertain Databases
Managing, searching and mining uncertain data has achieved much attention in the database community recently due to new sensor technologies and new ways of collecting data. There is a number of challenges in terms of collecting, modelling, representing, querying, indexing and mining uncertain data. In its scope, the diversity of approaches addressing these topics is very high because the underl...
متن کاملOn Uncertain Probabilistic Data Modeling
Uncertainty in data is caused by various reasons including data itself, data mapping, and data policy. For data itself, data are uncertain because of various reasons. For example, data from a sensor network, Internet of Things or Radio Frequency Identification is often inaccurate and uncertain because of devices or environmental factors. For data mapping, integrated data from various heterogono...
متن کاملSubspace Clustering for Uncertain Data
Analyzing uncertain databases is a challenge in data mining research. Usually, data mining methods rely on precise values. In scenarios where uncertain values occur, e.g. due to noisy sensor readings, these algorithms cannot deliver highquality patterns. Beside uncertainty, data mining methods face another problem: high dimensional data. For finding object groupings with locally relevant dimens...
متن کامل