Optimal String Mining Under Frequency Constraints
نویسندگان
چکیده
We propose a new algorithmic framework that solves frequency-related data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, emerging strings and strings that pass other statistical tests, e.g., the χ-test. In contrast to the presented result for strings, no optimal algorithms are known for other pattern domains such as itemsets. The key to our approach are several recent results on index structures for strings, among them suffixand lcp-arrays, and a new preprocessing scheme for range minimum queries. The advantages of array-based data structures (compared with dynamic data structures such as trees) are good locality behavior and extensibility to secondary memory. We test our algorithm on real-world data from computational biology and demonstrate that the approach also works well in practice.
منابع مشابه
Efficient String Mining under Constraints Via the Deferred Frequency Index
We propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel algorithm based on a deferred data structure. Despite its simplicity, our approach is up to 4 times faster and uses about half the memory compared to the best-known algorithm of Fischer et al. Applications in various string domains, e.g. natural...
متن کاملImpact of Pollution Location on Time and Frequency Characteristics of Leakage Current of Porcelain Insulator String under Different Humidity and Contamination Severity
One of the important factors influencing outdoor insulators performance is pollution phenomenon. The pollution, especially during humidity condition, reduces superficial resistance of insulator and lead to a flow of Leakage Currents (LC) on the insulator surface, which may result in total flashover. The LC characteristics are affected by parameters such as nature and severity of pollution. Loca...
متن کاملIntroducing Softness into Inductive Queries on String Databases
In many application domains (e.g., WWW mining, molecular biology), large string datasets are available and yet under-exploited. The inductive database framework assumes that both such datasets and the various patterns holding within them might be queryable. In this setting, queries which return patterns are called inductive queries and solving them is one of the core research topics for data mi...
متن کاملMitašiūnaitė Mining String Data under Similarity and Soft - Frequency Constraints : Application to Promoter Sequence Analysis
An inductive database is a database that contains not only data but also patterns. Inductive databases are designed to support the KDD process. Recent advances in inductive databases research have given rise to a generic solvers capable of solving inductive queries that are arbitrary Boolean combinations of anti-monotonic and monotonic constraints. They are designed to mine different types of p...
متن کاملOptimal production strategy of bimetallic deposits under technical and economic uncertainties using stochastic chance-constrained programming
In order to catch up with reality, all the macro-decisions related to long-term mining production planning must be made simultaneously and under uncertain conditions of determinant parameters. By taking advantage of the chance-constrained programming, this paper presents a stochastic model to create an optimal strategy for producing bimetallic deposit open-pit mines under certain and uncertain ...
متن کامل