Optimal exact string matching based on su x arrays
نویسندگان
چکیده
Using the su x tree of a string S, decision queries of the type \Is P a substring of S?" can be answered in O(jP j) time and enumeration queries of the type \Where are all z occurrences of P in S?" can be answered inO(jP j+z) time, totally independent of the size of S. However, in large scale applications as genome analysis, the space requirements of the su x tree are a severe drawback. The su x array is a more space economical index structure. Using it and an additional table, Manber and Myers (1993) showed that decision queries and enumeration queries can be answered in O(jP j+log jSj) and O(jP j+log jSj+z) time, respectively, but no optimal time algorithms are known. In this paper, we show how to achieve the optimal O(jP j) and O(jP j+ z) time bounds for the su x array. Our approach is not con ned to exact pattern matching. In fact, it can be used to e ciently solve all problems that are usually solved by a top-down traversal of the su x tree. Experiments show that our method is not only of theoretical interest but also of practical relevance.
منابع مشابه
Optimal Su x Tree Construction with Large
The su x tree of a string is the fundamental data structure of combinatorial pattern matching. In this paper, we present a novel, deterministic algorithm for the construction of su x trees. We settle the main open problem in the construction of su x trees: we build su x trees in linear time for integer alphabet.
متن کاملAverage-optimal string matching
The exact string matching problem is to find the occurrences of a pattern of length m from a text of length n symbols. We develop a novel and unorthodox filtering technique for this problem. Our method is based on transforming the problem into multiple matching of carefully chosen pattern subsequences. While this is seemingly more difficult than the original problem, we show that the idea leads...
متن کاملFast Approximate String Matching with Suffix Arrays and A* Parsing
We present a novel exact solution to the approximate string matching problem in the context of translation memories, where a text segment has to be matched against a large corpus, while allowing for errors. We use suffix arrays to detect exact n-gram matches, A* search heuristics to discard matches and A* parsing to validate candidate segments. The method outperforms the canonical baseline by a...
متن کاملOptimal Exact Strring Matching Based on Suffix Arrays
Using the suffix tree of a string S, decision queries of the type “Is P a substring of S?” can be answered in O(|P |) time and enumeration queries of the type “Where are all z occurrences of P in S?” can be answered in O(|P |+z) time, totally independent of the size of S. However, in large scale applications as genome analysis, the space requirements of the suffix tree are a severe drawback. Th...
متن کاملString Range Matching
Given strings X and Y the exact string matching problem is to find the occurrences of Y as a substring of X. An alternative formulation asks for the lexicographically consecutive set of suffixes of X that begin with Y. We introduce a generalization called string range matching where we want to find the suffixes of X that are in an arbitrary lexicographical range bounded by two strings Y and Z. ...
متن کامل