An Improvement and an Extension on the Hybrid Index for Approximate String Matching
نویسنده
چکیده
In [2] Navarro and Baeza-Yates found their so-called hybrid index to be the best alternative for indexed approximate search in English text. The original hybrid index is based on Levenshtein edit distance. We propose two modifications to the hybrid index. The first is a way to accelerate the search. The second modification is to make the index permit also the error of transposing two adjacent characters (“Damerau distance”). A full discussion is presented in Section 11 of [1]. Let ed(A, B) denote the edit distance between strings A and B, |A| denote the length of A, Ai denote the ith character of A, and Ai..j denote the substring of A that begins from its ith and ends at it jth character. Given a length-m pattern string P , a length-n text string T , and an error limit k, the task of approximate string matching is to find such text positions j where ed(P, Th..j) ≤ k and h ≤ j. Levenshtein edit distance edL(A, B) is the minimum number of single-character insertions, deletions and substitutions needed in transforming A into B or vice versa. Damerau edit distance edD(A, B) is otherwise similar but permits also the operation of transposing two permanently adjacent characters. Using an index structure during the search can accelerate approximate string matching. One such index is the hybrid index of Navarro & Baeza-Yates [2] for Levenshtein edit distance, which they found to be the best choice for searching English text. It uses intermediate partitioning, where the pattern is partitioned into j pieces P , .., P j , and then each piece P i is searched for with d = ⌊k/j⌋ errors. If j > 1 and a hit Tj−h..j is found so that edL(P , Tj−h..j) ≤ d , the text area Tj−m−k..j+m+k will be included in a check for a complete match of P with k errors. The hits for each piece P i are found by a depth-first search (DFS) over a suffix tree built for the text. This involves filling a dynamic programming table D, where D[r, l] = ed(P i 1..r, Tj+1..j+l), during the DFS. When the DFS arrives at a node that corresponds to the text substring Tj+1..j+l, the distances edL(P i 1..r, Tj+1..j+l) are computed for r = 1 . . .m , where m = |P |. Our main proposal for accelerating the DFS is as follows. When the DFS reaches a depth-l node that corresponds to the text substring Tj+1..j+l and where D[r, l] ≥ d for r = 1..m, the only strings that have Tj+1..j+l as a prefix and match P i with d errors are of form Tj+1..j+l ◦ P i h+1..m , where ◦ denotes ⋆ Supported by Tampere Graduate School in Information Science and Engineering. 1 A trie of all suffixes of the text in which each suffix has its own leaf node and the position of each suffix is recorded into the corresponding leaf. a) t h e r e 0 1 2 3 4 5 6 7 8 t 1 0 1 2 3 4 h 2 1 0 1 2 3 e 3 2 1 0 1 2 s 4 3 2 1 1 2 2 i 5 4 3 2 2 2 2 2 s 6 5 4 3 3 3 2 2 2 b) m = m d = k OURS/NBY (WSJ) OURS/NBY (yeast) 5 1 0,23 0,66 5 2 0,33 0,74 10 1 0,19 0,20 10 2 0,33 0,31 10 3 0,41 0,39 15 1 0,19 0,19 15 2 0,34 0,27 15 3 0,43 0,36 15 4 0,50 0,49 Fig. 1. Figure a): Matrix D for computing edL(P i 1..r, Tj+1..), where P i 1..mi = “thesis”, Tj+1.. = “there..”, and d i = 2. Now D[r, 5] ≥ d = 2 for r = 1 . . . m, and the only way to reach a cell value D[m, x] ≤ 2, where x > 5, is to have only matches at the remaining parts of the top-left-to-bottom-right diagonals with the value D[h, 5] = 2. The cells in these diagonal extensions have the value d = 2 underlined, and the pattern suffixes corresponding to the cell values D[3, 5], D[4, 5] and D[5, 5] (shown in bold) are P i 4..6 = “sis”, P i 5..6 = “is” and P i 6 = “s”, respectively. Figure b): The ratio between the running time of our improved DFS (OURS) and the runtime of the original DFS of Navarro and Baeza-Yates (NBY). We tested with two ≈ 10 MB texts: Wall Street Journal articles (WSJ) and the DNA of baker’s yeast (yeast). The computer was a 600 Mhz Pentium 3 with 256 MB RAM, Linux OS and GCC 3.2.1 compiler. concatenation and h fulfills the condition D[h, l] = d. In this situation we check directly for the presence of any of these concatenated substrings, and then let the DFS backtrack. Fig. 1a illustrates, and Fig. 1b shows experimental results from a comparison against the original DFS of [2] when P i = P and d = k. In addition we propose the following lemma for partitioning P under Damerau distance. It uses classes of characters, which refers to permitting a pattern position to match with any character enumerated inside square brackets. For example P = “thes[ei]s” matches with the strings “theses” and “thesis”. Lemma 1. Let P , i = 1..j, be j non-overlapping substrings of the pattern P that are ordered so that P i+1 occurs on the right side of P i in P . Also let B be some string for which edD(P, B) ≤ k, let each P i be associated with the corresponding number of errors d, and let strings P̄ , i = 1..j, be defined as follows: P̄ i = P , if i = j or P i and P i+1 do not occur consecutively in P . P̄ i = P i 1..m−1 ◦ [P i m P i+1 1 ], otherwise. If ∑j i=1 d i ≥ k− j+1, then one of the strings P̄ i matches inside B with at most d errors.
منابع مشابه
Approximate String Matching with Lempel-Ziv Compressed Indexes
A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T . Furthermore, the structure can reproduce any substring of T , thus it actually replaces T . Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper...
متن کاملA Hybrid Indexing Method for Approximate String Matching
We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the average retrieval time is , for some that depends on the error fraction tolerated and the alphabet size . It is shown that for approximately , where . The space required is four times...
متن کاملPractical Methods for Approximate String Matching
Given a pattern string and a text, the task of approximate string matching is to find all locations in the text that are similar to the pattern. This type of search may be done for example in applications of spelling error correction or bioinformatics. Typically edit distance is used as the measure of similarity (or distance) between two strings. In this thesis we concentrate on unit-cost edit ...
متن کاملAn Improved Semantic Schema Matching Approach
Schema matching is a critical step in many applications, such as data warehouse loading, Online Analytical Process (OLAP), Data mining, semantic web [2] and schema integration. This task is defined for finding the semantic correspondences between elements of two schemas. Recently, schema matching has found considerable interest in both research and practice. In this paper, we present a new impr...
متن کاملAverage-Optimal Multiple Approximate String Matching
We present a new algorithm for multiple approximate string matching, based on an extension of the optimal (on average) singlepattern approximate string matching algorithm of Chang and Marr. Our algorithm inherits the optimality and is also competitive in practice. We present a second algorithm that is linear time and handles higher difference ratios. We show experimentally that our algorithms a...
متن کاملn-Gram/2L-approximation: a two-level n-gram inverted index structure for approximate string matching
Approximate string matching is to find all the occurrences of a query string in a text database allowing a specified number of errors. Approximate string matching based on the n-gram inverted index (simply, n-gram Matching) has been widely used. A major reason is that it is scalable for large databases since it is not a main memory algorithm. Nevertheless, n-gram Matching also has drawbacks: th...
متن کامل