On-Line Approximate String Matching with Bounded Errors
نویسندگان
چکیده
We introduce a new dimension to the widely studied on-line approximate string matching problem, by introducing an error threshold parameter so that the algorithm is allowed to miss occurrences with probability . This is particularly appropriate for this problem, as approximate searching is used to model many cases where exact answers are not mandatory. We show that the relaxed version of the problem allows us breaking the average-case optimal lower bound of the classical problem, achieving average case O(n logσm/m) time with any = poly(k/m), where n is the text size, m the pattern length, k the number of differences for edit distance, and σ the alphabet size. Our experimental results show the practicality of this novel and promising research direction. Finally, we extend the proposed approach to the multiple approximate string matching setting, where the approximate occurrence of r patterns are simultaneously sought. Again, we can break the average-case optimal lower bound of the classical problem, achieving average case O(n logσ(rm)/m) time with any = poly(k/m).
منابع مشابه
Adaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کاملSelf-Bounded Prediction Suffix Tree via Approximate String Matching
Prediction suffix trees (PST) provide an effective tool for sequence modelling and prediction. Current prediction techniques for PSTs rely on exact matching between the suffix of the current sequence and the previously observed sequence. We present a provably correct algorithm for learning a PST with approximate suffix matching by relaxing the exact matching condition. We then present a self-bo...
متن کاملFast Approximate String Matching in a Dictionary
A successful technique to search large textual databases allowing errors relies on an online search in the vocabulary of the text. To reduce the time of that on-line search, we index the vocabulary as a metric space. We show that with reasonable space overhead we can improve by a factor of two over the fastest online algorithms , when the tolerated error level is low (which is reasonable in tex...
متن کاملApproximate String Matching: Theory and Applications (La Recherche Approchée de Motifs : Théorie et Applications)
The approximate string matching is a fundamental and recurrent problem that arises in most computer science fields. This problem can be defined as follows : Let D = {x1, x2, . . . xd} be a set of d words defined on an alphabet Σ, let q be a query defined also on Σ, and let k be a positive integer. We want to build a data structure on D capable of answering the following query : find all words i...
متن کاملn-Gram/2L-approximation: a two-level n-gram inverted index structure for approximate string matching
Approximate string matching is to find all the occurrences of a query string in a text database allowing a specified number of errors. Approximate string matching based on the n-gram inverted index (simply, n-gram Matching) has been widely used. A major reason is that it is scalable for large databases since it is not a main memory algorithm. Nevertheless, n-gram Matching also has drawbacks: th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Theor. Comput. Sci.
دوره 412 شماره
صفحات -
تاریخ انتشار 2008