Large Text Searching Allowing Errors
نویسندگان
چکیده
We present a full inverted index for exact and approximate string matching in large texts. The index is composed of a table containing the vocabulary of words of the text and a list of positions in the text corresponding to each word. The size of the table of words is usually much less than 1% of the text size and hence can be kept in main memory, where most query processing takes place. The text, on the other hand, is not accessed at all. The algorithm permits a large number of variations of the exact and approximate string search problem, such as phrases, string matching with sets of characters (range and arbitrary set of characters, complements, wild cards), approximate search with nonuniform costs and arbitrary regular expressions. The whole index can be built in linear time, in a single sequential pass over the text, takes near 1=3 the space of the text, and retrieval times are near O(p n) for typical cases. Experimental results show that the algorithm works well in practice: for a one-gigabyte text collection, all matchings of a phrase of 3 words allowing up to 1 error can be found in approximately 6 seconds and allowing no errors can be found in under half a second. This index has been implemented in a software package called Igrep, which is publicly available. Experiments show that Igrep is much faster than Glimpse in typical queries.
منابع مشابه
Fast Filters for Two Dimensional String Matching Allowing Rotations
We give faster algorithms for searching a 2-dimensional pattern in a 2-dimensional text allowing rotations, mismatches and/or insertion/deletion errors.
متن کاملFast Approximate String Matching in a Dictionary
A successful technique to search large textual databases allowing errors relies on an online search in the vocabulary of the text. To reduce the time of that on-line search, we index the vocabulary as a metric space. We show that with reasonable space overhead we can improve by a factor of two over the fastest online algorithms , when the tolerated error level is low (which is reasonable in tex...
متن کاملDEPARTMENT OF COMPUTER SCIENCE Fast Text Searching With Errors
Searching for a pattern in a text file is a very common operation in many applications ranging from text editors and databases to applications in molecular biology. In many instances the pattern does not appear in the text exactly. Errors in the text or in the query can result from misspelling or from experimental errors (e.g., when the text is a DNA sequence). The use of such approximate patte...
متن کاملAgrep — a Fast Approximate Pattern-matching Tool
Searching for a pattern in a text file is a very common operation in many applications ranging from text editors and databases to applications in molecular biology. In many instances the pattern does not appear in the text exactly. Errors in the text or in the query can result from misspelling or from experimental errors (e.g., when the text is a DNA sequence). The use of such approximate patte...
متن کاملAgrep - A Fast Approximate Pattern-Matching Tool
Searching for a pattern in a text file is a very common operation in many applications ranging from text editor sand databases to applications in molecular biology. In many instances the pattern does not appear in the text exactly. Errors in the text or in the query can result from misspelling or from experimental errors (e.g., when the text is a DNA sequence). The use of such approximate patte...
متن کامل