Large Text Searching Allowing Errors

نویسندگان

Gonzalo Navarro

Nivio Ziviani

چکیده

We present a full inverted index for exact and approximate string matching in large texts. The index is composed of a table containing the vocabulary of words of the text and a list of positions in the text corresponding to each word. The size of the table of words is usually much less than 1% of the text size and hence can be kept in main memory, where most query processing takes place. The text, on the other hand, is not accessed at all. The algorithm permits a large number of variations of the exact and approximate string search problem, such as phrases, string matching with sets of characters (range and arbitrary set of characters, complements, wild cards), approximate search with nonuniform costs and arbitrary regular expressions. The whole index can be built in linear time, in a single sequential pass over the text, takes near 1=3 the space of the text, and retrieval times are near O(p n) for typical cases. Experimental results show that the algorithm works well in practice: for a one-gigabyte text collection, all matchings of a phrase of 3 words allowing up to 1 error can be found in approximately 6 seconds and allowing no errors can be found in under half a second. This index has been implemented in a software package called Igrep, which is publicly available. Experiments show that Igrep is much faster than Glimpse in typical queries.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Filters for Two Dimensional String Matching Allowing Rotations

We give faster algorithms for searching a 2-dimensional pattern in a 2-dimensional text allowing rotations, mismatches and/or insertion/deletion errors.

متن کامل

Fast Approximate String Matching in a Dictionary

A successful technique to search large textual databases allowing errors relies on an online search in the vocabulary of the text. To reduce the time of that on-line search, we index the vocabulary as a metric space. We show that with reasonable space overhead we can improve by a factor of two over the fastest online algorithms , when the tolerated error level is low (which is reasonable in tex...

متن کامل

DEPARTMENT OF COMPUTER SCIENCE Fast Text Searching With Errors

Searching for a pattern in a text file is a very common operation in many applications ranging from text editors and databases to applications in molecular biology. In many instances the pattern does not appear in the text exactly. Errors in the text or in the query can result from misspelling or from experimental errors (e.g., when the text is a DNA sequence). The use of such approximate patte...

متن کامل

Agrep — a Fast Approximate Pattern-matching Tool

متن کامل

Agrep - A Fast Approximate Pattern-Matching Tool

Searching for a pattern in a text file is a very common operation in many applications ranging from text editor sand databases to applications in molecular biology. In many instances the pattern does not appear in the text exactly. Errors in the text or in the query can result from misspelling or from experimental errors (e.g., when the text is a DNA sequence). The use of such approximate patte...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1997

Large Text Searching Allowing Errors

نویسندگان

چکیده

منابع مشابه

Fast Filters for Two Dimensional String Matching Allowing Rotations

Fast Approximate String Matching in a Dictionary

DEPARTMENT OF COMPUTER SCIENCE Fast Text Searching With Errors

Agrep — a Fast Approximate Pattern-matching Tool

Agrep - A Fast Approximate Pattern-Matching Tool

عنوان ژورنال:

اشتراک گذاری