One-Gapped q-Gram Filtersfor Levenshtein Distance
نویسندگان
چکیده
We have recently shown that q-gram filters based on gapped q-grams instead of the usual contiguous q-grams can provide orders of magnitude faster and/or more efficient filtering for the Hamming distance. In this paper, we extend the results for the Levenshtein distance, which is more problematic for gapped q-grams because an insertion or deletion in a gap affects a q-gram while a replacement does not. To keep this effect under control, we concentrate on gapped q-grams with just one gap. We demostrate with experiments that the resulting filters provide a significant improvement over the contiguous q-gram filters. We also develop new techniques for dealing with complex q-gram filters.
منابع مشابه
Finding All Approximate Gapped Palindromes
We study the problem of finding all maximal approximate gapped palindromes in a string. More specifically, given a string S of length n, a parameter q ≥ 0 and a threshold k > 0, the problem is to identify all substrings in S of the form uvw such that (1) the Levenshtein distance between u and w is at most k, where w is the reverse of w and (2) v is a string of length q. The best previous work r...
متن کاملBetter Filtering with Gapped q-Grams
A popular and well-studied class of filters for approximate string matching compares substrings of length q, the q-grams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped q-grams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we repo...
متن کاملBDD-Based Analysis of Gapped q-Gram Filters
Recently, there has been a surge of interest in gapped q-gram filters for approximate string matching. Important design parameters for filters are for example the value of q, the filter-threshold and in particular the shape (aka seed) of the filter. A good choice of parameters can improve the performance of a q-gram filter by orders of magnitude and optimising these parameters is a nontrivial c...
متن کاملA Technique for Data Deduplication using Q-Gram Concept with Support Vector Machine
Several systems that rely on consistent data to offer high quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi-replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations in developing methods for removing replicas from its data ...
متن کاملRandom Projection and Geometrization of String Distance Metrics
Edit distance is not the only approach how distance between two character sequences can be calculated. Strings can be also compared in somewhat subtler geometric ways. A procedure inspired by Random Indexing can attribute an D-dimensional geometric coordinate to any character N-gram present in the corpus and can subsequently represent the word as a sum of N-gram fragments which the string conta...
متن کامل