Occurrence and Substring Heuristics for -Matching
نویسندگان
چکیده
We consider a version of pattern matching useful in processing large musical data: matching, which consists in finding matches which are -approximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance between two symbols , is measured as . We also consider -matching, where is a bound on the total sum of the differences. We first consider “occurrence heuristics” by adapting exact string matching algorithms to the two notions of approximate string matching. The resulting algorithms are efficient in practice. Then we consider “substring heuristics”. We present -matching algorithms fast on the average providing that the pattern is “non-flat” and the alphabet interval is large. The pattern is “flat” if its structure does not vary substantially. The algorithms, named BM1, -BM2 and -BM3 can be thought as members of the generalized Boyer-Moore family of algorithms. The algorithms are fast on average. This is the first paper on the subject, previously only “occurrence heuristics” have been considered. Our substring heuristics are much stronger and refer to larger parts of texts (not only to single positions). We use -versions of suffix tries and subword The work of these authors was partially supported by NATO grant PST.CLG.977017. The work of this author was partially supported by Welcome foundation, Royal Society and EPSRC grants. ha l-0 06 19 56 5, v er si on 1 19 M ar 2 01 3 Author manuscript, published in "Fundamenta Informaticae 56, 1,2 (2003) 1-21"
منابع مشابه
Occurrence and Substring Heuristics for i-Matching
We consider a version of pattern matching useful in processing large musical data: matching, which consists in finding matches which are -approximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance between two symbols , is measured as . We also consider -matching, where is a bound on the total sum of the diff...
متن کاملAbelian pattern matching in strings
Abelian pattern matching is a new class of pattern matching problems. In abelian patterns, the order of the characters in the substrings does not matter, e.g. the strings abbc and babc represent the same abelian pattern a+2b+c. Therefore, unlike classical pattern matching, we do not look for an exact (ordered) occurrence of a substring, rather the aim here is to find any permutation of a given ...
متن کاملkmacs: the k-Mismatch Avera- ge Common Substring Approach for Phylogeny Reconstruction
The vast majority of sequence comparison methods for phylogeny reconstruction rely on pairwise or multiple sequence alignments. These approaches are in practice not usable for longer sequences such as complete genomes. For this reason alignment-free methods have recently become more popular because they are much faster and usually computable in linear time. Some of these methods are based on re...
متن کاملA New Family of String Classifiers Based on Local Relatedness
This paper introduces a new family of string classifiers based on local relatedness. We use three types of local relatedness measurements, namely, longest common substrings (LCStr’s), longest common subsequences (LCSeq’s), and window-accumulated longest common subsequences (wLCSeq’s). We show that finding the optimal classier for given two sets of strings (the positive set and the negative set)...
متن کاملFast and Sensitive Probe Selection for DNA Chips Using Jumps in Matching Statistics
The design of large scale DNA microarrays is a challenging problem. So far, probe selection algorithms must trade the ability to cope with large scale problems for a loss of accuracy in the estimation of probe quality. We present an approach based on jumps in matching statistics that combines the best of both worlds. This article consists of two parts. The first part is theoretical. We introduc...
متن کامل