Sequence analysis Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping
نویسندگان
چکیده
Motivation: Calculating the edit-distance (i.e. minimum number of insertions, deletions and substitutions) between short DNA sequences is the primary task performed by seed-and-extend based mappers, which compare billions of sequences. In practice, only sequence pairs with a small editdistance provide useful scientific data. However, the majority of sequence pairs analyzed by seedand-extend based mappers differ by significantly more errors than what is typically allowed. Such error-abundant sequence pairs needlessly waste resources and severely hinder the performance of read mappers. Therefore, it is crucial to develop a fast and accurate filter that can rapidly and efficiently detect error-abundant string pairs and remove them from consideration before more computationally expensive methods are used. Results: We present a simple and efficient algorithm, Shifted Hamming Distance (SHD), which accelerates the alignment verification procedure in read mapping, by quickly filtering out error-abundant sequence pairs using bit-parallel and SIMD-parallel operations. SHD only filters string pairs that contain more errors than a user-defined threshold, making it fully comprehensive. It also maintains high accuracy with moderate error threshold (up to 5% of the string length) while achieving a 3-fold speedup over the best previous algorithm (Gene Myers’s bit-vector algorithm). SHD is compatible with all mappers that perform sequence alignment for verification. Availability and implementation: We provide an implementation of SHD in C with Intel SSE instructions at: https://github.com/CMU-SAFARI/SHD. Contact: [email protected], [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
منابع مشابه
Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping
MOTIVATION Calculating the edit-distance (i.e. minimum number of insertions, deletions and substitutions) between short DNA sequences is the primary task performed by seed-and-extend based mappers, which compare billions of sequences. In practice, only sequence pairs with a small edit-distance provide useful scientific data. However, the majority of sequence pairs analyzed by seed-and-extend ba...
متن کاملSequence analysis GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping
Motivation: High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short readsthat cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and ‘candidate’ locations in that reference genome. The similarity measurement...
متن کاملGateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping
Motivation High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and 'candidate' locations in that reference genome. The similarity measuremen...
متن کاملMAGNET: Understanding and Improving the Accuracy of Genome Pre-Alignment Filtering
In the era of high throughput DNA sequencing (HTS) technologies, calculating the edit distance (i.e., the minimum number of substitutions, insertions, and deletions between a pair of sequences) for billions of genomic sequences is the computational bottleneck in todays read mappers. The shifted Hamming distance (SHD) algorithm proposes a fast filtering strategy that can rapidly filter out inval...
متن کاملHobbes: optimized gram-based methods for efficient read alignment
Recent advances in sequencing technology have enabled the rapid generation of billions of bases at relatively low cost. A crucial first step in many sequencing applications is to map those reads to a reference genome. However, when the reference genome is large, finding accurate mappings poses a significant computational challenge due to the sheer amount of reads, and because many reads map to ...
متن کامل