Fast search in DNA sequence databases using punctuation and indexing
نویسندگان
چکیده
Exact pattern searching in DNA sequence databases has applications in identification of highly conserved regulatory sequences, the design of hybridization probes, and improving performance of approximate homology searching tools such as BLAST and BLAT. We propose a new pattern searching algorithm, CompressedPunctuated-Boyer-Moore (cp-BM), to enhance exact pattern match searches of DNA sequences. cp-BM encodes two bits to represent each A, T, C, G character (4-character 8 bit (4C8B) compression), plus punctuator characters to indicate unambiguously the encoding frame of the compressed target sequence, thereby solving the misalignment problem in searching patterns with ordinary 4C8B compression. cp-BM searches DNA patterns at least 6 times faster than AGREP for pattern lengths ≥ 128 and between 2-fold and 5-fold faster than d-BM for all pattern lengths. cp-BM’s performance is enhanced by punctuator indexing and multiple punctuators, especially for short sequences, yielding greater than 10-fold enhancements compared to d-BM and AGREP. In addition, cp-BM outperformed BLAT for sequences 64 or more bases in length, and was more than three-fold faster for 256 base sequences.
منابع مشابه
Adapting Decision Tree-Based Method to Index Large DNA-Protein Sequence Datasets
Currently, the size of biological databases has increased significantly with the growing number of users and the rate of queries where some databases are of terabyte size. Hence, there is an increasing need to access databases at the fastest possible rate. Where biologists are concerned, the need is more of a means to fast, scalable and accuracy searching in biological databases. This may seem ...
متن کاملEfficient Querying on Genomic Databases by Using Metric Space Indexing Techniques
A genomic database consists of a set of nucleotide sequences, for which an important kind of queries is the local sequence alignment. This paper investigates two different indexing techniques, namely the variations of GNAT trees [1] and M-trees [3], to support fast query evaluation for local alignment, by transforming the alignment problem to a variant metric space neighborhood search problem.
متن کاملAdapting and Enhancing the Searching Algorithm Based on Decision Tree Indexing for Large Dna-protein Datasets
Currently, the size of biological databases has increased significantly with the growing number of users and the rate of queries where some databases are of terabyte size. Hence, there is an increasing need to access databases at the fastest possible rate. Where biologists are concerned, the need is more of a means to fast, scalable and accuracy searching in biological databases. This may seem ...
متن کاملgpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کاملEfficient Querying on Gnomic Databases by Using Metric Space Indexing Techniques
A genomic database consists of a set of nucleotide sequences, for which an important kind of queries is the local sequence alignment. This paper investigates two different indexing techniques, namely the variations of GNAT trees [1] and M-trees [3], to support fast query evaluation for local alignment, by transforming the alignment problem to a variant metric space neighborhood search problem.
متن کامل