SST: An algorithm for searching sequence databases in time proportional to the logarithm of the database size

نویسندگان

  • Eldar Giladi
  • Michael G. Walker
  • James Ze Wang
چکیده

We have developed an algorithm, called SST (Sequence Search Tree), that searches a database of DNA sequences for near exact matches, in time proportional to the logarithm of the database size n. In SST, we partition each sequence into fragments of xed length called \windows" using multiple o sets. Each window is mapped into a vector of dimension 4 which contains the frequency of occurrence of its component k-tuples, with k a parameter typically in the range 4 6. Then we create a tree-structured index of the windows in vector space, using tree structured vector quantization (TSVQ). We identify the nearest-neighbors of a query sequence by partitioning the query into windows and searching the tree-structured index for nearest neighbor windows in the database. This yields an O(log n) complexity for the search. SST is most e ective for applications in which the target sequences show a high degree of similarity to the query sequence, such as assembling shotgun sequences or matching ESTs to genomic sequence. The algorithm is also an e ective ltration method. Speci cally, it can be used as a preprocessing step for other search methods to reduce the complexity of searching one large database against another. For the problem of identifying overlapping fragments in the assembly of 120,000 fragments from a 1.5 megabase genomic sequence, SST is 17 to 35 times faster than BLAST when we consider both building and searching the tree. For searching alone (i.e., after building the tree index), SST is 50 to 100 times faster than BLAST.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size

MOTIVATION Searches for near exact sequence matches are performed frequently in large-scale sequencing projects and in comparative genomics. The time and cost of performing these large-scale sequence-similarity searches is prohibitive using even the fastest of the extant algorithms. Faster algorithms are desired. RESULTS We have developed an algorithm, called SST (Sequence Search Tree), that ...

متن کامل

وضعیت بازیابی اطلاعات در دو پایگاه نمایه و نما و سنجش اثربخشی استفاده از واژگان کنترل ‌شده در نمایه‌سازی این دو پایگاه

Purpose: This study was carried out to determine the level of precision, recall, and searching time for “Nama” and “Namayeh” databases, as well as to find out which of the indexing tools (thesaurus and Dewey decimal classification) helps us more in improvement of information retrieval. Methodology: This study is an analytical survey in which the necessary data was collected by direct observati...

متن کامل

Relational Databases Query Optimization using Hybrid Evolutionary Algorithm

Optimizing the database queries is one of hard research problems. Exhaustive search techniques like dynamic programming is suitable for queries with a few relations, but by increasing the number of relations in query, much use of memory and processing is needed, and the use of these methods is not suitable, so we have to use random and evolutionary methods. The use of evolutionary methods, beca...

متن کامل

Two meta-heuristic algorithms for parallel machines scheduling problem with past-sequence-dependent setup times and effects of deterioration and learning

This paper considers identical parallel machines scheduling problem with past-sequence-dependent setup times, deteriorating jobs and learning effects, in which the actual processing time of a job on each machine is given as a function of the processing times of the jobs already processed and its scheduled position on the corresponding machine. In addition, the setup time of a job on each machin...

متن کامل

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999