Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases

نویسندگان

  • Alexander V. Alekseyenko
  • Christopher J. Lee
چکیده

MOTIVATION The exponential growth of sequence databases poses a major challenge to bioinformatics tools for querying alignment and annotation databases. There is a pressing need for methods for finding overlapping sequence intervals that are highly scalable to database size, query interval size, result size and construction/updating of the interval database. RESULTS We have developed a new interval database representation, the Nested Containment List (NCList), whose query time is O(n + log N), where N is the database size and n is the size of the result set. In all cases tested, this query algorithm is 5-500-fold faster than other indexing methods tested in this study, such as MySQL multi-column indexing, MySQL binning and R-Tree indexing. We provide performance comparisons both in simulated datasets and real-world genome alignment databases, across a wide range of database sizes and query interval widths. We also present an in-place NCList construction algorithm that yields database construction times that are approximately 100-fold faster than other methods available. The NCList data structure appears to provide a useful foundation for highly scalable interval database applications. AVAILABILITY NCList data structure is part of Pygr, a bioinformatics graph database library, available at http://sourceforge.net/projects/pygr

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Rewriting queries using views in the presence of arithmetic comparisons

We consider the problem of answering queries using views, where queries and views are conjunctive queries with arithmetic comparisons over dense orders. Previous work only considered limited variants of this problem, without giving a complete solution. We first show that obtaining equivalent rewritings for conjunctive queries with arithmetic comparisons is decidable. Then we consider the proble...

متن کامل

A New Model Selection Test with Application to the Censored Data of Carbon Nanotubes Coating

Model selection of nano and micro droplet spreading can be widely used to predict and optimize of different coating processes such as ink jet printing, spray painting and plasma spraying. The idea of model selection is beginning with a set of data and rival models to choice the best one. The decision making on this set is an important question in statistical inference. Some tests and criteria a...

متن کامل

Asymptotic algorithm for computing the sample variance of interval data

The problem of the sample variance computation for epistemic inter-val-valued data is, in general, NP-hard. Therefore, known efficient algorithms for computing variance require strong restrictions on admissible intervals like the no-subset property or heavy limitations on the number of possible intersections between intervals. A new asymptotic algorithm for computing the upper bound of the samp...

متن کامل

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

New Heuristic Algorithm for Flow Shop Scheduling with 3 Machines and 2 Robots Considering the Breakdown Interval of Machines and Robots Simultaneously

Scheduling is an important subject of production and operations management area. In flow-shop scheduling, the objective is to obtain a sequence of jobs which when processed in a fixed order of machines, will optimize some well defined criteria. The concept of transportation time is very important in scheduling. Transportation can be done by robots. In situations that robots are used to transpor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 23 11  شماره 

صفحات  -

تاریخ انتشار 2007