Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases
نویسندگان
چکیده
MOTIVATION The exponential growth of sequence databases poses a major challenge to bioinformatics tools for querying alignment and annotation databases. There is a pressing need for methods for finding overlapping sequence intervals that are highly scalable to database size, query interval size, result size and construction/updating of the interval database. RESULTS We have developed a new interval database representation, the Nested Containment List (NCList), whose query time is O(n + log N), where N is the database size and n is the size of the result set. In all cases tested, this query algorithm is 5-500-fold faster than other indexing methods tested in this study, such as MySQL multi-column indexing, MySQL binning and R-Tree indexing. We provide performance comparisons both in simulated datasets and real-world genome alignment databases, across a wide range of database sizes and query interval widths. We also present an in-place NCList construction algorithm that yields database construction times that are approximately 100-fold faster than other methods available. The NCList data structure appears to provide a useful foundation for highly scalable interval database applications. AVAILABILITY NCList data structure is part of Pygr, a bioinformatics graph database library, available at http://sourceforge.net/projects/pygr
منابع مشابه
Rewriting queries using views in the presence of arithmetic comparisons
We consider the problem of answering queries using views, where queries and views are conjunctive queries with arithmetic comparisons over dense orders. Previous work only considered limited variants of this problem, without giving a complete solution. We first show that obtaining equivalent rewritings for conjunctive queries with arithmetic comparisons is decidable. Then we consider the proble...
متن کاملA New Model Selection Test with Application to the Censored Data of Carbon Nanotubes Coating
Model selection of nano and micro droplet spreading can be widely used to predict and optimize of different coating processes such as ink jet printing, spray painting and plasma spraying. The idea of model selection is beginning with a set of data and rival models to choice the best one. The decision making on this set is an important question in statistical inference. Some tests and criteria a...
متن کاملAsymptotic algorithm for computing the sample variance of interval data
The problem of the sample variance computation for epistemic inter-val-valued data is, in general, NP-hard. Therefore, known efficient algorithms for computing variance require strong restrictions on admissible intervals like the no-subset property or heavy limitations on the number of possible intersections between intervals. A new asymptotic algorithm for computing the upper bound of the samp...
متن کاملgpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کاملNew Heuristic Algorithm for Flow Shop Scheduling with 3 Machines and 2 Robots Considering the Breakdown Interval of Machines and Robots Simultaneously
Scheduling is an important subject of production and operations management area. In flow-shop scheduling, the objective is to obtain a sequence of jobs which when processed in a fixed order of machines, will optimize some well defined criteria. The concept of transportation time is very important in scheduling. Transportation can be done by robots. In situations that robots are used to transpor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Bioinformatics
دوره 23 11 شماره
صفحات -
تاریخ انتشار 2007