Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology

نویسندگان

  • Lucas D. Wittwer
  • Ivana Piližota
  • Adrian M. Altenhoff
  • Christophe Dessimoz
  • Freddie Salsbury Jr
چکیده

Orthology inference and other sequence analyses across multiple genomes typically start by performing exhaustive pairwise sequence comparisons, a process referred to as "all-against-all". As this process scales quadratically in terms of the number of sequences analysed, this step can become a bottleneck, thus limiting the number of genomes that can be simultaneously analysed. Here, we explored ways of speeding-up the all-against-all step while maintaining its sensitivity. By exploiting the transitivity of homology and, crucially, ensuring that homology is defined in terms of consistent protein subsequences, our proof-of-concept resulted in a 4× speedup while recovering >99.6% of all homologs identified by the full all-against-all procedure on empirical sequences sets. In comparison, state-of-the-art k-mer approaches are orders of magnitude faster but only recover 3-14% of all homologous pairs. We also outline ideas to further improve the speed and recall of the new approach. An open source implementation is provided as part of the OMA standalone software at http://omabrowser.org/standalone.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhanced homology searching through genome reading frame predetermination

MOTIVATION Many bioinformatic approaches exist for finding novel genes within genomic sequence data. Traditionally, homology search-based methods are often the first approach employed in determining whether a novel gene exists that is similar to a known gene. Unfortunately, distantly related genes or motifs often are difficult to find using single query-based homology search algorithms against ...

متن کامل

Speeding-up Hirschberg and Hunt-Szymanski LCS Algorithms

Two algorithms are presented that solve the problem of recovering the longest common subsequence of two strings instead of just its length. The first algorithm is an improvement of Hirschberg’s divide-and-conquer algorithm. The second algorithm is an improvement of Hunt-Szymanski algorithm based on an efficient computation of all dominant match points. These two algorithms use bit-vector operat...

متن کامل

Speeding up transposition-invariant string matching

Finding the longest common subsequence (LCS) of two given sequences A = a0a1 . . . am−1 and B = b0b1 . . . bn−1 is an important and well studied problem. We consider its generalization, transposition-invariant LCS (LCTS), which has recently arisen in the field of music information retrieval. In LCTS, we look for the longest common subsequence between the sequences A + t = (a0 + t)(a1 + t) . . ....

متن کامل

A Reschedule Design for Disrupted Liner Ships Considering Ports Demand and CO2 Emissions: The Case study of Islamic Republic of Iran Shipping Lines

This study presents a MILP model to retrieve or get close to the early schedule of disrupted container vessels. The model is appliedon a realcase study of Islamic Republic of Iran Shipping Lines (IRISL) considering container demands, and CO2 emissionssolvedwith CPLEX GAMS solver in less than a minute. Sensitivity analysis on fuel inventory level shows the inevitable influence of the ...

متن کامل

HH-MOTiF: de novo detection of short linear motifs in proteins by Hidden Markov Model comparisons

Short linear motifs (SLiMs) in proteins are self-sufficient functional sequences that specify interaction sites for other molecules and thus mediate a multitude of functions. Computational, as well as experimental biological research would significantly benefit, if SLiMs in proteins could be correctly predicted de novo with high sensitivity. However, de novo SLiM prediction is a difficult compu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2014