Maximum Likelihood Genome Assembly

نویسندگان

  • Paul Medvedev
  • Michael Brudno
چکیده

Whole genome shotgun assembly is the process of taking many short sequenced segments (reads) and reconstructing the genome from which they originated. We demonstrate how the technique of bidirected network flow can be used to explicitly model the double-stranded nature of DNA for genome assembly. By combining an algorithm for the Chinese Postman Problem on bidirected graphs with the construction of a bidirected de Bruijn graph, we are able to find the shortest double-stranded DNA sequence that contains a given set of k-long DNA molecules. This is the first exact polynomial time algorithm for the assembly of a double-stranded genome. Furthermore, we propose a maximum likelihood framework for assembling the genome that is the most likely source of the reads, in lieu of the standard maximum parsimony approach (which finds the shortest genome subject to some constraints). In this setting, we give a bidirected network flow-based algorithm that, by taking advantage of high coverage, accurately estimates the copy counts of repeats in a genome. Our second algorithm combines these predicted copy counts with matepair data in order to assemble the reads into contigs. We run our algorithms on simulated read data from Escherichia coli and predict copy counts with extremely high accuracy, while assembling long contigs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Constructing a genome assembly that has the maximum likelihood

We formulate genome assembly problem as an optimization problem in which the objective function is the likelihood of the assembly given the reads.

متن کامل

SWALO: scaffolding with assembly likelihood optimization

Scaffolding i.e. ordering and orienting contigs is an important step in genome assembly. We present a method for scaffolding based on likelihoods of genome assemblies. Generative models for sequencing are used to obtain maximum likelihood estimates of gaps between contigs and to estimate whether linking contigs into scaffolds would lead to an increase in the likelihood of the assembly. We then ...

متن کامل

Phylogenomic inferences from reference-mapped and de novo assembled short-read sequence data using RADseq sequencing of California white oaks (Quercus section Quercus).

The emergence of next generation sequencing has increased by several orders of magnitude the amount of data available for phylogenetics. Reduced representation approaches, such as restriction-sited associated DNA sequencing (RADseq), have proven useful for phylogenetic studies of non-model species at a wide range of phylogenetic depths. However, analysis of these datasets is not uniform and we ...

متن کامل

Environmental factor dependent maximum likelihood method for association study targeted to personalized medicine.

The most challenging strategy for analyzing genome-wide polymorphisms and/or expression profiles is to solve multi-factor causal-relationship simultaneously. As the first step, we propose a framework of association study using maximum likelihood method that simultaneously handles genetic polymorphisms and epi-genetic information, e.g. environmental factors. We evaluate the theory by applying it...

متن کامل

Likelihood-based inference of population history from low-coverage de novo genome assemblies.

Short-read sequencing technologies have in principle made it feasible to draw detailed inferences about the recent history of any organism. In practice, however, this remains challenging due to the difficulty of genome assembly in most organisms and the lack of statistical methods powerful enough to discriminate between recent, nonequilibrium histories. We address both the assembly and inferenc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of computational biology : a journal of computational molecular cell biology

دوره 16 8  شماره 

صفحات  -

تاریخ انتشار 2009