Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the mega-reads algorithm
نویسندگان
چکیده
Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and highly repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.
منابع مشابه
Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm.
Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data...
متن کاملThe first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum
Common bread wheat, Triticum aestivum, has one of the most complex genomes known to science, with 6 copies of each chromosome, enormous numbers of near-identical sequences scattered throughout, and an overall haploid size of more than 15 billion bases. Multiple past attempts to assemble the genome have produced assemblies that were well short of the estimated genome size. Here we report the fir...
متن کاملRapid Genome Mapping in Nanochannel Arrays for Highly Complete and Accurate De Novo Sequence Assembly of the Complex Aegilops tauschii Genome
Next-generation sequencing (NGS) technologies have enabled high-throughput and low-cost generation of sequence data; however, de novo genome assembly remains a great challenge, particularly for large genomes. NGS short reads are often insufficient to create large contigs that span repeat sequences and to facilitate unambiguous assembly. Plant genomes are notorious for containing high quantities...
متن کاملThe efficacy of Cot-based gene enrichment in wheat (Triticum aestivum L.).
We report the results of a study on the effectiveness of Cot filtration (CF) in the characterization of the gene space of bread wheat (Triticum aestivum L.), a large genome species (1C = 16,700 Mb) of tremendous agronomic importance. Using published Cot data as a guide, 2 genomic libraries for hexaploid wheat were constructed from the single-stranded DNA collected at Cot values > 1188 and 1639 ...
متن کاملDiscovery of High-Confidence Single Nucleotide Polymorphisms from Large-Scale De Novo Analysis of Leaf Transcripts of Aegilops tauschii, A Wild Wheat Progenitor
Construction of high-resolution genetic maps is important for genetic and genomic research, as well as for molecular breeding. Single nucleotide polymorphisms (SNPs) are the predominant class of genetic variation and can be used as molecular markers. Aegilops tauschii, the D-genome donor of common wheat, is considered a valuable genetic resource for wheat improvement. Our previous study implied...
متن کامل