library construction induced mutation rate from expressed sequence tag collections
نویسنده
چکیده
Estimation of population heterozygosity and library construction induced mutation rate from expressed sequence tag collections. Abstract: Unigene alignments obtained from cDNA libraries made using multiple individuals are not currently used to estimate population heterozygosity, as they are known to harbor mutations created during library construction. We describe an estimator of population heterozygosity that only utilizes SNPs unlikely to be library construction artifacts.-3-Expressed Sequence Tag (EST) projects have become a popular and cost-effective means of initially cataloging a large number of genes in biological systems without genome projects (reviewed in Rudd, 2003). DNA sequencing of several thousand randomly chosen clones from a cDNA library allows thousands of different transcripts to be identified. However, since the likelihood of observing a given transcript is proportional to the expression level of that transcript in the tissue from which the library is derived, oftentimes transcripts are represented by several EST sequences. In a typical EST project, using an inbred line as the starting material to construct the cDNA library, ESTs associated with the same transcript can be assembled into a Unigene cluster and the consensus sequence associated with that assembly is referred to as a Unigene. On the other hand, if the ESTs contributing to a Unigene cluster are associated with cDNA libraries obtained from different individuals, or different inbred lines, then Single Nucleotide Polymorphisms (SNPs) can be identified from the resulting alignments (Picoult-Newberg et al., 1999). Although SNPs obtained from haphazard collections of ESTs may have utility as markers, it would be difficult to estimate per site heterozygosity from such a resource, since the unknown ascertainment scheme could bias any estimates. On the other hand, if a cDNA library were constructed from an equimolar collection of RNAs from an infinite number of outbred individuals, the alignments associated with different Unigene clusters could be used to estimate per site heterozygosity using standard population genetics methods for estimating diversity (e.g., Hartl and Clark, 1997). Standard methods could also be applied to a Unigene cluster obtained from a library derived from a finite number of individuals provided the alignment depth of that cluster is much less than twice the number of individuals used to-4-create the cDNA library, in order to insure alleles sampled in the Unigene cluster are likely to be independent of one another. However, the application of standard methods for estimating per site heterozygosity to collections of ESTs has generally been avoided, as the DNA …
منابع مشابه
Estimation of population heterozygosity and library construction-induced mutation rate from expressed sequence tag collections.
Unigene alignments obtained from cDNA libraries made using multiple individuals are not currently used to estimate population heterozygosity, as they are known to harbor mutations created during library construction. We describe an estimator of population heterozygosity that utilizes only SNPs unlikely to be library construction artifacts.
متن کاملopenSputnik—a database to ESTablish comparative plant genomics using unsaturated sequence collections
The public expressed sequence tag collections are continually being enriched with high-quality sequences that represent an ever-expanding range of taxonomically diverse plant species. While these sequence collections provide biased insight into the populations of expressed genes available within individual species and their associated tissues, the information is conceivably of wider relevance i...
متن کاملGenomic Resources and Tools for Gene Function Analysis in Potato
Potato, a highly heterozygous tetraploid, is undergoing an exciting phase of genomics resource development. The potato research community has established extensive genomic resources, such as large expressed sequence tag (EST) data collections, microarrays and other expression profiling platforms, and large-insert genomic libraries. Moreover, potato will now benefit from a global potato physical...
متن کاملCorrection of sequence-based artifacts in serial analysis of gene expression
MOTIVATION Serial Analysis of Gene Expression (SAGE) is a powerful technology for measuring global gene expression, through rapid generation of large numbers of transcript tags. Beyond their intrinsic value in differential gene expression analysis, SAGE tag collections afford abundant information on the size and shape of the sample transcriptome and can accelerate novel gene discovery. These la...
متن کاملGenome analysis Support vector machines for separation of mixed plant–pathogen EST collections based on codon usage
Motivation: Discovery of host and pathogen genes expressed at the plant–pathogen interface often requires the construction of mixed libraries that contain sequences from both genomes. Sequence identification requires high-throughput and reliable classification of genome origin. When using single-pass cDNA sequences difficulties arise from the short sequence length, the lack of sufficient taxono...
متن کامل