Characterizing bias in population genetic inferences from low-coverage sequencing data.
نویسندگان
چکیده
The site frequency spectrum (SFS) is of primary interest in population genetic studies, because the SFS compresses variation data into a simple summary from which many population genetic inferences can proceed. However, inferring the SFS from sequencing data is challenging because genotype calls from sequencing data are often inaccurate due to high error rates and if not accounted for, this genotype uncertainty can lead to serious bias in downstream analysis based on the inferred SFS. Here, we compare two approaches to estimate the SFS from sequencing data: one approach infers individual genotypes from aligned sequencing reads and then estimates the SFS based on the inferred genotypes (call-based approach) and the other approach directly estimates the SFS from aligned sequencing reads by maximum likelihood (direct estimation approach). We find that the SFS estimated by the direct estimation approach is unbiased even at low coverage, whereas the SFS by the call-based approach becomes biased as coverage decreases. The direction of the bias in the call-based approach depends on the pipeline to infer genotypes. Estimating genotypes by pooling individuals in a sample (multisample calling) results in underestimation of the number of rare variants, whereas estimating genotypes in each individual and merging them later (single-sample calling) leads to overestimation of rare variants. We characterize the impact of these biases on downstream analyses, such as demographic parameter estimation and genome-wide selection scans. Our work highlights that depending on the pipeline used to infer the SFS, one can reach different conclusions in population genetic inference with the same data set. Thus, careful attention to the analysis pipeline and SFS estimation procedures is vital for population genetic inferences.
منابع مشابه
Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites
High-throughput shotgun sequence data makes it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual’s genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an...
متن کاملA novel approach to estimating heterozygosity from low-coverage genome sequence An Investigation Submitted to Genetics
High-throughput shotgun sequence data makes it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual’s genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an...
متن کاملA novel approach to estimating heterozygosity from low-coverage genome sequence.
High-throughput shotgun sequence data make it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual's genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an ...
متن کاملAccounting for bias from sequencing error in population genetic estimates.
Sequencing error presents a significant challenge to population genetic analyses using low-coverage sequence in general and single-pass reads in particular. Bias in parameter estimates becomes severe when the level of polymorphism (signal) is low relative to the amount of error (noise). Choosing an arbitrary quality score cutoff yields biased estimates, particularly with newer, non-Sanger seque...
متن کاملngsTools: methods for population genetics analyses from next-generation sequencing data
SUMMARY Next-generation sequencing technologies produce short reads that are either de novo assembled or mapped to a reference genome. Genotypes and/or single-nucleotide polymorphisms are then determined from the read composition at each site, which become the basis for many downstream analyses. However, for low sequencing depths, e.g. , there is considerable statistical uncertainty in the assi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Molecular biology and evolution
دوره 31 3 شماره
صفحات -
تاریخ انتشار 2014