Choosing subsamples for sequencing studies by 1 minimizing the average distance to the closest leaf
نویسندگان
چکیده
19 Imputation of genotypes in a study sample can make use of sequenced or densely genotyped 20 external reference panels consisting of individuals that are not from the study sample. It can 21 also employ internal reference panels, incorporating a subset of individuals from the study 22 sample itself. Internal panels offer an advantage over external panels, as they can reduce 23 imputation errors arising from genetic dissimilarity between a population of interest and a 24 second, distinct population from which the external reference panel has been constructed. 25 As the cost of next-generation sequencing decreases, internal reference panel selection is 26 becoming increasingly feasible. However, it is not clear how best to select individuals to 27 include in such panels. We introduce a new method for selecting an internal reference panel— 28 minimizing the average distance to the closest leaf (ADCL)—and compare its performance 29 relative to an earlier algorithm: maximizing phylogenetic diversity (PD). Employing both 30 simulated data and sequences from the 1000 Genomes Project, we show that ADCL provides 31 a significant improvement in imputation accuracy, especially for imputation of sites with low32 frequency alleles. This improvement in imputation accuracy is robust to changes in reference 33 panel size, marker density, and length of the imputation target region. 34
منابع مشابه
Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf
Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It also can employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal panels offer an advantage over external panels because they can reduce imputation errors arising ...
متن کاملیادگیری نیمه نظارتی کرنل مرکب با استفاده از تکنیکهای یادگیری معیار فاصله
Distance metric has a key role in many machine learning and computer vision algorithms so that choosing an appropriate distance metric has a direct effect on the performance of such algorithms. Recently, distance metric learning using labeled data or other available supervisory information has become a very active research area in machine learning applications. Studies in this area have shown t...
متن کاملPopulation genetic studies of Liza aurata using D-Loop sequencing in the southeast and southwest coasts of the Caspian Sea
Genetic diversity as an important marker of the ecological status of aquatic ecosystems is considered a unique and powerful tool to evaluate biological communities. In order to evaluate the genetic diversity among golden mullet species (Liza aurata) in the southeast and southwest coasts of the Caspian Sea by D-Loop gene sequencing, a total of 23 fin specimens of golden mullet were collected fro...
متن کاملMissing data imputation in multivariable time series data
Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of diffe...
متن کاملPopulation genetic studies of Liza aurata using D-Loop sequencing in the southeast and southwest coasts of the Caspian Sea
Genetic diversity as an important marker of the ecological status of aquatic ecosystems is considered a unique and powerful tool to evaluate biological communities. In order to evaluate the genetic diversity among golden mullet species (Liza aurata) in the southeast and southwest coasts of the Caspian Sea by D-Loop gene sequencing, a total of 23 fin specimens of golden mullet were collected fro...
متن کامل