Choosing subsamples for sequencing studies by 1 minimizing the average distance to the closest leaf

نویسندگان

Jonathan T. L. Kang

Peng Zhang

Sebastian Zöllner

چکیده

19 Imputation of genotypes in a study sample can make use of sequenced or densely genotyped 20 external reference panels consisting of individuals that are not from the study sample. It can 21 also employ internal reference panels, incorporating a subset of individuals from the study 22 sample itself. Internal panels offer an advantage over external panels, as they can reduce 23 imputation errors arising from genetic dissimilarity between a population of interest and a 24 second, distinct population from which the external reference panel has been constructed. 25 As the cost of next-generation sequencing decreases, internal reference panel selection is 26 becoming increasingly feasible. However, it is not clear how best to select individuals to 27 include in such panels. We introduce a new method for selecting an internal reference panel— 28 minimizing the average distance to the closest leaf (ADCL)—and compare its performance 29 relative to an earlier algorithm: maximizing phylogenetic diversity (PD). Employing both 30 simulated data and sequences from the 1000 Genomes Project, we show that ADCL provides 31 a significant improvement in imputation accuracy, especially for imputation of sites with low32 frequency alleles. This improvement in imputation accuracy is robust to changes in reference 33 panel size, marker density, and length of the imputation target region. 34

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf

Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It also can employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal panels offer an advantage over external panels because they can reduce imputation errors arising ...

متن کامل

یادگیری نیمه نظارتی کرنل مرکب با استفاده از تکنیک‌های یادگیری معیار فاصله

Distance metric has a key role in many machine learning and computer vision algorithms so that choosing an appropriate distance metric has a direct effect on the performance of such algorithms. Recently, distance metric learning using labeled data or other available supervisory information has become a very active research area in machine learning applications. Studies in this area have shown t...

متن کامل

Population genetic studies of Liza aurata using D-Loop sequencing in the southeast and southwest coasts of the Caspian Sea

Genetic diversity as an important marker of the ecological status of aquatic ecosystems is considered a unique and powerful tool to evaluate biological communities. In order to evaluate the genetic diversity among golden mullet species (Liza aurata) in the southeast and southwest coasts of the Caspian Sea by D-Loop gene sequencing, a total of 23 fin specimens of golden mullet were collected fro...

متن کامل

Missing data imputation in multivariable time series data

Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of diffe...

متن کامل

Population genetic studies of Liza aurata using D-Loop sequencing in the southeast and southwest coasts of the Caspian Sea

متن کامل

ذخیره در منابع من

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Choosing subsamples for sequencing studies by 1 minimizing the average distance to the closest leaf

نویسندگان

چکیده

منابع مشابه

Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf

یادگیری نیمه نظارتی کرنل مرکب با استفاده از تکنیک‌های یادگیری معیار فاصله

Population genetic studies of Liza aurata using D-Loop sequencing in the southeast and southwest coasts of the Caspian Sea

Missing data imputation in multivariable time series data

Population genetic studies of Liza aurata using D-Loop sequencing in the southeast and southwest coasts of the Caspian Sea

عنوان ژورنال:

اشتراک گذاری