Multivariate Entropy Distance Method for Distinguishing Coding and Non-coding DNA Sequences
نویسنده
چکیده
The multivariate entropy distance (MED) method is a new highly efficient and accurate gene identification algorithm, which use the so-called entropy-density profile (EDP) for the global description of a DNA sequence of finite length. It is found the EDPs of coding and non-coding sequences show clearly distinct patterns. An individual sequence display an EDP clearly clustered around its respective mean EDP (coding or non-coding). The rapid convergence property of the partially averaged EDP makes the MED method practical for gene finding with a need for as few as 20 samples for achieving a highly accurate identification of genes on the whole genome. Test on a dozen prokaryotic genomes obtain an overall accuracy of prediction over 99%. The results suggest the interest of multivariate and global description for complex biological systems.
منابع مشابه
Investigation of Polymorphisms in Non-Coding Region of Human Mitochondrial DNA in 31 Iranian Hypertrophic Cardiomyopathy (HCM) Patients
The D-loop region is a hot spot for mitochondrial DNA (mtDNA) alterations, containing two hypervariable segments, HVS-I and HVS-II. In order to identify polymorphic sites and potential genetic background accounting for Hypertrophic CardioMyopathy (HCM) disease, the complete non-coding region of mtDNA from 31 unrelated HCM patients and 45 normal controls were sequenced. The sequences were aligne...
متن کاملPhylogenetic Analysis of Three Long Non-coding RNA Genes: AK082072, AK043754 and AK082467
Now, it is clear that protein is just one of the most functional products produced by the eukaryotic genome. Indeed, a major part of the human genome is transcribed to non-coding sequences than to the coding sequence of the protein. In this study, we selected three long non-coding RNAs namely AK082072, AK043754 and AK082467 which show brain expression and local region conservation among vertebr...
متن کاملMutual Information Measure for Distinguishing Coding and Non-Coding DNA Sequences
Several methodologies have been developed to identify genes and classify DNA sequences into coding and non-coding sequences. This classification process is fundamental in gene finding and gene annotation tools and is one of the most challenging tasks in bioinformatics and computational biology. The approach described herein measures mutual information (MIM) found in DNA sequences at the amino a...
متن کاملP87: The Role of the Long Non-Coding RNA Sequences (LncRNAs) in Neurological Disorders
Precise interpretation of the transcriptome sequences in the several species showed that the major part of genome has been transcribed; however, just a few amounts of the transcription sequences have open-reading frames which are conversed during the evolution. So, it is unlikely that many of the transcribed sequences code the proteins. Among the all human non-coding transcripts, at least 10000...
متن کاملتخمین مکان نواحی کدکننده پروتئین در توالی عددی DNA با استفاده پنجره با طول متغیر بر مبنای منحنی سه بعدی Z
In recent years, estimation of protein-coding regions in numerical deoxyribonucleic acid (DNA) sequences using signal processing tools has been a challenging issue in bioinformatics, owing to their 3-base periodicity. Several digital signal processing (DSP) tools have been applied in order to Identify the task and concentrated on assigning numerical values to the symbolic DNA sequence, then app...
متن کامل