Estimating the entropy of DNA sequences.

نویسندگان

  • A O Schmitt
  • H Herzel
چکیده

The Shannon entropy is a standard measure for the order state of symbol sequences, such as, for example, DNA sequences. In order to incorporate correlations between symbols, the entropy of n-mers (consecutive strands of n symbols) has to be determined. Here, an assay is presented to estimate such higher order entropies (block entropies) for DNA sequences when the actual number of observations is small compared with the number of possible outcomes. The n-mer probability distribution underlying the dynamical process is reconstructed using elementary statistical principles: The theorem of asymptotic equi-distribution and the Maximum Entropy Principle. Constraints are set to force the constructed distributions to adopt features which are characteristic for the real probability distribution. From the many solutions compatible with these constraints the one with the highest entropy is the most likely one according to the Maximum Entropy Principle. An algorithm performing this procedure is expounded. It is tested by applying it to various DNA model sequences whose exact entropies are known. Finally, results for a real DNA sequence, the complete genome of the Epstein Barr virus, are presented and compared with those of other information carriers (texts, computer source code, music). It seems as if DNA sequences possess much more freedom in the combination of the symbols of their alphabet than written language or computer source codes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

THE ENTROPIES OF THE SEQUENCES OF FUZZY SETS AND THE APPLICATIONS OF ENTROPY TO CARDIOGRAPHY

In this paper, rstly we have introduced to entropy of sequences of fuzzy sets and given sometheorems about it. Secondly, the waves P and T which appears in electrocardiograms weretransferred to fuzzy sets, by using denition of entropy for sequences of fuzzy sets, and somenumerical values were obtained for sequences of waves P and T. Thus any person can makea medical predictions for some cardiac...

متن کامل

Clustering of a Number of Genes Affecting in Milk Production using Information Theory and Mutual Information

Information theory is a branch of mathematics. Information theory is used in genetic and bioinformatics analyses and can be used for many analyses related to the biological structures and sequences. Bio-computational grouping of genes facilitates genetic analysis, sequencing and structural-based analyses. In this study, after retrieving gene and exon DNA sequences affecting milk yield in dairy ...

متن کامل

ESTIMATING THE MEAN OF INVERSE GAUSSIAN DISTRIB WTION WITH KNOWN COEFFICIENT OF VARIATION UNDER ENTROPY LOSS

An estimation problem of the mean µ of an inverse Gaussian distribution IG(µ, C µ) with known coefficient of variation c is treated as a decision problem with entropy loss function. A class of Bayes estimators is constructed, and shown to include MRSE estimator as its closure. Two important members of this class can easily be computed using continued fractions

متن کامل

Entropy and long-range correlations in random symbolic sequences

The goal of this paper is to develop an estimate for the entropy of random long-range correlated symbolic sequences with elements belonging to a finite alphabet. As a plausible model, we use the high-order additive stationary ergodic Markov chain. Supposing that the correlations between random elements of the chain are weak we express the differential entropy of the sequence by means of the sym...

متن کامل

Estimating change-points in biological sequences via the cross-entropy method

The genomes of complex organisms, including the human genome, are known to vary in GC content along their length. That is, they vary in the local proportion of the nucleotides G and C, as opposed to the nucleotides A and T. Changes in GC content are often abrupt, producing well-defined regions. We model DNA sequences as a multiple change-point process in which the sequence is separated into seg...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of theoretical biology

دوره 188 3  شماره 

صفحات  -

تاریخ انتشار 1997