6 . 047 / 6 . 878 Lecture 5 : HMMs
نویسنده
چکیده
The previous lecture introduced hidden Markov models (HMMs), a technique used to model types of sequences rather than exact matches. Recall that a Markov chain consists of states Q, initial state probabilities p, and transition state probabilities A. A hidden Markov model has the additional property of emitting a series of observable outputs, one from each state, with various emission probabilities E. Because the observations do not allow one to uniquely infer the states, the model is “hidden.” As discussed in Lecture 4, matching states of an HMM to a nucleotide sequence is somewhat similar to the problem of alignment in that we wish to map each character to the hidden state that generated it. We can reuse states—i.e., return to the same state—and in this sense it differs from nucleotide alignment. However, the principle we used to solve each problem was the same. In alignment, we had an exponential number of possible sequences; in the HMM matching problem, we have an exponential number of possible parse sequences, i.e., choices of generating states. Indeed, in an HMM with k states, at each position we can be in any of k states; hence, for a sequence of length n, there are k possible parses. As we have seen, in both cases we nonetheless avoid actually doing exponential work by using dynamic programming. HMMs present several problems of interest beyond simply finding the optimal parse sequence, however. So far, we have only discussed finding the single optimal path that could have generated a given sequence and scoring (i.e., computing the probability of) such a path. We will also be interested in computing the total probability of a given sequence being generated by a particular HMM over all possible state paths that could have generated it; the method we present is yet another application of dynamic programming, known as the forward algorithm. One motivation for computing this probability is the desire to measure the accuracy of a model. Being able to compute the total probability of a sequence allows us to compare alternate models by asking the question: “Given a portion of a genome, how likely is it that each HMM produced this sequence?” Additionally, although we now know the Viterbi decoding algorithm for finding the single optimal path, we will talk about another notion of decoding known as posterior decoding, which finds the most likely state at any position of a sequence (given the knowledge that our HMM produced the entire sequence). The posterior decoding algorithm will apply both the forward algorithm and the closely related backward algorithm. After this discussion, we will pause for an aside on encoding “memory” in a Markov chain before moving on
منابع مشابه
6 . 047 / 6 . 878 Scribe Notes Lecture 8 – Classification
Again, we come across the distinction between a generative model and a discriminative model. In the context of classification, a generative scheme creates a set of models describing how likely it is for a given feature set to be generated, given the underlying class. As an example, in spam classification, a generative model might model a spam generator by saying that it has a 50% probability of...
متن کاملFall 2008 6 . 047 / 6 . 878 Lecture 13 Population Genomics II : Pardis Sabeti Scribe : Clara Chan
Recall Mendel’s Law of Independent Assortment, which states that allele pairs from different loci separate independently during the formation of gametes. For example if we have a population of individuals such that at a given locus we find 80% A and 20% G, while at a second locus of interest we find 50% C and 50% T, then independent assortment would imply that the corresponding haplotype freque...
متن کامل6 . 047 / 6 . 878 Fall 2008
Compared to human-engineered artifacts, biological systems are strikingly powerful: they are reliable and flexible, capable of storing information at incredible density, and — most importantly — able to self-replicate. In many ways, biology, when viewed as a technology, is vastly superior to any other. Scientists have been modifying biological systems for many years — both to better understand ...
متن کامل6 . 047 / 6 . 878 Fall 2008 Lecture 24 : Module Networks
Biological systems are generally incredibly complex, consisting of a huge number of interacting parts. Consider, for example, the task of understanding the structure of the genetic regulatory system. The dependence among expression levels of all the genes in a cell is so complicated that it makes any sort of detailed understanding almost impossible. The diagram for the functional network of yea...
متن کامل