The Statistics of Semi-Probabilistic Alignment
نویسندگان
چکیده
Computer-assisted sequence comparison has become an integral part of modern molecular biology. Two types of algorithms have been used: those which search for the optimal alignment (as exemplified by the Smith-Waterman algorithm [1]), and those which identify likely alignments (as exemplified by the HMM-based “Sequence Alignment Modules” [2]). In each case, the quality of alignment is summarized by an alignment score S; the latter is typically taken to be the logarithm of the total likelihood in the probabilistic approaches. An important goal common to the study of all algorithms is to understand the score distribution P (S) for the appropriate null models. This distribution gives the probability that a high scores could have arisen by chance and is therefore more meaningful to homology detection than the alignment scores themselves. Rigorous results on such background statistics are known only for the gapless alignment [3], whose score distribution follows the so-called Gumbel form, P (S) = KMNλe−λS−KMN exp(−λS) , for long sequence lengths M and N . There exist explicit formulae relating the hundreds of alignment parameters to the two Gumbel parameters λ and K. For the gapped Smith-Waterman alignment, ample empirical evidences suggest that the null score distribution still obeys the Gumbel form. But the dependences of the two Gumbel parameters on the alignment score functions are very complicated and largely unknown. For probabilistic alignments, the log-likelihood score does not even satisfy Gumbel distribution as was recently shown [4]. Combining the optimal and the probabilistic approaches to sequence alignment, we developed a new “hybrid” algorithm for which the alignment score is still Gumbel-distributed, and the Gumbel parameters can be computed accurately and rapidly, for a large class of scoring functions (including the position-specific ones), and for a wide range of sequence lengths (down to ∼ 100 amino acids), without the need of extensive simulation as is commonly done. We have independently checked using sequences generated by simple evolution models that the fidelity of homology detected by the hybrid algorithm is comparable to or better than that of the Smith-Waterman algorithm [4].
منابع مشابه
Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models
The score statistics of probabilistic gapped local alignment of random sequences is investigated both analytically and numerically. The full probabilistic algorithm (e.g., the "local" version of maximum-likelihood or hidden Markov model method) is found to have anomalous statistics. A modified "semi-probabilistic" alignment consisting of a hybrid of Smith-Waterman and probabilistic alignment is...
متن کاملgpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کاملCorrespondence between probabilistic norms and fuzzy norms
In this paper, the connection between Menger probabilistic norms and H"{o}hle probabilistic norms is discussed. In addition, the correspondence between probabilistic norms and Wu-Fang fuzzy (semi-) norms is established. It is shown that a probabilistic norm (with triangular norm $min$) can generate a Wu-Fang fuzzy semi-norm and conversely, a Wu-Fang fuzzy norm can generate a probabilistic norm.
متن کاملStatistical significance and extremal ensemble of gapped local hybrid alignment
A “semi-probabilistic” alignment algorithm which combines ideas from Smith-Waterman and probabilistic alignment is proposed and studied in detail. It is predicted that the score statistics of this “hybrid” algorithm is of the universal Gumbel form, with the key Gumbel parameter λ taking on a fixed asymptotic value for a wide variety of scoring parameters. We have also characterized the “extrema...
متن کاملProbabilistic analysis of the asymmetric digital search trees
In this paper, by applying three functional operators the previous results on the (Poisson) variance of the external profile in digital search trees will be improved. We study the profile built over $n$ binary strings generated by a memoryless source with unequal probabilities of symbols and use a combinatorial approach for studying the Poissonized variance, since the probability distribution o...
متن کامل