Sequentially Fitting "Inclusive" Trees for Inference in Noisy-OR Networks

نویسندگان

Brendan J. Frey

Relu Patrascu

Tommi S. Jaakkola

Jodi Moran

چکیده

An important class of problems can be cast as inference in noisyOR Bayesian networks, where the binary state of each variable is a logical OR of noisy versions of the states of the variable's parents. For example, in medical diagnosis, the presence of a symptom can be expressed as a noisy-OR of the diseases that may cause the symptom on some occasions, a disease may fail to activate the symptom. Inference in richly-connected noisy-OR networks is intractable, but approximate methods (e .g., variational techniques) are showing increasing promise as practical solutions. One problem with most approximations is that they tend to concentrate on a relatively small number of modes in the true posterior, ignoring other plausible configurations of the hidden variables. We introduce a new sequential variational method for bipartite noisyOR networks, that favors including all modes of the true posterior and models the posterior distribution as a tree. We compare this method with other approximations using an ensemble of networks with network statistics that are comparable to the QMR-DT medical diagnostic network. 1 Inclusive variational approximations Approximate algorithms for probabilistic inference are gaining in popularity and are now even being incorporated into VLSI hardware (T. Richardson, personal communication). Approximate methods include variational techniques (Ghahramani and Jordan 1997; Saul et al. 1996; Frey and Hinton 1999; Jordan et al. 1999), local probability propagation (Gallager 1963; Pearl 1988; Frey 1998; MacKay 1999a; Freeman and Weiss 2001) and Markov chain Monte Carlo (Neal 1993; MacKay 1999b). Many algorithms have been proposed in each of these classes. One problem that most of the above algorithms suffer from is a tendency to concentrate on a relatively small number of modes of the target distribution (the distribution being approximated). In the case of medical diagnosis, different modes correspond to different explanations of the symptoms. Markov chain Monte Carlo methods are usually guaranteed to eventually sample from all the modes, but this may take an extremely long time, even when tempered transitions (Neal 1996) are (a) " , \ Q(x) (b) , \ fiE. .. x .. x Figure 1: We approximate P(x) by adjusting the mean and variance of a Gaussian, Q(x}. (a) The result of minimizing D(QIIP) = 2:" Q(x)log(Q(x)/ P(x», as is done for most variational methods. (b) The result of minimizing D(PIIQ) = 2:" P(x)log(P(x)/Q(x». used. Preliminary results on local probability propagation in richly connected networks show that it is sometimes able to oscillate between plausible modes (Murphy et al. 1999; Frey 2000), but other results also show that it sometimes diverges or oscillates between implausible configurations (McEliece et al. 1996). Most variational techniques minimize a cost function that favors finding the single, most massive mode, excluding less probable modes of the target distribution (e.g., Saul et al. 1996; Ghahramani and Jordan 1997; Jaakkola and Jordan 1999; Frey and Hinton 1999; Attias 1999). More sophisticated variational techniques capture multiple modes using substructures (Saul and Jordan 1996) or by leaving part of the original network intact and approximating the remainder (Jaakkola and Jordan 1999). However, although these methods increase the number of modes that are captured, they still exclude modes. Variational techniques approximate a target distribution P(x) using a simpler, parameterized distribution Q(x) (or a parameterized bound). For example, P(diseasel, disease2,'" , diseaseNlsymptoms) may be approximated by a factorized distribution, Ql (diseasedQ2 (disease2) .. ·QN(diseaseN). For the current set of observed symptoms, the parameters of the Q-distributions are adjusted to make Q as close as possible to P. A common approach to variational inference is to minimize a relative entropy, D(QIIP) = l: Q(x) log ~~:~. x (1) Notice that D(QIIP):j:. D(PIIQ). Often D(QIIP) can be minimized with respect to the parameters of Q using iterative optimization or even exact optimization. To see how minimizing D(QIIP) may exclude modes of the target distribution, suppose Q is a Gaussian and P is bimodal with a region of vanishing density between the two modes, as shown in Fig. 1. If we minimize D(QIIP) with respect to the mean and variance of Q, it will cover only one of the two modes, as illustrated in Fig.1a. (We assume the symmetry is broken.) This is because D(QIIP) will tend to infinity if Q is nonzero in the region where P has vanishing density. In contrast, if we minimize D(PIIQ) = Ex P(x)log(P(x)/Q(x)) with respect to the mean and variance of Q, it will cover all modes, since D(PIIQ) will tend to infinity if Q vanishes in any region where P is nonzero. See Fig. lb. For many problems, including medical diagnosis, it is easy to argue that it is more important that our approximation include all modes than exclude non plausible configurations at the cost of excluding other modes. The former leads to a low number of false negatives, whereas the latter may lead to a large number of false negatives (concluding a disease is not present when it is) . Figure 2: Bipartite Bayesian network. 8kS are observed, dns are hidden. 2 Bipartite noisy-OR networks Fig. 2 shows a bipartite noisy-OR Bayesian network with N binary hidden variables d = (d1, . . . , dN) and K binary observed variables s = (Sl, . . . , S K) . Later, we present results on medical diagnosis, where dn = 1 indicates a disease is active, dn = 0 indicates a disease is inactive, Sk = 1 indicates a symptom is active and Sk = 0 indicates a symptom is inactive. The joint distribution is K N P(d, s) = [II P(skl d )] [II P(dn )]. (2) k=l n=l In the case of medical diagnosis, this form assumes the diseases are independent. 1 Although some diseases probably do depend on other diseases, this form is considered to be a worthwhile representation of the problem (Shwe et al., 1991). The likelihood for Sk takes the noisy-OR form (Pearl 1988). The probability that symptom Sk fails to be activated (Sk = 0) is the product of the probabilities that each active disease fails to activate Sk: N P(Sk = Old) = PkO II p~~. (3) n=l Pkn is the probability that an active dn fails to activate Sk. PkO accounts for a "leak probability". 1-PkO is the probability that symptom Sk is active when none of the diseases are active. Exact inference computes the distribution over d given a subset of observed values in s. However, if Sk is not observed, the corresponding likelihood (node plus edges) may be deleted to give a new network that describes the marginal distribution over d and the remaining variables in s. So, we assume that we are considering a subnetwork where all the variables in s are observed. We reorder the variables in s so that the first J variables are active (Sk = 1, 1 ~ k ~ J) and the remaining variables are inactive (Sk = 0, J + 1 ~ k ~ K). The posterior distribution can then be written J N K N N P(dls) ocP(d,s) = [II(1-PkoIIp~~)][ II (pkoIIp~~)][IIP(dn)J. (4) k=l n=l k=J+1 n=l n=l Taken together, the two terms in brackets on the right take a simple, product form over the variables in d. So, the first step in inference is to "absorb" the inactive 1 However, the diseases are dependent given that some symptoms are present . variables in s by modifying the priors P(dn) as follows: K d pI (dn) = anP(dn) ( II Pkn) n, (5) k=J+l where an is a constant that normalizes P/(dn). Assuming the inactive symptoms have been absorbed, we have J N N P(dls) ex [II (1 PkO II p~~)] [II P/(dn)]. (6) k=l n=l n=l The term in brackets on the left does not have a product form. The entire expression can be multiplied out to give a sum of 2J product forms, and exact "QuickS core" inference can be performed by combining the results of exact inference in each of the 2J product forms (Heckerman 1989). However, this exponential time complexity makes large problems, such as QMR-DT, intractable. 3 Sequential inference using inclusive variational trees As described above, many variational methods minimize D(QIIP), and find approximations that exclude some modes of the posterior distribution. We present a method that minimizes D(PIIQ) sequentially by absorbing one observation at a time so as to not exclude modes of the posterior. Also, we approximate the posterior distribution with a tree. (Directed and undirected trees are equivalent we use a directed representation, where each variable has at most one parent.) The algorithm absorbs one active symptom at a time, producing a new tree by searching for the tree that is closest in the D(PIIQ) sense to the product of the previous tree and the likelihood for the next symptom. This search can be performed efficiently in O(N2 ) time using probability propagation in two versions of the previous tree to compute weights for edges of a new tree, and then applying a minimum-weight spanning-tree algorithm. Let Tk(d) be the tree approximation obtained after absorbing the kth symptom, Sk = 1. Initially, we take To(d) to be a tree that decouples the variables and has marginals equal to the marginals obtained by absorbing the inactive symptoms, as described above. Interpreting the tree Tk-l (d) from the previous step as the current "prior" over the diseases, we use the likelihood P(Sk = lid) for the next symptom to obtain a new estimate of the posterior: N A(dls1 , ... ,Sk) ex Tk-l (d)P(Sk = lid) = Tk-l (d) (1 PkO II p~~) n=l = Tk-l(d) TLl(d), (7) where TLI (d) = Tk-l (d) (PkO TI;;=l p~~) is a modified tree. Let the new tree be Tk(d) = TIn Tk(dnld1rk (n)), where 7rk (n) is the index of the parent of dn in the new tree. The parent function 7rk (n) and the conditional probability tables of Tk (d) are found by minimizing

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining

Decision-tree algorithms provide one of the most popular methodologies for symbolic knowledge acquisition. The resulting knowledge, a symbolic decision tree along with a simple inference mechanism, has been praised for comprehensibility. The most comprehensible decision trees have been designed for perfect symbolic data. Classical crisp decision trees (DT) are widely applied to classification t...

متن کامل

Efficient Search-Based Inference for noisy-OR Belief Networks: TopEpsilon

Inference algorithms for arbitrary belief networks are impractical for large, complex belief networks. Inference algorithms for specialized classes of belief networks have been shown to be more efficient. In this paper, we present a searchbased algorithm for approximate inference on arbitrary, noisy-OR belief networks, generalizing earlier work on search-based inference for twolevel, noisy-OR b...

متن کامل

Contextual Variable Elimination with Overlapping Contexts

Belief networks (BNs) extracted from statistical relational learning formalisms often include variables with conditional probability distributions (CPDs) that exhibit a local structure (e.g, decision trees and noisy-or). In such cases, naively representing CPDs as tables and using a general purpose inference algorithm such as variable elimination (VE) results in redundant computation. Contextua...

متن کامل

T-REX: reconstructing and visualizing phylogenetic trees and reticulation networks

UNLABELLED T-REX (tree and reticulogram reconstruction) is an application to reconstruct phylogenetic trees and reticulation networks from distance matrices. The application includes a number of tree fitting methods like NJ, UNJ or ADDTREE which have been very popular in phylogenetic analysis. At the same time, the software comprises several new methods of phylogenetic analysis such as: tree re...

متن کامل

An Efficient Cluster Head Selection Algorithm for Wireless Sensor Networks Using Fuzzy Inference Systems

An efficient cluster head selection algorithm in wireless sensor networks is proposed in this paper. The implementation of the proposed algorithm can improve energy which allows the structured representation of a network topology. According to the residual energy, number of the neighbors, and the centrality of each node, the algorithm uses Fuzzy Inference Systems to select cluster head. The alg...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Sequentially Fitting "Inclusive" Trees for Inference in Noisy-OR Networks

نویسندگان

چکیده

منابع مشابه

A New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining

Efficient Search-Based Inference for noisy-OR Belief Networks: TopEpsilon

Contextual Variable Elimination with Overlapping Contexts

T-REX: reconstructing and visualizing phylogenetic trees and reticulation networks

An Efficient Cluster Head Selection Algorithm for Wireless Sensor Networks Using Fuzzy Inference Systems

عنوان ژورنال:

اشتراک گذاری