Week 08: Probabilistic Inference and Belief Networks

In the real world, there will be uncertainty and randomness due to several factors:

Hence, using probabilities is often required.

Random Variable

Some aspect of the world about which there is uncertainty.

RVs are denoted with a capital letter and have an associated domain.

Unobserved RVs have distributions; a table of probabilities of values. A probability is a single number e.g. P(W=rain)=0.1P(W = rain) = 0.1. The probabilities sum to 1 and none are negative.

Joint distribution over a set of RVs: a map from assignments/outcomes/atomic events to reals; P(X1=x1,,Xn=xn)P(X_1 = x_1, \dots, X_n = x_n).

Event: set E of assignments; P(E)=(x1,,xnE)P(x1,,xn)P(E) = \sum_{(x_1, \dots, x_n \in E)}{P(x_1, \dots, x_n)}.

Marginalization (summing out): projecting a joint distribution to a sub-distribution over subset of variables: P(X1=x1)=x2domain(X2)P(X1=x1,X2=x2)P(X_1 = x_1) = \sum_{x_2 \in domain(X_2)}{P(X_1 = x_1, X_2 = x_2)}.

Conditional probability: P(ab)=P(a,b)P(b)P(a|b) = \frac{P(a, b)}{P(b)}.

Conditional distribution: probability distribution over some variables given fixed values of others. If WW and TT take binary values, P(W,T)P(W, T) is a 2 by 2 table, P(WT)P(W|T) is two 2-row tables, each summing to 1.

To get the whole conditional distribution at once, select the joint probabilities matching the evidence and normalize the selection (make it sum to 1). Example:

P(Train)P(T|rain): select rows where R=rainR=rain, then divide the probabilities by the sum P(warm,rain)+P(cold,rain)P(warm, rain) + P(cold, rain).

P(x1x2)=P(x1,x2)P(x2)=P(x1,x2)x1domain(X1)P(x1,x2) \begin{aligned} P(x_1|x_2) &= \frac{P(x_1, x_2)}{P(x_2)} \\ &= \frac{P(x_1, x_2)}{\sum_{x_1 \in domain(X_1)}{P(x_1, x_2)}} \end{aligned}

Product rule:

P(xy)=P(x,y)P(y)P(x,y)=P(xy)P(y) \begin{aligned} P(x|y) &= \frac{P(x, y)}{P(y)} \\ \therefore P(x, y) &= P(x|y) \cdot P(y) \end{aligned}

Chain rule: P(x1,x2,x3)=P(x1)P(x2x1)P(x3x1,x2)P(x_1, x_2, x_3) = P(x_1) \cdot P(x_2|x_1) \cdot P(x_3|x_1, x_2). More generally:

P(x1,,xn)=i=1nP(xix1,,xi1) P(x_1, \dots, x_n) = \prod_{i=1}^{n}{P(x_i|x_1, \dots, x_{i-1})}

Probabilistic Inference

Computing a desired probability from other known probabilities.

P(x,y)=P(xy)P(y)=P(yx)P(x)P(x, y) = P(x|y) \cdot P(y) = P(y|x) \cdot P(x). By dividing this by the marginal, we get Baye’s rule:

P(xy)=P(yx)P(x)P(y) P(x|y) = \frac{P(y|x) \cdot P(x)}{P(y)}

This allows us to invert a conditional distribution - often one conditional is simple but the other is tricky. $$ \begin{aligned} P(x,y|z) &= P(y,x|z) \ \ P(x,y|z) &= \frac{P(x,y,z)}{P(z)} \ P(y|x,z) &= \frac{P(x,y,z)}{P(x,z)} \ \therefore P(x,y,z) &= P(y|x,z) \cdot P(x,z) \ \therefore P(x,y|z) &= \frac{P(y|x,z) \cdot P(x,z)}{P(z)} \ &= P(y|x,z) \cdot P(x|z) \ \ \therefore P(x|y,z) &= \frac{P(y|x,z) \cdot P(x|z)}{P(y,z)} \ \text{if z is implicit}: \ P(x|y) &= \frac{P(y|x) \cdot P(x)}{P(y)} \text{ (Baye’s rule)} \end{aligned} $$

Inference by Enumeration

A more general procedure: P(Y1,,Yme1,,ek)P(Y_1, \dots, Y_m|e_1, \dots, e_k) where:

These variables can be referred to as X1,,XnX_1, \dots, X_n.

First, select entries consistent with the evidence.

Then, sum out HH:

P(Y1,,Ym,e1,,ek)=h1,,hrP(Y1,,Ym,h1,,hr,e1,,ek) P(Y_1, \dots, Y_m, e_1, \dots, e_k) = \sum_{h_1, \dots, h_r}{P(Y_1, \dots, Y_m, h_1, \dots, h_r, e_1, \dots, e_k)}

Finally, normalize the remaining entries.

Complexity of Models

Simple models are easier to build, explain, and usually lead to lower time complexity and space requirements.

To measure complexity, count the number of free parameters that must be specified; a joint distribution over n variables, each with a domain size of d requires dnd^n entries in the table, and the number of free parameters will be dn1d^n - 1 (the last one can be inferred as probabilities must sum to 1).

Independence

Two variables are independent if: $$ \begin{aligned} P(X, Y) &= P(X) \cdot P(Y) \text{ or} \ \forall x, y: P(x, y) &= P(X=x) \cdot P(Y=y) \end{aligned} $$ That is, their joint distribution factors into a product of two simpler distributions.

Absolute independence: X ⁣ ⁣ ⁣YX{\perp\!\!\!\perp}Y

Independence can be used as a modelling assumption; if all the variables are independent, instead of having dn1d^n - 1 parameters in the joint model, we only need nd1nd - 1 rows.

Absolute (unconditional) independence is vary rare; conditional independence is more robust. For a given value of ZZ, the probability of XX is independent of YY; X ⁣ ⁣ ⁣YZX{\perp\!\!\!\perp}Y|Z if:

x,y,z:P(x,yz)=P(xz)P(yz) orx,y,z:P(xz,y)=P(xz) \begin{aligned} &\forall x, y, z: P(x, y|z) = P(x|z) \cdot P(y|z) \text{ or}\\ &\forall x, y, z: P(x|z, y) = P(x|z) \end{aligned}

In this case:

P(X,Y,Z)=P(XY,Z)P(Y,Z)=P(XY,Z)P(YZ)P(Z)=P(XZ)P(YZ)P(Z) \begin{aligned} P(X, Y, Z) &= P(X|Y, Z) \cdot P(Y, Z) \\ &= P(X|Y, Z) \cdot P(Y|Z) \cdot P(Z) \\ &= P(X|Z) \cdot P(Y|Z) \cdot P(Z) \end{aligned}

This can occur if XX and YY are both dependent on ZZ but are independent of each other; the value of XX modifies the probability of the parent, ZZ’s, value and thus modifies the probability of YY in turn.

Belief Networks

Graphical Model Notation

Arcs:

Nodes:

Distributions:

D-separation can be used to decide if a set of nodes XX is independent of YY given ZZ.

Encoding

BNs implicitly encode joint distributions; this can be calculated as a product of local conditional distributions.

Example:

ABACB,CDCE \begin{aligned} A &\rightarrow B \\ A &\rightarrow C \\ B, C &\rightarrow D \\ C &\rightarrow E \\ \end{aligned}
P(a,b,c,d,e)=P(ea,b,c,d)P(a,b,c,d)(by the product rule)=P(ec)P(a,b,c,d)(e dependent on c but independent of all others)=P(ec)P(da,b,c)P(a,b,c)=P(ec)P(db,c)P(a,b,c)=P(ec)P(db,c)P(ca,b)P(a,b)=P(ec)P(db,c)P(ca)P(ba)P(a) \begin{aligned} P(a, b, c, d, e) &= P(e|a, b, c, d) \cdot P(a, b, c, d) \quad\text{(by the product rule)} \\ &= P(e|c) \cdot P(a, b, c, d) \quad\text{(e dependent on c but independent of all others)} \\ &= P(e|c) \cdot P(d|a, b, c) \cdot P(a, b, c) \\ &= P(e|c) \cdot P(d|b, c) \cdot P(a, b, c) \\ &= P(e|c) \cdot P(d|b, c) \cdot P(c|a, b) \cdot P(a, b) \\ &= P(e|c) \cdot P(d|b, c) \cdot P(c|a) \cdot P(b|a) \cdot P(a) \\ \end{aligned}

More generally, if you have a full assignment, multiplying the relevant conditionals gives the probability:

P(x1,,xn)=i=1nP(xiparents(Xi)) P(x_1, \dots, x_n) = \prod_{i=1}^{n}{P(x_i | parents(X_i))}
product = 1
for i in range(1, n):
  product *= probability(x[i] | parents(i))

Thus, we can reconstruct any entry of the full joint. However, not every BN can represent every full joint.

Inference Enumeration

Computing P(Ye)P(Y | e). If HH are the hidden variables and α\alpha normalizes the values so the sum of the probabilities is 1:

P(Ye)=αP(Y,e)=αHP(Y,e,H) P(Y|e) = \alpha \cdot P(Y, e) = \alpha \sum_H{P(Y, e, H)}

(NB: H\sum_H means H1H2\sum_{H_1}{\sum_{H_2}{\dots}})

This has to be computed for every value in the domain of YY.