In the real world, there will be uncertainty and randomness due to several factors:
- Theoretical and modelling limitations
- Coin toss: incomplete physics model
- Sensory and measurement limitations
- Measurements not accurate enough
- Computational limitations
- Computation too time consuming
Hence, using probabilities is often required.
Some aspect of the world about which there is uncertainty.
RVs are denoted with a capital letter and have an associated domain.
Unobserved RVs have distributions; a table of probabilities of values. A probability is a single number e.g. P(W=rain)=0.1. The probabilities sum to 1 and none are negative.
Joint distribution over a set of RVs: a map from assignments/outcomes/atomic events to reals; P(X1=x1,…,Xn=xn).
Event: set E of assignments; P(E)=∑(x1,…,xn∈E)P(x1,…,xn).
Marginalization (summing out): projecting a joint distribution to a sub-distribution over subset of variables: P(X1=x1)=∑x2∈domain(X2)P(X1=x1,X2=x2).
Conditional probability: P(a∣b)=P(b)P(a,b).
Conditional distribution: probability distribution over some variables given fixed values of others. If W and T take binary values, P(W,T) is a 2 by 2 table, P(W∣T) is two 2-row tables, each summing to 1.
To get the whole conditional distribution at once, select the joint probabilities matching the evidence and normalize the selection (make it sum to 1). Example:
P(T∣rain): select rows where R=rain, then divide the probabilities by the sum P(warm,rain)+P(cold,rain).
P(x1∣x2)=P(x2)P(x1,x2)=∑x1∈domain(X1)P(x1,x2)P(x1,x2)
Product rule:
P(x∣y)∴P(x,y)=P(y)P(x,y)=P(x∣y)⋅P(y)
Chain rule: P(x1,x2,x3)=P(x1)⋅P(x2∣x1)⋅P(x3∣x1,x2). More generally:
P(x1,…,xn)=i=1∏nP(xi∣x1,…,xi−1)
Computing a desired probability from other known probabilities.
P(x,y)=P(x∣y)⋅P(y)=P(y∣x)⋅P(x). By dividing this by the marginal, we get Baye’s rule:
P(x∣y)=P(y)P(y∣x)⋅P(x)
This allows us to invert a conditional distribution - often one conditional is simple but the other is tricky.
$$
\begin{aligned}
P(x,y|z) &= P(y,x|z) \
\
P(x,y|z) &= \frac{P(x,y,z)}{P(z)} \
P(y|x,z) &= \frac{P(x,y,z)}{P(x,z)} \
\therefore P(x,y,z) &= P(y|x,z) \cdot P(x,z) \
\therefore P(x,y|z) &= \frac{P(y|x,z) \cdot P(x,z)}{P(z)} \
&= P(y|x,z) \cdot P(x|z) \
\
\therefore P(x|y,z) &= \frac{P(y|x,z) \cdot P(x|z)}{P(y,z)} \
\text{if z is implicit}: \
P(x|y) &= \frac{P(y|x) \cdot P(x)}{P(y)} \text{ (Baye’s rule)}
\end{aligned}
$$
A more general procedure: P(Y1,…,Ym∣e1,…,ek) where:
-
(E1,…,Ek)=(e1,…,ek) are evidence variables
-
Y1,…,Ym are query variables
-
H1,…,Hr are hidden variables
These variables can be referred to as X1,…,Xn.
First, select entries consistent with the evidence.
Then, sum out H:
P(Y1,…,Ym,e1,…,ek)=h1,…,hr∑P(Y1,…,Ym,h1,…,hr,e1,…,ek)
Finally, normalize the remaining entries.
Simple models are easier to build, explain, and usually lead to lower time complexity and space requirements.
To measure complexity, count the number of free parameters that must be specified; a joint distribution over n variables, each with a domain size of d requires dn entries in the table, and the number of free parameters will be dn−1 (the last one can be inferred as probabilities must sum to 1).
Two variables are independent if:
$$
\begin{aligned}
P(X, Y) &= P(X) \cdot P(Y) \text{ or} \
\forall x, y: P(x, y) &= P(X=x) \cdot P(Y=y)
\end{aligned}
$$
That is, their joint distribution factors into a product of two simpler distributions.
Absolute independence: X⊥⊥Y
Independence can be used as a modelling assumption; if all the variables are independent, instead of having dn−1 parameters in the joint model, we only need nd−1 rows.
Absolute (unconditional) independence is vary rare; conditional independence is more robust. For a given value of Z, the probability of X is independent of Y; X⊥⊥Y∣Z if:
∀x,y,z:P(x,y∣z)=P(x∣z)⋅P(y∣z) or∀x,y,z:P(x∣z,y)=P(x∣z)
In this case:
P(X,Y,Z)=P(X∣Y,Z)⋅P(Y,Z)=P(X∣Y,Z)⋅P(Y∣Z)⋅P(Z)=P(X∣Z)⋅P(Y∣Z)⋅P(Z)
This can occur if X and Y are both dependent on Z but are independent of each other; the value of X modifies the probability of the parent, Z’s, value and thus modifies the probability of Y in turn.
- A way to describe complex joint distributions
- Sometimes called Baye’s nets or Bayesian networks
- Local interactions chain together to give global, indirect interactions
Arcs:
- Allows dependence between variables
- Arcs don’t force dependence
- Arrows may imply causation (but don’t have to)
Nodes:
- One node per RV
- Either assigned (observed) or unassigned (unobserved)
- Conditionally independent of its non-descendants given its parents’ state
- Conditionally independent of all other nodes given its parents, children, and children’s parents’ state
Distributions:
- A collection of distributions (CPTs) over each node; one for each combination of the parents’ values
-
P(X∣a1,…,an) for all combinations of a
D-separation can be used to decide if a set of nodes X is independent of Y given Z.
BNs implicitly encode joint distributions; this can be calculated as a product of local conditional distributions.
Example:
AAB,CC→B→C→D→E
P(a,b,c,d,e)=P(e∣a,b,c,d)⋅P(a,b,c,d)(by the product rule)=P(e∣c)⋅P(a,b,c,d)(e dependent on c but independent of all others)=P(e∣c)⋅P(d∣a,b,c)⋅P(a,b,c)=P(e∣c)⋅P(d∣b,c)⋅P(a,b,c)=P(e∣c)⋅P(d∣b,c)⋅P(c∣a,b)⋅P(a,b)=P(e∣c)⋅P(d∣b,c)⋅P(c∣a)⋅P(b∣a)⋅P(a)
More generally, if you have a full assignment, multiplying the relevant conditionals gives the probability:
P(x1,…,xn)=i=1∏nP(xi∣parents(Xi))
product = 1
for i in range(1, n):
product *= probability(x[i] | parents(i))
Thus, we can reconstruct any entry of the full joint. However, not every BN can represent every full joint.
Computing P(Y∣e). If H are the hidden variables and α normalizes the values so the sum of the probabilities is 1:
P(Y∣e)=α⋅P(Y,e)=αH∑P(Y,e,H)
(NB: ∑H means ∑H1∑H2…)
This has to be computed for every value in the domain of Y.
- Answering P(Y): no evidence; all variables except the query are hidden
- Answering P(y∣e): answer P(Y∣e), then pick result for Y=y
- Answering P(Y2=y1,Y2=y2∣e): P(y1∣y2,e)⋅P(y2∣e)