Probability and Bayes Nets

Probability

The general situation is:

  • Observed variables (evidence)

  • Unobserved variables

  • Model: Agent knows something about how the know variables relate to the unknown variables

A distribution is a table of probabilities of values.

A joint distribution over a set of random variables: X1,X2,...,XnX_1, X_2,...,X_n specifies a real number for each assignment.

A probabilistic model is a joint distribution over a set of random variables.

An event is a set E of outcomes P(E)=(x1,,xn)EP(x1,,xn)P(E)=\sum_{(x_1,\dots,x_n)\in E} P(x_1, \dots, x_n).

Marginal distributions are sub-tables which eliminate variables.

For continuous variables, the PP is a density. It integrates to 1.

Conditional distributions are probability distributions over some variables given fixed values of others.

P(AB)=P(AB)P(B)P(A|B)=\frac{P(AB)}{P(B)}
P(x1,x2,xn)=i=1nP(xix1,,xi1)P(x_1,x_2,\dots x_n) = \prod_{i=1}^{n} P(x_i|x_1,\dots,x_{i-1})

Probabilistic Inference

Our model is a joint distribution, i.e. a table of probabilities which captures the likelihood of each possible outcome, also known as an assignment.

Given a joint PDF, we can compute any probability distribution P(Q1Qke1ek)P(Q_1\dots Q_k|e_1\dots e_k) using a simple and intuitive procedure known as inference by enumeration.

Three types of variables:

  • Query variables QiQ_i

  • Evidence variables eie_i

  • Hidden variables

General idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables

Example: we want to calculate P(WS=winter)P(W|S=winter)

The wrost-case time complexity is O(dn)O(d^n) and the space complexity is O(dn)O(d^n) because we have to store the whole table.

Independece

XX,YY independent if and only if:

x,y:P(x,y)=P(x)P(y)\forall x,y:P(x,y)=P(x)P(y)

XX and YY are conditionally independent given ZZ if and only if:

x,y,z:P(x,yz)=P(xz)P(yz)\forall x,y,z:P(x,y|z)=P(x|z)P(y|z)

We can leverage independence to decompose a big problem into subproblems.

P(Toothache,Catch,Cavity,Weather)=P(Toothache,Catch,Cavity)P(Weather)P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather)

With conditional independence, we can further reduce the size of joint distribution table:

P(ToothacheCatch,Cavity)=P(ToothacheCavity)P(Toothache|Catch, Cavity)=P(Toothache|Cavity)

Because toothache is conditionally independent of catch given cavity.

Bayes' Rule

Bayes' Rule:

P(xy)=P(yx)P(y)P(x)P(x|y)=\frac{P(y|x)}{P(y)}P(x)

This rule let us build one conditional from its reverse.

Assessing diagnostic probability from causal probability:

P(causeeffect)=P(effectcause)P(cause)P(effect)P(cause|effect)=\frac{P(effect|cause)P(cause)}{P(effect)}

Posterior(conditional) probability is prior(unconditional) probability with some evidence. Here, P(effect)P(effect) is already known. We can simply denote it with a normalization factor α\alpha.

P(causeeffect)=α×P(effectcause)P(cause)P(cause|effect)=\alpha \times P(effect|cause)P(cause)

This is an example of a naive Bayes model:

Total number of parameters is linear in nn. Can greatly reduce the table size. BUt it is under the assumption that each effect is conditionally independent of all other effects given the cause.

Bayes Nets

Not impractical to store the entire joint distribution in the memory.

Bayes' Net is a directed, acyclic graph.

An edge from node AA to node BB indicates that we store the probability table for P(BA)P(B | A).

A condition probability table is stored underneath the node.

The CPT is a collection of distributions over XX, one for each combination of parents' values.

There's a conditional distribution for each node. That is a collection of distributions over XX, one for each combination of parents' values.

Only distributions whose variables are absolutely independent can be represented by a Bayes' net without arcs.

Compactness Bayes Net is featured for its compactness. A CPT for a Boolean XiX_i with kk Boolean parents has 2k2^k rows for the combinations of parent values. We just need to store 2k2^k entries for a boolean XiX_i. (Each row requires one number pp for Xi=trueX_i = true and the number for Xi=falseX_i = false is just 1p1 − p) If each variable has no more than kk parents, the complete network requires O(n2k)O(n \cdot 2^k) numbers. That is linear to nn as opposed to O(2n)O(2^n) for the full distribution.

Global Semantics Global semantics defines the full joint distribution as the product of the local conditional distributions:

P(x1,x2,,xn)=i=1nP(xiparents(Xi))P(x_1,x_2,\dots ,x_n) =\prod_{i=1}^{n}P(x_i|parents(X_i))

Local Semantics Each node is conditionally independent of its nondescendants given its parents.

Global Semantics \Leftrightarrow Local Semantics

Each node is conditionally independent of all its ancestor nodes in the graph, given all of its parents.

Example: for five random variables

  • B: Burglary occurs.

  • A: Alarm goes off.

  • E: Earthquake occurs.

  • J: John calls.

  • M: Mary calls.

We can construct the Bayes Net according to the causal relationship:

The tail of a directed edge is the cause and the head is considered as the effect.

In the alarm model above, we would store probability tables P(B),P(E),P(AB,E),P(JA)P(B),P(E),P(A | B,E),P(J | A) and P(MA)P(M | A).

Given all of the CPTs for a graph, we can calculate the probability of a given assignment using the Bayes reconstitution formula (cumulatively multiply XX conditioned on its parents):

P(X1,X2,,Xn)=i=1nP(Xiparents(Xi))P(X_1,X_2,\dots ,X_n) =\prod_{i=1}^{n}P(X_i|parents(X_i))
P(b,e,+a,+j,m)=P(b)P(e)P(+ab,e)P(+j+a)P(m+a)P(−b,−e,+a,+j,−m) = P(−b)·P(−e)·P(+a | −b,−e)·P(+j | +a)·P(−m | +a)

Not every BN can represent every joint distribution.

BNs need not actually be causal: correlated but not causal.

Topology really encodes conditional independence.

Bayes Nets (Independence)

Important question about a BN: Are two values independent given certain evidence?

  • If yes, can prove using algebra

  • If no, can prove with a counter example

Each node is conditionally independent of all its ancestor nodes in the graph, given all of its parents. But a node may not be independent of its ancestors without the information about its parents.

Three Basic Cases:

  1. Low Pressure ightarrowightarrow Rain ightarrowightarrow Traffic

    According to (3)(3), traffic is not independent of low pressure. However, given Rain, the influence link is "blocked".

  2. Forums busy \leftarrow Project Due ightarrowightarrow Lab full (Common cause)

    Given Project Due, the Forums and Lab full are independent.

  3. Raining ightarrowightarrow Traffic \leftarrow Ballgame (Common effect)

    Here, however, Raining and Ballgame are independent without the information of traffic. Given Traffic, Raining and Ballgame are not independent. They are in competition as explanation. You know it isn't raining, so it must be a ball game.

The general case:

To get from point A to point B in a graph:

  • In a path: all segements needs to be open.

  • In a network: one path open is fine.

D-separation is a criterion for deciding whether a set of variables is independent of another set of variables in a Bayes network( i.e. Xi ⁣ ⁣ ⁣Xj{Xk1,Xk2,,Xkn}X_i\newcommand{\indep}{\perp \!\!\! \perp} \indep X_j |\{X_{k_1}, X_{k_2},\dots,X_{k_n}\} ).

A path is active if each triple is active( i.e. the influence can flow, the two edges of the triple is not independent ). The algorithm checks all undirected paths between XiX_i and XjX_j. If one or more active, then independence is not guaranteed. Otherwise, it is. Once XX and YY "d-separated" by ZZ, it means .

Given a Bayes net structure, we can run d-separation algorithm to build a complete list of conditional independencies that are necessarily true of the form:

The list determines the set of probability distributions that can be represented.

We can prove the local Markov property through the Bayes reconstitution formula.

The Local Markov property (a variable is conditionally independent of all other variables given its neighbors) is a necessary condition for the Bayesian reconstitution formula.

In the Bayes Net, the less number of arrows you draw, the more assumptions of independence you have to make, the fewer joint probability distributions you can represent, the smaller the size of the node.

Bayes Nets (Inference)

Inference is the process of calculating the joint PDF for some set of query variables based on some set of observed variables.

Previously, we perform inference by enumeration to compute probability distribution. We can do this by eliminate variables one by one. To eliminate a variable XX, we:

  1. Join(multiply together) all factors involving XX.

  2. Sum out XX.

For four variables like below, If we want to calculate P(T+e)P(T|+e) by inferencing by enumeration, we should get the joint probability with 16 rows. However, with the Bayes Net, we can eliminate the variables one by one.

Bayes Nets (Sampling)

Why sample?

  • Learning: get smaples from a distribution you don't know

  • Inference: getting a sample is faster than computing the right answer

Sampling in Bayes' Nets

  • Prior Sampling

  • Rejection Sampling

  • Likehood weighting

  • Gibbs Sampling

Prior Sampling

Basic idea:

  • Draw NN samples from a sampling distribution SS

  • Compute an approximate posterior probability P^\widehat{P}

  • Show this converges to the true probaility PP

for i in 1, 2, ..., n
	sample(xi, P(Xi|Parents(Xi)))
return (x1,x2,...,xn)

Let NPS(x1,,xn)N_{PS}(x_1,\dots,x_n) be the number of samples generated for event x1,,xnx_1,\dots, x_n

limNP^(x1,,xn)=limNNPS(x1,,xn)N=SPS(x1,,xn)=P(x1,,xn)\lim_{N\rightarrow\infin} \widehat{P}(x_1,\dots,x_n) = \lim_{N\rightarrow \infin} {\frac{N_{PS} (x_1,\dots,x_n)}{N}} = S_{PS}(x_1,\dots,x_n) = P(x_1,\dots,x_n)

The sampling procedure is consistent.

Consistent

An estimator is said to be consistent if it converges in probability to the true value of the parameter being estimated as the sample size increases.

Rejection Sampling

Let's say we want P(C)P(C). There's no point keeping all samples around. We just tally counts of CC as we go.

for i in 1, 2, ..., n
	sample(xi, P(Xi|Parents(Xi)))
	if xi not consistent with evidence
		reject
return (x1,x2,...,xn)

P(shapeblue)P(shape|blue)

If we find a red/green, we just throw it away. Then why did we generate it?

Likelihood Weighting

We force the samples to be blue. Every sample is consistent with the evidence now. The idea here is to fix evidence variables and sample the rest.

But this is not necessarily consistent with the joint distribution. So we introduce the weight.

w = 1.0

for i = 1, 2, ..., n
	if Xi is an evidence variable
		Xi = observation xi for Xi
		w = w * P(xi|Parents(Xi))
	else
		sample(xi, P(Xi|Parents(Xi))

return (x1,x2,...,xn), w

Sampling distribution if z\bold{z} sampled and ee fixed evidence

SWS(z,e)=i=1lP(ziParents(Zi))S_{WS}(\bold{z},e) = \prod_{i=1}^{l}P(z_i|Parents(Z_i))
ω(z,e)=i=1mP(eiParents(Ei))\omega(\bold{z},e) = \prod_{i=1}^{m} P(e_i|Parents(E_i))

Together, weight sampling distribution is consistent:

SWS(z,e)ω(z,e)=i=1lP(ziParents(Zi))i=1mP(eiParents(Ei))S_{WS}(\bold{z},e)\cdot \omega(\bold{z},e) = \prod_{i=1}^{l}P(z_i|Parents(Z_i)) \prod_{i=1}^{m} P(e_i|Parents(E_i))

Likelihood wieghting doesn't solve all our problems: Evidence influences the choice of downstream variables, but not upstream ones

The problem is, if JJ and MM are the evidence, the weight is extremely low in some cases and you can't affect the choice of AA, BB and EE. But if AA and BB are the evidence, your simulation is based on the evidence all the way down to the JJ.

A simpler would be the above one. Your simluation is like:

F A Weight
-f +a 0.01
-f +a 0.01
-f +a 0.01
-f +a 0.01
-f +a 0.01
......
+f +a 0.99

Gibbs Sampling

Step 1 Fix evidence

Step 2 Initialize other variables randomly

Step 3 Repeat

Choose a non-evidence variable XX

Resample XX from P(Xall other variables)P(X|all\space other\space variables)

Last updated