Probability and Bayes Nets
Last updated
Last updated
The general situation is:
Observed variables (evidence)
Unobserved variables
Model: Agent knows something about how the know variables relate to the unknown variables
A distribution is a table of probabilities of values.
A joint distribution over a set of random variables: specifies a real number for each assignment.
A probabilistic model is a joint distribution over a set of random variables.
An event is a set E of outcomes .
Marginal distributions are sub-tables which eliminate variables.
For continuous variables, the is a density. It integrates to 1.
Conditional distributions are probability distributions over some variables given fixed values of others.
Our model is a joint distribution, i.e. a table of probabilities which captures the likelihood of each possible outcome, also known as an assignment.
Three types of variables:
Hidden variables
General idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables
We can leverage independence to decompose a big problem into subproblems.
With conditional independence, we can further reduce the size of joint distribution table:
Because toothache is conditionally independent of catch given cavity.
Bayes' Rule:
This rule let us build one conditional from its reverse.
Assessing diagnostic probability from causal probability:
This is an example of a naive Bayes model:
Not impractical to store the entire joint distribution in the memory.
Bayes' Net is a directed, acyclic graph.
A condition probability table is stored underneath the node.
Only distributions whose variables are absolutely independent can be represented by a Bayes' net without arcs.
Global Semantics Global semantics defines the full joint distribution as the product of the local conditional distributions:
Local Semantics Each node is conditionally independent of its nondescendants given its parents.
Each node is conditionally independent of all its ancestor nodes in the graph, given all of its parents.
Example: for five random variables
B: Burglary occurs.
A: Alarm goes off.
E: Earthquake occurs.
J: John calls.
M: Mary calls.
We can construct the Bayes Net according to the causal relationship:
The tail of a directed edge is the cause and the head is considered as the effect.
Not every BN can represent every joint distribution.
BNs need not actually be causal: correlated but not causal.
Topology really encodes conditional independence.
Important question about a BN: Are two values independent given certain evidence?
If yes, can prove using algebra
If no, can prove with a counter example
Each node is conditionally independent of all its ancestor nodes in the graph, given all of its parents. But a node may not be independent of its ancestors without the information about its parents.
Three Basic Cases:
Given Project Due, the Forums and Lab full are independent.
Here, however, Raining and Ballgame are independent without the information of traffic. Given Traffic, Raining and Ballgame are not independent. They are in competition as explanation. You know it isn't raining, so it must be a ball game.
The general case:
To get from point A to point B in a graph:
In a path: all segements needs to be open.
In a network: one path open is fine.
Given a Bayes net structure, we can run d-separation algorithm to build a complete list of conditional independencies that are necessarily true of the form:
The list determines the set of probability distributions that can be represented.
We can prove the local Markov property through the Bayes reconstitution formula.
The Local Markov property (a variable is conditionally independent of all other variables given its neighbors) is a necessary condition for the Bayesian reconstitution formula.
In the Bayes Net, the less number of arrows you draw, the more assumptions of independence you have to make, the fewer joint probability distributions you can represent, the smaller the size of the node.
Inference is the process of calculating the joint PDF for some set of query variables based on some set of observed variables.
Why sample?
Learning: get smaples from a distribution you don't know
Inference: getting a sample is faster than computing the right answer
Sampling in Bayes' Nets
Prior Sampling
Rejection Sampling
Likehood weighting
Gibbs Sampling
Prior Sampling
Basic idea:
The sampling procedure is consistent.
Consistent
An estimator is said to be consistent if it converges in probability to the true value of the parameter being estimated as the sample size increases.
Rejection Sampling
If we find a red/green, we just throw it away. Then why did we generate it?
Likelihood Weighting
We force the samples to be blue. Every sample is consistent with the evidence now. The idea here is to fix evidence variables and sample the rest.
But this is not necessarily consistent with the joint distribution. So we introduce the weight.
Together, weight sampling distribution is consistent:
Likelihood wieghting doesn't solve all our problems: Evidence influences the choice of downstream variables, but not upstream ones
A simpler would be the above one. Your simluation is like:
Gibbs Sampling
Step 1 Fix evidence
Step 2 Initialize other variables randomly
Step 3 Repeat
Given a joint PDF, we can compute any probability distribution using a simple and intuitive procedure known as inference by enumeration.
Query variables
Evidence variables
Example: we want to calculate
The wrost-case time complexity is and the space complexity is because we have to store the whole table.
, independent if and only if:
and are conditionally independent given if and only if:
Posterior(conditional) probability is prior(unconditional) probability with some evidence. Here, is already known. We can simply denote it with a normalization factor .
Total number of parameters is linear in . Can greatly reduce the table size. BUt it is under the assumption that each effect is conditionally independent of all other effects given the cause.
An edge from node to node indicates that we store the probability table for .
The CPT is a collection of distributions over , one for each combination of parents' values.
There's a conditional distribution for each node. That is a collection of distributions over , one for each combination of parents' values.
Compactness Bayes Net is featured for its compactness. A CPT for a Boolean with Boolean parents has rows for the combinations of parent values. We just need to store entries for a boolean . (Each row requires one number for and the number for is just ) If each variable has no more than parents, the complete network requires numbers. That is linear to as opposed to for the full distribution.
Global Semantics Local Semantics
In the alarm model above, we would store probability tables and .
Given all of the CPTs for a graph, we can calculate the probability of a given assignment using the Bayes reconstitution formula (cumulatively multiply conditioned on its parents):
Low Pressure Rain Traffic
According to , traffic is not independent of low pressure. However, given Rain, the influence link is "blocked".
Forums busy Project Due Lab full (Common cause)
Raining Traffic Ballgame (Common effect)
D-separation is a criterion for deciding whether a set of variables is independent of another set of variables in a Bayes network( i.e. ).
A path is active if each triple is active( i.e. the influence can flow, the two edges of the triple is not independent ). The algorithm checks all undirected paths between and . If one or more active, then independence is not guaranteed. Otherwise, it is. Once and "d-separated" by , it means X\indep Y | Z.
Previously, we perform inference by enumeration to compute probability distribution. We can do this by eliminate variables one by one. To eliminate a variable , we:
Join(multiply together) all factors involving .
Sum out .
For four variables like below, If we want to calculate by inferencing by enumeration, we should get the joint probability with 16 rows. However, with the Bayes Net, we can eliminate the variables one by one.
Draw samples from a sampling distribution
Compute an approximate posterior probability
Show this converges to the true probaility
Let be the number of samples generated for event
Let's say we want . There's no point keeping all samples around. We just tally counts of as we go.
Sampling distribution if sampled and fixed evidence
The problem is, if and are the evidence, the weight is extremely low in some cases and you can't affect the choice of , and . But if and are the evidence, your simulation is based on the evidence all the way down to the .
Choose a non-evidence variable
Resample from