Probability and Bayes Nets
Probability
The general situation is:
Observed variables (evidence)
Unobserved variables
Model: Agent knows something about how the know variables relate to the unknown variables
A distribution is a table of probabilities of values.
A joint distribution over a set of random variables: specifies a real number for each assignment.
A probabilistic model is a joint distribution over a set of random variables.
An event is a set E of outcomes .
Marginal distributions are sub-tables which eliminate variables.
For continuous variables, the is a density. It integrates to 1.
Conditional distributions are probability distributions over some variables given fixed values of others.
Probabilistic Inference
Our model is a joint distribution, i.e. a table of probabilities which captures the likelihood of each possible outcome, also known as an assignment.
Given a joint PDF, we can compute any probability distribution using a simple and intuitive procedure known as inference by enumeration.
Three types of variables:
Query variables
Evidence variables
Hidden variables
General idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables
Example: we want to calculate
The wrost-case time complexity is and the space complexity is because we have to store the whole table.
Independece
, independent if and only if:
and are conditionally independent given if and only if:
We can leverage independence to decompose a big problem into subproblems.
With conditional independence, we can further reduce the size of joint distribution table:
Because toothache is conditionally independent of catch given cavity.
Bayes' Rule
Bayes' Rule:
This rule let us build one conditional from its reverse.
Assessing diagnostic probability from causal probability:
Posterior(conditional) probability is prior(unconditional) probability with some evidence. Here, is already known. We can simply denote it with a normalization factor .
This is an example of a naive Bayes model:
Total number of parameters is linear in . Can greatly reduce the table size. BUt it is under the assumption that each effect is conditionally independent of all other effects given the cause.
Bayes Nets
Not impractical to store the entire joint distribution in the memory.
Bayes' Net is a directed, acyclic graph.
An edge from node to node indicates that we store the probability table for .
A condition probability table is stored underneath the node.
The CPT is a collection of distributions over , one for each combination of parents' values.
There's a conditional distribution for each node. That is a collection of distributions over , one for each combination of parents' values.
Only distributions whose variables are absolutely independent can be represented by a Bayes' net without arcs.
Compactness Bayes Net is featured for its compactness. A CPT for a Boolean with Boolean parents has rows for the combinations of parent values. We just need to store entries for a boolean . (Each row requires one number for and the number for is just ) If each variable has no more than parents, the complete network requires numbers. That is linear to as opposed to for the full distribution.
Global Semantics Global semantics defines the full joint distribution as the product of the local conditional distributions:
Local Semantics Each node is conditionally independent of its nondescendants given its parents.
Global Semantics Local Semantics
Each node is conditionally independent of all its ancestor nodes in the graph, given all of its parents.
Example: for five random variables
B: Burglary occurs.
A: Alarm goes off.
E: Earthquake occurs.
J: John calls.
M: Mary calls.
We can construct the Bayes Net according to the causal relationship:
The tail of a directed edge is the cause and the head is considered as the effect.
In the alarm model above, we would store probability tables and .
Given all of the CPTs for a graph, we can calculate the probability of a given assignment using the Bayes reconstitution formula (cumulatively multiply conditioned on its parents):
Not every BN can represent every joint distribution.
BNs need not actually be causal: correlated but not causal.
Topology really encodes conditional independence.
Bayes Nets (Independence)
Important question about a BN: Are two values independent given certain evidence?
If yes, can prove using algebra
If no, can prove with a counter example
Each node is conditionally independent of all its ancestor nodes in the graph, given all of its parents. But a node may not be independent of its ancestors without the information about its parents.
Three Basic Cases:
Low Pressure Rain Traffic
According to , traffic is not independent of low pressure. However, given Rain, the influence link is "blocked".
Forums busy Project Due Lab full (Common cause)
Given Project Due, the Forums and Lab full are independent.
Raining Traffic Ballgame (Common effect)
Here, however, Raining and Ballgame are independent without the information of traffic. Given Traffic, Raining and Ballgame are not independent. They are in competition as explanation. You know it isn't raining, so it must be a ball game.
The general case:
To get from point A to point B in a graph:
In a path: all segements needs to be open.
In a network: one path open is fine.
D-separation is a criterion for deciding whether a set of variables is independent of another set of variables in a Bayes network( i.e. ).
A path is active if each triple is active( i.e. the influence can flow, the two edges of the triple is not independent ). The algorithm checks all undirected paths between and . If one or more active, then independence is not guaranteed. Otherwise, it is. Once and "d-separated" by , it means .
Given a Bayes net structure, we can run d-separation algorithm to build a complete list of conditional independencies that are necessarily true of the form:
The list determines the set of probability distributions that can be represented.
We can prove the local Markov property through the Bayes reconstitution formula.
The Local Markov property (a variable is conditionally independent of all other variables given its neighbors) is a necessary condition for the Bayesian reconstitution formula.
In the Bayes Net, the less number of arrows you draw, the more assumptions of independence you have to make, the fewer joint probability distributions you can represent, the smaller the size of the node.
Bayes Nets (Inference)
Inference is the process of calculating the joint PDF for some set of query variables based on some set of observed variables.
Previously, we perform inference by enumeration to compute probability distribution. We can do this by eliminate variables one by one. To eliminate a variable , we:
Join(multiply together) all factors involving .
Sum out .
For four variables like below, If we want to calculate by inferencing by enumeration, we should get the joint probability with 16 rows. However, with the Bayes Net, we can eliminate the variables one by one.
Bayes Nets (Sampling)
Why sample?
Learning: get smaples from a distribution you don't know
Inference: getting a sample is faster than computing the right answer
Sampling in Bayes' Nets
Prior Sampling
Rejection Sampling
Likehood weighting
Gibbs Sampling
Prior Sampling
Basic idea:
Draw samples from a sampling distribution
Compute an approximate posterior probability
Show this converges to the true probaility
Let be the number of samples generated for event
The sampling procedure is consistent.
Consistent
An estimator is said to be consistent if it converges in probability to the true value of the parameter being estimated as the sample size increases.
Rejection Sampling
Let's say we want . There's no point keeping all samples around. We just tally counts of as we go.
If we find a red/green, we just throw it away. Then why did we generate it?
Likelihood Weighting
We force the samples to be blue. Every sample is consistent with the evidence now. The idea here is to fix evidence variables and sample the rest.
But this is not necessarily consistent with the joint distribution. So we introduce the weight.
Sampling distribution if sampled and fixed evidence
Together, weight sampling distribution is consistent:
Likelihood wieghting doesn't solve all our problems: Evidence influences the choice of downstream variables, but not upstream ones
The problem is, if and are the evidence, the weight is extremely low in some cases and you can't affect the choice of , and . But if and are the evidence, your simulation is based on the evidence all the way down to the .
A simpler would be the above one. Your simluation is like:
Gibbs Sampling
Step 1 Fix evidence
Step 2 Initialize other variables randomly
Step 3 Repeat
Choose a non-evidence variable
Resample from
Last updated