# Probability and Bayes Nets

## Probability

The general situation is:

* Observed variables (evidence)
* Unobserved variables
* Model: Agent knows something about how the know variables relate to the unknown variables

A **distribution** is a table of probabilities of values.

A **joint distribution** over a set of random variables: $$X\_1, X\_2,...,X\_n$$ specifies a real number for each assignment.

A **probabilistic model** is a joint distribution over a set of random variables.

An **event** is a set E of outcomes $$P(E)=\sum\_{(x\_1,\dots,x\_n)\in E} P(x\_1, \dots, x\_n)$$.

**Marginal** distributions are sub-tables which eliminate variables.

For continuous variables, the $$P$$ is a density. It integrates to 1.

**Conditional distributions** are probability distributions over some variables given fixed values of others.

$$
P(A|B)=\frac{P(AB)}{P(B)}
$$

$$
P(x\_1,x\_2,\dots x\_n) = \prod\_{i=1}^{n} P(x\_i|x\_1,\dots,x\_{i-1})
$$

### Probabilistic Inference

Our model is a **joint distribution**, i.e. a table of probabilities which captures the likelihood of each possible **outcome**, also known as an **assignment**.

Given a joint PDF, we can compute any probability distribution $$P(Q\_1\dots Q\_k|e\_1\dots e\_k)$$ using a simple and intuitive procedure known as **inference by enumeration.**

Three types of variables:

* Query variables $$Q\_i$$
* Evidence variables $$e\_i$$
* Hidden variables

General idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables

**Example:** we want to calculate $$P(W|S=winter)$$

![Untitled](https://p.ipic.vip/u99fjt.png)

The wrost-case time complexity is $$O(d^n)$$ and the space complexity is $$O(d^n)$$ because we have to store the whole table.

### Independece

$$X$$,$$Y$$ independent if and only if:

$$
\forall x,y:P(x,y)=P(x)P(y)
$$

$$X$$ and $$Y$$ are conditionally independent given $$Z$$ if and only if:

$$
\forall x,y,z:P(x,y|z)=P(x|z)P(y|z)
$$

We can leverage independence to decompose a big problem into subproblems.

$$
P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather)
$$

With conditional independence, we can further reduce the size of joint distribution table:

$$
P(Toothache|Catch, Cavity)=P(Toothache|Cavity)
$$

Because toothache is conditionally independent of catch given cavity.

### Bayes' Rule

Bayes' Rule:

$$
P(x|y)=\frac{P(y|x)}{P(y)}P(x)
$$

This rule let us build one conditional from its reverse.

Assessing diagnostic probability from causal probability:

$$
P(cause|effect)=\frac{P(effect|cause)P(cause)}{P(effect)}
$$

Posterior(conditional) probability is prior(unconditional) probability with some evidence. Here, $$P(effect)$$ is already known. We can simply denote it with a normalization factor $$\alpha$$.

$$
P(cause|effect)=\alpha \times P(effect|cause)P(cause)
$$

This is an example of a naive Bayes model:

<figure><img src="https://p.ipic.vip/xg8rtd.png" alt="Screenshot 2023-06-09 at 10.39.09 PM"><figcaption></figcaption></figure>

Total number of parameters is linear in $$n$$. Can greatly reduce the table size. BUt it is under the assumption that each effect is conditionally independent of all other effects given the cause.

## Bayes Nets

Not impractical to store the **entire** joint distribution in the memory.

Bayes' Net is a directed, acyclic graph.

An edge from node $$A$$ to node $$B$$ indicates that we store the probability table for $$P(B | A)$$.

A condition probability table is stored underneath the node.

The CPT is a collection of distributions over $$X$$, one for each combination of parents' values.

There's a conditional distribution for each node. That is a collection of distributions over $$X$$, one for each combination of parents' values.

Only distributions whose variables are absolutely independent can be represented by a Bayes' net without arcs.

**Compactness** Bayes Net is featured for its compactness. A CPT for a Boolean $$X\_i$$ with $$k$$ Boolean parents has $$2^k$$ rows for **the combinations of parent values**. We just need to store $$2^k$$ entries for a boolean $$X\_i$$. (Each row requires one number $$p$$ for $$X\_i = true$$ and the number for $$X\_i = false$$ is just $$1 − p$$) If each variable has no more than $$k$$ parents, the complete network requires $$O(n \cdot 2^k)$$ numbers. That is linear to $$n$$ as opposed to $$O(2^n)$$ for the full distribution.

**Global Semantics** Global semantics defines the full joint distribution as the product of the local conditional distributions:

$$
P(x\_1,x\_2,\dots ,x\_n) =\prod\_{i=1}^{n}P(x\_i|parents(X\_i))
$$

**Local Semantics** Each node is conditionally independent of its nondescendants given its parents.

**Global Semantics** $$\Leftrightarrow$$ **Local Semantics**

<figure><img src="https://p.ipic.vip/qed6ki.png" alt="Screenshot 2023-04-27 at 10.38.06 AM"><figcaption></figcaption></figure>

**Each node is conditionally independent of all its ancestor nodes in the graph, given all of its parents.**

**Example:** for five random variables

* B: Burglary occurs.
* A: Alarm goes off.
* E: Earthquake occurs.
* J: John calls.
* M: Mary calls.

We can construct the Bayes Net according to the causal relationship:

<figure><img src="https://p.ipic.vip/fzesun.png" alt="Untitled"><figcaption></figcaption></figure>

The tail of a directed edge is the cause and the head is considered as the effect.

In the alarm model above, we would store probability tables $$P(B),P(E),P(A | B,E),P(J | A)$$ and $$P(M | A)$$.

Given all of the CPTs for a graph, we can calculate the probability of a given assignment using the **Bayes reconstitution formula** (cumulatively multiply $$X$$ conditioned on its parents):

$$
P(X\_1,X\_2,\dots ,X\_n) =\prod\_{i=1}^{n}P(X\_i|parents(X\_i))
$$

$$
P(−b,−e,+a,+j,−m) = P(−b)·P(−e)·P(+a | −b,−e)·P(+j | +a)·P(−m | +a)
$$

Not every BN can represent every joint distribution.

**BNs need not actually be causal**: correlated but not causal.

Topology really encodes conditional independence.

### Bayes Nets (Independence)

Important question about a BN: Are two values independent given certain evidence?

* If yes, can prove using algebra
* If no, can prove with a counter example

Each node is conditionally independent of all its ancestor nodes in the graph, **given all of its parents**. But a node may not be independent of its ancestors without the information about its parents.

**Three Basic Cases:**

1. Low Pressure $$ightarrow$$ Rain $$ightarrow$$ Traffic

   According to $$(3)$$, traffic is not independent of low pressure. However, given Rain, the influence link is "blocked".
2. Forums busy $$\leftarrow$$ Project Due $$ightarrow$$ Lab full (Common cause)

   Given Project Due, the Forums and Lab full are independent.
3. Raining $$ightarrow$$ Traffic $$\leftarrow$$ Ballgame (Common effect)

   Here, however, Raining and Ballgame are independent without the information of traffic. Given Traffic, Raining and Ballgame are not independent. They are in competition as explanation. You know it isn't raining, so it must be a ball game.

**The general case:**

To get from point A to point B in a graph:

* In a path: all segements needs to be open.
* In a network: one path open is fine.

D-separation is a criterion for deciding whether a set of variables is independent of another set of variables in a Bayes network( i.e. $$X\_i\newcommand{\indep}{\perp !!! \perp} \indep X\_j |{X\_{k\_1}, X\_{k\_2},\dots,X\_{k\_n}}$$ ).

A path is active if each triple is active( i.e. the **influence** can flow, the two edges of the triple is **not independent** ). The algorithm checks all undirected paths between $$X\_i$$ and $$X\_j$$. If one or more active, then independence is not guaranteed. Otherwise, it is. Once $$X$$ and $$Y$$ "d-separated" by $$Z$$, it means $$X\indep Y | Z$$.

Given a Bayes net structure, we can run d-separation algorithm to build a complete list of conditional independencies that are necessarily true of the form:

$$
X\_i \indep X\_j | {X\_{k\_1}, \dots X\_{k\_n}}
$$

The list determines the set of probability distributions that can be represented.

We can prove the local Markov property through the **Bayes reconstitution formula**.

The Local Markov property (a variable is conditionally independent of all other variables given its neighbors) is a necessary condition for the Bayesian reconstitution formula.

In the Bayes Net, the less number of arrows you draw, the more assumptions of independence you have to make, the fewer joint probability distributions you can represent, the smaller the size of the node.

### Bayes Nets (Inference)

Inference is the process of calculating the joint PDF for some set of query variables based on some set of observed variables.

Previously, we perform inference by enumeration to compute probability distribution. We can do this by **eliminate** variables one by one. To eliminate a variable $$X$$, we:

1. Join(multiply together) all factors involving $$X$$.
2. Sum out $$X$$.

For four variables like below, If we want to calculate $$P(T|+e)$$ by inferencing by enumeration, we should get the joint probability with 16 rows. However, with the Bayes Net, we can eliminate the variables one by one.

![Untitled](https://p.ipic.vip/m92gha.jpg)

### Bayes Nets (Sampling)

Why sample?

* Learning: get smaples from a distribution you don't know
* Inference: getting a sample is faster than computing the right answer

**Sampling in Bayes' Nets**

* Prior Sampling
* Rejection Sampling
* Likehood weighting
* Gibbs Sampling

**Prior Sampling**

Basic idea:

* Draw $$N$$ samples from a sampling distribution $$S$$
* Compute an approximate posterior probability $$\widehat{P}$$
* Show this converges to the true probaility $$P$$

<figure><img src="https://p.ipic.vip/e0d689.png" alt="Screenshot 2023-05-14 at 9.26.12 PM"><figcaption></figcaption></figure>

```pseudocode
for i in 1, 2, ..., n
	sample(xi, P(Xi|Parents(Xi)))
return (x1,x2,...,xn)
```

Let $$N\_{PS}(x\_1,\dots,x\_n)$$ be the number of samples generated for event $$x\_1,\dots, x\_n$$

$$
\lim\_{N\rightarrow\infin} \widehat{P}(x\_1,\dots,x\_n) = \lim\_{N\rightarrow \infin} {\frac{N\_{PS} (x\_1,\dots,x\_n)}{N}} = S\_{PS}(x\_1,\dots,x\_n) = P(x\_1,\dots,x\_n)
$$

The sampling procedure is consistent.

> **Consistent**
>
> An estimator is said to be consistent if it converges in probability to the true value of the parameter being estimated as the sample size increases.

**Rejection Sampling**

Let's say we want $$P(C)$$. There's no point keeping all samples around. We just tally counts of $$C$$ as we go.

```pseudocode
for i in 1, 2, ..., n
	sample(xi, P(Xi|Parents(Xi)))
	if xi not consistent with evidence
		reject
return (x1,x2,...,xn)
```

![Screenshot 2023-05-24 at 8.17.26 PM](https://p.ipic.vip/t8birf.png)

$$P(shape|blue)$$

If we find a red/green, we just throw it away. Then why did we generate it?

**Likelihood Weighting**

<figure><img src="https://p.ipic.vip/fx13gv.png" alt="Screenshot 2023-05-24 at 8.21.48 PM"><figcaption></figcaption></figure>

We force the samples to be blue. Every sample is consistent with the evidence now. The idea here is to fix evidence variables and sample the rest.

But this is not necessarily consistent with the joint distribution. So we introduce the weight.

```pseudocode
w = 1.0

for i = 1, 2, ..., n
	if Xi is an evidence variable
		Xi = observation xi for Xi
		w = w * P(xi|Parents(Xi))
	else
		sample(xi, P(Xi|Parents(Xi))

return (x1,x2,...,xn), w
```

Sampling distribution if $$\bold{z}$$ sampled and $$e$$ fixed evidence

$$
S\_{WS}(\bold{z},e) = \prod\_{i=1}^{l}P(z\_i|Parents(Z\_i))
$$

$$
\omega(\bold{z},e) = \prod\_{i=1}^{m} P(e\_i|Parents(E\_i))
$$

Together, weight sampling distribution is consistent:

$$
S\_{WS}(\bold{z},e)\cdot \omega(\bold{z},e) = \prod\_{i=1}^{l}P(z\_i|Parents(Z\_i)) \prod\_{i=1}^{m} P(e\_i|Parents(E\_i))
$$

Likelihood wieghting doesn't solve all our problems: **Evidence influences the choice of downstream variables, but not upstream ones**

<figure><img src="https://p.ipic.vip/fzesun.png" alt="Untitled"><figcaption></figcaption></figure>

The problem is, if $$J$$ and $$M$$ are the evidence, the weight is extremely low in some cases and you can't affect the choice of $$A$$, $$B$$ and $$E$$. But if $$A$$ and $$B$$ are the evidence, your simulation is based on the evidence all the way down to the $$J$$.

<figure><img src="https://p.ipic.vip/v36e3q.png" alt="Screenshot 2023-05-24 at 9.03.27 PM"><figcaption></figcaption></figure>

A simpler would be the above one. Your simluation is like:

```
F A Weight
-f +a 0.01
-f +a 0.01
-f +a 0.01
-f +a 0.01
-f +a 0.01
......
+f +a 0.99
```

**Gibbs Sampling**

**Step 1** Fix evidence

<figure><img src="https://p.ipic.vip/ysa6wf.png" alt="Screenshot 2023-05-24 at 9.11.38 PM"><figcaption></figcaption></figure>

**Step 2** Initialize other variables **randomly**

<figure><img src="https://p.ipic.vip/i1jnu6.png" alt="Screenshot 2023-05-24 at 9.12.15 PM"><figcaption></figcaption></figure>

**Step 3** Repeat

![Screenshot 2023-05-24 at 9.13.17 PM](https://p.ipic.vip/7e4m5s.png)

Choose a non-evidence variable $$X$$

Resample $$X$$ from $$P(X|all\space other\space variables)$$

![Screenshot 2023-05-24 at 9.15.22 PM](https://p.ipic.vip/189vh2.png)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://osh.fducslg.com/notes/readme/07-probability-and-bayes-nets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
