Reasoning over Time or Space

Often, we want to reason about a sequence of observations (A changing world).

Markov (Transition) Models

Markov Assumption: the current state depends only on a finite fixed number of states.

Value of XX at a given time is called the state

Structure: chain

Screenshot 2023-05-29 at 12.56.00 PM

X1X_1 is a state of the world.

Two functions dictate how the chain behaves: P(X1)P(X_1) and P(XtXt1)P(X_t|X_{t-1})

Parameters are called transition probabilities or dynamic, specify how the state evolves over time (also, initial state probabilities)

Stationarity Assumption: transition probabilities the same at all times:

P(XiXi1)=P(XjXj1) where ij.P(X_i|X_{i-1})=P(X_j|X_{j-1}) \text{ where } i \neq j.

Each time step only depends on the previous. This is called the (first order) Markov property.

Example

Screenshot 2023-05-29 at 1.08.34 PM
  • State X={rain,sun}X=\{\text{rain},\text{sun}\}

  • Initial distribution 1.0 sun\text{sun}

  • CPT P(XtXt1)P(X_t|X_{t-1})

| Xt1X_{t-1} | XtX_t | P(XtXt1)P(X_t|X_{t-1}) | | :-----------: | :-----------: | :--------------: | | sun\text{sun} | sun\text{sun} | 0.9 | | sun\text{sun} | rain\text{rain} | 0.1 | | rain\text{rain} | sun\text{sun} | 0.3 | | rain\text{rain} | rain\text{rain} | 0.7 |

Given P(X1)P(X_1), P(xt)=xt1P(xt,xt1)=xt1P(xtxt1)P(xt1)P(x_t)=\sum_{x_{t-1}} P(x_t, x_{t-1}) = \sum_{x_{t-1}} P(x_t|x_{t-1})P(x_{t-1}).

Stationary Distributions

For most chains: influence of the initial distribution gets less and less over time. The distribution we end up in is independent of the initial distribution.

The distribution we end up with is called the stationary distribution PP_{\infin} of the chain. It satidfies P(X)=P+1(X)=xP(Xx)P(x)P_{\infin}(X)=P_{\infin+1}(X)=\sum_{x}P(X|x)P_{\infin}(x).

What's P(X)P(X) at time t=t=\infin?

P(X)=P+1(X)=xP(Xx)P(x)P_{\infin}(X)=P_{\infin+1}(X)=\sum_{x} P(X|x)P_{\infin}(x)

is actually a linear system.

P(sun)=P(sunsun)P(sun)+P(sunrain)P(rain)P_{\infin}(sun) = P(sun|sun) P_{\infin}(sun) + P(sun|rain) P_{\infin}(rain)
P(rain)=P(rainsun)P(sun)+P(rainrain)P(rain)P_{\infin}(rain) = P(rain|sun) P_{\infin}(sun) + P(rain|rain) P_{\infin}(rain)

It can be solved (even not linearly independenet).

Alternatively, we run simulation for a long (ideally infinite) time for better resource requirement.

Hidden Markov Models

Markov chains are not so useful for most agents. We need observations to update our beliefs. If we want to get the next state, we must have an observation of the current state.

Hidden Markov Models (HMMs) are based on a Markov chain over states X. At each time step, the model observes outputs (Evidence).

Screenshot 2023-05-30 at 1.03.11 PM

An HMM is defined by:

  • Initial distribution P(X1)P(X_1)

  • Transitions: P(XtXt1)P(X_t|X_{t-1})

  • Emissions: P(EtXt)P(E_t|X_t)

Sensor Markov Assumption: P(EtX0:t,E0:t1)=P(EtXt)P(E_t|X_{0:t}, E_{0:t-1})=P(E_t|X_t), P(EtXt)P(E_t|X_t) is also called observation model.

State XnX_n is independent of **all past states and all past evidence (X0:n2X_{0:n-2} $E_{0:n-1})giventhepreviousstate)** given the previous state X_{n-1}$.

Evidence is independent of all past states and all past evidence given the current state.

Example of NER

John Smith works at OpenAI

B-PER I-PER O O B-ORG

Here, John is an Observation and B-PER is an Hidden State.

Inference

Classical inference problems:

  • Filtering (Inferring the present): The objective is to determine Bt(X)=P(Xte1:t)B_t(X) = P(X_t | e_{1:t}) based on the evidence available at each time.

  • Prediction: Given the evidence gathered, the goal is to determine the future state, P(Xt+ke1:t)P(X_{t+k} | e_{1:t}) where k0k ≥ 0.

  • Most Likely Hidden Path (Viterbi alignment): argmaxX1:tP(X1:tV1:t)\text{argmax}_{X_{1:t}}P(X_{1:t}|V_{1:t})

Prediction

The job of prediction is rather simple. We start by

P(X2e1)=x1P(X2,x1e1)=x1P(X2,x1,e1)P(e1)=x1P(X2x1)P(x1)P(e1x1)P(e1)=x1P(X2x1)P(x1e1)P(X_2|e_{1})=\sum _{x1}P(X_2,x_1|e_{1})=\sum_{x_1}\frac{P(X_2,x_1,e_1)}{P(e_1)}=\sum_{x_1} \frac{P(X_2|x_1)P(x_1)P(e_1|x_1)}{P(e_1)}=\sum_{x_1}P(X_2|x_1)P(x_1|e_1)
P(Xt+1e1:t)=xtP(Xt+1Xt)P(Xte1:t)P(X_{t+1}|e_{1:t})=\sum_{x_t} P(X_{t+1}|X_t)P(X_t|e_{1:t})

Filtering

The idea is to start with P(X1)P(X_1) and derive BtB_t in terms of Bt1B_{t-1}.

There are two steps: Passage of Time and Observation

Screenshot 2023-06-03 at 5.55.22 PM

Passage of Time

The idea is to introduce a new variable.

The base case is derive P(X2)P(X_2) from P(X1)P(X_1)

P(x2)=x1P(x1,x2)=x1P(x1)P(x2x1)P(x_2) = \sum_{x_1} P(x_1,x_2)=\sum_{x1}P(x_1)P(x_2|x_1)

The generate case:

Assume we have current belief and transition probability:

B(Xt)=P(Xte1:t)B(X_t) = P(X_{t}|e_{1:t})
P(Xt+1Xt)P(X_{t+1}|X_t)

Then after one time step passes:

P(Xt+1e1:t)=xtP(Xt+1,Xte1:t)P(X_{t+1}|e_{1:t}) = \sum_{x_t} P(X_{t+1},X_t|e_{1:t})
P(Xt+1e1:t)=xtP(Xt+1Xt,e1:t)P(Xte1:t)P(X_{t+1}|e_{1:t}) = \sum_{x_t}P(X_{t+1}|X_t, e_{1:t})P(X_t|e_{1:t})

Given the current state, the next state is independent of all other nodes in the Bayes Net. Therefore,

P(Xt+1Xt,e1:t)=P(Xt+1Xt)P(X_{t+1}|X_t, e_{1:t}) = P(X_{t+1}|X_t)

More compactly:

B(Xt+1)=xtP(Xt+1Xt)B(xt)B^{'}(X_{t+1}) = \sum_{x_t} P(X_{t+1}|X_{t}) B(x_t)

The basic idea for this formula is that the beliefs are pushed through the transitions.

Observation

The base case is to derive P(X1e1)P(X_1|e_1) based on P(X1)P(X_1) and P(EX1)P(E|X_1).

Given P(X1)P(X_1) and P(EX1)P(E|X_1), how can we get P(X1e1)P(X_1|e_1)?

P(x1e1)=P(x1,e1)P(e1)P(x_1|e_1) = \frac{P(x_1, e_1)}{P(e_1)}

Here, P(e1)P(e_1) is a constant. Because:

P(X1)=[P(x1)P(x2)P(xm)]P(X_1) = \begin{bmatrix}P(x_1) \\ P(x_2) \\ \dots\\ P(x_m)\end{bmatrix}
P(EX1)=[P(e1x1)P(e1x2)P(e1x3)P(e1xm)P(e2x1)P(e2x2)P(e2x3)P(e2xm)P(enx1)P(enx2)P(enx3)P(enxm)]P(E|X_1) = \begin{bmatrix}P(e_1|x_1)&P(e_1|x_2)&P(e_1|x_3)&\dots&P(e_1|x_m) \\ P(e_2|x_1)&P(e_2|x_2)&P(e_2|x_3)&\dots&P(e_2|x_m) \\ \dots& \dots& \dots& \dots\\P(e_n|x_1)&P(e_n|x_2)&P(e_n|x_3)&\dots&P(e_n|x_m) \end{bmatrix}
P(EX1)P(X1)=m[P(e1)P(e2)P(em)]P(E|X_1)\cdot P(X_1) = m\cdot \begin{bmatrix}P(e_1) \\ P(e_2) \\ \dots\\ P(e_m)\end{bmatrix}

Thus, P(x1e1)P(x_1|e_1) is proportional to P(x1,e1)P(x_1, e_1) in terms of x1x_1

P(x1e1)=P(x1)P(e1x1)P(x_1|e_1) = P(x_1)P(e_1|x_1)

And some normalization is needed.

The general case is that assuming we have current belief B(Xt+1)=P(Xt+1e1:t)B^{'}(X_{t+1})=P(X_{t+1}|e_{1:t}) and evidence model P(et+1Xt+1)P(e_{t+1}|X_{t+1}). Then, after evidence comes in:

P(Xt+1e1:t+1)=P(Xt+1,et+1e1:t)P(et+1e1:t)P(X_{t+1}|e_{1:t+1})=\frac{P(X_{t+1},e_{t+1}|e_{1:t})}{P(e_{t+1}|e_{1:t})}

Which is propotional to P(Xt+1,et+1e1:t)P(X_{t+1}, e_{t+1}|e_{1:t}) in terms of Xt+1X_{t+1}.

P(Xt+1,et+1e1:t)=P(et+1e1:t,Xt+1)P(Xt+1e1:t)P(X_{t+1}, e_{t+1}|e_{1:t})=P(e_{t+1}|e_{1:t}, X_{t+1})P(X_{t+1}|e_{1:t})

GivenX4X_4, E4E_4 is conditionally independent of E1:4E_{1:4}.

Or compactly, B(Xt+1)Xt+1P(ct+1Xt+1)B(Xt+1)B(X_{t+1}) \propto _{X_{t+1}} P(c_{t+1}|X_{t+1}) B^{'} (X_{t+1}).

As we get observations, beliefs get reweighed, uncertainty "decreases".

Screenshot 2023-06-03 at 6.39.23 PM

To summarize the above two phases:

Every time step, we start with current P(Xevidence)P(X|evidence)

We then update for time:

P(xte1:t1)=xt1P(xt1e1:t1)P(xtxt1)P(x_{t}|e_{1:t-1})=\sum_{x_{t-1}}P(x_{t-1}|e_{1:t-1})P(x_t|x_{t-1})

We update for evidence:

P(xte1:t)XP(xte1:t1)P(etxt)P(x_t|e_{1:t}) \propto _{X} P(x_t|e_{1:t-1})\cdot P(e_t|x_t)

This is our updated belief Bt(X)=P(Xte1:t)B_t(X)=P(X_t|e_{1:t}).

Particle Filtering

Sometimes X|X| is too big to use exact inference. X|X| may be too big to even store B(X)B(X) (like XX Is continuous)

The solution is to use approximate inference.

We just track samples of XX, not all values. The samples are called particles. Time per step is linear in the number of samples.

Our representation of P(X)P(X) is now a list of NN particles(samples). Generally, N<<XN << |X|.

Once we have sampeld an initial list of particles, the simulation takes a similar form to the forward algorithm, with a time elapse update followed by an observation update at each timestep.

Each particle is moved by sampling its next position from the transition model

x=sample(P(Xx))x^{'} = sample(P(X^{'}|x))

This like prior sampling - samples' frequencies reflect the transition probabilities. If enough samples, consistent.

During the observation update for partivle filtering, we use the sensor model P(EX)P(E|X) to weight each particle according to the probability dictated by the observed evidence and the particle's state.

  1. Calculate the weights of all particles as described above.

  2. Calculate the total weight for each state.

  3. If the sum of all weights across all states is 0, reinitialize all particles.

  4. Else, normalize the distribution of total weights over states and resample your list of particles from this distribution.

Rather than tracking weighted samples, we resample. We sample NN times from our weighted sample distribution. This is equivalent to renormalize the distribution.

The reason why we do a resample is that some particles' weight are fairly low and donot contribute too much to the our reasoning.

Screenshot 2023-06-03 at 9.43.30 PM

Last updated