Reasoning over Time or Space
Often, we want to reason about a sequence of observations (A changing world).
Markov (Transition) Models
Markov Assumption: the current state depends only on a finite fixed number of states.
Value of X at a given time is called the state
Structure: chain

X1 is a state of the world.
Two functions dictate how the chain behaves: P(X1) and P(Xt∣Xt−1)
Parameters are called transition probabilities or dynamic, specify how the state evolves over time (also, initial state probabilities)
Stationarity Assumption: transition probabilities the same at all times:
Each time step only depends on the previous. This is called the (first order) Markov property.
Example

State X={rain,sun}
Initial distribution 1.0 sun
CPT P(Xt∣Xt−1)
| Xt−1 | Xt | P(Xt∣Xt−1) | | :-----------: | :-----------: | :--------------: | | sun | sun | 0.9 | | sun | rain | 0.1 | | rain | sun | 0.3 | | rain | rain | 0.7 |
Given P(X1), P(xt)=∑xt−1P(xt,xt−1)=∑xt−1P(xt∣xt−1)P(xt−1).
Stationary Distributions
For most chains: influence of the initial distribution gets less and less over time. The distribution we end up in is independent of the initial distribution.
The distribution we end up with is called the stationary distribution P∞ of the chain. It satidfies P∞(X)=P∞+1(X)=∑xP(X∣x)P∞(x).
What's P(X) at time t=∞?
is actually a linear system.
It can be solved (even not linearly independenet).
Alternatively, we run simulation for a long (ideally infinite) time for better resource requirement.
Hidden Markov Models
Markov chains are not so useful for most agents. We need observations to update our beliefs. If we want to get the next state, we must have an observation of the current state.
Hidden Markov Models (HMMs) are based on a Markov chain over states X. At each time step, the model observes outputs (Evidence).

An HMM is defined by:
Initial distribution P(X1)
Transitions: P(Xt∣Xt−1)
Emissions: P(Et∣Xt)
Sensor Markov Assumption: P(Et∣X0:t,E0:t−1)=P(Et∣Xt), P(Et∣Xt) is also called observation model.
State Xn is independent of **all past states and all past evidence (X0:n−2 $E_{0:n-1})∗∗giventhepreviousstateX_{n-1}$.
Evidence is independent of all past states and all past evidence given the current state.
Example of NER
John Smith works at OpenAI
B-PER I-PER O O B-ORG
Here, John is an Observation and B-PER is an Hidden State.
Inference
Classical inference problems:
Filtering (Inferring the present): The objective is to determine Bt(X)=P(Xt∣e1:t) based on the evidence available at each time.
Prediction: Given the evidence gathered, the goal is to determine the future state, P(Xt+k∣e1:t) where k≥0.
Most Likely Hidden Path (Viterbi alignment): argmaxX1:tP(X1:t∣V1:t)
Prediction
The job of prediction is rather simple. We start by
Filtering
The idea is to start with P(X1) and derive Bt in terms of Bt−1.
There are two steps: Passage of Time and Observation

Passage of Time
The idea is to introduce a new variable.
The base case is derive P(X2) from P(X1)
The generate case:
Assume we have current belief and transition probability:
Then after one time step passes:
Given the current state, the next state is independent of all other nodes in the Bayes Net. Therefore,
More compactly:
The basic idea for this formula is that the beliefs are pushed through the transitions.
Observation
The base case is to derive P(X1∣e1) based on P(X1) and P(E∣X1).
Given P(X1) and P(E∣X1), how can we get P(X1∣e1)?
Here, P(e1) is a constant. Because:
Thus, P(x1∣e1) is proportional to P(x1,e1) in terms of x1
And some normalization is needed.
The general case is that assuming we have current belief B′(Xt+1)=P(Xt+1∣e1:t) and evidence model P(et+1∣Xt+1). Then, after evidence comes in:
Which is propotional to P(Xt+1,et+1∣e1:t) in terms of Xt+1.
GivenX4, E4 is conditionally independent of E1:4.
Or compactly, B(Xt+1)∝Xt+1P(ct+1∣Xt+1)B′(Xt+1).
As we get observations, beliefs get reweighed, uncertainty "decreases".

To summarize the above two phases:
Every time step, we start with current P(X∣evidence)
We then update for time:
We update for evidence:
This is our updated belief Bt(X)=P(Xt∣e1:t).
Particle Filtering
Sometimes ∣X∣ is too big to use exact inference. ∣X∣ may be too big to even store B(X) (like X Is continuous)
The solution is to use approximate inference.
We just track samples of X, not all values. The samples are called particles. Time per step is linear in the number of samples.
Our representation of P(X) is now a list of N particles(samples). Generally, N<<∣X∣.
Once we have sampeld an initial list of particles, the simulation takes a similar form to the forward algorithm, with a time elapse update followed by an observation update at each timestep.
Each particle is moved by sampling its next position from the transition model
This like prior sampling - samples' frequencies reflect the transition probabilities. If enough samples, consistent.
During the observation update for partivle filtering, we use the sensor model P(E∣X) to weight each particle according to the probability dictated by the observed evidence and the particle's state.
Calculate the weights of all particles as described above.
Calculate the total weight for each state.
If the sum of all weights across all states is 0, reinitialize all particles.
Else, normalize the distribution of total weights over states and resample your list of particles from this distribution.
Rather than tracking weighted samples, we resample. We sample N times from our weighted sample distribution. This is equivalent to renormalize the distribution.
The reason why we do a resample is that some particles' weight are fairly low and donot contribute too much to the our reasoning.

Last updated