Given P(X1), P(xt)=∑xt−1P(xt,xt−1)=∑xt−1P(xt∣xt−1)P(xt−1).
Stationary Distributions
For most chains: influence of the initial distribution gets less and less over time. The distribution we end up in is independent of the initial distribution.
The distribution we end up with is called the stationary distributionP∞ of the chain. It satidfies P∞(X)=P∞+1(X)=∑xP(X∣x)P∞(x).
It can be solved (even not linearly independenet).
Alternatively, we run simulation for a long (ideally infinite) time for better resource requirement.
Hidden Markov Models
Markov chains are not so useful for most agents. We need observations to update our beliefs. If we want to get the next state, we must have an observation of the current state.
Hidden Markov Models (HMMs) are based on a Markov chain over states X. At each time step, the model observes outputs (Evidence).
An HMM is defined by:
Initial distribution P(X1)
Transitions: P(Xt∣Xt−1)
Emissions: P(Et∣Xt)
Sensor Markov Assumption: P(Et∣X0:t,E0:t−1)=P(Et∣Xt), P(Et∣Xt) is also called observation model.
State Xn is independent of **all past states and all past evidence (X0:n−2 $E_{0:n-1})∗∗giventhepreviousstateX_{n-1}$.
Evidence is independent of all past states and all past evidence given the current state.
Example of NER
John Smith works at OpenAI
B-PER I-PER O O B-ORG
Here, John is an Observation and B-PER is an Hidden State.
Inference
Classical inference problems:
Filtering (Inferring the present): The objective is to determine Bt(X)=P(Xt∣e1:t) based on the evidence available at each time.
Prediction: Given the evidence gathered, the goal is to determine the future state, P(Xt+k∣e1:t) where k≥0.
Most Likely Hidden Path (Viterbi alignment): argmaxX1:tP(X1:t∣V1:t)
Prediction
The job of prediction is rather simple. We start by
Sometimes ∣X∣ is too big to use exact inference. ∣X∣ may be too big to even store B(X) (like X Is continuous)
The solution is to use approximate inference.
We just track samples of X, not all values. The samples are called particles. Time per step is linear in the number of samples.
Our representation of P(X) is now a list of N particles(samples). Generally, N<<∣X∣.
Once we have sampeld an initial list of particles, the simulation takes a similar form to the forward algorithm, with a time elapse update followed by an observation update at each timestep.
Each particle is moved by sampling its next position from the transition model
x′=sample(P(X′∣x))
This like prior sampling - samples' frequencies reflect the transition probabilities. If enough samples, consistent.
During the observation update for partivle filtering, we use the sensor model P(E∣X) to weight each particle according to the probability dictated by the observed evidence and the particle's state.
Calculate the weights of all particles as described above.
Calculate the total weight for each state.
If the sum of all weights across all states is 0, reinitialize all particles.
Else, normalize the distribution of total weights over states and resample your list of particles from this distribution.
Rather than tracking weighted samples, we resample. We sample N times from our weighted sample distribution. This is equivalent to renormalize the distribution.
The reason why we do a resample is that some particles' weight are fairly low and donot contribute too much to the our reasoning.