Hidden Markov Model

Is a kind of a Directed Graph Model

Overview

The HMM is a generative probabilistic model, in which a sequence of observable X variables is generated by a sequence of internal hidden states Z. The hidden states are not be observed directly. The transitions between hidden states are assumed to have the form of a (first-order) Markov chain. They can be specified by the start probability vector π and a transition probability matrix A. The emission probability of an observable can be any distribution with parameters θ conditioned on the current hidden state. The HMM is completely determined by π, A and θ.

There are three fundamental problems for HMMs:

Given the model parameters and observed data, estimate the optimal sequence of hidden states.
Given the model parameters and observed data, calculate the likelihood of the data.
Given just the observed data, estimate the model parameters.

The first and the second problem can be solved by the dynamic programming algorithms known as the Viterbi algorithm and the Forward-Backward algorithm, respectively. The last one can be solved by an iterative Expectation-Maximization (EM) algorithm, known as the Baum-Welch algorithm.

A hidden Markov model is a Markov chain for which the state is only partially observable.

Quick View

Category	Usage	Methematics	Application Field
Supervised Learning, Unsupervised Learning	Sequential Labeling	Markov Chain	NLP

Markov Model

Markov models	System state is fully observable	System state is partially observable
System is autonomous	Markov chain	Hidden Markov model
System is controlled	Markov decision process	Partially observable Markov decision process

Markov Chain

A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

A Markov chain is a stochastic process with the Markov property.

The term "Markov chain" refers to the sequence of random variables such a process moves through, with the Markov property defining serial dependence only between adjacent periods (as in a "chain").

First-order Markov Chain

Time-homogeneous Markov Chain (stationary Markov chain)

Example of Markov chain

$P={\begin{bmatrix}0.9&0.075&0.025\0.15&0.8&0.05\0.25&0.25&0.5\end{bmatrix}}.$

The Hidden Markov Model

Example of Urn and ball

壇子與小球

Probabilistic parameters of a hidden Markov model (example)

X — states
y — possible observations => only visible thing in HMM!
a — state transition probabilities
b — output probabilities

Formula

HMM (𝝺) can be defined as (S, V, A, B, 𝞹)

S: status set
- $S = { 1, 2, 3, \dots, N }$
V: output sign set
- $V = {v_1, v_2, \dots, v_M }$
A: (state) transition matrix (狀態)轉移矩陣
- $A_{N\times N} = [a_{ij}]$
- $a_{ij} = P(q_t=j|q_{t-1}=i)$
B: output sign probability distribution
- $B = {b_j(k)}$
- $b_j(k) = P(v_k|j)$
𝞹: initial state probability distribution
- $\pi = {\pi_i}$
- $\pi_i = P(q_1 = i)$

The hidden markov process can be considered as a double stochastic process:

the first stochastic process: described with the state transition matrix (can't be observed directly)
the second stochastic process: defined by the output probability matrix => output the observable signs.

Generate observation sequence

Problems of HMM

For a given HMM 𝝺 = (A, B, 𝞹) and an observation sequence O => Calculate observation probability P(O|𝝺)? (i.e. the Probability calculation problem)
For a given HMM 𝝺 = (A, B, 𝞹) and an observation sequence O => Find the most possible transition sequence q? (i.e. the Prediction problem)
How to get or update the model's parameter by an observation sequence O? (i.e. the Learning problem)
- based on Maximum Likelihood priniple, how to confirm a set of model parameters that maximize P(O|𝝺)

Example of flipping coin: (this will be used to demonstrate the following problems)

𝝺 = (S, V, A, B, 𝞹)

S = {1, 2, 3}
V = {H, T}
A

A 1 2 3

1 0.90 0.05 0.05

2 0.45 0.10 0.45

3 0.45 0.45 0.1
B

B 1 2 3

H 0.50 0.75 0.25

T 0.50 0.25 0.75
𝞹 = {1/3, 1/3, 1/3}

Problem one: Estimate observation sequence probability

Naive (Brute-force) Algorithm

List all the posibility of the state transition sequence then claculate.

For the given 𝝺, calculate P(O)
Calculate P(O, q) (joint probability (聯合機率) of O and q)
Caculate observation sequence probability P(O)

Time complexity:

multiply: (2T-1)N^T times
addition: N^T - 1 times

By the concept of dynamic programming, maybe we can seperate the problem into sub-problem and simplify the problem. => Forward Algorithm, Backward Algorithm

Forward Algorithm

The forward variable $\alpha_{t}(i)$

$$ \alpha_{t+1}(j) = [\sum_{i=1}^N \alpha_t(i) \alpha_{ij}]b_j(o_{t+1}) $$

Initialization (i from 1 to N) $$ \alpha_1(i) = \pi_ib_i(o_1) $$
Iteration (t from 1 to T-1, j from 1 to N) $$ \alpha_{t+1}(j) = [\sum_{i=1}^N \alpha_t(i) \alpha_{ij}]b_j(o_{t+1}) $$
Termination $$ P(O|\lambda) = \sum_{i=1}^N \alpha_T(i) $$

Time complexity:

multiply: N(N+1)(T-1)+N times
addition: N(N-1)(T-1) times

Calculate the P(HHT|𝝺) of the example:

$a_t(i)$	H	H	T
1	0.16667	0.15000	0.08672
2	0.25000	0.05312	0.00684
3	0.08333	0.03229	0.02597

P(HHT|𝝺) = 0.11953

Backward Algorithm

The backward variable $\beta_{t}(i)$

$$ \beta_t(i) = \sum_{j=1}^N \alpha_{ij}b_j(o_{j+1}\beta_{t+1}(j)) $$

Initialization (i from 1 to N) $$ \beta_1(i) = 1 $$
Iteration (t from 1 to T-1, j from 1 to N) $$ \beta_{t}(i) = \sum_{i=1}^N \alpha_{ij}b_j(o_{t+1})\beta_{t+1}(j) $$
Termination $$ P(O|\lambda) = \sum_{i=1}^N \pi_ib_i(o_1)\beta_1(i) $$

The definition of the backward algorithm is not exact symmetric of the forward algorithm. You can define it as a symmetric solution of forward algorithm, but we'll use this algorithm in other problem (the third one)

Calculate the P(HHT|𝝺) of the example:

```
  |   H   |   H   |   T   | (initial)
```

--------|-------|-------|-------|----------- $\beta_t(i)$|$\pi_ib_i(H)\beta_1(i)$|$\beta_1(i)$|$\beta_2(i)$|$\beta_3(i)$ 1 |0.04203|0.25219|0.50000|1.00000 2 |0.05074|0.20297|0.58750|1.00000 3 |0.02676|0.32109|0.41250|1.00000

P(HHT|𝝺) = 0.11953

Problem two: Find the optimized state transition sequence

$$ q^* = \arg\max_q P(O, q|\lambda) $$

We can also brute-force to solve this problem. Thus we'll use the DP algorithm again.

Viterbi Algorithm

It's a variation of forward algorithm

The Viterbi variable $\delta_t(i)$: At time t and state i. The probability of (observed: $q_1q_2\dots q_{t-1}q_t$ => best state transition sequence: $o_1o_2\dots o_t$)

$$ \delta_t(i) = \max_{q1\dots q_{t-1}} P(q_1q_2\dots q_{t-1}q_t = i, o_1o_2\dots o_t|\lambda) $$

base case: $$ \delta_1(i) = \pi_ib_i(o_1) $$
recursive case: $$ \delta_{t+1}(j) = [\max_i\delta_t(i)a_{ij}]b_j(o_{t+1}) $$
record the path: $\psi_t(i)$ record the best path from the lastest time (i.e. t-1)

Initialization (i from 1 to N) $$ \delta_1(i) = \pi_ib_i(o_1) \ \psi_1(i) = 0 $$
Iteration (t from 1 to T-1, j from 1 to N) $$ \delta_t(j) = \max_i[\delta_{t-1}a_{ij}]b_j(o_t) \ \psi_t(j) = \arg\max_i \delta_{t-1}(i)a_{ij} $$
Termination $$ P^* = \max{i}\delta_T(i) $$
Solve the best path $$ q^*_T = \arg\max_i\delta_T(i) $$

Calculate the best state transition sequence of the example:

$\delta_t(i)$	H	H	T
1	0.16667	0.07500	0.03375
2	0.25000	0.02812	0.00316
3	0.08333	0.02812	0.00949

$P^* = 0.03375$

$\psi_t(i)$	$\psi_2(i)$	$\psi_3(i)$
1	1	1
2	3	3
3	2	2

$q^* = 1$

so the best state transition sequence is (1, 1, 1)

Problem three: Learning Parameters

Principle: Maximum Likelihood Estimation - finds the parameter which maximize P(O|𝝺)

Learning Method

Unknown HMM model 𝝺, given a observation sequence O

Supervised Learning:
- also give State transition sequence (i.e. the answer)
- pros: easy and good effect
- cons: need the transition sequence, most of the time need manual tagging => high cost
Unsupervised Learning
- no analytic expression (解析解)
- can only reach local optimal model

Naive thought of unsupervised learning

Given initial parameters (A, B, 𝞹)

Because we don't have transition sequence. We're not able to calculate state transition frequency, state output frequency and initial state frequency. => Assume all the state transition sequence are possible!

Then calculate the expectation values of them. And update (A, B, 𝞹) with them.

How to pick weight?

For a state transition sequence q, select P(q|O, 𝝺)

$$ P(q|O, \lambda) = \frac{P(q, O|\lambda)}{P(O|\lambda)} = \frac{P(q, O|\lambda)}{\sum_qP(q, O|\lambda)} $$

But it will only work in theory. We need more efficient algorithm.

In all the possible path, pick $q_t=i$ and $q_{t+1} = j$. The path set:

$$ Q = {q|q_t=i, q_{t+1} = j} $$

The expectation of the transaction (i,j) occur while time t to t+1

$$ \frac{\sum_{q\in Q} P(q, O|\lambda)}{P(O|\lambda)} $$

Baum-Welch Algorithm

Find the unknown parameters of a hidden Markov model (HMM). It makes use of a forward-backward algorithm.

The variable $\xi_t(i, j)$: For the given model 𝝺 and the observation sequence O. Expetation probability of time t at state i => time t+1 at state j

$$ \xi_t(i, j) = P(q_t=i, q_{t+1}=j|O, \lambda) $$

$$ \xi_t(i, j) = \frac{P(q_t=i, q_{t+1}=j|O, \lambda)}{P(O|\lambda)} \\ = \frac{a_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{P(O|\lambda)} \\ = \frac{a_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\sum_{i=1}^N\sum_{j=1}^N \alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)} $$

The variable $\gamma_t(i)$ For the given model 𝝺 and the observation sequence O. Expetation probability of start from i at time t

$$ \gamma_t(i) = \sum_{j=1}^N \xi_t(i, j) $$

Consider all the time

probability start from state i $$ \sum_{t=1}^{T-1}\gamma_t(i) $$
probability transit from state i to state j $$ \sum_{t=1}^{T-1}\xi_t(i, j) $$

Then we can estimate (𝞹, A, B)

$$ \bar{\pi}_i = \gamma_1(i) $$

$$ \bar{a}{ij} = \frac{\sum{t=1}^{T-1}\xi_t(i, j)}{\sum_{t=1}^{T-1}\gamma_t(i)} = \sum_{t=1}^{T-1}\frac{\xi_t(i, j)}{\gamma_t(i)} $$

$$ \bar{b}j(k) = \frac{\sum{t=1}^{T}\gamma_t(j) \times \delta(o_t, v_k)}{\sum_{t=1}^{T}\gamma_t(j)} $$

($\delta(o_t, v_k)$ = 1 when $o_t = v_k$, otherwise = 0)

Initialize (𝞹, A, B) to all 1.
Iterative calculate new parameter ($\bar{\pi}, \bar{A}, \bar{B}$)
Update them and calculate again...
Until converge

Baum-Welch Algorithm is a kind of EM algorithm

E-step:

Calculate $\xi_t(i, j)$ and $\gamma_t(i)$

M-step:

Estimate model $\bar{\lambda}$

Termination condition $$ |\log P(O|\bar{\lambda}) - \log P(O|\lambda)| < \epsilon $$

Implementation of HMM

Floating point overflow problem

for Forward algorithm
- scaling factors (放大因子) & log
for Viterbi algorithm
- log
for Baum–Welch algorithm
- scaling factors

Resources

Wikipedia

Stochastic process
Markov chain
- Examples of Markov chains
Markov model
- Hidden Markov model

Article

隱馬爾科夫模型 (HMM) 簡介——自然語言處理詞性標注

Paper

Numerically Stable Hidden Markov Model Implementation

Package

HMMLearn

Github hmmlearn/hmmlearn
Document

HanLP

Python API Github
Official Main Page

GHMM

GHMM Library

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HMM.md

HMM.md

Hidden Markov Model

Overview

Quick View

Markov Model

Markov Chain

First-order Markov Chain

Time-homogeneous Markov Chain (stationary Markov chain)

Example of Markov chain

The Hidden Markov Model

Example of Urn and ball

Formula

Generate observation sequence

Problems of HMM

Problem one: Estimate observation sequence probability

Naive (Brute-force) Algorithm

Forward Algorithm

Backward Algorithm

Problem two: Find the optimized state transition sequence

Viterbi Algorithm

Problem three: Learning Parameters

Learning Method

Naive thought of unsupervised learning

Baum-Welch Algorithm

Implementation of HMM

Resources

Wikipedia

Article

Paper

Package

HMMLearn

HanLP

GHMM

A	1	2	3
1	0.90	0.05	0.05
2	0.45	0.10	0.45
3	0.45	0.45	0.1

B	1	2	3
H	0.50	0.75	0.25
T	0.50	0.25	0.75

Files

HMM.md

Latest commit

History

HMM.md

File metadata and controls

Hidden Markov Model

Overview

Quick View

Markov Model

Markov Chain

First-order Markov Chain

Time-homogeneous Markov Chain (stationary Markov chain)

Example of Markov chain

The Hidden Markov Model

Example of Urn and ball

Formula

Generate observation sequence

Problems of HMM

Problem one: Estimate observation sequence probability

Naive (Brute-force) Algorithm

Forward Algorithm

Backward Algorithm

Problem two: Find the optimized state transition sequence

Viterbi Algorithm

Problem three: Learning Parameters

Learning Method

Naive thought of unsupervised learning

Baum-Welch Algorithm

Implementation of HMM

Resources

Wikipedia

Article

Paper

Package

HMMLearn

HanLP

GHMM