Chapter 2 Introduction to Hidden Markov Models
A hidden Markov model \(\{X_t: t \in \mathbb{N}\}\) can be summarized by
\[\begin{align} \Pr(C_t|\boldsymbol{C}^{(t-1)}) &= \Pr(C_t|C_{t-1}) \qquad{t = 2, 3, \dots} \tag{2.1} \end{align}\]
\[\begin{align} \Pr(X_t|\boldsymbol{X}^{(t-1)}, \boldsymbol{C}^{(t)}) &= \Pr(X_t|C_t) \qquad{t \in \mathbb{N}} \tag{2.2} \end{align}\]
where
(2.1) is the unobserved parameter process, which satisfies the Markov property
(2.2) is the observed state-dependent process, which are conditionally independent on past and future observations and states
2.1 State-Dependent Distributions
The state-dependent distributions of observation \(X_t\) given that the Markov chain is in state \(i\) at time \(t\) are the probability mass (pmf) or density (pdf) functions
\[p_i (x) = \Pr(X_t = x|C_t = i) \qquad{\text{for } i = 1, \dots, m}\]
or in matrix notation,
\[ \boldsymbol{P} (x) = \begin{pmatrix} p_1 (x) & & 0\\ & \ddots & \\ 0 & & p_m(x) \end{pmatrix} \]
Note: The remaining sections assume discrete observations \(X_t\), however the continuous case is analagous.
2.2 Marginal Distributions
2.2.1 Univariate Case
Define \(u_i (t) = \Pr (C_t = i)\) for \(t = 1, \dots, T\).
The marginal distribution of \(X_t\) where the MC is homogenous, but not necessarily stationary is given by
\[\begin{align} \Pr (X_t = x) &= \sum_{i=1}^m \Pr (C_t = i) \Pr(X_t = x|C_t = i)\\ &= \sum_{i=1}^m u_i (t) p_i (x) \end{align}\]
or in matrix notation,
\[\begin{align} \Pr(X_t = x) &= (u_1(t), \dots, u_m(t)) \begin{pmatrix} p_1 (x) & & 0\\ & \ddots & \\ 0 & & p_m(x) \end{pmatrix} \begin{pmatrix} 1\\ \vdots\\ 1 \end{pmatrix}\\ &= \boldsymbol{u} (t) \boldsymbol{P} (x) \boldsymbol{1'} \end{align}\]
Since \(\boldsymbol{u} (t+1) = \boldsymbol{u} (t) \boldsymbol{\Gamma}\) (Equation (1.7)), it follows that \(\boldsymbol{u} (t) = \boldsymbol{u} (1) \boldsymbol{\Gamma}^{(t-1)}\), so
\[\begin{equation} \Pr(X_t = x) = \boldsymbol{u} (1) \boldsymbol{\Gamma}^{t-1} \boldsymbol{P} (x) \boldsymbol{1'} \end{equation}\]
The marginal distribution of \(X_t\) if the MC is stationary is given by
\[\begin{equation} \Pr(X_t = x) = \boldsymbol{\delta P} (x) \boldsymbol{1'} \end{equation}\]
Note: Since a MC with stationary distribution \(\boldsymbol{\delta}\) such that \(\boldsymbol{\delta \Gamma} = \boldsymbol{\delta} \text{ and } \boldsymbol{\delta 1'} = 1\) implies that \(\boldsymbol{\delta \Gamma}^{t-1} = \boldsymbol{\delta}\) for all \(t \in \mathbb{N}\).
2.2.2 Bivariate Case
The bivariate distribution of \(X_t\) and \(X_{t+k}\) where the MC is homogenous, but not necessarily stationary is given by
\[\begin{align} \Pr(X_t = v, X_{t+k} = w) &= \sum_{i=1}^m \sum_{j=1}^m \Pr (X_t = v, X_{t+k} = w, C_t = i, C_{t+k} = j)\\ &= \sum_{i=1}^m \sum_{j=1}^m \Pr(C_t = i) \Pr(X_t = v|C_t = i) \Pr(C_{t+k} = j|C_t = i) \Pr(X_{t+k} = w|C_{t+k} = j)\\ &= \sum_{i=1}^m \sum_{j=1}^m u_i (t) p_i (v) \gamma_{ij} (k) p_j (w) \end{align}\]
or in matrix notation
\[\begin{align} \Pr(X_t = v, X_{t+k} = w) &= \boldsymbol{u} (t) \boldsymbol{P} (v) \boldsymbol{\Gamma}^k \boldsymbol{P} (w) \boldsymbol{1'} \qquad{\text{where } \gamma_{ij} (k) \text{ denotes the (i, j) element of } \boldsymbol{\Gamma}^k} \end{align}\]
The marginal distribution of \(X_t\) and \(X_{t+k}\) if the MC is stationary is given by \[\begin{align} \Pr(X_t = v, X_{t+k} = w) &= \boldsymbol{\delta P} (v) \boldsymbol{\Gamma}^k \boldsymbol{P} (w) \boldsymbol{1'} \end{align}\]
Note: The above follows from Equation (9.1) in which \(\Pr(X_t, X_{t+k}, C_{t}, C_{t+k}) = \Pr (C_t) \Pr(X_t|C_t) \Pr(C_{t+k}|C_t) \Pr(X_{t+k}|C_{t+k})\).
2.3 Moments
2.3.1 Univariate Case
Let \(g\) be any functions for which the relevant state-dependent expectations exist.
The state-dependent expectation of \(X_t\) where the MC is homogenous, but not necessarily stationary is given by
\[\begin{align} E(g(X_t)) &= \sum_{i=1}^m E(g(X_t)|C_t = i) \Pr(C_t = i)\\ &= \sum_{i=1}^m u_i (t) E(g(X_t)|C_t = i) \end{align}\]
and in the stationary case,
\[\begin{align} E(X_t) &= \sum_{i=1}^m E(X_t|C_t = i) \Pr(C_t = i)\\ &= \sum_{i=1}^m u_i (t) E(X_t|C_t = i) \end{align}\]
and in the stationary case,
\[\begin{align} E(g(X_t)) = \sum_{i=1}^m \delta_i E(g(X_t)|C_t = i) \end{align}\]
2.3.2 Bivariate Case
The state-dependent expectation of \(X_t\) and \(X_{t+k}\) where the MC is homogenous, but not necessarily stationary is given by
\[\begin{align} E(g(X_t, X_{t+k})) &= \sum_{i, j=1}^m E(g(X_t, X_{t+k})|C_t = i, C_{t+k} = j) \Pr(C_t = i) \Pr(C_{t+k}=j|C_t = i)\\ &= \sum_{i, j=1}^m E(g(X_t, X_{t+k})|C_t = i, C_{t+k} = j) u_i(t)\gamma_{ij} (k) \end{align}\]
and in the stationary case,
\[\begin{align} E(g(X_t, X_{t+k})) = \sum_{i, j=1}^m E(g(X_t, X_{t+k})|C_t = i, C_{t+k} = j) \delta_i \gamma_{ij} (k) \end{align}\]
If \(g\) is factorizable as \(g(X_t, X_{t+k}) = g_1 (X_t) g_2 (X_{t+k})\), then the above becomes
\[E(g(X_t, X_{t+k})) = \sum_{i, j=1}^m E(g_1(X_t)|C_t = i)E(g_2(X_{t+k})|C_{t+k}=j) \delta_i \gamma_{ij} (k)\]
2.4 Likelihood of Hidden Markov Models
Proposition 2.1 The likelihood is given by
\[\begin{equation} L_T = \boldsymbol{\delta P} (x_1) \boldsymbol{\Gamma P} (x_2) \boldsymbol{\Gamma} \cdots \boldsymbol{\Gamma P} (x_T) \boldsymbol{1'} \tag{2.3} \end{equation}\]
If \(\boldsymbol{\delta}\), the distribution of \(C_1\), is the stationary distribution of the Markov chain, then in addition
\[\begin{equation} L_T = \boldsymbol{\delta \Gamma P} (x_1) \boldsymbol{\Gamma P} (x_2) \boldsymbol{\Gamma} \cdots \boldsymbol{\Gamma P} (x_T) \boldsymbol{1'} \tag{2.4} \end{equation}\]
Proof.
\[\begin{align} L_T &= \Pr(\boldsymbol{X}^{(T)} = \boldsymbol{x}^{(T)})\\ &= \sum_{c_1, c_2, \dots, c_T = 1}^m \Pr(\boldsymbol{X}^{(T)} = \boldsymbol{x}^{(T)}, \boldsymbol{C}^{(T)} = \boldsymbol{c}^{(T)}) \qquad{\text{by LOTP}}\\ &= \sum_{c_1, c_2, \dots, c_T = 1}^m \Pr(C_1 = c_1) \prod_{k=2}^T \Pr(C_k = c_k|C_{k-1}=c_{k-1}) \prod_{k=1}^T \Pr(X_k = x_k|C_k = c_k) \qquad{\text{by Equation (10.1)}}\\ &= \sum_{c_1, c_2, \dots, c_T = 1}^m \left(\delta_{c_1} \gamma_{c_1, c_2} \gamma_{c_2, c_3} \cdots \gamma_{c_{T-1}, c_T} \right) \left( p_{c_1} (x_1) p_{c_2} (x_2) \cdots p_{c_T} (x_T) \right)\\ &= \sum_{c_1, \dots, c_T = 1}^m \delta_{c_1} p_{c_1} (x_1) \gamma_{c_1, c_2} p_{c_2} (x_2) \gamma_{c_2, c_3} \cdots \gamma_{c_{T-1}, c_T} p_{c_T} (x_T)\\ &= \boldsymbol{\delta P} (x_1) \boldsymbol{\Gamma P} (x_2) \boldsymbol{\Gamma} \cdots \boldsymbol{\Gamma P} (x_T) \boldsymbol{1'} \end{align}\]
If \(\boldsymbol{\delta}\) is the stationary distribution, then \(\boldsymbol{\delta P} (x_1) = \boldsymbol{\delta \Gamma P} (x_1)\).
Note: The likelihood consists of a sum of \(m^T\) terms, each of which is a product of \(2T\) factors, which would require \(\mathcal{O} (Tm^T)\) operations. For example, the likelihood from the Bivariate Distributions section consists of \(m^2\) terms, each of which is a product of \(2 \times 2\) factors. The recursive nature of the likelihood drastically reduces the model complexity and can be evaluated by the forward algorithm which performs \(\mathcal{O} (Tm^2)\) operations. This makes numerical maximization feasible as shown in the next chapter.
2.4.1 Forward Algorithm
For \(t = 1, 2, \dots, T\)
\[\begin{align} \boldsymbol{\alpha}_t &= \boldsymbol{\delta P} (x_1) \boldsymbol{\Gamma P} (x_2) \boldsymbol{\Gamma} \cdots \boldsymbol{\Gamma P} (x_T)\\ &= \boldsymbol{\delta P} (x_1) \prod_{s=2}^t \boldsymbol{\Gamma P} (x_s) \tag{2.5} \end{align}\]
with the convention that an empty product is the identity matrix.
It follows from this definition that
\[\begin{equation} L_T = \boldsymbol{\alpha}_T \boldsymbol{1'} \qquad{ \text{ and }} \qquad{} \boldsymbol{\alpha}_t = \boldsymbol{\alpha}_{t-1} \boldsymbol{\Gamma P} (x_t) \text{ for } t \geq 2 \end{equation}\]
For a homogenous but not necessarily stationary MC, the likelihood can be computed as \(L_T = \boldsymbol{\alpha}_T \boldsymbol{1'}\) where
\[\boldsymbol{\alpha}_1 = \boldsymbol{\delta P} (x_1)\]
\[\boldsymbol{\alpha}_t = \boldsymbol{\alpha}_{t-1} \boldsymbol{\Gamma P} (x_t) \qquad{\text{for } t = 2, 3, \dots, T}\]
and in the stationary case,
\[\boldsymbol{\alpha}_0 = \boldsymbol{\delta}\]
\[\boldsymbol{\alpha}_t = \boldsymbol{\alpha}_{t-1} \boldsymbol{\Gamma P} (x_t) \qquad{\text{for } t = 1, 2, \dots, T}\]
2.5 Exercises
- Consider a stationary two-state Poisson-HMM with parameters
\[\boldsymbol{\Gamma} = \begin{pmatrix} 0.1 & 0.9\\ 0.4 & 0.6 \end{pmatrix} \qquad{} \boldsymbol{\lambda} = (1, 3)\]
\(\qquad{}\) In each of the following ways, compute the probability that the first three observations from this \(\qquad{}\) model are 0, 2, 1.
Consider all possible sequences of the states of the Markov chain that could have occured. Compute the probability of each sequence, and the probability of the observations given each sequence.
Apply the formula \(\Pr(X_1 = 0, X_2 = 2, X_3 = 1) = \boldsymbol{\delta P} (0) \boldsymbol{\Gamma P} (2) \boldsymbol{\Gamma P} (1) \boldsymbol{1'}\), where
\[\begin{align} \boldsymbol{P} (x) &= \begin{pmatrix} \frac{\lambda_1^s e^{- \lambda_1}}{s!} & 0\\ 0 & \frac{\lambda_2^s e^{- \lambda_2}}{s!} \end{pmatrix}\\ &= \begin{pmatrix} \frac{1^s e^{-1}}{s!} & 0\\ 0 & \frac{3^s e^{-3}}{s!} \end{pmatrix} \end{align}\]
- Consider the vector \(\boldsymbol{\alpha}_t = (\alpha_1(1), \dots, \alpha_t (m))\) defined by \[\alpha_t(j) = \Pr(\boldsymbol{X}^{(t)} = \boldsymbol{x}^{(t)}, C_t = j), \qquad{j = 1, 2, \dots, m}\]
\(\qquad{}\) Use conditional probability and the conditional independence assumptions to show that
\[\begin{equation} \alpha_t(j) = \sum_{i=1}^m \alpha_{t-1} (i) \gamma_{ij} p_j (x_t) \tag{2.6} \end{equation}\]
Verify that the result from (a), written in matrix notation, yields the forward recursion \[\boldsymbol{\alpha}_t = \boldsymbol{\alpha}_{t-1} \boldsymbol{\Gamma P} (x_t) \qquad{t = 2, \dots, T}\]
Hence derive the matrix expression for the likelihood.
2.6 Solutions
Question 1
Part a
# Set up -----------------------------------------------------------------------
## Load package
library(tidyverse)
## Recall statdist function in chapter 1
= function(Gamma){
statdist # number of components
= nrow(Gamma)
m
# identity matrix (mxm)
= diag(m)
I # ones matrix (mxm)
= matrix(rep(1,m*m), nrow=m,ncol=m)
U # vector of ones (1xm)
= rep(1, m)
one_vec
# I - Gamma + U
= I - Gamma + U
X
# delta(Im - Gamma + U) = 1
= solve(t(X), one_vec)
delta return(delta)
}
# Question 1 --------------------------------------------------------------------
# Parameters
= matrix(c(0.1, 0.4, 0.9, 0.6), nrow=2, ncol=2)
Gamma_pois = c(1, 3)
lambda_pois = statdist(Gamma_pois) ## assuming stationary
delta_pois
# All possible sequences
= c(rep(1, 4), rep(2, 4))
i = c(rep(c(1, 1, 2, 2), 2))
j = c(rep(c(1, 2), 4))
k
# values for X
= 0
x1 = 2
x2 = 1
x3
# Function that computes Pr(X=s|C=c)
= function(lambda1, lambda2, index, s){
p if(index==1){
= ((lambda1^s)*exp(-lambda1))/(factorial(s))
ps return(ps)
}else{
= ((lambda2^s)*exp(-lambda2))/(factorial(s))
ps return(ps)
}
}
# Compute Pr(X=s|C=c)
= c()
pi0 = c()
pj2 = c()
pk1 for(ind in 1:length(i)){
= p(lambda_pois[1], lambda_pois[2], i[ind], x1)
pi0[ind] = p(lambda_pois[1], lambda_pois[2], j[ind], x2)
pj2[ind] = p(lambda_pois[1], lambda_pois[2], k[ind], x3)
pk1[ind]
}
# delta_i
= c(rep(delta_pois[1], 4), rep(delta_pois[2], 4))
delta_i
# gamma_ij
= c()
gamma_ij
for(ind in 1:length(i)){
= Gamma_pois[i[ind], j[ind]]
gamma_ij[ind]
}
# gamma_jk
= c()
gamma_jk
for(ind in 1:length(i)){
= Gamma_pois[j[ind], k[ind]]
gamma_jk[ind]
}
# Table for likelihood computation
= as.data.frame(cbind(i, j, k, pi0, pj2, pk1, delta_i, gamma_ij, gamma_jk))
lktab
<- lktab %>%
lktab mutate(product = pi0*pj2*pk1*delta_i*gamma_ij*gamma_jk) %>%
mutate_if(is.numeric, round, 4)
# Likelihood
sum(lktab$product)
## [1] 0.0073
# Global decoding
which(lktab$product == max(lktab$product)),] lktab[
## i j k pi0 pj2 pk1 delta_i gamma_ij gamma_jk product
## 3 1 2 1 0.3679 0.224 0.3679 0.3077 0.9 0.4 0.0034
The sequence that maximizes the conditional probability is \(i = 1, j = 2, k=1\)
Part b
# P(s) matrices
= function(lambda1, lambda2, s){
Ps = matrix(c(((lambda1^s)*exp(-lambda1))/(factorial(s)),
Ps_matrix 0,
0,
^s)*exp(-lambda2))/(factorial(s))), nrow=2, ncol=2)
((lambda2return(Ps_matrix)
}
= Ps(lambda_pois[1], lambda_pois[2], 0)
P0 = Ps(lambda_pois[1], lambda_pois[2], 2)
P2 = Ps(lambda_pois[1], lambda_pois[2], 1)
P1
# Row vector of ones
= rep(1, 2)
one_vec
# Apply the formula
%*%P0%*%Gamma_pois%*%P2%*%Gamma_pois%*%P1%*%one_vec delta_pois
## [,1]
## [1,] 0.00729174
Question 2
Part a
\[\begin{align} \alpha_t (j) &= \Pr(\boldsymbol{X}^{(t)} = \boldsymbol{x}^{(t)}, C_t = j)\\ &= \Pr (X_t = x_t | \boldsymbol{X}^{(t-1)} = \boldsymbol{x}^{(t-1)}, C_t = j) \Pr(\boldsymbol{X}^{(t-1)} = \boldsymbol{x}^{(t-1)}, C_t = j) \qquad{\text{by Chain rule}}\\ &= \Pr(X_t = x_t |C_t = j) \Pr(\boldsymbol{X}^{(t-1)} = \boldsymbol{x}^{(t-1)}, C_t = j) \qquad{\text{by conditional independence}}\\ &= \Pr(X_t = x_t |C_t = j) \sum_{i=1}^m \Pr(C_t = j|\boldsymbol{X}^{(t-1)} = \boldsymbol{x}^{(t-1)}, C_{t-1} = i) \Pr(\boldsymbol{X}^{(t-1)} = \boldsymbol{x}^{(t-1)}, C_{t-1} = i) \qquad{\text{by Law of Total Probability}}\\ &= \Pr(X_t = x_t |C_t = j) \sum_{i=1}^m \Pr(C_t = j|C_{t-1} = i) \Pr(\boldsymbol{X}^{(t-1)} = \boldsymbol{x}^{(t-1)}, C_{t-1} = i) \qquad{\text{by conditional independence}}\\ &= \sum_{i=1}^m \Pr(\boldsymbol{X}^{(t-1)} = \boldsymbol{x}^{(t-1)}, C_{t-1} = i) \Pr(C_t = j|C_{t-1} = i) \Pr(X_t = x_t|C_t=j)\\ &= \sum_{i=1}^m \alpha_{t-1} (i) \gamma_{ij} p_j (x_t) \end{align}\]
Part b
Let \(j \in \mathbb{N}\). Then
\[\begin{align} \left(\boldsymbol{\alpha}_{t-1} \boldsymbol{\Gamma P} (x_t)\right)_{1j}\\ &= \sum_{k=1}^m \left(\boldsymbol{\alpha}_{t-1} \boldsymbol{\Gamma P}\right)_{1k} \left(\boldsymbol{\Gamma P} (x_t)\right)_{kj}\\ &= \sum_{k=1}^m \sum_{i=1}^m \left(\boldsymbol{\alpha}_{t-1} \right)_{1i} \left(\boldsymbol{\Gamma}\right)_{ik} \left(\boldsymbol{P} (x_t)\right)_{kj} \qquad{\text{since } \boldsymbol{P} (x_t) \text{ is a diagonal matrix so} \left(\boldsymbol{P} (x_t) \right)_{kj} = 0 for all k \neq j}\\ &= \sum_{i=1}^m \boldsymbol{\alpha}_{t-1} (i) \gamma_{ij} p_j (x_t) \end{align}\]
Since the \(j\)-th entry of \(\boldsymbol{\alpha}_{t-1} \boldsymbol{\Gamma P} (x_t)\) is equal to the \(j\)-th entry of \(\boldsymbol{\alpha}_t\) for an arbitrary \(j \in \mathbb{N}\), we conclude that \(\boldsymbol{\alpha}_t = \boldsymbol{\alpha}_{t-1} \boldsymbol{\Gamma P} (x_t)\).
Part c
We will prove for \(t = 1, \dots, T\) that \(\boldsymbol{\alpha}_t = \boldsymbol{\delta P} (x_1) \boldsymbol{\Gamma} \cdots \boldsymbol{\Gamma P} (x_t)\) by induction on \(t\).
Base case:
If \(t=1\), then
\[\begin{align*} \boldsymbol{\alpha}_1 &= (\alpha_1 (1), \dots, \alpha_1 (m))\\ &= (\Pr(X_1 = x_1, C_1 = 1), \dots, \Pr(X_1 = x_1, C_1 = m))\\ &= (\Pr(C_1 = 1) \Pr(X_1 = x_1|C_1 = 1), \dots, \Pr(C_1 = m) \Pr(X_1 = x_1|C_1 = m))\\ &= (\delta_{c_1} p_1(x_1), \dots, \delta_{c_1} p_m (x_1))\\ &= \boldsymbol{\delta P} (x_1) \end{align*}\]
Inductive step:
Let \(t \in \{1, \dots, T\}\). Suppose \(\boldsymbol{\alpha}_t = \boldsymbol{\delta P} (x_1) \boldsymbol{\Gamma} \cdots \boldsymbol{\Gamma P} (x_t)\). Then
\[\begin{align*} \boldsymbol{\alpha}_{t+1} &= (\alpha_{t+1} (1), \dots, \alpha_{t+1} (m))\\ &= (\Pr(\boldsymbol{X}^{(t+1)} = \boldsymbol{x}^{(t+1)}, C_{t+1} = 1), \dots, \Pr(\boldsymbol{X}^{(t+1)} = \boldsymbol{x}^{(t+1)}, C_{t+1} = m))\\ &= (\sum_{i=1}^m \alpha_t (i) \gamma_{i1} p_1 (x_{t+1}), \dots, \sum_{i=1}^m \alpha_t (i) \gamma_{im} p_m (x_{t+1})) \qquad{\text{by part a}}\\ &= \boldsymbol{\alpha}_t \boldsymbol{\Gamma P} (x_{t+1}) \qquad{\qquad{\text{by part b}}}\\ &= \boldsymbol{\delta P} (x_1) \boldsymbol{\Gamma} \cdots \boldsymbol{\Gamma P} (x_t) \boldsymbol{\Gamma P} (x_{t+1}) \qquad{\text{by the inductive hypothesis}} \end{align*}\]