(After Young et al. 1997)

Let each spoken word be represented by a sequence of speech vectors
or *observations* **O**, defined as

**O** = (**o**_{1} **o**_{2} **o**_{3
}**o**_{4
}**o**_{5}
... **o*** _{n}* )

where **o*** _{t}* is the speech vector observed at time

arg | max | {p(w|
_{i }O)} |

i |

(which means "the word *w _{i}* for which the probability
of that word's occurrence given the observation sequence is maximum").
So,

p(w| _{i }O) = |
p(O | w) _{i }p(w)
_{i
}p(O) |

*p*(*w _{i }*) is referred to as the prior probability
that of word

*p*(**O**), the
observation probability, is 1, because the observation sequence is known.
Hence, *p*(*w _{i }*|

*p*(**O | ***w _{i }*) =

For a particular state sequence *X in figure 7.6*

*p*(**O, ***X*** | ***M _{i }*) =

However, only the observation sequence is known: the underlying state
sequence *X* is hidden. That is why it is called a *Hidden*
Markov
Model. Given that *X* is unknown, the required likelihood is
computed
by summing over all possible state sequences. Alternatively, the
likelihood
can be approximated by considering only the most likely state sequence.

All this, of course, assumes that the state transition probabilities
*a _{ij
}*and the observation probabilities