(After Rabiner and Juang)
Given an observation sequence O and a model Mi, how do we efficiently compute p(O | Mi ), the probability of the observation sequence, given the model?
We can also view this problem as one of scoring how well a given model matches an observation sequence. If we are trying to choose among several competing models, the solution to this problem allows us to choose the model that best matches the observations.
Given the observation sequence O and the model Mi, how do we choose a corresponding state sequence X that is optimal in some sense (i.e. best "explains" the observations)?
In this problem, we attempt to uncover the hidden part of the model - the unknown state sequence. In most cases, there is no "correct" state sequence to be found. All we require is to find the best state sequence we can for the task at hand, or even a sub-optimal one, provided that it words well enough.
How do we adjust the parameters of each model to maximise p(O | Mi )? In other words, how do we train each model so that the model works as well as it can?
Consider an application of this approach in recognition of isolated words. The first task is to build individual word models. To start with, we merely guess values for the model parameters. Then we use the solution to problem 3 to improve the estimates of the model parameters. The algorithms that are customarily used to solve each of these three problems are:
Solution to problem 1: the Forwards-Backwards algorithm (closely linked to solution 3).
Solution to problem 2: the Viterbi algorithm.
Solution to problem 3: the Baum-Welch algorithm.
The details of these are mostly too complicated to go into in a course of this (introductory) level, though Jurafsky and Martin (2000) do a pretty good job. The Viterbi algorithm is perhaps the easiest place to start, as it is a dynamic programming algorithm, very similar to DTW.
Next: Practical examples