We shall examine 2 examples:
1) "tagging" words with part of speech labels, based on a limited amount
of visible
context (e.g. previous 2 words);
2) labelling a sequence of signal parameters (e.g. LPC coefficients) with
phonological
categories (e.g. phoneme labels).
1. n-gram models for tagging.
Many if not most words are of several syntactic categories. Consequently,
sentences are
commonly syntactically ambiguous. Consider "Time flies like an arrow".
N
V ADJ
DET N
[Time [flies [like
an arrow]]]
(Normal "simile" reading)
V
N ADJ
DET N
[[Time flies] [like an
arrow]] (Curious imperative)
N
N V DET N
[[Time flies] [like an arrow]]
(Strange species of insect)
In this case, a good grammar and parsing algorithm ought to be able to
give all three parses as possibilities. Yet we know that one of the three
structures is more likely than the others. Neither the grammar nor the
parser classically provides us with likelihoods, however, only
possibilities.
Another example: "-ing" forms may be verbs (e.g. "was swinging"),
adjectives ("low
swinging branches") or nouns ("swinging of the branches"). A good parser
may be able to discriminate between the different contexts in which the
-ing word occurs, but very often
a full parse is not necessary: reference to a few adjacent words may be
sufficient. E.g.
was Xing: Xing probably is a Verb
ADJ Xing N: Xing probably is an
Adjective
the Xing of: Xing probably is a Noun
A simple model for estimating the likely part of speech of a word in a
particular context
is the n-gram approach. An n-gram model is built on a list of short word
sequences, each
of which is paired with one or more part of speech labels, together with
an associated
"score" to indicate the likelihood that the final word in the sequence is
of that part of
speech. For example:
(Part of the entries for "swinging":)
...
branch was swinging: V, 1
branch was swinging: ADJ, 0
branch was swinging: N, 0
the low swinging: V, 0
the low swinging: ADJ, 0.9
the low swinging: N, 0.1
under the swinging: V, 0
under the swinging: ADJ, 0.9
under the swinging: N, 0.1
...
This is an example of the most common kind of n-gram model, a trigram
model (based
on 3-word sequences).
Q: Where do trigrams, their categories and scores come from?
A: from a tagged corpus.
Several large-scale text corpora have been manually tagged, and they thus
provide a
useful resource with which to build an n-gram model. Consider, for
example, the opening
few lines of "Rebecca":
"Last night I dreamt I went to Manderley again. It seemed to me I stood by the iron gate leading to the drive, and for a while I could not enter, for the way was barred to me..."
Construct trigrams, and pair them with categories. Keep a count of how many times each trigram occurs in the text:
Trigram | 3rd word category | Count |
Last night I | Pronoun | 1 |
night I dreamt | Verb | 1 |
I dreamt I | Pronoun | 1 |
dreamt I went | Verb | 1 |
I went to | Preposition | 1 |
went to Manderley | Proper noun | 1 |
... |
As we work through the text, the count of each trigram will at first be low, often 1. Some trigrams, however, will occur again e.g. "I went to". In some cases the third word of some trigram will be classified differently because of the different syntactic contexts in which the trigram appears. In such cases, the counts will probably be different for the two classifications. By the end of the novel, the number of trigrams, categories and counts will be a substantial list.
In Unix (e.g. bash):
cat triplets | sort | uniq -c | sort -nr >triplet_counts
Most frequent triplets in the Spoken BNC:
4977 I DON'T KNOW
2974 A LOT OF
2394 I DON'T THINK
1897 DO YOU WANT
1575 IN N IT
1542 ONE OF THE
1459 WHAT DO YOU
1456 I MEAN I
1390 YOU WANT TO
1301 A BIT OF
1249 GON NA BE
1211 BE ABLE TO
1208 THE END OF
1207 DU N NO
1187 YOU HAVE TO
1186 I'M GON NA
1172 IT WAS A
1146 DO YOU KNOW
1124 YOU KNOW I
1122 DO YOU THINK
cat triplet_counts | awk 'BEGIN
{s=0} {s = s+$1} END {print s}'
5915653
From which we can estimate the probability of each triplet thus:
cat triplet_counts | awk '{print $0, $1/5915653}' >triplet_probabilities
4977 I DON'T KNOW
0.000841327
2974 A LOT OF 0.000502734
2394 I DON'T THINK 0.000404689
1897 DO YOU WANT 0.000320675
1575 IN N IT 0.000266243
1542 ONE OF THE 0.000260664
1459 WHAT DO YOU 0.000246634
1456 I MEAN I 0.000246127
1390 YOU WANT TO 0.00023497
1301 A BIT OF 0.000219925
1249 GON NA BE 0.000211135
1211 BE ABLE TO 0.000204711
1208 THE END OF 0.000204204
1207 DU N NO 0.000204035
1187 YOU HAVE TO 0.000200654
1186 I'M GON NA 0.000200485
1172 IT WAS A 0.000198118
1146 DO YOU KNOW 0.000193723
1124 YOU KNOW I 0.000190004
1122 DO YOU THINK 0.000189666
Using the model to tag a new text.
With the complete model, we can tag a new text simply be breaking it into
trigrams and referring to the part-of-speech trigram table. For example,
from the table of trigrams we can see that "to" is
almost always a preposition when it follows "I went".
The main problem of this model is incompleteness of the model. For
example, "went to
Twickenham" is not amongst the "Rebecca" trigrams, even though "went to
Manderley" and "went to Monte Carlo" may be. Therefore, we need a back-off
method to estimate the probability of trigrams that were not seen in the
original corpus.