Statistical language modelling

Statistical Language Modelling

We shall examine 2 examples:

1) "tagging" words with part of speech labels, based on a limited amount of visible context (e.g. previous 2 words);

2) labelling a sequence of signal parameters (e.g. LPC coefficients) with phonological categories (e.g. phoneme labels).

1. n-gram models for tagging.

Many if not most words are of several syntactic categories. Consequently, sentences are commonly syntactically ambiguous. Consider "Time flies like an arrow".

    N       V    ADJ   DET       N
[Time [flies [like      an     arrow]]]        (Normal "simile" reading)

     V     N         ADJ DET    N
[[Time flies] [like      an   arrow]]        (Curious imperative)

      N      N      V   DET N
[[Time flies] [like    an arrow]]            (Strange species of insect)

In this case, a good grammar and parsing algorithm ought to be able to give all three parses as possibilities. Yet we know that one of the three structures is more likely than the others. Neither the grammar nor the parser classically provides us with likelihoods, however, only possibilities.

Another example: "-ing" forms may be verbs (e.g. "was swinging"), adjectives ("low swinging branches") or nouns ("swinging of the branches"). A good parser may be able to discriminate between the different contexts in which the -ing word occurs, but very often a full parse is not necessary: reference to a few adjacent words may be sufficient. E.g.

    was Xing:     Xing probably is a Verb
    ADJ Xing N:    Xing probably is an Adjective
    the Xing of:    Xing probably is a Noun

A simple model for estimating the likely part of speech of a word in a particular context is the n-gram approach. An n-gram model is built on a list of short word sequences, each of which is paired with one or more part of speech labels, together with an associated "score" to indicate the likelihood that the final word in the sequence is of that part of speech. For example:

(Part of the entries for "swinging":)

    ...
    branch was swinging:     V, 1
    branch was swinging:     ADJ, 0
    branch was swinging:     N, 0
    the low swinging:    V, 0
    the low swinging:    ADJ, 0.9
    the low swinging:    N, 0.1
    under the swinging:    V, 0
    under the swinging:    ADJ, 0.9
    under the swinging:    N, 0.1
    ...

This is an example of the most common kind of n-gram model, a trigram model (based on 3-word sequences).

Q: Where do trigrams, their categories and scores come from?

A: from a tagged corpus.

Several large-scale text corpora have been manually tagged, and they thus provide a useful resource with which to build an n-gram model. Consider, for example, the opening few lines of "Rebecca":

"Last night I dreamt I went to Manderley again. It seemed to me I stood by the iron gate leading to the drive, and for a while I could not enter, for the way was barred to me..."

Construct trigrams, and pair them with categories. Keep a count of how many times each trigram occurs in the text:

Trigram	3rd word category	Count
Last night I	Pronoun	1
night I dreamt	Verb	1
I dreamt I	Pronoun	1
dreamt I went	Verb	1
I went to	Preposition	1
went to Manderley	Proper noun	1
...

As we work through the text, the count of each trigram will at first be low, often 1. Some trigrams, however, will occur again e.g. "I went to". In some cases the third word of some trigram will be classified differently because of the different syntactic contexts in which the trigram appears. In such cases, the counts will probably be different for the two classifications. By the end of the novel, the number of trigrams, categories and counts will be a substantial list.

In Unix (e.g. bash):

cat triplets | sort | uniq -c | sort -nr >triplet_counts

Most frequent triplets in the Spoken BNC:

   4977 I DON'T KNOW
   2974 A LOT OF
   2394 I DON'T THINK
   1897 DO YOU WANT
   1575 IN N IT
   1542 ONE OF THE
   1459 WHAT DO YOU
   1456 I MEAN I
   1390 YOU WANT TO
   1301 A BIT OF
   1249 GON NA BE
   1211 BE ABLE TO
   1208 THE END OF
   1207 DU N NO
   1187 YOU HAVE TO
   1186 I'M GON NA
   1172 IT WAS A
   1146 DO YOU KNOW
   1124 YOU KNOW I
   1122 DO YOU THINK

To calculate the total number of triplets using a Unix/Linux shell (e.g. bash):

cat triplet_counts | awk 'BEGIN {s=0} {s = s+$1} END {print s}'
5915653

From which we can estimate the probability of each triplet thus:

cat triplet_counts | awk '{print $0, $1/5915653}' >triplet_probabilities

   4977 I DON'T KNOW 0.000841327
   2974 A LOT OF 0.000502734
   2394 I DON'T THINK 0.000404689
   1897 DO YOU WANT 0.000320675
   1575 IN N IT 0.000266243
   1542 ONE OF THE 0.000260664
   1459 WHAT DO YOU 0.000246634
   1456 I MEAN I 0.000246127
   1390 YOU WANT TO 0.00023497
   1301 A BIT OF 0.000219925
   1249 GON NA BE 0.000211135
   1211 BE ABLE TO 0.000204711
   1208 THE END OF 0.000204204
   1207 DU N NO 0.000204035
   1187 YOU HAVE TO 0.000200654
   1186 I'M GON NA 0.000200485
   1172 IT WAS A 0.000198118
   1146 DO YOU KNOW 0.000193723
   1124 YOU KNOW I 0.000190004
   1122 DO YOU THINK 0.000189666

Using the model to tag a new text.

With the complete model, we can tag a new text simply be breaking it into trigrams and referring to the part-of-speech trigram table. For example, from the table of trigrams we can see that "to" is almost always a preposition when it follows "I went".

The main problem of this model is incompleteness of the model. For example, "went to Twickenham" is not amongst the "Rebecca" trigrams, even though "went to Manderley" and "went to Monte Carlo" may be. Therefore, we need a back-off method to estimate the probability of trigrams that were not seen in the original corpus.

Next: Trigram modelling