Introducing Speech and Language Processing

Week 1 homework

1) Download and run the example code from the class (nfsa1.pl and nfst1.pl).
Use Prolog commands

trace.

to observe the code in operation, step-by-step,

and

notrace.


to run the code normally.

2) Write Prolog code for a finite state automaton for the mini-language given in:
http://www.phon.ox.ac.uk/jcoleman/new_SLP/Lecture_1/figure5-18.png

How could you alter it so that it can also accept the following strings? --

I want some information.
I want some uh information.
Would you like a first class seat?
How much is a first class seat?
I will need to return in the morning.

3) (Separate task in preparation for next week; nothing especially to do with the previous tasks.)

a) Download the plain text of "Treasure Island" from http://ota.ox.ac.uk/text/5730.txt
b) Tokenize the text by replacing every white space with a new line.

Hint: In OSX or Linux try e.g.

cat treasureisland.txt | sed 's/\ /\
/g' > tokens.txt


In Windows, you can do this by editing treasureisland.txt in a text editor (e.g. Notepad++) and doing a search and replace to search for all instances of space and replacing them by newline. (Of course you can do this on other operating systems too.) Then save the modified file as tokens.txt

Alternative, you might find it useful to install Cygwin on your Windows system so that you can use *nix commands. It's not a full Linux distribution, just a "feels like" user interface.

c) Sort the tokenized text into an alphabetically ordered list. Count the number of instances of each token, to make a list of words with their counts. Sort the list of word counts into descending order by count.

Hints: in OSX or Linux try e.g.

cat tokens.txt | sort | uniq -c | sort -nr >wordcounts.txt


In Windows, it takes two command line steps: first

SORT tokens.txt sorted.txt

then
UNIQUE /count <tokens.txt >wordcounts.txt

(UNIQUE is not a core Windows/DOS command, but you can download it from http://www.richpasco.org/utilities/unique.html)