Introducing Speech and Language Processing
Week 1 homework
1) Download and run the example code from the class (nfsa1.pl
and nfst1.pl).
Use Prolog commands
trace.
to observe the code in operation, step-by-step,
and
notrace.
to run the code normally.
2) Write Prolog code for a finite state automaton for the mini-language
given in:
http://www.phon.ox.ac.uk/jcoleman/new_SLP/Lecture_1/figure5-18.png
How could you alter it so that it can also accept the following strings? --
I want some
information.
I want some uh information.
Would you like a first class seat?
How much is a first class seat?
I will need to return in the morning.
3) (Separate task in preparation for next week; nothing especially to do
with the previous tasks.)
a) Download the plain text of "Treasure Island" from http://ota.ox.ac.uk/text/5730.txt
b) Tokenize the text by replacing every white space with a new line.
Hint: In OSX or Linux try e.g.
cat
treasureisland.txt | sed 's/\ /\
/g' > tokens.txt
In Windows, you can do this by editing treasureisland.txt
in a text editor (e.g. Notepad++) and doing a search and replace to search
for all instances of space and replacing them by newline. (Of course you can
do this on other operating systems too.) Then save the modified file as tokens.txt
Alternative, you might find it useful to install Cygwin on your Windows
system so that you can use *nix commands. It's not a full Linux
distribution, just a "feels like" user interface.
c) Sort the tokenized text into an alphabetically ordered list. Count the
number of instances of each token, to make a list of words with their
counts. Sort the list of word counts into descending order by count.
Hints: in OSX or Linux try e.g.
cat tokens.txt | sort | uniq -c | sort -nr >wordcounts.txt
In Windows, it takes two command line steps: first
SORT tokens.txt
sorted.txt
then
UNIQUE /count <tokens.txt >wordcounts.txt
(UNIQUE is not a core Windows/DOS command, but you can download it from http://www.richpasco.org/utilities/unique.html)