Parsing: a quick introduction

1. References

Introductory reading:

Clocksin, W. F and C. S. Mellish (1981) Programming in Prolog. SpringerVerlag. Ch. 9.

Clocksin and Mellish's Ch. 9 grammar is in the file sentence_grammar.pl.

Further reading:

Pereira, F. C. N. and S. M. Shieber (1987) Prolog and Natural Language Analysis. CSLI Publications. Sections 2.7 (pp. 29-36), 3.4.2 (pp. 612), 3.7 (pp. 70-79).

Gazdar, G. and C. Mellish, Natural Language Processing in Prolog. Chapters 4 and 5.

2. Intuitive Parsing

Step 1.	The string:	the	quick	brown	fox	jumps	over	the	lazy	dogs
Step 2.	Tagging:	DET	ADJ	ADJ	N	V	P	DET	ADJ	N
					\|	\|	\|			\|
Step 3.	Project N' heads:		\___	\___	N'	\|	\|		\___	N'
Step 4.	Add N' modifiers:				\|	\|	\|			\|
Step 5.	Project NP:	\_________________			NP	\|	\|	\________		NP
Step 6.	Add NP specifiers:				\|	\|	\|
Step 7.	Project PP:				\|	\|	PP	___________________/
Step 8.	Project VP:				\|	VP	___/
					\	/
Step 9.	S:					S

The tree is constructed from frontier to root (bottom-up), as single words are grouped into phrases, phrases into clauses etc.

3. Real parsing 1. (Topdown) recursive descent parsing

Start symbol: S
Input string: "the quick brown fox jumps over the lazy dogs"

Rules:

1)    S → NP VP
2)    NP → (DET) ADJ* N
3)    VP → V NP
4)    VP → V PP
5)    DET → the; a; an ...
6)    N → dogs; fox; jumps ...
7)    ADJ → quick; brown; lazy ...
8)     V → jumps; runs ...
9)    P → over; onto; in ; under ...
10)    PP → P NP

Initial state 1:

	S		Stack ("to do" list):	S
↙	⇣	↘
?	?	?
the quick	brown	...

"Reach down" from the start symbol towards the string. I.e. "how would I generate this string?"

The only way of getting "down" from S is via rule 1: S → NP VP. So build a little bit of structure and put NP and VP on the stack (the list of symbols remaining to be dealt with).

State of play 2:

		S			Stack:	NP
		/	\			VP
	NP		VP
↙	⇣	⇣	⇣	⇣
?	?	?	?	?
the quick	brown	...		←	The part of the string that remains to be parsed is called the remainder

Next step 3: expand the leftmost unexpanded daughter first. (I.e. NP). NP → (DET) ADJ* N

		S			Stack:	(DET)
		/	\			ADJ*
	NP		VP			N
/	\|	\	⇣	⇣		VP
(DET)	ADJ*	N	?	?
⇣	⇣	⇣
?	?	?
					Remainder: the quick brown ...

Expand leftmost unexpanded daughter: DET → the; a; an ...

As this is a preterminal rule, we require that one of the terminals on the right hand side is a prefix (= an initial substring) of the remainder of the analysis string for this rule to be applicable. This condition is met in this case, as "the" is a prefix of the remainder.

State of play 4:

		S			Stack:	ADJ*
		/	\			N
	NP		VP			VP
/	\|	\	⇣	⇣
(DET)	ADJ*	N	?	?
\|	⇣⇣	⇣
the	?	?
					Remainder: quick brown fox ...

Expand ADJ*. "quick" and "brown" are ADJ's, so they can be included in the NP.

State of play 5:

		S			Stack:	N
	/		\			VP
	NP		VP
/	\|	\	⇣	⇣
(DET)	ADJ*	N	?	?
\|	/ \	⇣
the	quick brown
					Remainder: fox jumps over ...

Expand top of the stack (leftmost unexpanded daughter).

First, N. (Rule 6 N → dogs; fox ...)

Then, VP. (Rule 3 VP → V NP)

Then, V. (Rule 8 V → jumps; runs ...)

Then, NP. (Rule 2 NP → (DET) ADJ* N)

At this stage, the state of play is:

		S					Stack:	(DET)
	/		\					ADJ*
	NP		VP					N
/	\|	\	/		\
DET	ADJ*	N	V		NP
\|	/ \	\|	\|	/	\|	\
the	quick brown	fox	jumps	(DET)	ADJ*	N

							Remainder: over the lazy dogs

Rules 5 and 7 are both preterminal, but neither of them introduces a prefix of "over the lazy dogs" So, we must try other expansions of the most recently expanded nonterminal (backtracking). But there are no other expansions of NP, so we must backtrack again, to the VP node. An alternative expansion of VP is rule 4 (VP → V PP). The parse continues, and eventually all of the material will be included in the parse. When no more of the string is left, and there are no more categories left of the stack to deal with, the parse is complete.

4. The simplest parsing program: a Prolog DCG (Definite Clause Grammar) (download)

/* DOG_GRAMMAR.PL */

s --> np, vp.

np --> n.        np --> adj, n.    np --> adj, adj, n.
np --> det, n.    np --> det, adj, n.    np --> det, adj, adj, n.

vp --> v, np.    vp --> v, pp.

pp --> p, np.

det --> [the].    det --> [a].    det --> [an].

n --> [dogs].    n --> [fox].    n --> [jumps].

adj --> [quick].    adj --> [brown].    adj --> [lazy].

v --> [jumps].    v --> [runs].

p --> [over].    p --> [onto].
p --> [in].        p --> [under].

/* Generate all sentences */

loop:- s(S,[]), write(S), nl, fail.

5. Difference lists

These Prolog grammars employ difference list notation for strings of words.

([the,quick,brown,fox],[quick,brown,fox]) indirectly indicates the single word [the], with [quick,brown,fox] left as a remainder.

([a,fish,swims],[]), with an empty list as its remainder, indicates the string [a,fish,swims]

This is a bit baffling to explain, but becomes easy enough once you use trace. to watch how the parser works, step by step.

At the Prolog ?- prompt, type

[dog_grammar].

to load and compile dog_grammar.pl

Prolog replies:

dog_grammar consulted
yes

?-

At the Prolog prompt, try any of the following queries:

?- s([the,quick,brown,fox,jumps,over,the,lazy,dogs],[]).

?- s(X,[]).

(Type semicolon-return after the reply to generate additional answers.)

?- s([the,X,jumps,over,the,Y],[]).

(Note that "jumps" is listed as both a verb and a noun.)

Other grammars to consult include: sentence_grammar.pl, syllable_grammar.pl

Try:

?- syllable_sequence([dh,@,k,w,i,k,b,r,a,w,n,f,o,k,s],[]).

More on phonological parsing:

Church, K. W. (1983) Phrase Structure Parsing: A Method for Taking Advantage of Allophonic Constraints. Ph. D. thesis, M. I. T. Distributed by IULC, and also published by Kluwer.

Randolph, M. A. (1989) Syllable-based Constraints on Properties of English Sounds. Ph.D. thesis, M. I. T.

Dirksen, A. (1993) Phrase Structure Phonology. In Ellison, T. M. and J. M. Scobbie, eds. (1993) Computational Phonology. Edinburgh Working Papers in Cognitive Science 8.

Coleman, J. (1993) English word-stress in Unification-based Grammar. In Ellison and Scobbie, eds.