You are on page 1of 9

presented in Figure 1, and to t h a n k his colleagues, Richard 5. CONWAY,R. W., AND WORLEY,W. S.

The Cornell--IIASP
Conway, William Maxwell, and Robert Wagner for their system for the 360/65. Teeh. Rep. No. 68-A1, Office of Com-
puter Services, Cornell U., Ithaca, N. Y.
helpful discussions of the material.
6. DAMERAU,F. A technique for computer detection and correc-
RECEIVED APRIL, 1969; REVISED AUGUST, 1969 tion of spelling errors. Comm. ACM 7, 3 (Mar. 1964), 171-176.
7. DAVIDSON,L. Retrieval of misspelled names in an airlines
REFERENCES passenger reservation system. Comm. ACM 5, 3 (Mar. 1962),
1. ALBERGA, CYRIL, N. String similarity and misspellings. 169-171.
Comm. ACM I0, 5 (May 1967), 302-313. 8. FREEMAN,D. N. Error correction in CORC: The Cornell
2. BLAIR,CHARLESR. A program for correcting spelling errors. Computing Language. Ph.D. Th., Cornell U., Ithaca, N. Y.,
Information and Control 3 (Mar. 1960), 60-67. Sept. 1963.
3. CONWAY,R. W., AND MAXWELL,W.L. CORC--the Cornell 9. GLANTZ, I-I. W. On the recognition of information with a
computing language. Comm. ACM 6 (June 1963), 317-321. digital computer. J. ACM 4, 2 (Apr. 1957), 178-188.
4. CONWAY, R. W., AND MAXWELL, W . L . CUPL---an approach 10. HAMMING,R.W. One Man's View of Computer Science. J.
to introductory computing instruction. Tech. Rep. No. 68-4, ACM 16, 1 (Jan. 1969), 3-12.
Dept. of Computer Science, Cornell U., Ithaca, N. Y. 11. JACKSON,M. Mnemonics, Datamation IS (Apr. 1967), 26-29.

An Efficient Context-Free Parsing gramming languages and of programs which " u n d e r s t a n d "
or translate natural languages.
Algorithm Numerous parsing algorithms have been developed.
Some are general, in the sense that they can handle all
context-free grammars, while others can handle only sub-
JAY EARLEY classes of grammars. The latter, restricted algorithms tend
University of California,* Berkeley, California to be much more efficient. The algorithm described here
seems to be the most efficient of the general algorithms,
and also it can handle a larger class of grammars in linear
time than most of the restricted algorithms. We back up
A parsing algorithm which seems to be the most efficient gen-
these claims of efficiency with both a formal investigation
eral context-free algorithm known is described. It is similar to
and an empirical comparison.
both Knuth's LR(k) algorithm and the familiar top-down algo-
This paper is based on the author's 1968 report [I] where
rithm. It has a time bound proportional to n3 (where n is the many of the points studied here appear in much greater
length of the string being parsed) in general; it has an n2 bound
detail. In Section 2 the terminology used in this paper is
for unambiguous grammars; and it runs in linear time on a large
defined. In Section 3 the algorithm is described informally
class of grammars, which seems to include most practical
and in Section 4 it is described precisely. Section 5 is a
context-free programming language grammars. In an empirical
study of the formal efficiency properties of the algorithm
comparison it appears to be superior to the top-down and
and m a y be skipped by those not interested in this aspect.
bottom-up algorithms studied by Griffiths and Petrlck.
Section 6 has the empirical comparison and in Section 7 the
KEY WORDS AND PHRASES~ syntaxanalysis,parsing, context-free grammar, practical use of the algorithm is discussed.
compilers,computationalcomplexity
CR CATEGORIES: 4.12, 5.22, 5.23 2. Terminology
A language is a set of strings over a finite set of symbols.
We call these terminal symbols and represent them b y
1. I n t r o d u c t i o n
lowercase letters: a, b, c. We use a context-free grammar as a
formal device for specifying which strings are in the set.
Context-free grammars (BNF grammars) have been This grammar uses another set of symbols, the non-
used extensively for describing the syntax of programming terminals, which we can think of as syntactic classes. We
languages and natural languages. Parsing algorithms for use capitals for nonterminals: A, B, C. Strings of either
context-free grammars consequently play a large role in terminals or nonterminals are represented b y Greek letters:
the implementation of compilers and interpreters for pro- a, fl, 3,. The e m p t y string is k. a ~ represents
k times
* Computer Science Department. This work was partially sup-
ported by the Office of Naval Research under Contract No. O/ " ' " Or"
NONR3656(23) with the Computer Center, University of Cali-
fornia, Berkeley, and by the Advanced Research Projects Agency [ a I is the number of symbols in a. There is a finite set of
of the Office of the Secretary of Defense (F44620-67-C-0058), productions or rewriting rules of the form A --~ a. The non-
monitored by the Air Force Office of Scientific Research. terminal which stands for "sentence" is called the root R

94 C o m m u n i c a t i o n s of t h e ACM V o l u m e 13 / Number 2 / February, 1970


of the grammar. The productions with a particular non- A recognizer is an algorithm which takes as input a string
terminal D on their left sides are called the alternatives of and either accepts or rejects it depending on whether or not
D. Hereafter we use grammar to mean context-free gram- the string is a sentence of the grammar. A parser is a recog-
mar. nizer which also outputs the set of all legal derivation trees
We will work with this example grammar of simple for the string.
arithmetic expressions, grammar AE:
3. Informal Explanation
E ---~ T
The following is an informal description of the algorithm
E-~E+T as a recognizer: I t scans an input string Xi . . . X~ from
left to right looking ahead some fixed number k of symbols.
T--~P
As each symbol X~ is scanned, a set of states S~ is con-
T-~T*P structed which represents the condition of the recognition
process at that point in the scan. Each state in the set
P-~a
represents (1) a production such that we are currently
The terminal symbols are { a , + , . } , the nonterminals scanning a portion of the input string which is derived
are {E, T, P}, and the root is E. from its right side, (2) a point in that production which
Most of the rest of the definitions are understood to be shows how much of the production's right side we have
with respect to a particular grammar G. We write a ~ f~ recognized so far, (3) a pointer back to the position in the
if ~I~, ~, ~/, A such that a = ~A~ and fl = ~7/~ and A --+ 7/is input string at which we began to look for that instance of
a production. We write a ~ / ~ (fl is derived from a) if 3 the production, and (4) a k-symbol string which is a
strings no, al , • • • , ~ (m > O) such that syntactically allowed successor to that instance of the
production. This quadruple is represented here as a pro-
duction, with a dot in it, followed b y an integer and a
The sequence no, . . . , a,~ is called a derivation (of ~ from string.
a). For example, if we are recognizing a • a with respect to
A sentential form is a string a such that the root R ~ a. grammar A E and we have scanned the first a, we would be
A sentence is a sentential form consisting entirely of termi- in the state set $1 consisting of the following states (ex-
nal symbols. The language defined by a grammar L ( G ) is cluding the k-symbol strings):
the set of its sentences. We may represent any sentential P --~ a. 0
form in at least one way as a derivation tree (or parse tree)
reflecting the steps made in deriving it (though not the T ---~P . 0
order of the steps). For example, in grammar AE, either T-+ T..P 0
derivation
E --~ T . 0
E~E+T~T+T~T+P~T,P+P
E-~E.+T 0
or
Each state represents a possible parse for the beginning
E~E+T~E+P~T+P~T,P+P
of the string, given that only the a. has been seen. All the
is represented by states have 0 as a pointer, since all the productions repre-
sented must have begun at the beginning of the string.
E
There will be one such state set for each position in 'the
string. To aid in recognition, we place k -4- 1 right termina-
E -4- T tors " -~" (a symbol which does not appear elsewhere in
the grammar) at the right end of the input string.
To begin the algorithm, we put the single state
T p
¢,~.R-~ -~ 0
T * P into state set So, where R is the root of the grammar and
where ~ is a new nonterminal.
The degree of ambiguity of a sentence is the number of its
In general, we operate on a state set St as follows: we
distinct derivation trees, A sentence is unambiguous if it process the states in the set in order, performing one of
has degree 1 of ambiguity. A grammar is unambiguous if three operations on each one depending on the form of the
each of its sentences is unambiguous. A grammar has state. These operations may add more states to S~ and
bounded ambiguity if there is a bound b on the degree of may also put states in a new state set Si.1. We describe
ambiguity of any sentence of the grammar. A grammar is these three operations by example.
reduced if every nonterminal appears in some derivation of In grammar AE, with k = 1, So starts as the single state
some sentence. ~ . E -~ -~ 0 (1)

V o l u m e 13 / N u m b e r 2 / F e b r u a r y , 1970 Communications o f t h e ACM 95


The predictor operation is applicable to a state when there If we finish processing S~ and S~+1 remains empty, an
is a nonterminal to the right of the dot. I t causes us to error has occurred in the input string. Otherwise, we start
add one new state to S~ for each alternative of t h a t non- to process S~+1.
terminal. We put the dot at the beginning of the produc- The third operation, the completer, is applicable to a
tion in each new state, since we have not scanned any of state if its dot is at the end of its production. Thus the
its symbols yet. The pointer is set to i, since the state was completer is applicable to each of these states in S~.
created in S~. Thus the predictor adds to S~ all productions I t compares the look-ahead string with Xi+~ .-- X~+~. If
which might generate substrings beginning at X~+~. they match, it goes back to the state set indicated b y the
In our example, we add to So pointer, in this case So, and adds all states from So which
have P to the right of the dot. I t moves the dot over P in
E ~ .E+T 4 0 (2) these states. Intuitively, So is the state set we were in when
E--~ . T 4 0 (3) we went looking for t h a t P. We have now found it, so we
go back to all the states in So which caused us to look for a
The k-symbol look-ahead string is 4, since it is after E in P, and we move the dot over the P in these states to show
the original state. We must now process these two states. that it has been successfully scanned.
The predictor is also applicable to them. Operating on (2), If X2 = -4-, then the completer applied to (6) causes us
it produces to add to $i
E--~ . E + T + 0 (4) T-+P. 4 0
E-~.T + 0 (5) T'-+ P. + 0
with a look-ahead symbol --? because it appears after E in T-+P. * 0
(2). Operating on (3), it produces
Applying the completer to the second of these produces
T --~ . T . P 4 0
E--IT. 4 0
T--~ .P 4 0
E-aT. -5 0
Here the look-ahead symbol is 4 because T is last in the
T --> T . * P 4 0
production and 4 is its look-ahead symbol. Now, the
predictor, operating on (4) produces (4) and (5) again, T "-+ T . * P -5 0
but they axe already in So, so we do nothing. From (5) it
T ---~ T . * P * 0
produces
and finally, from the second of these, we get
T--~ .T.P -+ 0
¢-~E. 4 4 0
T--+.P + 0
E --~ E . + T 4 0
The rest of So is
E--~ E . + T + 0
T --~ . T*P • 0
The scanner then adds to S~
T - ~ .P * 0
E-+ E+.T 4 0
P - ~ .a 4 0
E-+ E+.T + 0
P --~ .a + 0
If the algorithm ever produces an S~+1 consisting of the
19"--> . a * 0
single state
The predictor is not applicable to any of the last three
states. Instead the scanner is, because it is applicable just
,~E4. 4 0
in case there is a terminal to the right of the dot. The then we have correctly scanned an E and the 4, so we are
scanner compares t h a t symbol with X~+i, and if they finished with the string, and it is a sentence of the grammar.
match, it adds the state to S~+1, with the dot moved over A complete run of the algorithm on grammar A E is
one in the state to indicate t h a t t h a t terminal symbol has given in Figure 1. In this example, we have written as
been scanned. one all the states in a state set which differ only in their
If X1 = a, then S1 is look-ahead string. (Thus " 4 + * " as a look-ahead string
stands for three states, with " 4 ", " + " , and " * " as their
P~a. 4 0 respective look-ahead strings.)
P~a. + 0 (6) The technique of using state sets and the look-ahead are
derived from Knuth's work on L R ( k ) grammars [2]. I n
P-+a. • 0
fact our algorithm bears a close relationship to K n u t h ' s
these states being added b y the scanner. algorithm on L R (k) grammars except for the fact t h a t

96 C o m m u n i c a t i o n s o f t h e ACM V o l u m e 13 / N u m b e r 2 / F e b r u a r y , 1970
GRAMMAR A E
THE RECOGNIZER. This is a function of three argu-
root: E --~T [ E-ST input string = a-sa*a ments R E C ( G , Xi • • • X~, k) computed as follows:
T--~P [ T*P
P--~a LetX,+~ =-~ (1-<i-< k + l ) .
Let S~ be empty (0 ~ i _-<n + 1).
~=1
Add (0,0,0,~~) to So.
So ¢-,.Eq q 0 Sa P --~ a. -~+* 2
(X*fa) E-~ .Eq-T -~ + 0 (X*=*) T--~P. "~+* 2
E--~ .T -~-5 0 E ~ E+T. "~+ 0 For i ~-- 0 step 1 until n do
T--~.T*P 4 + * 0 T--~T.*P "~ + * 2 Begin
T ~ .P ~ -5* 0 Process the states of St in order, performing one of the following
P ~ .a -~ -5* 0 S~ T--~T*.P ~-5" 2
(X~=a) P-~.a ~-5. 4 three operations on each state s = (p, j, f, ~>.
Sl P-~a. q-5* 0 (1) Predictor: If s is nonfinal and Cp¢j+l) is a nonterminal,
(X2=+) T-*P. -~+* 0 S~ P-~a. "I.-{.-* 4 then for each q such that C~¢j+n = Dq, and for each
E-~T. ~-5 0 (X~=-~) T--~T*P. -~-5. 2 fl 6 tIk(Cp¢~+2)..- C~o~) add (q, 0, i, fl> to S~.
T --~T.*P -~-5. 0 E--~ E + T . ~+ 0
* -~ E.q q 0 T-~T.*P ~ +* 2
(2) Completer: If s is final and ~ = Xi+l ... Xi+~, then for
E -~ E.-ST ~-5 0 ¢ -~ E.q q 0 each (q,l,g,~> 6 Sf (after all states have been added to Sf) such
E -~ E.-ST "~-5 0 that C~a+~) = Dp, add (q,l + 1, g, ~>to S~.
S~ E-*E+.T q+ 0 (3) Scanner: If s is nonfinal and C~¢y+n is terminal, then if
(X~=a) T-~.T*P q +* 2 S~ ~Eq. q 0
T --~.P -[ -5* 2 Cp(~+l) = X~+i , add (p, j + 1, f, o~>to S~+~.
P -~ .a q +* 2 If S~+iis empty, return rejection.
If i = n and S~+~ = {(0,2,0,-~>}, return acceptance.
Fro. 1 End
Notice t h a t the ordering imposed on state sets is not
i m p o r t a n t to their meaning b u t is simply a device which
he uses a stack rather t h a n pointers to keep track of w h a t allows their members to be processed correctly b y the
to do after a parsing decision is made. algorithm. Also note t h a t i cannot become greater t h a n n
Note also that, although it did not develop this way, without either acceptance or rejection occurring because
our algorithm is in effect a top-down parser [3] in which of the fact t h a t -~ appears only in production zero. This
we carry along all possible parses simultaneously in such does not really represent a complete description of the
a way t h a t we can often combine like subparses. This cuts algorithm until we describe in detail how all these opera-
down on duplication of effort and also avoids the left- tions are implemented on a machine. The following de-
recursion problem. (A straightforward top-down parser scription assumes a knowledge of basic list processing tech-
m a y go into an infinite loop on g r a m m a r s containing left- niques.
recursion, i.e. A --~ A~.) Implementation
4. The Recognizer (1) For each nonterminal, we keep a linked list of its
alternatives, for use in prediction.
The following is a precise description of the recognition (2) T h e states in a state set are kept in a linked list
algorithm for input string X~ • • • X . and g r a m m a r G. so they can be processed in order.
NOT~TmN. N u m b e r the productions of g r a m m a r G ar- (3) I n addition, as each state set S~ is constructed, we
bitrarily 1, . . . , d - 1, where each production is of the put entries into a vector of size i. The f t h entry in this
form vector (0 =< f ~ i) is a pointer to a list of all states in St
with pointer f, i.e. states of the form ~p, j, f, ~} 6 S~ for
D~--~ C ~ "-- C ~ (1 _-< p ~ d -- 1)
some p, j, ~. Thus, to test if a state (p, j, f, ~) has already
where p is the n u m b e r of symbols on the right-hand side been added to S~, we search through the list pointed to
of the p t h production. Add a 0th production b y the f t h entry in this vector. (This takes an amount of
time independent of f . ) T h e vector and lists can be dis-
D0 --+ R -~ carded after S~ is constructed.
(4) For the use of the completer, we also keep, for each
where R is the root of G, and -~ is a new terminal symbol.
state set S~ and nonter~final N, a list of all states <p, j, f, ~)
Definition. A state is a quadruple (p, j, f, a> where p, j,
6 S~ such t h a t C~¢i+1) -- N.
and f are integers ( 0 ~ p = < d - - 1) ( 0 = < j E T ) (0__<f
(5) I f the g r a m m a r contains null productions (A --~ ~),
n -t- 1 ) and a is a string consisting of k terminal symbols.
we cannot implement the completer in a straightforward
state set is an ordered set of states. A final state is one in
way. When performing the completer on a null state
which j = ~. We add a state to a state set b y putting it
(A-~. a i) we want to add to Si each state in S~ with A
last in the ordered set unless it is already a member.
to the right of the dot. But one of these m a y not have been
Definition. Hk ( ' y ) = {ce [ a is terminal, ] a [ = k, and
added to St yet. So we must note this and check for it
fl such t h a t 7 ~ a~}. when we add more states to S~.
H~ (~) is the set of all k-symbol terminal strings which The above implementation description is not m e a n t to
begin some string derived from % This is used in forming be the only way or the best way to implement the al-
the look-ahead string for the states. gorithm. I t is merely a method which does allow the

Volume 13 / N u m b e r 2 / February, 1970 Communications of the ACM 97


algorithm to achieve the time and space bounds which t h a n one way (this can happen; t h a t ' s why we must test
we quote in Section 5. for the existence of a state before we add it to a state set)
The correctness of this recognizer has been proved in then it took at most H i steps to do the operation.
[1]. I t requires no restrictions of any kind on the context- I n the case t h a t the g r a m m a r is unambiguous and re-
free g r a m m a r to be successful. duced, we can show t h a t each such state gets added in
only one way. Assume t h a t the state (q, j + 1, f, a) is added
5. Time and Space Bounds to Si in two different ways b y the completer. T h e n we h a v e
To develop some idea of the efficiency of the algorithm, Sl = ( p l , PPl, f l , Xi,1 . . . Xi+k) C S~,
we can use a formal model of a computer and measure the
time as the n u m b e r of primitive steps executed b y this s2 = (p2, P2, f2, X~+i . . . X~+k) E S~,
model and the space as the n u m b e r of storage locations (q, j, f, a) E Ss, and Sf2 , Dp~ = Cq(i+l) = D~2
used. We use a random access model (described in the
Appendix) because we feel t h a t this model represents most and either pl ~ p~ or fl ~ f2, for otherwise sl and s2 would
accurately the properties of real computers which are be the same state.
relevant to syntax analysis. So we have
We are interested in upper bounds on the time (and less X~ . . . X s C ~ . - . Cq(i+l) ~ X~ . . . Xf~ C ~ ..- C~
i m p o r t a n t the space) as a function of n (the length of the
.
input string) for various classes of context-free grammars. X1 • • • Xi
Specifically, an n 2 algorithm for a subclass A of g r a m m a r s and
means t h a t there is some n u m b e r C (which m a y depend
on the size of the grammar, but not on n), such t h a t Cn 2 Xl "'" XfCql "'" Cq(3+l ) ~ Xl "'" Xf2C!a21 "'" Cp2~2

is an upper bound on the n u m b e r of primitive steps re-


X1 . . . X ,
quired to parse a n y string of length n with respect to a
g r a m m a r in class A. and since pl = p2 and fl = f2 cannot both be true, the
THE GENERAL CASE. Our algorithm is an n 3 recognizer above two derivations of X~ . . - X¢ are represented b y
in general. The reasons for this are: different derivation trees. Therefore since the g r a m m a r is
(a) The n u m b e r of states in any state set S~ is propor- reduced, every nonterminal generates some terminal string,
tional to i (--~i) because the ranges of the p, j, and a and so there exists an ambiguous sentence X ~ • . . X ~ a for
components of a state are bounded, while only the f com- some a.
ponent depends on i, and it is bounded b y n. So if the g r a m m a r is unambiguous, the completer ex-
(b) The scanner and predictor operations each execute ecutes ~ i steps per state set and the time is bounded b y
a bounded n u m b e r of steps per state in any state set. So n 2. Notice t h a t the time is also n 2 for g r a m m a r s with
the total time for processing the states in S~ plus the scanner bounded ambiguity since each state can then be added b y
and predictor operations is ~-~i. the completer only a bounded n u m b e r of times. I n [1] we
(c) T h e completer executes ~--~i steps for each state it show t h a t the time is n 2 for an even larger class of gram-
processes in the worst case because it m a y have to add ~ f mars, and thereby also obtain Younger's n ~ results for
states for Si, the state set pointed back to. So it takes linear and metalinear g r a m m a r s [6].
~ i 2 steps in S i. Kasaini [7] has also obtained independently the result for
(d) Summing from i = 0, • .. , n + 1 gives ~ n ~ steps. unambiguous grammars, but his algorithm (which is a
This bound holds even if the look-ahead feature is not modification of Cocke's) has the disadvantage t h a t it re-
used (by setting/c = 0). The bound is no better t h a n t h a t quires the g r a m m a r in normal form. His algorithm, like
obtained b y Younger [4] for Coeke's algorithm [5], b u t ours, achieves its time bound on a r a n d o m access machine
our algorithm is better for two reasons. Ours does not only.
require the g r a m m a r to be put into a n y special form LINEAR TIME. We now characterize the class of gram-
(Coeke's required normal form), and ours actually does mars which the algorithm will do in time n. We notice t h a t
better t h a n n 3 on most grammars, as we shall show (Cocke's for some g r a m m a r s the n u m b e r of states in a state set can
always requires n3). Furthermore, although Younger's n 3 grow indefinitely with the length of the string being recog-
result is obtained on a Turing machine, his algorithm is in nized. For some others there is a fixed bound on the size
no w a y made faster b y putting it on a r a n d o m access of any state set. We call the latter g r a m m a r s b o u n d e d state
machine. grammars. T h e y can be done b y the algorithm in time n for
UNAMBmUOUS GRAMMARS. T h e completer is the only the following reason. Let b be the bound on the n u m b e r of
operation which forces us to use i 2 steps for each state set states in a n y state set. T h e n the processing of the states
we process, making the whole thing n ~. So the question is, in a state set together with the scanner and predictor re-
in w h a t cases does this operation involve only i steps in- quires ~ b steps, and the completer requires ~ b 2 steps.
stead of i2? I f we examine the state set S~ after the com- Summing over all the state sets gives us ~ b 2 n or , ~ n
pleter has been applied to it, there are at most proportional steps.
to i states in it. So unless some of t h e m were added in more So the class of time n g r a m m a r s for our algorithm in-

98 Communications o f t h e ACM V o l u m e 13 / N u m b e r 2 / F e b r u a r y , 1970


eludes the b o u n d e d state grammars, b u t it actually in- T h e following examples illustrate some of the ideas in
eludes more. We n o w examine h o w this class of g r a m m a r s this section. G r a m m a r U B D A (Figure 2) actually re-
compares with those t h a t can be done in linear time b y requires time proportional to ¢ . Notice t h a t state
other algorithms. M o s t of the previously mentioned "re-
stricted" algorithms w o r k on some subclass of g r a m m a r s A --~ A A . ~x 02
which t h e y can do in time n, and we will henceforth call gets added twice b y the completer. This is signified b y the
t h e m time n algorithms. K n u t h ' s L R ( k ) algorithm [2] superscript on the 0. One can tell b y the looks of the super-
works on a class of g r a m m a r s which includes those of scripts on states
just a b o u t all the others, so his will be a good one for com-
parison if we expect to do well. A ---~ A A . ~x 2
I t turns o u t t h a t almost all L R ( k ) g r a m m a r s are
A --~ A A . ~x 12
b o u n d e d state (except for certain right recursive gram-
mars). A n d even t h o u g h some L R (k) g r a m m a r s m a y n o t A~AA. ~x 0~
be b o u n d e d state, all of t h e m can be done in time n b y
our algorithm if a look-ahead of k or greater is used. I n in $4 t h a t there are H i states, each of which is added H i
fact a n y finite union of L R (k) g r a m m a r s (obtained b y times, in ~-~n state sets, producing the n 3 behavior.
combining the g r a m m a r s in a straightforward w a y in order G r a m m a r B K (Figure 3) has u n b o u n d e d ambiguity, b u t
to generate the union of the languages) is a time n gram- it is a time n g r a m m a r , a n d in fact it is b o u n d e d state.
m a r for our algorithm given the proper look-ahead. (This This can be seen because all the state sets after Si are the
is proved in [1].) same size. G r a m m a r P A L (Figure 4) is an u n a m b i g u o u s
I t is here t h a t the look-ahead feature of the algorithm g r a m m a r which requires time n 2. S~ and S~+1 each h a v e
is m o s t obviously useful. W e can obtain the n 3 a n d n 2 re- i + 4 states in them, up to and including S~, so the total
sults w i t h o u t it, b u t we cannot do all L R (k) g r a m m a r s in n u m b e r of states is ~-~n2.
time n w i t h o u t it. I n addition, a look-ahead of k = 1 is SPACE. Since the space is t a k e n up b y ~ n state sets,
a good practical device for cutting down on a lot of ex- each containing N n states, t h e space b o u n d is n 2 in general.
traneous processing with m a n y c o m m o n grammars. This is comparable to Cocke's and K a s a m i ' s algorithms,
T h e time n g r a m m a r s for our algorithm, then, include which also require n 2. However, it has the a d v a n t a g e over
b o u n d e d state grammars, finite unions of L R (k) gram- Cocke's in t h a t the n 2 is only an u p p e r b o u n d for ours,
mars, a n d others. T h e y include m a n y g r a m m a r s which while his requires n 2 all the time.
are ambiguous, and some with u n b o u n d e d degree of a m -
biguity, b u t u n f o r t u n a t e l y there are also u n a m b i g u o u s
g r a m m a r s which require time n 2.
GRAMMAR BK
root: K --~ ] KJ sentences: x~ (n ->_ 0)
GRAMMAR UBDA J~FII
root: A - + x [AA sentences: X" (n >= 1) F-+x
X-~x
REC(UBDA, x 4, 1) REC(BK, x", 1)
S0 4,-+.Aq -i 0 8a A - + x . qx 2
A--,.x -ix 0 A--~AA. -ix 1 So 4 , ~ . K - I -i o Si (2 =< i ~ n)
A--+.AA -ix 0 A-+AA. -ix 02 K--*. "Ix 0 F --~ x. -ix i-1
81 A u x . -ix 0 A-+A.A qx 2 K--~.KJ -ix 0 F-->x. -ix i--1
4, ~ A . - I -i o A--~A.A -ix 1 4, ~ K.-I -i 0 J oF. -ix i -- 1
A--~A.A -ix 0 4, -+A.-I -i 0 K-+K.J -ix 0 ,1 - ~ I . -Ix i-- 1
A-+.x -ix 1 A-+A.A -ix 0 J --+ .F ~x 0 K-+KJ. -ix 02
A-+.AA -ix 1 A--~.x qx 3 J-+.I -ix 0 K-+K.J -ix 0
A-+.AA -ix 3 F~.x -ix 0 4, -~ K.-I 0
82 A ~ x . -ix 1 I-~.x -Ix 0 J-+.F -ix i
A---~AA. qx 0 $4 A - + x . -ix 3
A~_AA. -ix 2 J ~ .I -ix i
A--~A.A -ix 1 S~ F ~ x . -ix 0 F--~.x -ix i
4,---~A.-I -i 0 A-+AA. -ix i~
A--+AA. -~x 0 3 I ~x. -ix 0 I --~ .x -ix i
A-~A.A -ix 0 J ~F. -ix 0
A-+.x -ix 2 A--+A.A -ix 3
A--~A.A -ix 2 J -+I. -ix 0 Sn%l
A--~.AA ~x 2 K-+KJ. -ix 02 ¢-~K~. ~ 0
A--~A.A -~x 1
4, ~ h . - i -i 0 K--~K.J -ix 0
A--~A.A -ix 0 4,~K.-I -i o
A--~ .x -ix 4 J --~ .F -ix 1
A--+.AA -ix 4 J --~ .I -ix 1
F~.x -ix I
Ss 4,--'~A-I. -i 0 I-+.x -ix 1

FIG. 2 FIG. 3

V o l u m e 13 / N u m b e r 2 / F e b r u a r y , 1970 Communications of t h e ACM 99


GRAMMAR P A L G1 G2 G3 G4
root: root: root: root:
A--~ x [ xAx sentences : x" (n => 1, n o d d ) S --+ Ab S-taB S - + ab } aSb S ---~AB
A - ~ a [ Ab B-*aB [b A--~a[Ab
REC(PAL, xs,0):
B-~bc l bB I Bd
So ¢~ -+.A-~ 0 $4 A ~ x A x . 1
G r a m - Sen-
A .x 0 A--~x. 3 mar tence TD STD BU SBU Ours
A .xAx 0 A-+x.Ax 3 G1 ab n (n~+7n (n2+7n 9n+5 9n+5 4n+7

S~ A --+X. 0 A ----~xA.x 0 +2)/2 +2)/2


G2 anb 3 n + 2 2n+2 11.2n + 7 4n+4 4n+4
x.Ax 0 A --~ xA.x 2
A G3 anb n 5n -- 1 5n -- 1 11.2n-~ -- 5 6n 6n + 4
¢ Ad 0 A--~ .x 4 G4 abncd ~ 2 n+6 ~ 2 n+2 ~ 2 n+s (n 3 + 21n 2 + 463 183 + 8
A .X
1 A ~ .xAx 4 + 15)/3
A -+ . x A x i $5 A ~ xAx. 0
1 A --~ xAx. 2 FIG. 5
$2 A - + x .
A--~ x . A x 1 A-+x. 4
A - * x.Ax 4
A ~ xA.x 0
2 qa --~ A.-~ 0 PROPOSITIONAL CALCULUS GRMViMAR
A ---~ .x
A --~ . xAx 2 A --+ xA.x 1
A -* xA.x 3 root: F - - ~ C I S I P ] U
S~ A --~ xAx. 0 A - * .x 5 C~UDU
A ---> x . 2 A -* xAx. 5 U~(F) I~UIL
A-~, x.Ax 2 L--~L' Ip [q Ir
4' ~A.-~ 0 S-~UV S [ UVU
A ~ xA.x 1 S~ ~b --+A-~. 0 P--~U AP [ UAU
A-.-~ . x 3 Sentence Length PA SBU OUrs
A -+ . xAx 3 p 1 14 18 28
(PAq) 5 89 56 68
FIG. 4 (p'Aq)VrVpVq' 13 232 185 148
p~((q3N(r'V(pAq))) 26 712 277 277
(q'Vr)
N ( , ~ P ' A (qVr)Ap')) 17 1955 223 141
6. Empirical Results ((pAq)V (qAr) V ( r A p ' ) ) 38 2040 562 399
3N((p'Vq')A(r'Vp))
We have programmed the algorithm and tested it against
t h e t o p - d o w n a n d b o t t o m - u p p a r s e r s e v a l u a t e d b y Griffiths FIG. 6
a n d P e t r i c k [8]. T h e s e a r e t h e oldest of t h e c o n t e x t - f r e e
parsers, a n d t h e y d e p e n d h e a v i l y on b a c k t r a c k i n g . P e r -
h a p s b e c a u s e of this, t h e i r u p p e r b o u n d s for t i m e a r e GRAMMAR GRE
e x p o n e n t i a l (C" for some c o n s t a n t C ) . H o w e v e r , t h e y root: X - - + a l X b l Y a
Y - + e [YdY
also c a n do well on some g r a m m a r s , a n d b o t h h a v e b e e n
Sentence Length PA SBU Ours
u s e d in n u m e r o u s c o m p i l e r - c o m p i l e r s , so i t will b e in- ededea 6 35 52 33
teresting to compare our algorithm with them. ededeab 4 10 75 92 45
T h e Griffiths a n d P e t r i c k d a t a is n o t in t e r m s of a c t u a l ededeab 1° 16 99 152 63
r u n n i n g t i m e s b u t in t e r m s of " p r i m i t i v e o p e r a t i o n s . " ededeab 2°° 206 859 2052 633
T h e y h a v e expressed t h e i r a l g o r i t h m s as sets of n o n d e - (ed)%abb 12 617 526 79
(ed)7eabb 18 24352 16336 194
t e r m i n i s t i c r e w r i t i n g rules for a T u r i n g - m a c h i n e - l i k e device. (ed)Seabb 20 86139 54660 251
E a c h a p p l i c a t i o n of one of t h e s e is a p r i m i t i v e o p e r a t i o n .
W e h a v e chosen as o u r p r i m i t i v e o p e r a t i o n t h e a c t of FIO. 7
a d d i n g a s t a t e t o a s t a t e set (or a t t e m p t i n g to a d d one
w h i c h is a l r e a d y there). W e feel t h a t this is c o m p a r a b l e t o
t h e i r p r i m i t i v e o p e r a t i o n b e c a u s e b o t h are in s o m e sense GRAM.MAR NSE
the most complex operation performed by the algorithm root: S--+ AB
whose c o m p l e x i t y is i n d e p e n d e n t of t h e size of t h e g r a m - A ~ a 1SC
m a r or i n p u t string. B--~b I DB
C-~c
W e c o m p a r e t h e a l g o r i t h m s on seven different g r a m m a r s .
D--~d
T w o of t h e i r e x a m p l e s were n o t used b e c a u s e t h e e x a c t Sentence Length SBU Ours
g r a m m a r w a s n o t given. F o r t h e first four, Griffiths a n d adbcddb 7 43 44
P e t r i c k were able to find closed-form expressions for t h e i r ad3bcbcd~bcd4b 18 111 108
results, so we d i d also ( F i g u r e 5). B U a n d T D a r e t h e adbcd2bcd6bcd3b 19 117 114
ad18b 20 120 123
bottom-up and top-down algorithms respectively and SBU
a(bc)~da(bcd)Mbcd4b 24 150 141
a n d S T D a r e t h e i r selective versions. I t is o b v i o u s f r o m a (bcd) Mbcd3bcb 16 100 95
t h e s e results t h a t S B U is b y f a r t h e b e s t of t h e o t h e r al-
g o r i t h m s , a n d t h e rest of t h e i r d a t a b e a r s t h i s out. T h e r e - FIG. 8

100 C o m m u n i c a t i o n s of t h e ACM V o l u m e 13 / Number 2 / February, 1970


fore we compare our algorithm with SBU only. We used In this way, when we reach the terminating state
our algorithm with k = 0. The two are comparable on ~b --~ R ~ . 0 we will have the parse tree for the sentence
G1, G2, G3, the simple grammars, but on G4, which is hanging from R if it is unambiguous, and otherwise we will
very ambiguous, ours is clearly superior--n to n *. have a factored representation of all possible parse trees.
For the next three grammars we present only the raw In [1] a precise description of this process is given.
data (Figures 6-8). The data for our algorithm was ob- The time bounds for the parser are the same as those of
tained b y programming it and having the program com- the recognizer, while the space bound goes up to n 3 in
pute the number of primitive operations it performed. We general in order to store the parse trees.
have also included the data from [8] on PA, the predictive We recommend using consistently a look-ahead of k = 1.
analyzer, which is a modified top-down algorithm. On the In fact it would probably be most efficient to implement
propositional calculus grammar, PA seems to be running the algorithm to do just a look-ahead of 1. To implement
in time n 2, while both SBU and ours run in time n, with the full look-ahead for any k would be more costly in
ours a little faster. Grammar G R E produces two kinds programming effort and less efficient overall since so few
of behavior. All three algorithms go up linearly with the programming languages need the extra look-ahead. Most
number of "b" 's, with SBU using a considerably higher programming languages use only a one character context
constant coefficient. However, P A and SBU go up ex- to disambiguate their constructs, and if two characters
ponentially with the number of "ed" 's, while ours goes are needed in some cases, our algorithm has the nice
up as the square. Grammar NSE is quite simple, and each property that it will not fail, it may just take a little longer.
algorithm takes time n with the same coefficient. Our algorithm has the useful property that it can be
So we conclude that our algorithm is clearly superior modified to handle an extension of context-free grammars
to the backtracking algorithms. I t performs as well as which makes use of the Kleene star notation. In this nota-
the best of them on all seven grammars and is substan- tion: A --~ { B C } . D means A may be rewritten as an
tially faster on some. arbitrary number (including 0) of BC's followed b y a D.
There are at least four distinct general context-free al- I t generates a language equivalent to that generated b y
gorithms besides o u r s - - T D , BU, Kasami's n 2, and Cocke's A --~ D ]BCA. However, the parse structure given to the
n ~. We have shown so far that our algorithm achieves time language is different in the two grammars:
bounds which are as good as those of any of these al-
gorithms, or better. However, we are also interested in A A
how our algorithm compares with these algorithms in a
practical sense, not just at an upper bound. B C A
We have just presented some empirical results in this
section which indicate t h a t our algorithm is better than
T D and BU. Furthermore, our algorithm must be su- B C B C D B C A
perior to Cooke's since his always achieves its upper bound
of n 8. This leaves Kasami's. His algorithm [7] is actually I
D
described as an algorithm for unambiguous grammars, b u t
it can easily be extended to a general algorithm. I n this Structures like t h a t on the left cannot be obtained using
form we suspect that it will have an n 3 bound in general context-free grammars at all, so this extension is useful.
and will be n 2 as often as ours. We are aware of no results The modification to our algorithm which implements it
about the class of grammars that it can parse in time n. involves two additional operations:
(1) Any state of the form
7. T h e P r a c t i c a l Use o f t h e A l g o r i t h m
A ~ o.{,e}*~ f
In this section we discuss the question, in what areas and
in what form can the algorithm best be put to use? is replaced b y
TH~ FoRm. Before we can do much with it, we must
A ~ o~{.~}*~ f
make the recognizer into a parser. This is done b y altering
the recognizer so that it builds a parse tree as it does the A ---~ ,x{~}*.~ f
recognition process. Each time we perform the comple-
indicating that fl may be present or absent.
ter operation adding a state E - ~ a D . ~ g (ignoring look-
(2) Any state of the form
ahead) we construct a pointer from the instance of D in
that state to the state D --~ 7. f which caused us to do the
operation. This indicates that D was parsed as 7. In case
is replaced by
D is ambiguous there will be a set of pointers from it,
one for each completer operation which caused E -+ a D . flg A--+ ,~{ .~}*~ f
to be added to the particular state set. Each symbol in
A ---+ o~{~}*.~ f
.y will also have pointers from it (unless it is terminal),
and so on, thus representing the derivation tree for D. indicating the fl may be repeated or not.

V o l u m e 13 / N u m b e r 2 / February, 1970 C o m m u n i c a t i o n s o f t h e ACM 101


THE UsE. The algorithm will probably be most useful grammars and Kasami's (at least in his paper) only on
in natural language processing systems where the full power unambiguous ones, but ours works on them all and seems
of context-free grammars is used. I t should also be useful to do about as well as other algorithms automatically.
in compiler writing systems and extendible languages. In
Appendix
most compiler writing systems and extendible languages,
the programmer is allowed to express the syntax (or the RANDOM ACCESS MACHINE. This model has an un-
syntax extension) of his language in something like B N F , bounded number of registers (counters), each of which m a y
and the system uses a parser to analyze subsequent pro- contain any nonnegative integer. These registers are named
grams in this language. Programming language grammars (addressed) by successive nonnegative integers. T h e primi-
tend to lie in a restricted subset of context-free grammars tive operations which are allowed on these registers are as
which can be processed efficiently, yet some compiler follows:
writing systems in fact use general parsers, so ours may be (1) Store 0 or the contents of one register into another.
of use here. In addition to its efficiency properties, ours (2) Test the contents of one register against 0 or against
has the advantage that it accepts the grammar in the form the contents of another register for equahty.
in which it is written, so t h a t semantic routines can be (3) Add 1 or subtract 1 from the contents of a register
associated with productions without fear t h a t the parser (taking 0 - 1 = 0).
will not reflect the original structure of the grammar. (4) Add the contents of one register to another.
Our algorithm will not compete so favorably with the The control for this model is a normal finite state device.
time n algorithms, however. Certainly ours will do in time T h e most important property of this machine is t h a t in the
n any grammar that a time n parser can do at all, but above four operations, the register R to be operated on
this does not take into account the constant coefficient of n. m a y be specified in two ways:
Most of the time n algorithms really consist of a two-fold (1) R is the register whose address is n (register n).
process. First they compile from an acceptable grammar a (2) R is the register whose address is the contents of
parser for t h a t particular grammar, and then the grammar register n.
may be discarded and the compiled parser used directly to This second mode (sometimes called indirect addressing)
analyze strings. This allows the time n algorithms to in- plus primitive operation 4 (used for array accessing) gives
corporate much specialized information into the compiled our model the random access property. T h e time is meas-
parser, thus reducing the coefficient of n to something ured by the number of primitive operations performed, and
quite small--probably an order of magnitude less than the space is measured by the number of registers used in
that of our algorithm. any of these operations.
Consequently we have developed a compilation process Acknowledgments. I am deeply indebted to Robert Floyd
for our algorithm which works only on time n grammars for his guidance in this research. I also benefited from dis-
and reduces our coefficient to approximately the same order cussions with Albert Meyer, Rudolph Krutar, and James
of magnitude as those of the time n parsers. This may make Gray, and from detailed criticisms by the referees.
our algorithm competitive with them, but we have not RECEIVED FEBRUARY, 1969; REVISED JUNE, 1969
implemented and tested it, so this is speculation. Some sort
REFERENCES
of efficient time n parser for a larger class of grammars is
needed, however, because most restricted parsers suffer 1. EXRLEr, J. An efficient context-free parsing algorithm. Ph.D.
Thesis, Comput. Sei. Dept., Carnegie-Mellon U., Pittsburgh,
from the problem that the grammar one naturally writes
Pa., 1968.
for many programming languages is not acceptable to 2. KNUTH, D. E. On the translation of languages from left to
them, and much fiddling must be done with the grammar right. Information and Control 8 (1965), 607-639.
to get it accepted. Knuth's algorithm is an exception to 3. FLOYD, R. W. The syntax of programming languages--a
this, but it has the problem that the size of the compiled survey. IEEE Trans. EC-13, 4 (Aug. 1964).
4. YOUNGER, D. H. Recognition and parsing of context-free
parser is much too great for reasonable programming lan- languages in time n 3. Information and Control 10 (1967), 189-
guage grammars (see [1, p. 129]). Unfortunately, our 208.
compiled algorithm, since it is similar to Knuth's, may 5. HAys, D. Automatic language-data processing. In Computer
also have these problems. Applications in the Behavioral Sciences, H. Borko (Ed.)
Prentice Hall, Englewood Cliffs, N.J., 1962.
8. Conclusion 6. YOUNGER, D . H . Context-free language processing in time n 3.
General Electric R & D Center, Schenectady, N.Y., 1966.
In conclusion let us emphasize t h a t our algorithm not 7. KASAMI, T., AND TORII, K. A syntax-analysis procedure for
only matches or surpasses the best previous results for unambiguous context-free grammars. J. ACM 16, 3 (July
times n 3 (Younger), n 2 (Kasami) and n (Knuth), but it 1969), 423--431.
8. GRIFFITHS, T., AND PETRICK, S. On the relative efficiencies of
does this with one single algorithm which does not have context-free grammar recognizers. Comm. ACM 8, 5 (May
specified to it the class of grammars it is operating on and 1965), 289-300.
does not require the grammar in any special form. I n 9. FELDMAN, J., AND GRIES, D. Translator writing systems.
other words, Knuth's algorithm works only 'on L R ( k ) Comm. ACM 11, 2 (Feb. 1968), 77-113.

102 C o m m u n i c a t i o n s of t h e ACM V o l u m e 13 / Number 2 / February, 1970

You might also like