Professional Documents
Culture Documents
The Cornell--IIASP
Conway, William Maxwell, and Robert Wagner for their system for the 360/65. Teeh. Rep. No. 68-A1, Office of Com-
puter Services, Cornell U., Ithaca, N. Y.
helpful discussions of the material.
6. DAMERAU,F. A technique for computer detection and correc-
RECEIVED APRIL, 1969; REVISED AUGUST, 1969 tion of spelling errors. Comm. ACM 7, 3 (Mar. 1964), 171-176.
7. DAVIDSON,L. Retrieval of misspelled names in an airlines
REFERENCES passenger reservation system. Comm. ACM 5, 3 (Mar. 1962),
1. ALBERGA, CYRIL, N. String similarity and misspellings. 169-171.
Comm. ACM I0, 5 (May 1967), 302-313. 8. FREEMAN,D. N. Error correction in CORC: The Cornell
2. BLAIR,CHARLESR. A program for correcting spelling errors. Computing Language. Ph.D. Th., Cornell U., Ithaca, N. Y.,
Information and Control 3 (Mar. 1960), 60-67. Sept. 1963.
3. CONWAY,R. W., AND MAXWELL,W.L. CORC--the Cornell 9. GLANTZ, I-I. W. On the recognition of information with a
computing language. Comm. ACM 6 (June 1963), 317-321. digital computer. J. ACM 4, 2 (Apr. 1957), 178-188.
4. CONWAY, R. W., AND MAXWELL, W . L . CUPL---an approach 10. HAMMING,R.W. One Man's View of Computer Science. J.
to introductory computing instruction. Tech. Rep. No. 68-4, ACM 16, 1 (Jan. 1969), 3-12.
Dept. of Computer Science, Cornell U., Ithaca, N. Y. 11. JACKSON,M. Mnemonics, Datamation IS (Apr. 1967), 26-29.
An Efficient Context-Free Parsing gramming languages and of programs which " u n d e r s t a n d "
or translate natural languages.
Algorithm Numerous parsing algorithms have been developed.
Some are general, in the sense that they can handle all
context-free grammars, while others can handle only sub-
JAY EARLEY classes of grammars. The latter, restricted algorithms tend
University of California,* Berkeley, California to be much more efficient. The algorithm described here
seems to be the most efficient of the general algorithms,
and also it can handle a larger class of grammars in linear
time than most of the restricted algorithms. We back up
A parsing algorithm which seems to be the most efficient gen-
these claims of efficiency with both a formal investigation
eral context-free algorithm known is described. It is similar to
and an empirical comparison.
both Knuth's LR(k) algorithm and the familiar top-down algo-
This paper is based on the author's 1968 report [I] where
rithm. It has a time bound proportional to n3 (where n is the many of the points studied here appear in much greater
length of the string being parsed) in general; it has an n2 bound
detail. In Section 2 the terminology used in this paper is
for unambiguous grammars; and it runs in linear time on a large
defined. In Section 3 the algorithm is described informally
class of grammars, which seems to include most practical
and in Section 4 it is described precisely. Section 5 is a
context-free programming language grammars. In an empirical
study of the formal efficiency properties of the algorithm
comparison it appears to be superior to the top-down and
and m a y be skipped by those not interested in this aspect.
bottom-up algorithms studied by Griffiths and Petrlck.
Section 6 has the empirical comparison and in Section 7 the
KEY WORDS AND PHRASES~ syntaxanalysis,parsing, context-free grammar, practical use of the algorithm is discussed.
compilers,computationalcomplexity
CR CATEGORIES: 4.12, 5.22, 5.23 2. Terminology
A language is a set of strings over a finite set of symbols.
We call these terminal symbols and represent them b y
1. I n t r o d u c t i o n
lowercase letters: a, b, c. We use a context-free grammar as a
formal device for specifying which strings are in the set.
Context-free grammars (BNF grammars) have been This grammar uses another set of symbols, the non-
used extensively for describing the syntax of programming terminals, which we can think of as syntactic classes. We
languages and natural languages. Parsing algorithms for use capitals for nonterminals: A, B, C. Strings of either
context-free grammars consequently play a large role in terminals or nonterminals are represented b y Greek letters:
the implementation of compilers and interpreters for pro- a, fl, 3,. The e m p t y string is k. a ~ represents
k times
* Computer Science Department. This work was partially sup-
ported by the Office of Naval Research under Contract No. O/ " ' " Or"
NONR3656(23) with the Computer Center, University of Cali-
fornia, Berkeley, and by the Advanced Research Projects Agency [ a I is the number of symbols in a. There is a finite set of
of the Office of the Secretary of Defense (F44620-67-C-0058), productions or rewriting rules of the form A --~ a. The non-
monitored by the Air Force Office of Scientific Research. terminal which stands for "sentence" is called the root R
96 C o m m u n i c a t i o n s o f t h e ACM V o l u m e 13 / N u m b e r 2 / F e b r u a r y , 1970
GRAMMAR A E
THE RECOGNIZER. This is a function of three argu-
root: E --~T [ E-ST input string = a-sa*a ments R E C ( G , Xi • • • X~, k) computed as follows:
T--~P [ T*P
P--~a LetX,+~ =-~ (1-<i-< k + l ) .
Let S~ be empty (0 ~ i _-<n + 1).
~=1
Add (0,0,0,~~) to So.
So ¢-,.Eq q 0 Sa P --~ a. -~+* 2
(X*fa) E-~ .Eq-T -~ + 0 (X*=*) T--~P. "~+* 2
E--~ .T -~-5 0 E ~ E+T. "~+ 0 For i ~-- 0 step 1 until n do
T--~.T*P 4 + * 0 T--~T.*P "~ + * 2 Begin
T ~ .P ~ -5* 0 Process the states of St in order, performing one of the following
P ~ .a -~ -5* 0 S~ T--~T*.P ~-5" 2
(X~=a) P-~.a ~-5. 4 three operations on each state s = (p, j, f, ~>.
Sl P-~a. q-5* 0 (1) Predictor: If s is nonfinal and Cp¢j+l) is a nonterminal,
(X2=+) T-*P. -~+* 0 S~ P-~a. "I.-{.-* 4 then for each q such that C~¢j+n = Dq, and for each
E-~T. ~-5 0 (X~=-~) T--~T*P. -~-5. 2 fl 6 tIk(Cp¢~+2)..- C~o~) add (q, 0, i, fl> to S~.
T --~T.*P -~-5. 0 E--~ E + T . ~+ 0
* -~ E.q q 0 T-~T.*P ~ +* 2
(2) Completer: If s is final and ~ = Xi+l ... Xi+~, then for
E -~ E.-ST ~-5 0 ¢ -~ E.q q 0 each (q,l,g,~> 6 Sf (after all states have been added to Sf) such
E -~ E.-ST "~-5 0 that C~a+~) = Dp, add (q,l + 1, g, ~>to S~.
S~ E-*E+.T q+ 0 (3) Scanner: If s is nonfinal and C~¢y+n is terminal, then if
(X~=a) T-~.T*P q +* 2 S~ ~Eq. q 0
T --~.P -[ -5* 2 Cp(~+l) = X~+i , add (p, j + 1, f, o~>to S~+~.
P -~ .a q +* 2 If S~+iis empty, return rejection.
If i = n and S~+~ = {(0,2,0,-~>}, return acceptance.
Fro. 1 End
Notice t h a t the ordering imposed on state sets is not
i m p o r t a n t to their meaning b u t is simply a device which
he uses a stack rather t h a n pointers to keep track of w h a t allows their members to be processed correctly b y the
to do after a parsing decision is made. algorithm. Also note t h a t i cannot become greater t h a n n
Note also that, although it did not develop this way, without either acceptance or rejection occurring because
our algorithm is in effect a top-down parser [3] in which of the fact t h a t -~ appears only in production zero. This
we carry along all possible parses simultaneously in such does not really represent a complete description of the
a way t h a t we can often combine like subparses. This cuts algorithm until we describe in detail how all these opera-
down on duplication of effort and also avoids the left- tions are implemented on a machine. The following de-
recursion problem. (A straightforward top-down parser scription assumes a knowledge of basic list processing tech-
m a y go into an infinite loop on g r a m m a r s containing left- niques.
recursion, i.e. A --~ A~.) Implementation
4. The Recognizer (1) For each nonterminal, we keep a linked list of its
alternatives, for use in prediction.
The following is a precise description of the recognition (2) T h e states in a state set are kept in a linked list
algorithm for input string X~ • • • X . and g r a m m a r G. so they can be processed in order.
NOT~TmN. N u m b e r the productions of g r a m m a r G ar- (3) I n addition, as each state set S~ is constructed, we
bitrarily 1, . . . , d - 1, where each production is of the put entries into a vector of size i. The f t h entry in this
form vector (0 =< f ~ i) is a pointer to a list of all states in St
with pointer f, i.e. states of the form ~p, j, f, ~} 6 S~ for
D~--~ C ~ "-- C ~ (1 _-< p ~ d -- 1)
some p, j, ~. Thus, to test if a state (p, j, f, ~) has already
where p is the n u m b e r of symbols on the right-hand side been added to S~, we search through the list pointed to
of the p t h production. Add a 0th production b y the f t h entry in this vector. (This takes an amount of
time independent of f . ) T h e vector and lists can be dis-
D0 --+ R -~ carded after S~ is constructed.
(4) For the use of the completer, we also keep, for each
where R is the root of G, and -~ is a new terminal symbol.
state set S~ and nonter~final N, a list of all states <p, j, f, ~)
Definition. A state is a quadruple (p, j, f, a> where p, j,
6 S~ such t h a t C~¢i+1) -- N.
and f are integers ( 0 ~ p = < d - - 1) ( 0 = < j E T ) (0__<f
(5) I f the g r a m m a r contains null productions (A --~ ~),
n -t- 1 ) and a is a string consisting of k terminal symbols.
we cannot implement the completer in a straightforward
state set is an ordered set of states. A final state is one in
way. When performing the completer on a null state
which j = ~. We add a state to a state set b y putting it
(A-~. a i) we want to add to Si each state in S~ with A
last in the ordered set unless it is already a member.
to the right of the dot. But one of these m a y not have been
Definition. Hk ( ' y ) = {ce [ a is terminal, ] a [ = k, and
added to St yet. So we must note this and check for it
fl such t h a t 7 ~ a~}. when we add more states to S~.
H~ (~) is the set of all k-symbol terminal strings which The above implementation description is not m e a n t to
begin some string derived from % This is used in forming be the only way or the best way to implement the al-
the look-ahead string for the states. gorithm. I t is merely a method which does allow the
FIG. 2 FIG. 3