You are on page 1of 38

Syntax Analysis

Position of a Parser in the Compiler Model


Source Program
Lexical Analyzer Token, tokenval Get next token Parser and rest of front-end Intermediate representatio n

Lexical error

Syntax error Semantic error Symbol Table

-1-

A Parser
Context free grammar, G Token stream, s (from lexer)

Parser

Yes, if s in L(G) No, otherwise Error messages

Syntax analyzers (parsers) = CFG acceptors which also output the corresponding derivation when the token stream is accepted Various kinds: LL(k), LR(k), SLR, LALR
-2-

The Parser
A parser implements a C-F grammar The role of the parser is twofold: 1. To check syntax (= string recognizer)
o
o

And to report syntax errors accurately

2. To invoke semantic actions


For static semantics checking, e.g. type checking of expressions, functions, etc. o For syntax-directed translation of the source code to an intermediate representation

-3-

Parsing
Universal (any C-F grammar)
o o o o o o

Cocke-Younger-Kasimi Earley Recursive descent (predictive parsing) LL (Left-to-right, Leftmost derivation) methods

Top-down (C-F grammar with restrictions)


Bottom-up (C-F grammar with restrictions)
Operator precedence parsing LR (Left-to-right, Rightmost derivation) methods
SLR, canonical LR, LALR

-4-

Top Down parsing


Start from S (the start symbol) Use productions to derive a sequence of tokens For arbitrary strings , , and for a production: A
o o o o

A single step of the derivation is A (substitute for A) S E+S (S + E) + E

Example
(E + S + E) + E

-5-

Parsing Top-Down
Goal: construct a leftmost derivation of string while reading in S E+S|E sequentail token stream E num | (S)
Partly-derived String Lookahead parsed part unparsed part

E + S ( (1+2+(3+4))+5 (S) + S 1 (1+2+(3+4))+5 (E+S)+S 1 (1+2+(3+4))+5 (1+S)+S 2 (1+2+(3+4))+5 (1+E+S)+S 2 (1+2+(3+4))+5 (1+2+S)+S 2 (1+2+(3+4))+5 (1+2+E)+S ( (1+2+(3+4))+5 (1+2+(S))+S 3 (1+2+(3+4))+5 (1+2+(E+S))+S 3 (1+2+(3+4))+5
-6-

...

Problem with Top-Down Parsing


Want to decide which production to apply based on next symbol
S E E+S|E num | (S)
Ex1: (1) S E (S) Ex2: (1)+2 S E+S (1)+E (1)+2

(E) (1) (S)+S (E)+S

How did you know to pick E+S in Ex2, if you picked E followed by (S), you couldnt parse it?
-7-

Grammar is Problem
This grammar cannot be parsed topdown with only a single look-ahead symbol! LL(1) = Left-to-right scanning, Left-most derivation, 1 look-ahead symbol Is it LL(k) for some k? If yes, then can rewrite grammar to allow top-down parsing: create LL(1) grammar for same language
-8-

S E

E+S|E num | (S)

Making a Grammar LL(1)


S S E E E+S E num (S)

S S S E E

ES +S num (S)

Problem: Cant decide which S production to apply until we see the symbol after the first expression Left-factoring: Factor common S prefix, add new non-terminal S at decision point. S derives (+S)* Also: Convert left recursion to right recursion
-9-

Parsing with New Grammar


S ES S | +S E num | (S)
Partly-derived String Lookahead parsed part unparsed part

ES ( (1+2+(3+4))+5 (S)S 1 (1+2+(3+4))+5 (ES)S 1 (1+2+(3+4))+5 (1S)S + (1+2+(3+4))+5 (1+ES)S 2 (1+2+(3+4))+5 (1+2S)S + (1+2+(3+4))+5 (1+2+S)S ( (1+2+(3+4))+5 (1+2+ES)S ( (1+2+(3+4))+5 (1+2+(S)S)S 3 (1+2+(3+4))+5 (1+2+(ES)S)S 3 (1+2+(3+4))+5 (1+2+(3S)S)S + (1+2+(3+4))+5 (1+2+(3+E)S)S 4 (1+2+(3+4))+5
- 10 -

...

Predictive Parsing
LL(1) grammar:
For a given non-terminal, the lookahead symbol uniquely determines the production to apply o Top-down parsing = predictive parsing o Driven by predictive parsing table of
o

non-terminals x terminals

productions

- 11 -

Parsing with Table


S ES S | +S E num | (S)
Partly-derived String Lookahead parsed part unparsed part

ES ( (1+2+(3+4))+5 (S)S 1 (1+2+(3+4))+5 (ES)S 1 (1+2+(3+4))+5 (1S)S + (1+2+(3+4))+5 (1+ES)S 2 (1+2+(3+4))+5 (1+2S)S + (1+2+(3+4))+5 num + ( ) $ S ES ES S +S E num (S)
- 12 -

How to Implement This?


Table can be converted easily into a recursive descent parser 3 procedures: parse_S(), parse_S(), and parse_E()

num + ( ) $ S ES ES S +S E num (S)


- 13 -

Recursive-Descent Parser
void parse_S() { switch (token) { case num: parse_E(); parse_S(); return; case (: parse_E(); parse_S(); return; default: ParseError(); } }

lookahead token

num + ( ) $ S ES ES S +S E num (S)


- 14 -

Recursive-Descent Parser (2)


void parse_S() { switch (token) { case +: token = input.read(); parse_S(); return; case ): return; case EOF: return; default: ParseError(); } } num + ( ) $ S ES ES S +S E num (S)
- 15 -

Recursive-Descent Parser (3)


void parse_E() { switch (token) { case number: token = input.read(); return; case (: token = input.read(); parse_S(); if (token != )) ParseError(); token = input.read(); return; default: ParseError(); } } num + ( ) $ S ES ES S +S E num (S)
- 16 -

Call Tree = Parse Tree


parse_ S parse_ parse_S E parse_ parse_ S S parse_ parse_S E parse_ S parse_ parse_S E parse_ S parse_ parse_S E parse_ S - 17 S E (S) E+S 1 E+S 2 E (S) E+S 3 E + S E 5

How to Construct Parsing Tables?


Needed: Algorithm for automatically generating a predictive parse table from a grammar

S S E

ES | +S number | (S)

??

num + ( ) $ S ES ES S +S E num (S)

- 18 -

Constructing Parse Tables


Can construct predictive parser if:
o

For every non-terminal, every lookahead symbol can be handled by at most 1 production

FIRST( ) for an arbitrary string of terminals and non-terminals is:


o

Set of symbols that might begin the fully expanded version of

FOLLOW(X) for a non-terminal X is:


o

Set of symbols that might follow the derivation of X in the input stream
X

FIRST
- 19 -

FOLLOW

Parse Table Entries


Consider a production X Add to the X row for each symbol in FIRST( ) If can derive ( is nullable), add for each symbol in FOLLOW(X) Grammar is LL(1) if no conflicting entries
ES | +S number | (S)
num + ( ) $ S ES ES S +S E num (S)
- 20 -

S S E

Computing Nullable
X is nullable if it can derive the empty string:
o o

If it derives directly (X ) If it has a production X YZ ... where all RHS symbols (Y,Z) are nullable

Algorithm: assume all non-terminals are nonnullable, apply rules repeatedly until no change

S S E

ES | +S number | (S)

Only S is nullable

- 21 -

Computing FIRST
Determining FIRST(X)
o o o

if X is a terminal, then add X to FIRST(X) if X then add to FIRST(X) if X is a nonterminal and X Y1Y2...Yk then a is in FIRST(X) if a is in FIRST(Yi) and is in FIRST(Yj) for j = 1...i-1 (i.e., its possible to have an empty prefix Y1 ... Yi-1 if is in FIRST(Y1Y2...Yk) then is in FIRST(X)

S S E

ES | +S number | (S)
- 22 -

FIRST(S) = {num, ( } FIRST(S) = { , + } FIRST(E) = { num, ( }

Computing FOLLOW
Determining FOLLOW(X)
o o o

if S is the start symbol then $ is in FOLLOW(S) if A B then add all FIRST( ) != to FOLLOW(B) if A B or B and is in FIRST( ) then add FOLLOW(A) to FOLLOW(B) FIRST(S) = {num, ( } FIRST(S) = { , + } FIRST(E) = { num, ( }

S S E

ES | +S number | (S)
- 23 -

FOLLOW(S) = { $, ) } FOLLOW(S) = { $, ) } FOLLOW(E) = { +, ), $ }

Putting it all Together


FIRST(S) = {num, ( } FIRST(S) = { , + } FIRST(E) = { num, ( } FOLLOW(S) = { $, ) } FOLLOW(S) = { $, ) } FOLLOW(E) = { +, ), $ }

Consider a production X Add to the X row for each symbol in FIRST( ) If can derive ( is nullable), add for each symbol in FOLLOW(X)

S S E

ES | +S number | (S)

num + ( ) $ S ES ES S +S E num (S)


- 24 -

Ambiguous Grammars
Construction of predictive parse table for ambiguous grammar results in conflicts
S S+S|S*S| num

FIRST(S+S) = FIRST(S*S) = FIRST(num) = { num }

- 25 -

LL(1) Grammar
A grammar G is LL(1) if it is not left recursive and for each collection of productions A 1| 2| | n for nonterminal A the following holds: 1. FIRST( i) FIRST( j) = for all i 2. if i * then 2.a. j * for all i j 2.b. FIRST( j) FOLLOW(A) = for all i j j

- 26 -

Non-LL(1) Examples

Grammar S Sa|a S aS|a

Not LL(1) because: Left recursive FIRST(a S) FIRST(a) For R: S


*

S aR| R S| S aRa R S|

and

For R: FIRST(S) FOLLOW(R)

- 27 -

Impact of Ambiguity
Different parse trees correspond to different evaluations! Thus, program meaning is not defined!!
* *

+
1

+ 3 1 2 =9

2
=7

- 28 -

Can We Get Rid of Ambiguity?


Ambiguity is a function of the grammar, not the language! A context-free language L is inherently ambiguous if all grammars for L are ambiguous Every deterministic CFL has an unambiguous grammar
o o

So, no deterministic CFL is inherently ambiguous No inherently ambiguous programming languages have been invented

To construct a useful parser, must devise an unambiguous grammar

- 29 -

Eliminating Ambiguity
Often can eliminate ambiguity by adding nonterminals and allowing recursion only on right or left S
o o

S T

S+T|T T * num | num


T

S+T T*3 2

o o

1 T non-terminal enforces precedence Left-recursion; left associativity

- 30 -

A Closer Look at Eliminating Ambiguity


Precedence enforced by
Introduce distinct non-terminals for each precedence level o Operators for a given precedence level are specified as RHS for the production o Higher precedence operators are accessed by referencing the next-higher precedence non-terminal
o

- 31 -

Associativity
An operator is either left, right or non associative
o o o

Left: a + b + c = (a + b) + c Right: a ^ b ^ c = a ^ (b ^ c) Non: a < b < c is illegal (thus undefined)

Position of the recursion relative to the operator dictates the associativity


o o

Left (right) recursion left (right) associativity Non: Dont be recursive, simply reference next higher precedence non-terminal on both sides of operator
- 32 -

Error Handling
A good compiler should assist in identifying and locating errors
o o o o

Lexical errors: important, compiler can easily recover and continue Syntax errors: most important for compiler, can almost always recover Static semantic errors: important, can sometimes recover Dynamic semantic errors: hard or impossible to detect at compile time, runtime checks are required Logical errors: hard or impossible to detect

- 33 -

Error Recovery Strategies


Panic mode
o

Discard input until a token in a set of designated synchronizing tokens is found Perform local correction on the input to repair the error

Phrase-level recovery
o

Error productions
o

Augment grammar with productions for erroneous constructs


Choose a minimal sequence of changes to obtain a global least-cost correction

Global correction
o

- 34 -

Panic Mode Recovery


Add synchronizing actions to undefined entries based on FOLLOW Pro: Can be automated Cons: Error messages are needed
id E ER T TR F F id T F TR E T ER ER TR synch + T ER synch TR * F TR F (E) synch T + * E ( ER F TR synch TR synch

FOLLOW(E) = { ) $ } FOLLOW(ER) = { ) $ } FOLLOW(T) = { + ) $ } FOLLOW(TR) = { + ) $ } FOLLOW(F) = { + * ) $ }


) ER synch TR synch $ synch

T ER synch

- 35 - nonterminal A and skips input synch: the driver pops current till

Phrase-Level Recovery
Change input stream by inserting missing tokens For example: id id is changed into id * id Pro: Can be automated Cons: Recovery not always intuitive Can then continue
here
id E ER T TR F T F F TR id insert * E T ER ER TR synch + T ER synch TR * F TR F (E) synch
- 36 -

* E T

( ER

) ER

$ synch synch TR synch

T ER synch F TR synch TR synch

insert *: driver inserts missing * and retries the production

Error Productions
E T ER ER + T ER | T F TR TR * F TR | F ( E ) | id
id E ER T TR F T TR F F TR F TR id E T ER ER TR synch + T ER synch TR * F TR F (E)
- 37 -

Add error production: T R F TR to ignore missing *, e.g.: id id Pro: Powerful recovery method Cons: Cannot be automated
+ * E T synch ( ER F TR synch TR synch ) ER synch TR synch $ synch T ER synch

You might also like