You are on page 1of 20

COMPILER DESIGN

UNIT 2 NOTES

UNIT -2 (SYNTAX ANALYSIS)


Syntax Analysis discussion on CFG, LMD,RMD, parse trees, Role of a parser classification
of parsing techniques Brute force approach, left recursion, left factoring, Top down parsing
First and Follow- LL(1) Grammars, Non-Recursive predictive parsing Error recovery in
predictive parsing.
INTRODUCTION
By design, every programming language has precise rules that prescribe the syntactic structure of
well-formed programs. In C, for example, a program is made up of functions, a function out of
declarations and statements, a statement out of expressions, and so on.
Syntax analysis is the second phase of the compiler. It gets the input from the tokens and generates a
syntax tree or parse tree.
Advantages of grammar for syntactic specification :

A grammar gives a precise
specification of a

programming language.

and

easy-to-understand

syntactic

 An efficient parser can be constructed automatically from a properly designed grammar.



A grammar imparts a structure to a source program that is useful for its
translation
into
object code and for the detection of errors.
New constructs can be added to a language more easily when there is a grammatical
description of the language.
CONTEXT-FREE GRAMMARS
A Context-Free Grammar is a quadruple that consists of terminals, non-terminals, start symbol and
productions.
Terminals : These are the basic symbols from which strings are formed.
Non-Terminals : These are the syntactic variables that denote a set of strings. These help to
define the language generated by the grammar.
Start Symbol : One non-terminal in the grammar is denoted as the Start-symbol and the set of
strings it denotes is the language defined by the grammar.
Productions : It specifies the manner in which terminals and non-terminals can be combined to
form strings. Each production consists of a non-terminal, followed by an arrow, followed by a
string of non-terminals and terminals.
Examples of context free grammar

COMPILER DESIGN

UNIT 2 NOTES

The grammar in Fig. 4.2 defines simple arithmetic expressions. In this grammar, the terminal
symbols are id,+,-,*,( ,)
The nonterminal symbols are expression, term and factor, and expression is the start symbol

Notational Conventions
To avoid always having to state that "these are the terminals," "these are the nontermiaals ," and
so on, the following notational conventions for grammars are used
1. These symbols are terminals:
(a) Lowercase letters early in the alphabet, such as a, b, e.
(b) Operator symbols such as +, r, and so on.
(c) Punctuation symbols such as parentheses, comma, and so on.
(d) The digits 0,1,. . . ,9.
(e) Boldface strings such as id or if, each of which represents a single terminal symbol.
2. These symbols are nonterminals:
(a) Uppercase letters early in the alphabet, such as A, B, C.
(b) The letter S, which, when it appears, is usually the start symbol.
(c) Lowercase, italic names such as expr or stmt.
(d) When discussing programming constructs, uppercase letters may be used to represent non
terminals for the constructs. For example, nonterminals for expressions, terms, and factors are
often represented by E, T, and F, respectively.
3. Uppercase letters late in the alphabet, such as X, Y, 2, represent grammar symbols; that is,
either non terminals or terminals.
4. Lowercase letters late in the alphabet, chiefly u, v, . . . , x , represent (possibly empty) strings
of terminals.
5. Lowercase Greek letters, a, ,O, y for example, represent (possibly empty) strings of grammar
symbols. Thus, a generic production can be written as A + a, where A is the head and a the body.
6. A set of productions A -+ al, A + a2,. . . , A -+ a k with a common head A (call them Aproductions), may be written A + a1 / a s I . . I ak. Call a l , a2,. . . , a k the alternatives for A.
2

COMPILER DESIGN

UNIT 2 NOTES

7. Unless stated otherwise, the head of the first production is the start symbol
Using these conventions, the grammar of above example can be rewritten concisely as

Derivations
The construction of a parse tree can be made precise by taking a derivational view, in which
productions are treated as rewriting rules. Beginning with the start symbol, each rewriting step
replaces a non terminal by the body of one of its productions. This derivational view corresponds
to the top-down construction of a parse tree, but the precision afforded by derivations will be
especially helpful when bottom-up parsing is discussed. As we shall see, bottom-up parsing is
related to a class of derivations known as "rightmost" derivations, in which the rightmost non
terminal is rewritten at each step.
Types of derivations:
The two types of derivation are:
1. Left most derivation
2. Right most derivation.
In leftmost derivations, the leftmost non-terminal in each sentinel is always chosen first for
replacement.
In rightmost derivations, the rightmost non-terminal in each sentinel is always chosen first for
replacement.
Example:
Given grammar G : E E+E | E*E | ( E ) | - E | id Sentence to be derived : - (id+id)

LEFTMOST DERIVATION

RIGHTMOST DERIVATION

E-E
E-(E)
E-(E+E)
E-(id+E)
E-(id+id)

E-E
E-(E)
E-(E+E)
E-(E+id)
E-(id+id)

 String that appear in leftmost derivation are called left sentinel forms.
 String that appear in rightmost derivation are called right sentinel forms.
Sentinels:
3

COMPILER DESIGN

UNIT 2 NOTES

Given a grammar G with start symbol S, if S , where may contain non-terminals


or terminals, then is called the sentinel form of G.

Parse Trees and Derivations


A parse tree is a graphical representation of a derivation that filters out the order in which
productions are applied to replace nonterminals. Each interior node of a parse tree represents the
application of a production. The interior node is labeled with the ont terminal A in the head of
the production; the children of the node are labeled, from left to right, by the symbols in the body
of the production by which this A was replaced during the derivation. For example, the parse tree
for -(id + id) in Fig. 4.3, results from the derivation (4.8) as well as derivation (4.9).
The leaves of a parse tree are labeled by nonterminals or terminals and, read from left to right,
constitute a sentential form, called the yield or frontier of the tree.
To see the relationship between derivations and parse trees, consider any derivation

For each sentential form i in the derivation, we can construct a parse tree whose yield is i. The
process is an induction on i.
BASIS: The tree for 1 = A is a single node labeled A.

COMPILER DESIGN

UNIT 2 NOTES

THE ROLE OF PARSER


The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and verifies
that the string can be generated by the grammar for the source language. It reports any syntax
errors in the program. It also recovers from commonly occurring errors so that it can continue
processing its input.

Functions of the parser :


1. It verifies the structure generated by the tokens based on the grammar.
2. It constructs the parse tree.
3. It reports the errors.
4. It performs error recovery.
Issues :
Parser cannot detect errors such as:
1. Variable re-declaration
2. Variable initialization before use.
3. Data type mismatch for an operation.
The above issues are handled by Semantic Analysis phase.
5

COMPILER DESIGN

UNIT 2 NOTES

Syntax error handling :


Programs can contain errors at many different levels. For example :
1. Lexical, such as misspelling a keyword.
2. Syntactic, such as an arithmetic expression with unbalanced parentheses.
3. Semantic, such as an operator applied to an incompatible operand.
4. Logical, such as an infinitely recursive call.
Functions of error handler :
1. It should report the presence of errors clearly and accurately.
2. It should recover from each error quickly enough to be able to detect subsequent errors.
3. It should not significantly slow down the processing of correct programs.
Error recovery strategies :
The different strategies that a parse uses to recover from a syntactic error are:
1. Panic mode
2. Phrase level
3. Error productions
4. Global correction
Panic mode recovery:
On discovering an error, the parser discards input symbols one at a time until
synchronizing token is found. The synchronizing tokens are usually delimiters, such as
semicolon or end. It has the advantage of simplicity and does not go into an infinite loop. When
multiple errors in the same statement are rare, this method is quite useful.
Phrase level recovery:
On discovering an error, the parser performs local correction on the remaining input that
allows it to continue. Example: Insert a missing semicolon or delete an extraneous semicolon etc.
Error productions:
The parser is constructed using augmented grammar with error productions. If an error
production is used by the parser, appropriate error diagnostics can be generated to indicate the
erroneous constructs recognized by the input.
Global correction:
Given an incorrect input string x and grammar G, certain algorithms can be used to find a
parse tree for a string y, such that the number of insertions, deletions and changes of tokens is as
small as possible. However, these methods are in general too costly in terms of time and space.
Regular Expressions vs. Context-Free Grammars:
REGULAR EXPRESSION
1. It is used to describe the tokens of programming languages.
2. It is used to check whether the given input is valid or not using transition diagram.
6

COMPILER DESIGN

UNIT 2 NOTES

3. The transition diagram has set of states and edges.


4. It has no start symbol.
5. It is useful for describing the structure of lexical constructs
constants, keywords, and so forth.

such

as

identifiers,

CONTEXT-FREE GRAMMAR
1. It consists of a quadruple where S start symbol, P production, T terminal, V
variable or non- terminal.
2. It is used to check whether the given input is valid or not using derivation.
3. The context-free grammar has set of productions.
4. It has start symbol.
5. It is useful in describing nested structures such as balanced parentheses, matching
begin-ends and so on.
Eliminating Ambiguity
Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity. As an example,
we shall eliminate the ambiguity from the following "danglingelse" grammar:stmt +

Here "other" stands for any other statement. According to this grammar, the compound
conditional statement if El then S1 else if E2 then S2 else S3

has the parse tree shown in Fig. 4.8.' Grammar (4.14) is ambiguous since the string if El then if
E2 then S1 else S2. has the two parse trees shown in Fig. 4.9.
In all programming languages with conditional statements of this form, the first parse tree is
preferred. The general rule is, "Match each else with the closest unmatched then." This
disambiguating rule can theoretically be incorporated directly into a grammar, but in practice it is
rarely built into the productions.

COMPILER DESIGN

UNIT 2 NOTES

We can rewrite the dangling-else grammar (4.14) as the following unambiguous grammar. The
idea is that a statement appearing between a then and an else must be "matched" ; that is, the
interior statement must not end with an unmatched or open then. A matched statement is either
an if-then-else statement containing no open statements or it is any other kind of unconditional
statement. Thus, we may use the grammar in Fig. 4.10. This grammar generates the same strings
as the dangling-else grammar (4.14), but it allows only one parsing for string (4.15); namely, the
one that associates each else with the closest previous unmatched then.

Eliminating Left Recursion:


A grammar is said to be left recursive if it has a non-terminal A such that there is a
derivation A=>A for some string . Top-down parsing methods cannot handle left-recursive
grammars. Hence, left recursion can be eliminated as follows:
If there is a production A A | it can be replaced with
A A
8

COMPILER DESIGN

UNIT 2 NOTES

A A |
without changing the set of strings derivable from A.
Example : Consider the following grammar for arithmetic expressions:
EE+T|T
TT*F|F
F (E)|id
First eliminate the left recursion for E
ETE
E +TE |
Then eliminate for T as
TFT
T *FT |
Thus the obtained grammar after eliminating left recursion is
ETE
E+TE|
T FT
T*FT|
F (E)|id
Algorithm to eliminate left recursion:
1. Arrange the non-terminals in some order A1, A2 . . . An.
2. for i := 1 to n do begin
for j := 1 to i-1 do begin
replace each production of the form Ai Aj by the productions Ai 1 | 2 | . . . | k where
Aj 1 | 2 | . . . | k are all the current Aj- productions
end
eliminate the immediate left recursion among the Ai-productions
end
Left factoring:
Left factoring is a grammar transformation that is useful for producing a grammar suitable
for predictive parsing. When it is not clear which of two alternative productions to use to expand a
9

COMPILER DESIGN

UNIT 2 NOTES

non-terminal A, we can rewrite the A-productions to defer the decision until we have seen
enough of the input to make the right choice.
If there is any production A 1 | 2 , it can be rewritten as
A A
A 1 | 2
Consider the grammar ,
G : S iEtS | iEtSeS | a
Eb
Left factored, this grammar becomes
S iEtSS | a
S eS |
Eb
PARSING
It is the process of analyzing a continuous stream of input in order to determine its
grammatical structure with respect to a given formal grammar.
Parse tree:
Graphical representation of a derivation or deduction is called a parse tree. Each interior
node of the parse tree is a non-terminal; the children of the node can be terminals or nonterminals.
Types of parsing:
1.Top down parsing
2.Bottom up parsing
Top-down parsing : A parser can start with the start symbol and try to transform it to the
input string. Example : LL Parsers.
Bottom-up parsing :A parser can start with input and attempt to rewrite it into the start
symbol.Example : LR Parsers.
Top-down parsing:

Top-down parsing can be viewed as the problem of constructing a parse tree for the input string,
starting from the root and creating the nodes of the parse tree in preorder Equivalently, top-down
parsing can be viewed as finding a leftmost derivation for an input string.
Example

10

COMPILER DESIGN

UNIT 2 NOTES

This sequence of trees corresponds to a leftmost derivation of the input


At each step of a top-down parse, the key problem is that of determining the production to be
applied for a non terminal, say A. Once an A-production is chosen, the rest of the parsing process
consists of "matching7' the terminal symbols in the production body with the input string.
For example, consider the top-down parsing as below

Recursive-Descent Parsing
A recursive-descent parsing program consists of a set of procedures, one for each non terminal.
Execution begins with the procedure for the start symbol, which halts and announces success if
its procedure body scans the entire input string. Pseudo code for a typical non terminal appears in
Fig. 4.13. Note that this pseudo code is nondeterministic, since it begins by choosing the Aproduction to apply in a manner that is not specified.
General recursive-descent may require backtracking; that is, it may requirerepeated scans over
the input However, backtracking is rarely needed to parse programming language constructs, so
backtracking parsers are not seen frequently.

Algorithm(A typical procedure for a nonterminal in a top-down parser)


11

COMPILER DESIGN

UNIT 2 NOTES

Example for recursive decent parsing:


A left-recursive grammar can cause a recursive-descent parser to go into an infinite loop. Hence,
elimination of left-recursion must be done before parsing.
Consider the grammar for arithmetic expressions
EE+T|T
TT*F|F
F(E)|id
After eliminating the left-recursion the grammar becomes, E TE
E+TE

T FT
T*FT
F (E) | id
Now we can write the procedure for grammar as follows:
Recursive procedure:
Procedure E()
begin
T( );
EPRIME( );
end
12

COMPILER DESIGN

UNIT 2 NOTES

Procedure EPRIME( )
begin
If input_symbol=+ then ADVANCE( );
T( );
EPRIME( );
end
Procedure T( )
begin
F( );
TPRIME( );
end

Procedure TPRIME( )
begin
If input_symbol=* then ADVANCE( );
F( );
TPRIME( );
end
Procedure F( )
begin
If input-symbol=id then ADVANCE( );
else if input-symbol=( then ADVANCE( );
E( );
else if input-symbol=) then ADVANCE( );
end
else ERROR( );
Stack implementation:
To recognize input id+id*id :

PROCEDURE

INPUT STRING

E( )

id+id*id
id+id*id

T( )

13

COMPILER DESIGN

UNIT 2 NOTES

id+id*id
F( )

id+id*id
ADVANCE( )

id+id*id
TPRIME( )

id+id*id
EPRIME( )

id+id*id
ADVANCE( )

id+id*id
T( )

id+id*id
F( )

id+id*id
ADVANCE( )

id+id*id
TPRIME( )

id+id*id
ADVANCE( )

id+id*id
F( )

id+id*id
ADVANCE( )

id+id*id
TPRIME( )

Nonrecursive Predictive Parsing


A nonrecursive predictive parser can be built by maintaining a stack explicitly, rather than
implicitly via recursive calls. The parser mimics a leftmost derivation.If w is the input that has
been matched so far, then the stack holds a sequence of grammar symbols a such that

The table-driven parser in Fig. 4.19 has an input buffer, a stack containing a sequence of
grammar symbols, a parsing table constructed by Algorithm 4.31, and an output stream. The
input buffer contains the string to be parsed, followed by the endmarker $. We reuse the symbol
$ to mark the bottom of the stack, which initially contains the start symbol of the grammar on top
of $. The parser is controlled by a program that considers X, the symbol on top of the stack, and
a, the current input symbol. If X is a nonterminal, the parser chooses an X-production by
consulting entry M[X, a] of the parsing table IM. (Additional code could be executed here, for
example, code to construct a node in a parse tree.) Otherwise, it checks for a match between the
terminal X and current input symbol a

14

COMPILER DESIGN

UNIT 2 NOTES

Input buffer:
It consists of strings to be parsed, followed by $ to indicate the end of the input string.
Stack:
It contains a sequence of grammar symbols preceded by $ to indicate the bottom of the stack.
Initially, the stack contains the start symbol on top of $.
Parsing table:
It is a two-dimensional array M[A, a], where A is a non-terminal and a is a terminal.
Predictive parsing program:
The parser is controlled by a program that considers X, the symbol on top of stack, and a, the
current input symbol. These two symbols determine the parser action. There are three
possibilities:
1. If X = a = $, the parser halts and announces successful completion of parsing.
2. If X = a $, the parser pops X off the stack and advances the input pointer to the next
input symbol.
3. If X is a non-terminal , the program consults entry M[X, a] of the parsing table M. This
entry will either be an X-production of the grammar or an error entry.
If M[X, a] = {X UVW},the parser replaces X on top of the stack by WVU. If M[X, a] = error,
the parser calls an error recovery routine.
Algorithm for nonrecursive predictive parsing:
Input : A string w and a parsing table M for grammar G.
Output : If w is in L(G), a leftmost derivation of w; otherwise, an error indication.
Method : Initially, the parser has $S on the stack with S, the start symbol of G on top, and w$ in
the input buffer. The program that utilizes the predictive parsing table M to produce a parse for
the input is as follows:
set ip to point to the first symbol of w$;
repeat
let X be the top stack symbol and a the symbol pointed to by ip; if X is a terminal or $ then
if X = a then
15

COMPILER DESIGN

UNIT 2 NOTES

pop X from the stack and advance ip else error()


else /* X is a non-terminal */
if M[X, a] = X Y1Y2 Yk then begin
pop X from the stack;
push Yk, Yk-1, ,Y1 onto the stack, with Y1 on top;
output the production X Y1 Y2 . . . Yk
end
else error()
until X = $
/* stack is empty */
Predictive parsing table construction:
The construction of a predictive parser is aided by two functions associated with a grammar G :
1. FIRST
2. FOLLOW
Rules for first( ):
1. If X is terminal, then FIRST(X) is {X}.
2. If X is a production, then add to FIRST(X).
3. If X is non-terminal and X a is a production then add a to FIRST(X).
4. If X is non-terminal and X Y1 Y2Yk is a production, then place a in FIRST(X) if for some
i, a is in FIRST(Yi), and is in all of FIRST(Y1),,FIRST(Yi-1); that is, Y1,.Yi-1 => . If is in
FIRST(Yj) for all j=1,2,..,k, then add to FIRST(X).
Rules for follow( ):
1. If S is a start symbol, then FOLLOW(S) contains $.
2. If there is a production A B, then everything in FIRST() except is placed in
follow(B).
3. If there is a production A B, or a production A B where FIRST() contains , then
everything in FOLLOW(A) is in FOLLOW(B).
Algorithm for construction of predictive parsing table: Input : Grammar G
Output : Parsing table M Method :
1. For each production A of the grammar, do steps 2 and 3.
2. For each terminal a in FIRST(), add A to M[A, a].
3. If is in FIRST(), add A to M[A, b] for each terminal b in FOLLOW(A). If is in
FIRST() and $ is in FOLLOW(A) , add A to M[A, $].
4. Make each undefined entry of M be error.
Example:
Consider the following grammar :
EE+T|T
TT*F|F
16

COMPILER DESIGN

UNIT 2 NOTES

F(E)|id
After eliminating left-recursion the grammar is
E TE
E+TE|
TFT
T*FT|
F (E)|id
First( ) :
FIRST(E) = { ( , id}
FIRST(E) ={+ , }
FIRST(T) = { ( , id}
FIRST(T) = {*, }
FIRST(F) = { ( , id }
Follow( ):
FOLLOW(E) = { $, ) }
FOLLOW(E) = { $, ) }
FOLLOW(T) = { +, $, ) }
FOLLOW(T) = { +, $, ) }
FOLLOW(F) = {+, * , $ , ) }
Predictive parsing table :

Stack
$E

Input
id+id*id $
17

Output

COMPILER DESIGN

UNIT 2 NOTES

$ET
$ETF

id+id*id $
id+id*id $

ETE
TFT

$ETid
$ET
$E
$ET+

id+id*id $
+id*id $
+id*id $
+id*id $

Fid

$ET
$ETF
$ETid
$ET
$ETF*

id*id $
id*id $
id*id $
*id $
*id $

$ETF
$ETid
$ET

id $
id $
$

$E
$

$
$

T
E +TE
TFT
Fid
T *FT
Fid
T
E

LL(1) grammar:
The parsing table entries are single entries. So each location has not more than one entry. This
type of grammar is called LL(1) grammar.
Consider this following grammar:
SiEtS | iEtSeS| a
Eb
After eliminating left factoring, we have
SiEtSS|a
S eS |
Eb
To construct a parsing table, we need FIRST() and FOLLOW() for all the non-terminals.
FIRST(S) = { i, a }
FIRST(S)={e,}
FIRST(E) = { b}
FOLLOW(S) = { $ ,e }
FOLLOW(S)={$,e}
FOLLOW(E) = {t}

18

COMPILER DESIGN

UNIT 2 NOTES

Parsing table:

Since there are more than one production, the grammar is not LL(1) grammar.
Actions performed in predictive parsing:
1. Shift
2. Reduce
3. Accept
4. Error
Implementation of predictive parser:
1. Elimination of left recursion, left factoring and ambiguous grammar.
2. Construct FIRST() and FOLLOW() for all non-terminals.
3. Construct predictive parsing table.
4. Parse the given input string using stack and parsing table.

Error Recovery in Predictive Parsing


This discussion of error recovery refers to the stack of a table-driven predictive parser, since it
makes explicit the terminals and nonterminals that the parser hopes to match with the remainder
of the input; the techniques can also beused with recursive-descent parsing.
An error is detected during predictive parsing when the terminal on top of the stack does not
match the next input symbol or when nonterminal A is on top of the stack, a is the next input
symbol, and M[A, a] is error (i.e., the parsing-table entry is empty).
Panic Mode

Panic-mode error recovery is based on the idea of skipping symbols on the the input until a token
in a selected set of synchronizing tokens appears. Its effectiveness depends on the choice of
synchronizing set. The sets should bechosen so that the parser recovers quickly from errors that
are likely to occur in practice. Some heuristics are as follows:
1.As a starting point, place all symbols in FOLLOW(A) into the synchronizing set for
nonterminal A. If we skip tokens until an element of FOLLOW(A) is seen and pop A from the
stack, it is likely that parsing can continue.
2.It is not enough to use FOLLOW(A) as the synchronizing set for A. For example, if
semicolons terminate statements, as in C, then keywords that begin statements may not appear in
the FOLLOW set of the nonterminal representing expressions. A missing semicolon after an
assignment may therefore result in the keyword beginning the next statement being skipped.
Often, there is a hierarchical structure on constructs in a language; for example, expressions
appear within statements, which appear within blocks, and so on. We can add to the
synchronizing set of a lower-level construct the symbols that begin higher-level constructs. For

19

COMPILER DESIGN

UNIT 2 NOTES

example, we might add keywords that begin statements to the synchronizing sets for the
nonterminals generating expressions.
3. If we add symbols in FIRST(A) to the synchronizing set for nonterminal
A, then it may be possible to resume parsing according to A if a symbol
in FIRST(A) appears in the input.
4. If a nonterminal can generate the empty string, then the production deriving E can be used as a
default. Doing so may postpone some error detection, but cannot cause an error to be missed.
This approach reduces the number of nonterminals that have to be considered during error
recovery.
5. If a terminal on top of the stack cannot be matched, a simple idea is to pop the terminal, issue
a message saying that the terminal was inserted,and continue parsing. In effect, this approach
takes the synchronizing set of a token to consist of all other tokens.

20

You might also like