You are on page 1of 14

CS1622

Lecture 6 Introduction to Parsing

CS 1622 Lecture 6

The Front End

Source code

Scanner

tokens

Parser

IR

Errors

Parser Checks the stream of words and their parts of speech (produced by the scanner) for grammatical correctness Determines if the input is syntactically well formed

Guides checking at deeper levels than syntax Builds an IR representation of the code

Think of this as the mathematics of diagramming sentences


2

CS 1622 Lecture 6

The Functionality of the Parser


Input: sequence of tokens from lexer Output: parse tree of the program

parse tree is generated if the input is a legal program Instead of parse tree, some parsers produce directly:

abstract syntax tree (AST) + symbol table, or intermediate code, or object code
CS 1622 Lecture 6 3

Comparison with Lexical Analysis


Phase Scanner Parser Input String of characters String of tokens Output String of tokens Parse tree

CS 1622 Lecture 6

Example
The program: x*y+z Input to parser:
ID TIMES ID PLUS ID
well write tokens as follows:

E E E id * E id
5

E id

id * id + id

Output of parser: the parse tree

CS 1622 Lecture 6

What must parser do?

Not all strings of tokens are valid programs

parser must distinguish between valid and invalid strings of tokens

Parser must expose program structure (associativity and precedence)

parser must return the parse tree A language for describing valid strings of tokens A method for distinguishing valid from invalid strings of tokens (and for building the parse tree)
CS 1622 Lecture 6 6

We need:

Parser
source lexical analyzer token parser get next token symbol table parse tree rest of frontend IR

Parsing = determining whether a string of tokens can be generated by a grammar


CS 1622 Lecture 6 7

Grammars

Precise, easy-to understand description of syntax Context-free grammars -> efficient parsers (automatically!) Help in translation and error detection

Eg. Attribute grammars Can add new constructs systematically

Easier language evolution

CS 1622 Lecture 6

Syntax Errors

Many errors are syntactic or exposed by parsing

eg. Unbalanced () Report errors quickly & accurately Recover quickly (continue parsing after error) Little overhead on parse time
CS 1622 Lecture 6 9

Error handling goals:


Error Recovery

Panic mode

Discard tokens until synchronization token found (often ;)

Phrase level

Local correction: replace a token by another and continue Encode commonly expected errors in grammar Find closest input string that is in L(G)

Error productions

Global correction

Too costly in practice

CS 1622 Lecture 6

10

Context-free Grammars

Precise and easy way to specify the syntactical structure of a programming language Efficient recognition methods exist Natural specification of many recursive constructs:

expr -> expr + expr | term


CS 1622 Lecture 6 11

Context-free Grammar Definition

Terminals T

Symbols which form strings of L(G), G a CFG (= tokens in the scanner), e.g. if, else, id Syntactic variables denoting sets of strings of L(G) Impose hierarchical structure (e.g., precedence rules) Denotes the set of strings of L(G) Rules that determine how strings are formed N -> (N|T)*

Nonterminals N

Start symbol S ( N)

Productions P

CS 1622 Lecture 6

12

Why are regular expressions not enough?

What programs are generated by?


digit+ ( ( + | - | * | / ) digit+ )* no structure! Generates a list rather than a tree

What important properties this regular expression fails to express?

CS 1622 Lecture 6

13

Why are regular expressions not enough?

Write an automaton that accepts strings

a, (a), ((a)), and (((a))) cannot do - regular expressions cannot count.

a, (a), ((a)), (((a))), (k a)k

CS 1622 Lecture 6

14

Example: Expression Grammar


expr -> expr op expr expr -> (expr) expr -> - expr expr -> id op -> + op -> op -> * op -> / op -> ^

Terminals:

{id, +, -, *, /, ^} {expr, op,} Expr

Nonterminals

Start symbol

CS 1622 Lecture 6

15

Notational Conventions

Terminals

Nonterminals

a,b,c.. +,-,.. ,.; etc 0..9 expr or <expr>

A, B, C .. S start symbol (if present) or first nonterminal in production list u,v,.. , A ->

Terminal strings

Grammar symbol strings

Productions

CS 1622 Lecture 6

16

Shorthands & Derivations


E -> E + E | E * E | (E) | - E | <id>

E => - E E derives E => derives in 1 step =>* derive in n (0..) steps

CS 1622 Lecture 6

17

More Definitions

L(G) language generated by G = set of strings derived from S S =>+ w : w sentence of G (w string of terminals) S =>+ : sentential form of G (string can contain nonterminals) G and G are equivalent : L(G) = L(G) A language generated by a grammar (of the form shown) is called a context-free language

CS 1622 Lecture 6

18

Example
G = ({-,*,(,),<id>}, {E}, E, {E -> E + E, E-> E * E , E -> (E) , E> - E, E -> <id>})
Sentence: -(<id> + <id>) Derivation: E => -E => -(E) => -(E+E)=>-(<id>+E) => (<id> + <id>)

Leftmost derivation i.e. always replace leftmost nonterminal Rightmost derivation analogously Left /right sentential form

CS 1622 Lecture 6

19

Parse Trees
E E => -E => -(E) => -(E+E)=> -(<id>+E) => -(<id> + <id>) ( E <id>
CS 1622 Lecture 6

Parse tree = graphical representation of a derivation ignoring replacement order E E + ) E <id>


20

Expressive Power

CFGs are more powerful than REs


Can express matching () with CFGs Can express most properties desired for programming languages Identifiers declared before used L = {wcw|w is in (a|b)*} Parameter checking (#formals = #actuals) L ={anbmcndm|n 1, m 1}

CFGs cannot express:


CS 1622 Lecture 6

21

Parsing

= determining whether a string of tokens can be generated by a grammar Two classes based on order in which parse tree is constructed:

Top-down parsing

Start construction at root of parse tree Start at leaves and proceed to root
CS 1622 Lecture 6 22

Bottom-up parsing

Derivations and Parse Trees


S!!!LLL A derivation is a sequence of productions A derivation can be drawn as a tree
Start symbol is the trees root 1 nXYY ! L X1Xn add For a production N -> X1 X as children to node N n X

1 n

YYL

CS 1622 Lecture 6

23

Derivation Example

S -> E + E | E * E E -> id | (E) Derivation for: id * id + id Derivation for: id + id + id

See board

CS 1622 Lecture 6

24

Notes on Derivations

A parse tree has


Terminals at the leaves Non-terminals at the interior nodes

An in-order traversal of the leaves is the original input The parse tree shows the association of operations, the input string does not
CS 1622 Lecture 6 25

Terminals

Terminals are called because there are no rules for replacing them Once generated, terminals are permanent Terminals are the tokens of the language represented by the grammar

CS 1622 Lecture 6

26

Left-most and Right-most Derivations

The example is a leftmost derivation

At each step, replace the left-most non-terminal

There is an equivalent notion of a right-most derivation

E E*E id *E id *E+E id *id+E id*id+ id


27

CS 1622 Lecture 6

Right-most Derivation in Detail (1)


E E+E * E+ id * E * E + id * E * id + id * id * id + id
CS 1622 Lecture 6

28

Derivations and Parse Trees

Note that right-most and left-most derivations have the same parse tree The difference is the order in which branches are added

CS 1622 Lecture 6

29

Summary of Derivations

We are not just interested in whether s e L(G)

We need a parse tree for s

A derivation defines a parse tree

But one parse tree may have many derivations

Left-most and right-most derivations are important in parser implementation


CS 1622 Lecture 6 30

10

Ambiguity (Cont.)
This string has two parse trees
E E E + E id * E id E id E * E + E id

id E id

CS 1622 Lecture 6

31

Parse trees
Question 1:

for each of the two parse trees, find the corresponding left-most derivation

Question 2:

for each of the two parse trees, find the corresponding right-most derivation

CS 1622 Lecture 6

32

Ambiguity (Cont.)

A grammar is ambiguous if for some string


(the following three conditions are equivalent)

it has more than one parse tree if there is more than one right-most derivation if there is more than one left-most derivation

Ambiguity is BAD

Leaves meaning of some programs ill-defined

CS 1622 Lecture 6

33

11

Dealing with Ambiguity

There are several ways to handle ambiguity

Most direct method is to rewrite grammar ''' unambiguousl

EEE | EEid E' | id | (E)!+!"


Enforces precedence of * over + E.g., precedence or associativity rules (e.g. via parser tool declarations) Same idea we saw with scanning: instead of complicated REs use symbol table to recognize keywords

Separate (external of grammar) conflict-resolution rules

CS 1622 Lecture 6

34

Expression Grammars (precedence)

Rewrite the grammar


use a different nonterminal for each precedence level start with the lowest precedence (MINUS)

E E - E | E / E | ( E ) | id

rewrite to
E E-E | T T T/T | F F id | ( E )

CS 1622 Lecture 6

35

Example
parse tree for id id / id
E E E-E | T T T/T | F F id | ( E ) T F id
CS 1622 Lecture 6

E E T T F id / T F id
36

12

More than one parse tree?

Attempt to construct a parse tree for idid/id that shows the wrong precedence. Question:

Why do you fail to construct this parse tree?

CS 1622 Lecture 6

37

Associativity

The grammar captures operator precedence, but it is still ambiguous!

fails to express that both subtraction and division are left associative;

e.g., 5-3-2 is equivalent to: ((5-3)-2) and not to: (5-(3-2)).

CS 1622 Lecture 6

38

Recursion

A grammar is recursive in nonterminal X if:

X + X

recall that + means in one or more steps, X derives a sequence of symbols that includes an X

A grammar is left recursive in X if:

X + X

in one or more steps, X derives a sequence of symbols that starts with an X

A grammar is right recursive in X if:

X + X

in one or more steps, X derives a sequence of symbols that ends with an X


CS 1622 Lecture 6 39

13

How to fix associativity

The grammar given above is both left and right recursive in nonterminals exp and term

try at home: write the derivation steps that show this. For left associativity, use left recursion. For right associativity, use right recursion.

To correctly expresses operator associativity:


Here's the correct grammar:


E ET | T T T/F | F F id | ( E )
CS 1622 Lecture 6 40

14

You might also like