You are on page 1of 109

MODULE III

Introduction to compiling:- Compilers,


Analysis of a source program, the
phases of a compiler.
Lexical Analysis:-The role of the lexical
analyzer, Input buffering, specification
of tokens, Recognition of tokens, Finite
automata, Conversion of an NFA to DFA,
From a regular expression to an NFA.
COMPILERS
Introduction to Compilers

Translator
A translator is a program that takes a program
written in one programming language as input and
produces a program in another language as output. If the
source language is a high level language and the object
language is a low level language , then such a translator is
called a compiler.

Source Object
Compiler
Program Program
Analysis of Source Program

 The analysis part breaks up the source program into


constituent pieces and imposes a grammatical structure on
them.

 It then uses this structure to create an intermediate code of


the source program.

 If the analysis part detects any error, it must provide


informative messages, so the user can take corrective action.

 The analysis part also collects information about the source


program and stores it in the data structure SYMTAB, which is
passed along with the intermediate code to the synthesis
phase.
 The synthesis part constructs the desired target
program from the intermediate representation and
the information in the SYMTAB.

 The analysis part is often called the front end and


synthesis phase is called the back end.
Source program

Lexical Analyzer
token stream
Syntax Analyzer
Syntax tree
Semantic Analyzer
Syntax tree
Intermediate code generator

Symbol Table Intermediate representation


Machine independent Code
optimizer
Intermediate representation
Code generator
Target machine code
Machine dependent Code
optimizer
Target machine code
Phases of a compiler
Lexical Analysis (Scanning)

-The first phase of a compiler

-The lexical analyzer reads the stream of characters from the source
program and groups the characters into meaningful sequences
called lexemes.

-For each lexeme, the lexical analyzer produces a token as output of


the form,
(token-name, attribute-value)
Where token-name is an abstract symbol that is used during
syntax analysis, and attribute-value points to an entry in the symbol
table for this token.
Eg. Position = initial +rate *60
The lexemes and tokens are
1) position is a lexeme would be mapped into a
token <id,1>, where id is identifier and 1 points to
the SYMTAB entry for position.
2) = is a lexeme that mapped into a token < = >.
Since this token needs no attribute value, we have
omitted the second component.
3) initial - <id,2>
4) + - < + >
5)rate - <id,3>
6 ) * - <*>
7) 60 - <60>
Syntax Analysis(Parsing)
 The second phase of the compiler.
 The parser uses the first components of the tokens
produced by the lexical analyzer to create syntax trees.
The syntax tree for the above eg is

<id,1> +
<id,2>
*
<id,3> 60
Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the
SYMTAB to check the source program for semantic consistency with
the language definition.

It also gathers type information and saves it in either the syntax tree
or the SYMTAB, for subsequent use during intermediate code
generation.

An important part of semantic analysis is type checking, where the


compiler checks that each operator has matching operands.
(eg, the compiler must report an error, if a float value is used as an
array index).
Eg. Suppose position, initial and rate are float numbers. The lexeme
<60> is an integer. The type checker in semantic analyzer discovers
that the operator * is applied to a float number rate and an int 60. So
int 60 is converted to float.
Intermediate Code Generation.
 In the process of translation from source to
target code, the compiler may construct one or
more intermediate representations.
 This intermediate representations should be
(a) easy to produce and (b) easy to translate
into target machine.

Eg. t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Code Optimization
 The machine independent code optimization phase
attempts to improve the intermediate code so that
better target code will result.
 A simple intermediate code generation algorithm
followed by code optimization is a reasonable way
to generate good target code.
 The optimizer can deduce that the conversion of 60
from int to float can be done once. So the
inttofloat operation can be eliminated by replacing
int 60 by float 60.0
Eg. t1 = id3 * 60.0
id1 = id2 * t1
Code Generation
 The code generator takes as input an
intermediate representation of the source
program and maps it to the target language.
 If the target language is machine code,
registers or memory locations are selected for
each of the variables used by the program.
 Then the intermediate instructions are
translated into sequences of machine
instructions.

Eg. LDF R2, id3


MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1
SYMBOL TABLE MANAGEMENT

 An essential function of a compiler is to record the


variable names used in the source program and
collect information about various attribute of each
name.

 This data structure should be designed to allow the


compiler to find the record for each name quickly
and to store or retrieve data from that record
quickly.
Position = initial +rate *60

Lexical Analyzer

<id,1> = <id,2> + <id,3> * <60>

Syntax Analyzer

<id,1> +

<id,2>
*

<id,3> 60

Semantic Analyzer
=

<id,1> +

<id,2>
*

<id,3> inttofloat(60)

Intermediate code generator

t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3

code optimizer

t1 = id3 * 60.0
id1 = id2 * t1
Code Generator

LDF R2, id3


MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1

Translation of an assignment statement


Role of Lexical Analyzer

 The main task of lexical analyzer is to


read the input characters and group them
into lexemes and produce tokens
 The stream of tokens is send to the parser
for syntax analysis.
token
Lexical
Parser
Analyzer
Source getnexttoken To semantic analysis
Program

Symbol Table
Tasks(Role) of Lexical Analyzer
 Identification of Lexemes
 Removal of comments and white
spaces(blank,newline,tab etc)
 Correlating error messages generated by
the compiler with the source program.
Lexical Analyzer processes.
a) Scanning consists (simple processes) that do not
require tokenization of the input, such as deletion
of comments and compaction of consecutive white
spaces into one.
b) Lexical analysis(complex process) where scanner
produces the sequence of tokens as output.
Tokens, Patterns and Lexemes

 A token is a pair with a token name and an optional


attribute value
 A pattern is a description of the form that the
lexemes of a token may take
 A lexeme is a sequence of characters in the source
program that matches the pattern for a token
INPUT BUFFERING
 Specializedbuffering techniques have
been developed to reduce the amount
of overhead required to process a
single input character.
 Two buffers are alternately reloaded.
Each buffer is of same size N. N is the
size of a disk block.

E = M * C * * 2 eof

lexemeBegin forward
Input Buffering…

Two pointers are required


lexemeBegin - marks the beginning of a lexeme
forward - scans ahead until a pattern match is found.

Advancing forward requires that we first test whether


we have reached the end of one of the buffers , and if
so, we must reload the other buffer from the input,
and move forward to the beginning of the newly
loaded buffer.
Sentinels

 Used to mark the end of input.


 Natural choice is the character eof.
 Anyeof that appears other than at
the end of a buffer means that the
input is at an end.

E = M * eof C * * 2 eof eof

lexemeBegin forward

Sentinels at the end of each buffer


Switch(*forward++){
case eof:
if(forward is at the end of first buffer)
{ reload second buffer;
forward = beginning of second buffer; }
else if(forward is at the end of second buffer)
{ reload first buffer;
forward = beginning of first buffer; }
else /* eof within a buffer marks the end of
input */
terminate lexicalanalysis;
break;
}
SPECIFICATION OF TOKENS
Strings and Languages
 alphabet is a finite sequence of symbols.
 The string over an alphabet is a finite sequence of
symbols drawn from that alphabet.
 |s| represents the length of a string s, Ex: banana is a
string of length 6
 The set {0,1} is the binary alphabet
 A language is any countable set of strings over some
fixed alphabet.
 Abstract languages - , the empty set , or {}, the set
containing only the empty string.
 The empty string is the identity under concatenation;
that is, for any string s, s = s = s.
 Exponentiation of strings :- s0 is  , and for all i >0, si
is si-1s. Since S = S , s1 = s , s2=ss,s3 =sss and so on.
Operations on Languages

OPERATION DEFINITION
union of L and M L U M = {s | s is in L or s is in M}
written L U M
concatenation of L LM = {st | s is in L and t is in M}
and M written LM

Kleene closure of L L*=  Li
written L* i 0

L* denotes “zero or more concatenations of “ L



positive closure of L
written L+ L+= L
i 1
i

L+ denotes “one or more concatenations of “ L


Operations on Languages (contd.)
 Example:
 Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and let
D be the set of digits {0,1,.. .9). L and D are, respectively, the
alphabets of uppercase and lowercase letters and of digits.
Other languages constructed from L and D are

1. L U D is the set of letters and digits - strictly speaking the


language with 62 (52+10) strings of length one, each of which
strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one
letter followed by one digit.(10×52).
Ex: A1, a1,B0,etc
3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)
4. L* is the set of all strings of letters, including .
5. L(L U D)* is the set of all strings of letters and digits beginning
with a letter.
6. D+ is the set of all strings of one or more digits.
Regular Expression & Regular language

 Regular Expression
A notation that allows us to define a pattern in
a high level language.
 Regular language
 Each regular expression r denotes a language
L(r) (the set of sentences relating to the regular
expression r)
Notes: Each word in a program can be expressed in a
regular expression
Eg. Suppose we want to describe the set of valid
C identifiers.
If letter_ stand for any letter or the underscore,
and digit_ stands for any digit, then we would
describe the language of C identifiers by :
letter_(letter_ | digit)*
The | means union.
( ) are used to group sub expressions.
* means “zero or more occurrences of”
The juxtaposition of letter_ with the remainder
of the expression signifies concatenation.
Rules for constructing regular expressions
The regular expressions are built recursively out of smaller regular
expressions using the following rules.

1.  is a regular expression denoting {}, the language


containing only empty string . L() = {}

2. If a is a symbol in alphabet , then a is a regular


expression, and L(a) = { a } ,that is the language with
one string, of length one, with a in its one position.
(We use italics for symbols and boldface for their corresponding
regular expression.)
INDUCTION:
Let r and s be regular expressions with languages
L(r) and L(s). Then
a) (r) | (s) is a regular expression denoting the
language L(r)  L(s)
b) (r)(s) is a regular expression denoting the
language L(r) L(s)
c) (r)* is a regular expression denoting the
language (L(r))*
d) (r) is a regular expression denoting the
language L(r).
Precedence
* has highest precedence.
 Concatenation ha second
highest precedence
| has lowest precedence

Eg. (a) | ((b) * (c )) may be


replaced by a | b * c.
Examples
Algebraic laws of
Regular Expressions
AXIOM DESCRIPTION
r|s=s|r | is commutative
r | (s | t) = (r | s) | t | is associative
(r s) t = r (s t) concatenation is associative
r(s|t)=rs|rt
(s|t)r=sr|tr concatenation distributes over |

r = r
r = r  Is the identity element for concatenation

r* = ( r |  )* relation between * and 


r** = r* * is idempotent
Regular Definitions

 Wecan give names to certain


regular expressions and use those
names in subsequent expressions.
d1 -> r1
d2 -> r2
.....
dn -> rn
e.g. C identifiers are strings of letters , digits and
underscores.
letter_ A|B|…|Z|a|b|…|z|_
digit 0|1|2|…|9
Id  letter_(letter_|digit)*
This can also be written as
letter_ [A –Za-z_]
digit [0-9]
Id  letter_(letter_|digit)*

We shall conventionally use italics for the symbols


defined in the regular expressions.
Recognition of tokens

 In this topic we study how to


take the patterns for all the
needed tokens and build a
piece of code that examines
the input string and finds a
prefix that is a lexeme
matching one of the patterns.
1/22/201
41
8
Consider following example.

stmt  If expr then stmt


| if expr then stmt else stmt
I
expr  term relop term
| term
term  id
| number

A grammar for branching statements


42
 For relop, we use the comparison operators.
 The patterns for tokens (id and number) are
digit  [0-9]
digits  digit+
number  digits (. digits)?(E [+-]? digits)?
letter  [A-Za-z]
id  letter (letter | digit)*
if  if
then  then
else  else
relop  <|>|<=|>=|=|<>
1/22/201
43
8
Token for white space is
ws  (blank | tab | newline )+

 Token ws is different from the


other tokens in that, when we
recognize it, we do not return
it to the parser, but rather
restart the lexical analysis from
the character that follows the
whitespace.
1/22/201
44
8
LEXEMES TOKEN NAME ATTRIBUTE VALUE
Any ws - -
If if -
then then -
else else -
Any id id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop

Tokens, their patterns, and attribute values

45
Transition Diagrams

 As an intermediate step in the


construction of a lexical analyzer, we
first convert patterns into "transition
diagrams.“
 Transition diagrams have a collection
of nodes or circles, called states.
 Each state represents a condition that
could occur during the process of
scanning the input looking for a
lexeme that matches one of several
patterns.
1/22/201
46
8
 Edges are directed from one
state of the transition diagram
to another.
 Each edge is labeled by a
symbol or set of symbols.
 All our transition diagrams are
deterministic, meaning that
there is never more than one
edge out of a given state with a
given symbol among its labels.
1/22/201
47
8
Some important
conventions about transition diagrams are:
1. Certain states are said to be accepting, or
final.
These states indicate that a lexeme has
been found.
(We always indicate an accepting state by a
double circle, and if there is an action to be
taken — typically returning a token and an
attribute value to the parser — we shall attach
that action to the accepting state.)
2. In addition, if it is necessary to retract the
forward pointer one position (i.e., the lexeme
does not include the symbol that got us to the
accepting state), then we shall additionally place
a * near that accepting state.
3. One state is designated the
start state, or initial state; it is
indicated by an edge, labeled
"start," entering from no where.
The transition diagram always
begins in the start state before
any input symbols have been
read.
Transition diagram for relop
 We begin in state 0, the start state. If
we see < as the first input symbol, then
among the lexemes that match the
pattern for relop we can only be looking
at <, <>, or <=.
 Therefore go to state 1, and look at the
next character.
 If it is =, then we recognize lexeme <=,
enter state 2, and return the token relop
with attribute LE, the symbolic
constant representing this particular
comparison operator.
 If in state 1 the next character is >, then
instead we have lexeme <>, and enter
state 3 to return an indication that the
not-equals operator has been found.
 State4 has a * to indicate that we
must retract the input one position.

 In state 0, we see any character


besides <, =, or >, we can not
possibly be seeing a relop lexeme, so
this transition diagram will not be
used.
Recognition of Reserved Words and Identifiers
 Usually, keywords like if or then are reserved
so they are not identifiers even though they
look like identifiers.
Letter or digit

start letter other


9 10 11 return(getToken(), installlD ())

Transition diagram for id's and keywords

53
There are two ways that we can handle
reserved words that look like identifiers:
1) Install the reserved words in the
symbol table initially.
When we find an identifier, a call to installlD
places it in the symbol table if it is not
already there and returns a pointer to the
symbol-table entry for the lexeme found.
Any identifier not in the symbol table during
lexical analysis cannot be a reserved word, so
its token is id.

1/22/201
54
8
 The function getToken examines the
symbol table entry for the lexeme found,
and returns whatever token name the
symbol table says this lexeme represents
— either id or one of the keyword tokens
that was initially installed in the table.
2.) Create separate transition diagrams for
each keyword.

nonlet/dig *
start t h e n

Transition diagram for then


55
A transition diagram for
unsigned numbers

start

56
A transition diagram for whitespace
delim

Here we look for one or more “white space” characters


,represented by delim. These characters would be blank,
tab newline etc.

In state 24, we have found a block of consecutive


whitespace characters, followed by a non whitespace
character. We retract the input to begin at the non
whitespace, but we do not return to the parser.
57
Design of Lexical Analyzer

Initial step is to form flowcharts for the valid possible tokens


Flowcharts for lexical analyzer is known as Transition diagrams
Components are
States – represent the circles
Edges – the arrows connecting the states
The labels on the edges indicate the input character that can
appear after
that state
Transition diagram for identifier

letter or digit

Start letter delimiter 2 *


0 1

Fig : Transition diagram for identifier


The next step is to produce code for each of the states

The code for State 0

State 0 : C:= GETCHAR( );


if LETTER © then goto state1
else FAIL( )

Here LETTER is a boolean valued function, returns true if C is a


letter
FAIL is a routine which retracts the lookahead pointer and
starts up the next transition diagram or calls the error routine.
The code for State 1

State 1 : C:= GETCHAR( );


if LETTER © or DIGIT( C ) then goto state1
else if DELIMITER ( C ) then goto state 2
else FAIL( )

Here DIGIT is a boolean valued function, returns true if C is one of


the digits 0, 1, ….,9. DELIMITER is a procedure which returns true
whenever C is a character that could follow an identifier
The code for State 2

State 2 : RETRACT( );
return (id, INSTALL( ) )

state 2 indicates that an identifier has been found. Since the


delimiter is not part of the token found, the function RETRACT will
move the lookahead pointer one character back.* indicate states on
which input retraction must take place.
INSTALL( ) procedure will install the identifier into symbol table if it
is not already there.
Token Code Value
begin 1 ---
end 2 ---
If 3 ---
Then 4 ---
Else 5 ---
Identifier 6 pointer to symbol
table
Constant 7 pointer to
symbol table
< 8 1
<= 8 2 Fig: Tokens
= 8 3
recognized
<> 8 4
> 8 5
Keywords :

Blank/
Start B E G 3 I N newline 6 * return(1,
0 1 2 4 5
)

Blank/
*
E 7 N 8 D 9 newline10 return(2,
)
Blank/
L S 12 E newline14* return(5,
11 13
)
Blank/ *
I 15 F newline 17 return(3,
16
)
Blank/ *
T 18 H 19 E 20 N 21 newline 22 return(4,
)
Identifier :

letter or digit
Not
Start letter letter or *
23 24 25 return(6, INSTALL( )
digit
)

constant :

digit

Start digit Not digit


*
26 27 28 return(7, INSTALL( )
)
Relops :

Not
Start *
= or 31 return(8,1)
29 < 30
>

= 32 return(8, 2)

> 33 return(8,4 )

*
= 34 return(8,3 )

*
> 35 not = 36 return(8,5)

= 37 return(8,6)
Regular Expressions

Strings and Languages

Alphabet or character class


denote any finite set of symbols
Eg : {0,1} is an alphabet, with two symbols 0 and 1

String
It is a finite sequence of symbols
Eg: 001, 10101,….
Operations with string

Length : x denotes the length of string x, will be the number


of
characters in x
‘Є’ is the empty string, Є = 0

Concatenation of x and y is denoted by x.y or xy , formed by


appending string ‘y’ to ‘x’
Eg: x = abc y = de then x.y = abcde
xЄ = Єx = x where Є is the identity in
concatenation

Exponentiation xi means string x repeated ‘ i ‘ times


Eg: x1 = x; x2 = xx; x3 = xxx; ….. and x0 = Є
Prefix
is obtained by discarding o or more trailing symbols of x

Eg: abc, abcd, a ….. Are prefix of abcde

Suffix
of x is obtained by discarding 0 or more leading symbols
of x
Eg: cde, e, …… represent the suffix of abcde
Substring
of x is obtained by deleting a prefix and suffix from x
Eg: cd, abc, de, abcde …represent the substring of
abcde
All suffix and prefix will be a substring, but the substring
need
not be a suffix or prefix
Є and x are prefixes, suffixes, and substring of x
Language
It is the set of strings formed from specific alphabet
If L & M are two languages, the possible operations are

Concatenation
Concatenation of L & M is denoted as L.M and can be
found by selecting a string x from L and y from M and joining
them in that order
LM = {xy x is in L and y is in M}
ФL = LФ = Ф

Exponentiation
Li = LLLLL…… L (i times)
L0 = {Є}, {Є}L = L{Є}=L
Union
LUM = {x x is in L or x is in M}
ФUL = LUФ = L

Closure

‘*’ denotes ‘0 or more instances of’; L* = U Li
i=0
Eg: let L = { aa }
L* is all strings of even number of a’s
L0 = {Є} L1 = { aa } L2 = { aaaa } …….

‘+’ is the positive closure, means ‘ one or more instances


of ’
exclude {Є}, then its L.(L*)

L.(L*) ∞ i=0U Li∞


= L.
i=0 i=1=
= U Li+1 U Li = L+
Regular Expressions
used to describe the tokens
Eg: for identifier,
identifier = letter ( letter digit )*
used to define a language

Regular Expression construction rules


1. Є is a regular expression denoting {Є}, that is the language
containing only the empty string
2. For each a in Σ, a is regular expression denoting {a}, the
language
with only one string, that string consisting of the single
symbol a
3. If R and S are regular expressions denoting languages LR
and LS
respectively then
(i) (R ) (S) is a regular expression denoting LR U LS
A regular expression is defined in terms of primitive regular
expression (basis) and compound regular expressions
(induction
rules)
So rules (i) and (ii) form the basis, (iii) forms the inductive
portion
Eg: Some Regular Expressions
1. a* - denotes all strings of 0 or more a’s
2. aa* - denotes the string of one or more a’s (a+)
3. (a b)*- the set of all strings of a’s and b’s i.e. (a*b*)*
4. (aa ab ba bb)* - all strings of even length
5. Є a b – strings of length 0 or 1
6. (a b) (a b) (a b) denotes strings of length 3
so (a b) (a b) (a b) (a b)* denotes strings of length 3 or
more
Є a b (a b) (a b) (a b) (a b)* - all strings whose length
is not 2
Regular Expressions for
Keyword = BEGIN/ END/ IF/THEN/ ELSE
Identifier = letter (letter/digit)*
Constant = digit+
relop = </ <=/ = / <>/ >/ >=
If two regular expressions R and S denote same language,
then R and S are equivalent
i.e. (a/b)* = (a*b*)*

Algebraic laws with Regular Expressions

1. R/S = S/R ( / is commutative)


2. ( R/S) /T = R/ (S/T) (/ is associative)
3. R (ST) = (RS) T ( . Is associative)
4. R (S/T) = RS / RT and
(S/T) R= SR /TR ( . Distribution over / )
5. ЄR = RЄ = R (Є is identity for
concatenation)
Finite Automata

Language Recognizer

It is a program that identifies the presence of a token on the input .


It takes a string x as its input, answers ‘yes’ if x is a sentence of L
and ‘no’ otherwise.

How it works?

To determine ‘x’ belongs to a language L, ‘x’ is decomposed into a


sequence of substrings denoted by the primitive sub expressions in
R
Example

Given R = (a/b)*abb, the set of all strings ending in abb,and x


= aabb
Since R = R1R2 where R1 = (a/b)* and R2 = abb
It is easy to show ‘a’ є the language(a is an element of the
Nondeterministic Automata

It is the generalized transition diagram that is derived from the


expression
a
start a b b 3
0 1 2

Fig: A non deterministic finite automata of (a


b)*abb
The nodes are called states and the labeled edges are called
transitions. Edges can be labeled by ‘є’ & characters. Also same
character can label two or more transitions out of one state. It has
one start state and can be one or more final states(accepting
states).
Transition table

The tabular form representing the transitions of an NFA . In the


transition table, there is a row for each state and a column for each
admissible input symbol and є.
The entry for row ‘i’ and symbol ‘a’ is the set of possible next states
for state ‘i’ on i/p ‘a’.

State Input symbol


a b
Fig: Transition table
0 {0,1} {0}
1 ----- {2}
2 ---- {3}
The path for the input string “aabb” can be represented by the
following sequence of moves

State Remaining i/p


0 aabb
0 abb
1 bb
2 b
3 є

The language defined by an NFA is the set of i/p strings it accepts.


NFA accepting aa* bb*

a
1 2
ε

start 0
b
ε b
3 4
Algorithm to construct an NFA from a Regular Expression

Input :- A regular expression R over alphabet ∑


Output :- An NFA, N accepting the language denoted by R
Method : Decompose R into its primitive components. For
each component, we construct a finite automata
inductively using basis and induction rules
Finite Automata construction from regular expression

The basis and induction rules are


1. NFA for “є”
є where i and f are new initial state
i f
and final state

2. NFA for “a”


a each state should be new
i' f'
Each time we need a new state, we give that state a new
name. Even
if a appears several times in the regular expression R, we
give each
instance of a a separate finite automation with its own
states.

3. Having constructed components for the basis regular


expressions,
we proceed to combine them in ways that correspond to the
way
compound regular expressions are formed from smaller
regular
expressions.
3. NFA for “R1 / R2”

Let N1 and N2 be NFAs corresponds R1 and R2 respectively

N1
є є

f
i'i

є є
N2

There is a transition on ε from the new initial state to the initial states of
N1 and N2.There is an ε-transition from the final states of N1 and N2 to
the new final state f. Any path from i to f must go through either N1 or
4. NFA for “R1R2”

Let N1 and N2 be NFAs corresponds R1 and R2 respectively

i N1 N2 f

The initial state of N2 is identified with the accepting state of N1. A path
from i to f must go first through N1, then through N2.
5. NFA for “R1* ”

є є
i N1 f

In this, we can go from i to f directly along a path labeled ε,or go through


N1 one or more times.
Decomposition of (a / b)*abb

R11
R9
R10
R7
R8
R5 b
R6
b
R4
*
a

( R3 )

R1 R2
/
a b
R1= a R2= b

a b
N1 : 2 3 N2 : 4 5

R3=
R1/R2
a
є 2 3 є N4 : R4= (R3) is same as
N3 : N3
1 6
є b є
4 5
R5= (R4)*

N5 : є

a
є 2 3 є
є є
0 1 6 7
є b є
4 5

є
R6= a

a
N6 : 7' 8

R7= R5R6 є

N7 :
a
є 2 3 є
є є a
0 1 6 7 8
є b є
4 5

є
R8= b

b
N8 : 8' 9

R9= R7R8 є

N9 :
a
є 2 3 є
є є a b
0 1 6 7 8 9
є b є
4 5

є
R10 = b

b
N10 : 9' 10

R11= R9R10 є

N11 :
a
є 2 3 є
Start є є a b b
0 1 6 7 8 9 10
є b є
4 5

є
Deterministic Automata (DFA)

Since in the NFA transition function is multivalued and є, it is


difficult to simulate an NFA with a computer program

A finite automaton is deterministic if


(i) it has no transitions on input є
(ii) for each state ‘s’ and input symbol ‘a’, there is at most one edge
labeled ‘a’ leaving ‘s’

For each NFA, we can find a DFA accepting the same language.
є

a
є 2 3 є
Start є є a b b
0 1 6 7 8 9 10
є b є
4 5

Є-closure (0) = { 0, 1, 2, 4, 7} -------------- (A)


a
{3, 8}
b {5
}
Є-closure {3, 8} = {1, 2, 3, 4, 6, 7, 8} ------------
- (B) a
{3, 8}
b { 5, 9 }
Є-closure {5} = {1, 2, 4, 5, 6, 7} ----------------
(C) a
{3, 8}
b { 5}

Є-closure {5, 9} = {1, 2, 4, 5, 6, 7, 9} ---------


-- (D) a
{3, 8}
b { 5, 10
}
Є-closure {5, 10 } = {1, 2, 4, 5, 6, 7, 10} ------------
(E) a
{3, 8}
b {5}
State Input symbol
a b
A (Start) B C
B B D
C B C a
D B E
a
E (Accept) B C B b

a D
Start A a a b

b E
C
b

b
Minimizing the number of states

State Input symbol


a b
a
A (Start) B A
B B D
B b
D B E b
a D
E (Accept) B A a a
A
b
Start b E
Constructing DFA from NFA

 Algorithm

– Input: a NFA N.
– output: a DFA D accepting the same language
Let us define the function ε-CLOSURE(s) to be the set of
states of N built by applying the following rules:
1. S is added to Є-closure (s)
2. If ‘t’ is in Є-CLOSURE (s), and there is an edge labeled Є
from ‘t ‘ to ‘u’, then ‘u’ is added to Є-CLOSURE(s) if ‘u’ is not
already there. Rule 2 is repeated until no more states can be
added to Є-CLOSURE(s) .
Thus,ε-CLOSURE(s) is the set of states that can be reached
from s on ε-transitions only. If T is a set of states, then ε-
CLOSURE(T) is the union over all states s in T of ε-
CLOSURE(s).
Constructing DFA from NFA

 Algorithm Є- CLOSURE

Push all states in T onto stack;


Є-closure(T) := T;
while stack is not empty do
begin
pop s, the top element, off the stack
for each state t with an edge from s to t
labeled Є do
if t is not in Є-closure(T) do
begin
add t to Є-closure(T)
push t onto stack
end if
end do
end while
Constructing DFA from NFA

 Algorithm – Subset construction

While there is an unmarked state x= {s1,s2,….,sn) of D do


Begin
mark x;
for each input symbol ‘a ‘ do
Begin
let T be the set of states to which there is a
transition on ‘a’ from some state si in x;
y := Є- CLOSURE (T)
If ‘y’ has not yet been added to the set of
states of D then
make ‘y’ an unmarked state of D
Add a transition from x to y labeled ‘a’ if not
already
present
Endfor
Minimizing the number of states in DFA

 Algorithm

– Input: a DFA M
– output: a minimum state DFA M’
• If some states in M ignore some inputs, add transitions
to a “dead” state.
• Let P = {accepting state, All nonaccepting states}
• Let P’ = {}
• Loop: for each group G in P do
Partition G into subgroups so that s and t (in G) belong
to the same subgroup if and only if each input
‘a’,states s and t have transitions to states in the
same group of P
put those subgroups in P’
if (P != P’) goto loop
• Remove any dead states and unreachable states.
NFA to DFA Example-2
a -closure({0}) = {0,1,3,7}
1 2 subset({0,1,3,7},a) = {2,4,7}
 subset({0,1,3,7},b) = {8}
start
0  3
a
4
b
5
b
6
a
-closure({2,4,7}) = {2,4,7}
b
subset({2,4,7},a) = {7}
 subset({2,4,7},b) = {5,8}
7 b 8
-closure({8}) = {8}
subset({8},a) = 
subset({8},b) = {8}

-closure({7}) = {7}
subset({7},a) = {7}
subset({7},b) = {8}
----------------------
b
DFA states
A = {0,1,3,7} a3
B = {2,4,7} C
b a
C = {8} b b
D = {7} start
E = {5,8} A D
F = {6,8}
a a
b b
B E F
a1 a3 a2 a3
Minimizing the Number of
States of a DFA

C
b a
b a
start a b b start a b b
A B D E A B D E
a a
a
a b a
A language for specifying Lexical Analyzers

 A LEX source pgm is a specification of a lexical analyzer, consisting


of a
set of regular expressions together with an action for each regular
expression.

 The action is a piece of code which is to be executed whenever a


token
specified by the corresponding regular expression is recognized.

 The output of LEX is a lexical analyzer pgm constructed from the


LEX
source specification.
Creating a Lexical Analyzer
with Lex
lex
source lex Lexical
program compiler
analyzer L

input Lexical sequence


stream Analyzer L of tokens
A LEX source pgm consists of 2 parts:
Auxiliary definitions and translation rules

Auxiliary Definitions

The auxiliary definitions are stmnts of the form


D1=R1
D2=R2
.
.
Dn=Rn
Eg: letter=A B …… Z
digit=0 1 ……. 9
identifier= letter (letter digit )*
Translation Rules

The translation rules of a LEX pgm are stmnts of the form


P1 {A1}
P2 {A2}
.
.
Pm {Am}

Where each pi is a regular expression called a pattern and each Ai is


a pgm fragment
The pattern describe the form of the tokens
The pgm fragment describes what action the lexical analyzer
should take when token Pi is found.
AUXILIARY DEFINITIONS
letter = A B ….. Z
digit = 0 1 ….. 9
TRANSLATION RULES
BEGIN {return 1}
END {return 2}
IF {return 3}
THEN {return 4}
ELSE {return 5}
letter(letter digit)* {LEXVAL:= INSTALL();
return 6}
digit* {LEXVAL:= INSTALL();
return 7}
< {LEXVAL:=1;
return 8}
<= {LEXVAL:=2;
return 8}
= {LEXVAL:=3;
return 8}
<> {LEXVAL:=4;
return 8}
> {LEXVAL:=5;
return 8}
>= {LEXVAL:=6;
return 8}
Regular Expressions in Lex
x match the character x
\. match the character .
“string”match contents of string of characters
. match any character except newline
^ match beginning of a line
$ match the end of a line
[xyz] match one character x, y, or z (use \ to escape -)
[^xyz]match any character except x, y, and z
[a-z] match one of a to z
r* closure (match zero or more occurrences)
r+ positive closure (match one or more occurrences)
r? optional (match zero or one occurrence)
r1 r2 match r1 then r2 (concatenation)
r1|r2 match r1 or r2 (union)
(r) grouping
r1\r2 match r1 when followed by r2
{d} match the regular expression defined by d

You might also like