You are on page 1of 28

1

UNIT-1
Language processor
Language Processor is system software. It is used to translate or convert a source program or program written in
any of the high- level languages (c, java, and so on) to equivalent target program or machine code.

In the conversion of source program to target program, four types of system software are involved as shown in
the following diagram.

Source program or program written in HLL

Preprocessor

Modified source program

Compiler /Interpreter

Program in assembly language or assembly code


Language Processor Assembler

Reloadable machine code


Linker and Loader

Absolute machine code

Preprocessor
A preprocessor is system soft ware. It takes source program as input and generates the equivalent modified
source program as output.

Source program

Preprocessor

Modified source program

It processes all preprocessor directives that are present in the program.


Example preprocessor directives in c language are #include, #define, # ifdef , #endif and so on.
The functionalities of preprocessor are:

1. File inclusion
2. Micro expansion
3. Conditional compilation
4. Collecting parts of program into a single unit
2

File inclusion

When the preprocessor processes any #include directive that is present in the source program, it replaces
the #include directive with the predefined functions that are contained in the corresponding header file.

For example, if the source program in c is as shown below:

#include <stdio.h>

main ( )
{
__
__
___
}

When the preprocessor processes the #include<stdio.h> directive, it generates the modified source program as
shown below:

scanf()
{
.
.
.
}

printf( )
{
.
.
.
}

.
.

main( )
{
.
.
.
}

The directive #include<stdio.h> is replaced with the functions contained in the stdio header file.

Macro expansion

Any name defined using #define is called a macro in c language. There are two types of macros in c
language:
1. Object like macro
2. Function like macro

An object like macro is defined using the following syntax


#define name value
3

Ex: #define A 100

A function like macro is defined using the following syntax


#define name(parameters) code

Ex: #define B(x) 100*x

When the preprocessor processes any macro, it scans the entire program and replaces all occurrences of the
macro name with the corresponding value.
For example, when the preprocessor processes the directive
#define A 100

Then it replaces all occurrences of name A in the program with the value 100.

Conditional compilation

Conditional compilation allows to select lines of source program that need to be compiled and those to be
ignored.
Whenever we want to indicate that only some lines of source program have to be compiled instead of all
lines then we use conditional compilation.
In C language, the following directives are used for conditional compilation
#if, #ifdef, #ifndef, #else, #elif, #endif and so on.
Conditional compilation directives begin with an if and end with an endif.
The syntax of condition compilation directive is

#if or #ifdef
statements
[#elseif
statements]
[#else
statements]
#endif

If the if part is true then all statements until the next else or elseif or endif are compiled. Otherwise, if the else is
present, the statements between the else and the end are compiled.

Ex : #define MAX
void main ( )
{
#ifdef MAX
printf(MAX is defined\n);
#else
printf(MAX is undefined\n);
#endif
}

In above program, the statement #ifdef MAX checks whether MAX is defined using #define statement. As
MAX is already defined using #define, the statement printf(MAX is defined\n) is only compiled in the
program.
The modified source program is

void main ( )
{
printf (MAX is defined\n);
}
Compiler

Compiler is system software.


It converts the source program into either assembly code or machine code
4

Source program

Compiler

Assembly code or machine code


Compiler converts the source program into equivalent assembly or machine code using number of phases.

Interpreter

An interpreter is system software.


It converts the source program into some intermediate form and immediately executes that form.

Differences between compiler and interpreter

Compiler

1. Compiler converts the entire source program into machine code or object code before starting the execution
of program.
Source program

Compiler

Machine code or object code

Input

Object code

Out put

After generating the object code, input is given to the object code to produce the required output.
2. After compiling the program, the object code of the program can be executed for any number of times. The
time required for second and remaining executions of the program is less.
3. Compiler generates an explicit object code for the source program.
4. Implementation of compiler for any language requires more amount of time because it is a large application.
5. Identification of errors in the source program is time consuming.

Interpreter

1. It converts the source program line by line into some intermediate form and immediately executes each
line.

Source program Interpreter


Output
Input

2. The time required for each execution of program is same as statements of the program have to be converted
before execution for each execution of the program.
5

3. Interpreter does not generate an explicit object code for the source program.
4. An interpreter can be implemented in less amount of time compared to a compiler. The reason for this is
that the interpreter will not generate any object code.
5. Identification of errors in the source program is easy in case of interpreter. While translating any statement
in the program, if any error is identified then it will be immediately reported to user.
6. Interpreter is best for applications which require more interaction with the user during their execution.

Assembler

Assembler is system software. It converts the assembly code into equivalent re locatable machine code.

Assembly code

Assembler

Re locatable machine code

The assembly code contains sequence of instructions of the following form


Mnemonic source operand, destination operand

Examples for mnemonics are ADD, SUB, MUL , DIV, MOV.


ADD A, B
MOV A, B

The functionalities of assembler are:


1) Converting each mnemonic to its equivalent machine code.
2) Assigning addresses to operands (both source & destination).
3) Generating machine instructions in the proper format.
4) Converting the data constants to their equivalent machine representations.

There are two types of assemblers:


1) Single pass
2) Two pass

In single pass assembler, the entire conversion will be done in one pass.
In two pass assembler, the roles of pass 1 and 2 are:

Pass 1

Assigning address to all statements in the program.


Processing of some assembler directives.

Pass 2

Assembling all instructions.


Processing of assembler directives left in pass 1.
Writing the object program.

There are two types of one pass assembler

1. Load and go

2. Other

Load and go assembler produces object code directly in memory for immediate execution.
6

The other produces usual kind of object code for later execution.

Characteristics of load and go assembler

It is useful for program development and testing.


It avoids the overhead of writing the object code out and reading it back.
Both one-pass and two-pass assembles can be designed as load and go.
All addresses must be known at assembly time for a load and go assembler.

Linker (or) Link editor

A linker is system software that takes one or more object files generated by a compiler and combine them
into a single executable program.

Obj- 1 Obj-2 lib


22222
22222
2222
Linker

Executable program

Each object file is the result of compiling one input source code file.
When a program comprises multiple object files, the linker combines these files into a unified executable
program.
Linkers can take object files from a collection called library or system library.
The linker also takes care of assigning the objects in a programs address space.

Static linking dynamic linking

1. In static linking, all library modules (object files of 1. In dynamic linking, the names of the object files
the program and library object files) used in the of the program and library object files are placed
program are copied in to the final executable image. in the executable file. The actual linking takes
2. Static linking is performed by linkers as the last place at run time.
step in compiling a program. 2. Dynamic linking is performed at run time by the
3. In static linking, the size of final executable operating system.
program is large as all object files are copied in to it. 3. In dynamic linking, the size of final executable
4. If any of the external programs is changed then program is less as only names of object files are
they have to be recompiled and re linked again. included into it.
5. Takes constant load time every time it is loaded 4. If any of the external programs is changed then
into the memory for execution. they have to be recompiled only.
6. Statically linked programs are faster. 5. Load time might be reduced if the shared library
7. No compatibility issues. code is already present in memory.
6. Dynamically linked programs are slower.
7. If a library is changed then applications have to
be reworked to be made compatible with the new
version of the library.
7

Loader

It brings an executable file residing on disk into main memory and starts its execution.
The steps or activities preformed by a loader are:
1. Reads the header of executable file to determine the size of text and data segments
2. Creates a new address space for the program.
3. Copies instructions and data into address space.
4. Copies arguments passed to the program on the stack.
5. Initializes the machine registers including the stack pointer.
6. Jumps to a start up routine that copies the programs arguments from the stack to registers and calls the
programs main routine.

Types of Loaders

Based on the activities performed by loader, loaders are divided into three types:

1. Compile and Go loaders


2. Absolute loader
3. Program linking loader

Compile and Go loaders

In this type of loaders, the assembler does the process of compiling and the generated machine code is
placed into memory for execution.
The assembler runs in one part of memory and the assembled machine instructions are put in another part
of memory.
This scheme is also called as assemble and go.

Compile and Compile and go loader


User program
go loader Executable code of user program

RAM

Advantages

Simple and easy to implement.


No additional routines are required to load compiled code into memory

Disadvantages

Wastage of memory space due to the presence of the assembler.


There is no generation of object file. So, there is a need to reassemble the code every time it is to be run,
resulting in more time for execution.
It cannot handle multiple programs written in different languages.

Absolute loader

The loader accepts re locatable object files generated by assemblers and then places them at specified
locations in memory.
The starting address of each module is stored in the corresponding object file.
8

Object code
Module-1 Assembler
Starting
address
Object code of
Starting
Object code module-1
Module-2 Assembler address
Starting
Absolute
address Object code of
Loader module-2
Starting
address
Object code of
module-n

RAM
Object code
Module-n Assembler
Starting
address

Advantages

It allows multiple source programs written in different languages.


The task of loader becomes simpler.

Disadvantages

The programmer must know the memory management.


If any modification is done in some modules then the starting addresses of next modules may get changed.
The programmer has to take care of this issue.

Program linking or direct linking loader

Program linking loader is used to provide interaction between the segments of a program when the program
is containing number of segments.
The assembler should give the following information regarding each segment of program to the loader:
1. The length of the object code of segment.
2. A list of external symbols that could be used by other object files.
3. A list of external symbols that the object code is using.
9

Structure of compiler

There are 2 parts in the compiler: Analysis part and Synthesis part.
Analysis part takes the source program as input and generates an equivalent intermediate form of the source
program as output.
Synthesis part takes intermediate form of the source program as input and generates equivalent machine code as
output.
One data structure called symbol table is used for storing information of variables that are contained in the
source program.

Source program

Analysis Part
.
.
Intermediate form

Symboltable
Synthesis part

Machine code

Analysis part and synthesis part of compiler uses the symbol table for getting information regarding any
variable in the source program.

Phases of a complier

There are 6 phases in the compiler: lexical analysis, syntax analysis, semantic analysis, intermediate code
generation, code optimization and code generation.

Following diagram shows 6 phases of compiler as well as input and output of each phase.

Source Program

Lexical Analysis

Sequence of tokens
Syntax Analysis

Syntax tree
Semantic Analysis

Syntax tree
Intermediate Code Generation

Intermediate code

Code optimization

Optimized intermediate code

Code Generation

Target code
10

Lexical analysis

It has 3 important roles:

1. It identifies or recognizes keywords, variables, operators, constants and special symbols that are present in
the source program and generates a token for each variable, keyword, constant, operator and special symbol.
2. Lexical analysis phase recognizes blank-spaces, tabs and new lines that are present in the source program but
do not generate any token for them.
3. It enters information regarding variables in the source program into symbol table.

For example if the source program is containing the following statement


position = initial + rate * 60

Then the lexical analysis phase generates a token for variable position, a token for operator =, a token for
variable initial and so on. The sequence of tokens generated may be v1, o1, v2, o2, v3, o3 and c1.

Syntax analysis

Syntax analysis phase identifies syntax errors in the source program using syntax rules of the corresponding
programming language. If there are no syntax errors in the program then syntax analysis phase generates syntax
tree as output. Otherwise, displays the description of syntax errors to the user. In a syntax tree, operators and
keywords appear at non leaf nodes; variables, constants and special symbols appear at leaf nodes.

For the statement: position = initial + rate * 60

Output of syntax analysis phase is

Syntax tree
=

+
position

*
initial

rate 60
Semantic analysis =
Semantic analysis phase identifies semantic errors or type mismatch errors present in the source program. If
there are no type mismatch errors then the semantic analysis phase generates syntax tree as output. If a type
mismatch is identified then the semantic analysis phase tries to perform automatic type conversion. If automatic
type conversion is possible then the semantic analysis phase performs automatic conversion and generates
syntax tree as output. Otherwise, it displays error information to the user.

Some examples of type errors or semantic errors are:


1) Using a float variable as an index of an array.

int a[10], b;
b=a[2.5];

2) Adding a variable with a function.

int a;
int b(int, int);
int c;
c=a+b;
11

In the statement position = initial + rate * 60 , if the data type of all variables is integer then the output of
semantic analysis phase is the following syntax tree

Syntax tree
=

+
position

*
initial

rate 60

In the statement position = initial + rate * 60 , if the data type of all variables is float then the output of
semantic analysis phase is the following syntax tree

Syntax tree

+
position

*
initial

rate inttofloat

60
Intermediate code generation

It generates an equivalent intermediate form of the source program. The intermediate code can be represented in
any one of the following forms

1) syntax tree
2) polish notation
3) three address code

Three address code is the most frequently used format. Property of three address code is: any statement in three
address code contains a maximum of 3 operands (variables or constants) and a maximum of two operators.

For the statement position = initial + rate * 60, the intermediate code in three address code form is

t1=rate*60
t2=initial+t1
position=t2
12

Code optimization

Code optimization takes the intermediate code as input and then optimizes (reduces the number of statements)
the intermediate code. The resulting intermediate code is called as optimized intermediate code.

For the statement position = initial + rate * 60, the optimized intermediate code is

t1=rate*60
position=initial+t1

Code generation

It takes optimized intermediate code as input and generates an equivalent machine code or assembly code as
output.

For the statement position = initial + rate * 60, the target code is

MOV rate, R1
MUL 60, R1
ADD initial, R1
MOV R1, position

Symbol table

The information of variables in source program is entered into the symbol table by either lexical or semantic
analysis phase. This information is used by semantic analysis phase for identifying semantic errors and also by
code generator phase for determining memory required by the variables.

Ex: Write output at all phases of compiler for the following statement
i = i * 70 + j + 2
where i and j are float variables.

Lexical analysis phase

Lexeme Token
i - v1
= - o1
i - v2
* - o2
70 - c1
+ - o3
j - v3
+ - o4
2 - c2
13

Syntax analysis phase

Syntax tree

i +

+ 2

* J

i 70

Semantic analysis phase

Syntax tree

i +

+ inttofloat

* J 2

i inttofloat

70

Intermediate code generation phase

Intermediate code

t1= inttofloat (70)


t2= i * t1
t3= t2 + j
t4= inttofloat(2)
t5= t3 + t4
i= t5
14

Code optimization phase

Optimized intermediate code

t1= inttofloat (70)


t2= i * t1
t3= t2 + j
t4= inttofloat(2)
i= t3 + t4

or

t1= i * 70.0
t2= t1 + j
i= t2 + 2.0

Code generation phase

Assembly code

MOV i, R1
MUL 70.0, R1
ADD j, R1
ADD 2.0, R1
MOV R1, i

Ex: write the output of all phases of compiler for statement x =(a+b) * (c+d) where x, a, b, c are float variables
and d is int variable.

Lexical analysis phase

Lexeme Token
x v1
= o1
( ss1
a v2
+ o2
b v3
) ss2
* o3
( ss4
c v4
+ o4
d v5
) ss4

Syntax analysis phase

Syntax tree

x *

+ +

a b c d
15

Semantic analysis phase

Syntax tree

x *

+ +

a b c inttofloat

d
Intermediate code generation phase

Intermediate code
t1 = a + b
t2 = inttofloat(d)
t3 = c + t2
t4 = t1 * t3
x = t4

Code optimization phase

Optimized intermediate code


t1 = a + b
t2 = inttofloat(d)
t3 = c + t2
x = t1 * t3

Code generation phase

Assembly code
MOV a, R1
ADD b, R1
MOV c, R2
ADD d, R2
MUL R1, R2
MOV R2, x

Ex: write output for the statement a = b * c + (d + e) at each phase of compiler. Here a, b, c, e are float variables
and d is int variable.

Lexical analysis phase

Lexeme Token
a v1
= o1
b v2
* o2
c v3
+ o3
( ss1
d v4
16

+ o4
e v5
) ss2

Syntax analysis phase

Syntax tree

a +

* +

b c d e

Semantic analysis phase

Syntax tree

a +

* +

b c e inttofloat

d
Intermediate code generation phase

Intermediate code
t1 = (inttofloat) d
t2 = t1 + e
t3 = b * c
t4 = t3 + t2
a = t4

Code optimization phase

Optimized intermediate code


t1 = (inttofloat) d
t2 = t1 + e
t3 = b * c
a = t3 + t2

Code generation phase

Assembly code
MOV e, R1
ADD d, R1
17

MOV b, R2
MUL c, R2
ADD R1, R2
MOV R2, a

Pass of a compiler

Pass of a compiler is group of phases. Group of first 5 phases i.e, lexical analysis, syntax analysis, semantic
analysis, intermediate code generation, code optimization is called as pass1 of compiler.
The last phase i.e., code generation is called pass 2 of a compiler. We can use the term front end in place of
pass1 and back end in the place of pass2 (or) front end is another name for pass1, backend is another name for
pass2.
18

LEXICAL ANALYSIS PHASE

Lexical Analysis phase

Lexical analysis phase is the first phase of compiler. It takes source program (or) high level language program
as input and generates a sequence of tokens as output.

Source program (or) High level language program

Lexical analysis phase

Sequence of tokens

Roles of Lexical Analysis Phase

1. It recognizes or identifies variables, constants, operators, keywords and special symbols present in the
program and generates a token for each of these.
2. It recognizes or identifies blank spaces, tabs, newlines and comments present in the source program but do
not generate any token for them.
3. Identifies lexical errors in the source program.

Token

Token is a symbol generated by lexical analysis phase when it recognizes a variable or constant or keyword or
operator or special symbol present in the source program.

For example, if the source program is containing a sequence of letters count then the lexical analysis phase
recognizes this sequence of letters as variable or identifier and generates a token say ID or V. Similarly, if
the source program is containing a sequence of digits 50 then the lexical analysis phase recognizes this
sequence as constant and generates a token say C.

Sequence of letters in source program Token generated by lexical analysis

count ID or V
5 C
if IF
< RELOP

Attribute of token

Lexical analysis phase generates same token name for number of constructs present in the source program. In
this case, an attribute is associated with the token name to uniquely identify the constructs. For example, lexical
analysis phase generates the token name ID for any variable or identifier present in the source program.
Similarly, it generates the token name C for any constant and so on. The syntax of token with attribute is

<token name, attribute name>


Ex:
If the source program is containing the variables: count, sum, a1
When the lexical analysis phase recognizes these variables then it generates the tokens <ID, 1>, <ID, 2>, <ID,
3>. Where 1, 2, 3 are attributes pointing to symbol table or the record number of symbol table where the
information of corresponding variables is stored.

If the source program is containing the constants: 5, 20, 100


When the lexical analysis phase recognizes these constants then it generates the tokens <C, 5>, <C, 20>, <C,
100>.
19

If the source program is containing the relational operators: <, <=, >, >=, ==, !=
When the lexical analysis phase recognizes these operators then it generates the tokens <RELOP, LT>,
<RELOP, LE>, <RELOP, GT>, <RELOP, GE>, <RELOP, EQ>, <RELOP, NE>.

Lexical analysis phase generates a different token name for each keyword present in the source program.

Ex: keyword token


if IF
while WHILE
int INT

If the source program is containing the special symbols: (, {, #


When the lexical analysis phase recognizes these special symbols then it generates the tokens <SS, (>, <SS, {>,
<SS, #>.

Lexeme

Lexeme is a sequence of characters for which a token is generated by lexical analysis phase.

Ex:
count is a lexeme because lexical analysis phase recognizes count as variable and generates a token for it.
if is a lexeme because lexical analysis phase recognizes if as keyword and generates a token for it. <= is a
lexeme because lexical analysis phase recognizes <= as operator and generates a token for it. 40 is a lexeme
because lexical analysis phase recognizes 40 as constant and generates a token for it. ( is a lexeme because
lexical analysis phase recognizes ( as special symbol and generates a token for it.

Pattern

Pattern is a rule which describes a set of strings for which the same token name is generated by lexical analysis
phase.

Ex:
For the set of variables {a, count, sum, a, c1, b12, ab12}, lexical analysis phase generates same token name
(ID). The pattern to describe this set of variables is
letter (letter/digit)*

Ex:
For the set of constants {5, 20, 100}, lexical analysis phase generates same token name (C). The pattern to
describe this set of constants is
digit+.

Lexical errors

Lexical errors are identified by lexical analysis phase when it scans an unrecognized sequence of characters in
the source program.

Ex:
In a C program, if there is a statement like a= b+c;@, when the lexical analysis phase scans @ symbol then it
reports a lexical error as @ is not an allowed symbol in C language.

Error recovery techniques

Lexical analysis phase use the following technique to recover from lexical errors:

1. Deleting an extra character.


2. Inserting a missing character.
3. Replacing an incorrect character by correct character.
4. Transposing two adjacent characters.
20

Regular expression

It is a notation used for describing a set of strings.

1. is a regular expression.
2. Each symbol of an alphabet is a regular expression.
3. If r1 and r2 are regular expressions then their union r1+r2, their concatenation r1r2, their kleene
closure r1*, r2* and their positive closure r1+, r2+ are all regular expressions.

Ex:

if = {a, b}is an alphabet then a, b, a+b, ab, a*, b*, a+, b+ are all regular expressions.

(a+b)ab, a*(a+b)a, (a+b)(a+b) are also regular expressions.

Regular set

Every regular expression indicates a set of strings called regular set or regular language.

Ex: regular expression regular set


a {a}
b {b}
a+b {a, b}
ab {ab}
a* {, a, aa, aaa, ......}
b* {, b, bb, bbb, ......}
a+ {a, aa, aaa, .....}
b+ {b, bb, bbb, ....}
(a+b)ab {aab, bab}
a*(a+b)a {aa, ba, aaa, aba, aaba, aaaa,.........}
(a+b)(a+b) {aa,ba,ab,bb}
(a+b)* {, a, b, ab, ba, bb, ....}- all strings that can be
formed with as and bs.

Ex:
1. Write regular expression for describing set of strings which starts with ab.
ab(a+b)*
2. Write regular expression which indicates set of strings containing aba.
(a+b)*aba(a+b)*
3. Write RE for set of strings starting with a and ending with b.
a(a+b)*b
4. Write RE for set of strings which contains at least two as.
(a+b)*a(a+b)*a(a+b)*

Short hand notations in Regular Expressions

* indicates 0 or more instances.


+ indicates 1 or more instances.
? indicates 0 or 1 instance.
[ ] - character class indicates number of alternatives.

Ex:
The regular expression a+b+c+............+z can be written as [a-z]

Ex:

The regular expression 0+1+2+...+9 can be written as [0-9]


21

Regular Definition

Giving a name to a RE is called as Regular Definition.


Syntax:
name -> regular expression

Ex:
r1 -> ab(a+b)*

1. Write regular definition for indicating set of variables or identifiers

l -> [a-z]
d -> [0-9]
v -> l(l+d)*

2. Write regular definition for numbers

d -> [0-9]
num -> d+(.d+)?

3. Write regular definition for indicating multiline comments

l -> [a-z]
comment -> /*l+*/

Implementation of Lexical analysis phase (or) Recognization of constructs and generating tokens (or)
Techniques for generating Tokens

The role of lexical analysis phase can be implemented in two ways:


1. Using Transition Diagrams
2. Using lex tool

Implementation of lexical analysis phase using Transition Diagrams

In this technique, transition diagrams are used to recognize the constructs (variables or identifiers, constants,
operators, keywords and special symbols) present in the source program. The number of transition diagrams
used for recognizing different constructs is as follows:
One diagram for variables, one diagram for constants, one diagram for each keyword, one diagram for each
operator category and one diagram for special symbols.

Transition diagram to recognize variables or identifiers and to generate corresponding token:

l, d
l other *
Generate(<ID, pointer to symbol
table>)

Generate() is a user defined function used to generate corresponding token for recognized construct.

Transition diagram to recognize constants and to generate corresponding token:

d d
d . d other *
Generate(<C, number>)

other
22

Transition diagram to recognize relational operators (<, <=, >, >=, ==, <>) and to generate corresponding
tokens:

< =
Generate(<RELOP, LE>)

>
Generate(<RELOP, NE>)

other *
Generate(<RELOP, LT>)

> =
Generate(<RELOP, GE>)

other *
Generate(<RELOP, GT>)

= =
Generate(<RELOP, EQ>)

Transition diagram to recognize arithmetic operators (+, -, *, /, %) and to generate corresponding tokens:

+ Generate(<AOP, ADD>)

_
Generate(<AOP, SUB>)

* Generate(<AOP, MUL>)

/
Generate(<AOP, DIV>)

%
Generate(<AOP, MOD>)

To recognize keywords, a separate transition diagram has to be used for each keyword.
Transition diagram to recognize the keyword if and to generate the corresponding token:

i f
Generate(IF)

Transition diagram to recognize the keyword main and to generate the corresponding token:

m a i n
Generate(MAIN)
23

Transition diagram to recognize special symbols and to generate corresponding tokens:

(
Generate(<SS, (>)

{
Generate(<SS, {>)

,
Generate(<SS, ,>)
.
.
.

If we want to implement the role of lexical analysis phase using transition diagrams, first we need to determine
the order in which the transition diagrams have to be inspected when a sequence of characters in the source
program is scanned. The transition diagrams for different constructs can be inspected in any order but transition
diagram of variables should be inspected after the transition diagrams of all the keywords.

Lex tool

Lex is a tool available in LINUX operating system. It can be used for implementing the lexical analysis phase.
Lex tool is used in the following way to implement the lexical analysis phase.

1. Write a program in lex language and save the program with .l extension.
Ex:
vvit.l, lex.l, lexical.l

In lex program, include statements which recognize variables or identifiers, constants, special symbols,
keywords and operators present in the program and generate corresponding tokens. Compile the program with
lex compiler. The command to compile the lex program is

$lex file name of program

Ex: $lex vvit.l

Result of compilation is equivalent C program with name lex.yy.c.

2. Compile the C program lex.yy.c using C compiler. The command to compile the C program is

$cc lex.yy.c

The result of compilation is an equivalent object program with name a.out.


3. Run the object program a.out and give source program as input. The command to run the object program is

$ ./a.out

Structure of lex program

A lex program contains three parts:

Declarations part
%%
Translation rules part
%%
Auxiliary functions part
24

Declaration part

In the declarations part, we can declare variables, constants, header files and regular definitions. To declare
variables, constants and header files, the syntax rules of C language are used. The declaration of variables,
constants and header files should be included between %{ and %}.

Ex:
%{
int a=10;
#define A 100
#include<stdio.h>
%}
To declare regular definitions, following syntax is used.

name regularexpression

Ex: l [a-z]
d [0-9]
v {l}({l}|{d})*

Translation rules part

It contains number of statements of the form

regulardefinitionname or pattern {action}

Here action contains any code in C language or any statements of C language.

Ex: {v} {printf(variable);}


{n} {printf(const);}
int {printf( keyword);}
+ {printf (operator);}
; {printf(special symbol);}

Auxiliary functions part

It contains main function as well as supporting functions or user defined functions which are used in the actions
of translation rules part. main function is used to start the execution of program.

Declarations part and auxiliary functions part are optional.

yylex() is a global function in lex. It scans the source program using the pointer yyin.
yyin is a global variable in lex. It is used as pointer to source program and is used by yylex() to scan the source
program.
yytext is a global variable in lex. It contains the sequence of characters currently being scanned by yylex().

The execution of lex program starts from main() function. In the main() function, the function yylex() is called.
The function yylex() scans the source program. While scanning the source program, the sequence of scanned
characters is stored into yytext and is compared with the left hand sides of translation rules (from first rule to
last rule). When a match occurs, the corresponding action is executed.

If we want to display output for only recognized constructs then we need to include a translation rule
. {}
at the end of translation rules part. The symbol . matches with any letter or digit or any symbol in the source
program.
25

Example programs

1) Write a lex program to recognize integer numbers in the source program (C program) and generate the
output
number: value of number
whenever a number is recognized.

%{
#include<stdio.h>
%}
d [0-9]
n {d}+
%%
{n} {printf(number: %s, yytext);}
%%
void main()
{
FILE *fp;
char sp[20];
printf(enter source program\n);
scanf(%s, sp);
fp=fopen(sp, r);
yyin=fp;
yylex();
}

2) Write a lex program to recognize variables in the given source program and generate the output
variable: name of variable
whenever a variable is recognized.

%{
#include<stdio.h>
%}
l [a-z]
d [0-9]
v {l}({l}|{d})*
%%
{v} {printf(variable %s, yytext);}
%%
void main()
{
FILE *fp;
char sp[20];
printf(enter source program\n);
scanf(%s, sp);
fp=fopen(sp, r);
yyin=fp;
yylex();
}

3) Write a lex program to recognize keywords: if, int, void, main, float and for in the source program and
display the output
keyword: name of keyword
whenever a keyword is recognized.

%{
#include<stdio.h>
%}
%%
26

if {printf(keyword: %s, yytext);}


int {printf(keyword: %s, yytext);}
void {printf(keyword: %s, yytext);}
main {printf(keyword: %s, yytext);}
float {printf(keyword: %s, yytext);}
for {printf(keyword: %s, yytext);}
%%
void main()
{
FILE *fp;
char sp[20];
printf(enter source program\n);
scanf(%s, sp);
fp=fopen(sp, r);
yyin=fp;
yylex();
}

4) Write a lex program to recognize arithmetic operators (+, -, *, /, %) in the source program and display the
output
Operator: symbol of operator
whenever an arithmetic operator is recognized.

%{
#include<stdio.h>
%}
%%
+ {printf(operator: %s, yytext);}
- {printf(operator: %s, yytext);}
* {printf(operator: %s, yytext);}
/ {printf(operator: %s, yytext);}
% {printf(operator: %s, yytext);}
%%
void main()
{
FILE *fp;
char sp[20];
printf(enter source program\n);
scanf(%s, sp);
fp=fopen(sp, r);
yyin=fp;
yylex();
}

5) Write a lex program to recognize blank spaces, tabs and new lines in the source program and display nothing.

%{
#include<stdio.h>
%}
w [ /t/n]
ws {w}+
%%
{ws} {}
%%
void main()
{
FILE *fp;
char sp[20];
27

printf(enter source program\n);


scanf(%s, sp);
fp=fopen(sp, r);
yyin=fp;
yylex();
}

While writing regular expressions, if we want to use either * or + or . symbol as part of the content indicated by
regular expression and not as shorthand notations then we have to put a \ before these symbols.

6) Write a lex program to recognize variables, constants, keywords, operators, special symbols, header files,
comments, blank spaces, tabs and new lines in the given source program and generate corresponding output.

%{
#include<stdio.h>
%}
l [a-z]
d [0-9]
v {l}({l}|{d})*
n {d}+\.{d}+
s [ \t\n]+
c /\*.+\*/
h {l}+\.h
%%
if {printf(keyword: %s, yytext);}
int {printf(keyword: %s, yytext);}
void {printf(keyword: %s, yytext);}
main {printf(keyword: %s, yytext);}
.
.
.
{v}{printf(variable: %s, yytext);}
{n}{printf(number: %s, yytext);}
+{printf(operator: %s, yytext);}
.
.
.
{ {printf(special symbol: %s, yytext);}
, {printf(special symbol: %s, yytext);}
.
.
.
{s} {}
{c} {}
{h} {printf(header file: %s, yytext);}
. {}
%%
void main()
{
FILE *fp;
char sp[20];
printf(enter source program\n);
scanf(%s, sp);
fp=fopen(sp, r);
yyin=fp;
yylex();
}
28

7) Write a lex program to count number of lines in the source program and display the count value.

%{
int count=0;
#include<stdio.h>
%}
%%
\n {count=count+1;}
%%
void main()
{
FILE *fp;
char sp[20];
printf(enter source program\n);
scanf(%s, sp);
fp=fopen(sp, r);
yyin=fp;
yylex();
printf(number of lines: %d,count);
}

8) Write a lex program to count number of words in a given text file.

%{
int count=0;
#include<stdio.h>
%}
w [ \t]
ws {w}+
%%
{ws} {count=count+1;}
%%
void main()
{
FILE *fp;
char sp[20];
printf(enter source program\n);
scanf(%s, sp);
fp=fopen(sp, r);
yyin=fp;
yylex();
printf(number of words: %d,count);
}

You might also like