You are on page 1of 17

Presentation

on

Lexical Analyzer

Presenters
1.Tofael Mahmud Rizvi
ID:133-15-2821

4.Masuma Akter
ID:133-15-2989

2. S. M Neoaz Mahfuz
ID:133-15-2982

5.Md. Al-amin
ID:133-15-3037

3.Zahidul Islam
ID:133-15-3061

Lexical Analyzer
Lexical Analyzer reads the source program character by
character and returns the tokens of the source program.
Puts information about identifiers into the symbol table.
Tries to understand each element in a program.

Lexical Analysis

Jeena Thomas, Asst


Professor, CSE,
SJCET Palai

How Lexical Analyzer interacts with Parser

Role of Lexical Analyzer


It scans the source code.
When it encounters a whitespace, operator symbol, or special
symbols, it decides that a word is completed.
Group them into lexemes.
Produce a sequence of tokens.
Lexical analyzer interact with the symbol table as well.
Correlates error messages with the source program.
Sends the stream of tokens to the parser for syntax analysis.

Role of Lexical Analyzer


Lexical analyzer can be divided into two processes:
a) Scanning consists of the simple processes that do not
require tokenization of the input,
such as deletion of comments and compaction of consecutive
whitespace characters into one.
b) Lexical analysis proper is the more complex portion, where
the scanner produces the sequence of tokens as output.

Terminologies
Token :
A token is a pair consisting of a token name and an optional attribute
value. The token name is an abstract symbol representing a kind of
lexical unit.
Typically,
1. Each keyword is a token, e.g, then, begin, integer.
2. Each identifier is a token, e.g., a, zap.
3. Each constant is a token, e.g., 123, 123.45, 1.2E3.
4. Each sign is a token, e.g., (, <, <=, +.

Terminologies
Lexeme :
A lexeme is a sequence of characters in the source
that matches the pattern for a token.

program

Pattern :
A pattern is a rule describing the set of lexemes
that can represent a particular token in source
program.
Regular expressions are an important notation for specifying
patterns.

Regular Expression
The regular expressions are built recursively out of smaller regular
expressions.
Each regular expression r denotes a language L(r).
BASIS: There are two rules that form the basis:
1. E is a regular expression, and L (E) is {E) , that is, the language
whose sole member is the empty string.
2. If a is a symbol in in a set, then a is a regular expression, and
L(a) = {a), that is, the language with one string, of length one, with a
in its one position.

Pattern Specifications
Alphabets:
Any finite set of symbols is a set of binary alphabets.
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets
{a-z, A-Z} is a set of English language alphabets.

Strings:
Any finite sequence of alphabets is called a string.
Length of the string is the total number of occurrence of alphabets.
A string of zero length is known as an empty string and is denoted
by (epsilon).

Pattern Specifications
Special Symbols
Arithmetic Symbols- Addition + , Subtraction , Modulo, Multiplication , Division/
Punctuation-

Comma, , Semicolon; , Dot. , Arrow >

Assignment-

Special Assignment- +=, /=, *=, -=


Comparison-

==, !=, <, <=, >, >=

Preprocessor-

Location Specifier-

&

Logical-

&, &&, |, ||, !

Shift Operator-

>>, >>>, <<, <<<

Pattern Specifications
Language
A language is considered as a finite set of strings over
some finite set of alphabets.
Computer languages are considered as finite sets, and
mathematically set operations can be performed on
them.
Finite languages can be described by means of regular
expressions.

Longest Match Rule


The Longest Match Rule states that the lexeme scanned
should be determined based on the longest match among
all the tokens available.

Lexical Errors
The errors thrown by the lexer when unable to continue.
Means there's no way to recognize a lexeme as a valid token
The simplest recovery strategy is "panic mode" recovery.
Other error-recovery actions are:
i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.

Approaches to Building Lexical Analyzers


The lexical analyzer is the only phase that processes input character
by character, so speed is critical.
There are two ways
Hand-Written LA: Writing self defined LA and control input buffering
or
Lexer generator tools: A tool that takes specications of tokens, often
in the regular expression notation, and produces a table-driven LA.
The most established tool is lex.

Other Lexer generator tools


ANTLR - Can generate lexical analyzers and parsers.
DFASTAR - C++.
Flex/"lex - (C/C++).
Ragel - C, C++, C#, Objective-C, D, Java, Go and Ruby.
The following lexical analysers can handle Unicode:
JavaCC - Java.
JFLex - For Java.
Quex - For C and C++.

Thank You

You might also like