You are on page 1of 18

TOKENS

Definition :-

 A Token is a string of characters, categorized


according to the rules as a symbol (e.g.
IDENTIFIER, NUMBER, COMMA, etc.)

 There is a set of strings in the input for which the


same token is produced as output. This set of
strings is described by a rule called a pattern
associated with the token.
Lexical analyser and tokens
 A lexical analyzer generally does nothing with
combinations of tokens, a task left for a
parser.
 For example, a typical lexical analyzer
recognizes parenthesis as tokens, but does
nothing to ensure that each '(' is matched
with a ')'.
 The lexical analyzer (either generated
automatically by a tool like lex, or hand-
crafted) reads in a stream of characters,
identifies the lexemes in the stream, and
categorizes them into tokens.
TOKEN TYPES

 • Identifiers: x , y11 , elsex_i00


 • Keywords: if , else , while
 • Integers: 2 , 1000 , -500 , +6663554
 • Floating point: 2.0 , 0.00020 , .02
 • Symbols: + , * , - , < , [ , ] , >, = , .. , /
 • Comments: { donʼt change this }
TOKEN VALUES
Some token types have values associated
with them

TYPE VALUE
IDENT sqrt

INTCONSTANT 1

RELOP >

ADDOP -
Token Definition Example
 Numeric literals in Pascal
◦ Definition of the token unsigned_number

digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

unsigned_integer  digit * digit

unsigned_number  unsigned_integer ( ( . unsigned_integer ) |  )


( ( e ( + | – | ) unsigned_integer ) |  )
 Recursion is not allowed!
 Notice the use of parentheses to avoid ambiguity
Lexical Analysis

input token value


identifier x
equal =
identifier x
star *
x = x * (acc+123) left-paren (
identifier acc
plus +
integer 123
right-paren )

 Tokens are typically represented by numbers


Lexemes :

 Character Sequence matched by an Instance


of the Token .

Example :- “sqrt”
Lexical Analysis
Consider this expression in the C programming language:

sum=3+2;

Tokenized in the following table:

LEXEME TOKEN TYPE


sum Identifier
= Assignment Operator
3 Integer
+ Addition operator
2 Integer
; Semi
First Step in Compilation
Source code
(character stream) Lexical analysis

Token stream

Parsing
Abstract syntax tree

Intermediate Code Generation

Intermediate code

Code Generation

Assembly code
Lexical Analysis
Source code if (b == 0) a = “hi”;
(character stream) Lexical analysis

Token
stream if ( b == 0 ) a = “hi” ;

Parsing

Semantic Analysis

CS331 • Lexical Analysis


A Closer Look
 Lexical analysis converts a character stream to
a token stream of pairs <token type, value>

if (x1 * x2 < 1.0) {


y = x1;
}

i f ( x 1 * x 2 <1 . 0 ) { \n

KEY:if LPAREN ID:x1 OP:* ID:x2 RELOP:<

NUM:1.0 RPAREN LBRACE


Process of converting Stream Of Characters
Into Tokens
<STMT>  IFKEY LPAREN <COND> RPAREN <STMT>
| ID ASSIGNOP <EXPR> SEMI
<COND>  <EXPR> RELOP <EXPR>
<EXPR>  ID | CONSTANT grammar

<STMT> parse tree


Parser groups
IFKEY LPAREN <COND> RPAREN <STMT>
tokens according
to grammar <EXPR> RELOP <EXPR> ID ASSIGNOP <EXPR> SEMI

ID CONSTANT CONSTANT

IFKEY LPAREN ID(b) RELOP(E CONSTANT(0) RPAREN ID(a) ASSIGNOP CONSTANT(63) SEM
Q) I

Lexical analyzer (phase 2) turns lexemes into tokens


if ( b == 0 ) a = 63 ;

Lexical analyzer (phase 1) groups characters into lexemes

i f ( b = = 0 ) a = 6 3 ;

CS331 • Lexical Analysis


Lexical errors
 What if user omits the space in “realf”?
◦ No lexical error, single token IDENT(“realf”) is
produced instead of sequence REAL, IDENT(“f”)!

 Typically few lexical error types


◦ illegal chars
◦ unterminated comments
◦ ill-formed constants

CS331 • Lexical Analysis


Issues

 How to break text up into tokens


if (x == 0) a = x<<1;
iff (x == 0) a = x<1;

 How to write the lexer Manually

CS331 • Lexical Analysis


Hand-written lexer
 Overall structure:
Driver:
Driver Calls GetNextToken
Prints token type and value

GetNextToken:
Calls AssembleSimpleToken
GetNextToken Changes Ids to keywords where necessary
Returns next token in input stream

AssembleSimpleToken :
Calls GetNextChar repeatedly The FSA
AssembleSimpleToken Assembles char sequences into valid tokens
Returns simple token

GetNextChar :
Returns the next significant
GetNextChar token in the input stream

CS331 • Lexical Analysis

You might also like