You are on page 1of 5

c  


 

The lexical analyzer is the first phase of compiler. A program or function which performs lexical
analysis is called a    ,  or  . A lexer often exists as a single function which is
called by a parser or another function.
à ×ts main task is to read the input characters from the source Program and produces output
a sequence of tokens that the parser uses for syntax analysis.
à To group them into lexemes
à Produce as output a sequence of tokens
à Group them into lexemes
à Produce as output a sequence of tokens
à input for the syntactical analyzer
à ×nteract with the symbol table
à ×nsert identifiers
à to strip out
ƕ comments
ƕ whitespaces: blank, newline, tab, «
ƕ other separators
à to correlate error messages generated by the compiler with the source program
à to keep track of the number of newlines seen
à to associate a line number with each error Message.
à Macros expansion

Upon receiving a ³get next token´ command from the parser the lexical analyzer reads input characters
until it can identify the next token. The LA return to the parser representation for the token it has found.
The representation will be an integer code, if the token is a simple construct such as parenthesis, comma
or colon. The representation is a pair consisting of an integer code and a pointer to a table if the token is a
more complex element such as an identifier or constant. The integer code gives the token type and the
pointer points to the value of that token.

Sometimes , lexical analyzers are divided into a cascade of two phases, the first called
³scanning´, and the second ³lexical analysis´.
The scanner is responsible for doing simple tasks, while the lexical analyzer proper does the
more complex operations.

The lexical analyzer which we have designed takes the input from a input file. ×t reads one
character at a time from the input file, and continues to read until end of the file is reached. ×t
recognizes the valid identifiers, keywords and specifies the token values of the keywords.

×t also identifies the header files, #define statements, numbers, special characters, various
relational and logical operators, ignores the white spaces and comments. ×t prints the output in a separate
file specifying the line number

Ô  
A   is a string of characters, categorized according to the rules as a symbol (e.g.,
× T××, UMB, COMMA). The process of forming tokens from an input stream of
characters is called    and the lexer categorizes them according to a symbol type. A
token can look like anything that is useful for processing an input text stream or text file.

A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser.
or example, a typical lexical analyzer recognizes parentheses as tokens, but does nothing to
ensure that each '(' is matched with a ')'.

Consider this expression in the C programming language:

sum=3+2;

Tokenized in the following table


 Ô   
sum ×dentifier
= Assignment operator
3 umber
+ Addition operator
2 umber
; nd of statement

Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer
generator such as lex.

Ô   is the process of demarcating and possibly classifying sections of a string of input
characters. The resulting tokens are then passed on to some other form of processing. The
process can be considered a sub-task of parsing input.

Take, for example,

The quick brown fox jumps over the lazy dog

The string isn't implicitly segmented on spaces, as an nglish speaker would do. The raw input,
the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e.
matching the string " " or regular expression /\s{1}/.

The tokens could be represented in XML,

<sentence>
<word>The</word>
<word>quick</word>
<word>brown</word>
<word>fox</word>
<word>jumps</word>
<word>over</word>
<word>the</word>
<word>lazy</word>
<word>dog</word>
</sentence>

Or an s-expression,

(sentence ((word The) (word quick) (word brown) (word fox) (word jumps) (word over) (word the) (word lazy)
(word dog)))
xamples of Tokens

G  

    
   
  
ù         
      
   
ù ×     
ù G      
ù        
ù        
  

You might also like