You are on page 1of 493
Principles of Compiler VW. atuvcctier diet Table of Contents: : : Ghepter-1 Introduction to Comziing (If) fo(t 22 Chapter=2 Lexical Analysis (2=1)t0(2-82 Chapter-3 Syntax Analysis (3 - 1) to (3 - 130) Chapter-4 Semantic Analysis (4 - 1) to (4 - 14) Chapter-5 _ Syntax Directed Translation (6-1) t0(5-28) Chapter-6 Intermediate Code Generation 6 - 1) to (6 - 40) Chapler=7 Run Time Storage Organization (F=1)10(7-34) Chapler=8 Cade Generation (8-1) 10 (8-46 Chapter=9 Code Optiizaton (@-1)10(9-70) References ®-1) Chepterwise University Questions with Answers (P= Ato P- 114) ‘Features of Book } | More detailed and modular representation of chapters. ' | Use of informative, self explanatory diagrams. ' Use of clear, plain and lucid language making the understanding very easy. | i 4 Large number of solved examples. it Approach of the book resembles classroom teaching. Best of Technical Publications —- As per Revised Syllabus of Pune University - 2003 Course B.E. (Computer) Semester - I Design and Analysis of Algorithms Puntambekar — Operating Systems Dhotre Object Oriented Modeling and Design Ainapure | Principles of Compiler Design Puntambekar (4) 8 ii Principles of Compiler Design ISBN 9788184314571 All ights reserved with Technical Publicotions. No part of this book should be ‘reproduced in ony form, Electronic, Mechanical, Photocopy or any information storage and retrieval system without prior permission in writing, from Technical Publications, Pune. Published by ‘Technical Publications Pune” #1, Amt Residency, 412, Shanivar Path, Pune - 411 030, lnc, Printer : Ates DTPinten ‘St.no. 10/3,Sishagad Rod, Pine 411 04 REAL SL eee Preface The importance of Principles of Compiler Design is well known in computer engineering fields. Overwhelming response to my books on various subjects inspired me to write this book. The book is structured to cover the key aspects of the subject Principles of Compiler Design. The book uses plain, lucid language to explain fundamentals of this subject. The book provides logical method of explaining various complicated concepts and stepwise methods fo explain the important topics. Each chapier is well supported with necessary illustrations, practical examples and solved problems. All chapters in this book are arranged in a proper sequence that permits each topic to build upon earlier studies. All core has been taken lo make students comfortable in understanding the basic concepts of the subject. The book not only covers the entire scope of the subject but explains the philosophy of the subject. This mokes the understanding of this subject more clear and makes it more interesting. The book will be very useful not only to the students but also to the subject teachers. The students have to omit nothing and possibly have to cover nothing more. | wish to express my profound thanks fo all those who helped in making this book a reality. Much needed moral support and encouragement is provided on numerous ‘occasions by my family. | wish to thank the Publisher and the entire team of Technical Publications who hove taken immense pain to get this book in time with quality printing. ‘Any suggestion for the improvement of the book will be acknowledged ond well appreciated. Author AA. Puntambekar Dedicated to God re LEE SETI 1.1 Introduction....... ui 1.2 Why to Write Compiler ?........... 41.2.1 Compiler : Analysis-Synthesis Model .. 4.2.2 Execution of Program .. . 4.2.3 Analysis of Source Program 1.3 Compilation Process in Brief... 1.3.1 The Phases of Compier .... 41.4 Cousins of the Compiler........... 4.5 Grouping of Phases ... 1.5.1 Concept of Pass 4.6 Types of Compiler... 1.7 Bootstrapping of Compiler 4.8 Interpreter_........... 1,9 Comparison of Compilers and Interpreters 4.10 Compiter Construction Tools..... 2.1 Introduction 2.2 Role of Lexical Analyzer ......... 2.2.1 Tokens, Patterns, Lexemes..... 2.3 input Buffering... 24 Specification of Tokens. 2.4.4 Strings and Language . . 2.4.2 Operations on Language 2.4.4 Notations used for Representing Regular Ete 2.4.5 Non Regular Language oo... 00.0.2... 2.5 Recognition of Tokens .. 2.6 Black Schematic of Lexical Analyzer 2.7 Automatic Construction of Lexical Analyzer... acoieniee 2O 2.8 LEX Specification and Features. sein 29 2.9 Lexical Analyzer Generatol 2.9.4 Patter Matching based on NFA’s... 3.1.4 Basic issues in Parsing... esseceveveeseveees Via Red Pires ooo ees ee ee ee i 3.1.3 Why Lexical and Syntax Analyzer are Separated Qut?.. 3.3 Ambiguous Grammar... 3.4 Parsing Techniques 3.5 Top-Down Parser. 2.5.2.1 Predictive LL(1) Parser. 3.5.2.2 Construction of Predictive LL(1) Parser 3.6 Bottom-up Parser... 3.6.1 Shil Reduce Parser Che Chapter= 5 3.6.2 Operator Precedence Parser .......... 3.7 LR Parser 3.7.1 SLR Parser 31.2 LR(K) PASEF eee reece PER TAER PRS scenes Enna nE eel 3.8 Comparison of LR Parsers.. 3.9 Using Ambiguous Grammar. 3.10 Error Recovery in LR Parser. 3.11 Automatic Construction of Parser (YACC) . 3.11.1 YACC Specification . Solved Exercise 3-103 Review Question: ster- 4 _Semant 3 4.1 Introduction .. i 4.2 Need of Semantic Analysis.......... 4.3 Type Analysis and Type Shocking: 4.3.4 Type Expression . . 4.4 Simple Type Checker...... 4.4.1 Type Checking of Expression . 4.4.2 Type Checking of Statements 4.5 Equivalence of Type Expressions. seis saonbestiesisiies e 4.5.1 Structural Equivalence of Type Expressions 4. 4.5.2 Name Equivalence of Type Expressions . cee 4.6 Type Conversions... Review Questions... tax Di ria Tanlaton 5.1 Introduction ...... 5.2 Syntax Directed Definition (SDD) 5.2.1 Construction of Syntax Trees . 5.2.1.1 Construction of Syntax Trae for Expression. 5.2.1.2 Directed Acyclic Graph for Expression. 5.3 Bottom-Up Evaluation of S-Attributed Definitions RN 5.3.1 Synthesized Attributes on the Parser Stack 5.4 L-attributed Definitions .. 5.4.1 L-attributed Definition 5.4.2 Translation Scheme. ........-60e08e |.2.1 Guideline for Designing the Translation Scheme 5.5 Top-Down Translation... 5.5.1 Construction of Syntax Tree for the Translation Scheme 5.6 Bottom Up Evaluation of Inherited Attributes .... Review Questions 5-28 Chapter 6 intermediate Code Generation 76 ij to (6 40) 6.1 Benefits of Intermediate Code Generation 64 6.2 Intermediate Languages...... 23 8.2.1 Types of Three Address Statements... ccs. csseccsesseseeeeeevsasssssess 63 6.2.2 Implementation of Three Address Code... ss... esee sess esses sees 6-4 6.5 Assignment Statements . 6.5.1 Type Conversion 6.6 Arrays... ne 6.7 Boolean Expressions ... 6.7.1 Numerical Representation. 6.8 Case Statements 6.9 Backpatching .. 6.9.1 Backpatching using Boolean Expressions. . . 6.9.2 Backpatching using Flow of Control Statements . 6.10 Procedure Calls . 6.11 Intermediate Code Generation using YACC... Salved Exercise Review Questions eects = 39 LETTS LT EL PI Chapter-7 Ruri Time Stora EA Source banguage |ssuee itasc.ccssaitsicasassasssisewBsiiscoccasetba cs csbeennsadbboncce 7-1 7.2 Storage Organization .. 3 =Z 7.2.1 Sub Division of Run Time Memory . 7-2 7.3 Activation Record. 7-3 7.4 Storage Allocation Strategies.. 7.4.1 Static Allocation .. an 7.4.2 Stack Allocation. . 7.4.3 Heap Allocation 7.5 Variable Length Data . 7.6 Access to Non Local Names... 7.6.1 Static Scope or Lexical Scope. .. Jom WW A blo 7.6.2 Lexical Scope for Nested Procedures 7.7 Parameter Passing 7.8 Symbol Tables . 7.8.1 Symbol - Table Entries 7.8.2 How to Store Names in Symbot Table?. 7.8.3 Symbo! Table Management... 7.9 Error Detection and Recovery... 7.10 Classification of Errors 7-24 7.16.1 Lexical Phase Errors. 7.10.2 Syntactic Phase Errors... 7.10.3 Semantic Errors sve i = N Q ls ko a 8 Review Questions. Chapter’’6) Code Géneration 8.4 Introduction ..... 8.2 Code Generation. 8.3 Object Code Forms... 8.4 Issues in Code Generation ...... 8.5 Target Machine Description 8.5.1 Costof the Instruction .... EAN LD 8.6 Basic Blocks and Flow Graphs... 8.6.1 Some Terminologies used in Basic Blocks 8.6.2 Algorithm for Partitioning into Blocks 8.6.3 Flow Graph we te 87 Next-Use INfOrT HON sos cease caeeeaeeininnnerrertnerieretcnanrninmrenrnnns B= 12 87.1 Storage for Temporary Names. 8.8 Register Allocation and Assignment... 8.8.1 Global Register Allocation 88.2 Usage Count....... 5 8.8.3 Register Assignment for Outer Loop 8.9 DAG Representation of Basic Blocks....... 8.10 Peephole Optimization... 8.10.1 Characteristics of Peephole Optimization. 8.11 A simple Code Generator .... 8.12 Generating Code from DAG.. 8.12.1 Rearranging Order .. : 8.12.2 Heuristic Ordering 8.12.3 Labelling Algorithm: 8.13 Dynamic Programming 8.13.1 Principle of Dynamic Programming 8.13.2 Dynamic Programming Algorithm. . 8.14 Generic Code Generation Algorithm Solved Exercise Review Questions... Chapter-9/ (Code Optimization 9.1 Introduction 9.2 Classification of Optimizatior 9.3 Principle Sources of Optimizatior 8.3.1 Compile Time Evaluation 9.3.2 Common Sub Expression Elimination 9.3.3 Variable Propagation.......... 26... eee 9.3.4 Code Movement ... . 9.3.6 Strength Reduction . 93.6 Dead Code Elimination 913.7 Loop Optimization .. 9.4 Optimization of Basic Blocks. 9.5 Loops in Flow Graphs. 9.6 Local Optimization .. 9.6.1 DAG Based Local Optimization 9.7 Global Optimization. 9.7.4 Control and Data Flow Analysis . 9.8 Computing Global Data Flow Information... 9.8.1 Data Flow Analysis ...... es 9.8.2 Data Flow Properties 9.9 Data Flow Equations 9.9.1 Representing Data Flow Information .......... 919.2 Data Flow Equations for Programming Constructs . 9.10 Iterative Data Flow Analysis. 9.10.1 Reaching Definition 9.10.1.1 Avalable Expressions . 9.10.1.2 Live Range identification . 9.10.2 Live Variable Analysis .... 9.11 Redundant Common Subexpression Elimination 9.12 Copy Propagation .. 9.13 Induction Variable .. Solved Exercise ... Review Questions PTSD SAA I, Introduction to Compiling 1.1 Introduction Compilers are basically translators. Designing a compiler for some language is a complex and time consuming process. Since new tools for writing the compilers are available this process has now become a sophisticated activity. While studying the subject compiler it is necessary to understand what is compiler and how the process of compilation can be carried out. In this chapter we will get introduced with the basic concepts of compiler. We will see how the source program is compiled with the help of various phases of compiler. Lastly we will get introduced with various compiler construction tools. 1.2 Why to Write Compiler ? In this section we will discuss two things : “What is compiler ? And “Why to write compiler ?” Let us start with “What is compiler?” translates it into an equivalent another language (target program). During this process of translation if some errors are encountered then compiler displays them as error messages. The basic model of compiler can be represented as follows. Input Output Compiler Source Target program program Fig. 1.1 Compiler The compiler takes a source program as higher level languages such as C, PASCAL, FORTRAN and converts it into low level language or a machine level language such as assembly language. (1-4) Principles of Compiler Design 4-2 Introduction to Compiling 1.2.4 Compiler : Analysis-Synthesis Model The compilation can be done in two parts : analysis and synthesis. In analysis part the source program is read and broken down into constituent pieces. The syntax and the meaning of the source string is determined and then an intermediate code is created from the input source program. In synthesis part this intermediate form of the source language is taken and converted into an equivalent target program. During this process if certain code has to be optimized for efficient execution then the required code is optimized. The analysis and synthesis model is as shown in Fig.1.2. Fig. 1.2 Analysis and synthesis model The analysis part is carried out in three sub parts p——>_ Lexical analysis: Analysis efsource | —» Syntax analysis program 1. semaniic anaiysis Fig. 1.3 1. Lexical Analysis - In this step the source program is read and then it is broken into stream of strings. Such strings are called tokens. Hence tokens are nothing but the collection of characters having some meaning. 2. Syntax Analysis - In this step the tokens are arranged in hierarchical structure that ultimately helps in finding the syntax of the source string. 3. Semantic Analysis - In this step the meaning of the source string is determined. In all these analysis steps the meaning of the every source string should be unique. Hence actions in lexical, syntax and semantic analysis are uniquely defined for a given language. After carrying out the synthesis phase the program gets executed. 1.2.2 Execution of Program To create an executable form of your source program only a compiler program is not sufficient. You may require several other programs to create an executable target program. After a synthesis phase a target code gets generated by the compiler. This target program generated by the compiler is processed further before it can be run which is as shown in the Fig. 1.4. Principles of Compiler Design 1-3 Introduction to Compiling Skeleton source program reprocessed source program Target assembly program Relocatable machine code Loaderiiink-editor Executable machine object fle Library, retocatable code Data Fig. 1.4 Process of execution of program The compiler takes a source program written in high level language as an input and converts it into a target assembly language. The assembler then takes this target assembly code as input and produces a relocatable machine code as an output. Then a program loader is called for performing the task of loading and link editing. The task of loader is to perform the relocation of an object code. Relocation of an object code means allocation of load time addresses which exist in the memory and placement of load time addresses and data in memory at proper locations. The link editor links the object modules and prepares a single module from several files of relocatable object modules to resolve the mutual references. These files may be library files and these library files may be referred by any program. Executable code Properties of Compiler When a compiler is built it should posses following properties. 1, The compiler itself must be bug-free. 2. It must generate correct machine code. Principles of Compiler Design 1-4 Introduction to Compiling 2 . The generated machine code must run fast. S . The compiler itself must run fast (compilation time must be proportional to program size). 5. The compiler must be portable (i.e. modular, supporting separate compilation). . It must give good diagnostics and error messages. . The generated code must work well with existing debuggers. . It must have consistent optimization. no - 1.2.3 Analysis of Source Program The source program can be analysed in three phases - 1) Linear Analysis : In this type of analysis the source string is read from left to right and grouped into tokens. For example - Tokens for a language can be identifiers, constants, relational operators, keywords. 2) Hierarchical Analysis : In this analysis, characters or tokens are grouped hierarchically into nested collections for checking them syntactically. 3) Semantic Analysis : This kind of analysis ensures the correctness of meaning, of the program. 1.3 Compilation Process in Brief In this section we will focus on “How a source program gets compiled?” 1.3.1 The Phases of Compiler ‘As we have discussed earlier the process of compilation is carried out in two parts : analysis and synthesis. Again the analysis is carried out in three phases : lexical analysis, syntax analysis and semantic analysis. And the synthesis is carried out with the help of intermediate code generation, code generation and code optimization. Let us discuss these phases one by one. 1, Lexical Analysis The lexical analysis is also called scanning. It is the phase of compilation in which the complete source code is scanned and your source program is broken up into group of strings called token. A token is a sequence of characters having a collective meaning. For example if some assignment statement in your source code is as follows, total = count + rate *10 Principles of Compiler Design 1-5 Introduction to Compiling Then in lexical analysis phase this statement is broken up into series of tokens as follows. 1. The identifier total . The assignment symbol . The identifier count 2 3. 4. The plus sign 5. The identifier rate 6. The multiplication sign 7. The constant number 10 The blank characters which are used in the programming statement are eliminated during the lexical analysis phase. 2. Syntax Analysis The syntax analysis is also called parsing. In this phase the tokens generated by the lexical analyser are grouped together to form a hierarchical structure. The syntax analysis determines the structure of the source string by grouping the tokens together. The hierarchical structure generated in this phase is called parse tree or syntax tree. For the expression total = count + rate *10 the parse tree can be generated as follows. a J\ count rate 10 Fig. 1.5 Parse tree for total = count + rate “10 In the statement ‘iotal = count + rate 10’ first of all rate*10 will be considered because in arithmetic expression the multiplication operation should be performed before the addition. And then the addition operation will be considered. For building such type of syntax tree the production rules are to be designed. The rules are usually expressed by context free grammar. For the above statement the production rules are — (1) E € identifier (2) E © number (@) EC EL +E2 Principles of Compiler Design 1-6 Introduction to Compiling (4) E < E1*E2 (6) E< (&) where E stands for an expression. «By rule (1) count and rate are expressions and * by rule(2) 10 is also an expression. © By rule (4) we get rate*l0 as expression. * And finally count +rate"l0 is an expression. 3. Semantic Analysis, Once the syntax is checked in the syntax analyser phase the next phase ie. the semantic analysis determines the meaning of the source string. For example meaning of source string means matching of parenthesis in the expression, or matching of if ...else statements or performing arithmetic operations of the expressions that are type compatible, or checking the scope of operation. LN; PS / rate int to float 10 For example, count Fig. 1.6 Semantic analysis Thus these three phases are performing the task of analysis. After these phases an intermediate code gets generated. 4. Intermediate Code Generation The intermediate code is a kind of code which is easy to generate and this code can be easily converted to target code. This code is in variety of forms such as three address code, quadruple, triple, posix. Here we will consider an intermediate code in three address code form. This is like an assembly language. The three address code consists of instructions each of which has at the most three operands. For example, t] := int to float (10) Principles of Compiler Design 1-7 Introduction to Compiling 2:= ratex tl 13 = count + t2 total := t3 There are certain properties which should be possessed by the three address code and those are, 1. Each three address instruction has at the most one operator in addition to the assignment. Thus the compiler has to decide the order of the operations devised by the three address code. 2. The compiler must generate a temporary name to hold the value computed by each instruction. 3. Some three address instructions may have fewer than three operands for example first and last instruction of above given three address code. i.e. {1 := int to float (10) total = t3 §. Code Optimization The code optimization phase attempts to improve the intermediate code. This is necessary to have a faster executing code or less consumption of memory. Thus by optimizing the code the overall running time of the target program can be improved. 6. Code Generation In code generation phase the target code gets generated. The intermediate code instructions are translated into sequence of machine instructions. Mov rate, 21 MUL #10.0, R1 MOV count, R2 ADD R2, R1 MOV R1, total Example - Show how an input a = b+c*60 get processed in compiler. Show the output at each stage of compiler. Also show the contents of symbol table. Principles of Compiler Design Introduction to Compiling Input processing in compiler Symbol table a=bic +60 | id = idgtid,* 60 Intermediate code generator ty: intiofioat (60) ty Fidget, tzid tty iy ty ida * 60.0 id, Code generator MOVE id, R, MULF #60.9, Ry MOVE id,,R, ADDF Ry, Ry MOVF Ry. id, Token stream Semantic tree Intermediate code Optimized code Machine code Principles of Compiler Design 1-9 Introduction to Compiling Symbol table management To support these phases of compiler a symbol table is maintained. The task of symbol table is to store identifiers (variables) used in the program The symbol table also stores information about attributes of each identifier. The attributes of identifiers are usually its type, its scope, information about the storage allocated for it. The symbol table also stores information about the subroutines used in the program. In case of subroutine, the symbol table stores the name of the subroutine, number of arguments passed to it, type of these arguments, the method of passing these arguments(may be call by value or call by reference) and return type if any. Basically symbol table is a data structure used to store the information about identifiers. The symbol table allows us to find the record for each identifier quickly and to store or retrieve data from that record efficiently. During compilation the lexical analyzer detects the identifier and makes its entry in the symbol table. However, lexical analyzer can not determine all the attributes of an identifier and therefore the attributes are entered by remaining phases of compiler. Various phases can use the symbol table in various ways. For example while doing the semantic analysis and intermediate code generation, we need to know what type of identifiers are. Then during code generation typically information about how much storage is allocated to identifier is seen. Error detection and handling To err is human. As programs are written by human beings therefore they can not be free from errors. In compilation, each phase detects errors. These errors must be reported to error handler whose task is to handle the errors so that the compilation can proceed. Normally, the errors are reported in the form of message. When the input characters from the input do not form the token, the lexical analyzer detects it as error. Large number of errors can be detected in syntax analysis phase. Such errors are popularly called as syntax errors. During semantic analysis; type mismatch kind of error is usually detected. Principles of Compiler Design 1-10 Introduction to Compiling ‘Semantic, analyzer Inlermediate code generator ‘Symbol table Error detection Management ‘and handling ‘Synthesis phase Code generator Terget machine code Fig. 1.7 Phases of compiler 1.4 Cousins of the Compiler Sometimes the output of preprocessor may be given as input to the compiler. Cousins of compiler means the context in which the compiler typically operates. Such contexts are basically the programs such as preprocessor, assemblers, loaders| and link editors. Let us discuss them in detail. 1. Preprocessors - The output of preprocessors may be given as the input to compilers. The tasks performed by the preprocessors are given as below Preprocessors allow user to use macros in the program. Macro means some set of instructions which can be used repeatedly in the program. Thus macro preprocessing task is be done by preprocessors. Preprocessor also allows user to include the header files which may be required by the program. Principles of Compiler Design 4-11 Introduction to Compiling For example : #include By this statement the header file stdio.h can be included and user can make use of the functions defined in this header file. This task of preprocessor is called file inclusion. Skeleton source Fig. 1.8 Role of preprocessor Macro Preprocessor - Macro is a small set of instructions. Whenever in a program macroname is identified then that name is replaced by macro definition (ie. set of instruction defining the corresponding macro). For a macro there are two kinds of statements macro definition and macro use. The macro definition is given by keyword like “define” or “macro” followed by the name of the macro. For example : # define PI 3,14 Whenever the “PI” is encountered in a program it is replaced by a value 3.14. 2. Assemblers - Some compilers produce the assembly code as output which is given to the assemblers as an input. The assembler is a kind of translator which takes the assembly program as input and produces the machine code as output. Fig. 1.9 Role of assembler ‘An assembly code is a mnemonic version of machine code. The typical assembly instructions are as given below. Mov a, RL MUL = #5,R1 ADD #7,R1 Mov Ri,b The assembler converts these instructions in the binary language which can be understood by the machine. Such a binary code is often called as machine code. This machine code is a relocatable machine code that can be passed directly to the loader/linker for execution purpose. Principles of Compiler Design 1-12 Introduction to Compiling The assembler converts the assembly program to low level machine language using two passes. A pass means one complete scan of the input program. The end of second pass is the relocatable machine code. 3. Loaders and Link Editors - Loader is a program which performs two functions: loading and link editing. Loading is a process in which the relocatable machine code is read and the relocatable addresses are altered. Then that code with altered instructions and data is placed in the memory at proper location. The job of link editor is to make a single program from several files of relocatable machine code. If code in one file refers the location in another file then such a reference is called external reference. The link editor resolves such external references also. [——> Loading Loader Link eaiting 1.5 Grouping of Phases Front end back end Front end Intermediate |” Backend Lexical ins Code |} Cote | program] 7] analyzer sioae optimizer || generator Fig. 1.10 Front end back end model Output Program Different phases of compiler can be grouped together to form a front end and back end. The front end consists of those phases that primarily dependent on the source language and independent on the target language. The front end consists of analysis part. Typically it includes lexical analysis, syntax analysis, and semantic analysis. Some amount of code optimization can also be done at front end. The back end consists of those phases that are totally dependent upon the target language and independent on the source language. It includes code generation and code optimization. The front end back end model of the compiler is very much advantageous because of following reasons - 1. By keeping the same front end and attaching different back ends one can produce a compiler for same source language on different machines. Principles of Compiler Design 1-13 Introduction to Compiling 2. By keeping different front ends and same back end one can compile several different languages on the same machine. 1.5.1 Concept of Pass One complete scan of the source language is called pass. It includes reading an input file and writing to an output file. Many phases can be grouped one pass. It is difficult to compile the source program into a single pass, because the program may have some forward references. [t is desirable to have relatively few passes, because it takes time to read and write intermediate file. On the other hand if we group several phases into one pass we may be forced to keep the entire program in the memory. Therefore memory requirement may be large. In the first pass the source program is scanned completely and the generated output will be an easy to use form which is equivalent to the source program along with the additional information about the storage allocation. It is possible to leave a blank slots for missing information and fill the slot when the information becomes available. Hence there may be a requirement of more than one pass in the compilation process. A typical arrangement for optimizing compiler is one pass for scanning and parsing, one pass for semantic analysis and third pass for code generation and target code optimization. C and Pascal permit one pass compilation, Modula-2 requires two passes. 1.6 Types of Compiler In this section we will discuss various types of compilers. Incremental compiler : Incremental compiler is a compiler which performs the recompilation of only modified source rather than compiling the whole source program. The basic features of incremental compiler are, 1. It tracks the dependencies between output and the source program. 2. It produces the same result as full recompile. 3. It performs less task than the recompilation. 4. The process of incremental compilation is effective for maintenance. Gross compiler : Basically there exists three types of languages 1. Source language i.e. the application program. 2. Target language in which machine code is written. Principles of Compiler Design 1-14 Introduction to Compiling 3. The Implementation language in which a compiler is written. There may be a case that all these three languages are different. In other words there may be a compiler which run on one machine and produces the target code for another machine. Such a compiler is called cross compiler. Thus by using cross compilation technique platform independency can be achieved. To represent cross compiler T diagram is drawn as follows Fig. 4.11 (a) T diagram with $ as source, T as target and | as implementation lan- guage . N LE ~ | Fig. 1.11 (b) Cross compiler For source language L the target language N gets generated which runs on machine M. 1.7 Bootstrapping of Compiler Bootstrapping is a process in which simple language is used to translate more complicated program which in turn may handle far more complicated program. This complicated program can further handle even more complicated program and so on. Writing a compiler for any high level language is a complicated process. It takes lot of time to write a compiler from scratch. Hence simple language is used to generate target code in some stages. To clearly understand the bootstrapping technique consider a following scenario. Suppose we want to write a cross compiler for new language X. The implementation language of this compiler is say Y and the target code being generated is in language Z. That is, we create X,Z. Now if existing compiler Y [i.e compiler written in language Y] runs on machine M and generates code for M then it is denoted as YM. Now if we run XyZ using YyM then we get a compiler Xy,Z. That means a compiler for source language X that generates a target code in language Z and which runs on machine M. Principles of Compiler Design 1-15 Introduction to Compiling Following diagram illustrates the above scenario. x z|[x z Y ym | tf tt laf ‘These two These two languages languages must be same. must be same. Fig. 1.12 Example - We can create compilers of many different forms. Now we will generate compiler which takes C language and generates an assembly language as an output with the availability of a machine of assembly language. Step 1: First we write a compiler for a small subset of C in assembly language. ———+ Compiter 1 Step 2: Then using this small subset of C ie. Cy, for the source language C the compiler is written. ———> compier 2 Step 3 : Finally we compile the second compiler. Using compiler 1 the compiler 2 is compiled. ASM: Compiler 2 us Compiler 1 Step 4: Thus we get a compiler written in ASM which compiles C and generates code in ASM. Principles of Compiler Design Introduction to Gompiting 1.8 Interpreter An interpreter is a kind of translator which produces the result directly when the source language and data is given to it as input. It does not produce the object code rather each time the program needs execution. The model for interpreter is as shown in Fig. 1.13. Data Source Result program {Direct execution| Fig. 1.13 Interpreter Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses interpreter. The process of interpretation can be cartied out in following phases. 1. Lexical analysis 2. Syntax analysis 3. Semantic analysis 4. Direct Execution Advantages : * Modification of user program can be easily made and implemented as execution proceeds. © Type of object that denotes a variable may change dynamically. © Debugging a program and finding errors is simplified task for a program used for interpretation. The interpreter for the language makes it machine independent. Disadvantages : © The execution of the program is slower. * Memory consumption is more. 1.9 Comparison of Compilers and Interpreters In interpreter the analysis phase is same as compiler ie. lexical, syntactic and semantic analysis is also performed in interpreter. In compiler the object code needs to be generated whereas in interpreter the object code need not be generated. During the Principles of Compiler Design 1-17 Introduction to Compiling process of interpretation the interpreter exists along with the source program. Binding and type checking is dynamic in case of interpreter. The interpretation is less efficient than compilation. This is because the source program has to be interpreted every time it is to be executed and every time it requires analysis of source program. The interpreter can be made portal as they do not produce the object code which is not possible in case of compiler. Interpreters save time in assembling and linking. Self modifying program can be interpreted but can not be compiled. Design of interpreter is simpler than the compiler. Interpreters give us improved debugging environment as it can check the errors like out of bound array indexing at run time. Interpreters are most useful in the environment where dynamic program modification is required. 1.40 Compiler Construction Tools Writing a compiler is tedious and time consuming task. There are some specialized tools for helping in implementation of various phases of compilers. These tools are called compiler construction tools. These tools are also called as compiler-compiler, compiler-generators, or translator writing system. Various compiler construction tools are given as below 1. Scanner generator - These generators generate lexical analyzers. The specification given to these generators are in the form of regular expressions. The UNIX has utility for a scanner generator called LEX. The specification given to the LEX consists of regular expressions for representing various tokens. 2. Parser generators - These produce the syntax analyzer. The specification given to these generators is given in the form of context free grammar. Typically UNIX has a tool called YACC which is a parser generator. 3. Syntax-directed translation engines - In this tool the parse tree is scanned completely to generate an intermediate code. The translation is done for each node of the tree. 4. Automatic code generator - These generators take an intermediate code as input and converts each rule of intermediate language into equivalent machine language. The template matching technique is used. The intermediate code statements are replaced templates that represent the corresponding sequence of machine instructions. 5. Data flow engines - The data flow analysis is required to perform good code optimization. The data flow engines are basically useful in code optimization. Summary © Compiler is a program which takes one language (source program) as input and translates it into an equivalent another language (target program), Principles of Compiler Design 1-18 Introduction to Compiling * The compilation can be done in two parts: analysis and synthesis. © Various phases of compiler are © Lexical analysis © Syntax analysis © Semantic analysis © Intermediate code generation © Code optimization © Code generation * Along with these phases symbol table management and error detection and handling is done. The compiler has a front end and back end skeleton. © Cousins of compiler are © Preprocessor © Assembler © Loader ©. Link editor * One complete scan of the source language is called pass. It includes reading an input file and writing to an output file. * An interpreter is a kind of translator which produces the result directly when the source language and data is given to it as input. * Various compiler construction tools are scanner generator, parser’ generator like LEX and YACC, syntax directed translation engine, automatic code generator, data flow engines. In this chapter we have discussed the fundamental concepts of compiler. Now we will discuss every phase of compiler in detail in further chapters. Solved Exercise Q.1: Mention some of the cousines of the compiler. Ans. : Cousines of compiler means the context in which the compiler typically operates. Such contexts are basically the programs such as preprocessor, assemblers, loaders and link editors. 1. Preprocessors - The preprocessors allow user to use macros in the program. Preprocessors also allow user to include the header files which may be required by the program. Principles of Compiler Design 1-19 Introduction to Compiling 2, Assemblers - Some compilers produce the assembly code as output which is given to the assemblers as input. The assembler takes assembly program as input and produces the machine code as output. Loaders and link editors - Loader perform the task of loading and link editing. It is a process in which the relocatable machine code is read and the relocatable addresses are altered. The link editor resolve the external references. » Q.2: What are compiler construction tools ? Ans.: Writing a compiler is tedious and time consuming task. Therefore there are some specialized tools for implementing various phases of compiler. These tools are called compiler construction tools. Various compiler construction tools are i) Scanner generator - For example : LEX ii) Parser generator - For example : YACC iii) Syntax directed translation engines. iv) Automatic code generator. v) Data flow engines. Q.3: What is preprocessor ? Ans. : Preprocessors are kind of programs which take the input source program and produce the output which becomes input to compiler. Various functions performed by preprocessors are i) Macro preprocessing ii) File inclusion iii) Language extensions. Q.4: What is cross compiler ? Give an example. Ans, : There may be a compiler which runs on one machine and produces the target code for another machine. Such a compiler is called cross compiler. Thus by using cross compilation technique platform independency can be achieved. For example : For the first version of EQN compiler, the compiler is written in C and the command are generated for TROFF, which is as shown in Fig. 1.14. Fig. 1.14 Principles of Compiler Design 1 20 Introduction to Compiling The cross compiler for EQN can be obtained by running it on PDP-11 through C compiler, which produces output for PDP-11 as shown below - Fig. 1.15 Q.5: What is an incremental compiler ? Enlist the basic features of incremental compiler. Ans. : Incremental compiler is such a compilation scheme in which only modified source text gets recompiled and merged with previously compiled code to form a new target code. Thus incremental compiler avoids recompilation of the whole source code on some modification. Rather only modified portion of source program gets recompiled. Various features of incremental compiler are i) During program development process modifications to source program can cause recompilation of whole source text. This overhead is reduced in incremental compiler. ii) Run time errors can be patched up just like compile time errors by keeping program modification as it is. iii) The compilation process is faster. iv) Handling batch programs become very flexible using incremental compiler. v) In incremental compiler program structure table is maintained for memory allocation of target code. When a new statement is compiled then new entry for it is created in program structure table. Thus memory allocation required for incremental compiler need not be contiguous. Q.6: With the help of a block schematic explain how “compilers-compilers” can reduce the effort in implementing new compiler. Ans.: There are various tools that help in implementation of various phases of compiler. Such tools are called as compiler compiler. These tools are - i) Scanner generator : These generators generate lexical analyzer. The specification given to these generators is in the form of regular expressions. For example LEX. ii) Parser generator : These produce syntax analyzer. The specification is given to that generator in the form of context free grammar. For example YACC. Principles of Compiler Design 1-21 Introduction to Compiling iii) Syntax directed translation engine : In this tool parse tree is scanned completely to generate an intermediate code. The translation is done for each node of the tree, iv) Data flow engines : The data flow engines are required to perform code optimization. v) Automatic code generator : It takes as an input the intermediate code and produces the machine code as an output. The block schematic is as shown below - Source Gererates Program Programs for ‘expression e.g. LEX analyzer Context free grammar ACS parse tree Syntax directed Syntax directed translation engines transiation intermediate code Data flow optimized code ‘Automatic. ode ganetatsi Code generation {Target machine code at Fig. 1.16 Q.7: Define pass of compiler. Which are the factors that decide number of passes for a compiler ? Ans. : Several phases of compiler are grouped into one pass. Thus pass of compiler is simply collection of various phases. For example : Lexical analysis, syntax analysis and intermediate code generation is grouped together and forms a single pass. Various factors affecting number of passes in compiler are - 1. Forward reference 2, Storage limitations 3. Optimization Principles of Compiler Design 1-22 Introduction to Compiling The compiler can require more than one passes to complete subsequent information. Review Questions What is compiler? Explain analysis-synthesis model of compiler. With a neat block diagram, explain various phases of compiler. What are the advantages of front-end and back-end of compiler ? Explain the concept of compiler-compiler. What is concept of bootstraping ? What is interpreter ? Differentiate between interpreter and compiler. Enlist the properties of compiler. Sena ene Explain, why it is better to have two passes in compiler than having one pass. 10. What is the use of symbol table in the process of compilation 7 000 Lexical Analysis 2.1 Introduction The process of compilation starts with the first phase called lexical analysis. In this phase the input is scanned completely in order to identify the tokens. The token structure can be recognized with the help of some diagrams. These diagrams are popularly known as finite automata. And to construct such finite automata regular expressions are used. These diagrams can be translated into a program for identifying tokens. In this chapter we will see what the role of lexical analyzer in compilation process is. We will also discuss the method of identifying tokens from source program. Finally we will learn a tool, LEX which automates the construction of lexical analyzer. 2.2 Role of Lexical Analyzer Lexical analyzer is the first phase of compiler. The lexical analyzer reads the input source program from left to right one character at a time and generates the sequence of tokens. Each token is a single logical cohesive unit such as identifier, keywords, operators and punctuation marks. Then the parser to determine the syntax of the source program can use these tokens. The role of lexical analyzer in the process of compilation is as shown below - jemands — po | em | a Returns tokens Target code Input Fig. 2.1 Role of lexical analyzer As the lexical analyzer scans the source program to recognize the tokens it is also called as scanner. Apart from token identification lexical analyzer also performs following functions. (2-1) Principles of Compiler Design 2-2 Lexical Analysis Functions of lexical analyzer 1. It produces stream of tokens. 2. It eliminates blank and comments. 3. It generates symbol table which stores the information about identifiers, constants encountered in the input. 4, It keeps track of line numbers. 5. It reports the error encountered while generating the tokens. The lexical analyzer works in two phases. In first phase it performs scan and in the second phase it does lexical analysis; means it generates the series of tokens. 2.2.1 Tokens, Patterns, Lexemes Let us learn some terminologies, which are frequently used when we talk about the activity of lexical analysis. Tokens : It describes the class or category of input string. For example, identifiers, keywords, constants are called tokens. Patterns : Set of rules that describe the token. Lexemes: Sequence of characters in the source program that are matched with the pattern of the token. For example int, i, num, ans, choice. Let us take one example of programming statement to clearly understand these terms - if (acb) Here “if”, "(" "a" ,"<" ,"b" ,")” are all lexemes. And “if” is a keyword, “(” is opening parenthesis, “a” is identifier, “<” is an operator and so on. Now to define the identifier pattern could be - 1. Identifier is a collection of letters. 2. Identifier is collection of alphanumeric characters and identifier’s beginning character should be necessarily a letter. When we want to compile a given source program, we submit that program to compiler. A compiler scans the source program and produces sequence of tokens therefore lexical analysis is also called as scanner. For example - The piece of source code is given below int MAX (int a, int 5) { if (a>b) return a; else return b; Principles of Compiler Design 2-3 Lexical Analysis } Lexeme Token int keyword MAX identifier ( operator int keyword a identifier i operator int keyword b identtier ) operator { operator it keyword The blank and new line characters can be ignored. These stream of tokens will be given to syntax analyzer. 2.3 Input Buffering The lexical analyzer scans the input string from left to right one character at a time. It uses two pointers begin_ptr (bp) and forward_ptr (fp) to keep track of the portion of the input scanned. Initially both the pointers point to the first character of the input string as shown below - bp fp Fig. 2.2 Initial configuration The forward_ptr moves ahead to search for end of lexeme. As soon as the blank space is encountered it indicates end of lexeme. In above example as soon as forward_ptr (fp) encounters a blank space the lexeme “int” is identified. Principles of Compiler Design 2-4 Lexical Analysis Fig. 2.3 Input buffering The fp will be moved ahead at white space. When fp encounters white space it ignore and moves ahead. Then both the begin_ptr (bp) and forward_ptr (fp) is set at next token i. token bp Fig. 2.4 Input buffering The input character is thus read from secondary storage. But reading in this way from secondary storage is costly. Hence buffering technique is used. A block of data is first read into a buffer, and then scanned by lexical analyzer. There are two methods used in this context : one buffer scheme and two buffer scheme. 1, One buffer scheme bp (eT fp Fig. 2.5 One buftfer scheme storing input string 2, Two buffer scheme bp buffer 1 [ET buffer2 fp Fig. 2.6 Two buffer scheme storing input string In this one buffer scheme, only one buffer is used to store the input string. But the problem with this scheme is that if lexeme is very long then it crosses the buffer boundary, to scan rest of the lexeme the buffer has to be refilled, that makes overwriting the first part of lexeme. To overcome the problem of one buffer scheme, in this method two buffers are used to store the input string. The first buffer and second buffer are scanned alternately. When end of current buffer is reached the other buffer is filled. The only problem with this method is that if length of the lexeme is longer than length of the buffer then scanning input can not be scanned completely. Principles of Compiler Design 2-5 Lexical Analysis Initially both the bp and fp are pointing to the first character of first buffer. Then the fp moves towards right in search of end of lexeme. As soon as blank character is recognized, the string between bp and fp is identified as corresponding token. To identify the boundary of first buffer end of buffer character should be placed at the end of first buffer. Similarly end of second buffer is also recognized by the end of buffer mark present at the end of second buffer. When fp encounters first eof, then one can recognize end of first buffer and hence filling up of second buffer is started. In the same way when second eof is obtained then it indicates end of second buffer. Alternatively both the buffers can be filled up until end of the input program and stream of tokens is identified. This eof character introduced at the end is called sentinel which is used to identify the end of buffer. Code for input buffering if(fp==eof (buffl)) /*encounters end of first buffer*/ /*Refill buffer2*/ Eptt; ) else if (£p ‘ /*Refill buffercl*/ fptt: of (buf£2) /*encounters end of second buffer*/ ) else if(fp=-eof (input)) return; /*terminate scanning*/ else fptt: /* still remaining input has to scanned */ 2.4 Specification of Tokens To specify tokens regular expressions are used. When a pattern is matched by some regular expression then token can be recognized. Let us understand the fundamental concepts of language. 2.4.1 Strings and Language String is a collection of finite number of alphabets or letters. The strings are synonymously called as words. © The length of a string is denoted by | S |. «The empty string can be denoted by e. © The empty set of strings is denoted by Principles of Compiler Design 2-6 Lexical Analysis Following terms are commonly used in strings. Torm Moaning Prefix of string A string obtained by removing zero or more tail symbols. For example for string Hindustan the prefix could be ‘Hindu’. Suffix of string A string obtained by removing zero or more leading symbols. For example for string Hindustan the suffix could be ‘stan’ ‘Substring A string obtained by removing prefix and suffix of a given string is called substring. For example for string Hindustan the string ‘indu’ can be a substring. Sequence of string | Any string formed by removing zero or more not necessarily the contiguous symbols Is called sequence of string. For example Hisan can be sequence of string 2.4.2 Operations on Language As we have seen that the language is a collection of strings. There are various operations which can be performed on the language. Operation Description Union of two languages | Liu L2 = (set of strings in L1 and strings in L2). L1 and L2. Concatenation of two L1.L2 = {set of strings in L1 followed by set of strings fanguages Li and L2. in 12). Kleen closure of L. if Ue io Lt denotes zero or more concatenations of L. Positive closure of L. ie Ue 1 L* denotes one or more concatenations of L. For example, Let L be the set of alphabets such as L= (A, B, C ...Z, a, b, c...2} and D be the set of digits such as D = { 0, 1, 2...9} then by performing various operations as discussed above new languages can be generated as follows - + LUD is aset of letters and digits. + LDisa set of strings consisting of letters followed by digits. « L3 is a set of strings having length of 5 each. Principles of Compiler Design 2-7 Lexical Analysis * Lt is a set of strings having all the strings including e. © Ltis a set of strings except e. 2.4.3 Regular Expressions Regular are mathematical symbolisms which describe the set of strings of specific language. It provides convenient and useful notation for representing tokens. Here are some rulesthat describe definition of the regular expressions over the input set denoted by D. 1. cis a regular expression that denotes the set containing empty string. 2. If R1 and R2 are regular expressions then R = Rl + R2 (same can also be represented as R = R1{R2) is also regular expression which represents union operation. 3. If Rl and R2 are regular expression then R = RLR2 is also a regular expression which represents concatenation operation. 4. If Ri is a regular expression then R = R1* is also a regular expression which represents kleen closure. A language denoted by regular expressions is said to be a regular set or a regular language. Let us see some examples of regular expressions. mm) Example 2.4: Write a regular expression (R.E.) for a language containing the strings of length two over & = (0, 1). Solution: RE. =(0+1) (041) imp Example 2.2: Write a regular expression for a language containing strings which end with “abb” over ¥. = {a, b}. Solution: RE. = (a+b)‘abb hm Example 2.3 : Write a regular expression for a recognizing identifier. Solution : For denoting identifier we will consider a set of letters and digits because identifier is a combination of letters or letter and digits but having first character as letter always. Hence R.E. can be denoted as, RE. = letter (letter+digit)* Where letter = (A, B,...Z, a, z) and digit = (0, 1, 2, ...9). himp Example 2.4: Design the regular expression (r.e.) for the language accepting all combinations of a’s except the null string over ¥ = {a} Solution : The regular expression has to built for the language L = {a, aa, aaa, ...} Principles of Compiler Design 2-8 Lexical Analysis This set indicates that there is no null string. So we can write, R ay The + is called positive closure ump Example 2.5: Design regular expression for the language containing all the strings with any number of a’s and b's. Solution : The re. will be re = (a+b) The set for this r.e will be - L = {¢, a, aa, ab, b, ba, bab, abab, ...any combination of a and b} The (a + b)’ means any combination of a and b even a null string. wm) Example 2.6: Construct a regular expression for the language containing all strings having any number of a's and b's except the null string. =(a+b)* This regular expression will give the set of strings of any combination of a's and b's except a null string. Solution : re. wm) Example 2.7: Construct the re. for the language accepting all the strings which are ending with 00 over the set ¥ = (0, 1). Solution : The re. has to be formed in which at the end there should be 00. That means re, = (Any combination of 0's and 1's) 00 re. = (0+ 1) 00 Thus the valid strings are 100, 0100, 1000... we have all strings ending with 00. mm) Example 2.8: Write re. for the language accepting the strings which are starting with 1 and ending with 0, over the set X = (0,1). Solution : The first symbol in re. should be 1 and the last symbol should be 0. So, R= 10+1)0 Note that the condition is strictly followed by keeping starting and ending symbols correctly. In between them there can be any combination of 0 and 1 induding null string. Principles of Compiler Design 2-9 Lexical Analysis mp Example 29: Write a regular expression to denote the language L over X”, where L= {e,b,q in which every string will be such that any number of a's followed by any number of b's followed by any number of c's. Solution : Any number of a's means a’. Similarly any number of b's and any number of c's means b’ and c’. So the regular expression is = a" b* c’. ium Example 2.10: Write re. to denote a language L over %', where ¥ = (a, b} such that the 3" character from right end of the string is always a. Solution : r= | any number of a either a or b either aor b a's and b's 3° gna 1” re. = (a+ b)’a (a+b) (a+b). Thus the valid strings are babb, baaa or abb, baba or aaab and so on. tum Example 2.11: Construct r.e. for the language which consists of exactly two b's over the set & = {a, b}. Solution : There should be exactly two b's. any number of a's|b any number of a's b any number of a's Hence re. = a° bab a® a” indicates either the string contains any number of a’s or a null string. Thus we can derive any string having exactly two b's and any number of a's. 2.44 Notations used for Representing Regular Expressions Regular expressions are tiny units, which are useful for representing the set of strings belonging to some specific language. Let us see notations used for writing the regular expressions. 1. One or more instances - To represent one or more instances + sign is used. If ris a regular expression then r* denotes one or more occurrences of r. For example, set of strings in which there are one or more occurrences of ‘a’ over the input set (a} then the regular expression can be written as a*. It basically denotes the set of {a, aa, aaa, aaaa, ...}. 2, Zero or more instances - To represent zero or more instances * sign is used. If r is a regular expression then r* denotes zero or more occurrences of r. For example, set of strings in which there are zero or more occurrences of ‘a’ Principles of Compiler Design 2-10 Lexical Analysis over the input set (a) then the regular expression can be written as a*. It basically denotes the set of {e, a, aa, aaa,...}. 3. Character classes - A class of symbols can be denoted by [ ]. For example, [012] means 0 or 1 or 2. Similarly a complete class of a small letters from a to z can be represented by a regular expression [a-z]. The hyphen indicates the range. We can also write a regular expression for representing any word of small letters as [a-zl*. As we know the language can be represented by regular expression but not all the languages are represented by re. such a set of languages is called non regular language. Let us understand them. 2.4.5 Non Regular Language A language which cannot be described by a regular expression is called a non regular language and the set denoted by such language is called non regular set. There are some languages which cannot be described by regular expressions. For example we cannot write a regular expression to check whether the given language is palindrome or not. Similarly we cannot write a regular expression to check whether the string is having balanced parenthesis. Thus regular expressions can be used to denote only fixed number of repetitions. Unspecific information cannot be represented by regular expression. 2.5 Recognition of Tokens For a programming language there are various types of tokens such as identifier, keywords, constants, and operators and so on. The token is usually represented by a pair token type and token value Fig. 2.7 Token representation The token type tells us the category of token and token value gives us the information regarding token, The token value is also called token attribute. During lexical analysis process the symbol table is maintained. The token value can be a pointer to symbol table in case of identifier and constants. The lexical analyzer reads the input program and generates a symbol table for tokens. Principles of Compiler Design 2-11 Lexical Analysis For example : We will consider some encoding of tokens as follows. Token Code Value if 1 - else while for identifior Ptr to symbol table constant Ptr to symbol table Consider, a program code as if (a<10) +27 else isi-2; Our lexical analyzer will generate following token stream. 1, ,1), (5,100), (7,1), (6,105), (8,2), (6,107), 10, (5,107), (9,1), (6,110), 2,(5,107), 10, 6,107), (9,2), (6,110). The corresponding symbol table for identifiers and constants will be, —<—<——$ Location | Type | Value counter 100 identifier a 105, constant 10 Principles of Compiler Design 2-12 Lexical Analysis 107 identifier i 110 constant 2 In above example scanner scans the input string and recognizes “if” as a keyword and returns token type as 1 since in given encoding code 1 indicates keyword “if” and hence 1 is at the beginning of token stream. Next is a pair (8,1) where 8 indicates parenthesis and 1 indicates opening parenthesis ‘(’. Then we scan the input ‘a’ it recognizes it as identifier and searches the symbol table to check whether the same entry is present. If not it inserts the information about this identifier in symbol table and returns 100. If the same identifier or variable is already present in symbol table then lexical analyzer does not insert it into the table instead it returns the location where it is present. 2.6 Block Schematic of Lexical Analyzer Lexical analysis is a process of recognizing tokens from input source program. Now the question is how does lexical analyzer recognize tokens, from given source program? Well, the lexical analyzer stores the input in a buffer. It builds the regular expressions for corresponding tokens. From these regular expressions, finite automata is built. When lexeme matches with the pattern generated by finite automata, the specific token gets recognized. The block schematic for this process is as shown below . Input butter Lexical analyzer Finite state Finite automata machine simulator Patter matching algorithm Tokens Fig. 2.8 Block schematic of lexical analyzer Principles of Compiler Design 2-13 Lexical Analysis While constructing the lexical analyzer we first design the regular expression for recognizing the corresponding token. A diagram resembling the flowchart is built. Such a diagram is called transition diagram. The transition diagram elaborates the actions to be taken while recognizing the token. The lexeme is stored in an input buffer. The forward pointer scans the input character-by-character moving from left to right. The transition diagram is used to keep track of the information about characters that are seen as the forward pointer scans the input. Positions in a transition diagram are called states and those are drawn by circles and the edges in the diagram represent the transitions from one state to another. 6) Fig. 2.9 Transition diagram oO There is a special state called start state, which denotes the starting of transition diagram. From starting state we start recognizing the tokens. After recognizing the token we should reach to the final state. In Fig. 29 S, is a start state and S, is a final state. Let us take some examples to understand the concept of transition table. The transition table is a tabular representation of transition diagram. We will first design the transition diagram and then we will build the transition table. Fig. 2.10 Transition table The finite state machine (FSM) can be defined as - A collection of 5 - tuples (Q, 2,3, qo, F) where Q is finite set of states, which is non empty. Lis input alphabet, indicates input set. qo in Q is the initial state. F is a set of final states. 3 is a transition function or a function defined for going to next state. The finite automata can be represented as Input tape Tepe head Finte control Principles of Compiler Design 2-14 Lexical Analysis ‘The finite automata is a mathematical representation of finite state machine. The machine has input tape in which input is placed occupying one character in each cell. The tape head reads the symbol from the input tape. The finite control is always one of internal states which decides what will be the next state after reading the input by the tape header. For example, suppose current state is q, and suppose now the tape head is pointing to c, it is a finite control which decides what will be the next state at input c. mm> Example 2.12: Design a FA which accepts the only input 101 over the input set Z={0, I}. Solution : Fig. 2.12 For Ex. 2.12 Note that in the problem statement it is mentioned as only input 101 will be accepted. Hence in the solution we have simply shown the transitions, for input 101. There is no other path shown for other input. wm Example 2.13: Design a FA which checks whether the given binary number is even. Solution : The binary number is made up of 0's and 1's when any binary number ends with 0 it is always even and when a binary number ends with 1 it is always odd. For example, 0100 is a even number, it is equal to 4. 0011 is a odd number, it is equal to 3. and so while designing FA we will assume one start state, one state ending in 0 and other state for ending with 1. Since we want to check whether given binary number is even or not, we will make the state for 0, the final state. Fig. 2.13 For Ex. 2.13 Principles of Compiler Design 2-15 Lexical Analysis The FA indicates clearly $, is a state which handles all the 1's and S, is a state which handles all the 0's. Let us take some input 01000 = 0S, 1000 018, 000 0108, 00 0100S, 0 1000, Now at the end of input we are in final or in accept state so it is a even number. Similarly let us take another input. 1011 = 15,011 108311 101S,1 1011S, Now we are at the state $, which is a non final state. Another idea to represent FA with the help of transition table. state Input| 0 1 2% &: s, 8, & 5 © S 5 Table 2.1 Transition table Thus the Table 2.1 indicates in the first column all the current states. And under the column 0 and 1 the next states are shown. The first row of transition table can be read as : When current state is Sy, on input 0 the next state will be S, and on input 1 the next state will be S,. The arrow marked ‘to Sy indicates that it is a start state and circle marked to S. indicates that it is a final state. ym Example 2.14: Design FA which accepts only those strings which start with 1 and ends with 0. Principles of Compiler Design 2-16 Lexical Analysis Solution : The FA will have a start state A from which only the edge with input 1 will go to next state. Fig. 2.14 For Ex. 2.14 In state B if we read 1 we will be in B state but if we read 0 at state B we will reach to state C which is a final state. In state C if we read either 0 or 1 we will go to state C or B respectively. Note that the special care is taken for 0, if the input ends with 0 it will be in final state. mp Example 2.15: Design FA which accepts odd number of 1's and any number of 0's. Solution : In the problem statement, it is indicated that there will be a state which is meant for odd number of 1's and that will be the final state. There is no condition on number of 0's. Fig. 2.15 For Ex. 2.15 At the start if we read input 1 then we will go to state S, which is a final state as we have read odd number of 1's. There can be any number of zeros at any state and therefore the self loop is applied to state S, as well as to state $,. For example, if the input is 10101101, in this string there are any number of zeros but odd number of ones. The machine will derive this input as follows. Principles of Compl ical Analysis 1 Design 2-47 ke 1_ Input ends here in which is a final state ots ols, in ths Start [a in state g Fig. 2.16 Ladder diagram of processing the input wm Example 2.16 > Design FA which checks whether the given unary number is divisible by 3. Solution : Fig. 2.17 For Ex. 2.16 The unary number is made up of ones. The number 3 can be written in unary form as 111, number 5 can be written as 11111 and so on. The unary number which is divisible by 3 can be 111 or 111111 or 111111111 and so on. The transition table is as follows. Input 1 > 4 au at a2 2 3. a Consider a number 111111 which is equal to 6 ie. divisible by 3. So after complete scan of this number we reach to final state q5. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-19 Lexical Analysis Let us test the above FA, if the number is 36 then it will proceed by reading the number digit by digit. From start [Jo Go to stale Sp 3[E From S, to Sy 36 4 Input ends and we are in final state Sp Fig. 2.19 Hence the number is divisible 3, isn't it ? Similarly if number is 121 which is not divisible by 3, it will be processed as $ 121 18,21 12591 121S, which is remainder 1 state. ‘ ‘=> Example 2.18: Design FA which checks whether a given binary number is divisible by three. Solution : As we have discussed in previous example, the same logic is applicable to this problem even ! The input number is a binary number. Hence the input set will be E= {0, 1} The start state is denoted by 5, the remainder 0 is by S, remainder 1 by 5; and remainder 2 by S3. Fig. 2.20 For Ex. 2.18 Principles of Compiler Design Lexical Analysis Let us take some number 010 which is equivalent to 2. We will scan this number from MSB to LSB. o1o0 TTT So S Sy Then we will reach to state S, which is remainder 2 state. Similarly for input 1001 which is equivalent to 9 we will be in final state after scanning the complete input. 1001 TT TT S, $2 S$, Sp Thus the number is really divisible by 3. mm> Example 2.19: Design FA which accepts even number of 0's and even number of 1's. Solution : This FA will consider four different stages for input 0 and 1. The stages could be even number of 0 and even number of 1, even number of 0 and odd number of 1, odd number of 0 and even number of 1, odd number of 0 and odd number of 1 Let us try to design the machine. Clo) Fig. 2.21 Here q, is a start state as well as final state. Note carefully that a symmetry of 0's and 1's is maintained. We can associate meanings to each state as : qo : State of even number of 0's and even number of 1's. q1 : State of odd number of 0's and even number of 1's. q2 : State of odd number of 0's and odd number of 1's. q3 : State of even number of 0's and odd number of 1's. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-23 Lexical Analysis return (operator, LE) retum (Operator, NE) Fig. 2.27 Transition diagram for relational operator 2.7 Automatic Construction of Lexical Analyzer For efficient design of compiler, various tools have been built for constructing lexical analyzers using the special purpose notations called regular expressions. The regular expressions are used in recognizing the tokens. Now we will discuss a special language that specifies the tokens using regular expressions. A tool called LEX gives this specification. Basically LEX is a Unix utility which generates the lexical analyzer. While discussing LEX we will lear how to write the specification file. In LEX tool, designing the regular expressions for corresponding tokens is a logical task. Having a file name with extension] (often pronounced as dot L) the lex specification file can be created. For example : the specification file name can be x1. This x. is then given to LEX compiler to produce lexyy.c. This lexyy.c is a C program which is actually a lexical analyzer program. As we know that the specification file stores the regular expressions for the tokens, the lex.yy.c file consists of the tabular representation of the transition diagrams constructed for the regular expressions of specification file say x1. The lexemes can be recognized with help of tabular transitions diagrams and some standard routines. In the specification file of LEX actions are associated with each regular expression. This actions are simply pieces of C code. These pieces of C code are directly carried over to the lex.yy.c. Finally the C compiler compiles this generated lex.yy.c and produces an object program a.out. Principles of Compiler Design 2-24 Lexical Analysis When some input stream is given to a.out then sequence of tokens get generated. The above described scenario is modeled below. Lex specification file Lex compiler lexyy.c c aout compiler fe xecutable program] Input Stream of strings tokens, from source program Fig. 2.28 Generation of lexical analyzer using LEX [Lexical analyser program] Now the question arises how do we write the specification file? Well, the LEX program consists of three parts - 1. Dedaration section 2. Rule section and 3. Procedure section a Declaration section %) BS Rule section & Auxiliary procedure section * In the declaration section declaration of variables constants can be done. Some regular definitions can also be written in this section. The regular definitions are basically components of regular expressions appearing in the tule section. * The rule section consists of regular expressions with associated actions. These translation rules can be given in the form as - R, {action;} R, {actiong} R, _{actiony} Principles of Compiler D 2-25 Lexical Analysis Where each R, is a regular expression and each action; is a program fragment describing what action is to be taken for corresponding regular expression. These actions can be specified by piece of C code. And third section is a auxiliary procedure section in which all the required procedures are defined. Sometimes these procedures are required by the actions in the rule section. © The lexical analyzer or scanner works in co-ordination with parser. When activated by the parser, the lexical analyzer begins reading its remaining input, character by character at a time. When the string is matched with one of the regular expressions R, then corresponding action, will get executed and this action, returns the control to the parser. * The repeated search for the lexeme can be made in order to return all the tokens in the source string. The lexical analyzer ignores white spaces and comments in this process. Let us lear some LEX programming with the help of some examples - "Neeta" print£("\n Noun"); "sings" "dances" | “eats” print ("\n Verb"); 8h main () { yylex()7 int yywrap() { return 1; } This is a simple program that recognizes noun and verb from the string clearly. ‘There are 3 sections in above program. © The section starting and ending with % { and %} respectively is a definition section. * The section starting with %% is called rule section. This section is closed by %%o within %% consists of regular expression and actions. Rule 1 gives the definition of noun and second rule gives the definition of verb. Principles of Compil Design 2-26 Lexical Analysis * The third section consists of two functions the main function and the yywrap function. In main function call to yylex routine is given. This function is defined in lexyy.c program. First we will compile our above program (x) using lex compiler. And then the lex compiler will generate a ready C program named lex.yy.c. This lex.yy.c makes use of regular expression and corresponding actions defined in xl. Hence our above program x.! is called lex specification file. When we compile lex.yy.c file using command cc. Here cc means compile c. We get an output file named a.out. (This is a default output file on LINUX platform). On execution of a.out we can give the input string. Following commands are used to run the lex program xi. |_, This command generates lex. yy.c $ lex x.1 This command compiles lex.yy.c (sometimes 5 ce lex.yy.e |” gcc can be used) $ .\a.out }—» This command runs an executable file. ‘After entering these commands a blank space for entering input gets available. ‘There we can give some valid input. Rama eats Noun verb Seeta sings Noun verb Then press either control + ¢ or cantrol + d to come out of the output. vi Editor commands While writing LEX program on LINUX popularly a vi editor is used. Here are some commonly used vi commands * To start vi editor : At the command prompt we can give following command $ vi filename For example $ vi xJ. Then file with x1 will be created and opened in vi editor. There are two modes in which vi editor works. The insert mode and command mode. To write something in the file we should make insert mode on. For that we have to give ESC i Now we can type the desired text in the file. To exit vi and save changes - ESC : zz Principles of Compiler Design 2-27 or ESC: wq To exit vi without saving changes ESC: ql Following commands are used to handle the text in the file. Command with Purpos Esc r Replaces character under cursor with next character. R Keep replacing character until ESC is pressed. a Append after cursor. A Append at end of line. x Deletes character under cursor. dd Deletes line under cursor. aw Deletes word under cursor. db Deletes word before cursor. Copies line which may be put by 'p! command, P Puts back before cursor. Pp Puts back after cursor. © To move at particular line number then give following command. ESC : # Here # is the desired line number. ‘a> Example 2.23: Write @ LEX program for recognizing the tokens such as identifiers, keywords. [Here we have given program name as lexprog. IJ. Solution : 8 s} identifier [a-zA-2Z] [a-zA-Z0-9]* 4% /*For recognizing preprocessor directives*/ +.* {printf("\n%s is a PREPROCESSOR DIRECTIVE", yytext);} /*For recognizing keywords*/ intl float | char| double} Principles of Compiler Design 2-28 Lexical Analysis while! for| dol break] continue] void} switch| case! long! structi const | typedef | return] else! goto {print£("n\t%s is a KEYWORD", yytext);} AE\t]*\( {printf("\n\tif is a KEYWORD\n\t (");} /*For recognizing single line comments*/ VANE ANAL VV {printf ("\n\n\t%s is a COMMENT\n", yytext);} /*Punction call or definition*/ (identifier)\((print£("\n\n FUNCTION CALL/DEFINITION\n\tés", yytext) ;) /*Block start and end*/ MC {printf("\n BLOCK BEGINS"); } \} {print£("\n BLOCK ENDS"); } /*For variables*/ (identifier) (\{[0-9]#\])* {printf ("\n\t% is an DENTIFIER", yytext);} /*For a string*/ \nee\n (printf ("\n\t%s is a STRING", yytext) ;} /*For recognizing constants */ {0-9} + {printf ("\n\t%s is a NUMBER", yytext):} \) (V8)? {printf ("\n\t") :ECHO;print£("\n") 7} \C ECHO; /*Bor operators*/ = {printf ("\n\tts is an ASSIGNMENT OPERATOR", yytext) } {print£("\n\t%s is a RELATIONAL OPERATOR", yytext) 7} s\n /*End of Rules section*/ 8 int main(int argc, char**argv) { if(arge>1) { FILE *file; file=fopen(argv{1],"r"); If (itile) aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-32 Lexical Analysis finite state machines can be built for the regular expressions describing certain languages. The design of lexical analyzer involves three tasks- (i) Representation of tokens by regular expression. i) Representation of regular expressions by finite state machines. (iii) Simulation of the behaviour of finite state machine, These three steps lead to the design of lexical analyzer. The LEX compiler constructs a transition table for a finite state machine from regular expression patterns in the LEX specification. The lexical analyzer itself consists of finite automaton simulator that uses the transition table. This transition table basically looks for the regular expression patterns from the input string stored in the input buffer. The finite automata can be deterministic or non-deterministic in nature. A deterministic automaton is said to be deterministic if for each state S for every input a € Eat the most one edge is labelled a. For example : Fig. 2.29 DFA A nor-deterministic finite automaton (NFA) is a set of states including start state and few or more final state and for a single input there can be multiple transitions. For example : Fig. 2.30 NFA aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-36 Lexical Analysis The r = a* b can be drawn as : Start Fig. 2.38 imp Example 2.25 : Build NFA for the regular expression r = (a + b)* ab. Solution : As r = (a + b)* ab, we will build NFA for (a + b)*. Fig, 2.39 aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-40 Lexical Analysis The NFA can be drawn as follows. Fig. 2.42 Fig. 2.43 Solution : We will first obtain e-closure on every state. The e-closure is basically an € - transition from one state to other. Hence e - closure (0) = {0} € - closure (1) € - closure (2) e - closure (3) e - closure (4) e - closure (6) e - closure (6) e - closure (7) € - closure (8) {1} {2,3,4,6,9} {34,6} {4} {5,8,3, 4,6} {3, 4,5, 6, 8} sorted it ! {6} {7,8,9,3,4, 6} {3, 4, 6,7, 8,9} {8,9, 3, 4, 6} Principles of Compiler Design 2-41 Lexical Analysis e- closure (8) = {3, 4,6, 8, 9} e- closure (9) = {9} Now we will obtain 8’ transitions for each state for each input symbol. 8(0, a) = e - closure {8(8(0,e),a)} 50, b) 8(1, a) 8(1, a) o(2, a) 8(2,b) e - closure {8(0,a)} e- closure {1} {1} € - closure { 5(8(0,e),b) } e - closure { 8(0,b) } e- closure {6 } 9 © - closure {5(8’(1,¢), a) } e- closure {(1,a)} e - closure {6} 9 € - closure {5(8'(1,£), a) } e- closure {5(1,a)} e - closure {2} {2.3,4, 6,9} € - closure {5(8(2,£),a) } € - closure {5((2,3,4,6,9),a) } e - closure {8(2,a) U (3,a) U &(4,a) U 8(6,a) U (9,a)} e- closure {5} {3,4,5,6,8, 9} € - closure {5(0(2,e),b)} e - closure { 5((2,3,4,6,9),b) } e - closure { 8(2, b) U 8(3,b) U 8(4,b) U 8(6,b) L 5(9,b)} e - closure {7 } Principles of Compiler Design 2-42 Lexical Analysis §(3, a) 83, b) 5(4, a) 5(4, b) 86, a) 86, b) = 4 = {3,4,6,7,8, 9} = e- dosure {8(6(3,£),a)} = €- dosure {8((34,6, a)} = e-dosure {8(3,a) U 5(4,a) U 8(6, a) } = €-dosure {5} = {3,4,5, 6, 8,9} = e€- closure { 5(6(3,£),b) } = e- dosure { §(3,4,6),b)} = €- dosure { 8(3,b) U 8(4,) U (6,b) } = €- closure {7} = {3,4,6,7,8, 9} = €- closure {8(6(4,8),a)} = ©~ closure {5(4,a)} = €- dosure {5} = {3,4,5, 6,89} = €- closure { 8(8(4,¢),b)} = €- closure {8(4,b)} =9 = €- dosure {8(66,¢),a)} = €~ closure {8(6,8,3,4,6,9,a)} = €- closure {8(3,a) U 5(4,a) U 8G,a) U &(6,a) U8(8,a) U 69,a) } = €-dosure {5} = {3,4,5, 6, 8,9} e - closure { 5(8'6,e),b) } e - closure { 5(3,4,5,6,8,9),b) } € - closure { 8(3,b) U 8(4,b) U 85, b) U 8(6,b) U 8B,b) U 39,b) } ¢ - closure {7} aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-44 Lexical Analysis = e - closure {7} = {3467.89} 8(9, a) = € - closure {5(8'(9,e),a) } = €- closure {8(9,a)} =o 8(9, b) = e - closure {5(6/(9,c),b) } e - closure {8(9,b)} =¢ The transition table for 6° function that represents the equivalent NFA without ¢ is as given below - e a b — {i} ¢ ay o (2,3.4,6.9) 2} | 64.56,89) | (3,4,6,7,8,9) 3} | 8.4.5.6,89} | {3,4.6,7,8,9} {4} | 3.4,5.6,89) ¢ {5} | 3.4.56,89) | (3.4.6.7.8,9) e) ° {3.4,6,7,8,9) [mM | @4s009 | @40709) aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-48 Lexical Analysis =O+0.@ Fig. 2.47 (c) Pattern 3: The NFA can be drawn as a combination of pattern 1, pattern 2 and pattern 3 transition diagrams. Fig. 2.47 (d) NFA with patterns for matching of input string To summarize this one can say that LEX tries to match the input string with help of patterns one by one and it tries to match the longest lexeme. 2.9.3 DFA for Lexical Analyzers In this method the DFA is constructed from the NFAs drawn from pattems of regular expressions. Then we try to match the input string with each of the DFA. The accepting states are recorded and we go on moving on the transitions until the input string terminates. If there are two patterns matching with the input string then the pattern which is mentioned first in the LEX specification file is always considered. For example, consider the LEX specification file consists of 0 Qs 011 a or" (} aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-52 Lexical Analysis With these new names the DFA will be as follows - Fig. 2.49 An equivalent DFA ‘um Example 2.30 : Convert the given NFA to DFA. Input o 1 ~ ao {40.4} a | Solution : As we know, first we will compute 5 function. 5({4o}, 9) = {do ai} Hence 8’((q0]-9) = [do a1] Similarly, 8(fa0}1) = {ao} Hence 8°({qo]-1) = [do] Thus we have got a new state [qo,q;]- Let us check how it behaves on input 0 and 1. So, 840-419) = 5090] 4 5 (C41 )-0) {qorai} ¥ {42} = {40-41-42} aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-56 Lexical Analysis ump Example 2.32: Convert the given NEA to its equivalent DFA. Fig. 2.51 Solution : We will first design a transition table from given transition diagram. “ input! 0 1 ae (40-9) = {40-41} (gol) = fora ” weed (gob 1) = faor93), As we have got a new state [q g, q; }we will compute & transitions for input 0 and 1. (ldo. 91}9) = 5(q0.0)V8(q1, 0) = {doa} VU {a2} (40-411) = fa ¥(Go.a1-1) = 8(4o, NUS, 1) = {90-43} 4. O} = {40-43} ¥(qo-41} 1) = [40/43] aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-60 Lexical Analysis Step 3: The states obtained [p,,p2,P3,--- Pn] © Qp- The states containing final state in p, is a final state in DFA. Now let us see some examples of conversion based on this procedure. ‘m= Example 2.33 : Convert the following NFA with € to equivalent DFA. Fig. 2.54 Solution : To convert this NFA we will first find e-closures. e-closures {qo} = {40/41/42} e-closures {q1} = {41-42} e-closures {q2} = {q2} Let us start from e-closure of start state e-closure {qo} = {40-41-42} we will call this state as A. Now let us find transitions on A with every input symbol. 8 (A, a) = e-closure (8(A,a)) = e-cosure (8(q0,41/42)-4) = e-dosure {8(q9,a) U 8(q;,a) U 8(qz,a)} = e-dosure {q)} = {q1/2}. Let us call it as state B. (A, b) = e-closure (8(A,b)) e-dosure (8(q9, 41, 42)-b) = e-closure {8(q9,b) U 8(q,,b) U 8(q2,b)} = e-dosure {qo} = {ao di az}ie A Hence we can state that d(A,a) = B aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-64 Lexical Analysis Hence the DFA is Fig. 2.69 As A = {qo,4;/q2} in which final state qp lies hence A is final state in B={q),q2} the state qp lies hence B is also final state in C = {q,} the state qp lies hence C is also a final state. nm Example 2.35 : Convert the given NFA with ¢ to its equivalent DFA. Solution : e-closure {qo} = {40/41/42} e-closure {q,} = {a1} e-closure {q2} = {42} e-closure {43} = {43} eclosure {q4} = {44} Now, Let e-closure {qo} = {40-41-42} be state A. Hence = (A, 0) = _e-closure {8((q9/41/42)-0) } e-closure {5 (49,0) U5(q;,)U 3(q2,0)} = eclosure {q3} " aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-68 Lexical Analysis Solved Exercise Q1 Discuss the role of finite automata in compiler. Ans. : The finite automata are used as the mathematical model that can be used to recognize the regular expressions. The regular expressions are built to match the pattern of the lexeme to identify the tokens. In order to recognize tokens the regular expressions can be created. These regular expressions then can be converted into the equivalent NFA. The NFA then can be converted to DFA. Input string is read character by character and lexical analysis is done using the finite automata so that valid tokens can be generated. Input buffer Lexeme Finite automata DFA |generated| Patterns Pattern matching algorithm Tokens Q2 Write a LEX rule for the following “an identifier that starts and end with digit and maximum length of 20 characters”. Ans. : [0-9] [A-Za-Z] (18) [0-9] Q3 State what strategy LEX should adopt if keywords are not reserved words. Ans. : The keywords are basically the reserved words that are associated with some special meaning. If keywords would not be the reserved words then in LEX regular expressions for these keywords along with the appropriate actions should be introduced. The corresponding, actions for these keywords should be such that the meaning associated with these keywords should remain consistent. Secondly the symbol table should be updated by entering these keywords along with appropriate meaning of them. This makes the keyword to take the appropriate meaning of them. Compiler has to make the consistency between the keywords for any application program, Q4 — Write short note on input buffer with lexical analyzer. Ans.: Refer section 2.3. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-70 Lexical Analysis yyin = file; }//end if yylex (); print£("\n"); return 0; }//end main Q.6 Write a LEX program for calculating number of lines, number of characters and number of words from the input string Ans. : ¥f int Char_Cnt = 0, Word Cnt = 0,Line Cnt = 0; $} word [* \t\n]+ se {word} {Word_Cnt++;Char_Cnt+=yyleng;} \n ({Char_Cnt++;Line_Cnt++;} Char Ent++; 8% main(int argc, char **argv) { if(arge > 1) { FILE *file; file = fopen(argv[1],"0"); if (file) { fprintf(stderr, "could not open $s\n",argv({1]); exit (1); ) yyin = file;/standard input file*/ } yylex (); print£("td td %d \n”,Char_Cnt,Word_Cnt, Line Cnt); return 0; ) int yywrap() t return 1; } Principles of Compiler Design 2-71 Q7 Write a LEX program for checking well formedness of parenthesis. Ans. /*This is a LEX program for Well formedness of brackets*/ finclude #include char temp(10]; int i=0,of=0,cf=0; extern FILE *yyin; %) at MICAS strepy (temp, yytext) 7 printf ("&s\n”, temp) : i=0; while (temp[i]! x if (temp [i] oftt: if (temp[i] cftt; else ; ites if (of==cf) printf (“wellformed!\n") ; else print£(“not\n"); a main(int argc,char *argv[]] ‘ FILE *fp=fopen(argv(1},"r"); yyin=£p; yylex(); fclose (fp); Principles of Compiler Design 2-72 Lexical Analysis Qs Ans. : Describe precisely and concisely the languages denoted by following regular expressions. i) (Ca) (a}b)*)* aa (ab) ii) (a b)" a (ab) (a) (@|b) iti) (bl ab)* (a}b)* i) This is a language contai ing a string of substring "aa". ii) The languge in which strings containing ‘a' as 4!" symbol from right end. iil) This is a language which does not contain substring "aa". Build a DEA containing the language {L=a"|n!= 4]. The DFA which you create must not have more than six states. The input set © = {a}. : The required DFA can be - +-©-O--O-O-@ Construct a DFA containing the language in which strings are m4 mod 3 = 0 and my mod 3 = 1 [ng means total number of a’s}. The required DFA is - aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-76 Lexical Analysis Ans. POR BI IES S JIBS OILS IOS ISO TOS HOCH EH ER ESSE ESE iis This is a LEX program which contains the regular expressions for i) message printing printf statement and ii) result printing printf statement Program name: = pratf.! peeesews peeeettar Seat tece tester i tertecersteettceer st ter ere tracery) a #include #include finclude’y.tab.h” extern char yylval[30]; a) a printé”("{ \t]*\"fa-zA-2}4\"") "7" { strepy(yylval, yytext); return PRINTF; } printé”("[ \t]*\"[a-zA-2] *”/8d"* | "&s"*\"\, [a-zA-Z] #7) 75" {strepy (yylval, yytext); return OUTPUT;)} \nl. ory POSS O ROSSI ISH OB OIE UBISC SOMO IC HD EO iia This is a YACC program which contains the Context Free Grammar for iii) message printing printf statement and iv) result printing printf statement Program name: — prntf.y TUBE SBE ESO HOHE SDS E SO HEEL SH EID ook ian / af #include extern FILE *yyin; %) ‘token PRINTF OUTPUT ue Stmt : Expri|Expr2 Exprl:PRINTF {printf(*\n Valid printf statement”) ;} Exprl:OUTPUT {print£("\n Valid Output statement”) ;} a main(int arge, char ‘argv{]) ‘ FILE *fp=fopen(argv(1],"r"); /*argv{1] -test.c the input program*/ /*The input file is opened for reading purpose*/ yyin=fp; /* yyin is a standard file pointer for the input file to lex program*/ aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 2-80 Lexical Analysis Q.17 Draw the transition diagram for unsigned numbers. Ans. ; The transition diagram for unsigned integer is - Digit The transition diagram for unsigned real number is - Return (num) Digit Digit Return (num) Q.18 Construct the NFA from the (a/b)* a(a/t) using Thompson's construction algorithm. Ans. : Let us assume RL = (a+b)* R2 =a Then given equation R = Ri R2 Rl ‘The NFA for R1 is aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-2 Syntax Analysis The parse tree drawn above is for some programming statement. It shows how the statement gets parsed according to their syntactic specification. 3.1.1 Basic Issues in Parsing + There are two important issues in parsing : i) Specification of syntax ii) Representation of input after parsing. * A very important issue in parsing is specification of syntax in programming language. Specification of syntax means how to write any programming statement. There are certain characteristic of specification of syntax - i) This specification should be precise and unambiguous. ii) This specification should be in detail, ie. it should cover all the details of the programming language. iii) This specification should be complete. Such a specification is called “Context Free Grammar”. * Another important issue in parsing is representation of the input after parsing. This is important because all the subsequent phases of compiler take the information from the parse tree being generated. This is important because the information suggested by any input programming statement should not be differed after building the syntax tree for it. © Lastly the most crucial issue is the parsing algorithm based on which we get the parse tree for the given input. We will discuss different approaches to parsing: Top-down and bottom-up. And we will study parsing algorithms concerning to these approaches. We are mainly interested in following issues- © How these algorithms work? * Are they efficient in nature? * What are their merits and limitations? Kind of input they require. Keep one thing in mind that we are now interested in simply syntax of the language we are not interested in meaning right now. Hence we will do only syntax analysis at this phase. Checking of the meaning (semantic) from syntactically correct input will be studied in the next phase of compilation. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principtes of Compiler Design 3-6 Syntax Analysis iii) If there exists a vertex A with children R,, Rp, ... Ry then there should be| production A > Ry Ry Ry. iv) The leaf nodes are from set T and interior nodes are from set V. Instead of choosing the arbitrary non-terminal one can choose. i) Either leftmost non-terminal in a sentential form then it is called leftmost derivation. ii) Or rightmost non-terminal in a sentential form, then it is called rightmost derivation. a> Example 3.2: Let G be a context free grammar for which the production rules are given as below. S—aB | bA A alaS|bAA B > b|bS |aBB Derive the string aaabbabbba using above grammar. Solution : Leftmost Derivation S 7 aB aaBB aaaBBB aaabSBB aaabbABB aaabbaBB aaabbabB aaabbabbS aaabbabbbA. aaabbabbba Rightmost Derivation S > aB aaBB aaBbS, aaBbbA aaBbba aaaBBba aaaBbbba aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-10 ‘Syntax Analysis Solution : Leftmost derivation “™ Ete E+E+E atere a+ErE+E atbeE+E atbeate atbeatb f Sint o—m o—m o—m Rightmost derivation e E. Ese “ly EsE+E * E*E+b JIN. ZIN Esatb Bo Bo oF E+E+a+b | | | | E+btatb a b @ b atbeatb 3.3 Ambiguous Grammar A grammar G is said to be ambiguous if it generates more than one parse trees for sentence of language L(G). For example : E> E+E| E*Ejid Then for id + id id gE * & + is id id id id (2) Parse tree 1 (b) Parse tree 2 Fig. 3.6 Ambiguous grammar mp Example 3.5: Find whether the following grammar is ambiguous or not. SAB | aaB Ava | Aa Bob aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-14 Syntax Analysis input string. After trying all the productions if we found every production unsuitable for the string match then in that case the parse tree can not be built. 3.5.1 Problems with Top-Down Parsing There are certain problems in top-down parsing. In order to implement the parsing we need to eliminate these problems. Let us discuss these problems and how to remove them. 1) Backtracking Backtracking is a technique in which for expansion of non-terminal symbol we choose one alternative and if some mismatch occurs then we try another alternative if Fig. 3.9 If for a non-terminal there are multiple production rules beginning with the same input symbol then to get the correct derivation we need to try all these altematives. Secondly, in backtracking we need to move some levels upward in order to check the possibilities, This increases lot of overhead in implementation of parsing, And hence it becomes necessary to eliminate the backtracking by modifying the grammar. For example : S—xPz Poywly Then 2) Left recursion The left recursive grammar is a grammar which is as given below. AL Aa Here +, means deriving the input in one or more steps. The A can be non-terminal and @ denotes some input string. If left recursion is present in the grammar then it creates serious problems. Because of left recursion the top-down parser can enter in infinite loop. This is as shown in the Fig. 3.10. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-18 Syntax Analysis is an ambiguous grammar. We will design the parse tree for id + id + id_as. follows. gy (a)Parse tree 1 (b) Parse tree 2 Fig. 3.12 Ambiguous grammar For removing the ambiguity we will apply one rule : if the. grammar’ has: left associative operator (such as +, -, *, /) then induce the left recursion and: ifthe grammar has right associative operator (exponential operator) then induce. the right recursion. The unambiguous grammar is EET E>T ToT ToF Foid Note one thing that the grammar is unambiguous but it is left, recursive and elimination of such left recursion is again a must. Backtracking Predictive parsers Recursive LL(1) Parser descent Fig. 3.13 Types top-down parsers Princlples.of Compiler Design 3-19 Syntax Analysis: There are two types by which the top-down parsing can be performed. 1, Backtracking 2. Predictive parsing A backtracking parser will try different production rules to find the match for the input string by backtracking each time. The backtracking is powerful than predictive parsing. But this technique is slower and it requires exponential time in general. Hence backtracking is not preferred for practical compilers. As the name suggests the predictive parser tries to predict the next construction using one or more lookahead symbols from input string. There are two types of predictive parsers : 1. Recursive descent 2. LL(1) parser. Let us discuss these types along with some examples. 3.5.2 Recursive Descent Parser A parser that uses collection of recursive procedures for parsing the given input string is called Recursive Descent (RD) Parser. In this type of parser the CFG is used to build the recursive routines. The RHS of the production rule is directly converted to a program. For each non-terminal a separate procedure is written and body of the procedure (code) is RHS of the corresponding non-terminal. Basic stops for construction of RD parser The RHS of the rule is directly converted into program code symbol by symbol. 1. If the input symbol is non-terminal then a call to the procedure corresponding to the non-terminal is made. 2. If the input symbol is terminal then it is matched with the lookahead from input. The lookahead pointer has to be advanced on matching of the input symbol. 3. If the production rule has many alternates then all these alternates has to be combined into a single body of procedure, 4, The parser should be activated by a procedure corresponding to the start symbol. Let us take one example to understand the construction of RD parser. Consider the grammar having start symbol E. E> numT TS “num T |e Principles of Compiler Design 3-20 Syntax Analysis procedure E { if lookahead = num then { mateh(num) 7 T /* call to procedure T */ } else error; if lookahead = $ { declare success; 7* Return on success */ } else error; }/*end of procedure E*/ procedure T { if lookahead = ‘*’ { match(**")7 if lockahead ='num’ { match (num) ; T , /* inner if closed*/ else error ) /* outer if closed*/ else NULL /*indicates € Here the other alternate is combined into same procedure T */ ) /* end of procedure T*/ procedure match( token t ) t if lookahead=t lookahead = next_token; /*lookahead pointer is advanced*/ else error } /*end of procedure match*/ procedure error { print (“Error!!!"); } /*end of procedure error*/ Fig. 3.14 Pseudo code for recursive descent parser ‘The parser constructed above is recursive descent parser for the given grammar. We can see from above code that the procedures for the non-terminals are written and the procedure body is simply mapping of RHS of the corresponding rule. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-22 ‘Syntax Analysis The simple block diagram for LL(1) parser is as given below. Tope] glolo Parsing table Fig. 3.15 Model for LL(1) parser The data structures used by LL(1) are i) input buffer ii) stack iii) parsing table. The LL(1) parser uses input buffer to store the input tokens. The stack is used to hold the left sentential form.The symbols in RHS of rule are pushed into the stack in reverse order ie. from right to left. Thus use of stack makes this algorithm non-recursive. The table is basically a two dimensional array. The table has row for non-terminal and column for terminals. The table can be represented as M[A.a]' where A is a non-terminal and a is current input symbol. The parser works as follows - The parsing program reads top of the stack and a current input symbol. ‘With the help of these two symbols the parsing action is determined. The parsing actions’ can be Top input token Parsing action $ $ Parsing successful, halt. a A Pop a and advance lookahead ‘to next token. a B Error. A a Refer table MiA,a] if entry at M[A,a] is error report Error. A a Refer table MiA,a] if entry at M[A,a] is A—> PQR then pop A then push R, then push Q, then push P. The parser consults the table M[A,2] each time while taking the parsing actions hence this type of parsing method is called table driven parsing algorithm. The configuration of LL(1) parser can be defined by top of the stack and a lookahead token. One by one configuration is performed and the input is successfully parsed if aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles.of Compiler Design <3-26 ‘Syntax Analysis, We can observe in the given grammar that + and ) are really following T. FOLLOW(T’) - Torr We will map this rule with A sa BB then A = T, a =F, B FOLLOW(T’)=FOLLOW(T) = (+.),$} T 3 Fr We will map this rule with A >. BB then A = T, a = *F, B= 7 FOLLOW(T’) = FOLLOW(T) = {+,),$} Hence FOLLOW(T’) = (+,),$} FOLLOW(F) - Consider T — FI’ or TY + *FT’ then by computational rule 2, ToFr VY srr A >oBB A SaBp A=T,o=¢,B=F,p=T A=T,a=*,B=F,p=T FOLLOW(B) = {FIRST (8) - e} FOLLOW(B) = {FIRST() — ¢} FOLLOW(F) = (FIRST(T’) - e} FOLLOW(F) = (FIRST(T’) - e} FOLLOW(F) = {*) FOLLOW(F) = {*} Consider T's *FI’ by computational rule 3 ToT A>aBB A=T,a=+,B=E,B=T FOLLOW(A) = FOLLOW (B) FOLLOW(T’) = FOLLOW(F) Hence FOLLOW(F) = [+)8) Finally FOLLOW(F) = {#} U {+,)$} FOLLOW€) = {+/,),8) To summarize above computation aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Syntax Analysis Parse Tree Step 1: We will start from leaf node Step 2: Step 3: Reducing float to T. T float. Step 4: Read next string from input. Step 5: Reducing id to L. L > id. L Crt) &) Step 6: Read next string from input. abs aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-38 Syntax Analysis SEx02 035 Reduce by E-> id ry SE-E +id3$ Shift SE-E* ica Shift $E—E-id3 $ Reduce by E-> id SE-E-E $ Reduce by E> E*E SE-E $ Reduce by E> Ee SE Accept Here we have followed two rules. 1. If the incoming operator has more priority than in stack operator then perform shift. 2. If in stack operator has same or less priority than the priority of incoming operator then perform reduce. wm) Example 3.11 : Consider the following grammar S$ 9TL; T int | float L Lid | id Parse the input string int id,id; using shift-reduce parser. Solution = Stack Input buffer Parsing action s int ida: Shift | Sint idid's Reduce by T— int $T id,id:$ Shift S$Tid ids Reduce by L— id STL ids Shift STL, id: Shift $TLid is Reduce by L > L, id STL is Shift STL: $ Reduce by S> TL; L ss $ Accept aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-42 Syntax Analysis Properties of LR Parser — LR parser is widely used for following reasons. 1. LR parsers can be constructed to recognize most of the programming languages for which context free grammar can be written. 2. The class of grammar that can be parsed by LR parser is a superset of class of grammars that can be parsed using predictive parsers. 3. LR parser works using non backtracking shift reduce technique yet it is efficient one. LR Parsers detect syntactical errors very efficiently. Structure of LR parsers The structure of LR INPUT Token parser is as given in following, Fig. 3.19. Tt consists of input buffer for storing the input string, a stack for storing the grammar symbols, output and a parsing table comprised of two parts, namely action & goto and goto. There is one parsing program which is actually a driving program and reads the input Parsing table symbol one at a time from Fig. 3.19 Structure of LR parser the input buffer. OUTPUT The driving program works on following line. 1, It initializes the stack with start symbol and invokes scanner (lexical analyzer) to get next token. 2. It determines 5, the state currently on the top of the stack and a; the current input symbol. 3, It consults the parsing table for the action [s,a, ] which can have one of the four values. i) 5, means shift state i. ii) r, means reduce by rule j. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-46 Syntax Analysis 10: S3eas A-vea 12: goto (I0, b) S- be 13: goto (10, S) A+SeA A+eSA Asea Seas Seb 14: goto (10, a) Aras mms Example 3.15: Consider the following grammar : S$ — Aa| bAc| Be| bBa And Bod Coumpute closure (1) and goto (I. Solution : First we will write the grammar using dot operator. Ip 2S eAa S > +bAc S > «Bc S > «bBa Av-d Booed Now we will apply goto on each symbol from state Ig. Each goto (1) will generate new subsequent states. I, : goto (Ip, A) S> Aca Ip : goto (Ip,b) S$ > beAc S— beBa Aved Booed Principles of Compiler Design 3-47 Syntax Analysis Ty: goto (Ip, B) S > Bec Ty: goto (Ip,d) Avds Bods Construction of canonical collection of set of item — 1, For the grammar G initially add S'S in the set of item C. 2. For each set of items 1, in C and for each grammar symbol X (may be terminal or non-terminal) add closure (I,X). This process should be repeated by applying goto(l,,X) for each X in I, such that goto(I,X) is not empty and not in C. The set of items has to constructed until no more set of items can be added to C. Now we will consider one grammar and construct the set of items by applying closure and goto functions. Example : E>E+T E~T ToT+F TF F>() F > id In this grammar we will add the augmented grammar E'>E in the I then we have to apply closure(1). ip: E’ 3eE E-eE+T E>eT ToeT+F ToeF F>0(B) Fseid Principles of Compiler Design 3-48 Syntax Analysis The item Ih is constructed starting from the grammar E’-» + E. Now immediately right to ¢ is E. Hence we have applied closure([y) and thereby we add E-productions with © at the left end of the rule, That means we have added E+ E + T and E> «T in Ig. But again as we can see that the rule Ee T which we have added, contains non-terminal T immediately right to «, So we have to add T-productions in Ig. T+ «T * Fand T-»«F. In T-productions after ¢ comes T and F respectively. But since we have already added T productions so we will not add those. But we will add all the F-productions having dots. The F-)« (E) and F->« id will be added. Now we can see that after dot and id are coming in these two productions. The ( and id are terminal symbols and are not deriving any rule, Hence our closure function terminates over here. Since there is no rule further we will stop creating Ip, Now apply goto([,,E) ElseE Shifidotto, ESE. EseE+T fight Es Ee4T Thus I, becomes, goto(ly, E) h:EEe E>Ee+T Since in I; there is no non-terminal after dot we cannot apply closure(l;) By applying goto on T of Ip, goto(l, T) h:E~>Te ToTe+F Since in I, there is no non-terminal after dot we cannot apply closure(l,) By applying goto on F of Ip, goto( Ic, F) bi ToFe Since after dot in I, there is nothing, hence we cannot apply closure(I,). Principles of Compiler Design 3-49 Syntax Analysis By applying goto on (of Ip But after dot E comes hence we will apply closure on E, then on T, then on F. goto(ly, ( ) i T(E) EseE+T EseT ToeT+F Tok F>+@) F seid By applying goto on id of Ip, Boto( lp, id) Ig: Fide Since in I, there is no non-terminal to the right of dot we cannot apply closure function here. Thus we have completed applying gotos on Ip. We will consider I, for applying goto. In I, there are two productions E’> Ee and E > Ee + T. There is no point applying goto on E> Ee, hence we will consider E’-> Ee +T for application of goto. goto(l,, +) Ig: E 9E+eT T +eT+F T +eF FT > ¢(6) F eid There is no other rule in I, for applying goto. So we will move to I,. In I, there are two productions E> T «and T— Te+E. We will apply goto on *. Principles of Compiler Design 3-50 Syntax Analysis goto(l, *) lb: ToT#eF F4*@) Foeid The goto cannot be applied on ly. Apply goto on E in [y In |, there are two productions having E after dot (F—* E and E~* E+). Hence we will apply goto on both of these productions. The Ig becomes, goto(I,, E) Ik: Fo) ESEe+T If we will apply goto on (ly, T) but we get E-> Ts and T-> Te*F which is I, only. Hence we will not apply goto on T. Similarly We will not apply goto on F, ( and id as we get the states Is, ly, I; again. Hence these gotos cannot be applied to avoid repetition. There is no point applying goto on 1, Hence now we will move ahead by applying goto on I, for T. goto(l, T) Ib: EE+Te ToTe«F Then, goto(s,, F) loi TOT+Fe Then, goto(l,, )) I: F>@ Applying goto on Ip, Io, Ty; is of no use. Thus now there is no item that can be added in the set of items. The collection of set of items is from Ip to Ty). aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-54 ‘Syntax Analysis In the given DFA every state is final state. The state Ip is initial state. Note that the DFA recognizes viable prefixes. For example - For the item, 1 goto (Ip, E) ESE Es Ig : goto (1y, +) = Ese teT TEE ©OEe2 © © © Ig :goto (IT) ESE+Te The viable prefixes E, E+ and E+T are recognized here continuing in this fashion. The DFA can be constructed for set of items. Thus DFA helps in recognizing valid viable prefixes that can appear on the top of the stack. Now we will also prepare a FOLLOW set for all the non-terminals because we require FOLLOW set according to rule 2.b of parsing table algorithm. [Note for readers: Below is a manual method of obtaining FOLLOW. We have already discussed the method of obtaining FOLLOW by using rules. This manual method can be used for getting FOLLOW quickly], FOLLOW(F’) = As E’ is a start symbol $ will be placed. = ($} FOLLOWEE) : E’+E_ that means E’ = E = start symbol. .. we will add $. E-E+T the + is following E. -. we will add + F—(B)_ the )is following E. we will add ). «. FOLLOW(€) = {+, ), $) FOLLOW(T) : As E’ > E, E> T the E’ = E = T = start symbol. ..we will add $. E> E+T EoT+T +EOT The + is following T hence we will add +. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-58 ‘Syntax Analysis Now we will fill up the goto table for all the non-terminals. In state Ip, goto(ly, E) = ly. Hence goto[0, E] =1, similarly goto(lp T) = I. Hence goto[0, T] = 2 Continuing in this fashion we can fill up the goto entries of SLR parsing table. Finally the SLR(1) parsing table will look as - State Action Goto a@]+*]r}c}f)]| settle 0 35 s4 1 2 1 36 Accept 2 4 87 m m 3 4 4 4 4 4 35 34 8 2 3 5 Ll td 6 re 6 35 34 9 7 35 34 10 8 36 s11 g rt s7 a rt 10 3 3 3 3 a 5S 5 5 6 Remaining blank entries in the table are considered as syntactical errors. Parsing the Input using parsing table ~ Now it’s the time to parse the actual string using above parsing table. Consider the parsing algorithm. Input : The input string w that is to be parsed and parsing table. Output : Parse w if w ¢ L(G) using bottom-up. If w ¢ L(G) then report syntactical error. Algorithm : 1. Initially push 0 as initial state onto the stack and place the input string with $ as end marker on the input tape. 2, IfS is the state on the top of the stack and a is the symbol from input buffer pointed by a lookahead pointer then a) If action{S, a] = shift j then push a, then push j onto the stack. Advance the input lookahead pointer. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-62 Syntax Analysis Hence we will not consider I, and I, again. Now move to state [,; from 1, the goto transitions on * . Hence Is : goto (15, *) FFs? As there is no point in applying goto on state I, and I,. We will choose state I, for goto transition. Ig : goto (I,, T) E> E+Ts To TF FoF Fea Fob Now we will first obtain FOLLOW of E, T and F. As the FOLLOW computation is required when the SLR parsing table is building. FOLLOW) = = (+, $} FOLLOW) = = {+, a,b, $} FOLLOWF) = = {+,+,a,b,$} As by rule 2 and 4, E — T and T > F we can state E = T = F. But E is a start symbol. Then by rule 2 and 4, T and F can act as start symbol. ».we have added $ in FOLLOW(). FOLLOW(T) and FOLLOW(F). The SLR parsing table can be constructed as follows. States Goto + * a b $ E T F | 0 S4 35 1 2 3 1 6 ‘Accept 2 2 S4 $5 2 7 3 4 $8 4 4 4 | 4 6 6 6 6 6 5 6 6 6 6 6 6 S4 $5 9 3 7 3 | se a 3 8 8 5 5 6 5 5 9 1 Sa $5 1 7 aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-67 Syntax Analysis If there is a production X +, b then add X > * 7b C9 ea, be FIRST@ a) Coed b ¢ FIRST(C$) b < FIRST(C) as FIRST(C) = {a, d} b= (ad) C - eaC, a or d will be added in Ip. Similarly C— ¢ d, a/d will be added in ly, Hence 1) : F395, $F $>°CC$ Cea, a/d Coed asd Here a/d is used to denote a or d. That means for the production C-> © aC, a and Creal. Now apply goto on Ip. S3+S,$ S > +CC$ C > ac, ald C ed, ald Hence goto(Iy, S) 1:8’ > S$ Now apply goto on C in Ip. $+ CeC,§ add in I,. Now as after dot C comes we will add the rules of C. X=C,Bre, a-$ X—+ ey,bwhere be FIRST@ a) C3 ea be FIRST¢ $) Coed be FIRSTG) Principies of Compiler Design 3-68 Syntax Analysis: Hence CH ea, $ ard C> ed, $ will be added to I,- 1, + goto(Ig, C) SCC, $ Cac, $ Coeds Now we will apply goto on a of Ip for rule C + # aC, a/d that becomes C+ aC, a/d will be added in Ig. C>aeC,a/d AsA=C,a=a,X=C,f=ea=a/d Hence X 3 ¢y,b Cac be FIRST@a) Coed be FIRST¢ a) or FIRST¢ d) b=a/d Hence I, : goto(Ip, a) Coa eC,a/d Co eal, a/d Coed, a/d Now if we apply goto on d of Ip in the rule C — « d, a/d then we get Cod¢, a/d hence I, becomes Ty + goto(ly, d) Crdsa/d As applying gotos on Ih is over. We will move to I, but we get no new production from I, we will apply goto on C in I. And there is no closure possible in this state. I, : goto(l,, C) SCC $ We will apply goto on a from Ion the rule C-> * aC, $ and we get the state I, Principles of Compiler Design 3-69 ‘Syntax Analysis CoaeGs A> aeX Ba A=CGa=a,X=CB=ea=$ X sey C+ eaC and C sed be FIRST a) be FIRST¢ $) b=$ Ig: goto(ly, a) Cr+aeC,$ Cra, $ Coed, $ You can note one thing that I; and I, are different because the second component in I, and I, is different. Apply goto on d of I; for the rule C-> #d,$. 1y : goto(l,, d) CroHdeg$ Now if we apply goto on a and d of I, we will get 1, and I, respectively and there is no point in repeating the states. So we will apply goto on C of Is. Ig : goto(l, C) C3 aCe a/d For ly and I, there is no point in applying gotos. Applying gotos on a and d of I, gives |, and I, respectively. Hence we will apply goto on I, for C for the rule. C3aeG$ Ig: goto(ls, C) Coal eS For the remaining states I;, Ig and I, we cannot apply goto. Hence the process of construction of set of LR(1) items is completed. Thus the set of LR(1) items consists of Ip to I, states. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-73 Syntax Analysis For instance goto(Ig, $) = 1,Hence goto[0, $] = 1. Continuing in this fashion we can fill up the LR(1) parsing table as follows. Action Goto a a $ s c 0 83 84 1 1 Accept 2 SB s7 5 3 S3 S84 8 4 8 8 5 dq 6 ‘SB s7 9 7 3 8 2 2 9 2 The remaining blank entries in the table are considered as syntactical error. Parsing the Input using LR(1) parsing table Using above parsing table we can parse the input string”aadd” as Stack | Input buffer} Action table Goto table Parsing action $o aadd$ action{ 0,a }=S3 ‘$083 add$ action(3,a]=S3 Shift $0e3a3__ | _ dds action{3,d]=S4 Shift $oa3azd4 | ds action{4.d]=" Reduce by Cd $0a3a3C8 | dS action[8,d]=r2 Reduce by C-> aC $0838 as action[8,0]=12 Reduce by C+ aC socz as action(2,¢]=S7 Shift socza7 | $ action(7,$]=r3_|_ [2,5 Reduce by Cd $0C2C5 $ action[5,$]=r1 {0,S}=1 Reduce by S + CC $051 $ accept Thus the given input string is successfully parsed using LR parser or canonical LR parser. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-77 Syntax Analysis: Now consider state I there is a match with the rule [Aa ea, b] and goto (i, a) = 5. C+ aC, a/d/$ and if the goto is applied on a‘ then we get the state Ij. Hence we will create entry acticn{0, a] = shift 36. Similarly, Ink C sed a/d A >~aeafyb A=Ca=¢a=d,p=e,b=a/d Boto(ly, d) = Tyr Hence action[0, d] = shift 47 For state Igy Cde,a/d/$ Avaeca A=C, a=d,a=a/d/$ action{47, a] = reduce by C — d ie. rule 3 action[47, d] = reduce by C ->d ie. rule 3 action(47, $] = reduce by C die. rule 3 S>S¢$inh So we will create action[1, $] = accept. The goto table can be filled by using the goto functions. For instance goto(ly, 5) = Ij. Hence goto[0, S] = 1. Continuing in this fashion we can fill up the LR(1) parsing table as follows. States Action Goto a a $ s c oO S836 S47 1 4 Accept 2 S836 SAT 5 3% S836 S47 89 47 3 rm m3 s_ | n —! 89 2 2 2 aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-86 Syntax Analysis The LALR parsing table will be - Goto s A B 1 2 4 1 ACCEPT 2 $6 3 so 7 8 4 59 6 1 7 sit 8 siz 10 8 11 m 12 14 The parsing table shows multiple entries in Action [59, a] and Action [59, c]. This is called reduce/reduce conflict. Because of this conflict we can not parse input. Thus it is shown that given grammar is LR(1) but not LALR(1). 3.8 Comparison of LR Parsers It’s a time to compare SLR, LALR and LR parser for the common factors such as size, class of CFG, efficiency and cost in terms of time and space. ee % : cS eeee| ety Raiser ces 1 SLR parser is smallest in The LALR and SLR have LR parser or canonical LR size. the same size. parser is largest in size. 2. It is an easiest method This method is applicable to| This method is most based on FOLLOW function.| wider class than SLR. powerful than SLR and LALR. Principles of Compiler Design 3-87 Syntax Analysis 3. | This method exposes less | Most of the syntactic This method exposes less syntactic features than that | features of a language are | syntactic features than that of LR parsers. expressed in LALR. of LR parsers. 4. | Error detection is not Error detection is not Immediate error detection is immegiate in SLR. immediate in LALR. done by LR parser. 5. | It requires less time and The time and space The time and space space complexity complesity is more in LALR | complexity is more for but efficient methods exist | canonical LR parser. for constructing LALR parsers directly. Graphical representation for the class of LR family is as given below. Fig. 3.24 Classification of grammars 3.9 Using Ambiguous Grammar As we have seen various parsing methods in which if at all the grammar is ambiguous then it creates the conflicts and we can not parse the input string with such ambiguous grammar. But for some languages in which arithmetic expressions are given ambiguous grammar are most compact and provide more natural specification as compared to equivalent unambiguous grammar. Secondly using ambiguous grammar we can add any new productions for special constructs. While using ambiguous grammar for parsing the input string we will use all the disambiguating rules so that each time only one parse tree will be generated for that specific input. Thus ambiguous grammar can be used in controlled manner for parsing the input. We will consider some ambiguous grammar and let us try to parse some input string. Using precedence and associativity to resolve parsing action conflicts Consider an ambiguous grammar for arithmetic expression. EOE+E Principles of Compiler Design Syntax Analysis E~E*E E>(E) Eid Now we will build the set of LR(0) items for this grammar. Ge Boek ESrErE Ese BE E> + EvE Es 6(E) E-+*(E) E-veid Es eid 1y 900 (Ig, E) 1g: goto (Ip, E) ESEs Es(E*) ES Este Es Bete ES ECE E+ Eee Ip! goto (Ip, () 1: goto (Iq, E) Eo (+8) ESE+Es EStEtE ESE Ste E> + BE Es exE Ese) E-eid 1, : Goto (Ig, E) 4: got0 (lo ESE+Es Eseid ESE S+E Es Bese Wy? goto (ty, +) E>E+-e EseE+E Ig goto (Ig, ) E> BE Es (Ee Es e(E) E-seid Here FOLLOW(E) = (+, *, ), $}- We have computed FOLLOW (8) as it will be required while processing, Now using the same rules of building SLR(0) parsing table we will generate the parsing table for above set of items. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. 3-100 Syntax Analysis Ani. return yytext [0]; a The YACC program /*Program Name :calci.y */ ¥{ double menvar; %} /*T0 define possible symbol types*/ Sunion { double dval; t /*Tokens used which are returned by lexer*/ %token NUMBER %token MEM %token LOG SINE nLOG COS TAN /*Defining the precedence and associativity*/ Sleft %-" ‘4 /*Lowest precedence*/ sleft ‘*" ‘/" Sright + %left LOG SINE nLOG COS TAN /*Highest precedence*/ /*No associativity*/ Snonassoc UMINUS /*Unary Minus*/ /*Sets the type for non-terminal*/ ‘type expression BB /*Start state*/ start: statement ‘\n‘ | start statement ‘\n’ /*or storing the answer (memory)*/ statement: MEM ‘=’ expression {memvar = $3;} | expression {printf(*Answer = %g\n",$1);) Zz /*For printing the answer*/ /*Por binary arithmetic operators*/ expression: expression ‘+’ expression ($$ = $1 + $37} | expression ‘~' expression ($$ = $1 - $3;} | expression ‘*’ expression {$$ = $1 * $37} 1 expression ‘/’ expression { /*To handle divide by zero case*/ if($3 == 0) aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-104 ‘Syntax Analysis (ii) (a, (a, a)) (iii) (a, (Ca, a), (aa) c) Construct a leftmost derivation for each of the sentences in (b). d) Construct a rightmost derivation for each of the sentences in (b). ©) What language does the grammar generate ? Ans. : a) The terminals are T = {a, (, )} The non-terminals are V = {L, $} The start symbol is S. b) The parse tree for i) @, a) is if) (@, (a, a)) aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-108 Syntax Analysis az Ans. : Ip: Ib: Construct SLR parsing table for the grammar : DEE sub E sup E DEE sub E 3) EE sup E 4) EE} SE Resolve the conflict if any. We will first construct canonical set of items LR (0) as follows. EB eE EE sub E sup E E> eE subE EE supE E> ¢(E} Eec + goto (Ip, E) EE E— Eesub E sup E E> EesubE E> EesupE + goto (Ip, {) E> (*°E E> © Esub EsupE E> «EsubE E> eEsupE E> «(E) E>ec + goto (Iq, ¢) E>ce goto (T,, sub) E— E sub +E sup E E~ E subeE EE sub E sup E E- eE subE Ig : goto (I;, E) E> {E} E> E «sub E sup E E~ EesubE E> Ee supE I, : goto (ly, E) E> E sub Ee sup E E> E subE« E> Ee sub EsupE E> Ee subE E> E+ supE Ig : goto (Is, E) E-~ Esup Ee E~ Eesub E sup Ee E> Ee sub E sup E E> Ee subE E> EesupE I, : goto ly, J) E> {E}e aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-113 Syntax Analysis as as Ty: Find FIRST and FOLLOW sets for the given grammar. S > PQR Pa |Rble Qc |dple Roel f : The FIRST and FOLLOW sets for the non-terminal symbols are as follows : FIRST (S) = {a ec, d, fl ={fac de, ft FIRST (P) = {a,e, £,€) FIRST (Q) = {c,d,e} FIRST (R) = {e, f} FOLLOW (S) = {$) FOLLOW (P) = {c,d,e, f) FOLLOW (Q) = [e, f} FOLLOW (R) = {b, $} A context free grammar G is as given below. Sx | Ay Belz A= Bx Construct canonical collections for set of LR(1) items. Also construct action and goto table for the same. 2 Let us construct first set of canonical item I). s3SS S3-x%$ So-Ay, $ A-> Bx, $ Bo: zle,x goto (Ip, S) S3S,$ goto My, x) Sox, $ Principles of Compiler Design 3-144 Syntax Analysis lh: goto (Ip, A) S>A-y, $ ly: goto ly, B) ASBxY goto (lo, 2) Boz,x Ie: goto (ls, y) So Ay, $ Ip: goto (I, x) ASBxy 1 Accept, rt sé J ~fo fa [a jo [ry S7 5S | 8 Note that conflict occurs at action [0, x]. Q.7 — Consider the following grammar. Compute the set of LR(O) items for this grammar and draw the corresponding DFA. S > SopS]x Ans: Ip: S—0S S-> *SopS Sex I: goto (Ig, S) Sse SS pS Principles of Compiler Design 3-115 Syntax Analysis goto (lo, *) Soxe I, + goto (h, op) S>Sop eS S— © SopS Sex 1,2 goto (lz, S) S—>SopS + S$ © opS The required DFA can be drawn as follows. QB Consider the following grammar EsE+T |T Tid id [} | id XI XoEE|E 4) Eliminate left recursion. b) Perform left factoring. c) Obtain FIRST and FOLLOW set. Ans. : a) To eliminate left recursion we need to modify the grammar. Using following rule we can eliminate left recursion. ABA‘ = A’>oA’ A’se A~Aa AB aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 3-128 Syntax Analysis Ig : goto (lz, b) A+b-,Sla I: goto (ly, a) Av Aa’$ The canonical LR parsing table can be constructed as follows. 0 1 2 3 8 8 4 s7 1 5 2 6 3 a 7 2 Review Questions Define syntax analyzer. Also discuss various issues in parsing. What is the role of parser in a compilation process ? Why lexical and syntax analyzer are separated out? Explain the term ambiguous grammar with appropriate example. What are the types of parser? What are the problems with top-down parsing? Explain Recursive Descent Parser with appropriate example. Write algorithms for computing FIRST and FOLLOW. Discuss LL(1) parsing method for the following grammar. EOE 3 +TE |e — FI’ > «ITE “= (E) |id SENAD RYN milan aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 4-2 ‘Semantic Analysis Source Program Fig. 4. 4.3.1 Type Expression «The systematic way of expressing type of language construct is called type expression. Thus the type expression can be defined as 1) The basic type is called type expression. Hence int, char, float, double, enum are type expressions. 2) While performing type checking two special basic types are needed such as type_error and void. The type_error is for reporting error occurred during type checking and void indicates that there is no type(NULL) associated with the statement. 3) The type name is also a type expression For example typedef int *INT_PTR This statements defines the type name INT_PTR as type expression which is actually a integer pointer. 4) The type constructors are also type expressions. The type constructors are array, Product, struct, product, pointer, function. * Each of these type expressions are explained with the help of an example as follows ~ i) Array — An array array (1,1) is a type expression where I is an index set which is usually an integer and the data type of the array elements is given by T. The type expression is : array (LT) For example int arr{20]; The type expression for an array ‘an’ having the elements from 1,...19 is array(0,1,..19,int) aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 6-2 Intermediate Code Gi 6.2 Intermediate Languages There are mainly three types of intermediate code representations © Syntax tree * Posix * Three address code These three representations are used to represent the intermediate languages. Let us discuss each with the help of examples : 4. Syntax tree The natural hierarchical structure is represented by syntax trees, Directed Acyclic Graph or DAG is very much similar to syntax trees but they are in more compact form. The code being generated as intermediate should be such that the remaining processing of the subsequent phases should be easy. -a+*b+-a*b for the syntax tree and DAG Consider the input string x : representation. assign assign fs () J/\ /\ LS urminus b uminus b Syntax tree Fig. 6.2 Syntax tree and DAG 2. Posix The posix notation is used using postfix representation.Consider that the ‘nput expression is x := -a +b +—a*b then the required posix form is xa-b*a-b*+i= aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 6-6 Intermediate Code Generation 6.3 Generation of Three Address Code The translation scheme for various language constructs can be generated by using the appropriate semantic actions. In this sections we will discuss the translation schemes for declaration statement, array references, Boolean expressions and so on. 6.4 Declarations In the declarative statements the data items along with their data types are declared. For example s>D {offset = 0} D+ idT {enter_tab(id.name,T.type.offset); offset := offset+T.width)} T + integer {T-type := integer; T.width := 4} T = real (Type Twidth + T -+ atrayinum) of Ty {Type += array(num.val,Ty,type) ‘Twidth = num.val X Ty.width } {Ttype := pointer(T.type) Twidth = 4} T * Initially the value of offset is set to zero.The computation of offset can be done by using the formula offset = offset + width. * In the above translation scheme T-type,T.width are the synthesized attributes. The type indicates the data type of corresponding identifier and width is used to indicate the memory units associated with an identifier of corresponding type.For instance integer has width 4 and real has 8. «The rule D-id:T is a declarative statement for id declaration. The enter_tab is a function used for creating the symbol table entry for identifier along with its type and offset. «The width of array is obtained by multiplying the width of each element by number of elements in the array. «The width of pointer type is supposed to be 4 aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principles of Compiler Design 6-10 Intermediate Code Generation E> €,*E2 | Eplace:snewtemp() it E,.type =integer and Ep.type = integer then { append (E.place ':=' E;.placo ‘int «" Ep place); E.type:=integer } else if E;.type =real and E».type = real then { append (E.place "=" Ey.place ‘real *" Ep.place}, E.type:=real } else if E,.type =integer and Ep.type = real then { t=newtemp(): append (t ‘:=" ‘intoreal’ E,.place); append (E.place "=" t ‘real + Ep.place); E.type:=real } else if E,.type =real and E> type = integer then { t=newtemp() append (t "=" ‘inttoreal’ Ep.place); append (E.place =" E;.place ‘real t); E.type :=real pe_error(): Translation scheme for generating 3-address code for expressions using type conversion ium) Example 6.2 : Generate the three address code for an expression x= a+b+c +d; Assuming x and a as real and b,c and d are integer b int *c t,int+d Solution : inttoreal t, areal + ty ty 6.6 Arrays As we know array is a collection of contiguous storage of elements. For accessing any element of an array what we need is its address. For statically declared arrays it aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. Principies of Compiler Design Intermediate Code Generation Production rule Semantic action a if B then MS, { backpatch(B.TlistM state); S.next:=merge(B.Flst,Sy - next) ) S — if B then My S; N else MS2 o ~ while M,B then do MS {backpatch(B.Tist.M,.state): backpatch(B. Fist M,.state); S.next:=merge(S ;.nextmerge(N.next,Sp.next)) } { backpatch(S,,next,M,.state); backpatch(B. Tis Mp state); S.next:=B.Flist; append(‘goto’ M, state); ) S + begin L end { S.next := Lnext } SA (S.next := NULL) Si Ms { backpatch(L .next,Mstate); Lnext:=S.next; } (Linext :=S.next ;} ( Mstate :snextstate;} ( N.next :=mklist(nextstate); append('goto _');} = The synthesized attributes Tlist and Flist are used to generate the jumping code for Boolean expressions. For the true statement B.Tlist will be generated and for the false statement B-Flist is generated. « The attribute ‘state’ will be associated with the M and that is used to record the number(address) of the statement.It is denoted as M.state.The ‘nextstate’ will point to the next statement. * The attribute next is used to indicate the pointer to the next list. The S.next and L.next records the pointer to the next list of S and L respectively-The S.next :=NULL initializes Snext to an empty list. Thus backpatching can be used to translate the flow of contro! statement in one pass.

You might also like