The Art of Disassembly

The Art Of Disassembly
http://aod.anticrack.de and http://board.anticrack.de
A project by: Zero, CuTedEvil, Crick

CHAPTER 0 Welcome To
The Art Of Disassembly
The Art Of Disassembly 3

Welcome To The Art Of Disassembly
What is AoD - The Art of Disassembly?

Art of Disassembly understands itselves as a Handbook for writing a disassembler.
Is this book like "Art of Assembly" ? No. We really do not want to mess with this great free
ebook.
So this book is very long. Did we really wrote all alone? No, not really. Especially for the
theoretical part we have included some of the best articles we found. Some of them is
SEH, the PE-tutorial or the "how to build a disassembler". A very long addition I have
done is the chapter "Let´s build a compiler" with standalone 350 pages. There was really
no need to "rewrite" these articles.
For the theoretical sources we included some good code-snippets we found on the web
and at http://board.win32asmcommunity.net.
Sure we respect the work of these authors and do not claim their work as ours. You
should do the same. Therefore we place always a footnote to the author and/or the loca-
tion where we found the article/source.
The practical part of the disassembler was developed during an online-course/discussion

at http://board.anticrack.de and has many contributors. We hope to mention them all.
At the end we hope that this book will be a complete handbook for building a disassem-
bler in assembly language for the win32 environment.
As development language we decided to use MASM32v7 because it is free and well sup-
ported at http://board.win32asmcommunity.net. But sure you can use (after reading this
book) TASM, NASM, FASM or whatever.
How much will this book cost and where can you get a paperback version ?
First this book will be always free. There is no need to pay for it.
Second there will be never a printed version. This book is like Art of Assembly only avail-
able in PDF format. Sure we will never sell articles made by others even we have
included them here.
4 The Art Of Disassembly

What is AoD - The Art of Disassembly?
Are we allowed to include the articles made by others ?
Yes, if we respect the authors and add footnotes to them and never disclaim their work as
ours. Please see this book as academical publication to increase knowledge.
Zero - Main Author
CuTedEvil - Main Coder
Crick - Main Coder

Licensing
No Licensing.
No Freeware or shareware or whatever.
No copyright, copyleft, copytop or copydown… just a little copycenter :D
For included articles not by us please respect their copyrights !
This document IS ABSOLUTELY FREE !!!
So we call it learn-ware.
You are allowed to do with this document whatever you want, as long you keep it as it is.
Do not extract parts and disclaim them as full tutorial. People are always lucky when they
find the full document.
So you are allowed to teach your grandma, print this document or take it into a pub and
place your beer on it.

Disclaimer
Disclaimer
- We disclaim ourselves.
- We are not responsible for damages at your computer when you use the informations
described in this book.
- We are not responsible when you are loosing your hairs as a result of this heavy and
complex material.
- We are not responsible what ever you do with the knowledge you gain from here.
- We are not responsible for the article-contents by other people we have included

The Software You Really Need !

There are some tools we will need for writing our disassembler. All tools you need to
download are for free and can be used without licensing for our disassembler.
- MASM32v7 package as our assembler
- RadAsm as IDE for development
- OllyDbg for debugging
Anyway you may need some more links to get informed:
- http://aod.anticrack.de
This is the main site for this project and this document. Check it to get the latest
release. We will offer links to all necessary tools as well as a change-log of this
document.
- http://www.anticrack.de
All information you need for coding assembly and reverse-engineering
- http://board.anticrack.de
This is the place where our disassembler and this document is developed. Please
check out the disassembler-forum!
- http://board.win32asmcommunity.net
The main place for asking question about MASM and assembly coding.
No reverse-engineering topics here please!
-http://www.cs.vu.nl/~dick/PTAPG.html
Parsing Techniques - A practical guide by Dick Grune and Ceriel J.H. Jacobs
That´s it !
Free tools, a free book and a free mind will take us on the road of wasting time…

The Software You Really Need !


CHAPTER 1 Basic knowledge you
need
First we need to have a look at some very important basic knowledge. It is very important that
you understand the first lesson before you start with coding a disassembler.
In this chapter we will first do a short journey into the PE-Filestructure of win32 applications.
Then we will discuss some coding techniques which can make your life easier when you are
coding the disassembler-engine. So we will have a look at modular and procedural coding,
after this we will have a short overview of object oriented programming (OOP). This is no
handbook for good coding, so you should know some parts of these coding-concepts. There-
fore we will not go into deep details of these topics. Next we will discuss linked lists and tree/
graph structures. Especially linked lists are very important when you load a file into memory
and want to parse it. Combining linked lists with OOP can be a very powerfull tool. After
understanding this we have a look at parsing-problems and how to loop though the bytes in
memory. At the end of this chapter we will need to look at the opcodes- and mnemomics-
comcept, which is one of the bases of our disassembler.
This chapter is for the very unexperienced users and should give you a good background
knowledge which you need to build your own disassembler engine.

Basic knowledge you need
Lesson 1 -
A little journey into the PE-Filestructure
The PE-Header is the most important thing you have to understand. It defines the struc-
ture of a normal (PE) file in the win32 environment.
When you are coding a disassembler you have to play with it. You need to detect if it has
a valid structure, inspect the different sections, want to have a look in the import and
export tables and need to find the entry point of the application. The next lessons are the
original tutorials by Iczelion. They are the best you can find to get a good overview of the
PE-filestructure in a win32 assembly environment.
There was really no need to write an own PE tutorial. Most of the beginning assembly
coders have learned from these tutorials. We respect the work by Iczelion and you should
do the same!

Lesson 1 - A little journey into the PE-Filestructure
Overview of the PE-File format1

PE stands for Portable Executable. It's the native file format of Win32. Its specification is
derived somewhat from the Unix Coff (common object file format). The meaning of "portable
executable" is that the file format is universal across win32 platform: the PE loader of every
win32 platform recognizes and uses this file format even when Windows is running on CPU
platforms other than Intel. It doesn't mean your PE executables would be able to port to other
CPU platforms without change. Every win32 executable (except VxDs and 16-bit Dlls) uses
PE file format. Even NT's kernel mode drivers use PE file format. Thus studying the PE file
format gives you valuable insights into the structure of Windows.
Let's jump into the general outline of PE file format without further ado.
1. This is the original tutorial by Iczelion

DOS MZ header
DOS stub
PE header
Section table
Section 1
Section 2
Section ...
Section n
The above picture is the general layout of a PE file. All PE files (even 32-bit DLLs) must
start with a simple DOS MZ header. We usually aren't interested in this structure much.
It's provided in the case when the program is run from DOS, so DOS can recognize it as
a valid executable and can thus run the DOS stub which is stored next to the MZ header.
The DOS stub is actually a valid EXE that is executed in case the operating system
doesn't know about PE file format. It can simply display a string like "This program
requires Windows" or it can be a full-blown DOS program depending on the intent of the
programmer. We are also not very interested in DOS stub: it's usually provided by the
assembler/compiler. In most case, it simply uses int 21h, service 9 to print a string saying
"This program cannot run in DOS mode".

After the DOS stub comes the PE header. The PE header is a general term for the PE-related
structure named IMAGE_NT_HEADERS. This structure contains many essential fields that
are used by the PE loader. We will be quite familiar with it as you know more about PE file for-
mat. In the case the program is executed in the operating system that knows about PE file
format, the PE loader can find the starting offset of the PE header from the DOS MZ header.
Thus it can skip the DOS stub and go directly to the PE header which is the real file header.
The real content of the PE file is divided into blocks called sections. A section is nothing more
than a block of data with common attributes such as code/data, read/write etc. You can think
of a PE file as a logical disk. The PE header is the boot sector and the sections are files in the
disk. The files can have different attributes such as read-only, system, hidden, archive and so
on. I want to make it clear from this point onwards that the grouping of data into a section is
done on the common attribute basis: not on logical basis. It doesn't matter how the code/data
are used , if the data/code in the PE file have the same attribute, they can be lumped together
in a section. You should not think of a section as "data", "code" or some other logical con-
cepts: sections can contain both code and data provided that they have the same attribute. If
you have a block of data that you want to be read-only, you can put that data in the section
that is marked as read-only. When the PE loader maps the sections into memory, it examines
the attributes of the sections and gives the memory block occupied by the sections the indi-
cated attributes.
If we view the PE file format as a logical disk, the PE header as the boot sector and the sec-
tions as files, we still don't have enough information to find out where the files reside on the
disk, ie. we haven't discussed the directory equivalent of the PE file format. Immediately fol-
lowing the PE header is the section table which is an array of structures. Each structure con-
tains the information about each section in the PE file such as its attribute, the file offset,
virtual offset. If there are 5 sections in the PE file, there will be exactly 5 members in this
structure array. We can then view the section table as the root directory of the logical disk.
Each member of the array is equvalent to the each directory entry in the root directory.

That's all about the physical layout of the PE file format. I'll summarize the major steps in
loading a PE file into memory below:
1.When the PE file is run, the PE loader examines the DOS MZ header for the offset
of the PE header. If found, it skips to the PE header.
2.The PE loader checks if the PE header is valid. If so, it goes to the end of the PE
header.
3.Immediately following the PE header is the section table. The PE header reads
information about the sections and maps those sections into memory using file
mapping. It also gives each section the attributes as specified in the section table.
4.After the PE file is mapped into memory, the PE loader concerns itself with the
logical parts of the PE file, such as the import table.
The above steps are oversimplification and are based on my own observation. There may
be some inaccuracies but it should give you the clear picture of the process.You should
download LUEVELSMEYER's description about PE file format. It's very detailed and you
should keep it as a reference.

Detecting a valid PE-File2

Theory
How can you verify if a given file is a PE file? That question is difficult to answer. That
depends on the length that you want to go to do that. You can verify every data structure
defined in the PE file format or you are satisfied with verifying only the crucial ones. Most of
the time, it's pretty pointless to verify every single structure in the files. If the crucial structures
are valid, we can assume that the file is a valid PE. And we will use that assumption.
The essential structure we will verify is the PE header itself. So we need to know a little about
it, programmatically. The PE header is actually a structure called IMAGE_NT_HEADERS. It
has the following definition:
IMAGE_NT_HEADERS STRUCT
Signature dd ?
FileHeader IMAGE_FILE_HEADER <>
OptionalHeader IMAGE_OPTIONAL_HEADER32 <>
IMAGE_NT_HEADERS ENDS
Signature is a dword that contains the value 50h, 45h, 00h, 00h. In more human term, it con-
tains the text "PE" followed by two terminating zeroes. This member is the PE signature so
we will use it in verifying if a given file is a valid PE one.
FileHeader is a structure that contains information about the physical layout of the PE file
such as the number of sections, the machine the file is targeted and so on.
OptionalHeader is a structure that contains information about the logical layout of the PE file.
Despite the "Optional" in its name, it's always present.

Our goal is now clear. If value of the signature member of the IMAGE_NT_HEADERS is
equal to "PE" followed by two zeroes, then the file is a valid PE. In fact, for comparison
purpose, Microsoft has defined a constant named IMAGE_NT_SIGNATURE which we
can readily use.
IMAGE_DOS_SIGNATURE equ 5A4Dh

IMAGE_OS2_SIGNATURE equ 454Eh
IMAGE_OS2_SIGNATURE_LE equ 454Ch
IMAGE_VXD_SIGNATURE equ 454Ch
IMAGE_NT_SIGNATURE equ 4550h
The next question: how can we know where the PE header is? The answer is simple: the
DOS MZ header contains the file offset of the PE header. The DOS MZ header is defined
as IMAGE_DOS_HEADER structure. You can check it out in windows.inc. The e_lfanew
member of the IMAGE_DOS_HEADER structure contains the file offset of the PE header.
The steps are now as follows:
1.Verify if the given file has a valid DOS MZ header by comparing the first word of
the file with the value IMAGE_DOS_SIGNATURE.
2.If the file has a valid DOS header, use the value in e_lfanew member to find the
PE header
3.Comparing the first word of the PE header with the value IMAGE_NT_HEADER.
If both values match, then we can assume that the file is a valid PE.

Example
.386
.model flat,stdcall
option casemap:none
include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\comdlg32.inc
include \masm32\include\user32.inc
includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\comdlg32.lib
SEH struct
PrevLink dd ? ; the address of the previous seh structure
CurrentHandler dd ? ; the address of the exception handler
SafeOffset dd ? ; The offset where it's safe to continue execution
PrevEsp dd ? ; the old value in esp
PrevEbp dd ? ; The old value in ebp
SEH ends
.data
AppName db "PE tutorial no.2",0
ofn OPENFILENAME <>
FilterString db "Executable Files (*.exe, *.dll)",0,"*.exe;*.dll",0
db "All Files",0,"*.*",0,0
FileOpenError db "Cannot open the file for reading",0
FileOpenMappingError db "Cannot open the file for memory mapping",0
FileMappingError db "Cannot map the file into memory",0
FileValidPE db "This file is a valid PE",0
FileInValidPE db "This file is not a valid PE",0
.data?
buffer db 512 dup(?)
hFile dd ?
hMapping dd ?

pMapping dd ?
ValidPE dd ?
.code
start proc
LOCAL seh:SEH
mov ofn.lStructSize,SIZEOF ofn
mov ofn.lpstrFilter, OFFSET FilterString
mov ofn.lpstrFile, OFFSET buffer
mov ofn.nMaxFile,512
mov ofn.Flags, OFN_FILEMUSTEXIST or OFN_PATHMUSTEXIST or OFN_LONGNAMES or
OFN_EXPLORER or OFN_HIDEREADONLY
invoke GetOpenFileName, ADDR ofn
.if eax==TRUE
invoke CreateFile, addr buffer, GENERIC_READ, FILE_SHARE_READ, NULL,
OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL
.if eax!=INVALID_HANDLE_VALUE
mov hFile, eax
invoke CreateFileMapping, hFile, NULL, PAGE_READONLY,0,0,0
.if eax!=NULL
mov hMapping, eax
invoke MapViewOfFile,hMapping,FILE_MAP_READ,0,0,0
.if eax!=NULL
mov pMapping,eax
assume fs:nothing
push fs:[0]
pop seh.PrevLink
mov seh.CurrentHandler,offset SEHHandler
mov seh.SafeOffset,offset FinalExit
lea eax,seh
mov fs:[0], eax
mov seh.PrevEsp,esp
mov seh.PrevEbp,ebp
mov edi, pMapping
assume edi:ptr IMAGE_DOS_HEADER
.if [edi].e_magic==IMAGE_DOS_SIGNATURE

add edi, [edi].e_lfanew

assume edi:ptr IMAGE_NT_HEADERS
.if [edi].Signature==IMAGE_NT_SIGNATURE
mov ValidPE, TRUE
.else
mov ValidPE, FALSE
.endif
.else
mov ValidPE,FALSE
.endif
FinalExit:
.if ValidPE==TRUE
invoke MessageBox, 0, addr FileValidPE, addr AppName,
MB_OK+MB_ICONINFORMATION
.else
invoke MessageBox, 0, addr FileInValidPE, addr AppName,
.endif
push seh.PrevLink
pop fs:[0]
invoke UnmapViewOfFile, pMapping
.else
invoke MessageBox, 0, addr FileMappingError, addr AppName,
MB_OK+MB_ICONERROR
.endif
invoke CloseHandle,hMapping
.else
invoke MessageBox, 0, addr FileOpenMappingError, addr AppName,
MB_OK+MB_ICONERROR
.endif
invoke CloseHandle, hFile
.else
invoke MessageBox, 0, addr FileOpenError, addr AppName, MB_OK+MB_ICONERROR
.endif
.endif
invoke ExitProcess, 0

start endp
SEHHandler proc C uses edx pExcept:DWORD, pFrame:DWORD, pContext:DWORD, pDis-

patch:DWORD
mov edx,pFrame
assume edx:ptr SEH
mov eax,pContext
assume eax:ptr CONTEXT
push [edx].SafeOffset
pop [eax].regEip
push [edx].PrevEsp
pop [eax].regEsp
push [edx].PrevEbp
pop [eax].regEbp
mov ValidPE, FALSE
mov eax,ExceptionContinueExecution
ret
SEHHandler endp
end start

Analysis:
The program opens a file and checks if the DOS header is valid, if it is, it checks the PE
header if it's valid. If it is, then it assumes the file is a valid PE. In this example, I use struc-
tured exception handling (SEH) so that we don't have to check for every possible error: if a
fault occurs, we assume that it's because the file is not a valid PE thus giving our program
wrong information. Windows itself uses SEH heavily in its parameter validation routines. If
you're interested in SEH, read the article by Jeremy Gordon.
The program displays an open file common dialog to the user and when the user chooses an
executable file, it opens the file and maps it into memory. Before it goes on with the verifica-
tion, it sets up a SEH:
assume fs:nothing
push fs:[0]
pop seh.PrevLink
lea eax,seh
mov fs:[0], eax
mov seh.PrevEsp,esp
mov seh.PrevEbp,ebp
We start by assuming the use of fs register as nothing. This must be done because MASM
assumes the use of fs register to ERROR. Next we store the address of the previous SEH
handler in our structure for use by Windows. We store the address of our SEH handler, the
address where the execution can safely resume if a fault occurs, the current values of esp
and ebp so that our SEH handler can get the state of the stack back to normal before it
resumes the execution of our program.
mov edi, pMapping

After we are done with setting up SEH, we continue with the verification. We put the
address of the first byte of the target file in edi, which is the first byte of the DOS header.
For ease of comparison, we tell the assembler that it can assume edi as pointing to the
IMAGE_DOS_HEADER structure (which is the truth). We then compare the first word of
the DOS header with the string "MZ" which is defined as a constant in windows.inc
named IMAGE_DOS_SIGNATURE. If the comparison is ok, we continue to the PE
header. If not, we set the value in ValidPE to FALSE, meaning that the file is not a valid
PE.
mov ValidPE, TRUE
.else
mov ValidPE, FALSE
.endif
To get to the PE header, we need the value in e_lfanew of the DOS header. This field con-
tains the file offset of the PE header, relative to the file beginning. Thus we add this value
to edi and we get to the first byte of the PE header. It's this place that a fault may occur. If
the file is really not a PE file, the value in e_lfanew will be incorrect and thus using it
amounts to using a wild pointer. If we don't use SEH, we must check the value of the
e_lfanew against the file size which is ugly. If all goes well, we compare the first dword of
the PE header with the string "PE". Again there is a handy constant named
IMAGE_NT_SIGNATURE which we can use. If the result of comparison is true, we
assume the file is a valid PE.

If the value in e_lfanew is incorrect, a fault may occur and our SEH handler will get control. It
simply restores the stack pointer, bsae pointer and resumes the execution at the safe offset
which is at the FinalExit label.
FinalExit:
.if ValidPE==TRUE
invoke MessageBox, 0, addr FileValidPE, addr AppName, MB_OK+MB_ICONINFORMATION
.else
.endif
The above code is simplicity itself. It checks the value in ValidPE and displays a message to
the user accordingly.
push seh.PrevLink
pop fs:[0]
When the SEH is no longer used, we dissociate it from the SEH chain.

File-Header3
Let's summarize what we have learned so far:
"DOS MZ header is called IMAGE_DOS_HEADER. Only two of its members are impor-
tant to us: e_magic which contains the string "MZ" and e_lfanew which contains the file
offset of the PE header.
"We use the value in e_magic to check if the file has a valid DOS header by comparing it
to the value IMAGE_DOS_SIGNATURE. If both values match, we can assume that the
file has a valid DOS header.
"In order to go to the PE header, we must move the file pointer to the offset specified by
the value in e_lfanew.
"The first dword of the PE header should contain the string "PE" followed by two zeroes.
We compare the value in this dword to the value IMAGE_NT_SIGNATURE. If they match,
then we can assume that the PE header is valid.
We will learn more about the PE header in this tutorial. The official name of the PE header
is IMAGE_NT_HEADERS. To refresh your memory, I show it below.
Signature dd ?
OptionalHeader IMAGE_OPTIONAL_HEADER32 <>

Signature is the PE signature, "PE" followed by two zeroes. You already know and use this
member.
FileHeader is a structure that contains the information about the physical layout/properies of
the PE file in general.
OptionalHeader is also a structure that contains the information about the logical layout
inside the PE file.
The most interesting information is in OptionalHeader. However, some fields in FileHeader

are also important. We will learn about FileHeader in this tutorial so we can move to study
OptionalHeader in the next tutorials.
IMAGE_FILE_HEADER STRUCT
Machine WORD ?
NumberOfSections WORD ?
TimeDateStamp dd ?
PointerToSymbolTable dd ?
NumberOfSymbols dd ?
SizeOfOptionalHeader WORD ?
Characteristics WORD ?
IMAGE_FILE_HEADER ENDS

TABLE 1. The File-Header
Field Name Meanings

Machine The CPU platform the file is intended for. For Intel platform, the value is
IMAGE_FILE_MACHINE_I386 (14Ch). I tried to use 14Dh and 14Eh as
stated in the pe.txt by LUEVELSMEYER but Windows refused to run it.
This field is rarely of interest to us except as a quick way of preventing a
program to be executed.
NumberOfSections The number of sections in the file. We will need to modify the value in
this member if we add or delete a section from the file.
TimeDateStamp The date and time the file is created. Not useful to us.
PointerToSymbolTable used for debugging.
NumberOfSymbols used for debugging.
SizeOfOptionalHeader The size of the OptionalHeader member that immediately follows this
structure. Must be set to a valid value.
Characteristics Contains flags for the file, such as whether this file is an exe or a dll.
In summary, only three members are somewhat useful to us: Machine, NumberOfSec-
tions and Characteristics. You would normally not change the values of Machine and
Characteristics but you must use the value in NumberOfSections when you're walking the
section table.
I'm jumping the gun here but in order to illustrate the use of NumberOfSections, I need to
digress briefly to the section table.

The section table is an array of structures. Each structure contains the information of a sec-
tion. Thus if there are 3 sections, there will be 3 members in this array. You need the value in
NumberOfSections so you know how many members there are in the array. You would think
that checking for the structure with all zeroes in its members would help. Windows does use
this approach. You can verify this fact by setting the value in NumberOfSections to a value
higher than the real value and Windows still runs the file without problem. From my observa-
tion, I think Windows reads the value in NumberOfSections and examines each structure in
the section table. If it finds a structure that contains all zeroes, it terminates the search. Else it
would process until the number of structures specified in NumberOfSections is met. Why
can't we ignore the value in NumberOfSections? Several reasons. The PE specification
doesn't specify that the section table array must end with an all-zero structure. Thus there
may be a situation where the last array member is contiguous to the first section, without
empty space at all. Another reason has to do with bound imports. The new-style binding puts
the information immediately following the section table's last structure array member. Thus
you still need NumberOfSections.

Optional Headers4
We have learned about the DOS header and some members of the PE header. Here's
the last, the biggest and probably the most important member of the PE header, the
optional header.
To refresh your memory, the optional header is a structure that is the last member of
IMAGE_NT_HEADERS. It contains information about the logical layout in the PE file.
There are 31 fields in this structure. Some of them are crucial and some are not useful. I'll
explain only those fields that are really useful.
There is a word that's used frequently in relation to PE file format: RVA
RVA stands for relative virtual address. You know what virtual address is. RVA is a daunt-
ing term for such a simple concept. Simply put, an RVA is a distance from a reference
point in the virtual address space. I bet you're familiar with file offset: an RVA is exactly
the same thing as file offset. However, it's relative to a point in virtual address space, not
a file. I'll show you an example. If a PE file loads at 400000h in the virtual address (VA)
space and the program starts execution at the virtual address 401000h, we can say that
the program starts execution at RVA 1000h. An RVA is relative to the starting VA of the
module.
Why does the PE file format use RVA? It's to help reduce the load of the PE loader. Since
a module can be relocated anywhere in the virtual address space, it would be a hell for
the PE loader to fix every relocatable items in the module. In contrast, if all relocatable
items in the file use RVA, there is no need for the PE loader to fix anything: it simply relo-
cates the whole module to a new starting VA. It's like the concept of relative path and
absolute path: RVA is akin to relative path, VA is like absolute path.

TABLE 2. Optional Header
Field Meanings
AddressOfEntryPoint It's the RVA of the first instruction that will be executed when the PE loader is
ready to run the PE file. If you want to divert the flow of execution right from the
start, you need to change the value in this field to a new RVA and the instruc-
tion at the new RVA will be executed first.
ImageBase It's the preferred load address for the PE file. For example, if the value in this
field is 400000h, the PE loader will try to load the file into the virtual address
space starting at 400000h. The word "preferred" means that the PE loader may
not load the file at that address if some other module already occupied that
address range.
SectionAlignment The granularity of the alignment of the sections in memory. For example, if the
value in this field is 4096 (1000h), each section must start at multiples of 4096
bytes. If the first section is at 401000h and its size is 10 bytes, the next section
must be at 402000h even if the address space between 401000h and 402000h
will be mostly unused.
FileAlignment The granularity of the alignment of the sections in the file. For example, if the
value in this field is 512 (200h), each section must start at multiples of 512
bytes. If the first section is at file offset 200h and the size is 10 bytes, the next
section must be located at file offset 400h: the space between file offsets 522
and 1024 is unused/undefined.
MajorSubsystemVersion The win32 subsystem version. If the PE file is designed for Win32, the sub-
MinorSubsystemVersion system version must be 4.0 else the dialog won't have 3-D look.
SizeOfImage The overall size of the PE image in memory. It's the sum of all headers and sec-
tions aligned to SectionAlignment.
SizeOfHeaders The size of all headers+section table. In short, this value is equal to the file size
minus the combined size of all sections in the file. You can also use this value
as the file offset of the first section in the PE file.
Subsystem Tell in which of the NT subsystem the PE file is intended for. For most win32
progs, only two values are used: Windows GUI and Windows CUI (console).
DataDirectory An array of IMAGE_DATA_DIRECTORY structures. Each structure gives the
RVA of an important data structure in the PE file such as the import address
table.

Section Table5
Theory:
Up to this tutorial, we learned about the DOS header, the PE header. What remains is the
section table. A section table is actually an array of structure immediately following the PE
header. The number of the array members is determined by NumberOfSections field in
the file header (IMAGE_FILE_HEADER) structure. The structure is called
IMAGE_SECTION_HEADER.
IMAGE_SIZEOF_SHORT_NAME equ 8
IMAGE_SECTION_HEADER STRUCT
Name1 db IMAGE_SIZEOF_SHORT_NAME dup(?)
union Misc
PhysicalAddress dd ?
VirtualSize dd ?
ends
VirtualAddress dd ?
SizeOfRawData dd ?
PointerToRawData dd ?
PointerToRelocations dd ?
PointerToLinenumbers dd ?
NumberOfRelocations dw ?
NumberOfLinenumbers dw ?
Characteristics dd ?
IMAGE_SECTION_HEADER ENDS
Again, not all members are useful. I'll describe only the ones that are really important.

TABLE 3. Section Table
Field Meanings
Name1 Actually the name of this field is "name" but the word "name" is an MASM
keyword so we have to use "Name1" instead. This member contains the
name of the section. Note that the maximum length is 8 bytes. The name is
just a label, nothing more. You can use any name or even leave this field
blank. Note that there is no mention of the terminating null. The name is
not an ASCIIZ string so don't expect it to be terminated with a null.
VirtualAddress The RVA of the section. The PE loader examines and uses the value in
this field when it's mapping the section into memory. Thus if the value in
this field is 1000h and the PE file is loaded at 400000h, the section will be
loaded at 401000h.
SizeOfRawData The size of the section's data rounded up to the next multiple of file align-
ment. The PE loader examines the value in this field so it knows how many
bytes in the section it should map into memory.
PointerToRawData The file offset of the beginning of the section. The PE loader uses the
value in this field to find where the data in the section is in the file.
Characteristics Contains flags such as whether this section contains executable code, ini-
tialized data, uninitialized data, can it be written to or read from.
Now that we know about IMAGE_SECTION_HEADER structure, let's see how we can emu-
late the PE loader's job:
1.Read NumberOfSections in IMAGE_FILE_HEADER so we know how many sections there are
in the file.
2.Use the value in SizeOfHeaders as the file offset of the section table and moves
the file pointer to that offet.
3.Walk the structure array, examining each member.
4.For each structure, we obtain the value in PointerToRawData and move the file
pointer to that offset. Then we read the value in SizeOfRawData so we know how many
bytes we should map into memory. Read the value in VirtualAddress and add the value
in ImageBase to it to get the virtual address the section should start from. And
then we are ready to map the section into memory and mark the attribute of the mem-
ory according to the flags in Characteristics.
5.Walk the array until all the sections are processed.
Note that we didn't make use the the name of the section: it's not really necessary.

Example:
This example opens a PE file and walks the section table, showing the information about
the sections in a listview control.
.386
.model flat,stdcall
option casemap:none
include \masm32\include\comctl32.inc
includelib \masm32\lib\comctl32.lib
IDD_SECTIONTABLE equ 104

IDC_SECTIONLIST equ 1001
SEH struct
CurrentHandler dd ? ; the address of the new exception handler
SEH ends
.data
ofn OPENFILENAME <>
db "All Files",0,"*.*",0,0


FileInValidPE db "This file is not a valid PE",0
template db "%08lx",0
SectionName db "Section",0
VirtualSize db "V.Size",0
VirtualAddress db "V.Address",0
SizeOfRawData db "Raw Size",0
RawOffset db "Raw Offset",0
Characteristics db "Characteristics",0
.data?
hInstance dd ?
hFile dd ?
hMapping dd ?
pMapping dd ?
ValidPE dd ?
NumberOfSections dd ?
.code
start proc
LOCAL seh:SEH
invoke GetModuleHandle,NULL
mov hInstance,eax
.if eax==TRUE
mov hFile, eax


.if eax!=NULL
mov hMapping, eax
.if eax!=NULL
mov pMapping,eax
assume fs:nothing
push fs:[0]
pop seh.PrevLink
lea eax,seh
mov fs:[0], eax
mov seh.PrevEsp,esp
mov seh.PrevEbp,ebp
mov edi, pMapping
mov ValidPE, TRUE
.else
mov ValidPE, FALSE
.endif
.else
mov ValidPE,FALSE
.endif
FinalExit:
push seh.PrevLink
pop fs:[0]
.if ValidPE==TRUE
call ShowSectionInfo
.else


.endif
.else
MB_OK+MB_ICONERROR
.endif
.else
MB_OK+MB_ICONERROR
.endif
.else
.endif
.endif
invoke InitCommonControls
start endp
SEHHandler proc C uses pExcept:DWORD,pFrame:DWORD,pContext:DWORD,pDispatch:DWORD

mov edx,pFrame
assume edx:ptr SEH
mov eax,pContext
pop [eax].regEip
push [edx].PrevEsp
pop [eax].regEsp
push [edx].PrevEbp
pop [eax].regEbp
mov ValidPE, FALSE
ret

SEHHandler endp
DlgProc proc uses edi esi hDlg:DWORD, uMsg:DWORD, wParam:DWORD, lParam:DWORD

LOCAL lvc:LV_COLUMN
LOCAL lvi:LV_ITEM
.if uMsg==WM_INITDIALOG
mov esi, lParam
mov lvc.imask,LVCF_FMT or LVCF_TEXT or LVCF_WIDTH or LVCF_SUBITEM
mov lvc.fmt,LVCFMT_LEFT
mov lvc.lx,80
mov lvc.iSubItem,0
mov lvc.pszText,offset SectionName
invoke SendDlgItemMessage,hDlg,IDC_SECTIONLIST,LVM_INSERTCOLUMN,0,addr lvc
inc lvc.iSubItem
mov lvc.fmt,LVCFMT_RIGHT
mov lvc.pszText,offset VirtualSize
inc lvc.iSubItem
mov lvc.pszText,offset VirtualAddress
inc lvc.iSubItem
mov lvc.pszText,offset SizeOfRawData
inc lvc.iSubItem
mov lvc.pszText,offset RawOffset
inc lvc.iSubItem
mov lvc.pszText,offset Characteristics
mov ax, NumberOfSections
movzx eax,ax
mov edi,eax
mov lvi.imask,LVIF_TEXT
mov lvi.iItem,0
assume esi:ptr IMAGE_SECTION_HEADER

.while edi>0
mov lvi.iSubItem,0
invoke RtlZeroMemory,addr buffer,9
invoke lstrcpyn,addr buffer,addr [esi].Name1,8
lea eax,buffer
mov lvi.pszText,eax
invoke SendDlgItemMessage,hDlg,IDC_SECTIONLIST,LVM_INSERTITEM,0,addr lvi
invoke wsprintf,addr buffer,addr template,[esi].Misc.VirtualSize
lea eax,buffer
mov lvi.pszText,eax
inc lvi.iSubItem
invoke SendDlgItemMessage,hDlg,IDC_SECTIONLIST,LVM_SETITEM,0,addr lvi
invoke wsprintf,addr buffer,addr template,[esi].VirtualAddress
lea eax,buffer
mov lvi.pszText,eax
inc lvi.iSubItem
invoke wsprintf,addr buffer,addr template,[esi].SizeOfRawData
lea eax,buffer
mov lvi.pszText,eax
inc lvi.iSubItem
invoke wsprintf,addr buffer,addr template,[esi].PointerToRawData
lea eax,buffer
mov lvi.pszText,eax
inc lvi.iSubItem
invoke wsprintf,addr buffer,addr template,[esi].Characteristics
lea eax,buffer
mov lvi.pszText,eax
inc lvi.iSubItem
inc lvi.iItem
dec edi
add esi, sizeof IMAGE_SECTION_HEADER

.endw
.elseif
uMsg==WM_CLOSE
invoke EndDialog,hDlg,NULL
.else
mov eax,FALSE
ret
.endif
mov eax,TRUE
ret
DlgProc endp
ShowSectionInfo proc uses edi

mov edi, pMapping
mov ax,[edi].FileHeader.NumberOfSections
movzx eax,ax
mov NumberOfSections,eax
add edi,sizeof IMAGE_NT_HEADERS
invoke DialogBoxParam, hInstance, IDD_SECTIONTABLE,NULL, addr DlgProc, edi
ret
ShowSectionInfo endp
end start

Analysis:
This example reuses the code of the example in PE tutorial 2. After it verifies that the file is a
valid PE, it calls a function, ShowSectionInfo.
ShowSectionInfo proc uses edi
mov edi, pMapping
We use edi as the pointer to the data in the PE file. At first, we initialize it to the value of
pMapping which is the address of the DOS header. Then we add the value in e_lfanew to it
so it now contains the address of the PE header.
mov ax,[edi].FileHeader.NumberOfSections
mov NumberOfSections,ax
Since we need to walk the section table, we must obtain the number of sections in this file.
That's the value in NumberOfSections member of the file header. Don't forget that this mem-
ber is of word size.
add edi,sizeof IMAGE_NT_HEADERS
Edi currently contains the address of the PE header. Adding the size of the PE header to it
will make it point at the section table.
invoke DialogBoxParam, hInstance, IDD_SECTIONTABLE,NULL, addr DlgProc, edi
Call DialogBoxParam to show the dialog box containing the listview control. Note that we
pass the address of the section table as its last parameter. This value will be available in
lParam during WM_INITDIALOG message.

In the dialog box procedure, in response to WM_INITDIALOG message, we store the

value of lParam (address of the section table) in esi, the number of sections in edi and
then dress up the listview control. When everything is ready, we enter a loop which will
insert the info about each section into the listview control. This part is very simple.
.while edi>0
mov lvi.iSubItem,0
Put this string in the first column.

invoke RtlZeroMemory,addr buffer,9
invoke lstrcpyn,addr buffer,addr [esi].Name1,8
lea eax,buffer
mov lvi.pszText,eax
We will display the name of the section but we must convert it to an ASCIIZ string first.
invoke SendDlgItemMesage,hDlg,IDC_SECTIONLIST,LVM_INSERTITEM,0,addr lvi
Then we display it in the first column.
We continue with this scheme until the last value we want to display for this section is dis-
played. Then we must move to the next structure.
dec edi
add esi, sizeof IMAGE_SECTION_HEADER
.endw
We decrement the value in edi for each section processed. And we add the size of
IMAGE_SECTION_HEADER to esi so it contains the address of the next
IMAGE_SECTION_HEADER structure.

The steps in walking the section table are:
1.Verify that the file is a valid PE
2.Go to the beginning of the PE header
3.Obtain the number of sections from NumberOfSections field in the file header.
4.Go to the section table either by adding ImageBase to SizeOfHeaders or by adding

the address of the PE header to the size of the PE header. (The section table imme-
diately follows the PE header). If you don't use file mapping, you need to move the
file pointer to the section table using SetFilePointer. The file offset of the section
table is in SizeOfHeaders.(SizeOfHeaders is a member of
IMAGE_OPTIONAL_HEADER)
5.Process each IMAGE_SECTION_HEADER structure.

Import Table6
We will learn about import table in this tutorial. Let me warn you first. This tutorial is a long
and difficult one for those who aren't familiar with the import table. You may need to read
this tutorial several times and may even have to examine the related structures under a
debugger.
Theory:
First of all, you should know what an import function is. An import function is a function
that is not in the caller's module but is called by the module, thus the name "import". The
import functions actually reside in one or more DLLs. Only the information about the func-
tions is kept in the caller's module. That information includes the function names and the
names of the DLLs in which they reside.
Now how can we find out where in the PE file the information is kept? We must turn to the
data directory for the answer. I'll refresh your memory a bit. Below is the PE header:
Signature dd ?
OptionalHeader IMAGE_OPTIONAL_HEADER <>
The last member of the optional header is the data directory:
IMAGE_OPTIONAL_HEADER32 STRUCT
....
LoaderFlags dd ?
NumberOfRvaAndSizes dd ?
DataDirectory IMAGE_DATA_DIRECTORY 16 dup(<>)
IMAGE_OPTIONAL_HEADER32 ENDS

The data directory is an array of IMAGE_DATA_DIRECTORY structure. A total of 16 mem-

bers. If you remember the section table as the root directory of the sections in a PE file, you
should also think of the data directory as the root directory of the logical components stored
inside those sections. To be precise, the data directory contains the locations and sizes of the
important data structures in the PE file. Each member contains information about an impor-
tant data structure.
Member Info inside

0 Export symbols
1 Import symbols
2 Resources
3 Exception
4 Security
5 Base relocation
6 Debug
7 Copyright string
8 Unknown
Thread local storage
9
(TLS)
10 Load configuration
11 Bound Import
12 Import Address Table
13 Delay Import
14 COM descriptor

Only the members painted in gold are known to me. Now that you know what each mem-
ber of the data directory contains, we can learn about the member in detail. Each member
of the data directory is a structure called IMAGE_DATA_DIRECTORY which has the fol-
lowing definition:
IMAGE_DATA_DIRECTORY STRUCT
VirtualAddress dd ?
isize dd ?
IMAGE_DATA_DIRECTORY ENDS
VirtualAddress is actually the relative virtual address (RVA) of the data structure. For
example, if this structure is for import symbols, this field contains the RVA of the
IMAGE_IMPORT_DESCRIPTOR array.
isize contains the size in bytes of the data structure referred to by VirtualAddress.
Here's the general scheme on finding important data structures in a PE file:
1.From the DOS header, you go to the PE header
2.Obtain the address of the data directory in the optional header.
3.Multiply the size of IMAGE_DATA_DIRECTORY with the member index you want:
for example if you want to know where the import symbols are, you must multiply
the size of IMAGE_DATA_DIRECTORY (8 bytes) with 1.
4.Add the result to the address of the data directory and you have the address of the
IMAGE_DATA_DIRECTORY structure that contains the info about the desired data
structure.
Now we will enter into the real discussion about the import table. The address of the
import table is contained in the VirtualAddress field of the second member of the data
directory. The import table is actually an array of IMAGE_IMPORT_DESCRIPTOR struc-
tures. Each structure contains information about a DLL the PE file imports symbols from.
For example, if the PE file imports functions from 10 different DLLs, there will be 10 mem-

bers in this array. The array is terminated by the member which contain all zeroes. Now we
can examine the structure in detail:
IMAGE_IMPORT_DESCRIPTOR STRUCT
union
Characteristics dd ?
OriginalFirstThunk dd ?
ends
TimeDateStamp dd ?
ForwarderChain dd ?
Name1 dd ?
FirstThunk dd ?
IMAGE_IMPORT_DESCRIPTOR ENDS
The first member of this structure is a union. Actually, the union only provides the alias for
OriginalFirstThunk, so you can call it "Characteristics". This member contains the the RVA of
an array of IMAGE_THUNK_DATA structures.
What is IMAGE_THUNK_DATA? It's a union of dword size. Usually, we interpret it as the

pointer to an IMAGE_IMPORT_BY_NAME structure. Note that IMAGE_THUNK_DATA con-
tains the pointer to an IMAGE_IMPORT_BY_NAME structure: not the structure itself.
Look at it this way: There are several IMAGE_IMPORT_BY_NAME structures. We collect the
RVA of those structures (IMAGE_THUNK_DATAs) into an array, terminate it with 0. Then we
put the RVA of the array into OriginalFirstThunk.

The IMAGE_IMPORT_BY_NAME structure contains information about an import func-

tion. Now let's see what IMAGE_IMPORT_BY_NAME structure looks like:
IMAGE_IMPORT_BY_NAME STRUCT
Hint dw ?
Name1 db ?
IMAGE_IMPORT_BY_NAME ENDS
Hint contains the index into the export table of the DLL the function resides in. This field is
for use by the PE loader so it can look up the function in the DLL's export table
quickly.This value is not essential and some linkers may set the value in this field to 0.
Name1 contains the name of the import function. The name is an ASCIIZ string. Note that
Name1's size is defined as byte but it's really a variable-sized field. It's just that there is no
way to represent a variable-sized field in a structure. The structure is provided so that you
can refer to the data structure with descriptive names.
TimeDateStamp and ForwarderChain are advanced stuff: We will talk about them after
you have firm grasp of the other members.
Name1 contains the RVA to the name of the DLL, in short, the pointer to the name of the
DLL. The string is an ASCIIZ one.
FirstThunk is very similar to OriginalFirstThunk, ie. it contains an RVA of an array of

IMAGE_THUNK_DATA structures(a different array though).
Ok, if you're still confused, look at it this way: There are several
IMAGE_IMPORT_BY_NAME structures. You create two arrays, then fill them with the
RVAs of those IMAGE_IMPORT_BY_NAME structures, so both arrays contain exactly
the same values (i.e. exact duplicate). Now you assign the RVA of the first array to Origi-
nalFirstThunk and the RVA of the second array to FirstThunk.

OriginalFirstThunk IMAGE_IMPORT_BY_NAME FirstThunk

| |
IMAGE_THUNK_DATA ---> Function 1 <--- IMAGE_THUNK_DATA
... ---> ... <--- ...
IMAGE_THUNK_DATA ---> Function n <--- IMAGE_THUNK_DATA
Now you should be able to understand what I mean. Don't be confused by the name
IMAGE_THUNK_DATA: it's only an RVA into IMAGE_IMPORT_BY_NAME structure. If you
replace the word IMAGE_THUNK_DATA with RVA in your mind, you'll perhaps see it more
clearly. The number of array elements in OriginalFirstThunk and FirstThunk array depends on
the functions the PE file imports from the DLL. For example, if the PE file imports 10 functions
from kernel32.dll, Name1 in the IMAGE_IMPORT_DESCRIPTOR structure will contain the
RVA of the string "kernel32.dll" and there will be 10 IMAGE_THUNK_DATAs in each array.

The next question is: why do we need two arrays that are exactly the same? To answer
that question, we need to know that when the PE file is loaded into memory, the PE
loader will look at the IMAGE_THUNK_DATAs and IMAGE_IMPORT_BY_NAMEs and
determine the addresses of the import functions. Then it replaces the
IMAGE_THUNK_DATAs in the array pointed to by FirstThunk with the real addresses of
the functions. Thus when the PE file is ready to run, the above picture is changed to:
OriginalFirstThunk IMAGE_IMPORT_BY_NAME FirstThunk

| |
IMAGE_THUNK_DATA ---> Function 1 Address of Function 1
... ---> ... ...
IMAGE_THUNK_DATA ---> Function n Address of Function n
The array of RVAs pointed to by OriginalFirstThunk remains unchanged so that if the

need arises to find the names of import functions, the PE loader can still find them.
There is a little twist on this *straightforward* scheme. Some functions are exported by
ordinal only. It means you don't call the functions by their names: you call them by their
positions. In this case, there will be no IMAGE_IMPORT_BY_NAME structure for that
function in the caller's module. Instead, the IMAGE_THUNK_DATA for that function will
contain the ordinal of the function in the low word and the most significant bit (MSB) of
IMAGE_THUNK_DATA set to 1. For example, if a function is exported by ordinal only and
its ordinal is 1234h, the IMAGE_THUNK_DATA for that function will be 80001234h.
Microsoft provides a handy constant for testing the MSB of a dword,
IMAGE_ORDINAL_FLAG32. It has the value of 80000000h.

Suppose that we want to list the names of ALL import functions of a PE file, we need to follow
the steps below:
1.Verify that the file is a valid PE
2.From the DOS header, go to the PE header
3.Obtain the address of the data directory in OptionalHeader
4.Go to the 2nd member of the data directory. Extract the value of VirtualAddress
5.Use that value to go to the first IMAGE_IMPORT_DESCRIPTOR structure
6.Check the value of OriginalFirstThunk. If it's not zero, follow the RVA in OriginalFirstThunk
to the RVA array. If OriginalFirstThunk is zero, use the value in FirstThunk instead. Some link-
ers generate PE files with 0 in OriginalFirstThunk. This is considered a bug. Just to be on the
safe side, we check the value in OriginalFirstThunk first.
7.For each member in the array, we check the value of the member against
IMAGE_ORDINAL_FLAG32. If the most significant bit of the member is 1, then the function is
exported by ordinal and we can extract the ordinal number from the low word of the member.
8.If the most significant bit of the member is 0, use the value in the member as the RVA into
the IMAGE_IMPORT_BY_NAME, skip Hint, and you're at the name of the function.
9.Skip to the next array member, and retrieve the names until the end of the array is reached
(it's null -terminated). Now we are done extracting the names of the functions imported from a
DLL. We go to the next DLL.
10.Skip to the next IMAGE_IMPORT_DESCRIPTOR and process it. Do that until the end of
the array is reached (IMAGE_IMPORT_DESCRIPTOR array is terminated by a member with
all zeroes in its fields).
Example:
This example opens a PE file and reads the names of all import functions of that file
into an edit control. It also shows the values in the IMAGE_IMPORT_DESCRIPTOR struc-
tures.

.386
.model flat,stdcall
option casemap:none
IDD_MAINDLG equ 101

IDC_EDIT equ 1000
IDM_OPEN equ 40001
IDM_EXIT equ 40003
DlgProc proto :DWORD,:DWORD,:DWORD,:DWORD

ShowImportFunctions proto :DWORD
ShowTheFunctions proto :DWORD,:DWORD
AppendText proto :DWORD,:DWORD
SEH struct
CurrentHandler dd ? ; the address of the new exception handler
SEH ends
.data
ofn OPENFILENAME <>
db "All Files",0,"*.*",0,0


NotValidPE db "This file is not a valid PE",0
CRLF db 0Dh,0Ah,0
ImportDescriptor db 0Dh,0Ah,"================[ IMAGE_IMPORT_DESCRIPTOR
]=============",0
IDTemplate db "OriginalFirstThunk = %lX",0Dh,0Ah
db "TimeDateStamp = %lX",0Dh,0Ah
db "ForwarderChain = %lX",0Dh,0Ah
db "Name = %s",0Dh,0Ah
db "FirstThunk = %lX",0
NameHeader db 0Dh,0Ah,"Hint Function",0Dh,0Ah
db "-----------------------------------------",0
NameTemplate db "%u %s",0
OrdinalTemplate db "%u (ord.)",0
.data?
hFile dd ?
hMapping dd ?
pMapping dd ?
ValidPE dd ?
.code
start:
invoke DialogBoxParam, eax, IDD_MAINDLG,NULL,addr DlgProc, 0
DlgProc proc hDlg:DWORD, uMsg:DWORD, wParam:DWORD, lParam:DWORD

invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_SETLIMITTEXT,0,0
.elseif uMsg==WM_CLOSE
invoke EndDialog,hDlg,0
.elseif uMsg==WM_COMMAND

.if lParam==0
mov eax,wParam
.if ax==IDM_OPEN
invoke ShowImportFunctions,hDlg
.else ; IDM_EXIT
invoke SendMessage,hDlg,WM_CLOSE,0,0
.endif
.endif
.else
mov eax,FALSE
ret
.endif
mov eax,TRUE
ret
DlgProc endp
SEHHandler proc C pExcept:DWORD, pFrame:DWORD, pContext:DWORD, pDispatch:DWORD

mov edx,pFrame
assume edx:ptr SEH
mov eax,pContext
pop [eax].regEip
push [edx].PrevEsp
pop [eax].regEsp
push [edx].PrevEbp
pop [eax].regEbp
mov ValidPE, FALSE
ret
SEHHandler endp
ShowImportFunctions proc uses edi hDlg:DWORD

LOCAL seh:SEH
mov ofn.lStructSize,SIZEOF

ofn mov ofn.lpstrFilter, OFFSET FilterString

.if eax==TRUE
mov hFile, eax
.if eax!=NULL
mov hMapping, eax
.if eax!=NULL
mov pMapping,eax
assume fs:nothing
push fs:[0]
pop seh.PrevLink
lea eax,seh
mov fs:[0], eax
mov seh.PrevEsp,esp
mov seh.PrevEbp,ebp
mov edi, pMapping
mov ValidPE, TRUE
.else
mov ValidPE, FALSE
.endif

.else
mov ValidPE,FALSE
.endif
FinalExit:
push seh.PrevLink
pop fs:[0]
.if ValidPE==TRUE
invoke ShowTheFunctions, hDlg, edi
.else
invoke MessageBox,0, addr NotValidPE, addr AppName,
MB_OK+MB_ICONERROR
.endif
.else
MB_OK+MB_ICONERROR
.endif
.else
MB_OK+MB_ICONERROR
.endif
.else
.endif
.endif
ret
ShowImportFunctions endp
AppendText proc hDlg:DWORD,pText:DWORD

invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_REPLACESEL,0,pText
invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_REPLACESEL,0,addr CRLF
invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_SETSEL,-1,0
ret
AppendText endp

RVAToOffset PROC uses edi esi edx ecx pFileMap:DWORD,RVA:DWORD

mov esi,pFileMap
assume esi:ptr IMAGE_DOS_HEADER
add esi,[esi].e_lfanew
assume esi:ptr IMAGE_NT_HEADERS
mov edi,RVA ; edi == RVA
mov edx,esi
add edx,sizeof IMAGE_NT_HEADERS
mov cx,[esi].FileHeader.NumberOfSections
movzx ecx,cx
assume edx:ptr IMAGE_SECTION_HEADER
.while ecx>0 ; check all sections
.if edi>=[edx].VirtualAddress
mov eax,[edx].VirtualAddress
add eax,[edx].SizeOfRawData
.if edi<eax ; The address is in this section
sub edi,eax
mov eax,[edx].PointerToRawData
add eax,edi ; eax == file offset
ret
.endif
.endif
add edx,sizeof IMAGE_SECTION_HEADER
dec ecx
.endw
assume edx:nothing
assume esi:nothing
mov eax,edi
ret
RVAToOffset endp
ShowTheFunctions proc uses esi ecx ebx hDlg:DWORD, pNTHdr:DWORD

LOCAL temp[512]:BYTE

invoke SetDlgItemText,hDlg,IDC_EDIT,0
invoke AppendText,hDlg,addr buffer
mov edi,pNTHdr
mov edi, [edi].OptionalHeader.DataDirectory[sizeof IMAGE_DATA_DIRECTORY].Vir-
tualAddress
invoke RVAToOffset,pMapping,edi
mov edi,eax
add edi,pMapping
assume edi:ptr IMAGE_IMPORT_DESCRIPTOR
.while !([edi].OriginalFirstThunk==0 && [edi].TimeDateStamp==0 && [edi].For-
warderChain==0 && [edi].Name1==0 && [edi].FirstThunk==0)
invoke AppendText,hDlg,addr ImportDescriptor
invoke RVAToOffset,pMapping, [edi].Name1
mov edx,eax
add edx,pMapping
invoke wsprintf, addr temp, addr IDTemplate, [edi].OriginalFirst-
Thunk,[edi].TimeDateStamp,[edi].ForwarderChain,edx,[edi].FirstThunk invoke
AppendText,hDlg,addr temp
.if [edi].OriginalFirstThunk==0
mov esi,[edi].FirstThunk
.else
mov esi,[edi].OriginalFirstThunk
.endif
invoke RVAToOffset,pMapping,esi
add eax,pMapping
mov esi,eax
invoke AppendText,hDlg,addr NameHeader
.while dword ptr [esi]!=0
test dword ptr [esi],IMAGE_ORDINAL_FLAG32
jnz ImportByOrdinal
invoke RVAToOffset,pMapping,dword ptr [esi]
mov edx,eax
add edx,pMapping
assume edx:ptr IMAGE_IMPORT_BY_NAME
mov cx, [edx].Hint

movzx ecx,cx
invoke wsprintf,addr temp,addr NameTemplate,ecx,addr [edx].Name1
jmp ShowTheText
ImportByOrdinal:
mov edx,dword ptr [esi]
and edx,0FFFFh
invoke wsprintf,addr temp,addr OrdinalTemplate,edx
ShowTheText:
invoke AppendText,hDlg,addr temp
add esi,4
.endw
add edi,sizeof IMAGE_IMPORT_DESCRIPTOR
.endw
ret
ShowTheFunctions endp
end start

Analysis:
The program shows an open file dialog box when the user clicks Open in the menu. It ver-
ifies that the file is a valid PE and then calls ShowTheFunctions.

Reserve 512 bytes of stack space for string operation.

Clear the text in the edit control

Insert the name of the PE file into the edit control. AppendText just sends
EM_REPLACESEL messages to append the text to the edit control. Note that it sends
EM_SETSEL with wParam=-1 and lParam=0 to the edit control to move the cursor to the
end of the text.
mov edi,pNTHdr
mov edi, [edi].OptionalHeader.DataDirectory[sizeof IMAGE_DATA_DIRECTORY].Vir-
tualAddress
Obtain the RVA of the import symbols. edi at first points to the PE header. We use it to go
to the 2nd member of the data directory array and obtain the value of VirtualAddress
member.

invoke RVAToOffset,pMapping,edi
mov edi,eax
add edi,pMapping
Here comes one of the pitfalls for newcomers to PE programming. Most of the addresses in
the PE file are RVAs and RVAs are meaningful only when the PE file is loaded into memory
by the PE loader. In our case, we do map the file into memory but not the way the PE loader
does. Thus we cannot use those RVAs directly. Somehow we have to convert those RVAs
into file offsets. I write RVAToOffset function just for this purpose. I won't analyze it in detail
here. Suffice to say that it checks the submitted RVA against the starting-ending RVAs of all
sections in the PE file and use the value in PointerToRawData field in the
IMAGE_SECTION_HEADER structure to convert the RVA to file offset.
To use this function, you pass it two parameters: the pointer to the memory mapped file and
the RVA you want to convert. It returns the file offset in eax. In the above snippet, we must
add the pointer to the memory mapped file to the file offset to convert it to virtual address.
Seems complicated, huh? :)
assume edi:ptr IMAGE_IMPORT_DESCRIPTOR
.while !([edi].OriginalFirstThunk==0 && [edi].TimeDateStamp==0 && [edi].Forward-
erChain==0 && [edi].Name1==0 && [edi].FirstThunk==0)
edi now points to the first IMAGE_IMPORT_DESCRIPTOR structure. We will walk the array
until we find the structure with zeroes in all members which denotes the end of the array.
invoke AppendText,hDlg,addr ImportDescriptor
invoke RVAToOffset,pMapping, [edi].Name1
mov edx,eax
add edx,pMapping
We want to display the values of the current IMAGE_IMPORT_DESCRIPTOR structure in the

edit control. Name1 is different from the other members since it contains the RVA to the name
of the dll. Thus we must convert it to a virtual address first.
invoke wsprintf, addr temp, addr IDTemplate, [edi].OriginalFirst-
Thunk,[edi].TimeDateStamp,[edi].ForwarderChain,edx,[edi].FirstThunk invoke
AppendText,hDlg,addr temp
Display the values of the current IMAGE_IMPORT_DESCRIPTOR.

.if [edi].OriginalFirstThunk==0
mov esi,[edi].FirstThunk
.else
mov esi,[edi].OriginalFirstThunk
.endif
Next we prepare to walk the IMAGE_THUNK_DATA array. Normally we would choose to

use the array pointed to by OriginalFirstThunk. However, some linkers errornously put 0
in OriginalFirstThunk thus we must check first if the value of OriginalFirstThunk is zero. If
it is, we use the array pointed to by FirstThunk instead.
invoke RVAToOffset,pMapping,esi
add eax,pMapping
mov esi,eax
Again, the value in OriginalFirstThunk/FirstThunk is an RVA. We must convert it to virtual

address.
invoke AppendText,hDlg,addr NameHeader
.while dword ptr [esi]!=0
Now we are ready to walk the array of IMAGE_THUNK_DATAs to look for the names of
the functions imported from this DLL. We will walk the array until we find an entry which
contains 0.
test dword ptr [esi],IMAGE_ORDINAL_FLAG32
jnz ImportByOrdinal
The first thing we do with the IMAGE_THUNK_DATA is to test it against

IMAGE_ORDINAL_FLAG32. This test checks if the most significant bit of the
IMAGE_THUNK_DATA is 1. If it is, the function is exported by ordinal so we have no need
to process it further. We can extract its ordinal from the low word of the
IMAGE_THUNK_DATA and go on with the next IMAGE_THUNK_DATA dword.

invoke RVAToOffset,pMapping,dword ptr [esi]

mov edx,eax
add edx,pMapping
assume edx:ptr IMAGE_IMPORT_BY_NAME
If the MSB of the IAMGE_THUNK_DATA is 0, it contains the RVA of

IMAGE_IMPORT_BY_NAME structure. We need to convert it to virtual address first.
mov cx, [edx].Hint
movzx ecx,cx
invoke wsprintf,addr temp,addr NameTemplate,ecx,addr [edx].Name1
jmp ShowTheText
Hint is a word-sized field. We must convert it to a dword-sized value before submitting it to

wsprintf. And we print both the hint and the function name in the edit control
ImportByOrdinal:
mov edx,dword ptr [esi]
and edx,0FFFFh
invoke wsprintf,addr temp,addr OrdinalTemplate,edx
In the case the function is exported by ordinal only, we zero out the high word and display the
ordinal.
ShowTheText:
add esi,4
After inserting the function name/ordinal into the edit control, we skip to the next
IMAGE_THUNK_DATA.
.endw
add edi,sizeof IMAGE_IMPORT_DESCRIPTOR
When all IMAGE_THUNK_DATA dwords in the array are processed, we skip to the next
IMAGE_IMPORT_DESCRIPTOR to process the import functions from other DLLs.

Appendix:
It would be incomplete if I don't mention something about bound import. In order to

explain what it is, I need to digress a bit. When the PE loader loads a PE file into memory,
it examines the import table and loads the required DLLs into the process address space.
Then it walks the IMAGE_THUNK_DATA array much like we did and replaces the
IMAGE_THUNK_DATAs with the real addresses of the import functions. This step takes
time. If somehow the programmer can predict the addresses of the functions correctly, the
PE loader doesn't have to fix the IMAGE_THUNK_DATAs each time the PE file is run.
Bound import is the product of that idea.
To put it in simple terms, there is a utility named bind.exe that comes with Microsoft com-
pilers such as Visual Studio that examines the import table of a PE file and replaces the
IMAGE_THUNK_DATA dwords with the addresses of the import functions.When the file
is loaded, the PE loader must check if the addresses are valid. If the DLL versions do not
match the ones in the PE files or if the DLLs need to be relocated, the PE loader knows
that the precomputed addresses are not valid thus it must walk the array pointed to by
OriginalFirstThunk to calculate the new addresses of import functions.
Bound import doesn't have much significance in our example because we use Original-
FirstThunk by default. For more information about the bound import, I recommmend
LUEVELSMEYER's pe.txt.

Export Table7
Theory:
When the PE loader runs a program, it loads the associated DLLs into the process address
space. It then extracts information about the import functions from the main program. It uses
the information to search the DLLs for the addresses of the functions to be patched into the
main program. The place in the DLLs where the PE loader looks for the addresses of the
functions is the export table.
When a DLL/EXE exports a function to be used by other DLL/EXE, it can do so in two ways:
it can export the function by name or by ordinal only. Say if there is a function named "GetSy-
sConfig" in a DLL, it can choose to tell the other DLLs/EXEs that if they want to call the func-
tion, they must specify it by its name, ie. GetSysConfig. The other way is to export by ordinal.
What's an ordinal? An ordinal is a 16-bit number that uniquely identifies a function in a partic-
ular DLL. This number is unique only within the DLL it refers to. For example, in the above
example, the DLL can choose to export the function by ordinal, say, 16. Then the other DLLs/
EXEs which want to call this function must specify this number in GetProcAddress. This is
called export by ordinal only.
Export by ordinal only is strongly discouraged because it can cause a maintenance problem
for the DLL. If the DLL is upgraded/updated, the programmer of that DLL cannot alter the
ordinals of the functions else other programs that depend on the DLL will break.
Now we can examine the export structure. As with import table, you can find where the export
table is from looking at the data directory. In this case, the export table is the first member of
the data directory. The export structure is called IMAGE_EXPORT_DIRECTORY. There are
11 members in the structure but only some of them are really used.

Field Name Meaning

The actual name of the module. This field is necessary
nName because the name of the file can be changed. If it's
the case, the PE loader will use this internal name.
A number that you must bias against the ordinals to
nBase
get the indexes into the address-of-function array.
Total number of functions/symbols that are exported
NumberOfFunctions
by this module.
Number of functions/symbols that are exported by
name. This value is not the number of ALL
functions/symbols in the module. For that number, you
need to check NumberOfFunctions. This value can
NumberOfNames
be 0. In that case, the module may export by ordinal
only. If there is no function/symbol to be exported in
the first case, the RVA of the export table in the data
directory will be 0.
An RVA that points to an array of RVAs of the
functions/symbols in the module. In short, RVAs to all
AddressOfFunctions
functions in the module are kept in an array and this
field points to the head of that array.
An RVA that points to an array of RVAs of the names
AddressOfNames
of functions in the module.
An RVA that points to a 16-bit array that contains the
AddressOfNameOrdinals ordinals associated with the function names in the
AddressOfNames array above.

Just reading the above table may not give you the real picture of the export table. The simpli-
fied explanation below will clarify the concept.
The export table exists for use by the PE loader. First of all, the module must keep the
addresses of all exported functions somewhere so the PE loader can look them up. It keeps
them in an array that is pointed to by the field AddressOfFunctions. The number of elements
in the array is kept in NumberOfFunctions. Thus if the module exports 40 functions, it must
have 40 members in the array pointed to by AddressOfFunctions and NumberOfFunctions
must contain a value 40. Now if some functions are exported by names, the module must
keep the names in the file. It keeps the RVAs to the names in an array so the PE loader can
look them up. That array is pointed to by AddressOfNames and the number of names in
NumberOfNames. Think about the job of the PE loader, it knows the names of the functions,
it must somehow obtain the addresses of those functions. Up to now, the module has two
arrays: the names and the addresses but there is no linkage between them. Thus we need
something that relates the names of the functions to their addresses. The PE specification
uses indexes into the address array as that essential linkage. Thus if the PE loader finds the
name it looks for in the name array, it can obtain the index into the address table for that
name too. The indexes are kept in another array (the last one) pointed to by the field
AddressOfNameOrdinals. Since this array exists as the linkage between the names and the
addresses, it must have exactly the same number of elements as the name array, ie. each
name can have one and only one associated address. The reverse is not true: an address
may have several names associated with it. Thus we can have "aliases" that refer to the
same address. To make the linkage works, both name and index arrays must run in parallel,
ie. the first element in the index array must hold the index for the first name and so on.

AddressOfNames AddressOfNameOrdinals
| |
RVA of Name 1 <--> Index of Name 1
... ... ...
RVA of Name N <--> Index of Name N
An example or two is in order. If we have the name of an export function and we need to
get its address in the module, we can do like this:
1.Go to the PE header
2.Read the virtual address of the export table in the data directory
3.Go to the export table and obtain the number of names (NumberOfNames)
4.Walk the arrays pointed to by AddressOfNames and AddressOfNameOrdinals in par-
allel, searching for the matching name. If the name is found in the
AddressOfNames array, you must extract the value in the associated element in
the AddressOfNameOrdinals array. For example, if you find the RVA of the match-
ing name in 77th element of the AddressOfNames array, you must extract the
value stored in the 77th element of the AddressOfNameOrdinals array. If you
walk the array until NumberOfNames elements are examined, you know that the
name is not in this module.
5.Use the value from the AddressOfNameOrdinals array as the index into the
AddressOfFunctions array. Say, if the value is 5, you must extract the value in
the 5th element of the AddressOfFunctions array. That value is the RVA of the
function.

Now we can turn our attention to the nBase member of the IMAGE_EXPORT_DIRECTORY
structure. You already know that the AddressOfFunctions array contains the addresses of all
export symbols in a module. And the PE loader uses the indexes into this array to find the
addresses of the functions. Let's imagine the scenario where we use the indexes into this
array as the ordinals. Since the programmers can specify the starting ordinal number in .def
file, like 200, it means that there must be at least 200 elements in the AddressOfFunctions
array. Furthermore the first 200 elements are not used but they must exist so that the PE
loader can use the indexes to find the correct addresses. This is not good at all. The nBase
member exists to solve this problem. If the programmer specifies the starting ordinal of 200,
the value in nBase would be 200. When the PE loader reads the value in nBase, it knows that
the first 200 elements do not exist and that it should subtract the ordinal by the value in nBase
to obtain the true index into the AddressOfFunctions array. With the use of nBase, there is no
need to provide 200 empty elements.
Note that nBase doesn't affect the values in the AddressOfNameOrdinals array. Despite the
name "AddressOfNameOrdinals", this array contains the true indexes into the AddressOf-
Functions array, not the ordinals.
With the discussion of nBase out of the way, we can continue to the next example.
Suppose that we have an ordinal of a function and we need to obtain the address of that func-
tion, we can do it like this:
1.Go to the PE header
2.Obtain the RVA of the export table from the data directory
3.Go to the export table and obtain the value of nBase.
4.Subtract the ordinal by the value in nBase and you have the index into the
AddressOfFunctions array.
5.Compare the index with the value in NumberOfFunctions. If the index is larger or
equal to the value in NumberOfFunctions, the ordinal is invalid.
6.Use the index to obtain the RVA of the function in the AddressOfFunctions array.
Note that obtaining the address of a function from an ordinal is much easier and faster than
using the name of the function. There is no need to walk the AddressOfNames and
AddressOfNameOrdinals arrays. The performance gain, however, must be balanced against
the difficulty in the maintaining the module.

In conclusion, if you want to obtain the address of a function from its name, you need to
walk both AddressOfNames and AddressOfNameOrdinals arrays to obtain the index into
the AddressOfFunctions array. If you have the ordinal of the function, you can go directly
to the AddressOfFunctions array after the ordinal is biased by nBase.
If a function is exported by name, you can use either its name or its ordinal in GetProcAd-
dress. But what if the function is exported by ordinal only? We come to that now.
"A function is exported by ordinal only" means the function doesn't have entries in both
AddressOfNames and AddressOfNameOrdinals arrays. Remember the two fields, Num-
berOfFunctions and NumberOfNames. The existence of these two fields is the evidence
that some functions may not have names. The number of functions must be at least equal
to the number of names. The functions that don't have names are exported by their ordi-
nals only. For example, if there are 70 functions but only 40 entries in the
AddressOfNames array, it means there are 30 functions in the module that are exported
by their ordinals only. Now how can we find out which functions are exported by ordinals
only? It's not easy. You must find that out by exclusion, ie. the entries in the AddressOf-
Functions array that are not referenced by the AddressOfNameOrdinals array contain the
RVAs of the functions that are exported by ordinals only.

Example:
This example is similar to the one in the previous tutorial. However, it displays
the values of some members of IMAGE_EXPORT_DIRECTORY structure and also lists the
RVAs, ordinals, and names of the exported functions. Note that this example doesn't
list the functions that are exported by ordinals only.
.386
.model flat,stdcall
option casemap:none
IDD_MAINDLG equ 101

IDC_EDIT equ 1000
IDM_OPEN equ 40001
IDM_EXIT equ 40003
DlgProc proto :DWORD,:DWORD,:DWORD,:DWORD

ShowExportFunctions proto :DWORD
ShowTheFunctions proto :DWORD,:DWORD
AppendText proto :DWORD,:DWORD
SEH struct
PrevLink dd ?
CurrentHandler dd ?
SafeOffset dd ?
PrevEsp dd ?
PrevEbp dd ?
SEH ends
.data

ofn OPENFILENAME <>

db "All Files",0,"*.*",0,0
NotValidPE db "This file is not a valid PE",0
NoExportTable db "No export information in this file",0
CRLF db 0Dh,0Ah,0
ExportTable db 0Dh,0Ah,"======[ IMAGE_EXPORT_DIRECTORY ]======",0Dh,0Ah
db "Name of the module: %s",0Dh,0Ah
db "nBase: %lu",0Dh,0Ah
db "NumberOfFunctions: %lu",0Dh,0Ah
db "NumberOfNames: %lu",0Dh,0Ah
db "AddressOfFunctions: %lX",0Dh,0Ah
db "AddressOfNames: %lX",0Dh,0Ah
db "AddressOfNameOrdinals: %lX",0Dh,0Ah,0
Header db "RVA Ord. Name",0Dh,0Ah
db "----------------------------------------------",0
template db "%lX %u %s",0
.data?
hFile dd ?
hMapping dd ?
pMapping dd ?
ValidPE dd ?
.code
start:
invoke DialogBoxParam, eax, IDD_MAINDLG,NULL,addr DlgProc, 0
DlgProc proc hDlg:DWORD, uMsg:DWORD, wParam:DWORD, lParam:DWORD

invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_SETLIMITTEXT,0,0
.elseif uMsg==WM_CLOSE
invoke EndDialog,hDlg,0
.elseif uMsg==WM_COMMAND
.if lParam==0
mov eax,wParam
.if ax==IDM_OPEN
invoke ShowExportFunctions,hDlg
.else ; IDM_EXIT
invoke SendMessage,hDlg,WM_CLOSE,0,0
.endif
.endif
.else
mov eax,FALSE
ret
.endif
mov eax,TRUE
ret
DlgProc endp
SEHHandler proc C pExcept:DWORD, pFrame:DWORD, pContext:DWORD, pDispatch:DWORD

mov edx,pFrame
assume edx:ptr SEH
mov eax,pContext
pop [eax].regEip
push [edx].PrevEsp
pop [eax].regEsp
push [edx].PrevEbp
pop [eax].regEbp
mov ValidPE, FALSE
ret

SEHHandler endp
ShowExportFunctions proc uses edi hDlg:DWORD

LOCAL seh:SEH
.if eax==TRUE
mov hFile, eax
.if eax!=NULL
mov hMapping, eax
.if eax!=NULL
mov pMapping,eax
assume fs:nothing
push fs:[0]
pop seh.PrevLink
lea eax,seh
mov fs:[0], eax
mov seh.PrevEsp,esp
mov seh.PrevEbp,ebp
mov edi, pMapping

mov ValidPE, TRUE
.else
mov ValidPE, FALSE
.endif
.else
mov ValidPE,FALSE
.endif
FinalExit:
push seh.PrevLink
pop fs:[0]
.if ValidPE==TRUE
invoke ShowTheFunctions, hDlg, edi
.else
invoke MessageBox,0, addr NotValidPE, addr AppName, MB_OK+MB_ICONERROR
.endif
.else
MB_OK+MB_ICONERROR
.endif
.else
MB_OK+MB_ICONERROR
.endif
.else
.endif
.endif
ret
ShowExportFunctions endp
AppendText proc hDlg:DWORD,pText:DWORD

invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_REPLACESEL,0,pText

invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_REPLACESEL,0,addr CRLF

invoke SendDlgItemMessage,hDlg,IDC_EDIT,EM_SETSEL,-1,0
ret
AppendText endp
RVAToFileMap PROC uses edi esi edx ecx pFileMap:DWORD,RVA:DWORD

mov esi,pFileMap
assume esi:ptr IMAGE_DOS_HEADER
add esi,[esi].e_lfanew
mov edi,RVA ; edi == RVA
mov edx,esi
add edx,sizeof IMAGE_NT_HEADERS
mov cx,[esi].FileHeader.NumberOfSections
movzx ecx,cx
assume edx:ptr IMAGE_SECTION_HEADER
.while ecx>0
.if edi>=[edx].VirtualAddress
add eax,[edx].SizeOfRawData
.if edi<eax
sub edi,eax
mov eax,[edx].PointerToRawData
add eax,edi
add eax,pFileMap
ret
.endif
.endif
add edx,sizeof IMAGE_SECTION_HEADER
dec ecx
.endw
assume edx:nothing
assume esi:nothing
mov eax,edi

ret
RVAToFileMap endp

LOCAL NumberOfNames:DWORD
LOCAL Base:DWORD
mov edi,pNTHdr
mov edi, [edi].OptionalHeader.DataDirectory.VirtualAddress
.if edi==0
invoke MessageBox,0, addr NoExportTable,addr AppName,MB_OK+MB_ICONERROR
ret
.endif
invoke RVAToFileMap,pMapping,edi
mov edi,eax
assume edi:ptr IMAGE_EXPORT_DIRECTORY
mov eax,[edi].NumberOfFunctions
invoke RVAToFileMap, pMapping,[edi].nName
invoke wsprintf, addr temp,addr ExportTable, eax, [edi].nBase, [edi].NumberOfFunc-
tions, [edi].NumberOfNames, [edi].AddressOfFunctions, [edi].AddressOfNames,
[edi].AddressOfNameOrdinals
invoke AppendText,hDlg,addr Header
push [edi].NumberOfNames
pop NumberOfNames
push [edi].nBase
pop Base
invoke RVAToFileMap,pMapping,[edi].AddressOfNames
mov esi,eax
invoke RVAToFileMap,pMapping,[edi].AddressOfNameOrdinals
mov ebx,eax
invoke RVAToFileMap,pMapping,[edi].AddressOfFunctions

mov edi,eax
.while NumberOfNames>0
invoke RVAToFileMap,pMapping,dword ptr [esi]
mov dx,[ebx]
movzx edx,dx
mov ecx,edx
shl edx,2
add edx,edi
add ecx,Base
invoke wsprintf, addr temp,addr template,dword ptr [edx],ecx,eax
dec NumberOfNames
add esi,4
add ebx,2
.endw
ret
ShowTheFunctions endp
end start

Analysis:
mov edi,pNTHdr
mov edi, [edi].OptionalHeader.DataDirectory.VirtualAddress
.if edi==0
invoke MessageBox,0, addr NoExportTable,addr AppName,MB_OK+MB_ICONERROR
ret
.endif
After the program verifies that the file is a valid PE, it goes to the data directory and obtains
the virtual address of the export table. If the virtual address is zero, the file doesn't have any
exported symbol.
mov eax,[edi].NumberOfFunctions
invoke RVAToFileMap, pMapping,[edi].nName
invoke wsprintf, addr temp,addr ExportTable, eax, [edi].nBase, [edi].NumberOfFunc-
tions, [edi].NumberOfNames, [edi].AddressOfFunctions, [edi].AddressOfNames,
[edi].AddressOfNameOrdinals
We display the important information in the IMAGE_EXPORT_DIRECTORY structure in the

edit control.
push [edi].NumberOfNames
pop NumberOfNames
push [edi].nBase
pop Base
Since we want to enumerate all function names, we need to know how many names there
are in the export table. nBase is used when we want to convert the indexes into the
AddressOfFunctions array into ordinals.
invoke RVAToFileMap,pMapping,[edi].AddressOfNames
mov esi,eax
invoke RVAToFileMap,pMapping,[edi].AddressOfNameOrdinals
mov ebx,eax
invoke RVAToFileMap,pMapping,[edi].AddressOfFunctions
mov edi,eax

The addresses of the three arrays are stored in esi, ebx, and edi, ready to be accessed.
.while NumberOfNames>0
Continue until all names are processed.

invoke RVAToFileMap,pMapping,dword ptr [esi]
Since esi points to an array of RVAs of the exported names, dereference it will give the
RVA of the current name. We convert it to the virtual address, to be used in wsprintf later.
mov dx,[ebx]
movzx edx,dx
mov ecx,edx
add ecx,Base
ebx points to the array of ordinals. Its array elements are word-size. Thus we need to con-
vert the value into a dword first. edx and ecx contain the index into the AddressOfFunc-
tions array. We will use edx as the pointer into the AddressOfFunctions array. We add the
value of nBase to ecx to obtain the ordinal number of the function.
shl edx,2
add edx,edi
We multiply the index by 4 (each element in the AddressOfFunctions array is 4 bytes in

size) and then add the address of the AddressOfFunctions array to it. Thus edx points to
the RVA of the function.
invoke wsprintf, addr temp,addr template,dword ptr [edx],ecx,eax
We display the RVA, ordinal, and the name of the function in the edit control.
dec NumberOfNames
add esi,4
add ebx,2
.endw
Update the counter and the addresses of the current elements in AddressOfNames and
AddressOfNameOrdinals arrays. Continue until all names are processed.

The PE file format by Bernd Luevelsmeyer
Preface
The PE ("portable executable") file format is the format of executable binaries (DLLs and pro-
grams) for MS windows NT, windows 95 and win32s; in windows NT, the drivers are in this
format, too. It can also be used for object files and libraries.
The format is designed by Microsoft and standardized by the TIS (tool interface standard)
Committee (Microsoft, Intel, Borland, Watcom, IBM and others) in 1993, apparently based on
a good knowledge of COFF, the "common object file format" used for object files and execut-
ables on several UNIXes and on VMS.
The win32 SDK includes a header file <winnt.h> containing #defines and typedefs for the PE-
format. I will mention the struct-member-names and #defines as we go.
You may also find the DLL "imagehelp.dll" to be helpful. It is part of windows NT, but docu-
mentation is scarce. Some of its functions are described in the "Developer Network".

General Layout
At the start of a PE file we find an MS-DOS executable ("stub"); thismakes any PE file a
valid MS-DOS executable.
After the DOS-stub there is a 32-bit-signature with the magic number

0x00004550 (IMAGE_NT_SIGNATURE).
Then there is a file header (in the COFF-format) that tells on which machine the binary is
supposed to run, how many sections are in it, the time it was linked, whether it is an exe-
cutable or a DLL and so on. (The difference between executable and DLL in this context
is: a DLL can not be started but only be used by another binary, and a binary cannot link
to an executable).
After that, we have an optional header (it is always there but still called "optional" - COFF
uses an "optional header" for libraries but not for objects, that's why it is called "optional").
This tells us more about how the binary should be loaded: The starting address, the
amount of stack to reserve, the size of the data segment etc..
An interesting part of the optional header is the trailing array of 'data directories'; these
directories contain pointers to data in the 'sections'. If, for example, the binary has an
export directory, you will find a pointer to that directory in the array member
IMAGE_DIRECTORY_ENTRY_EXPORT, and it will point into one of the sections.
Following the headers we find the 'sections', introduced by the 'section headers'. Essen-
tially, the sections' contents is what you really need to execute a program, and all the
header and directory stuff is just there to help you find it.
Each section has some flags about alignment, what kind of data it contains ("initialized
data" and so on), whether it can be shared etc., and the data itself. Most, but not all, sec-
tions contain one or more directories referenced through the entries of the optional
header's "data directory" array, like the directory of exported functions or the directory of
base relocations. Directoryless types of contents are, for example, "executable code" or
"initialized data".

+-------------------+
| DOS-stub |
+-------------------+
| file-header |
+-------------------+
| optional header |
|- - - - - - - - - -|
| |
| data directories |
| |
+-------------------+
| |
| section headers |
| |
+-------------------+
| |
| section 1 |
| |
+-------------------+
| |
| section 2 |
| |
+-------------------+
| |
| ... |
| |
+-------------------+
| |
| section n |
| |
+-------------------+

DOS-stub and Signature
The concept of a DOS-stub is well-known from the 16-bit-windows-executables (which

were in the "NE" format). The stub is used for OS/2-executables, self-extracting archives
and other applications, too.
For PE-files, it is a MS-DOS 2.0 compatible executable that almost always consists of
about 100 bytes that output an error message such as "this program needs windows NT".
You recognize a DOS-stub by validating the DOS-header, being a struct
IMAGE_DOS_HEADER. The first 2 bytes should be the sequence "MZ" (there is a
#define IMAGE_DOS_SIGNATURE for this WORD). You distinguish a PE binary from
other stubbed binaries by the trailing signature, which you find at the offset given by the
header member 'e_lfanew' (which is 32 bits long beginning at byte offset 60). For OS/2
and windows binaries, the signature is a 16-bit-word; for PE files, it is a 32-bit-longword
aligned at a 8-byte-boundary and having the value IMAGE_NT_SIGNATURE #defined to
be 0x00004550.
File Header
To get to the IMAGE_FILE_HEADER, validate the "MZ" of the DOS-header (1st 2 bytes),
then find the 'e_lfanew' member of the DOS-stub's header and skip that many bytes from
the beginning of the file. Verify the signature you will find there. The file header, a struct
IMAGE_FILE_HEADER, begins immediatly after it; the members are described top to
bottom.

The first member is the 'Machine', a 16-bit-value indicating the system the binary is intended
to run on. Known legal values are
IMAGE_FILE_MACHINE_I386 (0x14c) for Intel 80386 processor or better
0x014d for Intel 80486 processor or better
0x014e for Intel Pentium processor or better
0x0160 for R3000 (MIPS) processor, big endian
IMAGE_FILE_MACHINE_R3000 (0x162) for R3000 (MIPS) processor, little endian
IMAGE_FILE_MACHINE_ALPHA (0x184) for DEC Alpha AXP processor
IMAGE_FILE_MACHINE_POWERPC (0x1F0) for IBM Power PC, little endian
Then we have the 'NumberOfSections', a 16-bit-value. It is the number of sections that follow
the headers. We will discuss the sections later.
Next is a timestamp 'TimeDateStamp' (32 bit), giving the time the file was created. You can
distinguish several versions of the same file by this value, even if the "official" version number
was not altered. (The format of the timestamp is not documented except that it should be
somewhat unique among versions of the same file, but apparently it is 'seconds since Janu-
ary 1 1970 00:00:00' in UTC - the format used by most C compilers for the time_t.)
This timestamp is used for the binding of import directories, which will be discussed later.
Warning: some linkers tend to set this timestamp to absurd values which are not the time of
linking in time_t format as described.
The members 'PointerToSymbolTable' and 'NumberOfSymbols' (both 32 bit) are used for
debugging information. I don't know how to decipher them, and I've found the pointer to be
always 0.
'SizeOfOptionalHeader' (16 bit) is simply sizeof(IMAGE_OPTIONAL_HEADER). You can use

it to validate the correctness of the PE file's structure.

'Characteristics' is 16 bits and consists of a collection of flags, most of them being valid
only for object files and libraries:
Bit 0 (IMAGE_FILE_RELOCS_STRIPPED) is set if there is no relocation informa-
tion in the file. This refers to relocation information per section in
the sections themselves; it is not used for executables, which have
relocation information in the 'base relocation' directory described
below.
Bit 1 (IMAGE_FILE_EXECUTABLE_IMAGE) is set if the file is executable, i.e. it
is not an object file or a library. This flag may also be set if the
linker attempted to create an executable but failed for some reason, and
keeps the image in order to do e.g. incremental linking the next time.
Bit 2 (IMAGE_FILE_LINE_NUMS_STRIPPED) is set if the line number information
is stripped; this is not used for executable files.
Bit 3 (IMAGE_FILE_LOCAL_SYMS_STRIPPED) is set if there is no information
about local symbols in the file (this is not used for executable files).
Bit 4 (IMAGE_FILE_AGGRESIVE_WS_TRIM) is set if the operating system is sup-
posed to trim the working set of the running process (the amount of RAM
the process uses) aggressivly by paging it out. This should be set if it
is a demon-like application that waits most of the time and only wakes up
once a day, or the like.
Bits 7 (IMAGE_FILE_BYTES_REVERSED_LO) and 15
(IMAGE_FILE_BYTES_REVERSED_HI) are set if the endianess of the file is
not what the machine would expect, so it must swap bytes before reading.
This is unreliable for executable files (the OS expects executables to
be correctly byte-ordered).
Bit 8 (IMAGE_FILE_32BIT_MACHINE) is set if the machine is expected to be a 32
bit machine. This is always set for current implementations; NT5 may
work differently.
Bit 9 (IMAGE_FILE_DEBUG_STRIPPED) is set if there is no debugging information
in the file. This is unused for executable files. According to other
information ([6]), this bit is called "fixed" and is set if the image can
only run if it is loaded at the preferred load address (i.e. it is not
relocatable).
Bit 10 (IMAGE_FILE_REMOVABLE_RUN_FROM_SWAP) is set if the application may not
run from a removable medium such as a floppy or a CD-ROM. In this case,
the operating system is advised to copy the file to the swapfile and exe-
cute it from there.
Bit 11 (IMAGE_FILE_NET_RUN_FROM_SWAP) is set if the application may not run
from the network. In this case, the operating system is advised to copy
the file to the swapfile and execute it from there.

Bit 12 (IMAGE_FILE_SYSTEM) is set if the file is a system file such as a driver.

This is unused for executable files; it is also not used in all the NT driv-
ers I inspected.
Bit 13 (IMAGE_FILE_DLL) is set if the file is a DLL.
Bit 14 (IMAGE_FILE_UP_SYSTEM_ONLY) is set if the file is not designed to run on
multiprocessor systems (that is, it will crash there because it relies in
some way on exactly one processor).

Relative Virtual Addresses
The PE format makes heavy use of so-called RVAs. An RVA, aka "relative virtual
address", is used to describe a memory address if you don't know the base address. It is
the value you need to add to the base address to get the linear address. The base
address is the address the PE image is loaded to, and may vary from one invocation to
the next.
Example: suppose an executable file is loaded to address 0x400000 and execution starts
at RVA 0x1560. The effective execution start will then be at the address 0x401560. If the
executable were loaded to 0x100000, the execution start would be 0x101560.
Things become complicated because the parts of the PE-file (the sections) are not neces-
sarily aligned the same way the loaded image is. For example, the sections of the file are
often aligned to 512-byte-borders, but the loaded image is perhaps aligned to 4096-byte-
borders. See 'SectionAlignment' and 'FileAlignment' below.
So to find a piece of information in a PE-file for a specific RVA, you must calculate the off-
sets as if the file were loaded, but skip according to the file-offsets. As an example, sup-
pose you knew the execution starts at RVA 0x1560, and want to diassemble the code
starting there. To find the address in the file, you will have to find out that sections in RAM
are aligned to 4096 bytes and the ".code"-section starts at RVA 0x1000 in RAM and is
16384 bytes long; then you know that RVA 0x1560 is at offset 0x560 in that section. Find
out that the sections are aligned to 512-byte-borders in the file and that ".code" begins at
offset 0x800 in the file, and you know that the code execution start is at byte
0x800+0x560=0xd60 in the file.
Then you disassemble and find an access to a variable at the linear address 0x1051d0.
The linear address will be relocated upon loading the binary and is given on the assump-
tion that the preferred load address is used. You find out that the preferred load address is
0x100000, so we are dealing with RVA 0x51d0. This is in the data section which starts at
RVA 0x5000 and is 2048 bytes long. It begins at file offset 0x4800.
Hence. the veriable can be found at file offset 0x4800+0x51d0-0x5000=0x49d0.

Optional Header
Immediatly following the file header is the IMAGE_OPTIONAL_HEADER (which, in spite of

the name, is always there). It contains information about how to treat the PE-file exactly. We'll
also have the members from top to bottom.
The first 16-bit-word is 'Magic' and has, as far as I looked into PE-files, always the value
0x010b.
The next 2 bytes are the version of the linker ('MajorLinkerVersion' and 'MinorLinkerVersion')
that produced the file. These values, again, are unreliable and do not always reflect the linker
version properly. (Several linkers simply don't set this field.)
And, coming to think about it, what good is the version if you have got no idea *which* linker
was used?
The next 3 longwords (32 bit each) are intended to be the size of the executable code
('SizeOfCode'), the size of the initialized data ('SizeOfInitializedData', the so-called "data seg-
ment"), and the size of the uninitialized data ('SizeOfUninitializedData', the so-called "bss
segment"). These values are, again, unreliable (e.g. the data segment may actually be split
into several segments by the compiler or linker), and you get better sizes by inspecting the
'sections' that follow the optional header.
Next is a 32-bit-value that is a RVA. This RVA is the offset to the codes's entry point
('AddressOfEntryPoint'). Execution starts here; it is e.g. the address of a DLL's LibMain() or a
program's startup code (which will in turn call main()) or a driver's DriverEntry(). If you dare to
load the image "by hand", you call this address to start the process after you have done all
the fixups and the relocations.
The next 2 32-bit-values are the offsets to the executable code ('BaseOfCode') and the initial-
ized data ('BaseOfData'), both of them RVAs again, and both of them being of little interest
because you get more reliable information by inspecting the 'sections' that follow the head-
ers.
There is no offset to the uninitialized data because, being uninitialized, there is little point in
providing this data in the image.

The next entry is a 32-bit-value giving the preferred (linear) load address ('ImageBase') of
the entire binary, including all headers. This is the address (always a multiple of 64 KB)
the file has been relocated to by the linker; if the binary can in fact be loaded to that
address, the loader doesn't need to relocate the file again, which is a win in loading time.
The preferred load address can not be used if another image has already been loaded to
that address (an "address clash", which happens quite often if you load several DLLs that
are all relocated to the linker's default), or the memory in question has been used for
other purposes (stack, malloc(), uninitialized data, whatever). In these cases, the image
must be loaded to some other address and it needs to be relocated (see 'relocation direc-
tory' below). This has further consequences if the image is a DLL, because then the
"bound imports" are no longer valid, and fixups have to be made to the binary that uses
the DLL - see 'import directory' below.
The next 2 32-bit-values are the alignments of the PE-file's sections in RAM ('Section-
Alignment', when the image has been loaded) and in the file ('FileAlignment'). Usually
both values are 32, or FileAlignment is 512 and SectionAlignment is 4096. Sections will
be discussed later.
The next 2 16-bit-words are the expected operating system version ('MajorOperatingSys-
temVersion' and 'MinorOperatingSystemVersion' [they _do_ like self-documenting names
at MS]). This version information is intended to be the operating system's (e.g. NT or
Win95) version, as opposed to the subsystem's version (e.g. Win32); it is often not sup-
plied, or wrong supplied. The loader doesn't use it, apparently.
The next 2 16-bit-words are the binary's version, ('MajorImageVersion' and 'MinorImage-
Version'). Many linkers don't set this information correctly and many programmers don't
bother to supply it, so it is better to rely on the version-resource if one exists.
The next 2 16-bit-words are the expected subsystem version ('MajorSubsystemVersion'

and 'MinorSubsystemVersion'). This should be the Win32 version or the POSIX version,
because 16-bit-programs or OS/2-programs won't be in PE-format, obviously. This sub-
system version should be supplied correctly, because it *is* checked and used:
If the application is a Win32-GUI-application and runs on NT4, and the subsystem version
is *not* 4.0, the dialogs won't be 3D-style and certain other features will also work "old-
style" because the application expects to run on NT 3.51, which had the program man-
ager instead of explorer and so on, and NT 4.0 will mimic that behaviour as faithfully as
possible.

Then we have a 'Win32VersionValue' of 32 bits. I don't know what it is good for. It has been 0
in all the PE files that I inspected.
Next is a 32-bits-value giving the amount of memory the image will need, in bytes ('SizeOfIm-
age'). It is the sum of all headers' and sections' lengths if aligned to 'SectionAlignment'. It is a
hint to the loader how many pages it will need in order to load the image.
The next thing is a 32-bit-value giving the total length of all headers including the data direc-
tories and the section headers ('SizeOfHeaders'). It is at the same time the offset from the
beginning of the file to the first section's raw data.
Then we have got a 32-bit-checksum ('CheckSum'). This checksum is, for current versions of
NT, only checked if the image is a NT-driver (the driver will fail to load if the checksum isn't
correct). For other binary types, the checksum need not be supplied and may be 0.
The algorithm to compute the checksum is property of Microsoft, and they won't tell you.
However, several tools of the Win32 SDK will compute and/or patch a valid checksum, and
the function CheckSumMappedFile() in the imagehelp.dll will do so too.
The checksum is supposed to prevent loading of damaged binaries that would crash anyway
- and a crashing driver would result in a BSOD, so it is better not to load it at all.

Then there is a 16-bit-word 'Subsystem' that tells in which of the NT-subsystems the
image runs:
IMAGE_SUBSYSTEM_NATIVE (1)
The binary doesn't need a subsystem. This is used for drivers.
IMAGE_SUBSYSTEM_WINDOWS_GUI (2)
The image is a Win32 graphical binary. (It can still open a
console with AllocConsole() but won't get one automatically at
startup.)
IMAGE_SUBSYSTEM_WINDOWS_CUI (3)
The binary is a Win32 console binary. (It will get a console
per default at startup, or inherit the parent's console.)
IMAGE_SUBSYSTEM_OS2_CUI (5)
The binary is a OS/2 console binary. (OS/2 binaries will be in
OS/2 format, so this value will seldom be used in a PE file.)
IMAGE_SUBSYSTEM_POSIX_CUI (7)
The binary uses the POSIX console subsystem.
Windows 95 binaries will always use the Win32 subsystem, so the only legal values for
these binaries are 2 and 3; I don't know if "native" binaries on windows 95 are possible.
The next thing is a 16-bit-value that tells, if the image is a DLL, when to call the DLL's
entry point ('DllCharacteristics'). This seems not to be used; apparently, the DLL is always
notified about everything.
If bit 0 is set, the DLL is notified about process attachment (i.e. DLL load).
If bit 1 is set, the DLL is notified about thread detachments (i.e. thread
terminations).
If bit 2 is set, the DLL is notified about thread attachments (i.e. thread
creations).
If bit 3 is set, the DLL is notified about process detachment (i.e. DLL
unload).

The next 4 32-bit-values are the size of reserved stack ('SizeOfStackReserve'), the size of ini-
tially committed stack ('SizeOfStackCommit'), the size of the reserved heap ('SizeOfHeapRe-
serve') and the size of the committed heap ('SizeOfHeapCommit').
The 'reserved' amounts are address space (not real RAM) that is reserved for the specific
purpose; at program startup, the 'committed' amount is actually allocated in RAM. The 'com-
mitted' value is also the amount by which the committed stack or heap grows if necessary.
(Other sources claim that the stack will grow in pages, regardless of the 'SizeOfStackCommit'
value. I didn't check this.)
So, as an example, if a program has a reserved heap of 1 MB and a committed heap of 64

KB, the heap will start out at 64 KB and is guaranteed to be enlargeable up to 1 MB. The
heap will grow in 64-KB-chunks.
The 'heap' in this context is the primary (default) heap. A process can create more heaps if so
it wishes.
The stack is the first thread's stack (the one that starts main()). The process can create more
threads which will have their own stacks. DLLs don't have a stack or heap of their own, so the
values are ignored for their images. I don't know if drivers have a heap or a stack of their own,
but I don't think so.
After these stack- and heap-descriptions, we find 32 bits of 'LoaderFlags', which I didn't find a
useful description of. I only found a vague note about setting bits that automatically invoke a
breakpoint or a debugger after loading the image; however, this doesn't seem to work.
Then we find 32 bits of 'NumberOfRvaAndSizes', which is the number of valid entries in the
directories that follow immediatly. I've found this value to be unreliable; you might wish use
the constant IMAGE_NUMBEROF_DIRECTORY_ENTRIES instead, or the lesser of both.
After the 'NumberOfRvaAndSizes' there is an array of

IMAGE_NUMBEROF_DIRECTORY_ENTRIES (16) IMAGE_DATA_DIRECTORYs.
Each of these directories describes the location (32 bits RVA called 'VirtualAddress') and size
(also 32 bit, called 'Size') of a particular piece of information, which is located in one of the
sections that follow the directory entries. For example, the security directory is found at the
RVA and has the size that are given at index 4.

The directories that I know the structure of will be discussed later. Defined directory
indexes are:
IMAGE_DIRECTORY_ENTRY_EXPORT (0)
The directory of exported symbols; mostly used for DLLs.
Described below.
IMAGE_DIRECTORY_ENTRY_IMPORT (1)
The directory of imported symbols; see below.
IMAGE_DIRECTORY_ENTRY_RESOURCE (2)
Directory of resources. Described below.
IMAGE_DIRECTORY_ENTRY_EXCEPTION (3)
Exception directory - structure and purpose unknown.
IMAGE_DIRECTORY_ENTRY_SECURITY (4)
Security directory - structure and purpose unknown.
IMAGE_DIRECTORY_ENTRY_BASERELOC (5)
Base relocation table - see below.
IMAGE_DIRECTORY_ENTRY_DEBUG (6)
Debug directory - contents is compiler dependent. Moreover, many
compilers stuff the debug information into the code section and
don't create a separate section for it.
IMAGE_DIRECTORY_ENTRY_COPYRIGHT (7)
Description string - some arbitrary copyright note or the like.
IMAGE_DIRECTORY_ENTRY_GLOBALPTR (8)
Machine Value (MIPS GP) - structure and purpose unknown.

IMAGE_DIRECTORY_ENTRY_TLS (9)
Thread local storage directory - structure unknown; contains
variables that are declared "__declspec(thread)", i.e.
per-thread global variables.
IMAGE_DIRECTORY_ENTRY_LOAD_CONFIG (10)
Load configuration directory - structure and purpose unknown.
IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT (11)
Bound import directory - see description of import directory.
IMAGE_DIRECTORY_ENTRY_IAT (12)
Import Address Table - see description of import directory.
As an example, if we find at index 7 the 2 longwords 0x12000 and 33, and the load address is
0x10000, we know that the copyright data is at address 0x10000+0x12000 (in whatever sec-
tion there may be), and the copyright note is 33 bytes long. If a directory of a particular type is
not used in a binary, the Size and VirtualAddress are both 0.

Section directories
The sections consist of two major parts: first, a section description (of type
IMAGE_SECTION_HEADER) and then the raw section data. So after the data directories
we find an array of 'NumberOfSections' section headers, ordered by the sections' RVAs.
A section header contains:
An array of IMAGE_SIZEOF_SHORT_NAME (8) bytes that make up the name (ASCII) of

the section. If all of the 8 bytes are used there is no 0-terminator for the string! The name
is typically something like ".data" or ".text" or ".bss". There need not be a leading '.', the
names may also be "CODE" or "IAT" or the like. Please note that the names are not at all
related to the contents of the section. A section named ".code" may or may not contain
the executable code; it may just as well contain the import address table; it may also con-
tain the code *and* the address table *and* the initialized data. To find information in the
sections, you will have to look it up via the data directories of the optional header. Do not
rely on the names, and do not assume that the section's raw data starts at the beginning
of a section.
The next member of the IMAGE_SECTION_HEADER is a 32-bit-union of 'PhysicalAd-

dress' and 'VirtualSize'. In an object file, this is the address the contents is relocated to; in
an executable, it is the size of the contents. In fact, the field seems to be unused; There
are linkers that enter the size, and there are linkers that enter the address, and I've also
found a linker that enters a 0, and all the executables run like the gentle wind.
The next member is 'VirtualAddress', a 32-bit-value holding the RVA to the section's data
when it is loaded in RAM.
Then we have got 32 bits of 'SizeOfRawData', which is the size of the secion's data
rounded up to the next multiple of 'FileAlignment'.
Next is 'PointerToRawData' (32 bits), which is incredibly useful because it is the offset
from the file's beginning to the section's data. If it is 0, the section's data are not contained
in the file and will be arbitrary at load time.

Then we have got 'PointerToRelocations' (32 bits) and 'PointerToLinenumbers' (also 32 bits),
'NumberOfRelocations' (16 bits) and 'NumberOfLinenumbers' (also 16 bits). All of these are
information that's only used for object files. Executables have a special base relocation direc-
tory, and the line number information, if present at all, is usually contained in a special pur-
pose debugging segment or elsewhere.
The last member of a section header is the 32 bits 'Characteristics', which is a bunch of flags
describing how the section's memory should be treated:
If bit 5 (IMAGE_SCN_CNT_CODE) is set, the section contains executable code.
If bit 6 (IMAGE_SCN_CNT_INITIALIZED_DATA) is set, the section contains data that
gets a defined value before execution starts. In other words: the section's data in
the file is meaningful.
If bit 7 (IMAGE_SCN_CNT_UNINITIALIZED_DATA) is set, this section contains uninitial-
ized data and will be initialized to all-0-bytes before execution starts. This is
normally the BSS.
If bit 9 (IMAGE_SCN_LNK_INFO) is set, the section doesn't contain image data but
comments, description or other documentation. This information is part of an object
file and may be information for the linker, such as which libraries are needed.
If bit 11 (IMAGE_SCN_LNK_REMOVE) is set, the data is part of an object file's sec-
tion that is supposed to be left out when the executable file is linked. Often com-
bined with bit 9.
If bit 12 (IMAGE_SCN_LNK_COMDAT) is set, the section contains "common block data",
which are packaged functions of some sort.
If bit 15 (IMAGE_SCN_MEM_FARDATA) is set, we have far data - whatever that means.
This bit's meaning is unsure.
If bit 17 (IMAGE_SCN_MEM_PURGEABLE) is set, the section's data is purgeable - but I
don't think that this is the same as "discardable", which has a bit of its own, see
below. The same bit is apparently used to indicate 16-bit-information as there is
also a define IMAGE_SCN_MEM_16BIT for it. This bit's meaning is unsure.
If bit 18 (IMAGE_SCN_MEM_LOCKED) is set, the section should not be moved in memory?
Perhaps it indicates there is no relocation information? This bit's meaning is
unsure.
If bit 19 (IMAGE_SCN_MEM_PRELOAD) is set, the section should be paged in before exe-
cution starts? This bit's meaning is unsure.
Bits 20 to 23 specify an alignment that I have no information about. There are
#defines IMAGE_SCN_ALIGN_16BYTES and the like. The only value I've ever seen used is
0, for the default 16-byte- alignment. I suspect that this is the alignment of
objects in a library file or the like.
If bit 24 (IMAGE_SCN_LNK_NRELOC_OVFL) is set, the section contains some extended
relocations that I don't know about.

If bit 25 (IMAGE_SCN_MEM_DISCARDABLE) is set, the section's data is not needed

after the process has started. This is the case, for example, with the relocation
information. I've seen it also for startup routines of drivers and services that
are only executed once, and for import directories.
If bit 26 (IMAGE_SCN_MEM_NOT_CACHED) is set, the section's data should not be
cached. Don't ask my why not. Does this mean to switch off the 2nd-level-cache?
If bit 27 (IMAGE_SCN_MEM_NOT_PAGED) is set, the section's data should not be
paged out. This is interesting for drivers.
If bit 28 (IMAGE_SCN_MEM_SHARED) is set, the section's data is shared among all
running instances of the image. If it is e.g. the initialized data of a DLL, all
running instances of the DLL will at any time have the same variable contents.
Note that only the first instance's section is initialized. Sections containing
code are always shared copy-on-write (i.e. the sharing doesn't work if reloca-
tions are necessary).
If bit 29 (IMAGE_SCN_MEM_EXECUTE) is set, the process gets 'execute'-access to
the section's memory.
If bit 30 (IMAGE_SCN_MEM_READ) is set, the process gets 'read'-access to the sec-
tion's memory.
If bit 31 (IMAGE_SCN_MEM_WRITE) is set, the process gets 'write'-access to the
section's memory.
After the section headers we find the sections themselves. They are, in the file, aligned to
'FileAlignment' bytes (that is, after the optional header and after each section's data there
will be padding bytes) and ordered by their RVAs. When loaded (in RAM), the sections
are aligned to 'SectionAlignment' bytes.
As an example, if the optional header ends at file offset 981 and 'FileAlignment' is 512,
the first section will start at byte 1024. Note that you can find the sections via the 'Pointer-
ToRawData' or the 'VirtualAddress', so there is hardly any need to actually fuss around
with the alignments.

I will try to make an image of it all:

+-------------------+
| DOS-stub |
+-------------------+
| file-header |
+-------------------+
| optional header |
|- - - - - - - - - -|
| |----------------+
| data directories | |
| | |
|(RVAs to direc- |-------------+ |
|tories in sections)| | |
| |---------+ | |
| | | | |
+-------------------+ | | |
| |-----+ | | |
| section headers | | | | |
| (RVAs to section |--+ | | | |
| borders) | | | | | |
+-------------------+<-+ | | | |
| | | <-+ | |
| section data 1 | | | |
| | | <-----+ |
+-------------------+<----+ |
| | |
| section data 2 | |
| | <--------------+
+-------------------+
There is one section header for each section, and each data directory will point to one of the
sections (several data directories may point to the same section, and there may be sections
without data directory pointing to them).

Sections' raw data
general
All sections are aligned to 'SectionAlignment' when loaded in RAM, and 'FileAlignment' in
the file. The sections are described by entries in the section headers: You find the sec-
tions in the file via 'PointerToRawData' and in memory via 'VirtualAddress'; the length is in
'SizeOfRawData'.
There are several kinds of sections, depending on what's contained in them. In most
cases (but not in all) there will be at least one data directory in a section, with a pointer to
it in the optional header's data directory array.
code section
First, I will mention the code section. The section will have, at least, the bits
'IMAGE_SCN_CNT_CODE', 'IMAGE_SCN_MEM_EXECUTE' and
'IMAGE_SCN_MEM_READ' set, and 'AddressOfEntryPoint' will point somewhere into the
section, to the start of the function that the developer wants to execute first.
'BaseOfCode' will normally point to the start of this section, but may point to somewhere
later in the section if some non-code-bytes are placed before the code in the section. Nor-
mally, there will be nothing but executable code in this section, and there will be only one
code section, but don't rely on this. Typical section names are ".text", ".code", "AUTO"
and the like.

data section
The next thing we'll discuss is the initialized variables; this section contains initialized static
variables (like "static int i = 5;"). It will have, at least, the bits
'IMAGE_SCN_CNT_INITIALIZED_DATA', 'IMAGE_SCN_MEM_READ' and
'IMAGE_SCN_MEM_WRITE' set. Some linkers may place constant data into a section of
their own that doesn't have the writeable-bit. If part of the data is shareable, or there are other
peculiarities, there may be more sections with the apropriate section-bits set.
The section, or sections, will be in the range 'BaseOfData' up to 'BaseOfData'+'SizeOfInitial-

izedData'. Typical section names are '.data', '.idata', 'DATA' and so on.
bss section
Then there is the uninitialized data (for static variables like "static int k;"); this section is quite
like the initialized data, but will have a file offset ('PointerToRawData') of 0 indicating its con-
tents is not stored in the file, and 'IMAGE_SCN_CNT_UNINITIALIZED_DATA' is set instead
of 'IMAGE_SCN_CNT_INITIALIZED_DATA' to indicate that the contents should be set to 0-
bytes at load-time. This means, there is a section header but no section in the file; the section
will be created by the loader and consist entirely of 0-bytes. The length will be 'SizeOfUnini-
tializedData'. Typical names are '.bss', 'BSS' and the like.
These were the section data that are *not* pointed to by data directories. Their contents and
structure is supplied by the compiler, not by the linker. (The stack-segment and heap-seg-
ment are not sections in the binary but created by the loader from the stacksize- and heap-
size-entries in the optional header.)

copyright
To begin with a simple directory-section, let's look at the data directory

'IMAGE_DIRECTORY_ENTRY_COPYRIGHT'. The contents is a copyright- or descrip-
tion string in ASCII (not 0-terminated), like "Gonkulator control application, copyright (c)
1848 Hugendubel & Cie". This string is, normally, supplied to the linker with the command
line or a description file. This string is not needed at runtime and may be discarded. It is
not writeable; in fact, the application doesn't need access at all. So the linker will find out
if there is a discardable non-writeable section already and if not, create one (named
'.descr' or the like). It will then stuff the string into the section and let the copyright-direc-
tory-pointer point to the string. The 'IMAGE_SCN_CNT_INITIALIZED_DATA' bit should
be set.
exported symbols
(Note that the description of the export directory was faulty in versions of this text before
1999-03-12. It didn't describe forwarders, exports by ordinal only, or exports with several
names.)
The next-simplest thing is the export directory,

'IMAGE_DIRECTORY_ENTRY_EXPORT'. This is a directory typically found in DLLs; it
contains the entry points of exported functions (and the addresses of exported objects
etc.). Executables may of course also have exported symbols but usually they don't. The
containing section should be "initialized data" and "readable". It should not be "discard-
able" because the process might call "GetProcAddress()" to find a function's entry point at
runtime. The section is normally called '.edata' if it is a separate thing; often enough, it is
merged into some other section like "initialized data".
The structure of the export table ('IMAGE_EXPORT_DIRECTORY') comprises a header

and the export data, that is: the symbol names, their ordinals and the offsets to their entry
points.

First, we have 32 bits of 'Characteristics' that are unused and normally 0. Then there is a 32-
bit-'TimeDateStamp', which presumably should give the time the table was created in the
time_t-format; alas, it is not always valid (some linkers set it to 0). Then we have 2 16-bit-
words of version-info ('MajorVersion' and 'MinorVersion'), and these, too, are often enough
set to 0.
The next thing is 32 bits of 'Name'; this is an RVA to the DLL name as a 0-terminated ASCII
string. (The name is necessary in case the DLL file is renamed - see "binding" at the import
directory.) Then, we have got a 32-bit-'Base'. We'll come to that in a moment.
The next 32-bit-value is the total number of exported items ('NumberOfFunctions'). In addition
to their ordinal number, items may be exported by one or several names. and the next 32-bit-
number is the total number of exported names ('NumberOfNames'). In most cases, each
exported item will have exactly one corresponding name and it will be used by that name, but
an item may have several associated names (it is then accessible by each of them), or it may
have no name, in which case it is only accessible by its ordinal number. The use of unnamed
exports (purely by ordinal) is discouraged, because all versions of the exporting DLL would
have to use the same ordinal numbering, which is a maintainance problem.
The next 32-bit-value 'AddressOfFunctions' is a RVA to the list of exported items. It points to
an array of 'NumberOfFunctions' 32-bit-values, each being a RVA to the exported function or
variable.
There are 2 quirks about this list: First, such an exported RVA may be 0, in which case it is
unused. Second, if the RVA points into the section containing the export directory, this is a
forwarded export. A forwarded export is a pointer to an export in another binary; if it is used,
the pointed-to export in the other binary is used instead. The RVA in this case points, as men-
tioned, into the export directory's section, to a zero-terminated string comprising the name of
the pointed-to DLL and the export name separated by a dot, like "otherdll.exportname", or the
DLL's name and the export ordinal, like "otherdll.#19".
Now is the time to explain the export ordinal. An export's ordinal is the index into the
AddressOfFunctions-Array (the 0-based position in this array) plus the 'Base' mentioned
above.
In most cases, the 'Base' is 1, which means the first export has an ordinal of 1, the second
has an ordinal of 2 and so on.

After the 'AddressOfFunctions'-RVA we find a RVA to the array of 32-bit-RVAs to symbol

names 'AddressOfNames', and a RVA to the array of 16-bit-ordinals 'AddressOfNameOr-
dinals'. Both arrays have 'NumberOfNames' elements. The symbol names may be miss-
ing entirely, in which case the 'AddressOfNames' is 0. Otherwise, the pointed-to arrays
are running parallel, which means their elements at each index belong together. The
'AddressOfNames'-array consists of RVAs to 0-terminated export names; the names are
held in a sorted list (i.e. the first array member is the RVA to the alphabetically smallest
name; this allows efficient searching when looking up an exported symbol by name).
According to the PE specification, the 'AddressOfNameOrdinals'-array has the ordinal
corresponding to each name; however, I've found this array to contain the actual index
into the 'AddressOfFunctions-Array instead.
I'll draw a picture about the three tables:

AddressOfFunctions
|
|
|
v
exported RVA with ordinal 'Base'
exported RVA with ordinal 'Base'+1
...
exported RVA with ordinal 'Base'+'NumberOfFunctions'-1
AddressOfNames AddressOfNameOrdinals
| |
| |
| |
v v
RVA to first name <-> Index of export for first name
RVA to second name <-> Index of export for second name
... ...
RVA to name 'NumberOfNames' <-> Index of export for name 'NumberOfNames'

Some examples are in order.
To find an exported symbol by ordinal, subtract the 'Base' to get the index, follow the
'AddressOfFunctions'-RVA to find the exports-array and use the index to find the exported
RVA in the array. If it does not point into the export section, you are done. Otherwise, it points
to a string describing the exporting DLL and the name or ordinal therein, and you have to look
up the forwarded export there.
To find an exported symbol by name, follow the 'AddressOfNames'-RVA (if it is 0 there are no
names) to find the array of RVAs to the export names. Search your name in the list. Use the
name's index in the 'AddressOfNameOrdinals'-Array and get the 16-bit-number correspond-
ing to the found name. According to the PE spec, it is an ordinal and you need to subtract the
'Base' to get the export index; according to my experiences it is the export index and you
don't subtract. Using the export index, you find the export RVA in the 'AddressOfFunctions'-
Array, being either the exported RVA itself or a RVA to a string describing a forwarded export.
imported symbols
When the compiler finds a call to a function that is in a different executable (mostly in a DLL),
it will, in the most simplistic case, not know anything about the circumstances and simply out-
put a normal call-instruction to that symbol, the address of which the linker will have to fix, like
it does for any external symbol. The linker uses an import library to look up from which DLL
which symnol is imported, and produces stubs for all the imported symbols, each of which
consists of a jump-instruction; the stubs are the actual call-targets. These jump-instructions
will actually jump to an address that's fetched from the so-called import address table. In
more sophisticated applications (when "__declspec(dllimport)" is used), the compiler knows
the function is imported, and outputs a call to the address that's in the import address table,
bypassing the jump.
Anyway, the address of the function in the DLL is always necessary and will be supplied by
the loader from the exporting DLL's export directory when the application is loaded. The
loader knows which symbols in what libraries have to be looked up and their addresses fixed
by searching the import directory.

I will better give you an example. The calls with or without __declspec(dllimport) look like
this:
source:
int symbol(char *);
__declspec(dllimport) int symbol2(char*);
void foo(void)
{
int i=symbol("bar");
int j=symbol2("baz");
}
assembly:
...
call _symbol ; without declspec(dllimport)
...
call [__imp__symbol2] ; with declspec(dllimport)
...
In the first case (without __declspec(dllimport)), the compiler didn't know that '_symbol'
was in a DLL, so the linker has to provide the function '_symbol'. Since the function isn't
there, it will supply a stub function for the imported symbol, being an indirect jump. The
collection of all import-stubs is called the "transfer area" (also sometimes called a "tram-
poline", because you jump there in order to jump to somewhere else). Typically this trans-
fer area is located in the code section (it is not part of the import directory). Each of the
function stubs is a jump to the actual function in the target DLLs. The transfer area looks
like this:
_symbol: jmp [__imp__symbol]
_other_symbol: jmp [__imp__other__symbol]
...

This means: if you use imported symbols without specifying "__declspec(dllimport)" then the
linker will generate a transfer area for them, consisting of indirect jumps. If you do specify
"__declspec(dllimport)", the compiler will do the indirection itself and a transfer area is not
necessary. (It also means: if you import variables or other stuff you must specify
"__declspec(dllimport)", because a stub with a jmp instruction is appropriate for functions
only.)
In any case the adress of symbol 'x' is stored at a location '__imp_x'. All these locations
together comprise the so-called "import address table", which is provided to the linker by the
import libraries of the various DLLs that are used. The import address table is a list of
addresses like this:
__imp__symbol: 0xdeadbeef
__imp__symbol2: 0x40100
__imp__symbol3: 0x300100
...
This import address table is a part of the import directory, and it is pointed to by the
IMAGE_DIRECTORY_ENTRY_IAT directory pointer (although some linkers don't set this
directory entry and it works nevertheless; apparently, the loader can resolve imports without
using the directory IMAGE_DIRECTORY_ENTRY_IAT). The addresses in this table are
unknown to the linker; the linker inserts dummies (RVAs to the function names; see below for
more information) that are patched by the loader at load time using the export directory of the
exporting DLL. The import address table, and how it is found by the loader, will be described
in more detail later in this chapter.
Note that this description is C-specific; there are other application building environments that
don't use import libraries. They all need to generate an import address table, though, which
they use to let their programs access the imported objects and functions. C compilers tend to
use import libraries because it is convenient for them - their linkers use libraries anyway.
Other environments use e.g. a description file that lists the necessary DLL names and func-
tion names (like the "module definition file"), or a declaration-style list in the source.
This is how imports are used by the program's code; now we'll look how an import directory is
made up so the loader can use it.

The import directory should reside in a section that's "initialized data" and "readable". The
import directory is an array of IMAGE_IMPORT_DESCRIPTORs, one for each used DLL.
The list is terminated by a IMAGE_IMPORT_DESCRIPTOR that's entirely filled with 0-
bytes.
An IMAGE_IMPORT_DESCRIPTOR is a struct with these members:

OriginalFirstThunk
An RVA (32 bit) pointing to a 0-terminated array of RVAs to
IMAGE_THUNK_DATAs, each describing one imported function. The
array will never change.
TimeDateStamp
A 32-bit-timestamp that has several purposes. Let's pretend that
the timestamp is 0, and handle the advanced cases later.
ForwarderChain
The 32-bit-index of the first forwarder in the list of imported
functions. Forwarders are also advanced stuff; set to all-bits-1
for beginners.
Name
A 32-bit-RVA to the name (a 0-terminated ASCII string) of the
DLL.
FirstThunk
An RVA (32 bit) to a 0-terminated array of RVAs to
IMAGE_THUNK_DATAs, each describing one imported function. The
array is part of the import address table and will change.
So each IMAGE_IMPORT_DESCRIPTOR in the array gives you the name of the export-
ing DLL and, apart from the forwarder and timestamp, it gives you 2 RVAs to arrays of
IMAGE_THUNK_DATAs, using 32 bits. (The last member of each array is entirely filled
with 0-bytes to mark the end.)

Each IMAGE_THUNK_DATA is, for now, an RVA to a IMAGE_IMPORT_BY_NAME which

describes the imported function. The interesting point is now, the arrays run parallel, i.e.: they
point to the same IMAGE_IMPORT_BY_NAMEs.
No need to be desparate, I will draw another picture. This is the essential contents of one
IMAGE_IMPORT_DESCRIPTOR:
OriginalFirstThunk FirstThunk
| |
| |
| |
V V
0--> func1 <--0

1--> func2 <--1
2--> func3 <--2
3--> foo <--3
4--> mumpitz <--4
5--> knuff <--5
6-->0 0<--6 /* the last RVA is 0! */
where the names in the center are the yet to discuss IMAGE_IMPORT_BY_NAMEs. Each of
them is a 16-bit-number (a hint) followed by an unspecified amount of bytes, being the 0-ter-
minated ASCII name of the imported symbol.
The hint is an index into the exporting DLL's name table (see export directory above). The
name at that index is tried, and if it doesn't match then a binary search is done to find the
name. (Some linkers don't bother to look up correct hints and simply specify 1 all the time, or
some other arbitrary number. This doesn't harm, it just makes the first attempt to resolve the
name always fail, enforcing a binary search for each name.)
To summarize, if you want to look up information about the imported function "foo" from DLL
"knurr", you first find the entry IMAGE_DIRECTORY_ENTRY_IMPORT in the data directo-
ries, get an RVA, find that address in the raw section data and now have an array of
IMAGE_IMPORT_DESCRIPTORs. Get the member of this array that relates to the DLL
"knurr" by inspecting the strings pointed to by the 'Name's.

When you have found the right IMAGE_IMPORT_DESCRIPTOR, follow its 'OriginalFirst-
Thunk' and get hold of the pointed-to array of IMAGE_THUNK_DATAs; inspect the RVAs
and find the function "foo".
Ok, now, why do we have *two* lists of pointers to the IMAGE_IMPORT_BY_NAMEs?

Because at runtime the application doesn't need the imported functions' names but the
addresses. This is where the import address table comes in again. The loader will look up
each imported symbol in the export-directory of the DLL in question and replace the
IMAGE_THUNK_DATA-element in the 'FirstThunk'-list (which until now also points to the
IMAGE_IMPORT_BY_NAME) with the linear address of the DLL's entry point. Remem-
ber the list of addresses with labels like "__imp__symbol"; the import address table,
pointed to by the data directory IMAGE_DIRECTORY_ENTRY_IAT, is exactly the list
pointed to by 'FirstThunk'. (In case of imports from several DLLs, the import address table
comprises the 'FirstThunk'-Arrays of all the DLLs. The directory entry
IMAGE_DIRECTORY_ENTRY_IAT may be missing, the imports will still work fine.) The
'OriginalFirstThunk'-array remains untouched, so you can always look up the original list
of imported names via the 'OriginalFirstThunk'-list.
The import is now patched with the correct linear addresses and looks like this:
OriginalFirstThunk FirstThunk
| |
| |
| |
V V
0--> func1 0--> exported func1

3--> foo 3--> exported foo
4--> mumpitz 4--> exported mumpitz
5--> knuff 5--> exported knuff
6-->0 0<--6

This was the basic structure, for simple cases. Now we'll learn about tweaks in the import
directories.
First, the bit IMAGE_ORDINAL_FLAG (that is: the MSB) of the IMAGE_THUNK_DATA in the
arrays can be set, in which case there is no symbol-name-information in the list and the sym-
bol is imported purely by ordinal. You get the ordinal by inspecting the lower word of the
IMAGE_THUNK_DATA. The import by ordinals is discouraged; it is much safer to import by
name, because the export ordinals might change if the exporting DLL is not in the expected
version.
Second, there are the so-called "bound imports".
Think about the loader's task: when a binary that it wants to execute needs a function from a
DLL, the loader loads the DLL, finds its export directory, looks up the function's RVA and cal-
culates the function's entry point. Then it patches the so-found address into the 'FirstThunk'-
list. Given that the programmer was clever and supplied unique preferred load addresses for
the DLLs that don't clash, we can assume that the functions' entry points will always be the
same. They can be computed and patched into the 'FirstThunk'-list at link-time, and that's
what happens with the "bound imports". (The utility "bind" does this; it is part of the Win32
SDK.)
Of course, one must be cautious: The user's DLL may have a different version, or it may be
necessary to relocate the DLL, thus invalidating the pre-patched 'FirstThunk'-list; in this case,
the loader will still be able to walk the 'OriginalFirstThunk'-list, find the imported symbols and
re-patch the 'FirstThunk'-list. The loader knows that this is necessary if a) the versions of the
exporting DLL don't match or b) the exporting DLL had to be relocated.
To decide whether there were relocations is no problem for the loader, but how to find out if
the versions differ? This is where the 'TimeDateStamp' of the
IMAGE_IMPORT_DESCRIPTOR comes in. If it is 0, the import-list has not been bound, and
the loader must fix the entry points always. Otherwise, the imports are bound, and 'TimeDat-
eStamp' must match the 'TimeDateStamp' of the exporting DLL's 'FileHeader'; if it doesn't
match, the loader assumes that the binary is bound to a "wrong" DLL and will re-patch the
import list.
There is an additional quirk about "forwarders" in the import-list. A DLL can export a symbol
that's not defined in the DLL but imported from another DLL; such a symbol is said to be for-
warded (see the export directory description above).

Now, obviously you can't tell if the symbol's entry point is valid by looking into the times-
tamp of a DLL that doesn't actually contain the entry point. So the forwarded symbols'
entry points must always be fixed up, for safety reasons. In the import list of a binary,
imports of forwarded symbols need to be found so the loader can patch them.
This is done via the 'ForwarderChain'. It is an index into the thunk- lists; the import at the
indexed position is a forwarded export, and the contents of the 'FirstThunk'-list at this
position is the index of the *next* forwarded import, and so on, until the index is "-1" which
indicates there are no more forwards. If there are no forwarders at all, 'ForwarderChain' is
-1 itself.
This was the so-called "old-style" binding.
At this point, we should sum up what we have had so far :-)
Ok, I will assume you have found the IMAGE_DIRECTORY_ENTRY_IMPORT and you
have followed it to find the import-directory, which will be in one of the sections. Now
you're at the beginning of an array of IMAGE_IMPORT_DESCRIPTORs the last of which
will be entirely 0-bytes-filled.
To decipher one of the IMAGE_IMPORT_DESCRIPTORs, you first look into the 'Name'-
field, follow the RVA and thusly find the name of the exporting DLL. Next you decide
whether the imports are bound or not; 'TimeDateStamp' will be non-zero if the imports are
bound. If they are bound, now is a good time to check if the DLL version matches yours
by comparing the 'TimeDateStamp's. Now you follow the 'OriginalFirstThunk'-RVA to go
to the IMAGE_THUNK_DATA-array; walk down this array (it is be 0-terminated), and
each member will be the RVA of a IMAGE_IMPORT_BY_NAME (unless the hi-bit is set in
which case you don't have a name but are left with a mere ordinal). Follow the RVA, and
skip 2 bytes (the hint), and now you have got a 0-terminated ASCII-string that's the name
of the imported function.
To find the supplied entry point addresses in case it is a bound import, follow the 'First-
Thunk' and walk it parallel to the 'OriginalFirstThunk'-array; the array-members are the
linear addresses of the entry points (leaving aside the forwarders-topic for a moment).

There is one thing I didn't mention until now: Apparently there are linkers that exhibit a bug
when they build the import directory (I've found this bug being in use by a Borland C linker).
These linkers set the 'OriginalFirstThunk' in the IMAGE_IMPORT_DESCRIPTOR to 0 and
create only the 'FirstThunk'-array. Obviously, such import directories cannot be bound (else
the necessary information to re-fix the imports were lost - you couldn't find the function
names). In this case, you will have to follow the 'FirstThunk'-array to get the imported symbol
names, and you will never have pre-patched entry point addresses. I have found a TIS docu-
ment ([6]) describing the import directory in a way that is compatible to this bug, so that paper
may be the origin of the bug.
The TIS document specifies:
IMPORT FLAGS
TIME/DATE STAMP
MAJOR VERSION - MINOR VERSION
NAME RVA
IMPORT LOOKUP TABLE RVA
IMPORT ADDRESS TABLE RVA
as opposed to the structure used elsewhere:
OriginalFirstThunk
TimeDateStamp
ForwarderChain
Name
FirstThunk
The last tweak about the import directories is the so-called "new style" binding (it is described
in [3]), which can also be done with the "bind"-utility. When this is used, the 'TimeDateStamp'
is set to all-bits-1 and there is no forwarderchain; all imported symbols get their address
patched, whether they are forwarded or not. Still, you need to know the DLLs' version, and
you need to distinguish forwarded symbols from ordinary ones. For this purpose, the
IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT directory is created. This will, as far as I
could find out, *not* be in a section but in the header, after the section headers and before the
first section. (Hey, I didn't invent this, I'm only describing it!) This directory tells you, for each
used DLL, from which other DLLs there are forwarded exports.

The structure is an IMAGE_BOUND_IMPORT_DESCRIPTOR, comprising (in this order):

A 32-bit number, giving you the 'TimeDateStamp' of the DLL;
a 16-bit-number 'OffsetModuleName', being the offset from the beginning
of the directory to the 0-terminated name of the DLL;
a 16-bit-number 'NumberOfModuleForwarderRefs' giving you the number of
DLLs that this DLL uses for its forwarders.
Immediatly following this struct you find 'NumberOfModuleForwarderRefs' structs that tell
you the names and versions of the DLLs that this DLL forwards from. These structs are
'IMAGE_BOUND_FORWARDER_REF's: A 32-bit-number 'TimeDateStamp'; a 16-bit-
number 'OffsetModuleName', being the offset from the beginning of the directory to the 0-
terminated name of the forwarded-from DLL; 16 unused bits.
Following the 'IMAGE_BOUND_FORWARDER_REF's is the next

'IMAGE_BOUND_IMPORT_DESCRIPTOR' and so on; the list is terminated by an all-0-
bits-IMAGE_BOUND_IMPORT_DESCRIPTOR.
Sorry for the inconvenience, but that's what it looks like :-)
Now, if you have a new-bound import directory, you load all the DLLs, use the directory
pointer IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT to find the
IMAGE_BOUND_IMPORT_DESCRIPTOR, scan through it and check if the 'TimeDateS-
tamp's of the loaded DLLs match the ones given in this directory. If not, fix them in the
'FirstThunk'-array of the import directory.

resources
The resources, such as dialog boxes, menus, icons and so on, are stored in the data direc-
tory pointed to by IMAGE_DIRECTORY_ENTRY_RESOURCE. It is in a section that has, at
least, the bits 'IMAGE_SCN_CNT_INITIALIZED_DATA' and 'IMAGE_SCN_MEM_READ' set.
A resource base is a 'IMAGE_RESOURCE_DIRECTORY'; it contains several

'IMAGE_RESOURCE_DIRECTORY_ENTRY's each of which in turn may point to a
'IMAGE_RESOURCE_DIRECTORY'. This way, you get a tree of
'IMAGE_RESOURCE_DIRECTORY's with 'IMAGE_RESOURCE_DIRECTORY_ENTRY's
as leafs; these leafs point to the actual resource data.
In real life, the situation is somewhat relaxed. Normally you won't find convoluted trees you
can't possibly sort out. The hierarchy is, normally, like this: one directory is the root. It points
to directories, one for each resource type. These directories point to subdirectories, each of
which will have a name or an ID and point to a directory of the languages provided for this
resource; for each language you will find one resource entry, which will finally point to the
data. (Note that multi-language-resources don't work on Win95, which always uses the same
resource if it is available in several languages - I didn't check which one, but I guess it's the
first it encounters. They do work on NT.)
The tree, without the pointer to the data, may look like this:
(root)
|
+----------------+------------------+
| | |
menu dialog icon
| | |
+-----+-----+ +-+----+ +-+----+----+
| | | | | | |
"main" "popup" 0x10 "maindlg" 0x100 0x110 0x120
| | | | | | |
+---+-+ | | | | | |
| | default english default def. def. def.
german english

A IMAGE_RESOURCE_DIRECTORY comprises:
32 bits of unused flags called 'Characteristics';
32 bits 'TimeDateStamp' (again in the common time_t representation),
giving you the time the resource was created (if the entry is set);
16 bits 'MajorVersion' and 16 bits 'MinorVersion', thusly allowing you
to maintain several versions of the resource;
16 bits 'NumberOfNamedEntries' and another 16 bits 'NumberOfIdEntries'.
Immediatly following such a structure are

'NumberOfNamedEntries'+'NumberOfIdEntries' structs which are of the
format 'IMAGE_RESOURCE_DIRECTORY_ENTRY', those with the names coming first.
They may point to further 'IMAGE_RESOURCE_DIRECTORY's or they point to
the actual resource data.
A IMAGE_RESOURCE_DIRECTORY_ENTRY consists of:
32 bits giving you the id of the resource or the directory it describes;
32 bits offset to the data or offset to the next sub-directory.
The meaning of the id depends on the level in the tree; the id may be a number (if the hi-
bit is clear) or a name (if the hi-bit is set). If it is a name, the lower 31 bits are the offset
from the beginning of the resource section's raw data to the name (the name consists of
16 bits length and trailing wide characters, in unicode, not 0-terminated).

If you are in the root-directory, the id, if it is a number, is the resource-type:

1: cursor
2: bitmap
3: icon
4: menu
5: dialog
6: string table
7: font directory
8: font
9: accelerators
10: unformatted resource data
11: message table
12: group cursor
14: group icon
16: version information
Any other number is user-defined. Any resource-type with a type-name is always user-
defined.
If you are one level deeper, the id is the resource-id (or resource-name).
If you are another level deeper, the id must be a number, and it is the language-id of the spe-
cific instance of the resource; for example, you can have the same dialog in australian
english, canadian french and swiss german localized forms, and they all share the same
resource-id. The system will choose the dialog to load based on the thread's locale, which in
turn will usually reflect the user's "regional setting". (If the resource cannot be found for the
thread locale, the system will first try to find a resource for the locale using a neutral sublan-
guage, e.g. it will look for standard french instead of the user's canadian french; if it still can't
be found, the instance with the smallest language id will be used. As noted, all this works only
on NT.) To decipher the language id, split it into the primary language id and the sublanguage
id using the macros PRIMARYLANGID() and SUBLANGID(), giving you the bits 0 to 9 or 10
to 15, respectivly. The values are defined in the file "winresrc.h". Language-resources are
only supported for accelerators, dialogs, menus, rcdata or stringtables; other resource-types
should be LANG_NEUTRAL/SUBLANG_NEUTRAL.

To find out whether the next level below a resource directory is another directory, you
inspect the hi-bit of the offset. If it is set, the remaining 31 bits are the offset from the
beginning of the resource section's raw data to the next directory, again in the format
IMAGE_RESOURCE_DIRECTORY with trailing
IMAGE_RESOURCE_DIRECTORY_ENTRYs.
If the bit is clear, the offset is the distance from the beginning of the resource section's raw
data to the resource's raw data description, a IMAGE_RESOURCE_DATA_ENTRY. It
consists of 32 bits 'OffsetToData' (the offset to the raw data, counting from the beginning
of the resource section's raw data), 32 bits of 'Size' of the data, 32 bits 'CodePage' and 32
unused bits. (The use of codepages is discouraged, you should use the 'language'-fea-
ture to support multiple locales.)
The raw data format depends on the resource type; descriptions can be found in the MS
SDK documentation. Note that any string in resources is always in UNICODE except for
user defined resources, which are in the format the developer chooses, obviously.

relocations
The last data directory I will describe is the base relocation directory. It is pointed to by the
IMAGE_DIRECTORY_ENTRY_BASERELOC entry in the data directories of the optional
header. It is typically contained in a section if its own, with a name like ".reloc" and the bits
IMAGE_SCN_CNT_INITIALIZED_DATA, IMAGE_SCN_MEM_DISCARDABLE and
IMAGE_SCN_MEM_READ set.
The relocation data is needed by the loader if the image cannot be loaded to the preferred
load address 'ImageBase' mentioned in the optional header. In this case, the fixed addresses
supplied by the linker are no longer valid, and the loader has to apply fixups for absolute
addresses used for locations of static variables, string literals and so on.
The relocation directory is a sequence of chunks. Each chunk contains the relocation infor-
mation for 4 KB of the image. A chunk starts with a 'IMAGE_BASE_RELOCATION' struct. It
consists of 32 bits 'VirtualAddress' and 32 bits 'SizeOfBlock'. It is followed by the chunk's
actual relocation data, being 16 bits each.
The 'VirtualAddress' is the base RVA that the relocations of this chunk need to be applied to;
the 'SizeOfBlock' is the size of the entire chunk in bytes.
The number of trailing relocations is ('SizeOfBlock'-sizeof(IMAGE_BASE_RELOCATION))/2

The relocation information ends when you encounter a IMAGE_BASE_RELOCATION struct
with a 'VirtualAddress' of 0.
Each 16-bit-relocation information consists of the relocation position in the lower 12 bits and a
relocation type in the high 4 bits. To get the relocation RVA, you need to add the
IMAGE_BASE_RELOCATION's 'VirtualAddress' to the 12-bit-position. The type is one of:
IMAGE_REL_BASED_ABSOLUTE (0)
This is a no-op; it is used to align the chunk to a 32-bits-
border. The position should be 0.
IMAGE_REL_BASED_HIGH (1)
The high 16 bits of the relocation must be applied to the 16
bits of the WORD pointed to by the offset, which is the high
word of a 32-bit-DWORD.
IMAGE_REL_BASED_LOW (2)
The low 16 bits of the relocation must be applied to the 16

bits of the WORD pointed to by the offset, which is the low

word of a 32-bit-DWORD.
IMAGE_REL_BASED_HIGHLOW (3)
The entire 32-bit-relocation must be applied to the entire 32
bits in question. This (and the no-op '0') is the only
relocation type I've actually found in binaries.
IMAGE_REL_BASED_HIGHADJ (4)
This is one for the tough. Read yourself (from [6]) and make
sense out of it if you can:
"Highadjust. This fixup requires a full 32-bit value. The high
16-bits is located at Offset, and the low 16-bits is located in
the next Offset array element (this array element is included in
the Size field). The two need to be combined into a signed
variable. Add the 32-bit delta. Then add 0x8000 and store the
high 16-bits of the signed variable to the 16-bit field at
Offset."
IMAGE_REL_BASED_MIPS_JMPADDR (5)
Unknown
IMAGE_REL_BASED_SECTION (6)
Unknown
IMAGE_REL_BASED_REL32 (7)
Unknown

As an example, if you find the relocation information to be

0x00004000 (32 bits, starting RVA)
0x00000010 (32 bits, size of chunk)
0x3012 (16 bits reloc data)
0x30f6 (16 bits reloc data)
0x00000000 (next chunk's RVA)
0xff341234
you know the first chunk describes relocations starting at RVA 0x4000 and is 16 bytes long.
Because the header uses 8 bytes and one relocation uses 2 bytes, there are (16-8)/2=4 relo-
cations in the chunk. The first relocation is to be applied to the DWORD at 0x4012, the next
to the DWORD at 0x4080, and the third to the DWORD at 0x40f6. The last relocation is a no-
op. The next chunk has a RVA of 0 and finishes the list.
Now, how do you do a relocation? You know that the image *is* relocated to the preferred
load address 'ImageBase' in the optional header; you also know the address you did load the
image to. If they match, you don't need to do anything. If they don't match, you calculate the
difference actual_base-preferred_base and add that value (signed, it may be negative) to the
relocation positions, which you will find with the method described above.

Acknowledgments
Thanks go to David Binette for his debugging and proof-reading. (The remaining errors
are entirely mine.) Also thanks to wotsit.org for letting me put the file on their site.
Copyright
This text is copyright 1999 by B. Luevelsmeyer. It is freeware, and you may use it for any
purpose but on your own risk. It contains errors and it is incomplete. You have been
warned.
Bug reports
Send any bug reports (or other comments) to bernd.luevelsmeyer@iplan.heitec.net

Literature
[1]
"Peering Inside the PE: A Tour of the Win32 Portable Executable File
Format" (M. Pietrek), in: Microsoft Systems Journal 3/1994
[2]
"Why to Use _declspec(dllimport) & _declspec(dllexport) In Code", MS
Knowledge Base Q132044
[3]
"Windows Q&A" (M. Pietrek), in: Microsoft Systems Journal 8/1995
[4]
"Writing Multiple-Language Resources", MS Knowledge Base Q89866
[5]
"The Portable Executable File Format from Top to Bottom" (Randy Kath),
in: Microsoft Developer Network
[6]
Tool Interface Standard (TIS) Formats Specification for Windows Version
1.0 (Intel Order Number 241597, Intel Corporation 1993)

Appendix: hello world
In this appendix I will show how to make programs by hand. The example will use Intel-
assembly, because I don't speak DEC Alpha.
The program will be the equivalent of

#include <stdio.h>
int main(void)
{
puts(hello,world);
return 0;
}
First, I translate it to use Win32 functions instead of the C runtime:

#define STD_OUTPUT_HANDLE -11UL
#define hello "hello, world\n"
__declspec(dllimport) unsigned long __stdcall

GetStdHandle(unsigned long hdl);
__declspec(dllimport) unsigned long __stdcall

WriteConsoleA(unsigned long hConsoleOutput,
const void *buffer,
unsigned long chrs,
unsigned long *written,
unsigned long unused
);
static unsigned long written;
void startup(void)
{
WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE),hello,sizeof(hello)-
1,&written,0);
return;
}

Now I will fumble out the assembly:

startup:
; parameters for WriteConsole(), backwards
6A 00 push 0x00000000
68 ?? ?? ?? ?? push offset _written
6A 0D push 0x0000000d
68 ?? ?? ?? ?? push offset hello
; parameter for GetStdHandle()
6A F5 push 0xfffffff5
2E FF 15 ?? ?? ?? ?? call dword ptr cs:__imp__GetStdHandle@4
; result is last parameter for WriteConsole()
50 push eax
2E FF 15 ?? ?? ?? ?? call dword ptr cs:__imp__WriteConsoleA@20
C3 ret
hello:
68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 0A "hello, world\n"
_written:
00 00 00 00
That was the compiler part. Anyone can do that. From now on we play linker, which is much
more interesting :-)
I need to find the functions WriteConsoleA() and GetStdHandle(). They happen to be in

"kernel32.dll". (That was the 'import library' part.)
Now I can start to make the executable. Question marks will take the place of yet-to-find-out
values; they will be patched afterwards.

First the DOS-stub, starting at 0x0 and being 0x40 bytes long:
00 | 4d 5a 00 00 00 00 00 00 00 00 00 00 00 00 00 00
10 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30 | 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00 00
As you can see, this isn't really a MS-DOS program. It's just the header with the signature
"MZ" at the beginning and the e_lfanew pointing immediatly after the header, without any
code. That's because it isn't intended to run on MS-DOS; it's just here because the speci-
fication requires it.
Then the PE signature, starting at 0x40 and being 0x4 bytes long:
50 45 00 00
Now the file-header, which will start at byte 0x44 and is 0x14 bytes long:
Machine 4c 01 ; i386
NumberOfSections 02 00 ; code and data
TimeDateStamp 00 00 00 00 ; who cares?
PointerToSymbolTable 00 00 00 00 ; unused
NumberOfSymbols 00 00 00 00 ; unused
SizeOfOptionalHeader e0 00 ; constant
Characteristics 02 01 ; executable on 32-bit-machine

And the optional header, which will start at byte 0x58 and is 0x60 bytes long:
Magic 0b 01 ; constant
MajorLinkerVersion 00 ; I'm version 0.0 :-)
MinorLinkerVersion 00 ;
SizeOfCode 20 00 00 00 ; 32 bytes of code
SizeOfInitializedData ?? ?? ?? ?? ; yet to find out
SizeOfUninitializedData 00 00 00 00 ; we don't have a BSS
AddressOfEntryPoint ?? ?? ?? ?? ; yet to find out
BaseOfCode ?? ?? ?? ?? ; yet to find out
BaseOfData ?? ?? ?? ?? ; yet to find out
ImageBase 00 00 10 00 ; 1 MB, chosen arbitrarily
SectionAlignment 20 00 00 00 ; 32-bytes-alignment
FileAlignment 20 00 00 00 ; 32-bytes-alignment
MajorOperatingSystemVersion 04 00 ; NT 4.0
MinorOperatingSystemVersion 00 00 ;
MajorImageVersion 00 00 ; version 0.0
MinorImageVersion 00 00 ;
MajorSubsystemVersion 04 00 ; Win32 4.0
MinorSubsystemVersion 00 00 ;
Win32VersionValue 00 00 00 00 ; unused?
SizeOfImage ?? ?? ?? ?? ; yet to find out
SizeOfHeaders ?? ?? ?? ?? ; yet to find out
CheckSum 00 00 00 00 ; not used for non-drivers
Subsystem 03 00 ; Win32 console
DllCharacteristics 00 00 ; unused (not a DLL)
SizeOfStackReserve 00 00 10 00 ; 1 MB stack
SizeOfStackCommit 00 10 00 00 ; 4 KB to start with
SizeOfHeapReserve 00 00 10 00 ; 1 MB heap
SizeOfHeapCommit 00 10 00 00 ; 4 KB to start with
LoaderFlags 00 00 00 00 ; unknown
NumberOfRvaAndSizes 10 00 00 00 ; constant

As you can see, I plan to have only 2 sections, one for code and one for all the rest (data,
constants and import directory). There will be no relocations and no other stuff like
resources. Also I won't have a BSS segment and stuff the variable 'written' into the initial-
ized data. The section alignment is the same in the file and in RAM (32 bytes); this helps
to keep the task easy, otherwise I'd have to calculate RVAs back and forth too much.
Now we set up the data directories, beginning at byte 0xb8 and being 0x80 bytes long:
Address Size
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_EXPORT (0)
?? ?? ?? ?? ?? ?? ?? ?? ; IMAGE_DIRECTORY_ENTRY_IMPORT (1)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_RESOURCE (2)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_EXCEPTION (3)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_SECURITY (4)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_BASERELOC (5)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_DEBUG (6)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_COPYRIGHT (7)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_GLOBALPTR (8)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_TLS (9)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_LOAD_CONFIG (10)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT (11)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_IAT (12)
00 00 00 00 00 00 00 00 ; 13
00 00 00 00 00 00 00 00 ; 14
00 00 00 00 00 00 00 00 ; 15
Only the import directory is in use.

Next are the section headers. First we make the code section, which will contain the above
mentioned assembly. It is 32 bytes long, and so will be the code section. The header begins
at 0x138 and is 0x28 bytes long:
Name 2e 63 6f 64 65 00 00 00 ; ".code"
VirtualSize 00 00 00 00 ; unused
VirtualAddress ?? ?? ?? ?? ; yet to find out
SizeOfRawData 20 00 00 00 ; size of code
PointerToRawData ?? ?? ?? ?? ; yet to find out
PointerToRelocations 00 00 00 00 ; unused
PointerToLinenumbers 00 00 00 00 ; unused
NumberOfRelocations 00 00 ; unused
NumberOfLinenumbers 00 00 ; unused
Characteristics 20 00 00 60 ; code, executable, readable
The second section will contain the data. The header begins at 0x160 and is 0x28 bytes long:
Name 2e 64 61 74 61 00 00 00 ; ".data"
VirtualAddress ?? ?? ?? ?? ; yet to find out
SizeOfRawData ?? ?? ?? ?? ; yet to find out
PointerToRawData ?? ?? ?? ?? ; yet to find out
Characteristics 40 00 00 c0 ; initialized, readable, writeable

The next byte is 0x188, but the sections need to be aligned to 32 bytes (because I chose
so), so we need padding bytes up to 0x1a0:
00 00 00 00 00 00 ; padding
00 00 00 00 00 00
00 00 00 00 00 00
00 00 00 00 00 00
Now the first section, being the code section with the above mentioned assembly, *does*
come. It begins at byte 0x1a0 and is 0x20 bytes long:
6A 00 ; push 0x00000000
68 ?? ?? ?? ?? ; push offset _written
6A 0D ; push 0x0000000d
68 ?? ?? ?? ?? ; push offset hello_string
6A F5 ; push 0xfffffff5
2E FF 15 ?? ?? ?? ?? ; call dword ptr cs:__imp__GetStdHandle@4
50 ; push eax
2E FF 15 ?? ?? ?? ?? ; call dword ptr cs:__imp__WriteConsoleA@20
C3 ; ret
Because of the previous section's length we don't need any padding before the next sec-
tion (data), and here it comes, beginning at 0x1c0:
68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 0A ; "hello, world\n"
00 00 00 ; padding to align _written
00 00 00 00 ; _written

Now all that's left is the import directory. It will import 2 functions from "kernel32.dll", and it's
immediatly following the variables in the same section. First we will align it to 32 bytes:
00 00 00 00 00 00 00 00 00 00 00 00 ; padding
It begins at 0x1e0 with the IMAGE_IMPORT_DESCRIPTOR:

OriginalFirstThunk ?? ?? ?? ?? ; yet to find out
TimeDateStamp 00 00 00 00 ; unbound
ForwarderChain ff ff ff ff ; no forwarders
Name ?? ?? ?? ?? ; yet to find out
FirstThunk ?? ?? ?? ?? ; yet to find out
We need to terminate the import-directory with a 0-bytes-entry (we are at 0x1f4):

OriginalFirstThunk 00 00 00 00 ; terminator
TimeDateStamp 00 00 00 00 ;
ForwarderChain 00 00 00 00 ;
Name 00 00 00 00 ;
FirstThunk 00 00 00 00 ;
Now there's the DLL name left, and the 2 thunks, and the thunk-data, and the function
names. But we will be finished real soon now!
The DLL name, 0-terminated, beginning at 0x208:

6b 65 72 6e 65 6c 33 32 2e 64 6c 6c 00 ; "kernel32.dll"
00 00 00 ; padding to 32-bit-boundary
The original first thunk, starting at 0x218:

AddressOfData ?? ?? ?? ?? ; RVA to function name "WriteConsoleA"
AddressOfData ?? ?? ?? ?? ; RVA to function name "GetStdHandle"
00 00 00 00 ; terminator

The first thunk is exactly the same list and starts at 0x224:
(__imp__WriteConsoleA@20, at 0x224)
AddressOfData ?? ?? ?? ?? ; RVA to function name "WriteConsoleA"
(__imp__GetStdHandle@4, at 0x228)
AddressOfData ?? ?? ?? ?? ; RVA to function name "GetStdHandle"
Now what's left is the two function names in the shape of an

IMAGE_IMPORT_BY_NAME. We are at byte 0x230.
01 00 ; ordinal, need not be correct
57 72 69 74 65 43 6f 6e 73 6f 6c 65 41 00 ; "WriteConsoleA"
47 65 74 53 74 64 48 61 6e 64 6c 65 00 ; "GetStdHandle"
Ok, that's about all. The next byte, which we don't really need, is
0x24f. We need to fill the section with padding up to 0x260:
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; padding
00

We are done. Now that we know all the byte-offsets, we can apply fixups to all those
addresses and sizes that were indicated as "unknown" with '??'-marks. I won't force you to
read that step-by-step (it's quite straightforward), and simply present the result:
DOS-header, starting at 0x0:
00 | 4d 5a 00 00 00 00 00 00 00 00 00 00 00 00 00 00
10 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30 | 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00 00
signature, starting at 0x40:

50 45 00 00
file-header, starting at 0x44:

Machine 4c 01 ; i386
NumberOfSections 02 00 ; code and data
TimeDateStamp 00 00 00 00 ; who cares?
PointerToSymbolTable 00 00 00 00 ; unused
NumberOfSymbols 00 00 00 00 ; unused
SizeOfOptionalHeader e0 00 ; constant
Characteristics 02 01 ; executable on 32-bit-machine
optional header, starting at 0x58:

Magic 0b 01 ; constant
MajorLinkerVersion 00 ; I'm version 0.0 :-)
MinorLinkerVersion 00 ;
SizeOfCode 20 00 00 00 ; 32 bytes of code
SizeOfInitializedData a0 00 00 00 ; data section size
SizeOfUninitializedData 00 00 00 00 ; we don't have a BSS
AddressOfEntryPoint a0 01 00 00 ; beginning of code section
BaseOfCode a0 01 00 00 ; RVA to code section
BaseOfData c0 01 00 00 ; RVA to data section
ImageBase 00 00 10 00 ; 1 MB, chosen arbitrarily
SectionAlignment 20 00 00 00 ; 32-bytes-alignment
FileAlignment 20 00 00 00 ; 32-bytes-alignment
MajorOperatingSystemVersion 04 00 ; NT 4.0

MinorOperatingSystemVersion 00 00 ;
MajorImageVersion 00 00 ; version 0.0
MinorImageVersion 00 00 ;
MajorSubsystemVersion 04 00 ; Win32 4.0
MinorSubsystemVersion 00 00 ;
Win32VersionValue 00 00 00 00 ; unused?
SizeOfImage c0 00 00 00 ; sum of all section sizes
SizeOfHeaders a0 01 00 00 ; offset to 1st section
CheckSum 00 00 00 00 ; not used for non-drivers
Subsystem 03 00 ; Win32 console
DllCharacteristics 00 00 ; unused (not a DLL)
SizeOfStackReserve 00 00 10 00 ; 1 MB stack
SizeOfStackCommit 00 10 00 00 ; 4 KB to start with
SizeOfHeapReserve 00 00 10 00 ; 1 MB heap
SizeOfHeapCommit 00 10 00 00 ; 4 KB to start with
LoaderFlags 00 00 00 00 ; unknown
NumberOfRvaAndSizes 10 00 00 00 ; constant
data directories, starting at 0xb8:

Address Size
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_EXPORT (0)
e0 01 00 00 6f 00 00 00 ; IMAGE_DIRECTORY_ENTRY_IMPORT (1)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_RESOURCE (2)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_EXCEPTION (3)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_SECURITY (4)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_BASERELOC (5)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_DEBUG (6)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_COPYRIGHT (7)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_GLOBALPTR (8)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_TLS (9)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_LOAD_CONFIG (10)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT (11)
00 00 00 00 00 00 00 00 ; IMAGE_DIRECTORY_ENTRY_IAT (12)
00 00 00 00 00 00 00 00 ; 13
00 00 00 00 00 00 00 00 ; 14

00 00 00 00 00 00 00 00 ; 15
section header (code), starting at 0x138:

Name 2e 63 6f 64 65 00 00 00 ; ".code"
VirtualAddress a0 01 00 00 ; RVA to code section
SizeOfRawData 20 00 00 00 ; size of code
PointerToRawData a0 01 00 00 ; file offset to code section
Characteristics 20 00 00 60 ; code, executable, readable
section header (data), starting at 0x160:

Name 2e 64 61 74 61 00 00 00 ; ".data"
VirtualAddress c0 01 00 00 ; RVA to data section
SizeOfRawData a0 00 00 00 ; size of data section
PointerToRawData c0 01 00 00 ; file offset to data section
Characteristics 40 00 00 c0 ; initialized, readable, writeable
(padding)
00 00 00 00 00 00 ; padding
00 00 00 00 00 00
00 00 00 00 00 00
00 00 00 00 00 00
code section, starting at 0x1a0:

6A 00 ; push 0x00000000
68 d0 01 10 00 ; push offset _written

6A 0D ; push 0x0000000d
68 c0 01 10 00 ; push offset hello_string
6A F5 ; push 0xfffffff5
2E FF 15 28 02 10 00 ; call dword ptr cs:__imp__GetStdHandle@4
50 ; push eax
2E FF 15 24 02 10 00 ; call dword ptr cs:__imp__WriteConsoleA@20
C3 ; ret
data section, beginning at 0x1c0:

68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 0A ; "hello, world\n"
00 00 00 ; padding to align _written
00 00 00 00 ; _written
padding:
00 00 00 00 00 00 00 00 00 00 00 00 ; padding
IMAGE_IMPORT_DESCRIPTOR, starting at 0x1e0:
OriginalFirstThunk 18 02 00 00 ; RVA to orig. 1st thunk
TimeDateStamp 00 00 00 00 ; unbound
ForwarderChain ff ff ff ff ; no forwarders
Name 08 02 00 00 ; RVA to DLL name
FirstThunk 24 02 00 00 ; RVA to 1st thunk
terminator (0x1f4):
OriginalFirstThunk 00 00 00 00 ; terminator
TimeDateStamp 00 00 00 00 ;
ForwarderChain 00 00 00 00 ;
Name 00 00 00 00 ;
FirstThunk 00 00 00 00 ;
The DLL name, at 0x208:
6b 65 72 6e 65 6c 33 32 2e 64 6c 6c 00 ; "kernel32.dll"
00 00 00 ; padding to 32-bit-boundary
original first thunk, starting at 0x218:
AddressOfData 30 02 00 00 ; RVA to function name "WriteConsoleA"
AddressOfData 40 02 00 00 ; RVA to function name "GetStdHandle"
first thunk, starting at 0x224:
AddressOfData 30 02 00 00 ; RVA to function name "WriteConsoleA"

AddressOfData 40 02 00 00 ; RVA to function name "GetStdHandle"

IMAGE_IMPORT_BY_NAME, at byte 0x230:
57 72 69 74 65 43 6f 6e 73 6f 6c 65 41 00 ; "WriteConsoleA"
IMAGE_IMPORT_BY_NAME, at byte 0x240:
47 65 74 53 74 64 48 61 6e 64 6c 65 00 ; "GetStdHandle"
(padding)
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; padding
00
First unused byte: 0x260
Alas, this works on NT but didn't on windows 95. windows95 can't run applications with a sec-
tion alignment of 32 bytes, it needs an alignment of 4 KB and, apparently, a file alignment of
512 bytes. So for windows95 you'll have to insert a large number of 0-bytes (for padding) and
adjust the RVAs. Thanks go to D. Binette for testing on windows95.

Lesson 2 -
What knowledge do I need to code a
disassembler?
Well, this is not easy to say. You need as well good coding knowledge as you have to
know the theoretical conncept of important parts.
Here is a list of what you should really know:

- Assembly knowledge in the win32 environment
- OOP and how it works
- A good understanding how parsers work
- Knowledge of the PE-file-structure
- How SEH works
- Opcodes and Mnemonics
- Linked lists
- How to use a debugger
- Maybe knowledge of trees and graphs if you want to add a polymorphic
engine
- Basic understanding how a disassembler should work

Lesson 3 - What problems do I have to expect during development?
Lesson 3 -
What problems do I have to expect during
development?
The first main problem you will meet is your time… coding a disassembler needs much time
during development. Many people claim that writing a disassembler is waste of time. Well,
lets waste time! The knowledge you will receive from doing so is immense !
Another problem are opcodes and mnemonics which have to be identified by our disassem-
bler. You can find the one or other opcode/mnemonic list with your search engine, but a full
list of all opcodes is approximately 64 MB size. So when you load the list into memory, you
always should respect this size!
Just look at this: you have a hex-value and want to have the corresponding mnemonic. So
you have to search the opcode list (64MB) in memory. You have to do this for every mne-
monic and for all hex-values you receive which are not a correct mnemonic (parsing prob-
lem). Can you see now what speed problem you will get? Maybe we should look for a better
way to do this checks.
Next the identification of API Calls will take some more time. A mnemonic can have a corre-
sponding API Call, so we should need a kind of list to find them. This is the same as I men-
tioned above.
When you want to add at the end a polymorphic engine or a randomized garbage producer,
you need to write values back to the file. You have to keep an eye at the filessize and the PE-
file. It is simple to see that there have to some "corrections" of the file after doing this manip-
ulation!

Lesson 4 -
Modularity and it´s importance8
This is about the physical structure of the program. Many languages, and C++ is no
exception, allow the separate compilation of modules which are then linked together to
form the executable code. The basic conventions of C still apply, but may be a little more
formally expressed: Header (.h) files are effectively the interfaces of the different mod-
ules, and the .c or .cpp files contain the implementation. As to what should constitute a
module, much depends on physical constraints. For a small, relatively simple project it
may be quite satisfactory to include all classes and objects in one module, whereas in a
large system life will be more difficult.
In many senses modularity almost comes for free. A general rule is to group logically
related classes and objects in the same module, setting up interfaces only to those parts
other modules must see. Remember that the goal is twofold: the simplifying of documen-
tation under the divide and rule principle, and the elimination of unnecessary compilation.
The two ideals are COHESION, that is, groups of logically related abstractions; and
LOOSE COUPLING, that is, minimal dependencies between modules.
Once, then, the key abstractions have been identified, it is a relatively simple step to
divide physically the implementation into coherent modules. Abstraction combined with
encapsulation ensures that a given module will include all relevant functions and data.
Encapsulation ensures that the modules are unaware of, and hence unaffected by,
changes to the implementation details in other modules. This is a major advantage.
Regardless of good intentions, in any large system programmers are often led to make
coding decisions which depend on the internal implementation of another module; per-
haps even relying on side-effects.
Modularity may of course be affected by other needs: separate processors in a multi-pro-

cessor system; segment size limits; dynamic calling behaviour in virtual memory systems
(consequent on the need for late binding); the building of libraries where reusability is the
main objective; and work allocation in large project teams. In any event it is important to
realise that modularity is purely about the physical design: it is the use of classes and
objects that form the logical basis of the project.
8. Found at http://tutorials.freeskills.com/read/id/286

Lesson 5 - OOP - Magic possibilities or overbloated system?
Lesson 5 -
OOP - Magic possibilities or overbloated system?
OOP is one of the most discussed topics when you discuss High-Level-Languages. But do
we really need OOP here ? Why is OOP an interesting option when we just code in pure
assembly ?
Well, first OOP is a coding-concept. It organizes datastructures in memory leading to an

abstract model of our data. Positive with OOP is that data is well organized and has some
very important features like inheritance. Even for win32 assembly it is possible to code
objects as datastructures. We will see this later.
When we code our simple disassembler (later), we will use no OOP-concept. We code it plain
and simple. Later, when we go to the complex disassembler-engine, we may need OOP to
manipulate our data in memory and to reorganize it. As well it will be better and clearer cod-
ing (well, if you understood the OOP-concept).

Objects9
Objects are the central idea behind OOP. The idea is quite simple.
An object is a bundle of variables and related methods.
A method is similar to a procedure; we'll come back to these later.
The basic idea behind an object is that of simulation. Most programs are written with very
little reference to the real world objects the program is designed to work with; in object ori-
ented methodology, a program should be written to simulate the states and activities of
real world objects. This means that apart from looking at data structures when modelling
an object, we must also look at methods associated with that object, in other words, func-
tions that modify the objects attributes.
A few examples should help explain this concept. First, we turn to one of my favourite
pastimes...
9. Copied from http://www.quiver.freeserve.co.uk/OOP1.htm

Drink!
Say we want to write a program about a pint of beer. If we were writing this program in Mod-
ula-2, we could write something like this:
TYPE BeerType = RECORD
BeerName: STRING;
VolumeInPints: REAL;
Colour: ColourType;
Proof: REAL;
PintsNeededToGetYouFull: CARDINAL;
...
END;
Now lets say we want to initialise a pint of beer, and take a sip from it. In Modula-2, we might
code this as:
VAR MyPint: BeerType;
BEGIN
...
(* Initialise (i.e. buy) a pint: *)
MyPint.BeerName := "Harp";
MyPint.VolumeInPints := 1.00;
...
(* Take a sip *)
MyPint.VolumeInPints := MyPint.VolumeInPints - 0.1;
...
We have constructed this entire model based entirely on data types, that is we defined Beer-
Type as a record structure, and gave that structure various names, eg. Name. This is the
norm for procedural programming.

This is however, not how we look at things when we want to program using objects. If you
remember how we defined an object at the start of this section, you will remember that we
must not only deal with data types, but we must also deal with methods.
A method is an operation which can modify an objects behaviour. In other words, it is

something that will change an object by manipulating it's variables.
This means that when we take a real world object, in this case a pint of beer, when we
want to model it using computational objects, we not only look at the data structure that it
consists of, but also all possible operations that we might want to perform on that data.
For our example, we should also define the following methods associated with the Beer-
Type object:
InitialiseBeer - this should allow us to give our beer a name, a volume, etc.
GetVolume - to see how much beer we have left!
Take_A_Sip - for lunchtime pints...
Take_A_Gulp - for Lavery's pints...
Sink_Pint - for post exam pints...
There are loads more methods we could define - we might want a function GetBeerName
to help us order another pint for example. Now, some definitions. An object variable is a
single variable from an object's data structure, for example BeerName is one of Beer-
Type's object variables. Now the important bit from this section:
Only an object's methods can modify it's variables
There are a few exceptions, but we'll cover them much later. What this means in our
example is that unlike the Modula code, we cannot directly modify BeerType's variables -
we cannot set BeerName to "Tennents" directly. We must use the object's methods to do
this. In practice, what this means is that we must think very carefully when we define
methods. Say in the above example we discover when writing the main program that we
need to be able to take a drink of arbitrary size; we cannot do this with the above defini-
tion, we can only take a sip, a gulp etc. We must go back and define a new method asso-

ciated with BeerType, say Take_Drink which will take a parameter representing the amount of
beer we wish to drink.
Another Example
We'll now deal with a real-life example which will help us understand some more object con-
cepts. We will design an object to emulate a counter.
A counter is a variable in a program that is used to hold a value. If you don't know that then
you shouldn't be reading this! To make things very simple, we'll assume that our counter has
only three operations associated with it:
- Initialising the counter to a value

- Incrementing the counter by one
- Getting the current value of the counter
So, when we come to implement the above using objects we will define three methods that
do the above.

You may be thinking that we could implement this very simply in Modula-2 using definition
and implementation modules obtaining the same results as if we used an object oriented
language. Well, we nearly can:
DEFINITION MODULE Counter;

PROCEDURE InitialiseCounter(InitialValue: INTEGER);
PROCEDURE IncrementCounter;
PROCEDURE GetCounterValue(): INTEGER;
END Counter.
IMPLEMENTATION MODULE Counter;

VAR MyCounter: INTEGER;
PROCEDURE InitialiseCounter(InitialValue: INTEGER);
BEGIN
MyCounter := InitialValue;
END InitialiseCounter;
PROCEDURE IncrementCounter;
BEGIN
INC(MyCounter);
END IncrementCounter;
PROCEDURE GetCounterValue(): INTEGER;

BEGIN
RETURN MyCounter;
END GetCounterValue;
BEGIN
MyCounter := 0;
END Counter.
Because Modula-2 is not object oriented, this will only satisfy one of the requirements for
an object oriented language - encapsulation. This has been covered before; it simply
means that we have implemented information hiding, i.e. we cannot directly access
MyCounter from any module that imports Counter. But being object oriented means a lot
more than just encapsulation, as we'll see next...

Classes10
Say we wanted to extend the counter example discussed previously. Perhaps in our Modula-
2 program we need three counters. We could define an array of MyCounter and work through
that. Or say we needed up to 1000 counters. Then we could also declare an array, but that
would waste a lot of memory if we only used a few counters. Perhaps if we needed an infinite
amount of counters we could put them in a linked list and allocate memory as required.
The point of all this is that we are now talking in terms of data structures; all of the above dis-
cussion has nothing to do with the behaviour of the counter itself. When programming with
objects we can ignore anything not directly concerning the behaviour or state of an object; we
instead turn our attention to classes.
A class is a blueprint for an object.
What this basically means is that we provide a blueprint, or an outline of an object. This blue-
print is valid whether we have one or one thousand such objects. A class does not represent
an object; it represents all the information a typical object should have as well as all the meth-
ods it should have. A class can be considered to be an extremely extended TYPE declara-
tion, since not only are variables held but methods too.
10.Copied from http://www.quiver.freeserve.co.uk/OOP1.htm

C++
As an example, lets give the C++ class definition for our counter object.
class Counter {
private:
int MyCounter
public:
Counter() {
MyCounter = 0;
}
void InitialiseCounter(int value) {

MyCounter = value;
}
void IncrementCounter(void) {
MyCounter++;
}
int GetCounterValue(void) {
return (MyCounter);
}
}

So, a lot to go through for this little example. You really need to understand the fundamentals
of C before the example will make any sense.
- In the private section, all the object's variables should be placed. These
define the state of the object. As the name suggests, the variables are going
to be private, that is they cannot be accessed from outside the class declara-
tion. This is encapsulation.
- The public section contains all the object's methods. These methods, as the
name suggests, can be accessed outside the class declaration. The methods are
the only means of communication with the object.
- The methods are implemented as C functions or procedures; the three methods
should be easy to understand.
- All class definitions must also have one public method that has the same name
as the class itself, in this case Counter. This method is called the class con-
structor, and will be explained soon.
- Functions and procedures can also be placed in the private section; these will
not be accessible to the outside world but only within the class declaration.
This can be used to provide support routines to the public routines.

Instantiation
This is an awful big word for a powerfully simple concept. All we have done so far is to
create a class, i.e. a specification for our object; we have not created our object yet. To
create an object that simulates a counter in C++ then, all we have to do is declare in our
main program:
Counter i;
Although this seems just like an ordinary variable declaration, this is much more. The
variable it now represents an instance of the counter type; a counter object. We can now
communicate with this object by calling it's methods, for example we can set the counter
to the value '50' by calling
i.InitialiseCounter(50);.
We can increment the counter

i.IncrementCounter();
and we can get the counter value

value = i.GetCounterValue();
When we first instantiate an object (i.e. when we first declare, or create it), the class con-
structor is called. The class constructor is a method with the same name as the class def-
inition. This method should contain any start-up code for the object; any initialisation of
object variables should appear within this method. In the counter example, whenever we
create a new counter object, the first thing that happens to the object is that the variable
MyCounter is initialised to zero.
Remember the question posed at the very start? The power of objects starts to kick in
now. Say we require another counter within our program. All we have to do is declare a
new object, say:
Counter j;

The new counter object will have nothing to do with the previous object. What this means is
that i and j are two distinct objects, each with their own separate values. We can increment
them independently, for example. Should we need 1000 counter objects we could declare an
array of counter objects:
Counter loads[1000];
and then increment one of them using a call such as

loads[321].InitialiseCounter();

Java
The equivalent Java class definition for the counter example follows. It is remarkably sim-
ilar to the C++ definition, and differs only in syntax.
class Counter extends Object {

private int MyCounter;
Counter() {
MyCounter = 0;
}
public void InitialiseCounter(int value) {
MyCounter = value;
}
public void IncrementCounter(void) {
MyCounter++;
}
public int GetCounterValue(void) {
return (MyCounter);
}
}
A few brief notes about the differences:
All new classes must be defined with the extension extends Object. This defines the
superclass; this will be dealt with in the next section.
There are no public or private sections, instead all variables and methods are prefixed
with the appropriate qualifier.
The class constructor definition remains the same.

Instantiating objects in Java is slightly different; the designers knew that the C++ method of
declaring a new object was far too similar to how new variables are declared, so objects are
declared differently:
Counter i;
i = new Counter();
Basically we define a variable to reference the object in the first line. Then we actually create
an instance of the object by a call to new in the second line. Accessing object methods is
done in the exact same way in Java as in C++.

So which...?
A quick diversion from OOP here! At this point you might think it doesn't matter whether
you use C++ or Java, they both implement object oriented technology. Well, C++ can be
used to design programs without implementing any objects; C++ can be used as an
extended C. In Java, you must implement any non-trivial program using objects. This is
because Java has no support for structures (record types) or pointers; all these must be
replaced by object variables and methods. So, if you are using Java, you need to under-
stand object methodology; with C++ this is optional.
Basically, both these languages have hundreds of other features that I don't have time
even to begin to explain; as long as you have a basic understanding of object technolo-
gies and the
C language, you should find both rather easy to learn.
Why Bother?
The process of designing and programming objects seems very cumbersome, so why
bother? Well, it's difficult to see from such a small example, but for larger projects, OOP
techniques allow unlimited flexibility. Objects are used because:
- Encapsulation; in our example we cannot alter the value of the counter other
than by incrementing it or setting it to a initial value. This reduces pos-
sible bugs.
- Modularity; Different programmers or teams can work on different indepen-
dent objects.
- Inheritance; this is covered in the next section.
Basically, objects provide a secure and easily upgradable path for program developers.
Already, a considerable amount of developers are moving from normal procedural design
and embracing object oriented technology.
The next section should be easy to follow if you understood this one! By the way, the rea-
son the next few examples are only in Java is because I don't know enough about C++ to
program them!

Inheritance11
Another big word for a simple concept. To help explain this, we'll go back to our beer exam-
ple. Say we want to define a new class to represent a pint of an imported French beer. This
class would have all the variables and methods of the normal beer class, but it would have
the following additional information:
A variable representing the price of the beer
Two methods to set and get the price of the beer
(We need this information because we are students; everyone knows the price of Harp, but
we would like to know the price of this expensive beer before we order it!)
It would be rather tedious to define a new class, FrenchBeerType which had all the variables
and methods of BeerType plus a few more. Instead, we can define FrenchBeerType to be a
subclass of BeerType.
A subclass is a class definition which takes functionality from a previous class definition.
What this means is that we only define the additional information that the FrenchBeerType
class has.
Informally then, we would create a new class, FrenchBeerType, and tell our compiler that it is
a subclass of BeerType. In the class definition, we would include only the following informa-
tion:
A variable BeerPrice
A method SetBeerPrice
A method GetBeerPrice
We do not need to include any information about BeerName for example; all this is automati-
cally inherited. This means that FrenchBeerType has all the attributes of BeerType plus a few
additional ones. All this talk of beer is making me mad for a pint...
11.Copied from http://www.quiver.freeserve.co.uk/OOP1.htm

Counters, Counters, Counters...
Back to the counter example then! The counter we had in the last section is fine for most
counting purposes. But say in a program we require a counter that can not only be incre-
mented, but can be decremented too. Since this new counter is so similar in behaviour to
our previous counter, it would be mad to define a brand new class with everything that
Counter has plus a new method. Instead, we'll define a new class ReverseCounter that is
a subclass of Counter. We'll do this in Java.
class ReverseCounter extends Counter
{
public void DecrementCounter(void) {
MyCounter--;
}
}
The extends clause indicates the superclass of a class definition. A superclass is the
"parent" of a subclass; in our beer analogy, BeerType is the superclass of FrenchBeer-
Type, so if we were defining this in Java we would use class FrenchBeerType extends
BeerType. Basically, we are just saying that we want ReverseCounter to be a subclass of
Counter. When we define a brand new class that is not a subclass of anything (as we did
when we defined Counter) we use the superclass Object to indicate we want the default
superclass.
We have defined ReverseCounter to be a subclass of Counter. This means that if we

instantiate a ReverseCounter object, we can use any method that the class Counter pro-
vided, as well as the new methods provided. For example, if i is an object of the Rever-
seCounter class, then we can both increment it and decrement it; i.IncrementCounter();
and i.DecrementCounter; respectively.
Inheritance is a powerful tool. Unlike our simple example, inheritance can be passed on
from generation to generation; we could define a class SuperDuperReverseCounter for
example, that is a subclass of ReverseCounter which could provide added variables or
methods.

Bugs, bugs, bugs...
If you tried to compile the above example and found it wasn't compiling, don't worry! There is
a semi-deliberate mistake left in the code, which I am very usefully going to use to stress a
point.
When defining a class you must consider any possible subclass.
When we defined the Counter class we didn't even know what a subclass was, so we could
be forgiven for breaking this rule then, but not from now on! If we go back to how the class
was defined:
class Counter extends Object {
private int MyCounter;
...
...
}
We can see that the variable MyCounter is defined to be of type private. In Java, this means
that the variable becomes very, very private indeed; in fact, it is only accessible from inside
the class from which it is defined. It is not available to any other class, including it's sub-
classes. So when we reference MyCounter from inside ReverseCounter the Java compiler
will kick up a fuss, since we are outside the scope of the variable.
So, we should have realised at the time of writing the Counter class that subclasses might
need to get at this variable too. To fix this, all we have to do is change the qualifier of
MyCounter to:
protected int MyCounter;
A variable with a protected qualifier means that it can only be accessed from within the class
in which it is defined, as well as all subclasses of this class. This seems appropriate for our
purposes.

Lesson 6 - Linked lists - a powerfull tool12

Although linked lists sounds kind of scary, don't worry they are really easy to use once
you've got a little practice under your belt! When I first learned this odd way of storing
data, I really thought that I wouldn't be using them again. I certainly learned differently!
Linked lists form the foundation of many data storing schemes in my game!
They are really nice when you don't know how many of a data type you will need, and
don't want to waste space. They are like having a dynamically allocated string that fluctu-
ates in size as the program runs. Before I really confuse you lets get into a better explana-
tion!
12.This tutorial was taken from http://www.inversereality.org/tutorials/c++/linkedlists.html and was written by

Justin Deltener

Lesson 6 - Linked lists - a powerfull tool
A linked list is a chain of structs or records called nodes. Each node has at least two mem-
bers, one of which points to the next item or node in the list! These are defined as Single
Linked Lists because they only point to the next item, and not the previous. Those that do
point to both are called Doubly Linked Lists or Circular Linked Lists. Please note that there is
a distinct difference betweeen Double Linked lists and Circular Linked lists. I won't go into any
depth on it because it doesn't concern this tutorial. According to this definition, we could have
our record hold anything we wanted! The only drawback is that each record must be an
instance of the same structure. This means that we couldn't have a record with a char point-
ing to another structure holding a short, a char array, and a long. Again, they have to be
instances of the same structure for this to work. Another cool aspect is that each structure
can be located anywhere in memory, each node doesn't have to be linear in memory!
typedef struct List

{ long Data;
List* Next;
List()
{Next=NULL;
Data=0;
}
};
typedef List* ListPtr;
Notice that we define a default constructor for our structure that sets Next equal to NULL.
This is because we need to know when we have reached the end of our linked list. Each Next
item that is NOT equal to NULL means that it is pointing to another allocated instance. If it
does equal NULL, then we have reached the end of our list.

Starting up
First off, we need to set our Link pointers to some know location in memory. We will cre-
ate a temp pointer, allocate an instance, then assign our pointers. Something like this:
SLList:: SLList()
{ Head = new List;
Tail=Head;
CurrentPtr = Head;
}
We can forget about doing anything with temp after this because Head will always be
pointing to the memory allocated by it, until it is deleted. Just like we discussed, we cre-
ated a temp pointer, allocated an instance of our structure, then assigned both Head and
Tail to our new instance. This beginning point is very crucial. We must allocate at least
one instance right away so our pointers are actually pointing to something relevant! Well
now we have the smallest possible linked list, where head = tail. Pointer usage in linked
lists make them a little hard to learn at first, but once you think of the uses of pointers it
starts to come together. Now that we have our Head and Tail pointers actually pointing to
something that is physically there, let's cover how to add on additional nodes into our list.
A linked list with one node is kind of boring :)

Adding a Node
We can actually add nodes in two possible places, the beginning or the end, although the
standard seems to be the end. This makes our linked list act kind of like a que with the head
node being the oldest and end pointing to the newest objects. This brings up an interesting
subject also. How will we use our linked list? This is what makes the linked list so powerful.
We could use it as a priority list where the oldest objects get a higher precedence until
deleted from the list. We could also use it as a master listing of items that need to be kept
track of at one time, deleting object when they need to be, without using any precedence
scheme. Here's some code that will add a node onto the end of the list, and then move the
end pointer so that it really does point at the end.
void SLList::AddANode()
{Tail->Next = new List;
Tail=Tail->Next;
}
Here we add a node onto the end of our list, then move the Tail pointer to point to the new
instance! After this function we can always access our new node through Tail since we allo-
cated a new instance, then made Tail point to it!

Traversing the List

This is actually the most difficult part in dealing with Singly Linked Lists. This is because
we can't immediately access the previous node should we need to, like when we want to
delete a node and reconnect the node before to the node after the one being deleted.
One easier way is to create a function that will traverse through the list a given number of
nodes. This way, we can keep track of which one we are on should we need to delete it,
then we could pass the node to the function and get a pointer to the previous node!
Something like this :
ListPtr SLList::Previous(long index)

{ ListPtr temp=Head;
for(long count=0;count<index-1;count++)
{temp=temp->Next;
}
return temp;
}
ListPtr SLList::Previous(ListPtr index)

{ListPtr temp=Head;
if(index==Head) //special case, index IS the head :)
{ return Head;
}
while(temp->Next != index)
{ temp=temp->Next;
}
return temp;
}

If we know that we've gone into our list a certain number of nodes, we can pass that number
to our Previous function and get a pointer to the previous node. This works well, but is hard to
debug should we be off in our counter etc. I created a second version which lets you pass the
node you are currently at as an argument, then we can be absolutely certain that we will get
the previous node! I use the second version a LOT more. Also notice that the second has
error checking. If we are currently at the Head node and try to go back one, it simply returns
Head instead of returning garbage.
While creating our neeto class, I decided to use a class node pointer which we declared at
CurrentPtr. To that end, I created two functions that move our pointer forward and back one
node. If we are at the head and try to go back one node (into nothing), then the function
doesn't move our pointer. Likewise if we are at the end of the list and try to advance to the
next node (nothing), it doesn't move our pointer.
void SLList::Advance()
{ if(CurrentPtr->Next != NULL)
{ CurrentPtr=CurrentPtr->Next;
}
}
void SLList::Rewind()
{ if(CurrentPtr != Head)
{ CurrentPtr=Previous(CurrentPtr);
}
}

Deleting a Node
When deleting nodes from a linked list, there are 3 different cases to decide from. The
node to be deleted is the head node, it's a middle node (somewhere between the head or
tail, but not either) or it could be the tail node. Each requires a small change to take into
account when deleting the node. Let's go over each one in depth.
void SLList::DeleteANode(ListPtr corpse) //<-- i thought it was funny :)

ListPtr temp;
if(corpse == Head) //case 1 corpse = Head

{temp=Head;
Head=Head->Next;
delete temp;
}
else if(corpse == Tail) //case 2 corpse is at the end
{ temp = Tail;
Tail=Previous(Tail);
Tail->Next=NULL;
delete temp;
}
else //case 3 corpse is in middle somewhere
{temp=Previous(corpse);
temp->Next=corpse->Next;
delete corpse;
}
CurrentPtr=Head; //Reset the class tempptr
CurrentNode is actually corpse

Case 1: CurrentNode = Head Node
In this case, the node to be deleted is actually the Head node! This is a special case because
there is no previous node to connect. We simply use our temp pointer to remember where
Head is pointing at, advance the Head to the next position, then delete our saved location!
Simple huh!
Case 2: CurrentNode = End node
In this case, the node to be deleted is actually the Tail node! This is a special case because
we have a previous node, but no node afterwards to connect to. We save the old location of
Tail using temp, set Tail equal to the previous node, set the Next pointer of Tail equal to NULL
since it is at the end, then delete our temp pointer!
Case 3: CurrentNode is somwhere in between
In this case, there is a node before and a node after our current node. All we need to do is
connect the previous node to the node after our current node. We set temp equal to our pre-
vious node and set the Next pointer to the node after our current one (corpse). Once they are
connected, we can simply delete our current pointer! That's all there is to deleting nodes!

Before Exit
Before we can exit our program, we have to make certain that all of our dynamically allo-
cated structures or nodes are deleted, otherwise we will have a memory leak! To fix this,
we can build a routine to de-allocate any remaining nodes. Let's make it automatic and
place it in the class destructor!
SLList:: ~SLList()
{ ListPtr temp = Head;
CurrentPtr = Head;
while(CurrentPtr != NULL)
{CurrentPtr = CurrentPtr->Next;
delete temp;
temp=CurrentPtr;
}
}
This traverses through the list de-allocating nodes as it moves along until it has reached
the end! That's all there is to it!

Lesson 7 - Trees and Graphs

Just as using an array to store a sequence makes you pay for indexing even when you don't
need it (suggesting a linked list if you need flexibility), using a sorted array is a clunky sort of
bargain if you need to muck with the sequence on anything like a regular basis. There are
sorting algorithms that are fast on an already-mostly-sorted array, but even then you'll wind
up shifting huge pieces of array around to add or remove even a single element. Binary trees
can cheaply store a sorted sequence, with searching (and even indexing if you need it), and
let you add, remove, or muck with nodes at will.

Overview
Binary trees are the result of the same sort of relaxation that leads from an array to a
linked list. We don't really need the indexing that sorted arrays impose on sorted data; if
we throw it away, we're left with only the hierarchy of middle elements that
binary_search() traverses as it executes.
There are also much more paranoid binary-tree implementations that constantly juggle
the tree in bizarre ways, such that it is mathematically guaranteed that the tree will never
become too badly unbalanced (for some formal definition of "too badly"); two popular fla-
vors of this are splay trees and red-black trees. This approach involves quite a bit of over-
head, though, and adds complexity; in practice, it's rarely worth worrying about this
unless you can't avoid feeding an already-sorted list to your tree. It's exceedingly rare for
a tree to become unbalanced enough to make a difference by accident.

Reconstruction of Binary Trees from Traversals13

A collection of the three traversals is unique for a binary tree. But are all the three required?
Hereafter we talk about binary trees whose keys are alphabetic characters for convenience.
First and foremost we are thinking in terms of saving space.
There is a comfortable space saving representation called the linear or array representation
of a binary tree. Here, if an array a[1...n] is used to represent a binary tree, a[k] has it's left
child value at a[2k] and right child value at a[2k+1]. But here there should be a value which
specifies that there is no such node. e.g., if a[k] does not have a left child a[2k]=null value.
But it has to exist and hence space is wasted for all the non existent nodes. In a balanced
binary tree e.g.,AVL array representation is very efficient but not in random ones.
We shall base our discussion on the assumption that two traversals represent a binary tree
uniquely. Any inconsistencies to this assumption (they do exist) shall be sited as and when
they are dealt with.
The following is the structure (C style) used for representing a node in the ongoing discus-
sion:.
struct node{
char data;
struct node
*left,*right;
};
13.Taken from http://www.geocities.com/acmearticles/treerec.htm

A pseudocode of the three traversals follows:

Inorder(x):Inorder(x.left),Visit(x),Inorder(x.right)
Preorder(x):Visit(x),Preorder(x.left),Preorder(x.right)
Postorder(x):Postorder(x.left),Postorder(x.right),Visit(x)
Hereafter we represent the the traversals by the data in the nodes given in the order they
are visited. So a traversal consists of as many characters as there are nonempty nodes in
the tree. The algorithms for tree reconstruction from two traversals are presented below.
Some facts that are made use of in reconstruction are presented first.
1.The first data in preorder traversal represents the root.
2.The last data in postorder traversal represents the root.
3.The traversals can be split into three parts as
Preorder traversal= root(Preorder of root.left)(Preorder of root.right)
Inorder traversal= (inorder of root.left)root(inorder of root.right)
Postorder traversal= (postorder of root.left)(postorder of root.right)root

Given Inorder and Preorder traversals
The first element of the preorder traversal represents the root. Let the position of that element
in the inorder traversal be i. The string of characters from the first element to the element at
(i-1) constitutes the inorder traversal of the left subtree and the string of characters beyond i
till the end represents the inorder traversal of the right subtree. Now as there as as many
characters in preorder traversal as there are in inorder traversal, the preorder traversal can
easily be split apart to the root and the preorder traversals of the left and right subtrees
respectively.
C code for the same is given below.
struct node * buildtree(char *in,char *pre,int len)

{
int i,lenright,lenleft;
struct node *p;
if(!len)return NULL;
p=(struct node *)malloc(sizeof(struct node));
p->data=pre[0];p->left=NULL;p->right=NULL;
if(len==1)return p;
for(i=0;in[i]!=pre[0];i++);
lenright=len-i-1;
lenleft=len-lenright-1;
p->left=buildtree(in,pre+1,lenleft);
p-
>right=buildtree(in+lenleft+1,pre+lenleft+1,lenright);
return p;
}
The above function returns a pointer to the root of the three whose inorder and preorder tra-
versals are given as the first and parameters respectively and their length as the third param-
eter. The length of the inorder and preorder traversals will be the same.

Given Inorder and Postorder traversals
The algorithm for this reconstruction is almost same as that of the above but the differ-
ence lies in the fact that the root occurs last in the postorder traversal. Finding the relative
positions etc. are similar to that as in the above algorithm.
C code is given below.
struct node * buildtree(char *in,char *post,int len)

{
int i,lenright,lenleft;
struct node *p;
if(!len)return NULL;
p=(struct node *)malloc(sizeof(struct node));
p->data=post[len-1];p->left=NULL;p->right=NULL;
if(len==1)return p;
for(i=0;in[i]!=post[len-1];i++);
lenright=len-i-1;
lenleft=len-lenright-1;
p->left=buildtree(in,post,lenleft);
p-
>right=buildtree(in+lenleft+1,post+lenleft,lenright);
return p;
}
The above function returns a pointer to the root of the tree whose inorder and postorder
traversals and their length is given as the first second and third parameters respectively.

Now what about the combination of the postorder and preorder traversals, does it represent a
unique tree? It should, isn't it. But let us take this particular case. If the left subtree of the root
is empty, then the preorder and postorder each can be split to two parts as below.
Preorder traversal= root(Preorder of root.right)
Postorder traversal= (postorder of root.right)root
Consider another case where the right subtree of the root is empty.
Preorder traversal= root(Preorder of root.left)
Postorder traversal= (postorder of root.left)root
These two cases cannot be distinguished from one another. If preorder:abcd and pos-
torder:cdba. We can infer that a is the root and that b is the root of one of it's subtrees and
that one of it's subtrees is empty. But which one is empty? Left or right!!!
This is enough proof that a combination of preorder and postorder traversals does not repre-
sent a unique binary tree. But then does it mean that inorder traversal carries more informa-
tion of the binary tree than postorder or preorder traversals.? Or is our assumption that a
combination of inorder traversal and one of the other two represents a unique subtree wrong?

Non - Recursive algorithm for Binary Tree

Reconstruction from inorder and postorder traversals
Step 1: select the first element in Inorder traversal

Step 2: find its position in Postorder traversal, say pos
Step 3: if same position then
{
make it the root and make the partially complete sub-tree its left
subtree.
select next element in Inorder.
}
else
{
all elements in Postorder traversal from present element to the present
root forms the right subtree of the root. Insert these elements just as
inserting into a binary search tree.
if (present element in Inorder = last element in Postorder) Go to step 5
next element selected in Inorder is the element at pos+1.
}
Step 4: go to step 2
Step 5: end.

A code in C for the same is given below. The code assumes the following. Assume that the
tree holds data values of type character. Then the Inorder traversal of the same would be a
string of characters. Number the characters in inorder traversal in the order that they appear
in the inorder traversal. e.g., if the inorder traversal is "abgf" then a->1,b->2,g->3,f->4. So the
inorder traversal is reduced to a sorted order of numbers 1 to n where n is the number of
nodes in the tree. Now the mapped values are substituted in the postorder traversal. e.g., if
the postorder traversal is "bafg" then the modifies postorder traversal will be {2,1,4,3}. Arrays
P and I denote these modified inorder and postorder traversals. Note that after this transfor-
mation the modified inorder traversal does not carry any data regarding the structure of the
tree. The inorder and postorder traversals are stored in P and I from position 1 onwards.
P[0]=I[0]=0. The following functions have been used in the code below: findpos(int i,int *p)
returns the position of i in the array p. and insert(int k,node *root) This function inserts ele-
ment k into the binary tree rooted at root in the fashion of binary search tree insertion. root is
a global variable that holds a pointer to the root of the binary tree to be constructed. update-
root(int k,node *root) It constructs a new tree with root holding k and the present tree rooted
at root is made it's left subtree. n holds the number of non-null elements in the tree

At last (phew!), the code follows
void Buildtree()
{
int
presentposition=1,nextposition,position_in_postorder,previousposition,temp;
insert(I[presentposition],root);
previouselement=0; //initally NULL
while(1)
{
nextposition=prsentposition+1;
position_in_postorder=findpos(presentposition,P);
temp=position_in_postorder;
if(position_in_postorder == presentposition)
{
if (findpos(I[nextposition],P)>presentposition)
{
updateroot(I[nextposition],root);
previousposition=presentposition;
presentposition++;
}
}
else
{
position_in_postorder--;
while(P[position_in_postorder]!=previousposition)
{
insert(P[position_in_postorder],root);
position_in_postorder--;
}
if(temp>=n) return; //ending condition
else
{
previousposition=presentposition;
presentposition=temp+1;
updateroot(presentposition,root);
}
}
}
}
The above code was subject to a lot of testing and it worked

Lesson 8 - Parsing or how to loop through bytes
Lesson 8 - Parsing or how to loop through bytes14
parse vt., vi. parsed, pars'ing [Now Rare]

1. to separate (a sentence) into its parts, explaining
the grammatical form, function, and interrelation of each
part 2. to describe the form, part of speech, and
function of (a word in a sentence)
For the word parse is computer science parlance for the act of separating computer input into
meaningful parts for subsequent processing actions.
14.This lesson is taken from http://www.kilowattsoftware.com/tutorial/rexx/parseTutorial.htm (Kilowatt Software's

Classic Rexx Tutorial) and is adapted and modified for this course.
Recommended Reading: http://www.cs.vu.nl/~dick/PTAPG.html (Parsing Techniques - A Practical Guide)

Preparing to parse
Let us learn about parsing by analyzing the following reduction of Descartes' famous
quote:
I think I am
Here is a program that parses the words in the phrase. When a value consists of words
that are separated by only one space, and there are no leading or trailing spaces, the
value is easy to parse into a known number of words as follows.
parse value 'I think I am' with word1 word2 word3 word4
say "'"word1"'"
say "'"word2"'"
say "'"word3"'"
say "'"word4"'"
This shows:
'I'
'think,'
'I'
'am'

Here is another program that parses the above phrase.
phrase = 'I think I am'

do while phrase <> ''
parse var phrase word phrase
say "'"word"'"
end
This shows:
'I'
'think,'
'I'
'am'
This simple program achieved the desired result. The program is a Rexx parsing idiom. In
each loop iteration, the parse instruction extracts the first word in the phrase, and assigns the
remaining words (after the first word) to the phrase variable. The loop concludes when all of
the words in the phrase have been processed.
When there are more words in the value, than there are variables in the template, the trailing
words are assigned to the last variable in the template. Here is an example.
parse value 'Sam likes peaches and cream' with subject verb object
say 'subject:' subject
say 'verb:' verb
say 'object:' object
This shows:
subject: Sam
verb: likes
object: peaches and cream

Now let's make Descartes' quote a little more challenging. Additional spaces in the origi-
nal phrase, and punctuation characters, introduce various difficulties.
I think, I am .
Here is the same phrase with spaces represented as dots: · , so they can be seen!
···I··think,··I am··.··
The first parsing challenge is to extract the words within the quote. Let's try to do it with
the words and word built-in functions.
phrase = '···I··think,··I am··.··''

do i=1 for words( phrase )
say "'"word( i )"'"
end
This shows:
'I'
'think,'
'I'
'am'
'.'
This simple program worked well, although the second word includes a trailing comma. In
addition, the period is considered a word.

The following is an initial attempt to parse the words in the phrase.
parse value '···I··think,··I am··.··' with word1 word2 word3 word4

word5
say "'"word1"'"
say "'"word2"'"
say "'"word3"'"
say "'"word4"'"
say "'"word5"'"
This shows:
'I'
'think,'
'I'
'am'
'··.··'
Notice the spaces before and after the period.

The following program achieves a better result.

say "'"word"'"
end
This shows:
'I'
'think,'
'I'
'am'
'.'
This was our second parsing program. It worked fairly well, although the second word
includes a trailing comma. In addition, the period is considered a word. This time there
are no spaces before and after the period.

Now let's successfully parse the phrase into words.

word = strip( translate( word, , ',.;":?()' ) )
if word <> '' then
say "'"word"'"
end
This shows:
'I'
'think'
'I'
'am'
The above program translated punctuation characters to spaces, and then stripped spaces.
Any characters remaining after these operations were considered a word.

How does parsing work ?

The parse statement divides a source string into constitutent parts and assigns these to
variables, as directed by the parsing template.
The following picture introduces how parsing is performed, with multiple space dividers
between the variables to assign.

While the template is processed from left to right, several current positions in the source
string are maintained. The motion of these positions is guided by the division specifiers within
the template. In the picture above, the positions are those that would be in effect after the
template's verb term is processed. The object term will be processed next. The previous start
position locates the 'l' in 'likes'. The current end position locates the space between 'likes' and
'peaches'. The next start position locates the 'p' in 'peaches'. With these positions established
the value 'likes' is assigned to variable verb. When the object term is processed, it is the only
term remaining. Consequently, the remainder of the source string is assigned to the object
variable -- it receives the value: 'peaches and cream'.
If a relative position division specifier followed the verb term, the verb variable would receive
that many characters after the previous start position and all positions would be advanced to
that relative position. Study the following effect:
parse value 'Sam likes peaches and cream' with subject verb +2
object
say 'subject:' subject
say 'verb:' verb
say 'object:' object
This shows:
subject: Sam
verb: li
object: kes peaches and cream

The following is another illustration that shows how parsing is performed, with a literal pat-
tern divider between the variables to assign.
The literal pattern in this example is a quoted comma -- ',' . The previous start position
locates the 't' in 'think'. The current end position locates the ','. The next start position
locates the space between the comma and the 't' in 'therefore'. With these positions
established the value 'I think' is assigned to variable precondition. When the conse-
quence term is processed, it is the only term remaining. Consequently, the remainder of
the source string is assigned to the consequence variable -- it receives the value: ' there-
fore I am'. This value contains a leading space.

If a relative position division specifier followed the ',' literal pattern, The next start position
would be that many characters after the comma in the source string.
parse value 'I think, therefore I am' with precondition ',' +1

consequence
This advanced one character position after the comma. As a result, the consequence vari-
able receives the value 'therefore I am' without a leading space

Parsing Expressions by Recursive Descent15

Parsing expressions by recursive descent poses two classic problems
1.how to get the abstract syntax tree to follow the precedence and associativity of opera-
tors and
2.how to do so efficiently when there are many levels of precedence.
The classic solution to the first problem does not solve the second. I will present the clas-
sic solution, a well known alternative known as the "Shunting Yard Algorithm", and a less
well known one that I have called "Precedence Climbing".
15.This Lesson was taken from http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm and was written by The-
odore Norvell

Precedence and associativity

Consider the following example grammar, G,
E --> E "+" E
| E "-" E
| "-" E
| E "*" E
| E "/" E
| E "^" E
| "(" E ")"
| v
in which v is a terminal representing identifiers and constants.
We want to build a parser that will
1.Produce an error message if its input is not in the language of this grammar.
2.Produce an "abstract syntax tree" (AST) reflecting the structure of the input, if
the input is in the language of the grammar.
Each (correct) input will have a single AST based on the following precedence and associa-
tivity rules:
Parentheses have precedence over all other operators.
^ (exponentiation) has precedence over /, *, -, and +.
* and / have precedence over - and +.
Unary - has precedence over binary - and +.
^ is right associative while all other operators are left associative.
For example the first three rules tell us that
a ^ b * c ^ d + e ^ f / g ^ (h + i)
parses to the tree
+( *( ^(a,b), ^(c,d) ), /( ^(e,f), ^(g,+(h,i)) ) )
while the last rule tells us that
a - b - c
parses to -(-(a,b),c) rather than -(a,-(b,c)), whereas
a ^ b ^ c
parses to ^(a, ^(c,b)) rather than ^(^(a,b), c).

Aside: I am assuming that the desired output of the parser is an abstract syntax tree
(AST). The same considerations arise if the output is to be some other form such as
reverse-polish notation (RPN), calls to an analyzer and code generator (for one-pass
compilers), or a numerical result (as in a calculator). All the algorithms I present are easily
modified for these forms of output.

Recursive-descent parsing
The idea of recursive-descent parsing is to transform each nonterminal of a grammar into a
subroutine that will recognize exactly that nonterminal in the input.
Left recursive grammars, such as G, are unsuitable because a left-recursive production leads
to an infinite recursion in the recursive-descent parser. While the parser may be partially cor-
rect, it may not terminate.
We can transform G to a non-left-recursive grammar G1 as follows:

E --> P {B P}
P --> v | "(" E ")" | U P
B --> "+" | "-" | "*" | "/" | "^"
U --> "-"
The braces "{" and "}" represent zero or more repetitions of what is inside of them. Thus you
can think of E as having an infinity of alternative:
E --> P | P B P | P B P B P | ... ad infinitum
The language described by this grammar is the same as that of grammar G: L(G1) = L(G).
Not only is left recursion eliminated, but each choice can be made by looking at the next
token in the input.
Let's look at a recursive descent recognizer based on this grammar. I call this algorithm a rec-
ognizer because all it does is to recognize whether the input is in the language of the gram-
mar or not. That is it does not produce an abstract syntax tree, or any other form of output
that represents the contents of the input.
I'll assume that the following subroutines exist:
""next" returns the next token of input or special marker "end" to represent that there are no
more input tokens. "next" does not alter the input stream.
""consume" reads one token. When "next=end", consume is still allowed, but has no effect.

""error" stops the parsing process and reports an error.
In using these, let's construct a subroutine "Expect", which I will use throughout this essay
expect( tok ) is
if next = tok
consume
else
error
We will now write a subroutine called "Erecognizer". If it does not call "error", then the
input was an expression according to the above grammar. If it does call "error", then the
input contained a syntax error, e.g. unmatched parentheses, a missing operator or oper-
and, etc.
Erecognizer is
E()
expect( end )
E is
P
while next is a binary operator
consume
P
P is
if next is a v
consume
else if next = "("
consume
E
expect( ")" )
else if next is a unary operator
consume
P
else
error

Notice how the structure of the recognition algorithm mirrors the structure of the grammar.
This is the essence of recursive descent parsing.
The difference between a recognizer and a parser is that a parser produces some kind of out-
put that reflects the structure of the input. Next we will look at a way to modify the above rec-
ognition algorithm to be a parsing algorithm. It will build an AST, according to the precedence
and associativity rules, using a method known as the "shunting yard" algorithm.

The shunting yard algorithm

The idea of the shunting yard algorithm is to keep operators on a stack until we are sure
we have parsed both their operands. The operands are kept on a second stack. The
shunting yard algorithm can be used to directly evaluate expressions as they are parsed
(it is commonly used in electronic calculators for this task), to create a reverse Polish
notation translation of an infix expression, or to create an abstract syntax tree. I'll create
an abstract syntax tree, so my operand stacks will contain trees.
When parsing for example x*y+z, we push x on the operand stack, * on the operator
stack, and y on the operand stack. When the + is read, we compare it to the top of the
operator stack, which is *. Since the + has lower precedence than *, we know that both
operands to the * have been read and, in fact, will be on top of the operand stack. The
operands are popped, a new tree is built, *(a,b), and it is pushed on the operand stack.
Then the + is pushed on the operator stack. At the end of an expression the remaining
operators are put into trees with their operands and that is that.
In addition to "next", "consume". "end", "error", and "expect", which are explained in the
previous section, I will assume that the following subroutines and constants exist:
- binary" converts a token matched by B to an operator.
- unary" converts a token matched by U to an operator. We require that functions
"unary" and "binary" have disjoint ranges.
- mkLeaf" converts a token matched by v to a tree.
- mkNode" takes an operator and one or two trees and returns a tree.
- push", "pop", "top": the usual stack operations.
- empty": an empty stack
- sentinel" is a value that is not in the range of either unary or binary.

In the algorithm that follows I compare operators and the sentinel with a > sign. This compar-
ison is defined as follows:
- binary(x) > binary(y), if x has higher precedence than y, or x is left associative
and x and y have equal precedence.
- unary(x) > binary(y), if x has precedence higher or equal to y's
- op > unary(y), never (where op is any unary or binary operator)
- sentinel > op, never (where op is any unary or binary operator)
- op > sentinel (where op is any unary or binary operator): This case doesn't arise.
Now we define the following subroutines:
Aside: I hope the pseudo-code notation is fairly clear. I'll just comment that I'm assuming that
parameters are passed by reference, so only 2 stacks are created throughout the execution
of EParser.
Eparser is
var operators : Stack of Operator <- empty
var operands : Stack of Tree <- empty
push( operators, sentinel )
E( operators, operands )
expect( end )
return top( operands )
E( operators, operands ) is
P( operators, operands )
while next is a binary operator
pushOperator( binary(next), operators, operands )
consume
while top(operators) not= sentinel
popOperator( operators, operands )
P( operators, operands ) is
if next is a v
push( operands, mkLeaf( v ) )
consume
else if next = "("
consume

push( operators, sentinel )

E( operators, operands )
expect( ")" )
pop( operators )
else if next is a unary operator
pushOperator( unary(next), operators, operands )
consume
else
error
popOperator( operators, operands ) is
if top(operators) is binary
const t1 <- pop( operands )
const t0 <- pop( operands )
push( operands, mkNode( pop(operators), t0, t1 ) )
else
push( operands, mkNode( pop(operators), pop(operands) ) )
pushOperator( op, operators, operands ) is
while top(operators) > op
popOperator( operators, operands )
push( op, operators )
The Shunting Yard Algorithm appears to have been invented by Edsger Dijkstra around
1960 in connection with one of the first Algol compilers.

The classic solution

The classic solution to recursive-descent parsing of expressions is to create a new nontermi-
nal for each level of precedence as follows. G2:
E --> T {( "+" | "-" ) T}
T --> F {( "*" | "/" ) F}
F --> P ["^" F]
P --> v | "(" E ")" | "-" T
(The brackets [ and ] enclose an optional part of the production. As before, the braces { and }
enclose parts of the productions that may be repeated 0 or more times. The unquoted paren-
theses ( and ) serve only to group elements in a production.)
Grammar G2 describes the same language as the previous two grammars: L(G2) = L(G1) =
L(G)
The grammar is ambiguous; for example, -x*y has two parse trees. The ambiguity is resolved
by staying in each loop (in the productions for E and T) as long as possible and by taking the
option if possible (in the production for F). With that policy in place, all choices can be made
by looking only at the next token of input.
Note that the left-associative and the right-associative operators are treated differently; left-
associative operators are consumed in a loop, while right-associative operators are handled
with right-recursive productions. This is to make the tree building a bit easier.

We can transform this grammar to a parser written in pseudo code.

Eparser is
var t : Tree
t <- E
expect( end )
return t
E is
var t : Tree
t <- T
while next = "+" or next = "-"
const op <- binary(next)
consume
const t1 <- T
t <- mkNode( op, t, t1 )
return t
T is
var t : Tree
t <- F
while next = "*" or next = "/"
consume
const t1 <- F
t <- mkNode( op, t, t1 )
return t
F is
var t : Tree
t <- P
if next = "^"
consume
const t1 <- F
return mkNode( binary("^"), t, t1)
else
return t
P is
var t : Tree

if next = "("
consume
t <- E
expect( ")" )
return t
else if next = "-"
consume
t <- F
return mkNode( unary("-"), t)
else if next is a v
return mkLeaf( next )
else
error
It may be worthwhile to trace this algorithm on a few example inputs.
Although this is the classic solution, it has a few drawbacks:

- The size of the code is proportional to the number of precedence levels.
- The speed of the algorithm is proportional to the number of precedence levels.
- The number of precedence levels is built in.
When there are a large number of precedence levels, as in the C and C++ languages, the
first two disadvantages become problematic. In Pascal the number of precedence levels was
deliberately kept small because, I suspect, its designer, Niklaus Wirth, was aware of the
shortcomings of this method when the number of precedence levels is large.
The size problem can be overcome by creating one subroutine that is parameterized by pre-
cedence level rather than writing a separate routine for each level. But the speed problem
remains. Note that the number of calls to parse an expression consisting of a single identifier
is proportional to the number of levels of precedence.
I'm not sure who invented what I am calling the classic algorithm.

Precedence climbing
A method that solves all the listed problems of the classic solution, while being simpler
than the shunting-yard algorithm is what I call "precedence climbing".
Consider the input sequence
a ^ b * c + d + e
The E subroutine of the classic solution will deal with this by three calls to T, and by con-
suming the 2 "+"s, building a tree
+(+(result of first call, result of second call), result of third call)
We say that this loop directly consumes the two "+" operators.
The precedence climbing algorithm has a similar loop, but it always directly consumes the
first binary operator, then it consumes the next binary operator that is of lower prece-
dence, then the next operator that is of lower precedence than that. When it consumes a
left-associative operator, the same loop will also consume the next operator of equal pre-
cedence. Let me rewrite the example with operators written at different heights according
to their precedence:
+ +
*
^
a b c d e
One loop can consume all 4 operators, creating the tree
+(+(*(^(result of first call, result of second call) result of 3rd call), result of 4th call), result
of 5th call)

Each operator is assigned a precedence number. To make things more interesting lets add a
few more binary operators and use the following precedence tables:
Unary Binary operators

operators
Left
|| 0
- 3 Associative
Left
&& 1
Associative
Left
= 2
Associative
Left
+, - 3
Associative
Left
*, / 4
Associative
^ Right
5
Associative

We use the following grammar G3 in which nonterminal Exp is parameterized by a prece-

dence level. The idea is that Exp(p) recognizes expressions which contain no binary
operators (other than in parentheses) with precedence less than p
E --> Exp(0)
Exp(p) --> P {B Exp(q)}
P --> U Exp(q) | "(" E ")" | v
B --> "+" | "-" | "*" |"/" | "^" | "||" | "&&" | "="
U --> "-"
The loop implied by the braces, { and }, in the production for Exp(p) presents a problem:
when should the loop be exited? This choice is resolved as follows:
- If the next token is a binary operator and the precedence of that operator is
greater or equal to p,
then the loop is (re)entered.
- Otherwise the loop is exited.
In the productions for Exp(p) and P, the recursive use of Exp is parameterized, by a value
q. So there is a second choice to resolve: how is q chosen? The value of q is chosen
according to the previous operator:
- In the binary operator case:
oif the binary operator is left associative, q = the precedence of the operator +
1,
oif the binary operator is right associative, q = the precedence of the opera-
tor.
- After unary operators,
oq=the precedence of the operator.
Consider what will happen in parsing the expression, a * b - c * d - e * f = g * h - i * j - k *

l. To make things clearer, I'll present this expression 2 dimensionally to show the prece-
dences of the operators:
2 =
3 - - - -
4 * * * * * *
a b c d e f g h i j k l
! ! ! !

The call to Exp(0) will consume exactly the operators indicated by a ! . The sub-expressions:
a, b, c*d, e*f, and g*h-i*k-k*l will be parsed by calls to P and Exp(5), Exp(4), Exp(4) and
Exp(3) respectively.
What about right-associative operators? Consider an expression
a^b^c
Because of the different way right-associative operators are treated, Exp(0) will only con-
sume the first ^, as the second will be gobbled up by a recursive call to Exp(5).

A recursive-descent parser based on this method looks something like this:

Eparser is
var t : Tree
t <- Exp( 0 )
expect( end )
return t
Exp( p ) is
var t : Tree
t <- P
while next is a binary operator and prec(binary(next)) >= p
consume
const q <- case associativity(op)
of Right: prec( op )
Left: 1+prec( op )
const t1 <- Exp( q )
t <- mkNode( op, t, t1)
return t
P is
if next is a unary operator
const op <- unary(next)
consume
q <- prec( op )
const t <- Exp( q )
return mkNode( op, t )
else if next = "("
consume
const t <- Exp( 0 )
expect ")"
return t
else if next is a v
return mkLeaf( next )
else
error

I first saw this algorithm described by Keith Clarke in a posting to comp.compilers many years
ago. Most recently I used it in a JavaCC parser for a subset of C++. I've also used it in a
parser based on monadic parsing written in Haskell. I'd be happy to mail either grammar to
anyone who is interested.

How to parse and scan through hex bytes

In general our parser has to be very simple:
We suppose that our file is saved as a linked list in memory. Each node of the linked list
contains one hex-value of our file, e.g. 2F or something else.
Our main problem is to "scan" this linked list and to shout "hello" when we found a series
of hex-values which gives us a working mnemonic code. The problem HOW we check a
working mnemonic is not important at this place, we discuss this later in Lesson 9 -
Opcodes and Mnemonics (pages 108 ff).
Well, life could be easy. Imagine this:
Each hex-value correspondents to one hex-mnemonic. Wow, how easy. So we just take
each node in our linked list and translate it.
But reality looks different!
Realize this:
55 -' PUSH EBP
but
8B45 10 -' MOV EAX,DWORD PTR SS:[EBP+10]
Can you see our problem ? The mnemonics have a different size!
We will discuss first the theory and the pseudo-code for this problem. Please let me men-
tion that this short chapter will not describe theoretical problems like CF-grammars or
similar.
Later I will give you a assembly-code for this in Lesson 6 - Parsing (pages 218 ff.). Linked
lists in general can be found in Lesson 5 - Linked lists (pages 199 ff.)

So how could a pseudo-code look ?

1.Initialise our linked list and set pointer at the first node
2.add hex from the node to our stack
3.check the stack if we have a working mnemonic
4.if we found one, print this mnemonic and clear the stack and go to the next
node and goto 2
5.if not: goto next node of the linked list and goto 2
So this is easy. But will it work ? Well, yes and no. In theory it looks good, but when you look
at the pseudo-code you can see that there is no end-condition set! So in real life this code will
crash when there are no more hex-values.
Next we have not proved if our stack is bigger than 15 values. If we can not find a working
mnemomic with these values, we have done something wrong: either our parser does not
work or our opcode list is to short and does not contain a corresponding value in these 15
nodes or the file has some strange mnemonics in it!
Why do we have to check for these 15 values?
Because the mnemonics have a length between 1 and 15 bytes. So simple.
Let us suppose that we have a "good" file, where all hex-values result in a working code with
no problems.

So an advanced pseudo-code would look like this:

This looks better. We have improved the parsing algorithm and added some important fea-
tures:
- Checking if a single node is equal to a single opcode. We do this only if the
stack is empty.
- Error-handling if we can not get a value from a node.
- Checking if our stack (which grows with each loop) contains a valid complex
opcode.
- Checking if we reached the end of the linked list
Again we suppose that the linked list contains values which fit with our opcode-list. This
means that all hex-values (incl. cominations) can be translated somehow to a disassembled
list.
Can you feel our big problem ?
If we miss an opcode the parser may produce a wrong disassembly or stop during the dis-
aasembly!
This will be the main problem of one of the next chapters
We will later give a source-code as a working main-frame parsing algorithm. Feel free to opti-
mize this algorithm, ours is just a startup for simplicity.
You have now some background knowledge about parsing and its problems for us. There are
whole books describing the parsing problem, but do not read them until you really need to
know what CF-grammars are…
No, this pseudo-algorithm is for sure not the best, but it is simplified for easier understanding.
At the next page you find an answer to a very common problem: how to parse the command-
line under MASM. This is where I leave you alone with your thoughts…

A small algorithm which parses the commandline for a

filename16
It can be used for example to open a txt file directly with your own editor, etc. It checks the
following possibilities:
(1): AppName.exe
(2): "AppName.exe" CommandLine
(3): AppName.exe "CommandLine"
(4): "AppName.exe" "CommandLine"
Are there some more ways and/or errors?

code:
ProcessCommandLinePROClpCmdLine:DWORD
pushad
movedi, lpCmdLine
xorecx, ecx
dececx
moval, [edi]
incedi
.IF al == 22h
repnzscasb
incedi
moval, [edi]
.IF al == 22h
incedi
pushedi
16.Source by Rennsemmel, http://board.win32asmcommunity.net/showthread.php?s=&threadid=7464

repnzscasb
decedi
movbyte ptr [edi], 0
popedi
.ENDIF
.ELSE
@@:incedi
moval, [edi]
.IF !al
popad
xoreax, eax
ret
.ENDIF
cmpal, 22h
jnz@B
incedi
pushedi
repnzscasb
decedi
movbyte ptr [edi], 0
popedi
.ENDIF
movlpCmdLine, edi
popad
moveax, lpCmdLine
ret
ProcessCommandLineENDP

A simple Hex-Dump algorithm

This is a small and easy algorithm to dump a file as hex-values. It was coded by Hutch 17,
so respect his work.
Original thread:
I prototyped this algo in PowerBASIC inline but as it was a simple port
with only a few fiddles, I converted it to MASM notation as it may be use-
ful to a few people.
The algo takes a file read into a buffer, its length and the buffer to
write the hex dump to.
Important with this algo is to allocate the file length TIMES 4 as the
destination buffer as the hex dump is longer than the original data.
The formatting imposed limitations on the efficiency of this algo, every

second WORD size write is misaligned which will reduce its speed but it is
a lot fasater than the one I replaced and I could not se another way to
maintain alighment without making the formatting unacceptable so I kept it
as it is.
Regards,
hutch@movsd.com
17.http://board.win32asmcommunity.net

code:
; #########################################################################
HexDump proc lpString:DWORD,lnString:DWORD,lpbuffer:DWORD
LOCAL lcnt:DWORD
push ebx
push esi
push edi
jmp over_table
align 16
hex_table:
db
"00","01","02","03","04","05","06","07","08","09","0A","0B","0C","0D","0E","0F"
db
"10","11","12","13","14","15","16","17","18","19","1A","1B","1C","1D","1E","1F"
db
"20","21","22","23","24","25","26","27","28","29","2A","2B","2C","2D","2E","2F"
db
"30","31","32","33","34","35","36","37","38","39","3A","3B","3C","3D","3E","3F"
db
"40","41","42","43","44","45","46","47","48","49","4A","4B","4C","4D","4E","4F"
db
"50","51","52","53","54","55","56","57","58","59","5A","5B","5C","5D","5E","5F"
db
"60","61","62","63","64","65","66","67","68","69","6A","6B","6C","6D","6E","6F"
db
"70","71","72","73","74","75","76","77","78","79","7A","7B","7C","7D","7E","7F"
db
"80","81","82","83","84","85","86","87","88","89","8A","8B","8C","8D","8E","8F"
db
"90","91","92","93","94","95","96","97","98","99","9A","9B","9C","9D","9E","9F"
db
"A0","A1","A2","A3","A4","A5","A6","A7","A8","A9","AA","AB","AC","AD","AE","AF"
db
"B0","B1","B2","B3","B4","B5","B6","B7","B8","B9","BA","BB","BC","BD","BE","BF"
db
"C0","C1","C2","C3","C4","C5","C6","C7","C8","C9","CA","CB","CC","CD","CE","CF"
db
"D0","D1","D2","D3","D4","D5","D6","D7","D8","D9","DA","DB","DC","DD","DE","DF"

db
"E0","E1","E2","E3","E4","E5","E6","E7","E8","E9","EA","EB","EC","ED","EE","EF"
db
"F0","F1","F2","F3","F4","F5","F6","F7","F8","F9","FA","FB","FC","FD","FE","FF"
over_table:
lea ebx, hex_table ; get base address of table

mov esi, lpString ; address of source string
mov edi, lpbuffer ; address of output buffer
mov eax, esi
add eax, lnString
mov ecx, eax ; exit condition for byte read
mov lcnt, 0
xor eax, eax ; prevent stall
; %%%%%%%%%%%%%%%%%%%%%%% loop code %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hxlp:
mov al, [esi] ; get BYTE
inc esi
inc lcnt
mov dx, [ebx+eax*2] ; put WORD from table into DX
mov [edi], dx ; write 2 byte string to buffer
add edi, 2
mov BYTE PTR [edi], 32 ; add space
inc edi
cmp lcnt, 8 ; test for half to add "-"
jne @F
mov WORD PTR [edi], " -"
add edi, 2

@@:
cmp lcnt, 16 ; break line at 16 characters
jne @F
dec edi ; overwrite last space
mov WORD PTR [edi], 0A0Dh ; write CRLF to buffer
add edi, 2
mov lcnt, 0
@@:
cmp esi, ecx ; test exit condition
jl hxlp
; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
inc edi
mov BYTE PTR [edi], 0 ; append terminator
pop edi
pop esi
pop ebx
ret
HexDump endp
; #########################################################################

Lesson 9 - Opcodes and Mnemonics18

This chapter discusses the low-level implementation of the 80x86 instruction set. It
describes how the Intel engineers decided to encode the instructions in a numeric format
(suitable for storage in memory) and it discusses the trade-offs they had to make when
designing the CPU. This chapter also presents a historical background of the design
effort so you can better understand the compromises they had to make.
18.This chapter is part of “Art Of Assembly”

(http://webster.cs.ucr.edu/Page_AoAWin/HTML/ISA.html#1013164) - The best free assembly-book

Lesson 9 - Opcodes and Mnemonics
The Importance of the Design of the Instruction Set

In this chapter we will be exploring one of the most interesting and important aspects of CPU
design: the design of the CPU's instruction set. The instruction set architecture (or ISA) is one
of the most important design issues that a CPU designer must get right from the start. Fea-
tures like caches, pipelining, superscalar implementation, etc., can all be grafted on to a CPU
design long after the original design is obsolete. However, it is very difficult to change the
instructions a CPU executes once the CPU is in production and people are writing software
that uses those instructions. Therefore, one must carefully choose the instructions for a CPU.
You might be tempted to take the "kitchen sink" approach to instruction set design1 and
include as many instructions as you can dream up in your instruction set. This approach fails
for several reasons we'll discuss in the following paragraphs. Instruction set design is the
epitome of compromise management. Good CPU design is the process of selecting what to
throw out rather than what to leave in. It's easy enough to say "let's include everything." The
hard part is deciding what to leave out once you realize you can't put everything on the chip.
Nasty reality #1: Silicon real estate. The first problem with "putting it all on the chip" is that
each feature requires some number of transistors on the CPU's silicon die. CPU designers
work with a "silicon budget" and are given a finite number of transistors to work with. This
means that there aren't enough transistors to support "putting all the features" on a CPU. The
original 8086 processor, for example, had a transistor budget of less than 30,000 transistors.
The Pentium III processor had a budget of over eight million transistors. These two budgets
reflect the differences in semiconductor technology in 1978 vs. 1998.
Nasty reality #2: Cost. Although it is possible to use millions of transistors on a CPU today,
the more transistors you use the more expensive the CPU. Pentium IV processors, for exam-
ple, cost hundreds of dollars (circa 2002). A CPU with only 30,000 transistors (also circa
2002) would cost only a few dollars. For low-cost systems it may be more important to shave
some features and use fewer transistors, thus lowering the CPU's cost.
Nasty reality #3: Expandability. One problem with the "kitchen sink" approach is that it's very
difficult to anticipate all the features people will want. For example, Intel's MMX and SIMD
instruction enhancements were added to make multimedia programming more practical on
the Pentium processor. Back in 1978 very few people could have possibly anticipated the
need for these instructions.

Nasty reality #4: Legacy Support. This is almost the opposite of expandability. Often it is
the case that an instruction the CPU designer feels is important turns out to be less useful
than anticipated. For example, the LOOP instruction on the 80x86 CPU sees very little
use in modern high-performance programs. The 80x86 ENTER instruction is another
good example. When designing a CPU using the "kitchen sink" approach, it is often com-
mon to discover that programs almost never use some of the available instructions.
Unfortunately, you cannot easily remove instructions in later versions of a processor
because this will break some existing programs that use those instructions. Generally,
once you add an instruction you have to support it forever in the instruction set. Unless
very few programs use the instruction (and you're willing to let them break) or you can
automatically simulate the instruction in software, removing instructions is a very difficult
thing to do.
Nasty reality #5: Complexity. The popularity of a new processor is easily measured by
how much software people write for that processor. Most CPU designs die a quick death
because no one writes software specific to that CPU. Therefore, a CPU designer must
consider the assembly programmers and compiler writers who will be using the chip upon
introduction. While a "kitchen sink" approach might seem to appeal to such programmers,
the truth is no one wants to learn an overly complex system. If your CPU does everything
under the sun, this might appeal to someone who is already familiar with the CPU. How-
ever, pity the poor soul who doesn't know the chip and has to learn it all at once.
These problems with the "kitchen sink" approach all have a common solution: design a
simple instruction set to begin with and leave room for later expansion. This is one of the
main reasons the 80x86 has proven to be so popular and long-lived. Intel started with a
relatively simple CPU and figured out how to extend the instruction set over the years to
accommodate new features.

Basic Instruction Design Goals

In a typical Von Neumann architecture CPU, the computer encodes CPU instructions as
numeric values and stores these numeric values in memory. The encoding of these instruc-
tions is one of the major tasks in instruction set design and requires very careful thought.
To encode an instruction we must pick a unique numeric opcode value for each instruction
(clearly, two different instructions cannot share the same numeric value or the CPU will not be
able to differentiate them when it attempts to decode the opcode value). With an n-bit num-
ber, there are 2n different possible opcodes, so to encode m instructions you will need an
opcode that is at least log2(m) bits long.
Encoding opcodes is a little more involved than assigning a unique numeric value to each
instruction. Remember, we have to use actual hardware (i.e., decoder circuits) to figure out
what each instruction does and command the rest of the hardware to do the specified task.
Suppose you have a seven-bit opcode. With an opcode of this size we could encode 128 dif-
ferent instructions. To decode each instruction individually requires a seven-line to 128-line
decoder - an expensive piece of circuitry. Assuming our instructions contain certain patterns,
we can reduce the hardware by replacing this large decoder with three smaller decoders.
If you have 128 truly unique instructions, there's little you can do other than to decode each
instruction individually. However, in most architectures the instructions are not completely
independent of one another. For example, on the 80x86 CPUs the opcodes for "mov( eax,
ebx );" and "mov( ecx, edx );" are different (because these are different instructions) but these
instructions are not unrelated. They both move data from one register to another. In fact, the
only difference between them is the source and destination operands. This suggests that we
could encode instructions like MOV with a sub-opcode and encode the operands using other
strings of bits within the opcode.

For example, if we really have only eight instructions, each instruction has two operands,
and each operand can be one of four different values, then we can encode the opcode as
three packed fields containing three, two, and two bits (see Figure 5.1). This encoding
only requires the use of three simple decoders to completely determine what instruction
the CPU should execute. While this is a bit of a trivial case, it does demonstrate one very
important facet of instruction set design - it is important to make opcodes easy to decode
and the easiest way to do this is to break up the opcode into several different bit fields,
each field contributing part of the information necessary to execute the full instruction.
The smaller these bit fields, the easier it will be for the hardware to decode and execute
them2.
Figure 5.1 Separating an Opcode into Separate Fields to Ease Decoding

Although Intel probably went a little overboard with the design of the original 8086 instruction
set, an important design goal is to keep instruction sizes within a reasonable range. CPUs
with unnecessarily long instructions consume extra memory for their programs. This tends to
create more cache misses and, therefore, hurts the overall performance of the CPU. There-
fore, we would like our instructions to be as compact as possible so our programs' code uses
as little memory as possible.
It would seem that if we are encoding 2n different instructions using n bits, there would be
very little leeway in choosing the size of the instruction. It's going to take n bits to encode
those 2n instructions, you can't do it with any fewer. You may, of course, use more than n bits;
and believe it or not, that's the secret to reducing the size of a typical program on the CPU.
Before discussing how to use longer instructions to generate shorter programs, a short
digression is necessary. The first thing to note is that we generally cannot choose an arbitrary
number of bits for our opcode length. Assuming that our CPU is capable of reading bytes
from memory, the opcode will probably have to be some even multiple of eight bits long. If the
CPU is not capable of reading bytes from memory (e.g., most RISC CPUs only read memory
in 32 or 64 bit chunks) then the opcode is going to be the same size as the smallest object
the CPU can read from memory at one time (e.g., 32 bits on a typical RISC chip). Any attempt
to shrink the opcode size below this data bus enforced lower limit is futile. Since we're dis-
cussing the 80x86 architecture in this text, we'll work with opcodes that must be an even mul-
tiple of eight bits long.
Another point to consider here is the size of an instruction's operands. Some CPU designers
(specifically, RISC designers) include all operands in their opcode. Other CPU designers
(typically CISC designers) do not count operands like immediate constants or address dis-
placements as part of the opcode (though they do usually count register operand encodings
as part of the opcode). We will take the CISC approach here and not count immediate con-
stant or address displacement values as part of the actual opcode.

With an eight-bit opcode you can only encode 256 different instructions. Even if we don't
count the instruction's operands as part of the opcode, having only 256 different instruc-
tions is somewhat limiting. It's not that you can't build a CPU with an eight-bit opcode,
most of the eight-bit processors predating the 8086 had eight-bit opcodes, it's just that
modern processors tend to have far more than 256 different instructions. The next step
up is a two-byte opcode. With a two-byte opcode we can have up to 65,536 different
instructions (which is probably enough) but our instructions have doubled in size (not
counting the operands, of course).
If reducing the instruction size is an important design goal3 we can employ some tech-
niques from data compression theory to reduce the average size of our instructions. The
basic idea is this: first we analyze programs written for our CPU (or a CPU similar to ours
if no one has written any programs for our CPU) and count the number of occurrences of
each opcode in a large number of typical applications. We then create a sorted list of
these opcodes from most-frequently-used to least-frequently-used. Then we attempt to
design our instruction set using one-byte opcodes for the most-frequently-used instruc-
tions, two-byte opcodes for the next set of most-frequently-used instructions, and three
(or more) byte opcodes for the rarely used instructions. Although our maximum instruc-
tion size is now three or more bytes, most of the actual instructions appearing in a pro-
gram will use one or two byte opcodes, so the average opcode length will be somewhere
between one and two bytes (let's call it 1.5 bytes) and a typical program will be shorter
than had we chosen a two byte opcode for all instructions (see Figure 5.2).

Figure 5.2 Encoding Instructions Using a Variable-Length Opcode

Although using variable-length instructions allows us to create smaller programs, it comes

at a price. First of all, decoding the instructions is a bit more complicated. Before decod-
ing an instruction field, the CPU must first decode the instruction's size. This extra step
consumes time and may affect the overall performance of the CPU (by introducing delays
in the decoding step and, thereby, limiting the maximum clock frequency of the CPU).
Another problem with variable length instructions is that it makes decoding multiple
instructions in a pipeline quite difficult (since we cannot trivially determine the instruction
boundaries in the prefetch queue). These reasons, along with some others, is why most
popular RISC architectures avoid variable-sized instructions. However, for our purpose,
we'll go with a variable length approach since saving memory is an admirable goal.
Before actually choosing the instructions you want to implement in your CPU, now would
be a good time to plan for the future. Undoubtedly, you will discover the need for new
instructions at some point in the future, so reserving some opcodes specifically for that
purpose is a real good idea. If you were using the instruction encoding appearing in Fig-
ure 5.2 for your opcode format, it might not be a bad idea to reserve one block of 64 one-
byte opcodes, half (4,096) of the two-byte instructions, and half (1,048,576) of the three-
byte opcodes for future use. In particular, giving up 64 of the very valuable one-byte
opcodes may seem extravagant, but history suggests that such foresight is rewarded.
The next step is to choose the instructions you want to implement. Note that although
we've reserved nearly half the instructions for future expansion, we don't actually have to
implement instructions for all the remaining opcodes. We can choose to leave a good
number of these instructions unimplemented (and effectively reserve them for the future
as well). The right approach is not to see how quickly we can use up all the opcodes, but
rather to ensure that we have a consistent and complete instruction set given the compro-
mises we have to live with (e.g., silicon limitations). The main point to keep in mind here is
that it's much easier to add an instruction later than it is to remove an instruction later. So
for the first go-around, it's generally better to go with a simpler design rather than a more
complex design.

The first step is to choose some generic instruction types. For a first attempt, you should limit
the instructions to some well-known and common instructions. The best place to look for help
in choosing these instructions is the instruction sets of other processors. For example, most
processors you find will have instructions like the following:
Data movement instructions (e.g., MOV)
Arithmetic and logical instructions (e.g., ADD, SUB, AND, OR, NOT)
Comparison instructions
A set of conditional jump instructions (generally used after the compare instruc-
tions)
Input/Output instructions
Other miscellaneous instructions
Your goal as the designer of the CPU's initial instruction set is to chose a reasonable set of
instructions that will allow programmers to efficiently write programs (using as few instruc-
tions as possible) without adding so many instructions you exceed your silicon budget or vio-
late other system compromises. This is a very strategic decision, one that CPU designers
should base on careful research, experimentation, and simulation. The job of the CPU
designer is not to create the best instruction set, but to create an instruction set that is optimal
given all the constraints.
Once you've decided which instructions you want to include in your (initial) instruction set, the
next step is to assign opcodes for them. The first step is to group your instructions into sets
by common characteristics of those instructions. For example, an ADD instruction is probably
going to support the exact same set of operands as the SUB instruction. So it makes sense to
put these two instructions into the same group. On the other hand, the NOT instruction gener-
ally requires only a single operand4 as does a NEG instruction. So you'd probably put these
two instructions in the same group but a different group than ADD and SUB.

Once you've grouped all your instructions, the next step is to encode them. A typical
encoding will use some bits to select the group the instruction falls into, it will use some
bits to select a particular instruction from that group, and it will use some bits to determine
the types of operands the instruction allows (e.g., registers, memory locations, and con-
stants). The number of bits needed to encode all this information may have a direct
impact on the instruction's size, regardless of the frequency of the instruction. For exam-
ple, if you need two bits to select a group, four bits to select an instruction within that
group, and six bits to specify the instruction's operand types, you're not going to fit this
instruction into an eight-bit opcode. On the other hand, if all you really want to do is push
one of eight different registers onto the stack, you can use four bits to select the PUSH
instruction and three bits to select the register (assuming the encoding in Figure 5.2 the
eighth and H.O. bit would have to contain zero).
Encoding operands is always a problem because many instructions allow a large number
of operands. For example, the generic 80x86 MOV instruction requires a two-byte
opcode5. However, Intel noticed that the "mov( disp, eax );" and "mov( eax, disp );"
instructions occurred very frequently. So they created a special one byte version of this
instruction to reduce its size and, therefore, the size of those programs that use this
instruction frequently. Note that Intel did not remove the two-byte versions of these
instructions. They have two different instructions that will store EAX into memory or load
EAX from memory. A compiler or assembler would always emit the shorter of the two
instructions when given an option of two or more instructions that wind up doing exactly
the same thing.
Notice an important trade-off Intel made with the MOV instruction. They gave up an extra
opcode in order to provide a shorter version of one of the MOV instructions. Actually, Intel
used this trick all over the place to create shorter and easier to decode instructions. Back
in 1978 this was a good compromise (reducing the total number of possible instructions
while also reducing the program size). Today, a CPU designer would probably want to
use those redundant opcodes for a different purpose, however, Intel's decision was rea-
sonable at the time (given the high cost of memory in 1978).
To further this discussion, we need to work with an example. So the next section will go
through the process of designing a very simple instruction set as a means of demonstrat-
ing this process.

The Y86 Hypothetical Processor

Because of enhancements made to the 80x86 processor family over the years, Intel's design
goals in 1978, and advances in computer architecture occurring over the years, the encoding
of 80x86 instructions is very complex and somewhat illogical. Therefore, the 80x86 is not a
good candidate for an example architecture when discussing how to design and encode an
instruction set. However, since this is a text about 80x86 assembly language programming,
attempting to present the encoding for some simpler real-world processor doesn't make
sense. Therefore, we will discuss instruction set design in two stages: first, we will develop a
simple (trivial) instruction set for a hypothetical processor that is a small subset of the 80x86,
then we will expand our discussion to the full 80x86 instruction set. Our hypothetical proces-
sor is not a true 80x86 CPU, so we will call it the Y86 processor to avoid any accidental asso-
ciation with the Intel x86 family.
The Y86 processor is a very stripped down version of the x86 CPUs. First of all, the Y86 only
supports one operand size - 16 bits. This simplification frees us from having to encode the
size of the operand as part of the opcode (thereby reducing the total number of opcodes we
will need). Another simplification is that the Y86 processor only supports four 16-bit registers:
AX, BX, CX, and DX. This lets us encode register operands with only two bits (versus the
three bits the 80x86 family requires to encode eight registers). Finally, the Y86 processors
only support a 16-bit address bus with a maximum of 65,536 bytes of addressable memory.
These simplifications, plus a very limited instruction set will allow us to encode all Y86
instructions using a single byte opcode and a two-byte displacement/offset (if needed).
The Y86 CPU provides 20 instructions. Seven of these instructions have two operands, eight
of these instructions have a single operand, and five instructions have no operands at all. The
instructions are MOV (two forms), ADD, SUB, CMP, AND, OR, NOT, JE, JNE, JB, JBE, JA,
JAE, JMP, BRK, IRET, HALT, GET, and PUT. The following paragraphs describe how each of
these work.

The MOV instruction is actually two instruction classes merged into the same instruction.
The two forms of the mov instruction take the following forms:
mov( reg/memory/constant, reg );
mov( reg, memory );
where reg is any of AX, BX, CX, or DX; constant is a numeric constant (using hexadeci-
mal notation), and memory is an operand specifying a memory location. The next section
describes the possible forms the memory operand can take. The "reg/memory/constant"
operand tells you that this particular operand may be a register, memory location, or a
constant.
The arithmetic and logical instructions take the following forms:

add( reg/memory/constant, reg );
sub( reg/memory/constant, reg );
cmp( reg/memory/constant, reg );
and( reg/memory/constant, reg );
or( reg/memory/constant, reg );
not( reg/memory );
Note: the NOT instruction appears separately because it is in a different class than the
other arithmetic instructions (since it supports only a single operand).
The ADD instruction adds the value of the first operand to the second (register) operand,
leaving the sum in the second (register) operand. The SUB instruction subtracts the value
of the first operand from the second, leaving the difference in the second operand. The
CMP instruction compares the first operand against the second and saves the result of
this comparison for use with one of the conditional jump instructions (described in a
moment). The AND and OR instructions compute the corresponding bitwise logical oper-
ation on the two operands and store the result into the first operand. The NOT instruction
inverts the bits in the single memory or register operand.

The control transfer instructions interrupt the sequential execution of instructions in memory
and transfer control to some other point in memory either unconditionally, or after testing the
result of the previous CMP instruction. These instructions include the following:
ja dest; -- Jump if above (i.e., greater than)
jae dest; -- Jump if above or equal (i.e., greater than or equal)
jb dest; -- Jump if below (i.e., less than)
jbe dest; -- Jump if below or equal (i.e., less than or equal)
je dest; -- Jump if equal
jne dest; -- Jump if not equal
jmp dest; -- Unconditional jump
iret; -- Return from an interrupt
The first six instructions let you check the result of the previous CMP instruction for greater
than, greater or equal, less than, less or equal, equality, or inequality6. For example, if you
compare the AX and BX registers with a "cmp( ax, bx );" instruction and execute the JA
instruction, the Y86 CPU will jump to the specified destination location if AX was greater than
BX. If AX was not greater than BX, control will fall through to the next instruction in the pro-
gram.
The JMP instruction unconditionally transfers control to the instruction at the destination
address. The IRET instruction returns control from an interrupt service routine, which we will
discuss later.
The GET and PUT instructions let you read and write integer values. GET will stop and
prompt the user for a hexadecimal value and then store that value into the AX register. PUT
displays (in hexadecimal) the value of the AX register.
The remaining instructions do not require any operands, they are HALT and BRK. HALT ter-
minates program execution and BRK stops the program in a state that it can be restarted.
The Y86 processors require a unique opcode for every different instruction, not just the
instruction classes. Although "mov( bx, ax );" and "mov( cx, ax );" are both in the same class,
they must have different opcodes if the CPU is to differentiate them. However, before looking
at all the possible opcodes, perhaps it would be a good idea to learn about all the possible
operands for these instructions.

Addressing Modes on the Y86

The Y86 instructions use five different operand types: registers, constants, and three
memory addressing schemes. Each form is called an addressing mode. The Y86 proces-
sor supports the register addressing mode7, the immediate addressing mode, the indirect
addressing mode, the indexed addressing mode, and the direct addressing mode. The
following paragraphs explain each of these modes.
Register operands are the easiest to understand. Consider the following forms of the
MOV instruction:
mov( ax, ax );
mov( bx, ax );
mov( cx, ax );
mov( dx, ax );
The first instruction accomplishes absolutely nothing. It copies the value from the AX reg-
ister back into the AX register. The remaining three instructions copy the values of BX,
CX and DX into AX. Note that these instructions leave BX, CX, and DX unchanged. The
second operand (the destination) is not limited to AX; you can move values to any of
these registers.
Constants are also pretty easy to deal with. Consider the following instructions:
mov( 25, ax );
mov( 195, bx );
mov( 2056, cx );
mov( 1000, dx );
These instructions are all pretty straightforward; they load their respective registers with
the specified hexadecimal constant8.

There are three addressing modes which deal with accessing data in memory. The following
instructions demonstrate the use of these addressing modes:
mov( [1000], ax );
mov( [bx], ax );
mov( [1000+bx], ax );
The first instruction above uses the direct addressing mode to load AX with the 16 bit value
stored in memory starting at location $1000.
The "mov( [bx], ax );" instruction loads AX from the memory location specified by the contents
of the bx register. This is an indirect addressing mode. Rather than using the value in BX, this
instruction accesses to the memory location whose address appears in BX. Note that the fol-
lowing two instructions:
mov( 1000, bx );
mov( [bx], ax );
are equivalent to the single instruction:

mov( [1000], ax );
Of course, the second sequence is preferable. However, there are many cases where the
use of indirection is faster, shorter, and better. We'll see some examples of this a little later.
The last memory addressing mode is the indexed addressing mode. An example of this
memory addressing mode is
mov( [1000+bx], ax );
This instruction adds the contents of BX with $1000 to produce the address of the memory
value to fetch. This instruction is useful for accessing elements of arrays, records, and other
data structures.

Encoding Y86 Instructions

Although we could arbitrarily assign opcodes to each of the Y86 instructions, keep in
mind that a real CPU uses logic circuitry to decode the opcodes and act appropriately on
them. A typical CPU opcode uses a certain number of bits in the opcode to denote the
instruction class (e.g., MOV, ADD, SUB), and a certain number of bits to encode each of
the operands.
A typical Y86 instruction takes the form shown in Figure 5.3. The basic instruction is
either one or three bytes long. The instruction opcode consists of a single byte that con-
tains three fields. The first field, the H.O. three bits, defines the instruction. This provides
eight combinations. As you may recall, there are 20 different instructions; we cannot
encode 20 instructions with three bits, so we'll have to pull some tricks to handle the other
instructions. As you can see in Figure 5.3, the basic opcode encodes the MOV instruc-
tions (two instructions, one where the rr field specifies the destination, one where the
mmm field specifies the destination), and the ADD, SUB, CMP, AND, and OR instruc-
tions. There is one additional instruction field: special. The special instruction class pro-
vides a mechanism that allows us to expand the number of available instruction classes,
we will return to this expansion opcode shortly.
Figure 5.3 Basic Y86 Instruction Encoding

To determine a particular instruction's opcode, you need only select the appropriate bits for
the iii, rr, and mmm fields. The rr field contains the destination register (except for the MOV
instruction whose iii field is %111) and the mmm field encodes the source operand. For exam-
ple, to encode the "mov( bx, ax );" instruction you would select iii=110 ("mov( reg, reg );),
rr=00 (AX), and mmm=001 (BX). This produces the one-byte instruction %11000001 or $C0.
Some Y86 instructions require more than one byte. For example, the instruction "mov(
[1000], ax );" loads the AX register from memory location $1000. The encoding for the
opcode is %11000110 or $C6. However, the encoding for the "mov( [2000], ax );" instruction's
opcode is also $C6. Clearly these two instructions do different things, one loads the AX regis-
ter from memory location $1000 while the other loads the AX register from memory location
$2000. To encode an address for the [xxxx] or [xxxx+bx] addressing modes, or to encode the
constant for the immediate addressing mode, you must follow the opcode with the 16-bit
address or constant, with the L.O. byte immediately following the opcode in memory and the
H.O. byte after that. So the three byte encoding for "mov( [1000], ax );" would be $C6, $00,
$10 and the three byte encoding for "mov( [2000], ax );" would be $C6, $00, $20.
The special opcode allows the x86 CPU to expand the set of available instructions. This
opcode handles several zero and one-operand instructions as shown in Figure 5.4 and Fig-
ure 5.5.
Figure 5.4 Single Operand Instruction Encodings

Figure 5.5 Zero Operand Instruction Encodings
There are four one-operand instruction classes. The first encoding (00) further expands
the instruction set with a set of zero-operand instructions (see Figure 5.5). The second
opcode is also an expansion opcode that provides all the Y86 jump instructions (see Fig-
ure 5.6). The third opcode is the NOT instruction. This is the bitwise logical not operation
that inverts all the bits in the destination register or memory operand. The fourth single-
operand opcode is currently unassigned. Any attempt to execute this opcode will halt the
processor with an illegal instruction error. CPU designers often reserve unassigned
opcodes like this one to extend the instruction set at a future date (as Intel did when mov-
ing from the 80286 processor to the 80386).

Figure 5.6 Jump Instruction Encodings
There are seven jump instructions in the x86 instruction set. They all take the following form:
jxx address;
The JMP instruction copies the 16-bit value (address) following the opcode into the IP regis-
ter. Therefore, the CPU will fetch the next instruction from this target address; effectively, the
program "jumps" from the point of the JMP instruction to the instruction at the target address.
The JMP instruction is an example of an unconditional jump instruction. It always transfers

control to the target address. The remaining six instructions are conditional jump instructions.
They test some condition and jump if the condition is true; they fall through to the next instruc-
tion if the condition is false. These six instructions, JA, JAE, JB, JBE, JE, and JNE let you test
for greater than, greater than or equal, less than, less than or equal, equality, and inequality.
You would normally execute these instructions immediately after a CMP instruction since it
sets the less than and equality flags that the conditional jump instructions test. Note that there
are eight possible jump opcodes, but the x86 uses only seven of them. The eighth opcode is
another illegal opcode.

The last group of instructions, the zero operand instructions, appear in Figure 5.5. Three
of these instructions are illegal instruction opcodes. The BRK (break) instruction pauses
the CPU until the user manually restarts it. This is useful for pausing a program during
execution to observe results. The IRET (interrupt return) instruction returns control from
an interrupt service routine. We will discuss interrupt service routines later. The HALT pro-
gram terminates program execution. The GET instruction reads a hexadecimal value
from the user and returns this value in the AX register; the PUT instruction outputs the
value in the AX register.

Hand Encoding Instructions

Keep in mind that the Y86 processor fetches instructions as bit patterns from memory. It
decodes and executes those bit patterns. The processor does not execute instructions of the
form "mov( ax, bx );" (that is, a string of characters that are readable by humans). Instead, it
executes the bit pattern $C1 from memory. Instructions like "mov( ax, bx );" and "add( 5, cx );"
are human-readable representations of these instructions that we must first convert into
machine code (that is, the binary representation of the instruction that the machine actually
executes). In this section we will explore how to manually accomplish this task.
The first step is to chose an instruction to convert into machine code. We'll start with a very
simple example, the "add( cx, dx );" instruction. Once you've chosen the instruction, you look
up the instruction in one of the figures of the previous section. The ADD instruction is in the
first group (see Figure 5.3) and has an iii field of %101. The source operand is CX, so the
mmm field is %010 and the destination operand is DX so the rr field is %11. Merging these
bits produces the opcode %10111010 or $BA.
Figure 5.7 Encoding ADD( cx, dx );

Now consider the "add( 5, ax );" instruction. Since this instruction has an immediate
source operand, the mmm field will be %111. The destination register operand is AX
(%00) so the full opcode becomes $10100111 or $A7. Note, however, that this does not
complete the encoding of the instruction. We also have to include the 16-bit constant
$0005 as part of the instruction. The binary encoding of the constant must immediately
follow the opcode in memory, so the sequence of bytes in memory (from lowest address
to highest address) is $A7, $05, $00. Note that the L.O. byte of the constant follows the
opcode and the H.O. byte of the constant follows the L.O. byte. This sequence appears
backwards because the bytes are arranged in order of increasing memory address and
the H.O. byte of a constant always appears in the highest memory address.
Figure 5.8 Encoding ADD( 5, ax );
The "add( [2ff+bx], cx );" instruction also contains a 16-bit constant associated with the
instruction's encoding - the displacement portion of the indexed addressing mode. To
encode this instruction we use the following field values: iii=%101, rr=%10, and
mmm=%101. This produces the opcode byte %10110101 or $B5. The complete instruc-
tion also requires the constant $2FF so the full instruction is the three-byte sequence
$B5, $FF, $02.

Figure 5.9 Encoding ADD( [$2ff+bx], cx );
Now consider the "add( [1000], ax );" instruction. This instruction adds the 16-bit contents of
memory locations $1000 and $1001 to the value in the AX register. Once again, iii=%101 for
the ADD instruction. The destination register is AX so rr=%00. Finally, the addressing mode
is the displacement-only addressing mode, so mmm=%110. This forms the opcode
%10100110 or $A6. The instruction is three bytes long since it must encode the displacement
(address) of the memory location in the two bytes following the opcode. Therefore, the com-
plete three-byte sequence is $A6, $00, $10.

Figure 5.10 Encoding ADD( [1000], ax );
The last addressing mode to consider is the register indirect addressing mode, [bx]. The
"add( [bx], bx );" instruction uses the following encoded values: mmm=%101, rr=%01
(bx), and mmm=%100 ([bx]). Since the value in the BX register completely specifies the
memory address, there is no need for a displacement field. Hence, this instruction is only
one byte long.
Figure 5.11 Encoding the ADD( [bx], bx ); Instruction
You use a similar approach to encode the SUB, CMP, AND, and OR instructions as you
do the ADD instruction. The only difference is that you use different values for the iii field
in the opcode.

The MOV instruction is special because there are two forms of the MOV instruction. You
encode the first form (iii=%110) exactly as you do the ADD instruction. This form copies a
constant or data from memory or a register (the mmm field) into a destination register (the rr
field).
The second form of the MOV instruction (iii=%111) copies data from a source register (rr) to a
destination memory location (that the mmm field specifies). In this form of the MOV instruc-
tion, the source/destination meanings of the rr and mmm fields are reversed so that rr is the
source field and mmm is the destination field. Another difference is that the mmm field may
only contain the values %100 ([bx]), %101 ([disp+bx]), and %110 ([disp]). The destination val-
ues cannot be %000..%011 (registers) or %111 (constant). These latter five encodings are
illegal (the register destination instructions are handled by the other MOV instruction and
storing data into a constant doesn't make any sense).
The Y86 processor supports a single instruction with a single memory/register operand - the
NOT instruction. The NOT instruction has the syntax: "not( reg );" or "not( mem );" where
mem represents one of the memory addressing modes ([bx], [disp+bx], or [disp]). Note that
you may not specify a constant as the operand of the NOT instruction.
Since the NOT instruction has only a single operand, it only uses the mmm field to encode
this operand. The rr field, combined with the iii field, selects the NOT instruction (iii=%000
and rr=%10). Whenever the iii field contains zero this tells the CPU that special decoding is
necessary for the instruction. In this case, the rr field specifies whether we have the NOT
instruction or one of the other specially decoded instructions.
To encode an instruction like "not( ax );" you would simply specify %000 for iii and %10 for the
rr fields. Then you would encode the mmm field the same way you would encode this field for
the ADD instruction. Since mmm=%000 for AX, the encoding of "not( ax );" would be
%00010000 or $10.

Figure 5.12 Encoding the NOT( ax ); Instruction
The NOT instruction does not allow an immediate (constant) operand, hence the opcode
%00010111 ($17) is an illegal opcode.
The Y86 conditional jump instructions also use a special encoding. These instructions are
always three bytes long. The first byte (the opcode) specifies which conditional jump
instruction to execute and the next two bytes specify where the CPU transfers if the con-
dition is met. There are seven different Y86 jump instructions, six conditional jumps and
one unconditional jump. These instructions set mmm=%000, rr=%01, and use the mmm
field to select one of the seven possible jumps; the eighth possible opcode is an illegal
opcode (see Figure 5.6). Encoding these instructions is relatively straight-forward. Once
you pick the instruction you want to encode, you've determined the opcode (since there is
a single opcode for each instruction). The opcode values fall in the range $08..$0E ($0F
is the illegal opcode).
The only field that requires some thought is the 16-bit operand that follows the opcode.
This field holds the address of the target instruction to which the (un)conditional jump
transfers if the condition is true (e.g., JE transfers control to this address if the previous
CMP instruction found that its two operands were equal). To properly encode this field
you must know the address of the opcode byte of the target instruction. If you've already
converted the instruction to binary form and stored it into memory, this isn't a problem;
just specify the address of that instruction as the operand of the condition jump. On the
other hand, if you haven't yet written, converted, and placed that instruction into memory,
knowing its address would seem to require a bit of divination. Fortunately, you can figure
out the target address by computing the lengths of all the instructions between the current
jump instruction you're encoding and the target instruction. Unfortunately, this is an ardu-
ous task.

The best solution is to write all your instructions down on paper, compute their lengths (which
is easy, all instructions are one or three bytes long depending on the presence of a 16-bit
operand), and then assign an appropriate address to each instruction. Once you've done this
(and, assuming you haven't made any mistakes) you'll know the starting address for each
instruction and you can fill in target address operands in your (un)conditional jump instruc-
tions as you encode them. Fortunately, there is a better way to do this, as you'll see in the
next section.
The last group of instructions, the zero operand instructions, are the easiest to encode. Since
they have no operands they are always one byte long and the instruction uniquely specifies
the opcode for the instruction. These instructions always have iii=%000, rr=%00, and mmm
specifies the particular instruction opcode (see Figure 5.5). Note that the Y86 CPU leaves
three of these instructions undefined (so we can use these opcodes for future expansion).

Using an Assembler to Encode Instructions

Of course, hand coding machine language programs as demonstrated in the previous
section is impractical for all but the smallest programs. Certainly you haven't had to do
anything like this when writing HLA programs. The HLA compiler lets you create a text file
containing human readable forms of the instructions. You might wonder why we can write
such code for the 80x86 but not for the Y86. The answer is to use an assembler or com-
piler for the Y86. The job of an assembler/compiler is to read a text file containing human
readable text and translate that text into the binary encoded representation for the corre-
sponding machine language program.
An assembler or compiler is nothing special. It's just another program that executes on
your computer system. The only thing special about an assembler or compiler is that it
translates programs from one form (source code) to another (machine code). A typical
Y86 assembler, for example, would read lines of text with each line containing a Y86
instruction, it would parse9 each statement and then write the binary equivalent of each
instruction to memory or to a file for later execution.
Assemblers have two big advantages over coding in machine code. First, they automati-
cally translate strings like "ADD( ax, bx );" and "MOV( ax, [1000]);" to their corresponding
binary form. Second, and probably even more important, assemblers let you attach labels
to statements and refer to those labels within jump instructions; this means that you don't
have to know the target address of an instruction in order to specify that instruction as the
target of a jump or conditional jump instruction. Windows users have access to a very
simple Y86 assembler10 that lets you specify up to 26 labels in a program (using the sym-
bols 'A'..'Z'). To attach a label to a statement, you simply preface the instruction with the
label and a colon, e.g.,
L: mov( 0, ax );
To transfer control to a statement with a label attached to it, you simply specify the label
name as the operand of the jump instruction, e.g.,
jmp L;

The assembler will compute the address of the label and fill in the address for you whenever
you specify the label as the operand of a jump or conditional jump instruction. The assembler
can do this even if it hasn't yet encountered the label in the program's source file (i.e., the
label is attached to a later instruction in the source file). Most assemblers accomplish this
magic by making two passes over the source file. During the first pass the assembler deter-
mines the starting address of each symbol and stores this information in a simple database
called the symbol table. The assembler does not emit any machine code during this first
pass. Then the assembler makes a second pass over the source file and actually emits the
machine code. During this second pass it looks up all label references in the symbol table
and uses the information it retrieves from this database to fill in the operand fields of the
instructions that refer to some symbol.

Extending the Y86 Instruction Set

The Y86 CPU is a trivial CPU, suitable only for demonstrating how to encode machine
instructions. However, like any good CPU the Y86 design does provide the capability for
expansion. So if you wanted to improve the CPU by adding new instructions, the ability to
accomplish this exists in the instruction set.
There are two standard ways to increase the number of instructions in a CPU's instruction
set. Both mechanisms require the presence of undefined (or illegal) opcodes on the CPU.
Since the Y86 CPU has several of these, we can expand the instruction set.
The first method is to directly use the undefined opcodes to define new instructions. This
works best when there are undefined bit patterns within an opcode group and the new
instruction you want to add falls into that same group. For example, the opcode
%00011mmm falls into the same group as the NOT instruction. If you decided that you
really needed a NEG (negate, take the two's complement) instruction, using this particular
opcode for this purpose makes a lot of sense because you'd probably expect the NEG
instruction to use the same syntax (and, therefore, decoding) as the NOT instruction.
Likewise, if you want to add a zero-operand instruction to the instruction set, there are
three undefined zero-operand instructions that you could use for this purpose. You'd just
appropriate one of these opcodes and assign your instruction to it.
Unfortunately, the Y86 CPU doesn't have that many illegal opcodes open. For example, if
you wanted to add the SHL, SHR, ROL, and ROR instructions (shift and rotate left and
right) as single-operand instructions, there is insufficient space in the single operand
instruction opcodes to add these instructions (there is currently only one open opcode
you could use). Likewise, there are no two-operand opcodes open, so if you wanted to
add an XOR instruction or some other two-operand instruction, you'd be out of luck.
A common way to handle this dilemma (one the Intel designers have employed) is to use
a prefix opcode byte. This opcode expansion scheme uses one of the undefined opcodes
as an opcode prefix byte. Whenever the CPU encounters a prefix byte in memory, it reads
and decodes the next byte in memory as the actual opcode. However, it does not treat
this second byte as it would any other opcode. Instead, this second opcode byte uses a
completely different encoding scheme and, therefore, lets you specify as many new
instructions as you can encode in that byte (or bytes, if you prefer).

For example, the opcode $FF is illegal (it corresponds to a "mov( dx, const );" instruction) so
we can use this byte as a special prefix byte to further expand the instruction set11.
Figure 5.13 Using a Prefix Byte to Extend the Instruction Set

Encoding 80x86 Instructions

The Y86 processor is simple to understand, easy to hand encode instructions for it, and a
great vehicle for learning how to assign opcodes. It's also a purely hypothetical device
intended only as a teaching tool Therefore, you can now forget all about the Y86, it's
served its purpose. Now it's time to take a look that the actual machine instruction format
for the 80x86 CPU family.
They don't call the 80x86 CPU a Complex Instruction Set Computer for nothing. Although
more complex instruction encodings do exist, no one is going to challenge the assertion
that the 80x86 has a complex instruction encoding. The generic 80x86 instruction takes
the form shown in Figure 5.14. Although this diagram seems to imply that instructions can
be up to 16 bytes long, in actuality the 80x86 will not allow instructions greater than 15
bytes in length.
Figure 5.14 80x86 Instruction Encoding

The prefix bytes are not the "opcode expansion prefix" that the previous sections in this chap-
ter discussed. Instead, these are special bytes to modify the behavior of existing instructions
(rather than define new instructions). We'll take a look at a couple of these prefix bytes in a lit-
tle bit, others we'll leave for discussion in later chapters. The 80x86 certainly supports more
than four prefix values, however, an instruction may have a maximum of four prefix bytes
attached to it. Also note that the behavior of many prefix bytes are mutually exclusive and the
results are undefined if you put a pair of mutually exclusive prefix bytes in front of an instruc-
tion.
The 80x86 supports two basic opcode sizes: a standard one-byte opcode and a two-byte
opcode consisting of a $0F opcode expansion prefix byte and a second byte specifying the
actual instruction. One way to view these opcode bytes is as an eight-bit extension of the iii
field in the Y86 encoding. This provides for up to 512 different instruction classes (although
the 80x86 does not yet use them all). In reality, various instruction classes use certain bits in
this opcode for decidedly non-instruction-class purposes. For example, consider the ADD
instruction opcode. It takes the form shown in Figure 5.15.
Note that bit number zero specifies the size of the operands the ADD instruction operates
upon. If this field contains zero then the operands are eight bit registers and memory loca-
tions. If this bit contains one then the operands are either 16-bits or 32-bits. Under 32-bit
operating systems the default is 32-bit operands if this field contains a one. To specify a 16-bit
operand (under Windows or Linux) you must insert a special "operand-size prefix byte" in
front of the instruction.
Bit number one specifies the direction of the transfer. If this bit is zero, then the destination
operand is a memory location (e.g., "add( al, [ebx]);" If this bit is one, then the destination
operand is a register (e.g., "add( [ebx], al );" You'll soon see that this direction bit creates a
problem that results in one instruction have two different possible opcodes.

Figure 5.15 80x86 ADD Opcode

Encoding Instruction Operands

The "mod-reg-r/m" byte (in Figure 5.14) specifies a basic addressing mode. This byte con-
tains the following fields:
Figure 5.16 MOD-REG-R/M Byte
The REG field specifies an 80x86 register. Depending on the instruction, this can be either
the source or the destination operand. Many instructions have the "d" (direction) field in their
opcode to choose whether this operand is the source (d=0) or the destination (d=1) operand.
This field is encoded using the bit patterns found in the following table:

Register if data Register if data size Register if data size

REG Value
size is eight bits is 16-bits is 32 bits
%000 al ax eax
%001 cl cx ecx
%010 dl dx edx
%011 bl bx ebx
%100 ah sp esp
%101 ch bp ebp
%110 dh si esi
%111 bh di edi
For certain (single operand) instructions, the REG field may contain an opcode extension
rather than a register value (the R/M field will specify the operand in this case).
The MOD and R/M fields combine to specify the other operand in a two-operand instruc-
tion (or the only operand in a single-operand instruction like NOT or NEG). Remember,
the "d" bit in the opcode determines which operand is the source and which is the desti-
nation. The MOD and R/M fields together specify the following addressing modes:

MOD Meaning
Register indirect addressing mode or SIB with no displacement (when
%00
R/M=%100) or Displacement only addressing mode (when R/M=%101).
%01 One-byte signed displacement follows addressing mode byte(s).
%10 Four-byte signed displacement follows addressing mode byte(s).
%11 Register addressing mode.
MODR/MAddressing Mode
%00%000[eax]
%01%000[eax+disp8]
%10%000[eax+disp32]
%11%000register (al/ax/eax)1
%00%001[ecx]
%01%001[ecx+disp8]
%10%001[ecx+disp32]
%11%001register (cl/cx/ecx)
%00%010[edx]
%01%010[edx+disp8]
%10%010[edx+disp32]
%11%010register (dl/dx/edx)
%00%011[ebx]
%01%011[ebx+disp8]
%10%011[ebx+disp32]
%11%011register (bl/bx/ebx)
%00%100SIB Mode
%01%100SIB + disp8 Mode
%10%100SIB + disp32 Mode
%11%100register (ah/sp/esp)

%00%101Displacement Only Mode

(32-bit displacement)
%01%101[ebp+disp8]
%10%101[ebp+disp32]
%11%101register (ch/bp/ebp)
%00%110[esi]
%01%110[esi+disp8]
%10%110[esi+disp32]
%11%110register (dh/si/esi)
%00%111[edi]
%01%111[edi+disp8]
%10%111[edi+disp32]
%11%111register (bh/di/edi)
1
The size bit in the opcode specifies eight or 32-bit register size. To select a 16-bit register
requires a prefix byte.
There are a couple of interesting things to note about this table. First of all, note that there
are two forms of the [reg+disp] addressing modes: one form with an eight-bit displace-
ment and one form with a 32-bit displacement. Addressing modes whose displacement
falls in the range -128..+127 require only a single byte displacement after the opcode;
hence these instructions will be shorter (and sometimes faster) than instructions whose
displacement value is outside this range. It turns out that many offsets are within this
range, so the assembler/compiler can generate shorter instructions for a large percent-
age of the instructions.
The second thing to note is that there is no [ebp] addressing mode. If you look in the table
above where this addressing mode logically belongs, you'll find that it's slot is occupied by
the 32-bit displacement only addressing mode. The basic encoding scheme for address-
ing modes didn't allow for a displacement only addressing mode, so Intel "stole" the
encoding for [ebp] and used that for the displacement only mode. Fortunately, anything
you can do with the [ebp] addressing mode you can do with the [ebp+disp8] addressing
mode by setting the eight-bit displacement to zero. True, the instruction is a little bit
longer, but the capabilities are still there. Intel (wisely) chose to replace this addressing
mode because they anticipated that programmers would use this addressing mode less
often than the other register indirect addressing modes (for reasons you'll discover in a
later chapter).

Another thing you'll notice missing from this table are addressing modes of the form
[ebx+edx*4], the so-called scaled indexed addressing modes. You'll also notice that the table
is missing addressing modes of the form [esp], [esp+disp8], and [esp+disp32]. In the slots
where you would normally expect these addressing modes you'll find the SIB (scaled index
byte) modes. If these values appear in the MOD and R/M fields then the addressing mode is
a scaled indexed addressing mode with a second byte (the SIB byte) following the MOD-
REG-R/M byte that specifies the registers to use (note that the MOD field still specifies the
displacement size of zero, one, or four bytes). The following diagram shows the layout of this
SIB byte and the following tables explain the values for each field.
Figure 5.17 SIB (Scaled Index Byte) Layout
Scale Value Index*Scale Value

%00 Index*1
%01 Index*2
%10 Index*4
%11 Index*8

ndex Register
000 EAX
001 ECX
010 EDX
011 EBX
100 Illegal
101 EBP
110 ESI
111 EDI

Base Register
%000 EAX
%001 ECX
%010 EDX
%011 EBX
%100 ESP
%101 Displacement-only if MOD = %00, EBP if MOD = %01 or %10
%110 ESI
%111 EDI
The MOD-REG-R/M and SIB bytes are complex and convoluted, no question about that. The
reason these addressing mode bytes are so convoluted is because Intel reused their 16-bit
addressing circuitry in the 32-bit mode rather than simply abandoning the 16-bit format in the
32-bit mode. There are good hardware reasons for this, but the end result is a complex
scheme for specifying addressing modes.
Part of the reason the addressing scheme is so convoluted is because of the special cases
for the SIB and displacement-only modes. You will note that the intuitive encoding of the
MOD-REG-R/M byte does not allow for a displacement-only mode. Intel added a quick
kludge to the addressing scheme replacing the [EBP] addressing mode with the displace-
ment-only mode. Programmers who actually want to use the [EBP] addressing mode have to
use [EBP+0] instead. Semantically, this mode produces the same result but the instruction is
one byte longer since it requires a displacement byte containing zero.

You will also note that if the REG field of the MOD-REG-R/M byte contains %100 and
MOD does not contain %11 then the addressing mode is an "SIB" mode rather than the
expected [ESP], [ESP+disp8], or [ESP+disp32] mode. The SIB mode is used when an
addressing mode uses one of the scaled indexed registers, i.e., one of the following
addressing modes:
[reg32+eax*n] MOD = %00
[reg32+ebx*n] Note: n = 1, 2, 4, or 8.
[reg32+ecx*n]
[reg32+edx*n]
[reg32+ebp*n]
[reg32+esi*n]
[reg32+edi*n]
[disp+reg8+eax*n] MOD = %01

[disp+reg8+ebx*n]
[disp+reg8+ecx*n]
[disp+reg8+edx*n]
[disp+reg8+ebp*n]
[disp+reg8+esi*n]
[disp+reg8+edi*n]
[disp+reg32+eax*n] MOD = %10

[disp+reg32+ebx*n]
[disp+reg32+ecx*n]
[disp+reg32+edx*n]
[disp+reg32+ebp*n]
[disp+reg32+esi*n]
[disp+reg32+edi*n]
[disp+eax*n] MOD = %00 and BASE field contains %101
[disp+ebx*n]
[disp+ecx*n]
[disp+edx*n]
[disp+ebp*n]
[disp+esi*n]
[disp+edi*n]

In each of these addressing modes, the MOD field of the MOD-REG-R/M byte specifies the
size of the displacement (zero, one, or four bytes). This is indicated via the modes "SIB
Mode," "SIB + disp8 Mode," and "SIB + disp32 Mode." The Base and Index fields of the SIB
byte select the base and index registers, respectively. Note that this addressing mode does
not allow the use of the ESP register as an index register. Presumably, Intel left this particular
mode undefined to provide the ability to extend the addressing modes in a future version of
the CPU (although extending the addressing mode sequence to three bytes seems a bit
extreme).
Like the MOD-REG-R/M encoding, the SIB format redefines the [EBP+index*scale] mode as
a displacement plus index mode. Once again, if you really need this addressing mode, you
will have to use a single byte displacement value containing zero to achieve the same result.

Encoding the ADD Instruction: Some Examples

To figure out how to encode an instruction using this complex scheme, some examples
are warranted. So let's take a lot at how to encode the 80x86 ADD instruction using vari-
ous addressing modes. The ADD opcode is $00, $01, $02, or $03, depending on the
direction and size bits in the opcode (see Figure 5.15). The following figures each
describe how to encode various forms of the ADD instruction using different addressing
modes.
Figure 5.18 Encoding the ADD( al, cl ); Instruction
There is an interesting side effect of the operation of the direction bit and the MOD-REG-
R/M organization: some instructions have two different opcodes (and both are legal). For
example, we could encode the "add( al, cl );" instruction from Figure 5.18 as $02, $C8 by
reversing the AL and CL registers in the REG and R/M fields and then setting the d bit in
the opcode (bit #1). This issue applies to instructions with two register operands.

Figure 5.19 Encoding the ADD( eax, ecx ); instruction
Note that we can also encode "add( eax, ecx );" using the bytes $03, $C8.

Figure 5.20 Encoding the ADD( disp, edx ); Instruction

Figure 5.21 Encoding the ADD( [ebx], edi ); Instruction

Figure 5.22 Encoding the ADD( [esi+disp8], eax ); Instruction

Figure 5.23 Encoding the ADD ( [ebp+disp32], ebx); Instruction

Figure 5.24 Encoding the ADD( [disp32 +eax*1], ebp ); Instruction

Figure 5.25 Encoding the ADD( [ebx + edi * 4], ecx ); Instruction

Encoding Immediate Operands

You may have noticed that the MOD-REG-R/M and SIB bytes don't contain any bit combi-
nations you can use to specify an immediate operand. The 80x86 uses a completely dif-
ferent opcode to specify an immediate operand. Figure 5.26 shows the basic encoding for
an ADD immediate instruction.
Figure 5.26 Encoding an ADD Immediate Instruction
There are three major differences between the encoding of the ADD immediate and the
standard ADD instruction. First, and most important, the opcode has a one in the H.O. bit
position. This tells the CPU that the instruction has an immediate constant. This individual
change, however, does not tell the CPU that it must execute an ADD instruction, as you'll
see momentarily.
The second difference is that there is no direction bit in the opcode. This makes sense
because you cannot specify a constant as a destination operand. Therefore, the destina-
tion operand is always the location the MOD and R/M bits specify in the MOD-REG-R/M
field.

In place of the direction bit, the opcode has a sign extension (x) bit. For eight-bit operands,
the CPU ignores this bit. For 16-bit and 32-bit operands, this bit specifies the size of the con-
stant following the ADD instruction. If this bit contains zero then the constant is the same size
as the operand (i.e., 16 or 32 bits). If this bit contains one then the constant is a signed eight-
bit value and the CPU sign extends this value to the appropriate size before adding it to the
operand. This little trick often makes programs quite a bit shorter because one commonly
adds small valued constants to 16 or 32 bit operands.
The third difference between the ADD immediate and the standard ADD instruction is the
meaning of the REG field in the MOD-REG-R/M byte. Since the instruction implies that the
source operand is a constant and the MOD-R/M fields specify the destination operand, the
instruction does not need to use the REG field to specify an operand. Instead, the 80x86
CPU uses these three bits as an opcode extension. For the ADD immediate instruction these
three bits must contain zero (other bit patterns would correspond to a different instruction).
Note that when adding a constant to a memory location, the displacement (if any) associated
with the memory location immediately precedes the immediate (constant) data in the opcode
sequence.

Encoding Eight, Sixteen, and Thirty-Two Bit Operands

When Intel designed the 8086 they used one bit (s) to select between eight and sixteen
bit integer operand sizes in the opcode. Later, when they extended the 80x86 architecture
to 32 bits with the introduction of the 80386, they had a problem, with this single bit they
could only encode two sizes but they needed to encode three (8, 16, and 32 bits). To
solve this problem, they used a operand size prefix byte.
Intel studied their instruction set and came to the conclusion that in a 32-bit environment,
programs were more likely to use eight-bit and 32-bit operands far more often than 16-bit
operands. So Intel decided to let the size bit (s) in the opcode select between eight and
thirty-two bit operands, as the previous sections describe. Although modern 32-bit pro-
grams don't use 16-bit operands that often, they do need them now and then. To allow for
16-bit operands, Intel lets you prefix a 32-bit instruction with the operand size prefix byte,
whose value is $66. This prefix byte tells the CPU to operand on 16-bit data rather than
32-bit data.
You do not have to explicitly put an operand size prefix byte in front of your 16-bit instruc-
tions; the assembler will take care of this automatically for you whenever you use a 16-bit
operand in an instruction. However, do keep in mind that whenever you use a 16-bit oper-
and in a 32-bit program, the instruction is longer (by one byte) because of the prefix
value. Therefore, you should be careful about using 16-bit instructions if size (and to a
lesser extent, speed) are important because these instructions are longer (and may be
slower because of their effect on the cache).

Alternate Encodings for Instructions

As noted earlier in this chapter, one of Intel's primary design goals for the 80x86 was to cre-
ate an instruction set to allow programmers to write very short programs in order to save pre-
cious (at the time) memory. One way they did this was to create alternate encodings of some
very commonly used instructions. These alternate instructions were shorter than the standard
counterparts and Intel hoped that programmers would make extensive use of these instruc-
tions, thus creating shorter programs.
A good example of these alternate instructions are the "add( constant, accumulator );"
instructions (the accumulator is AL, AX, or EAX). The 80x86 provides a single byte opcode
for "add( constant, al );" and "add( constant, eax );" (the opcodes are $04 and $05, respec-
tively). With a one-byte opcode and no MOD-REG-R/M byte, these instructions are one byte
shorter than their standard ADD immediate counterparts. Note that the "add( constant, ax );"
instruction requires an operand size prefix (as does the standard "add( constant, ax );"
instruction, so it's opcode is effectively two bytes if you count the prefix byte. This, however, is
still one byte shorter than the corresponding standard ADD immediate.
You do not have to specify anything special to use these instructions. Any decent assembler
will automatically choose the shortest possible instruction it can use when translating your
source code into machine code. However, you should note that Intel only provides alternate
encodings for the accumulator registers. Therefore, if you have a choice of several instruc-
tions to use and the accumulator registers are among these choices, the AL/AX/EAX regis-
ters almost always make the best bet. This is a good reason why you should take some time
and scan through the encodings of the 80x86 instructions some time. By familiarizing yourself
with the instruction encodings, you'll know which instructions have special (and, therefore,
shorter) encodings.

Putting It All Together

Designing an instruction set that can stand the test of time is a true intellectual challenge.
An engineer must balance several compromises when choosing an instruction set and
assigning opcodes for the instructions. The Intel 80x86 instruction set is a classic exam-
ple of a kludge that people are currently using for purposes the original designers never
intended. However, the 80x86 is also a marvelous testament to the ingenuity of Intel's
engineers who were faced with the difficult task of extending the CPU in ways it was
never intended. The end result, though functional, is extremely complex. Clearly, no one
designing a CPU (from scratch) today would choose the encoding that Intel's engineers
are using. Nevertheless, the 80x86 CPU does demonstrate that careful planning (or just
plain luck) does give the designer the ability to extend the CPU far beyond it's original
design.
Historically, an important fact we've learned from the 80x86 family is that it's very poor
planning to assume that your CPU will last only a short time period and that users will
replace the chip and their software when something better comes along. Software devel-
opers usually don't have a problem adapting to a new architecture when they write new
software (assuming financial incentive to do so), but they are very resistant to moving
existing software from one platform to another. This is the primary reason the Intel 80x86
platform remains popular to this day.
Choosing which instructions you want to incorporate into the initial design of a new CPU
is a difficult task. You must balance the desire to provide lots of useful instructions with
the silicon budget and you must also be careful not to include lots of irrelevant instructions
that programmers wind up ignoring for one reason or another. Remember, all future ver-
sions of the CPU will probably have to support all the instructions in the initial instruction
set, so it's better to err on the side of supplying too few instructions rather than too many.
Remember, you can always expand the instruction set in a later version of the chip.

Hand in hand with selecting the optimal instruction set is allowing for easy future expansion of
the chip. You must leave some undefined opcodes available so you can easily expand the
instruction set later on. However, you must balance the number of undefined opcodes with
the number of initial instructions and the size of your opcodes. For efficiency reasons, we
want the opcodes to be as short as possible. We also need a reasonable set of instructions in
the initial instruction set. A reasonable instruction set may consume most of the legal bit pat-
terns in small opcode. So a hard decision has to be made: reduce the number of instructions
in the initial instruction set, increase the size of the opcode, or rely on an opcode prefix byte
(which makes the newer instructions (you add later) longer. There is no easy answer to this
problem, as the CPU designer, you must carefully weigh these choices during the initial CPU
design. Unfortunately, you can't easily change your mind later on.
Most CPUs (Von Neumann architecture) use a binary encoding of instructions and fetch
these instructions from memory. This chapter introduces the concept of binary instruction
encoding via the hypothetical "Y86" processor. This is a trivial (and not very practical) CPU
design that makes it easy to demonstrate how to choose opcodes for a simple instruction set,
encode operands, and leave room for future expansion. Some of the more interesting fea-
tures the Y86 demonstrates includes the fact that an opcode often contains subfields and we
usually group instructions by the number of types of operands they support. The Y86 encod-
ing also demonstrates how to use special opcodes to differentiate one group of instructions
from another and to provide undefined (illegal) opcodes that we can use for future expansion.
The Y86 CPU is purely hypothetical and useful only as an educational tool. After exploring
the design of a simple instruction set with the Y86, this chapter began to discuss the encod-
ing of instructions on the 80x86 platform. While the full 80x86 instruction set is far too com-
plex to discuss this early in this text (i.e., there are lots of instructions we still have to discuss
later in this text), this chapter was able to discuss basic instruction encoding using the ADD
instruction as an example. Note that this chapter only touches on the 80x86 instruction
encoding scheme. For a full discussion of 80x86 encoding, see the appendices in this text
and the Intel 80x86 documentation.

1
As in "Everything, including the kitchen sink."
2Not to mention faster and less expensive.
3
To many CPU designers it is not; however, since this was a design goal for the
8086 we'll follow this path.
4
Assuming this operation treats its single operand as both a source and destina-
tion operand, a common way of handling this instruction.
5Actually, Intel claims it's a one byte opcode plus a one-byte "mod-reg-r/m"
byte. For our purposes, we'll treat the mod-reg-r/m byte as part of the opcode.
6The
Y86 processor only performs unsigned comparisons.
7Technically,
registers do not have an address, but we apply the term addressing
mode to registers nonetheless.
8All numeric constants in Y86 assembly language are given in hexadecimal. The "$"
prefix is not necessary.

9"Parse"
means to figure out the meaning of the statement.
10This program is written with Borland's Delphi and was not ported to Linux by
the time this was written.

11We could also have used values $F7, $EF, and $E7 since they also correspond to
an attempt to store a register into a constant. However, $FF is easier to decode.
On the other hand, if you need even more prefix bytes for instruction expansion,
you can use these three values as well.

Lesson 10 - Structured Exception Handling (SEH)
Lesson 10 - Structured Exception Handling

(SEH)19
Everybody is talking about SEH. This seems to be a high-leveled topic which only hardest
experts use and understand. If you come from a highlevel language like Delphi, Java or C++
you know this concept. Handling appearing exceptions is one important concept to reduce
the "crashability" of your application.
I can give you one example:
You code an application which tries to open a file. But the file is not there. Under Delphi GUI
coding your application will crash and you have no possibility to receive a flag with true or
false of the operation. But you can simply use exception handling to eliminate a crash. With
SEH you first "try" to do the operation. If something goes wrong it will cause an exception.
When this exception is "thrown" you can offer your application to execute a different code and
the application does not crash. That´s it and not more.
19.Like the Iczelion tutorials (we included them) this article. It is a great document so please respect the work of
the author like we do it.

Win32 Exception handling for assembler programmers

by Jeremy Gordon - Background20
We're going to examine how to make an application more robust by handling its own
exceptions, rather than permitting the system to do so. An "exception" is an offence com-
mitted by the program, which would otherwise result in the embarrassing appearance of
the dreaded closure message box:-
or its more elaborate counterpart in Windows NT.
20.This lesson is the full article by Jeremy Gordon (Copyright © Jeremy Gordon 1996-2002). There was no
need to write an own crappy lesson. This article still rulez. You can find the original article with your favou-
rite search-engine.

What exception handling does ...
The idea of exception handling (often called "Structured Exception Handling") is that your
application instals one or more callback routines called "exception handlers" at run-time and
then, if an exception occurs, the system will call the routine to let the application deal with the
exception. The hope would be that the exception handler may be able to repair the exception
and continue running either from the same area of code where the exception occurred, or
from a "safe place" in the code as if nothing had happened. No closure message box would
then be displayed and the user would be done the wiser. As part of this repair it may be nec-
essary to close handles, close temporary files, free device contexts, free memory areas,
inform other threads, then unwind the stack or close down the offending thread. During this
process the exception handler may make a record of what it is doing and save this to a file for
later analysis.
If a repair cannot be achieved, exception handling allows your application to close gracefully,
having done as much clearing up, saving of data, and apologising as it can.

Planned exceptions
The Windows SDK suggests another use for exception handling. It is suggested as a way
to keep track of memory usage. The idea is that an exception will occur if you need to
commit more memory: you intercept it and carry out the memory allocation. This can be
done by intercepting a memory access violation [exception number 0C0000005h], which
would occur if your code tries to read from, or write to, memory which had not been com-
mitted.
Another way suggested to keep track of memory usage is to set the guard page flag in a
call to VirtualAlloc when committing the memory, or later using VirtualProtect. This causes
a guard page exception [080000001h] if an attempt was made to read to, or write from a
guarded area of memory, after which the guard page flag is released. The exception han-
dler would therefore be kept informed of the memory requirements and could reset the
flag if required.
These methods are widely used throughout the system, for example, as more stack is
required by a thread, it is automatically enlarged.
An application, however, usually knows what it hopes to do next, so it is much simpler and
quicker to keep track of memory requirements by keeping the top of the memory area as
a data variable, and to check before the start of each series of memory read/write opera-
tions whether the memory area needs to be enlarged or diminished.
This works even if more than one thread uses the same area of memory, since the same
data variable can be used by each thread. In that case, handling the 0C0000005h excep-
tion might only be a backup in case your code went wrong.

And what exception handling cannot do ...
Apart from divide by zero [exception code 0C0000094h] which can easily be avoided by pro-
tective coding, the most common type of exception is an attempt to read from, or write to, an
illegal memory address [0C0000005h]. There are several ways that the second (illegal
address) can arise. For example:-
- wrong index register values when addressing memory
- unexpected continuous loops involving memory access
- mismatch of PUSHes and POPs so execution continues from the wrong place
after return from a CALL
- unforeseen corruption in input data files
It can be seen from this list that exceptions may occur in unexpected circumstances for a
variety of reasons. And it will be precisely this type of exception which may terminate your
program despite the best efforts of your exception handler. In these circumstances at the very
least, the exception handler should try to save important data which would otherwise be lost,
and then retire gracefully, with suitable apologies.
Other program failures
Your program may fail for other reasons which will not result in an exception at all.
The usual cause of this is:-

- insufficient system resources
- continuous loops in your program which do not involve memory access
The result is that your program will not be able to respond to system messages it will appear
to the user simply to have stopped. Luckily, however, because it runs in its own virtual
address space other programs will not be affected, although the whole system may appear to
run a little more slowly.

Utterly fatal exceptions
Some errors are so bad that the system cannot even manage to call your exception han-
dler. Then only if the user is lucky will the system's closure message box appear, or the
devastating bright blue error screen will appear, showing that a "fatal" error has occurred.
Almost inevitably this is a result of a total crash of the system and a reboot is the only
remedy. Fortunately in Win32 you have to try quite hard to produce such errors, but they
can still occur.
... and where exception handling really scores
Having spent some time on what exception handling cannot do, let's review the instances
where it is invaluable:-
- During program development, to catch and report on errors as an alterna-
tive to debug control.
- When using code written by others which may not be fully trusted.
- When reading from, or writing to, memory areas which may be moved without
notice. For example, while spelunking around system memory areas (which
would be under system control) or memory areas which could possibly be
closed by other processes or threads.
- Using pointers from files which may be corrupted or of the wrong format.
Here exception handling would be much quicker than using the IsBadReadPtr
or IsBadWritePtr APIs to check each pointer immediately prior to its use.
- As a general catch-all for all unforeseen bugs.

Exception handling in practice

The Windows sequence
In order to understand what your code can or should do when handling exceptions, you need
to know in some more detail what the system does when an exception occurs. If you are new
to the subject, the following may not yet be clear. However it is necessary to know these
steps to understand the subject. The steps are as follows:-
1.Windows decides first whether it is an exception which it is willing to send
to the program's exception handler. If so, if the program is being debugged,
Windows will notify the debugger of the exception by suspending the program
and sending EXCEPTION_DEBUG_EVENT (value 1h) to the debugger.
2.If the program is not being debugged or if the exception is not dealt with by
the debugger, the system sends the exception to your per-thread exception
handler if you have installed one. A per-thread handler is installed at run-
time and is pointed to by the first dword in the Thread Information Block
whose address is at FS:[0].
3.The per-thread exception handler can try to deal with the exception, or it
may not do so, leaving it for handlers further up the chain, if there are any
more handlers installed.
4.Eventually if none of the per-thread handlers deal with the exception, if the
program is being debugged the system will again suspend the program and
notify the debugger.
5.If the program is not being debugged or if the exception is still not dealt
with by the debugger, the system will call your final handler if one is
installed. This will be a final handler installed at run-time by the applica-
tion using the API SetUnhandledExceptionFilter.
6.If your final handler does not deal with the exception after it returns, the
system final handler will be called. Optionally it will show the system's
closure message box. Depending on the registry settings, this box may give
the user a chance to attach a debugger to the program. If no debugger can be
attached or if the debugger is powerless to assist, the program is doomed and
the system will call ExitProcess to terminate the program.
7.Before finally terminating the program, though, the system will cause a
"final unwind" of the stack for the thread in which the exception occurred.

Advantages of using assembler for exception handling
Win32 provides only the framework for exception handling, using a handful of APIs. So
most of the code required for exception handling has to be coded by hand.
"C" programmers will use various shortcuts provided by their compilers by including in
their source code statements such as _try, _except, _finally, _catch and _throw.
One real disadvantage in relying on the compiler's code is that it can enlarge the final exe
file enormously.
Also most C programmers would have no idea what code is produced by the compiler
when exception handling is used, and this is a real disadvantage because to handle
exceptions properly you need flexibility, understanding and control. This is because
exceptions can be intercepted and handled in various ways and at various different levels
in your code. Using assembler you can produce tight, reliable and flexible code which you
can tailor closely to your own application.
Multi-threaded applications need particularly careful treatment and assembler provides a

simple and versatile way to add exception handling to such programs.
Information about exception handling at a low level is hard to get hold of, and the samples
in the Win32 Software Development Kit (SDK) concentrate on how to use the "C" com-
piler statements rather than how to hard-wire a program to use the Win32 framework
itself.
The information in this article was obtained using a test program and a debugger, and by
disassembling code produced by "C" compilers. The accompanying programs,
Except1.exe and Except2.exe, demonstrate the techniques described here.

Setting up simple exception handlers

I hope you will be pleasantly surprised to see in practice how easy it is in assembler to add
exception handling to your programs.
The two types of exception handlers

As you have seen above, there are two types of exception handlers.
Type 1 - the "final" exception handler

The "final" exception handler is called by the system if your program is doomed to close.
Because this handler is process-specific it is called irrespective of which thread caused the
exception.
Establishing a final exception handler

Typically, this is established in the main thread as soon as possible after the program entry
point by calling the API SetUnhandledExceptionFilter. It therefore covers the whole program
from that point until termination. There is no need to remove the handler on termination - this
is done automatically by windows.

Example
START: ;program entry point

PUSH ADDR FINAL_HANDLER ;
CALL SetUnhandledExceptionFilter ;
; ... ;
; ... ;code covered by final handler
; ... ;
CALL ExitProcess ;
;************************************ ;
FINAL_HANDLER: ;
; ... ;
; ... ;code to provide a polite exit
; ... ;
;(eax=-1 reload context and continue) ;
MOV EAX,1 ;eax=1 stops display of closure
RET box
;eax=0 enables display of the
box
No chaining of final exception handlers
There can only be one application-defined final exception handler in the process at any
one time. If SetUnhandledExceptionFilter is called a second time in your code the
address of the final exception handler is simply changed to the new value, and the previ-
ous one is discarded.
Type 2 - the "per-thread" exception handler
This type of handler is typically used to guard certain areas of code and is established by
altering the value held by the system at FS:[0]. Each thread in your program has a differ-
ent value for the segment register FS, so this exception handler will be thread specific. It
will be called if an exception occurs during the execution of code protected by the han-
dler.

The value in FS is a 16-bit selector which points to the "Thread Information Block", a structure
which contains important information about each thread. The very first dword in the Thread
Information Block points to a structure which we are going to call an "ERR" structure.
The "ERR" structure is at least 2 dwords as follows:-
Pointer to next ERR

1st dword +0
structure
Pointer to own exception

2nd dword +4
handler
Establishing a "per-thread" exception handler
So now we can see how easy it is to establish this type of exception handler:-

Example
PUSH ADDR HANDLER ;

FS PUSH [0] ;address of next ERR structure
FS MOV [0],ESP ;give FS:[0] the ERR address just
... made
... ;
... ;the code protected by the handler
FS POP [0] goes here
ADD ESP,4h ;
RET ;restore next ERR structure to FS:[0]
;*********************** ;throw away rest of ERR structure
HANDLER: ;
... ;
... ;
... ;
MOV EAX,1 ;exception handler code goes here
RET ;
;eax=1 go to next handler
;eax=0 reload context & continue
execution

Chaining of per-thread exception handlers
In the above code we can see that the 2nd dword of the ERR structure, which is the address
of your handler, is put on the stack first, then the 1st dword of the next ERR structure is put on
the stack by the instruction FS PUSH [0]. Suppose the code which was then protected by this
handler called other functions which needed their own individual protection. Then you may
create another ERR structure and handler to protect that code in exactly the same way. This
is called chaining. In practice this means that when an exception occurs the system will walk
the handler chain by first calling the exception handler most recently established before the
code where the exception occurred. If that handler does not deal with the exception (return-
ing EAX=1), then the system calls the next handler up the chain. Since each ERR structure
contains the address of the next handler up the chain, any number of such handlers can be
established in this way. Each handler might guard against or deal with particular types of
exceptions depending on what is foreseeable in your code. The stack is used to keep the
ERR structure, to avoid write-overs. However there is nothing to stop you using other parts of
memory for the ERR structures if you prefer.

Stack unwinds
We're going to look at with stack unwinds at this point because they shouldn't keep their
mystery any longer! A "stack unwind" sounds very dramatic, but in practice it's simply all
about calling the exception handlers whose local data is held further down the stack and
then (probably) continuing execution from another stack frame. In other words the pro-
gram gets ready to ignore the stack contents between these two positions.
Suppose you have a chain of per-thread handlers established as in this arrangement,

where Function A calls Function B which calls Function C:-

Then the stack will look something like this:-
stack? +ve
3rd Use of stack by Function C
Stack Handler 3
Frame Local Data Function C
Return address Function C

2nd
Use of stack by Function B
Stack
Handler 2
Frame
Local Data Function B
Return address Function B

1st
Use of stack by Function A
Stack
Handler 1
Frame
Local Data Function A
Return address Function A
Stack? +ve
Here as each function is called things are PUSHed onto the stack: firstly the return address,
then local data, and then the exception handler (this is the "ERR" structure referred to ear-
lier).

Then suppose that an exception occurs in Function C. As we have seen, the system will
cause a walk of the handler chain. Handler 3 will be called first. Suppose Handler 3 does
not deal with the exception (returning EAX=1), then Handler 2 will be called. Suppose
Handler 2 also returns EAX=1 so that Handler 1 is called. If Handler 1 deals with the
exception, it may need to cause a clear-up using local data in the stack frames created by
Functions B and C.
It can do so by causing an Unwind.
This simply repeats the walk of the handler chain again, causing first Handler 3 then Han-
dler 2, then Handler 1 to be called in turn.
The differences between this type of handler chain walk and the walk initiated by the sys-
tem when the exception first occurred are as follows:-
1.This handler walk is initiated by your handler rather than by the system
2.The exception flag in the EXCEPTION_RECORD should be set to 2h

(EH_UNWINDING). This indicates to the per-thread handler that it is being called
by another handler higher in the chain to clear-up using local data. It should not
attempt to do any more than that and it must return EAX=1.
3.The handler walk stops at the handler immediately before the caller. For example
in the diagram, if Handler 1 initiates the unwind, the last Handler to be called dur-
ing the unwind is Handler 2. There is no need for Handler 1 to be called from
within itself because it has access to its own local data to clear-up.
You can see below ("Providing access to local data") how the handler is able to find local
data during the handler walk.

How the unwind is done
The handler can initiate an unwind using the API RtlUnwind or, as we shall see, it can also
easily be done using your own code. This API can be called as follows:-
PUSH Return value
PUSH pExceptionRecord
PUSH ADDR CodeLabel
PUSH LastStackFrame
CALL RtlUnwind
Where:-
Return value is said to give a return value after the unwind (you would probably not use
this)
pExceptionRecord is a pointer to the exception record, which is one of the structures

sent to the handler when an exception occurs
CodeLabel is a place from which execution should continue after the unwind and is typ-
ically the code address immediately after the call to RtlUnwind. If this is not specified the
API appears to return in the normal way, however the SDK suggests that it should be
used and it is better to play safe with this type of API
LastStackFrame is the stack frame at which the unwind should stop. Typically this will
be the stack address of the ERR structure which contains the address of the handler
which is initiating the unwind
Unlike other APIs you cannot rely on RtlUnwind

saving the EBX, ESI or EDI registers – if you are
using these in your code you should ensure that
they are saved prior to PUSHing the first parameter
and restored after the CodeLabel

Own-code Unwind
The following code simulates the unwind (where ebx holds the address of the
EXCEPTION_RECORD structure sent to the handler):-
MOV D[EBX+4],2h ;make the exception flag EH_UNWINDING

FS MOV EDI,[0] ;get 1st per-thread handler address
L2: ;
CMP D[EDI],-1 ;see if it’s the last one
JZ >L3 ;yes, so finish
PUSH EDI,EBX ;push ERR structure, EXCEPTION_RECORD
CALL [EDI+4] ;call handler to run clear-up code
ADD ESP,8h ;remove the two parameters pushed
MOV EDI,[EDI] ;get pointer to next ERR structure
JMP L2 ;and do next if not at end
L3: ;code label when finished
Own-code Unwind
Here each handler is called in turn with the ExceptionFlag set to 2h until the last handler
is reached (the system has a value of -1 in the last ERR structure).
The above code does not check for corruption of the values at [EDI] and at [EDI+4]. The
first is a stack address and could be checked by ensuring that it is above the thread's
stack base given by FS:[8] and below the thread's stack top given by FS:[4]. The second
is a code address and so you could check that it lies within two code labels, one at the
start of your code and one at the end of it. Alternatively you could check that [EDI] and
[EDI+4] could be read by calling the API IsBadReadPtr.

Unwind by final handler then continue
It is not just a per-thread handler which can initiate a stack unwind. It can also be done in your
final handler by calling either RtlUnwind or an own-code unwind and then returning EAX= -1.
(See "Continuing execution after final handler called").
Final unwind then terminate
If a final handler is installed and it returns either EAX=0 or EAX=1, the system will cause the
process to terminate. However, before final termination something interesting happens. The
system does a final unwind by going back to the very first handler in the chain (that is to say,
the handler guarding the code in which the exception occurred). This is the very last opportu-
nity for your handler to execute the clear-up code necessary within each stack frame. You
can see this final unwind clearly occurring if you set the accompanying demo program
Except2.exe to allow the exception to go to the final handler and press either F3 or F5 when
there. It also happens in the simpler Except1.exe program.
The following code simulates the unwind (where ebx holds the address of the
EXCEPTION_RECORD structure sent to the handler):-

The information sent to the handlers

Clearly sufficient information must be sent to the handlers for them to be able to try to
repair the exception, make error logs, or report to the user. As we shall see, this informa-
tion is sent by the system itself on the stack, when the handlers are called. In addition to
this you can send your own information to the handlers by enlarging the ERR structure so
that it contains more information.
The information sent to the final handler
The final handler is documented in the Windows Software Development Kit ("SDK") as
the API "UnhandledExceptionFilter". It receives one parameter only, a pointer to the struc-
ture EXCEPTION_POINTERS. This structure is as follows:-
EXCEPTION_POINTERS Pointer to structure:-

+0 EXCEPTION_RECORD
Pointer to structure:-
+4
CONTEXT record

The structure EXCEPTION_RECORD has these fields:-
EXCEPTION_RECORD +0 ExceptionCode
+4 ExceptionFlag
+8 NestedExceptionRecord
+C ExceptionAddress
+10 NumberParameters
+14 AdditionalData

Where
ExceptionCode gives the type of exception which has occurred. There are a
number of these listed in the SDK and header files, but in prac-
tice, the types which you may come across are:-
C0000005h - Read or write memory violation
C0000094h - Divide by zero
C0000095h - Divide overflow
C00000FDh - The stack went beyond the maximum available size
80000001h - Violation of a guard page in memory set up using Virtual Alloc
The following only occur whilst dealing with exceptions:-
C0000025h - A non-continuable exception - the handler should not try to deal
with it
C0000026h - Exception code used the by system during exception handling.
This code might be used if the system encounters an unexpected
return from a handler. It is also used if no Exception Record is
supplied when calling RtlUnwind.
The following are used in debugging:-
80000003h - Breakpoint occurred because there was an INT3 in the code
80000004h - Single step during debugging
The exception codes follow these rules:

Bits 31-30 Bit 29 Bit 28 Bits 27-0
0=success 0=Microsoft Reserved For
exception
1=information 1=Application Must be zero code
2=warning
3=error
A typical own exception code sent by RaiseException
might therefore be E0000100h (error, application,
code=100h).

Own user code - this would be sent by your own application by calling the API
RaiseException. This is a quick way to exit code directly into your
handler if required.
Exception flag which gives instructions to the handler. The values can be:-
0 - a continuable exception (can be repaired)
1 - a non-continuable exception (cannot be repaired)
2 - the stack is unwinding - do not try to repair
Nested exception record pointing to another EXCEPTION_RECORD structure if the
handler itself has caused another exception
Exception address - the address in code where the exception occurred
NumberParameters - number of dwords to follow in Additional information
Additional information - array of dwords with further information
This can either be information sent by the application itself when calling
RaiseException, or, if the exception code is C0000005h it will be as
follows:-
1st dword - 0=a read violation, 1=a write violation.
2nd dword - address of access violation
The second part of the EXCEPTION_POINTERS structure which is sent to the final handler
points to the CONTEXT record structure which contains the processor-specific values of all
the registers at the time of the exception. WINNT.H contains the CONTEXT structures for var-
ious processors. Your program can find out what sort of processor is being used by calling
GetSystemInfo. CONTEXT is as follows for IA32 (Intel 386 and upwards):-

+0 context flags
(used when calling GetThreadContext)
DEBUG REGISTERS
+4 debug register #0
+C debug register #2
FLOATING POINT / MMX registers
+1C ControlWord
+20 StatusWord
+24 TagWord
+28 ErrorOffset
+2C ErrorSelector
+30 DataOffset
+34 DataSelector
+38 FP registers x 8 (10 bytes each)
+88 Cr0NpxState
SEGMENT REGISTERS
+8C gs register
+90 fs register
+94 es register
+98 ds register
ORDINARY REGISTERS
+9C edi register
+A0 esi register
+A4 ebx register
+A8 edx register
+AC ecx register
+B0 eax register
CONTROL REGISTERS
+B4 ebp register
+B8 eip register
+BC cs register
+C0 eflags register
+C4 esp register
+C8 ss register

The information sent to the per-thread handlers
At the time of the call to the per-thread handler, ESP points to three structures as follows:-
ESP+4
EXCEPTION_RECORD
Pointer to own ERR

ESP+8
structure
ESP+C
CONTEXT record
Unlike usual CALLBACKs in Windows, when the per-thread handler is called, the
C calling convention is used (caller to remove the arguments from the stack) not
the PASCAL convention (function to do so). This can be seen from the actual
Kernel32 code used to make the call:-
PUSH Param, CONTEXT record, ERR, EXCEPTION_RECORD

CALL HANDLER
ADD ESP,10h
In practice the first argument, Param, was not found to contain meaningful
information

The EXCEPTION_RECORD and CONTEXT record structures have already been

described above.
The ERR structure is the structure you created on the stack when the handler was estab-
lished and it must contain the pointer to the next ERR structure and the code address of
the handler now being installed (see "Setting up simple exception handlers", above). The
pointer to the ERR structure passed to the per-thread handler is to the top of this struc-
ture. It is possible, therefore, to enlarge the ERR structure so that the handler can receive
additional information.
In a typical arrangement the ERR structure might look like this, where [ESP+8h] points to
the top of this structure when the handler is called:-
ERR +0 Pointer to next ERR structure
+4 Pointer to own exception handler
+8 Code address of "safe-place" for handler
+C Information for handler
+10 Area for flags
+14 Value of EBP at safe-place
As we shall see below ("Continuing execution from a safe-place"), the fields at +8 and
+14 may be used by the handler to recover from the exception.

Providing access to local data
Let's now consider the best position of the ERR structure on the stack relative to the stack
frame, which may well hold local data variables. This is important because the handler may
well need access to this local data in order to clear-up properly. Here is some typical code
which may be used to establish a per-thread handler where there is local data:-
MYFUNCTION: ;procedure entry point

PUSH EBP ;save ebp (used to address stack frame)
MOV EBP,ESP ;use EBP as stack frame pointer
SUB ESP,40h ;make 16 dwords on stack for local data
;******** local data now at [EBP-4] to [EBP-40h]
;********** install handler and its ERR structure
PUSH EBP ;ERR+14h save ebp (being ebp at safe-place)
PUSH 0 ;ERR+10h area for flags
PUSH 0 ;ERR+0Ch information for handler
PUSH ADDR SAFE_PLACE ;ERR+8h new eip at safe-place
PUSH ADDR HANDLER ;ERR+4h address of handler
FS PUSH [0] ;ERR+0h keep next ERR up the chain
FS MOV [0],ESP ;point to ERR just made on the stack
... ;
... ;code which is protected goes here
... ;
JMP >L10 ;normal end if there is no exception
SAFE_PLACE: ;handler sets eip/esp/ebp for here
L10: ;
FS POP [0] ;restore next ERR up the chain
MOV ESP,EBP
POP EBP
RET
;*****************
HANDLER:
RET

Using this code, when the handler is called, the following is on the stack, and with
[ESP+8h] pointing to the top of the ERR structure (ie. ERR+0):-
StackÈ +ve
ERR +0 Pointer to next ERR structure
ERR +4 Pointer to own exception handler
ERR +8 Code address of "safe-place" for handler
ERR +C Information for handler
ERR +10 Area for flags
ERR +14 Value of EBP at safe-place
+18 Local Data
+1C Local Data
+20 Local Data
more local data È
You can see from this that since the handler is given a pointer to the ERR structure it can
also find the address of local data on the stack. This is because the handler knows the
size of the ERR structure and also the position of the local data on the stack. If the EBP
field is used at ERR+14h as in the above example, that could also be used as a pointer to
the local data.

Recovering from and Repairing an exception

Continuing execution from a safe-place
Choosing the safe-place
You need to continue execution from a place in the code which will not cause further prob-
lems. The main thing you must bear in mind is that since your program is designed to work
within the Windows framework, your aim is to return to the system as soon as possible in a
controlled manner, so that you can wait for the next system event. If the exception has
occurred during the call by the system to a window procedure, then often a good safe-place
will be near the exit point of the window procedure so that control passes back to the system
cleanly. In this case it will simply appear to the system that your application has returned from
the window procedure in the usual way.
If the exception has occurred, however, in code where there is no window procedure, then
you may need to exercise more control. For example, a thread established to do certain tasks
will probably need to be terminated, reporting to the main thread that it could not complete
the task.
Another major consideration is how easy it is to get the correct EIP, ESP and EBP values at
the safe-place. As we can see below, this may not be at all difficult.
There are so many possible permutations here it is probably pointless to postulate them. The
precise safe-place will depend on the nature of your code and the use you are making of
exception handling.

Example of how to get to safe-place
As an example, though, look again at the code example above in MYFUNCTION. You
can see the code label "SAFE-PLACE". This is a code address from which execution
could continue safely, the handler having done all necessary clearing up.
In the code example, in order to continue execution successfully, it must be borne in mind
that although SAFE-PLACE is within the same stack frame as the exception occurred, the
values of ESP and EBP need carefully to be set by the handler before execution contin-
ues from EIP.
These 3 registers therefore need to be set and for the following reasons:-
- ESP - to enable the FS POP [0] instruction to work and to POP other val-
ues if necessary
- EBP - to ensure that local data can be addressed within the handler and
to restore the correct ESP value to return from MYFUNCTION
- EIP - to cause execution to continue from SAFE-PLACE
Now you can see that each of these values is readily available from within the handler
function. The correct ESP value is, in fact, exactly the same as the top of the ERR struc-
ture itself (given by [ESP+8h] when the handler is called). The correct EBP value is avail-
able from ERR+14h, because this was PUSHed onto the stack when the ERR structure
was made. And the correct code address of SAFE-PLACE to give to EIP is at ERR+8h.
Now we are ready to see how the handler can ensure that execution continues from a
safe-place, instead of allowing the process to close, should an exception occur.

HANDLER: ;
PUSH EBP ;
MOV EBP,ESP ;
;** now [EBP+8]=pointer to EXCEPTION_RECORD
;** [EBP+0Ch]=pointer to ERR structure
;** [EBP+10h]=pointer to CONTEXT record
PUSH EBX,EDI,ESI ;save registers as required by windows
MOV EBX,[EBP+8] ;get exception record in ebx
TEST D[EBX+4],1h ;see if its a non-continuable exception
JNZ >L5 ;yes, so must not deal with it
TEST D[EBX+4],2h ;see if its EH_UNWINDING (from Unwind)
JZ >L2 ;no
... ;
... ;clear-up code when unwinding
... ;
JMP >L5 ;must return 1 to go to next handler
L2: ;
PUSH 0 ;return value (not used)
PUSH [EBP+8h] ;pointer to this exception record
PUSH ADDR UN23 ;code address for RtlUnwind to return
PUSH [EBP+0Ch] ;pointer to this ERR structure
CALL RtlUnwind ;
UN23: ;
MOV ESI,[EBP+10h] ;get context record in esi
MOV EDX,[EBP+0Ch] ;get pointer to ERR structure
MOV [ESI+0C4h],EDX ;use it as new esp
MOV EAX,[EDX+8] ;get safe place given in ERR structure
MOV [ESI+0B8h],EAX ;insert new eip
MOV EAX,[EDX+14h] ;get ebp at safe place given in ERR
MOV [ESI+0B4h],EAX ;insert new ebp
XOR EAX,EAX ;reload context & return to system eax=0
JMP >L6 ;
L5: ;
MOV EAX,1 ;go to next handler - return eax=1
L6: ;ordinary return (no actual arguments)
POP ESI,EDI,EBX
MOV ESP,EBP
POP EBP
RET

Repairing the exception
In the above example you saw the context being loaded with the new eip, ebp and esp to
cause execution to continue from a safe-place. It may be possible using the same method
of replacing the values for some of the registers in the context, to "repair" the exception,
permitting execution to continue from near the offending code, so that the current task
can be continued.
An obvious example would be a divide by zero, which can be repaired by the handler by
substituting the value 1 for the divisor, and then a return with EAX=0 (if a "per-thread"
handler) causing the system to reload the context and continue execution.

In the case of memory violations, you can make use of the fact that the address of the mem-
ory violation is passed as the second dword in the additional information field of the exception
record. The handler can use this very same value to pass to VirtualAlloc to commit more
memory starting at that place. If this is successful, the handler can then reload the context
(unchanged) and return EAX=0 to continue execution (in the case of a "per-thread" handler).

Continuing execution after final handler called

If you wish you can deal with exceptions in the final handler. You recall that at the begin-
ning of this article I said that the final handler is called by the system when the process is
about to be terminated.
This is true.
The returns in EAX from the final handler are not the same as those from the per-thread
handler. If the return is EAX=1 the process terminates without showing the system's clo-
sure message box, and if EAX=0 the box is shown.
However, there is also a third return code, EAX= -1 which is properly described in the
SDK as "EXCEPTION_CONTINUE_EXECUTION". This return has the same effect as
returning EAX=0 from a per-thread handler, that is, it reloads the context record into the
processor and continues execution from the eip given in the context. Of course, the final
handler may change the context record before returning to the system, in the same way
as a per-thread handler might do so. In this way the final handler can recover from the
exception by continuing execution from a suitable safe-place or it may try to repair the
exception.
If you use the final handler to deal with all exceptions instead of using per-thread handlers
you do lose some flexibility, though.

Firstly, you cannot nest final handlers. You can only have one working final handler estab-
lished by SetUnhandledExceptionFilter in your code at any one time. You could, if you
wished, change the address of the final handler as different parts of your code are being pro-
cessed. SetUnhandledExceptionFilter returns the address of the final handler being replaced
so you could make use of this as follows:-
PUSH ADDR FINAL_HANDLER ;

CALL SetUnhandledExceptionFilter ;
PUSH EAX ;keep address of previous handler
... ;
... ;this is the code
... ;being guarded
... ;
CALL SetUnhandledExceptionFilter ;restore previous handler
Note here that at the time of the second call to SetUnhandledExceptionFilter the address of
the previous handler is already on the stack because of the earlier PUSH EAX instruction.
Another difficulty with using the final handler is that the information sent to it is limited to the
exception record and the context record. Therefore you will need to keep the code address of
the safe-place, and the values of ESP and EBP at that safe-place, in static memory. This can
be done easily at run time. For example, when dealing with the WM_COMMAND message
within a window procedure,
PROCESS_COMMAND: ;called on uMsg=111h (WM_COMMAND)

MOV EBPSAFE_PLACE,EBP ;keep ebp at safe-place
MOV ESPSAFE_PLACE,ESP ;keep esp at safe-place
... ;
... ;protected code here
... ;
SAFE_PLACE: ;code-label for safe-place
XOR EAX,EAX ;return eax=0=message processed
RET

In the above example, in order to repair the exception by continuing execution from the
safe-place, the handler would insert the values of EBPSAFE_PLACE at CONTEXT+0B4h
(ebp), ESPSAFE_PLACE at CONTEXT+0C4h (esp), and ADDR SAFE_PLACE into
CONTEXT+0B8h (eip) and then return -1.
Note that in a stack unwind forced by the system because of a fatal exit, only the "per-
thread" handlers (if any) and not the final handler are called. If there are no "per-thread"
handlers, the final handler would have to deal with all clearing-up itself before returning to
the system.

Single-stepping by setting the trap flag within the handler

You can make a simple single-step tester for your program while it is under development by
using the handler's ability to set the trap flag in the register context before returning to the
system. You can arrange for the handler to display the results on the screen, or to dump
them to a file. This may be useful if you suspect that results are being altered under debug-
ger control, or if you need to see quickly how a particular piece of code responds to various
inputs. Insert the following code fragment where you want single-stepping to begin:-
MOV D[SSCOUNT],5
INT 3
SSCOUNT is a data symbol and is set to the number of steps the handler should do before
returning to normal operation. The INT 3 causes a 80000003h exception, so your handler is
called.

The code in your development program should be protected by a per-thread handler

using code like this:-.
SS_HANDLER: ;
PUSH EBP ;
MOV EBP,ESP ;
PUSH EBX,EDI,ESI ;save registers as required by Windows
JNZ >L14 ;yes
TEST D[EBX+4],02h ;see if EH_UNWINDING
JNZ >L14 ;yes
MOV EAX,[EBX] ;get ExceptionCode
CMP EAX,80000004h ;see if here because trap flag set
JZ >L10 ;yes
CMP EAX,80000003h ;see if its own INT 3 inserted to single-step
JNZ >L14 ;no
L10: ;
DEC D[SSCOUNT] ;stop when correct number done
JZ >L12 ;
OR D[ESI+0C0h],100h ;set trap flag in context
L12: ;
... ;
... ;code here to display results to screen
... ;
XOR EAX,EAX ;eax=0 reload context and return to system
JMP >L17 ;
L14: ;
MOV EAX,1 ;eax=1 system to go to next handler
L17:
POP ESI,EDI,EBX
MOV ESP,EBP
POP EBP
RET
Here the first call to the handler is caused by the INT 3 (the system objected strongly to
the use of INT 1 when I tried it). On receipt of this exception, which could only come from
the code fragment inserted in the code-to-test, the handler sets the trap flag in the context
before returning. This causes a 80000004h exception to come back to the handler upon
the next instruction. Note that with these exceptions, eip is already at the next instruction
ie. one past the INT 3, or past the instruction executed with the trap flag set. Accordingly
all you have to do in the handler to continue single-stepping is to set the trap flag again
and return to the system.
* Thanks to G.W.Wilhelm, Jr of IBM for this idea

Exception handling in multi-threaded applications

When it comes to exception handling in multi-threaded applications there is little or no help
from the system. You will need to plan for likely faults and organise your threads accordingly.
The rules applying to the exception handling provided by the system (in the context of a multi-
threaded application) are:-
1.Only one type 1 (final handler) can be in existence at any one time for each process. If
a new thread calls SetUnhandledExceptionFilter, this will simply replace the final han-
dler - there is no chain of final handlers as there is for the type 2 (per-thread) han-
dlers. Therefore the simplest way of using the final handler is still probably the best
way in a multi-threaded application - establish it in the main thread as soon as possi-
ble after the program start point.
2.The final handler will be called by the system if the process will be terminating, regard-
less of which thread caused the exception.
3.However, there will only be a final unwind (immediately prior to termination) in the per-
thread handlers established for the thread which caused the exception. Even if any
other (innocent) threads have a window and a message loop, the system will not warn
them that the process is about to terminate (no special message will be sent to them
other than usual messages arising from the loss of focus of other windows).
4.Therefore the other (innocent) threads cannot expect a final unwind if the process is to
terminate. And they will remain ignorant of the imminent termination.
5.If, as is likely, these other innocent threads will also need to clear-up on such termina-
tion you will need to inform them from the final handler. The final handler will need to
wait until these other threads have completed clearing up before returning to the sys-
tem.

6.The way in which the innocent threads are informed of the expected termination of
the program depends on the precise make-up of your code. If the innocent thread
has a window and message loop, then the final handler can use SendMessage to
that window to send an application defined message (must be 400h or above), to
inform that thread to terminate gracefully.
If there is no window and message loop, the final handler could set a public variable
flag, polled from time to time by the other thread. Alternatively you could use Set-
ThreadContext to force the thread to execute certain termination code, by setting
the value of eip to point to that code. This method would not work if the thread is in
an API, for example, waiting for the return from GetMessage. In that case you
would need to send a message as well, to make sure the thread returned from the
API, so that the new context is set.
7.RaiseException only works on the calling thread, so this cannot be used as a

means of communication between threads to make an innocent thread execute its
own exception handler code.
8.How does the final handler know when it may proceed after informing the other
threads that the program is about to terminate? SendMessage will not return until
the recipient has returned from its window procedure and the final handler could
wait for that return. Alternatively it could poll a flag waiting for a response from the
other thread that it has finished clearing up (note you must call the API Sleep in
the polling loop to avoid over-using the system). Or better still, the final handler
could wait until the other thread has terminated (this can be done using the API
WaitForSingleObject or WaitForMultipleObjects if there is more than one thread).
Alternatively use could be made of the Event or Semaphore APIs.
9.For an example of how these procedures could work in practice, suppose a sec-
ondary thread has the job of re-organising a database and then writing it to disk. It
may be in the middle of this task when the main thread causes an exception which
enters your final handler. Here you could either cause the secondary thread to
abort its job, by causing it to unwind and terminate gracefully, leaving the original
data on disk or alternatively you could permit it to complete the task, and then
inform the handler that it had finished so that the handler could then return to the
system. You would need to stop the secondary thread starting any further such
jobs if your handler had been called. This could be achieved by the handler setting

a flag tested by the secondary thread before it started any job, or by using the Event
APIs.

10.If communication between threads is difficult, there is another way for one thread
to access the stack of another thread, and thereby cause an unwind. This makes
use of the fact that whereas each thread has its own stack, the memory reserved
for that stack is within the address space for the process itself. You can check this
yourself if you watch a multi-threaded application using a debugger. As you move
between threads the values of ESP and EBP will change, but they are all kept
within the address space of the process itself. The value of FS will also be differ-
ent between threads and will point to the Thread Information Block for each
thread. So if you take the following steps one thread can access the stack and
cause an unwind of another:-
a. As each thread is created record in a static variable the value of its FS register.
b. As each thread closes it returns the static variables to zero.
c. The handler which needs to unwind other threads should take all the static vari-
ables in turn and for those which have a non-zero value (ie. thread was run-
ning at the time of the exception) the handlers should be called with the
exception flag of 2 (EH_UNWINDING) and, a user flag of say, 400h to show
that the per-thread handler is being called by your final handler. You cannot
call a per-thread handler in a different thread using RtlUnwind (which is
thread-specific) but it can be done using the following code (where ebx holds
the address of the EXCEPTION_RECORD):-

MOV D[EBX+4],402h ;make the exception flag EH_UNWINDING + 40

L1: ;
PUSH ES ;
MOV AX,[FS_VALUE] ;get FS value of thread to unwind
MOV ES,AX ;
ES MOV EDI,[0] ;get 1st per-thread handler address
POP ES ;
L2: ;
CMP D[EDI],-1 ;see if it’s the last one
JZ >L3 ;yes, so finish
PUSH EDI,EBX ;push ERR structure, EXCEPTION_RECORD
CALL [EDI+4] ;call handler to run clear-up code
ADD ESP,8h ;remove the two parameters pushed
JMP L2 ;and do next if not at end
L3: ;code label when finished
;now loop back to L1 with a new FS_VALUE until all threads done
Here you see that the Thread Information Block of each innocent thread is read using the ES
register, which is temporarily given the value of the thread's FS register.
Instead of using FS to find the Thread Information Block you could use the following code to
get a 32-bit linear address for it. In this code LDT_ENTRY is a structure of 2 dwords, ax holds
the 16-bit selector value (FS_VALUE) to be converted and hThread is any valid thread han-
dle:-

AND EAX,0FFFFh ;
PUSH ADDR LDT_ENTRY,EAX,[hThread] ;
CALL GetThreadSelectorEntry ;
OR EAX,EAX ;see if failed
JZ >L300 ;yes so return zero
MOV EAX,ADDR LDT_ENTRY ;
MOV DH,[EAX+7] ;get base high
MOV DL,[EAX+4] ;get base mid
SHL EDX,16D ;shift to top of edx
MOV DX,[EAX+2] ;and get base low
OR EDX,EDX ;edx now=linear 32 bit address)
L300: ;return nz on success
The reason why it is important (using the flag 400h) to inform the handler being called that
it is being called by another thread (the final handler) is that the thread being called is still
running because the exception occurred in a different thread. The handler may well need
to suspend the thread in these circumstances, so that the clear-up job can be achieved by
the calling thread. The innocent thread would then be given a safe-place to go to before
calling ResumeThread. All this must be done before the final handler is allowed to return
to the system because on return the system will simply terminate all threads by brute
force.

Except1
This program provides a simple example of how exception handling can be used in practice
in Windows programs written in assembler. The source code is contained in Except1.asm.
This is written in GoAsm syntax. Although the program is a Windows GDI program, it only
relies on message boxes, which is why there is no message loop.
The program has two exception handlers, a final exception handler and a per-thread excep-
tion handler. The final exception handler is created first, then a procedure is called which is in
code protected by the per-thread exception handler. An exception occurs within that proce-
dure and the per-thread handler is called. Within the handler, the user is asked whether the
handler should swallow the exception or not. If the user decides to swallow the exception, the
program would be able to continue to run, but actually in this case it terminates normally. If
the user decides that the exception should not be swallowed by the handler, then the final
exception handler is called (on the way to program closure). In real life, this handler would be
responsible for completing logs and records, closing file handles, releasing memory etc. But
before the program finally finishes, something interesting happens. The system calls the per-
thread exception handler in case there is more clearing up to do in that particular stack frame
using local data. This is the system unwind. All these events are followed from the various
message boxes which appear on the screen.

Source 1: Except1 - Exception Handling
;////////////////////////////////////////////////////////////////////////////
;// //
;// EXCEPT1.ASM - source for Except1.Exe //
;// Simple Demo of Win32 structured exception handling //
;// for assembler programmers //
;// See Except2 for a more complex demo dealing with voluntary //
;// stack unwinds and multiple handler levels //
;// COPYRIGHT NOTE - this file is Copyright Jeremy Gordon 2002 //
;// [McDuck Software] //
;// - e-mail: JG@JGnet.co.uk //
;// - www.GoDevTool.com //
;// LEGAL NOTICE - The author accepts no responsibility for losses //
;// of any type arising from this file or anything wholly or in part //
;// created from it //
;// //
;////////////////////////////////////////////////////////////////////////////
;
;This program only uses Windows message boxes, which is why there is no
;message loop.
;The program has two exception handlers. The final exception handler
;is created first, then a procedure is called which has its own
;per-thread exception handler, capable of swallowing an exception.
;This it does at the option of the user.
;If the user decides to swallow the exception, the program would be able
;to continue to run, but actually in this case it terminates normally.
;If the user decides that the exception should not be swallowed by the
;handler, then the final exception handler is called (on the way to
;program closure). In real life, this handler would be responsible for
;completing logs and records, closing file handles, releasing memory etc.
;But before the program finally finishes, the system calls the per-thread
;exception handler in case there is more clearing up to do in that
;particular stack frame using local data. This is the system unwind.
;

;Written for GoAsm (Jeremy Gordon). Assemble using:-

;GoAsm except1.asm
;Link using:-
;ALINK -oPE except1.obj -entry START kernel32.lib user32.lib gdi32.lib
;(where the lib files are made using ALIB)
;*******************************************************************
;
DATA SECTION
;
;*******************************************************************
FATALMESS DB "I thoroughly enjoyed it and I have already tidied everything up - "
DB "you know, completed records, closed filehandles, "
DB "released memory, that sort of thing .."
DB "Glad this was by design - bye, bye ..",0Dh,0Ah
DB ".. but first, I expect the system will do an unwind ..",0
;******************************
;
CODE SECTION
;
CLEAR_UP: ;all clearing up would be done here
RET
;
FINAL_HANDLER: ;system passes EXCEPTION_POINTERS
CALL CLEAR_UP
PUSH 40h ;exclamation sign + ok button only
PUSH "Except1 - well it's all over for now."
PUSH ADDR FATALMESS,0
CALL MessageBoxA ;wait till ok pressed
MOV EAX,1 ;terminate process without showing system message box
POP ESI,EDI,EBX
RET
;
;********************************* PROGRAM START
START:

;******** first lets make our final handler which would do all clearing up if
;******** the program has to close
PUSH ADDR FINAL_HANDLER
CALL SetUnhandledExceptionFilter
CALL PROTECTED_AREA
CALL CLEAR_UP ;here the program clears up normally
PUSH "Except1","This is a very happy ending",0
PUSH 0 ;code meaning a succesful conclusion
CALL ExitProcess ;and finish with aplomb!
;********************************* PROGRAM END
;
PROTECTED_AREA:
PUSH EBP,0,0 ; )create the
PUSH OFFSET SAFE_PLACE ; )ERR structure
PUSH OFFSET HANDLER ; )on the
FS PUSH [0] ; )stack
FS MOV [0],ESP ;point to structure just established on the stack
;
;*********************** and now lets cause the exception ..
XOR ECX,ECX ;set ecx to zero
DIV ECX ;divide by zero, causing exception
;*********************** because of the exception the code never gets to here
;
SAFE_PLACE: ;but the handler will jump to here ..
FS POP [0] ;restore original exception handler from stack
ADD ESP,14h ;throw away remainder of ERR structure made earlier
RET
;
;This simple handler is called by the system when the divide by zero
;occurs. In this handler the user is given a choice of swallowing the
;exception by jumping to the safe-place, or not dealing with it at all,
;in which case the system will send the exception to the FINAL_HANDLER
;

HANDLER:
MOV EAX,[EBX+4] ;get flag sent by the system
TEST AL,1h ;see if its a non-continuable exception
JNZ >.nodeal ;yes, so not allowed by system to touch it
TEST AL,2h ;see if its the system unwinding
JNZ >.unwind ;yes
PUSH 24h ;question mark + YES/NO buttons
PUSH 'Except1','There was an exception - do you want me to swallow it?',0
CALL MessageBoxA ;wait till button pressed
CMP EAX,6 ;see if yes clicked
JNZ >.nodeal ;no
;***************************** go to SAFE_PLACE
MOV ESI,[EBP+10h] ;get register context record in esi
MOV EDI,[EBP+0Ch] ;get pointer to ERR structure in edi
MOV [ESI+0C4h],EDI ;insert new esp (happens to be pointer to ERR)
MOV EAX,[EDI+8] ;get address of SAFE_PLACE given in ERR structure
MOV [ESI+0B8h],EAX ;insert that as new eip in register context
MOV EAX,[EDI+14h] ;get ebp at safe place given in ERR structure
MOV [ESI+0B4h],EAX ;insert that as new ebp in register context
XOR EAX,EAX ;eax=0 reload context and return to system
JMP >.fin
.unwind:
PUSH "Except1"
PUSH "The system calling the handler again for more clearing up (unwinding)"
PUSH 0
CALL MessageBoxA ;wait till ok pressed, then return eax=1
.nodeal:
MOV EAX,1 ;eax=1 system to go to next handler
.fin:
POP ESI,EDI,EBX
RET
;

Except 2
This is a more complex program which is intended to demonstrate in more detail the con-
tents of this article.
The source code for Except2.Exe (Except2.asm and Except2.RC) is also provided and
again it is in GoAsm syntax.
The main window is actually a modal dialog. A final handler is set up very early in the pro-
cess. When the "Cause Exception" button is clicked, first the dialog procedure is called
with the command, then 2 further routines are called, the third routine causing an excep-
tion of the type chosen by the radiobuttons. As execution passes through this code, 3 per-
thread exception handlers are created.

The exception is either repaired in situ if possible, or the program recovers in the chosen han-
dler from a safe-place. If the exception is allowed to go to the final handler you can either exit
by pressing F3 or F5, or if you press F7 the final handler will try to recover from the exception.
You can follow events as they occur because each handler displays various messages in the
listbox. There is a slight delay between each message so that you can follow more easily
what is happening, or you can scroll the messages to get them back into view.
When the program is about to terminate, something interesting happens. The system causes
a final unwind with the exception flag set to 2h. The messages sent to the listbox are slowed
down even further because the program will be terminating soon!
You will see that the same type of unwind occurs if you specify that execution should continue
from a "safe-place" or if F7 is pressed from the final handler. This unwind is initiating by the
handler itself.

Source 2: Except2 - Complex Exception Handling
;////////////////////////////////////////////////////////////////////////////
;// //
;// EXCEPT2.ASM - source for Except2.Exe //
;// Complex Demo of Win32 structured exception handling //
;// for assembler programmers //
;// See Except1.asm for a simple demo! //
;// COPYRIGHT NOTE - this file is Copyright Jeremy Gordon 1996-2002 //
;// [McDuck Software] //
;// - e-mail: JG@JGnet.co.uk //
;// - www.GoDevTool.com //
;// LEGAL NOTICE - The author accepts no responsibility for losses //
;// of any type arising from this file or anything wholly or in part //
;// created from it //
;// //
;////////////////////////////////////////////////////////////////////////////
;
;The program uses a modal dialog box as its main window, which is why
;there is no message loop (this is dealt with by the system itself)
;A dialog box is created and the user has the choice of exceptions to choose
;from. The exception can be dealt with in handlers 1, 2 or 3; if it would
;normally cause program exit, it goes to the final handler.
;if it is repaired, this can be done either by returning to the place
;of exception or to a safe-place.
;As a final luxory the final handler may also try to recover from the
;exception, unwinding the stack first of course.
;If you decide to let the system deal with the exception, the system then
;unwinds the stack in exactly the same way as the handler does if the
;program is to try to continue running.
;
;Written for GoAsm (Jeremy Gordon). Assemble using:-
;GoAsm except2.asm
;Resources (dialogs, version and bitmap) compiled using GoRC (Jeremy Gordon)
use:-

;GoRC except2.rc
;Link using:-
;ALINK -oPE except2.obj except2.res -entry START kernel32.lib user32.lib gdi32.lib
;(where the lib files are made using ALIB)
;*******************************************************************
;
DATA SECTION
;
;*******************************************************************
MSG DD 7 DUP 0 ;hWnd, +4=message, +8=wParam, +C=lParam, +10h=time, +14h/18h=pt
RECT DD 4 DUP 0 ;rectangle - left, +4 top, +8 right, +0Ch bottom
;****************************** some dwords
lpArguments DD 2 DUP 0 ;holds data when RaiseException called
flOldProtect DD 0 ;holds previous code section access protection
hHeap DD 0 ;handle to temporary memory areas
hList DD 0 ;handle to listbox
hDC DD 0 ;handle to device context of listbox
hCombo DD 0 ;handle to combo box
hInst DD 0 ;handle to main process
CINDEX DD 0 ;index of combobox selection
COUNT DD 0 ;used in getting a random number
MESSDELAY DD 100h ;length of time to keep message on the screen
EBPSAFE_PLACE3 DD 0 ;these are kept solely for
ESPSAFE_PLACE3 DD 0 ;repair by final handler
;******************************* non-doublewords follow
EXC_TYPE DB 0 ;radio button exception type chosen
HANDLER DB 0 ;the handler to repair the exception
CONTINUE DB 0 ;1=continue from handler safe-place
HANDLERFLAG DB 0 ;1=read/write message is new
;2=final handler unwind
;********************************* and some strings
BYETEXT DB 'Have an exceptional day!',0
;********************** combo box messages
COMBO_STRING1 DB 'Deal with the exception in handler ',0
COMBO_STRING3 DB 'Allow exception to go to final handler',0

;********************** exception messages

EXC_MESS0 DB 'Reading from h ... ',0 ;spaces at end to get rub-out
EXC_MESS1 DB 'Writing to h ... ',0 ;spaces at end to get rub-out
EXC_MESS2 DB 'ExceptionCode h now in handler :',0
EXC_MESS3 DB 'Attempting local repair (no unwind)',0
EXC_MESS4 DB 'Repair appears successful',0
EXC_MESS5 DB ' Flag= h (continuable exception)',0
EXC_MESS5A DB ' Flag= h (non-continuable exception)',0
EXC_MESS5B DB ' Flag= h (unwinding)',0
EXC_MESS5C DB ' Local data= h',0
EXC_MESS6 DB 'Handler cannot repair this exception',0
EXC_MESS7 DB 'Memory write error at h',0
EXC_MESS8 DB 'Memory read error at h',0
EXC_MESS9 DB 'Attempt to corrupt code at h',0
EXC_MESS10 DB 'ExceptionCode h in final handler',0
EXC_MESS11 DB 'Handler clear-up code',0
EXC_MESS11A DB 'Handler clear-up code - byebye ........',0
EXC_MESS12 DB 'Ready to do voluntary stack unwind',0
EXC_MESS13 DB ' Exception at eip= h',0
EXC_MESS14 DB 'Hello from safe-place #2!',0
EXC_MESS17 DB 'Key F3=polite end; F5=nasty end; F7=recover',0
EXC_MESS18 DB 'Closing memory heap and dc',0
EXC_MESS19 DB 'There will be an exception in 3rd routine',0
EXC_MESS20 DB ' (protected by handler 3)',0
EXC_MESS21 DB 'Now system will unwind and call ExitProcess ...',0
EXC_MESS22 DB 'Code at h caused an exception',0
EXC_MESS23 DB 'Now for own unwind then get to safe-place ...',0
EXC_MESS24 DB 'Hello from final handler in safe-place #3!',0
;
;*********************** for HEXWRITE
sHEXb DB '0123456789ABCDEF'
;
;*******************************************************************

;* CODE
;*******************************************************************
CODE SECTION
;
CODESTART: ;label for code corruption test
;
HEXWRITE: ;write hex number from eax into [esi]
PUSH EAX,EBX,EDX
MOV EBX,ADDR sHEXb
ROL EAX,4 ;get high order nibble into al
MOV DL,AL
AND EDX,0Fh ;use only least sig nibble
MOV DL,[EBX+EDX]
MOV [ESI],DL ;write the nibble
INC ESI ;ready for next
MOV DL,AL
MOV DL,[EBX+EDX]
MOV DL,AL
MOV DL,[EBX+EDX]
MOV DL,AL
MOV DL,[EBX+EDX]
MOV DL,AL


MOV DL,[EBX+EDX]
MOV DL,AL
MOV DL,[EBX+EDX]
MOV DL,AL
MOV DL,[EBX+EDX]
MOV DL,AL
MOV DL,[EBX+EDX]
POP EDX,EBX,EAX
RET
;
ADD_LISTBOXSTRING: ;add a string to listbox, scrolling if required
PUSH EDX,0,180h,[hList] ;LB_ADDSTRING (address in edx)
CALL SendMessageA
PUSH EAX ;keep item index
DEC EAX ;index now one smaller
PUSH 0,EAX ;string to ensure visible
PUSH 197h,[hList] ;LB_SETTOPINDEX
CALL SendMessageA ;scroll listbox now to show string just inserted
PUSH [hList]
CALL UpdateWindow

POP EAX ;restore item index

RET
;
WRITE_LISTBOXLINE: ;write the string in edx to listbox
PUSH EAX
;**************************
CALL ADD_LISTBOXSTRING ;write to listbox
PUSH [MESSDELAY] ;256 milliseconds at start
CALL Sleep ;delay for a while
;**************************
POP EAX
RET
;
WRITE_MEM_ERROR:
PUSH EBX
MOV EDX,ADDR EXC_MESS7 ;correct message if write error
CMP D[EBX+14h],1 ;see if write error flag from 1st part of array
JZ >0 ;yes (write=1, read=0)
MOV EDX,ADDR EXC_MESS8 ;correct message if read error
0:
MOV EAX,[EBX+18h] ;get 2nd part of array (inaccessible address)
MOV ESI,EDX
ADD ESI,22D
CALL HEXWRITE ;write address into message
CALL WRITE_LISTBOXLINE ;write the string in edx to listbox
OR B[HANDLERFLAG],1 ;ensure that read/write message is written into listbox
POP EBX
RET
;
WCE23: ;write memory read/write number into message
PUSH ESI
MOV ESI,EBX
CALL HEXWRITE ;write memory read/write number into message at esi
POP ESI
RET

;
WRITE_CURRENT_EDI: ;correct message in esi
PUSH ECX,EDI
MOV EDX,ADDR EXC_MESS0 ;read message
MOV EBX,13D
CMP B[EXC_TYPE],104D ;see if read test
JZ >1 ;yes
SUB EBX,2
MOV EDX,ADDR EXC_MESS1 ;write message
1:
MOV ESI,EDX ;keep correct message in esi
ADD EBX,EDX ;and correct write-place in ebx
TEST B[HANDLERFLAG],1 ;see if first read/write message
JZ >2 ;no
;************ drawtext is used because it is much quicker than lb_insertstring
;************ insert eventual item in listbox but write over it for now
MOV EAX,EDI ;this message will be displayed at end of test
ADD EAX,1000h ;so ensure it shows correct place of exception occurance
CALL WCE23 ;write memory read/write number into message
MOV EDX,ESI
CALL ADD_LISTBOXSTRING ;write item to listbox, returning index in eax
PUSH ADDR RECT,EAX ;index of last string written (wParam)
PUSH 198h,[hList] ;LB_GETITEMRECT
CALL SendMessageA ;get client co-ordinates in RECT for string just written
ADD D[RECT],2 ;allow for lhs border
AND B[HANDLERFLAG],0FEh ;don't come here again
2:
MOV EAX,EDI
CALL WCE23 ;write memory read/write number into message
;*********************
PUSH 100h,ADDR RECT ;no clipping
PUSH -1,ESI,[hDC] ;-1=system to count length
CALL DrawTextA
;*********************
POP EDI,ECX

RET
;
WRITE_WHICHADDRESS: ;eax=code address
MOV ESI,ADDR EXC_MESS22
MOV EDX,ESI
ADD ESI,8
CALL HEXWRITE ;write code address into message
RET
;
WRITE_HANDLERDATA: ;eax=exception no., ebx=record, dl=handler no.
PUSH EAX,ESI,EDX
CMP DL,4 ;see if final handler
PUSHFD ;keep flag
JZ >3 ;yes
ADD DL,48D ;convert handler number to ascii char
MOV [ESI+39D],DL ;write the handler number
3:
MOV EDX,ESI ;keep correct message
ADD ESI,14D
CALL HEXWRITE ;write exception number into message
MOV EAX,[EBX+4] ;get exception flag
MOV ESI,ADDR EXC_MESS5 ;continuable
CMP EAX,1
JB >4
MOV ESI,ADDR EXC_MESS5A ;non-continuable
JZ >4
MOV ESI,ADDR EXC_MESS5B ;unwind
4:
MOV EDX,ESI ;keep for WRITE_LISTBOXLINE later
ADD ESI,13D
CALL HEXWRITE ;write exception flag into message


POPFD ;restore flag
JZ >5 ;final handler so don't show local data address
MOV ESI,ADDR EXC_MESS5C
MOV EDX,ESI ;keep for WRITE_LISTBOXLINE later
ADD ESI,19D
MOV EAX,[EBP+0Ch] ;get pointer to ERR structure
CALL HEXWRITE ;write as address of local data
5:
POP EDX,ESI,EAX
RET
;
CLEARUPCODE_MESS: ;handler in edx
CMP DL,1 ;see if handler 1
JNZ >6
TEST B[HANDLERFLAG],2 ;see if final handler doing unwind, though
JNZ >6 ;yes, so do ordinary message
MOV D[MESSDELAY],3000D ;3 seconds
MOV ESI,ADDR EXC_MESS11A
6:
ADD DL,48D ;convert handler number to ascii char
MOV [ESI+8D],DL ;write the handler number into message
MOV EDX,ESI ;keep correct message
RET
;
ADD_STRING:
PUSH ESI,0,143h,[hCombo] ;CB_ADDSTRING (uMsg), handle to combobox
CALL SendMessageA
RET
;
INITIALISE_CONTROLS:
MOV ECX,[EBP+14h] ;get dialog id sent to DialogBoxIndirectParam (lParam)

JCXZ >1 ;it's main dialog

RET ;it must be "about" dialog
1:
;************************* initialise the radio buttons
PUSH 108D ;button to select
PUSH 109D,104D ;last,first in group
PUSH [EBP+8] ;hdlg
CALL CheckRadioButton
;************************* now initialise 2nd lot of radio buttons
PUSH 1 ;indicate check
PUSH 111D ;identifier
PUSH [EBP+8] ;hdlg
CALL CheckDlgButton
;************************* now initialise the list and combo box
PUSH 113D,[EBP+8] ;list box identifier
CALL GetDlgItem ;get list box handle
MOV [hList],EAX ;keep it
PUSH 110D,[EBP+8] ;combo box identifier
CALL GetDlgItem ;get combo box handle
MOV [hCombo],EAX ;keep it
MOV BL,'1' ;handler number to add to message
MOV ESI,ADDR COMBO_STRING1
2:
MOV [ESI+35D],BL ;insert number into message
CALL ADD_STRING
INC BL
CMP BL,'4' ;see if at last message
JNZ 2
MOV [CINDEX],EAX ;keep the selection for later use
PUSH 0,EAX,14Eh,[hCombo] ;CB_SETCURSEL, handle to combobox
CALL SendMessageA
MOV ESI,ADDR COMBO_STRING3
CALL ADD_STRING ;no repair message
RET
;

GET_EXC_TYPE: ;get the chosen exception type

MOV EBX,104D
MOV ESI,6 ;number to do
3:
PUSH EBX,[EBP+8] ;button identifier, hdlg
CALL IsDlgButtonChecked
CMP AL,1 ;see if button is checked
JZ >4 ;yes
INC EBX
DEC ESI
JNZ 3
4:
MOV [EXC_TYPE],BL ;keep type for later tests
RET
;
;***************************************************** PROGRAM START
START:
PUSH 0
CALL GetModuleHandleA
MOV [hInst],EAX
;**************************** establish a handler for the final exit
PUSH ADDR FINAL_HANDLER
CALL SetUnhandledExceptionFilter
;****************************** now create the dialog box
PUSH 0,ADDR DlgProc ;pointer to dialog procedure (param=0=main dialog)
PUSH 0 ;this dialog is the main window (no parent)
PUSH 'MainDialog' ;name of dialog in resource file
PUSH [hInst]
CALL DialogBoxParamA ;this does not return until dialog closed
PUSH 0 ;exit code zero=success if finishes this way
CALL ExitProcess
;****************************************************** PROGRAM END
;
PROCESS_COMMAND: ;called if WM_COMMAND (eax holds wParam)
CMP EAX,99D ;see if "about" clicked

JNZ >0 ;no

PUSH 1,ADDR DlgProc,[EBP+8h] ;param=1
PUSH 'About'
PUSH [hInst]
CALL DialogBoxParamA ;create about dialog, borrowing main dlgproc
RET
0:
CMP EAX,101D ;see if it was "cause exception" button
JZ >1 ;yes
RET
;************************************************* CAUSE EXCEPTION WAS CLICKED
1:
CALL GET_EXC_TYPE ;get the chosen exception type
;************************* next see if check button is checked
PUSH 112D,[EBP+8] ;identifier of safe-place radiobutton
CALL IsDlgButtonChecked
MOV [CONTINUE],AL ;keep this 1=continue from safe-place
;************************* now get the combo box selection
PUSH 0,0,147h ;CB_GETCURSEL (uMsg)
PUSH [hCombo] ;handle to combobox
CALL SendMessageA ;get current selection
INC AL ;handler 1 now = 1
MOV [HANDLER],AL
;***************** clear the listbox
PUSH 0,0,184h ;LB_RESETCONTENT
PUSH [hList] ;handle to listbox
CALL SendMessageA
CALL SECOND_ROUTINE ;run until exception and repair
RET
;
;******************************************************* DIALOG PROCEDURE
;******* The about dialog also comes here, but no static data is re-used
;******* apart from COUNT
DlgProc:
;

PUSH EBP
MOV EBP,ESP
;now [EBP+8]=hDlg, [EBP+0Ch]=uMsg, [EBP+10h]=wParam, [EBP+14h]=lParam
;************************************** create area for local data
SUB ESP,40h ;make space of 16 dwords on stack for local data
;now addressable as [EBP-4] to [EBP-40h]
;************************************** save registers as required by Windows
PUSH EBX,EDI,ESI
;************************************** install handler_1 and its ERR structure
PUSH EBP ;ERR+14h save ebp (being ebp at safe-place1)
PUSH ADDR EXC_MESS16 ;ERR+0Ch safe place 1 message
PUSH ADDR SAFE_PLACE1 ;ERR+8h place for new eip
PUSH ADDR HANDLER_1 ;ERR+4h address of handler routine
FS PUSH [0] ;ERR+0h keep next handler up the chain
;**************************************
INC D[COUNT] ;used in getting a random number
MOV EAX,[EBP+0Ch] ;get uMsg
CMP EAX,136h ;see if WM_CTLCOLORDLG
JZ >3 ;yes
CMP EAX,135h ;see if WM_CTLCOLORBTN
JZ >2 ;yes
CMP EAX,138h ;see if WM_CTLCOLORSTATIC
JNZ >4 ;no
PUSH 120D,[EBP+8]
CALL GetDlgItem ;get control 120 handle
CMP EAX,[EBP+14h] ;see if its the static control for bitmap frame
JZ LONG >8 ;must be kept white
2:
PUSH 1,[EBP+10h] ;1=transparent, wParam
CALL SetBkMode
3:
PUSH 00808040h ;blue colour from default palette
CALL CreateSolidBrush ;create brush as an object with handle in EAX

JMP LONG >9 ;return with the brush handle (deleted on program exit)
4: ;this is needed because dialog=main window (no IDCANCEL)
CMP EAX,110h ;see if WM_INITDIALOG
JNZ >5 ;no
CALL INITIALISE_CONTROLS
JMP >.nonzero ;return non-zero
5:
CMP EAX,10h ;see if WM_CLOSE (sent if sysmenu clicked)
JZ >6 ;yes, so say goodbye and finish
CMP EAX,111h ;see if WM_COMMAND
JNZ >8 ;no
TEST B[HANDLERFLAG],2 ;see if in final handler
JNZ >8 ;yes so ignore command messages
MOV EAX,[EBP+10h] ;wParam
CMP EAX,102D ;see if it was quit button
JZ >6 ;yes, so say goodbye and finish
CMP EAX,100D ;see if "about" OK button
JZ >7 ;yes so remove about dialog
CALL PROCESS_COMMAND
JMP >.nonzero
6:
TEST B[HANDLERFLAG],2 ;see if in final handler
JNZ >8 ;yes so ignore quit/close messages
MOV D[MESSDELAY],1000D ;one second delay
MOV EDX,ADDR BYETEXT ;write "Have an exceptional day!"
7:
PUSH 0,[EBP+8]
CALL EndDialog ;end dialog
.nonzero
MOV EAX,1 ;return non-zero (TRUE=message processed)
JMP >9
;****************************************************** HANDLER SAFE-PLACE 1
SAFE_PLACE1: ;esp/ebp already set to correct values by handler
CALL WRITE_LISTBOXLINE ;write the string in edx to listbox tell user reached here

8:
XOR EAX,EAX ;return zero (FALSE=message not processed)
9:
POP ESI,EDI,EBX
MOV ESP,EBP
POP EBP
RET 10h ;automatically does epilogue code to close stack frame
;
ATTEMPT_CORRUPTION: ;attempt code corruption in random place
MOV ESI,ADDR CODESTART
MOV EDI,ADDR CODEEND
SUB EDI,ESI ;get how many bytes in the routine
;*****************************
;Note that it is possible the code section has a write attribute from its
;own PE file, so first ensure that this is removed ..
PUSH ADDR flOldProtect
PUSH 20h ;PAGE_EXECUTE_READ
PUSH EDI,ESI ;size, start
CALL VirtualProtect
OR EAX,EAX ;check for success
JZ >.fin ;no, so too dangerous to do the test
;***************************** get a random number no higher than edi
XOR EBX,EBX
7:
STC
RCL EBX,1
CMP EDI,EBX ;find how many bits may be looked at
JNB 7
8:
CALL GetTickCount ;get count since Windows started now
MOV EDX,EAX ;keep whole tick count
SUB EAX,[COUNT] ;add another random element
MOV ECX,200D

9:
AND EAX,EBX ;only look at correct number of bits
CMP EDI,EAX ;see if number is now too high
JNB >10 ;no
ROR EDX,5 ;rotate edx 5 times
ADD EAX,EDX ;add extra random element
LOOP 9 ;try again 200 times
JMP 8 ;try again with another tick count
10:
;*********** number now in eax
ADD ESI,EAX ;get to address to corrupt
PUSH ESI
MOV EAX,ESI ;get number to write in eax
ADD ESI,27D
CALL HEXWRITE ;write exception flags into message
MOV EDX,ADDR EXC_MESS9 ;write "Attempt to corrupt code at h"
POP ESI
MOV B[ESI],90h ;attempt to corrupted code (causes exception)
.fin
RET
;
MEM_TEST: ;its a memory read/write exception
OR B[HANDLERFLAG],1 ;ensure read/write message is written to listbox
;******** get device context and set up correct font and colour
PUSH [hList]
CALL GetDC
MOV [hDC],EAX ;keep handle of device context of listbox
PUSH 0,0,31h,[hList] ;WM_GETFONT
CALL SendMessageA ;get listbox font
PUSH EAX,[hDC]
CALL SelectObject ;use this font in the dc
PUSH 0FF0000h,[hDC] ;nice blue colour
CALL SetTextColor

;**************************************************************
OR BL,BL ;see if write test
JZ >22 ;yes
;******************************** now for the read test
PUSH 0,1000h,0 ;make "growable" memory, 4K for immediate use
CALL HeapCreate
MOV EDI,EAX
MOV [hHeap],EAX ;keep heap address
MOV ECX,2001h ;ready to read from 8K +1
20:
MOV AL,[EDI] ;read into al
CMP ECX,1 ;unless the last (handler returns to here for last one)
JZ >21 ;listbox message already written
CALL WRITE_CURRENT_EDI ;show user current position
21:
INC EDI
LOOP 20 ;continue so as to cause exception
PUSH [hHeap]
CALL HeapDestroy
JMP >25
;******************************** now for the write test
22:
PUSH 4h ;read & write access
PUSH 2000h ;MEM_RESERVE
PUSH 10000h ;64K
PUSH 0 ;system to decide address
CALL VirtualAlloc
MOV [hHeap],EAX
PUSH 4h ;read & write access
PUSH 1000h ;MEM_COMMIT
PUSH 1000h ;4K
PUSH [hHeap]
CALL VirtualAlloc
MOV EDI,EAX ;base address of allocated 4K
MOV ECX,2001h ;ready to write 8K + 1 byte

23:
MOV B[EDI],'X'
CMP ECX,1 ;unless the last (handler returns to here for last one)
JZ >24 ;listbox message already written
CALL WRITE_CURRENT_EDI ;show user current position
24:
INC EDI
LOOP 23 ;continue so as to cause exception
PUSH 4000h,0,[hHeap] ;MEM_DECOMMIT
CALL VirtualFree ;decommit memory used
PUSH 8000h,0,[hHeap] ;MEM_RELEASE
CALL VirtualFree ;free memory used
25:
;**************************** release the device contact
PUSH [hDC],[hList]
CALL ReleaseDC
RET
;
ERROR_ROUTINE: ;the exception will occur in this routine
XOR EBX,EBX
MOV BL,[EXC_TYPE] ;get exception type again
SUB EBX,105D ;see if memory read/write test
JA >30 ;no
CALL MEM_TEST
RET
30:
;*********************** own software exception
DEC EBX ;see if should do own (continuable) software exception
JZ >31 ;yes
CMP EBX,1 ;see if should do own (non-continuable) software exception
JNZ >32 ;no
31: ;0=continuable exception, 1=non-continuable exception
MOV EAX,ADDR AVOID ;get place to restart from
MOV [lpArguments],EAX ;keep in array in memory
MOV [lpArguments+4],ESP ;keep esp too

PUSH ADDR lpArguments ;give array to function

PUSH 2 ;number of arguments in array
PUSH EBX ;continuable or non-continuable exception flag
PUSH 0E0000100h ;exception code
CALL RaiseException
AVOID:
RET
32:
DEC EBX,EBX ;see if divide by zero
JNZ >33 ;no
;*********************** divide by zero exception
XOR ECX,ECX
MOV EAX,66D
DIV CL ;divide by zero to create exception
RET
33: ;must be attempt to corrupt code test
CALL ATTEMPT_CORRUPTION ;attempt code corruption in random place in code
RET
;
THIRD_ROUTINE:
;**************************************
MOV [EBPSAFE_PLACE3],EBP ;these are kept solely for
MOV [ESPSAFE_PLACE3],ESP ;repair by final handler
;**************************************
MOV EDX,ADDR EXC_MESS19 ;"exception will occur in level 3 code"
MOV EDX,ADDR EXC_MESS20 ;"(protected by exception handler 3)"


CALL ERROR_ROUTINE ;exception will be caused by this routine
JMP >4
;************************************** here is the safe place & code
4:
ADD ESP,14h ;throw away handler_3
RET
;
SECOND_ROUTINE:
;**************************************
CALL THIRD_ROUTINE
JMP >5
;************************************** here is the safe place & code
5:
RET
;
;************ here is the routine to "unwind" the stack and go to safe-place
TRYFOR_SAFEPLACE: ;EAX=exception
CMP EAX,0C0000005h ;see if memory read/write exception
JNZ >6 ;no

CALL WRITE_MEM_ERROR ;write type and place of error

6:
MOV EDX,ADDR EXC_MESS12
CALL WRITE_LISTBOXLINE ;write "Ready to do voluntary stack unwind"
;*** now carry out own unwind for other handlers to clear-up using local data
;*** here is the call to the only recently documented API function RtlUnwind
PUSH 0 ;return value (not needed)
PUSH [EBP+8] ;send exception_record to per-thread handlers
PUSH ADDR UN23 ;return address
PUSH [EBP+0Ch] ;pointer to this ERR structure
CALL RtlUnwind
UN23:
;***************************** now change context to suit safe place
;***************************** current context has values as at the exception
MOV [ESI+0C4h],EDX ;insert new esp (happens to be pointer to ERR)
MOV EAX,[EDX+8] ;get safe place given in ERR structure
MOV EAX,[EDX+0Ch] ;get message address in eax
MOV [ESI+0A8h],EAX ;insert new edx
MOV EAX,[EDX+14h] ;get ebp at safe place given in ERR structure
RET
;***************** here is the routine to try repair an exception
ATTEMPT_LOCAL_REPAIR: ;EAX=exception, EBX=exception record
CALL WRITE_LISTBOXLINE ;write "Attempting local repair (no unwind)" (saves eax)
CMP EAX,0E0000100h ;see if own software exception
JZ >11 ;yes
CMP EAX,0C0000094h ;see if divide by zero exception
JZ >9 ;yes
CMP EAX,0C0000005h ;see if memory read/write exception
JNZ >10 ;no
CMP B[EXC_TYPE],104D ;see if memory test

JZ >7 ;yes
JNZ >10 ;no
7:
CALL WRITE_MEM_ERROR ;write type and place of error
;************** read from memory error - the following will work
PUSH 1000h ;allocate another 4K
PUSH 4 ;HEAP_GENERATE_EXCEPTIONS on error=another exception
PUSH [hHeap] ;normally get this from handler structure
CALL HeapAlloc ;allocate another 4K
OR EAX,EAX ;see if error
JZ >10 ;yes
JMP >12
;******** the above did not work for write error because memory has already
;been written to during exception and is therefore "corrupt". You get a
;C0000005h access violation. The way round this is to use the virtual alloc
;function which will permit you to specify the starting place for the new
;memory allocation (which is the same as inaccessible address):-
8:
PUSH 4 ;read and write access
PUSH 1000h ;commit more memory
PUSH 1000h ;another 4K required
PUSH [EBX+18h] ;inaccessible address sent as 2nd part of array
CALL VirtualAlloc ;add another 4K using inaccessible address as base
OR EAX,EAX ;see if error
JZ >10 ;yes
JMP >12
;********************************
9: ;its divide by zero exception
MOV D[ESI+0ACh],1D ;replace ecx with 1 to ensure div by 1 next time
JMP >12
10: ;error or unexpected exception return


CALL WRITE_LISTBOXLINE ;write "Handler cannot repair this exception"
STC
RET
11: ;its an own software exception
MOV EAX,[EDX+14h] ;get ebp at safe place given in ERR structure
MOV [ESI+0B4h],EAX ;insert new ebp in context
MOV EAX,[EBX+14h] ;get from exception record the address to jump to
MOV [ESI+0B8h],EAX ;change eip in context
MOV EAX,[EBX+18h] ;get from exception record the 2nd part of array
MOV [ESI+0C4h],EAX ;which is the ESP at repair place
12:
CALL WRITE_LISTBOXLINE ;write "repair appears successful"
CLC
RET ;return nc on success, c on failure
;
HEAP_CLOSE:
JZ >20 ;yes
JNZ >23 ;no
20:
CALL WRITE_LISTBOXLINE ;write "Closing memory heap and dc"
PUSH [hHeap]
CALL HeapDestroy
JMP >22
21:
PUSH 4000h,0,[hHeap] ;MEM_DECOMMIT
CALL VirtualFree ;decommit memory used

PUSH 8000h,0,[hHeap] ;MEM_RELEASE

CALL VirtualFree
22:
PUSH [hDC],[hList]
CALL ReleaseDC
23:
RET
;
HANDLER_3: ;handler 3
PUSH EBP
MOV EBP,ESP
JNZ >30 ;yes, so exception address is not useful here
MOV EAX,[EBX+0Ch] ;get ExceptionAddress
CALL WRITE_WHICHADDRESS
30:
MOV DL,3 ;indicate 3rd handler
CALL WRITE_HANDLERDATA ;saves edx
JNZ >34 ;yes
JZ >31 ;no
CALL CLEARUPCODE_MESS
CALL HEAP_CLOSE ;close the memory heap and dc if memory test
JMP >34 ;must return 1 to go to next handler
31:
CMP [HANDLER],DL ;see if this handler allowed to deal
JNZ >34 ;no
CMP B[CONTINUE],1 ;see if 1=continue from safe-place
JNZ >32 ;no so deal with exception locally
CALL TRYFOR_SAFEPLACE
JMP >33

32:
CALL ATTEMPT_LOCAL_REPAIR
JNC >33 ;success
33:
XOR EAX,EAX ;reload context and return to system
JMP >35
34:
MOV EAX,1 ;this handler will not deal with this exception
35:
POP ESI,EDI,EBX
MOV ESP,EBP
POP EBP
RET ;ordinary return because was a "C" type call not PASCAL
;
HANDLER_2: ;second handler
PUSH EBP
MOV EBP,ESP
MOV DL,2 ;indicate 2nd handler
JNZ >43 ;yes
JZ >40 ;no
40:
JNZ >43 ;no

JMP >42
41:
JNC >42 ;success
42:
XOR EAX,EAX ;exception was repaired - reload context and try again
JMP >44
43:
MOV EAX,1 ;this handler will not deal with this exception
44:
POP ESI,EDI,EBX
MOV ESP,EBP
POP EBP
;
HANDLER_1:
PUSH EBP
MOV EBP,ESP
MOV DL,1 ;indicate 1st handler
JNZ >53 ;yes
JZ >50 ;no
50:
JNZ >53 ;no

JMP >52
51:
JNC >52 ;success
52:
XOR EAX,EAX ;reload context and return to system
JMP >54
53:
MOV EAX,1 ;go to next handler
54:
POP ESI,EDI,EBX
MOV ESP,EBP
POP EBP
;
FINAL_HANDLER_RECOVERY: ;ebx=exception record, esi=context
MOV EDX,ADDR EXC_MESS23 ;will now do voluntary unwind and safe-place
;
;-- DO NOT REMOVE ---------------- the following unwind systems are alternative
;************* the final handler does not know the last ERR structure
;************* so find it
;FS MOV EAX,[0] ;get pointer to very first ERR structure
;L880:
;CMP D[EAX],-1 ;see if the last one
;JZ >L881 ;yes, so finish
;MOV EAX,[EAX] ;get pointer to next ERR structure
;JMP L880
;L881:
;PUSH ESI ;cannot rely on RtlUnwind to keep this (context)
;;**********************
;PUSH 0 ;return value (not used)
;PUSH EBX ;send exception_record to per-thread handlers

;PUSH ADDR UN25 ;return address

;PUSH EAX ;pointer to last unwind frame
;CALL RtlUnwind
;UN25:
;;**********************
;POP ESI
;JMP >61
;-- DO NOT REMOVE --------------------------------------------------------
;
;********************************** trying own unwind in final handler
MOV D[EBX+4],02h ;indicate eh_unwinding flag for termination code
FS MOV EDI,[0] ;get pointer to very first ERR structure
60:
CMP D[EDI],-1 ;see if the last one
JZ >61 ;yes, so finish
PUSH EDI,EBX ;push ERR structure,exception record
CALL [EDI+4] ;call the associated handler to run clear-up code
ADD ESP,8h ;remove parameters put on the stack
JMP 60
61:
;*******************************************************************
MOV EAX,[EBPSAFE_PLACE3] ;kept earlier in third_routine
MOV EAX,[ESPSAFE_PLACE3] ;in case of this repair
MOV [ESI+0C4h],EAX ;insert new esp
MOV EAX,ADDR SAFE_PLACE3
MOV EAX,ADDR EXC_MESS24 ;hello from safe-place 3 message
MOV [ESI+0A8h],EAX ;insert new edx
RET
;
;*********************** now if exception reached this point it is serious
FINAL_HANDLER: ;this time the system passes only the pointer
MOV EDX,[ESP+4] ;to EXCEPTION_POINTERS - get it in edx


OR B[HANDLERFLAG],2 ;flag that in final handler
;************************** see EXCEPTION_POINTERS structure
MOV ESI,[EDX+4] ;get context record in esi
MOV EBX,[EDX] ;get pointer to Exception Record
MOV EAX,[EBX] ;get exception code
MOV DL,4 ;indicate final handler
CALL WRITE_HANDLERDATA ;saves esi, ebx
MOV EAX,[ESI+0B8h] ;get eip from context
PUSH ESI ;keep context
MOV ESI,ADDR EXC_MESS13 ;Exception at eip= h
MOV EDX,ESI
ADD ESI,25D
CALL HEXWRITE
CALL ADD_LISTBOXSTRING ;write the string in edx to listbox
MOV EDX,ADDR EXC_MESS17 ;"Press F3=polite end, F5=nasty end, F7=recover!"
CALL ADD_LISTBOXSTRING ;write the string in edx to listbox
POP ESI ;restore context
;*************************************** flush any key messages in message queue
0:
CALL GetActiveWindow ;get handle to dialog
PUSH 1 ;PM_REMOVE remove message if there
PUSH 108h,100h,EAX,ADDR MSG ;WM_KEYLAST,WM_KEYFIRST key press filter
CALL PeekMessageA
OR EAX,EAX ;see if there was a key message there
JNZ 0 ;yes, so ignore it
;**************** now wait for correct keypress but let mouse messages through
1: ;note that command messages are sent direct to dlgproc
CALL GetActiveWindow ;get handle to dialog
PUSH 0,0,EAX,ADDR MSG ;get all messages
CALL GetMessageA
MOV EAX,[MSG+4] ;get message
CMP EAX,100h ;see if below WM_KEYFIRST
JB >2 ;yes, so send to dlgproc
CMP EAX,108h ;see if above WM_KEYLAST

JA >2 ;yes, so send to dlgproc

MOV EAX,[MSG+8] ;get virtual key
CMP EAX,76h ;see if F7 pressed
JZ >3 ;yes
JZ >5 ;yes
JZ >4 ;yes
JMP 1 ;no so ignore and wait for other messages
2:
PUSH ADDR MSG
CALL DispatchMessageA ;send mouse message to DlgProc
JMP 1
3:
CALL FINAL_HANDLER_RECOVERY
MOV EAX,-1 ;reload context and continue execution
JMP >7
;*****************************************************************************
4:
PUSH 0 ;ok button only
PUSH 'This is the polite end'
PUSH 'We sincerely offer our grovelling apologies (sic)!'
PUSH [hInst]
CALL WRITE_LISTBOXLINE ;back to the system for unwind and termination
MOV EAX,1 ;terminate process without showing message box
JMP >6
5:
CALL WRITE_LISTBOXLINE ;back to the system for unwind and termination
MOV EAX,0 ;terminate process showing message box
6:
MOV D[MESSDELAY],1000D ;greater delay for final messages from the system
7:

;*********************************************************************
AND B[HANDLERFLAG],0FDh ;clear flag that in final handler
POP ESI,EDI,EBX
RET 4h ;(for what it's worth) remove parameter from the stack
;
CODEEND: ;label for attempted code corruption
;

Lesson 11 - How is a disassembler working ?
Lesson 11 - How is a disassembler working21 ?

What is this document about?
This document describes the design and implementation of a tool which takes 32-bit Win-
dows executable file and disassembles the raw machine code of the executable file into
some form of human readable representation such as "assembly language", and displays it
to the user.
What is the purpose of this document?

Besides it serves as my personal note of what I studies, the document is mainly created for
those of you who may be interested in learning how to write a disassembler. I also make all
the source files available for download. I have extensive comments in the source, but some
parts of the project may be still difficult to understand without understanding an overall
design, so this document fills that hole.
It is, unfortunately, not possible for me (or anybody) to fully describe every detail of how to
write a disassembler from A to Z. Moreover, I do not claim that my design and implementation
is "the best". In fact, this project was more for educating myself than showing it to others. My
original intent was to write just a framework, then publish it so that other people can extend it.
"Open ended implementation"

The subtitle says "open ended implementation". What I mean by that is, as you will learn in
this document later on, my implementation is basically incomplete, and you are more than
welcome to take a part in it, completing the part that I left off. To start working on the part that
I left, all you have to do is to copy a couple of DLLs (and associated header file and lib file)
and start writing your own "decoder". See the document for detail.
I will also complete the project eventually...
21.This article was found via google and was written by Tsuyoshi Watanabe. We respect the work of this author
and you should do the same

NOTE:
I make no guarantee that my design nor implementation is the most efficient and correct.
Indeed, my design only reflects how I solve the problem, and it should differ from yours.
I make certain assumptions:
- Using Microsoft Visual C++ as the compiler
- Executable file that can be disassembled is compiled by Microsoft tool (you can change
this easily).
- it is only for 32-bit executable.

Introduction
Questions
Disassembling a machine code into human readable assembly code sounds complicate.
When you look at the Intel instruction manual, you understand that it is. However it is not nec-
essarily difficult to write one given that you decompose the task into smaller subordinate
tasks.
There are several problems that pops up in your mind when you think about writing a disas-
sembler.
- How does the raw machine code look like?
- How are machine code and assembly code related?
- How do I get to the beginning of a machine code? Where does it come from?
- What kind of documentation and specification do I need?
- etc.

Dumpbin
The easiest way to get answers to those questions is to play with "dumpbin.exe" utility
provided by Microsoft Visual C++ tools. This utility comes with every version of Visual
C++, from 2.0 to 6.0 as far as I remember. Note that Visual C++ 1.52, a 16-bit edition,
does NOT come with dumpbin.exe. Instead it came with exehdr.exe or something, and
that doesn't work for 32-bit PE format executable files.
Dumpbin.exe is a powerful PE format executable file dumper utility that can dump all
kinds of stuff from any PE file. Here, we study the output of dumpbin.exe using /DISASM
switch. It literally "disassembles" the content of "code" section of a given file. (basically
we don't have to write a disassembler at all since we got one!).

The following is a sample dumpbin output "NOTEPAD.EXE".
One instruction appears in a single line at a time (except when it is too long and wraps to a
following line). At the far left columns, you see addresses of each instruction. The first instruc-
tion "cmp" is located at address:
01B41000:
The middle column shows variable-length "raw machine code" per instruction. For example,
the first instruction is:
83 3D E8 8E B4 01 00

Finally, the human-readable assembly language instruction appears. It is:

cmp dword ptr ds:[01B48EE8h], 0
You don't need to understand what this really means until you get to much later part in this
document, but it roughly means that "compare a 4-byte big data located at address
01B48EE8 in the DS segment against literal value 0".
Notice that there are instructions that are 7-byte in length, like the first instruction, but oth-
ers may be 2-byte long, 5-byte long, some are even just 1-byte long. The point is that Intel
x86 (starting from 8086 up to the current Pentium II) use "variable-length instruction" as
against "fixed-length" instructions. This is one of the differences from RISC processors,
whose instructions are all the same length. Also contrast this with Java byte code.
Although Java byte code is not a native "machine" code (well, it sometimes is... I think
Sun has a hardware that directly interprets Java byte code), it is similar "encoding", and
its instructions are all one byte.

Intel x86? Which processor are we going to work for?
One of the reasons of Intel's success in their processor business is their "backward compati-
bility" with legacy codes. The following is a brief history of Intel's x86 series processors.
1979 8088/8086
1982 80286
1985 i386
1989 i486
1993 Pentium
1997 Pentium Pro & MMX stuff
1998 Pentium II & Celeron
Each generation of processor became better and better by improving things like:
expanding data and address bus to increase addressing space
introducing protected mode for more reliable operating environment
increasing the size of cache
integrating with FPU
adding more instructions like MMX that EVERYBODY uses
adding multi-scaler pipelining
increasing clock cycle rate
and many other stuff that I have no idea

From our disassembler's point of view, we don't have to worry about processor specific
things. It is all hidden, and instruction map is never "modified" although new instructions
were added over two decades.
Also, for this project, I intended to completely ignored 16-bit code. However, as I discover,
it was easier to include logics that are only applicable to 16-bit to the project since the
processor architecture is built with 16-bit and 32-bit mode relatively strongly coupled. In
another word, the amount of work to separate 16-bit stuff from 32-bit stuff is more than
simply take both in to the project.
To answer the question of which processor would our disassembler work, it will work only
back to i386. The reason: 80286 has no protected mode. Windows run only with pro-
tected mode.
Which Microsoft Windows?
Our disassembler is going to work primarily with Portable Executable format files (a.k.a
PE file). This PE format files are standard executable file format for 32-bit Windows. 16-bit
Windows executable are in format called NE (New Executable?), and it is not compatible
with PE format. Types of program that are in PE format are:
- User-mode executable file (EXE, DLL, and others) for Windows 95/98.
- User-mode executable file (EXE, DLL, and others) for Windows NT.
- Kernel-mode executable file (SYS) for Windows NT.
Kernel-mode executable for Windows 95 (and most of 98), normally called as VxD, are in
LE format (a format that is somewhat more compatible with OS/2), and this is not compat-
ible with PE file format.
However, as you will see, I designed the project in such way that the piece of software
that "parse" a stream of machine code byes are completely ignorant about "where" it
comes from. In another word, it could come from either PE file code section, or VxD's
code segment. So it is possible to extend it so that it will work with non-PE format file.
Still, vast majority of executables we deal with everyday are in PE file format. So we will
only work with PE file.

Any reference needed?
Only documentation that is going to be required is an Intel processor manual. It is officially

called "Intel Architecture Software Developer's Manual" (ISBN-1555122744). There are three
volumes, and the volume 2 contains most of the information we need. The problem is that this
document is not sold in most of the book stores. However, it is available for free from Intel's
download site .
Intel's official manual is not the only reference that we could use. In fact, there are other
books that also contain information needed to write disassembler. I find that it is helpful to
have several references so that when one book is not clear about something, I can check
other books. I used "The Intel Microprocessors 8086/8088, 80186/80188, 80286, 80386,
80486, Pentium, and Pentium Pro Processor" by Barry B. Brey (ISBN-0132606704).

Overall architecture
Phases of data representation.
Our data is a byte stream in machine code. Disassembler is nothing more than a software
that converts an input byte stream into something else. This conversion task could be
broken down into smaller pieces. To find out how many pieces into which we can break it
down, we need to see how many "phases" that our data will go through. The following fig-
ure shows three basic possible phases of data.
The first phase is the start of the processing. There are just a bunch of raw byte stream
which, supposedly, mean something to the hardware processor. Note that there are not
meaningful boundaries in the stream of bytes.
We like to transform this stream of bytes into a list of much smaller, yet still in raw format,
groups of bytes, which I call "raw instructions". Each raw instruction should correspond to
a single Intel x86 instruction. If data at this phase is rendered to users, they will only see
bunch of variable-length hex numbers.
In the final phase, we hope that every raw instruction is converted into a line of words and
numbers that we understand as "assembly language". If data at this phase is rendered,
users see "disassembled instructions".

Two processing tasks
By looking at the figure for the phases that our data will go through, we understand that there
need to be two distinct "processing" tasks.
First, we need to bridge from "Phase I" to "Phase II". I decided to call the processing, that
transforms our data from "machine code byte stream" format into "raw instructions" format, as
"Parsing". There could be better technical wording than "Parsing", but it could be called
something like "tokenization".
The second "processing" that transforms "raw instructions" into "assembly instructions" is
named "Decoding" because what it really does is to "interpret" what each byte in a raw
instruction mean and put it in another human-readable form.

These two processings could have been put together in a big "disassembler" processor,
but I thought it was better to separate them into two completely independent processings
because:
The task of parsing involves deciding where the current instruction ends. In another word,
it was primarily concerned about "how many bytes" it should process (read) for a single
instruction. On the other hand, "Decoding" is another kind of task that is not really con-
cerned about (or doesn't want to be concerned about) how many bytes are in an instruc-
tion, but rather what the data bytes mean so that it can convert it to a group of keywords
and numbers, which we understand as "assembly code".
It could be argued that by separating them into two, I am producing some redundancy --
basically there could be almost two "paths" for every byte in the input data. However, it
seemed to be reasonable to say that the cost of "duplicates" is far less than the time your
will spend debugging a module that performs two logically different tasks simultaneously.

Mapping of "processing" tasks to objects.
In a pure Object-oriented design, this mapping is probably "big NO NO". I am mapping "pro-
cessing" to "objects", which doesn't make sense in OO design world. However, I argue that
"parsing" is done by a "parser", "decoding" is performed by a "decoder", so I could map
objects to these "XXXers". The following figure shows our "parser" and "decoder".

The green rectangles are objects. As you can see, Parser takes "machine code byte
stream" as its input, produces "raw instructions". In turn, Decoder takes "raw instruction"
and converts it to "assembly instruction".
This is our overall design of the "engine" part of the disassembler.

Other utility objects we need.
You might have noticed that objects we got so far, Parser and Decoder, have no interaction
with the user. Parser could ask for a "machine code byte stream" from user directly, but I don't
know how many users can actually hand-craft machine code byte stream and give it to the
Parser. Meanwhile, when Decoder does his job of decoding raw instruction into lines of
assembly language code, how is it going to show his work to the user? Should it show each
line in a separate Message Box? We might automatically think that output of any disassem-
bler should be a scrolling output in the standard output console, but it doesn't have to be. I
never mention how it is "rendered". Who will be doing those extra works?
The figure below shows a couple of objects that do the "data providing (fetching)" and "ren-
dering service" part.
"Data stream provider" is someone or a piece of software that somehow "produces" an input
data stream. Our Parser happens to be a consumer of that product. The data stream may

come from a PE file, a LE file, from memory of a running program, or whatever. There
could be several different flavors of "Data stream provider" including Clipboard to which a
user may copied data from somewhere. The point is that it is bad idea to make such
assumptions here.
For this project, I arbitrarily decided to use a kind of PE dump utility which provide us with
the "data stream providing service".
"Rendering provider" (rendering service provider) is the UI guy. Rendering technique may
be a simple console output, dialog based list box output, or something more sophisticated
(complicated) like a graph of caller-callee relationships, but it is up to designer of "Ren-
dering provider" to decide how to "render" the assembly language lines produced by
Decoder. In this project, I have a Decoder called SimpleDecoder, and it uses "std::cout"
as the rendering provider. Since "std::cout" is a "service" not really an independent entity,
so SimpleDecoder implementation basically lacks "rendering provider" piece. More on
this later.

Where is user?
Now, we put all the pieces together. Certainly, the most important piece is the "user". I
described in the previous section, UI layer is going to interact with the "user".
User provides "executable file" to be disassembled. In turn, our "disassembler system"

returns a disassembled file.
So, this completes the section for "architecture".

Getting machine code byte stream

PE file wrapper object
In the previous chapter, we decided to have an utility object that provide "data stream pro-
viding" service. I also decided that for this project, we use a some kind of PE dumper.
Luckily, there are many sample PE dumper (I used Matt Pietrek's PEDUMP as the start-
ing point, thanks Matt!).
The following figure shows you which part of the system we will work on in this chapter.

According to the requirement, what we need is an object which is capable of taking a speci-
fied executable file from the user, then somehow get (extract?) the "code" part of the execut-
able file, then make it available for others such as our Parser (but could be anybody else who
need "machine code byte stream").
Requirements of this object are:
- It takes a file name of a target PE executable as an input
- It understands the PE file format
- It provides service functions so that client can obtain "machine code byte stream".
This is not a terribly involved set of requirements. The requirements can be easily fulfilled by
extending a typical PE file dumper.
Although the topic of PE file dumper is interesting and important, I decided not to dwell too
long on this subject. Besides, this object is rather "extra" helper object. We are more inter-
ested in "Parser" and "Decoder" since it provides the "guts" of a disassembler.
For this reason, I will just show my implementation of "Data stream provider" called PEFile-
Wrap.

PEFileWrap
Basically, PEFileWrap is a "wrapper" of PE file which provides a couple of methods,

among others, to give information about the location and size of "code section" within a
PE file.

PE file has different sections like these:
- Standard header
- Optional header
- Section table
- Code section
- Initialized data section
- Uninitialized data section
- Import table
- Export table
- Thread local storage
- etc.
However, you don't really care about anything except "Code section" of a PE file. This section
is where linker emits all the object codes into. When OS loader starts executing a program,
the first byte of this section is executed. This section is the "machine code byte stream" that
we are going to disassemble.

The interface (abstract base class) is as follows. Ones we are going to use are high-
lighted.
class IPEFileWrap
{
public:
virtual DWORD
GetBase()
= 0;
virtual DWORD
GetCodeSectionOffset()
= 0;
virtual UINT
GetCodeSectionSize()
= 0;
virtual DWORD GetInitializedDataSectionOffset() = 0;
virtual UINT GetInitializedDataSectionSize() = 0;
virtual DWORD GetUninitializedDataSectionOffset() = 0;

virtual UINT GetUninitializedDataSectionSize() = 0;
virtual DWORD GetImportDataSectionOffset() = 0;
virtual UINT GetImportDataSectionSize() = 0;
virtual DWORD GetExportDataSectionOffset() = 0;
virtual UINT GetExportDataSectionSize() = 0;
virtual DWORD GetResourceSectionOffset() = 0;
virtual UINT GetResourceSectionSize() = 0;
};
This interface is defined in the header file "PEUtility.h" and it is in the project's shared
Include directory as well as under PEUtility project source directory. Our disassembler
needs to include this file so that we can use the service.

How to create and use IPEFileWrap object
I decided to package this object in a DLL for two reasons:
1.Makes the project simpler
2.I can update the DLL since this PE dump stuff could potentially be another fun
project to extend. (you are more than welcome to enhance it to the next generation of
PE file content dumper).
At any rate, this object is "hosted" by a server DLL called PEUtility.DLL. The DLL is under the
project's top level Debug/Release directory. The DLL exports a function that you should call
to obtain a pointer to IPEFileWrap object - it is called CreatePEFileWrap().
extern "C" PEUTILITY_API
int
CreatePEFileWrap
(char* filename, IPEFileWrap** ppx);
enum PEUTILITY_ERROR_CODE
PEUTILITY_SUCCESS = 1,
PEUTILITY_FAILURE = 0,
OBJECT_ALREADY_CREATED = -1
};

When you call CreatePEFileWrap, it will return PEUTILITY_ERROR_CODE. Use this

return code to find out if there was any problem. If you get PEUTILITY_SUCCESS, then
everything went well.
NOTE: this implementation is "asking" for COM implementation. I intentionally made this
non-COM object because introducing COM here may make the project more complex
and hard to understand. Needless to say, it is far better to have it as a COM object.
This PEFileWrap object has a huge drawback. It is not multi-thread ready. In another
word, you can't open more than one PE file at a time with this object. In fact, with the cur-
rent implementation, all you can do is to create a PE file wrapper once, and until DLL is
unloaded (which basically means until application is terminated) the object continue to
exist. This problem can be solved by making the object COM-compliant.
After all, if you don't like this implementation, you can use your own PE file dumper utility.
As mentioned before, the only requirement for it in this project is that it can provide the
location of "machine code byte stream" and size of that stream!
You might want to look at the classes in PEUtility project. I got the followings:
- PEFile - represent a single PE file
- PEFileHeader - represent "header" part of a PE file
- PEOptionalHeader - represent "optional header" part of a PE file
-PESectionTable - represent "section table" part of a PE file
The code is largely based on Matt Pietrek's PEDUMP. However, the codes are decom-
posed into these classes from straight "C" implementation. Each class could be extended
to provide more sophisticated capability. For this project, however, my PEFileWrap fulfills
the requirement, so there is no point equipping it with other capabilities.

Understanding 32-bit Intel Processor Architecture

(IA32) for parsing
In the previous chapter, we learned how to get a "machine code byte stream" by using the
service provided by a PE wrapper class called PEFileWrap, which is hosted in PEUtilty.DLL.
Our next task is to parse the raw data of machine code byte stream into a list of smaller
groups of bytes, where each element in the list is going to representing a single instruction.
A design issue that we need to agree.
Before we go any further, we have a "design issue". The issue is this:

When Parser processes the input byte stream, will it produce an array of raw instruc-
tions? - or - Is it going to find out an end of a single instruction, and give control back to
the client? OK, to understand this issue, compare the following possible "design" of the
Parser:
1.Parser processes the byte stream, and as it find out a single instruction, it
addresses an entry into an array of pointer to byte. At the end, there will be a
dynamic array with the size of elements being equal to the number of instruc-
tions in the input byte stream.
2.Parser processes the byte stream, and as it find out a single instruction, it cop-
ies the entire instruction bytes into a buffer. This buffer could be an element of a
dynamically growable array.
3.Parser processes the byte stream, and as it find out a single instruction, it
returns a pointer to the beginning of the current instruction within the input byte
stream. It also returns the number of bytes that current instruction is. When cli-
ent says "go ahead", Parser start processing immediately beyond the last byte
of the previous instruction.
To make the story short, I used the "design #3". The reason is efficiency in space as well
as time. There will be no coping (actually I do perform physical copying for caching pur-
pose) of data into another location, which probably needs to be dynamically allocated.
There will be no dynamically growable array of pointers. A pointer takes up 4 bytes. Sup-
pose there are 1000 instructions, it will take 4k of memory, which is equivalent of an entire
page size.
The only requirement with this design is that client, in our case Decoder or whoever own
Decoder, must perform the decoding task on the fly. It can't go back to a previous instruc-
tion once it proceed to next instruction. This makes "decoding task" more linear.
Anyway, that's how my Parser is going to do. Now, lets get to the topic of "how to parse
Intel machine code byte stream" so that it can find a boundary of a current instruction.

Understanding Intel Architecture.

The subtitle of this chapter is "Understanding Intel Architecture (IA32) for parsing". The next
chapter talks about the design of Parser. Why we need an entire chapter for understanding
Intel Architecture? Because our task of "parsing" requires us to understand it. Of course, we
just need to understand very small segment of the Intel Architecture to do our job.
Let's dive into it. Ready?
This is the format of Intel x86 instructions. Yes, probably you don't understand things like
"ModR/M" and "SIB". You might have some idea of "Opcode" and "Displacement". They
means lots of things, but we have to remember this while we study Intel x86 instruction for-
mat:
At this point, we don't care what they mean as assembly language point of view. All we want
to know is that how many bytes each instruction we parse is going to be.
To put it in another words, Parser wants to know where the current instruction ends. How
could it be sure that an instruction ends at a particular location? That's what we are going to
find out in this chapter.

Prefix
Lets start with those guys that sit before Opcode.
The ones highlighted with yellow background are called "prefix" as a whole. As the name
implies, they might appear before Opcode. Their jobs is to "override" the default attribute
of the processor mode. For instance, when processor is running in 32-bit mode, the
default address and data (operand) size are 32-bit. Say, if you wanted to move just 16-bit
data into a register, then only for that instruction, default operand size attribute must be
overridden.
Address size and operand size are similar. Their presence flips the attribute between 16-
bit and 32-bit. Let's not go too far on this. Our goal is to understand "how many bytes we

need to read for an instruction". If you are interested in check intel processor manual. (I don't
mean to escape, we just don't have to know about it until we write Decoder).
The most important things that we should know about prefixes are:
- Each prefix is exactly one byte
- Every prefix is optional - it may be present, and may be absent.
- The order of appearance is not fixed (this I am not entirely sure but Intel pro-
cessor manual says so).
How do they look like? Here you go.
F0 F2 F3 F3
Instruction Prefix
67
Address-size override prefix
66
Operand-size override prefix
Segment override prefix 2E 36 3E 26 64 65
They are all in hex, and Instruction Prefix and Segment Prefix have more than one. Each byte
means something but we don't care what they mean now. (again, go ahead and find out what
they mean.)
So, what does this all mean to our parser? It means that:
Any instruction may start with at most 4 prefix bytes, which may appear in any order, so we
need to keep reading all (or none) of the prefix. In addition, Address-size prefix and Operand-

size prefix are going to influence subsequent parsing task, so we better remember that
we saw them, if they exist. That's it.
Opcode
Probably, Opcde is the easiest one to understand what it means (although we don't really
care). It decides which one of the operations provided by the processor that a particular
instruction wants.
In terms of parsing purpose, first thing we need to understand is that Opcode itself could
take up either one or two byte. There is no case where Opcode is absent. If Opcode is not
found, parsing must have screwed up somewhere.
Anyway, our Parser must be able to read either one byte or two bytes depending on
whether or not this Opcode is "One Byte opcode" or "Two Byte opcde". How can you tell
this? Easy. If you see 0F (hex), it is an escape character for yet another opcode byte.

Besides the size of an opcode itself, we must find out what kind of "operand(s)" a particular
opcode is going to take. Some of the operand don't take any operand, others take just ModR/
M byte, some take Immediate, etc.
The red line arrow in the above figure means that the presence of the fields pointed are dic-
tated by the field where the red line arrow originates. Therefore, whether ModR/M byte will
follow or not is depending on opcode. The same applies to Displacement and Immediate.
So, it is getting little complicated here. What we do?
This is the part that took most of my time in this project so far (aside Decoder which will be a
lot more). It would be nice that Intel processor manual has tables that say, "this and this and

that instruction take ModR/M. that and these opcodes take immediate" etc. Unfortunately,
they don't.
Intel processor manual describes operand requirements for every operand, but there isn't
any nicely formatted tables. I had to basically create tables of requirements by hand. The
tables I made are:
- Table of "One Byte" opcodes which take ModR/M field.
- Table of "Two Byte" opcodes which take ModR/M field.
- Table of "One Byte" opcodes which take 1-byte Displacement.
- Table of "One Byte" opcodes which take 2/4-byte Displacement.
- Table of "Two Byte" opcodes which take 1-byte Displacement.
- Table of "Two Byte" opcodes which take 2/4-byte Displacement.
- Table of "One Byte" opcodes which take 1-byte Immediate.
- Table of "One Byte" opcodes which take 2/4-byte Immediate.
- Table of "Two Byte" opcodes which take 1-byte Immediate.
- Table of "Two Byte" opcodes which take 2/4-byte Immediate.
If you are lost, that's natural.
"2/4-byte" part means that the size is either 2-byte (16-bit) or 4-byte (32-bit). How are we
going to know which size the operand is? This is where "default attribute" and possible
presence of Operand-size prefix comes into play. Parser must "recall" about any prefix
that it might have already parsed.
Construction of these tables took a while, and since I did it by hand, there may be errors.
On top of that, due to my lack of complete understanding of assembly language (did I
mention that I never really programmed in assembly language before? I am a C/ C++ pro-
grammer!), I might have made some mistakes. So far, my test result says that I got it right,
but I will not be surprised if there is a bug or two pops up due to bad table.
I am not going to show the contents of these tables here. It won't be exciting any way. You
can see the tables in the source file, "IA32OpcodePart.cpp". It is in ".cpp" file because
these tables are static member of a class.
From Parser's point of view, it has to check the current opcode against these require-
ments of operand fields, and if any match is found, it has to remember to parse operand
fields. The size of field might have to be determined by learning about operand-size (and

possible override made by prefix). For instance, if Immediate is required by a particular

opcode, after we read (or not read) ModR/M, SIB, and displacement fields, we must remem-
ber to parse x number of bytes for Immediate operand.
Don't worry to remember all these details. After all, this part is all implemented so you never
have to implement. (unless you become sick of my spaghetti, and decide to write your own).

ModR/M
This field is scary looking. What the &*^# is "ModR/M"? In short, this field possibly
encodes one or two operands, one of which could be a memory data. It also may encode
a sort of "sub-opcode", where certain opcode defines a "group operand" and actual oper-
and is determined by looking at a part of this ModR/M byte. Even worse, this ModR/M
may require SIB byte, which follows ModR/M. Again the detail of "meaning" is not so
important here.
As you can see with a couple of red arrows, ModR/M may say that it needs SIB and/or
Displacement field to completely describe operand(s).
Checking for this conditionals was not difficult. For a given ModR/M byte...
- If data at bit position 3,4,5 is equal to "100" (e.g. 00100110), then SIB byte
will follow.
- If data at bit position 6,7 is equal to "01", then there will be 1-byte dis-
placement.
- If data at bit position 6,7 is equal to "10" and Address-size is 32-bit, then
there will be 4-byte displacement.
- If data at bit position 6,7 is equal to "10" and Address-size is 16-bit, then
there will be 2-byte displacement.
From parser's point of view, it needs to find out signature at above mentioned bit loca-
tions, and remember the field requirement if they occur.
Intel processor manual completely describes the meanings of every possible pattern of
ModR/M byte, so it shouldn't be confusing when implementing a decoder.

SIB
This byte is very passive, and you don't have to do anything as far as parsing is concerned.
Just go right pass over SIB byte and go to the next fields, (or to the end of the current instruc-
tion if no other fields follow).
The meanings of every possibility of SIB byte is completely described by Intel processor man-
ual.

Displacement
Presence of this field is already determined by either Opcode or ModR/M, including size
of displacement. Parser needs to advance its location to pass over displacement field, if
exists.
The meaning of this field and how it relates to other fields are not our concern at this
moment. It will probably be used for effective address calculation when fetching some
data from memory.

Immediate
Finally, we see the end of the tunnel. Just like Displacement, if required, we would have
known by now. Just parse over it for the number of bytes for this field.
When pointer (or location counter) is advance passed this field, we must be looking at the
beginning of the next instruction. This is where Parser would say, "Done, here is the current
instruction!".

Now what?
After going through each part of the instruction format, we have general idea of what kind
of operations must be implemented for our parser.
If you can translate this instruction format to software objects, it save our time because
our understanding of instruction format would immediately reflect on the design of soft-
ware objects.
The following figure shows instruction format from our software's point of view. I made
some arbitrary regrouping of parts (fields) of instruction that are reasonable for our soft-
ware.

For instance, you see that all the prefix fields are merged to a single "prefix part". This makes
sense because of the strong relationship among the four prefixes. Another merger occurred
between ModR/M and SIB. Since SIB is a passive field (it doesn't designate other fields), it
became part of logic that takes care of ModR/M. The notion of "size" is less strict in this view.
In the next chapter, we map these parts to C++ objects, and refine the relationship and inter-
action among the fields.

Decoding raw instructions

Simple implementation - SimpeDecoder
This chapter is about Decoder part of a disassembler.
The task of a "decoder" in this context is to take a raw instruction bytes (possibly just one
byte) which represents a single instruction and convert it into a human-readable format
such as a line of assembly instruction.
Decoder does not have to worry about figuring out how many bytes the current instruction
is made up with since Parser object will tell him.
The most primitive type of Decoder is a decoder which does not "decode" at all.

What SimpleDecoder does?
The following figure shows what my SimpleDecoder does:
As you can see, it entirely skips the most interesting task of translating (decoding) raw
instruction into assembly instruction. Instead, it simply converts raw byte data into hexadeci-
mal representation in ASCII characters.
Obviously, SimpleDecoder is so simple that it add almost no value at all, but it should be a
good example for anybody who want to play around with InstructionParser.

Rendering provider
SimpleDecoder's job was to "decode" raw instructions, but it is not responsible for "ren-
dering decoded information". This task is performed by an implementation of "Rendering
service provider".
Rendering provider could be another fun project. It can range from simple console output
to a GUI rendering using icons, list views, or whatever.
For SimpleDecoder, I used std::cout as rendering service provider. In another word, I just
dumped into DOS box.
Again, this is the simplest rendering service I can ever ask.
Believe it or not, this is the end of this chapter.

Final words
More sophisticated implementation - Disassembler
This is where you can come in!
I haven't wrote any decoder that does more than SimpleDecoder does. Eventually, I would
like to write one and share my experience in here, but for now, this section is "under construc-
tion"!
Of course, I will be more than happy to work together, or just exchange ideas here!


CHAPTER 2 Lets´s build a compiler...
This sixteen-part series, written from 1988 to 1995, is a non-technical introduction to compiler
construction1 and is Copyright (C) 1988 Jack W. Crenshaw.
You may ask: “This book should be about writing disassemblers not compilers. What the heck
are you doing here ?”
The Answer is:

Do you know what a compiler is ? How it works ? So let me give you a short introduction and
then you will see WHY I have included the chapter 2.
If you code an application you start with typing your coding language with your favorite IDE.
Then you mostly push a button “compile” and after some seconds and some more magic you
have a working application.
1. The original URl is: http://compilers.iecc.com/crenshaw/

Lets´s build a compiler...
But how does this magic works ?
Well, first your source-code will be checked for error. This is called “Lexical Scanning”
and “Parsing”. One part of the compiler (for languages like JavaScript you call this magi-
cian “Interpreter”) scans our source for typos and for the correct “grammar”.
If anything goes wrong you will receive an error like “parsing error in line 546” or “if with-
out end in line 276”
If everything is OK the compiler will translate your source-code (human-readable) to

assembly-code (for example mov eax,0).
If this is finished the new code (assembly-code) will be translated to opcodes.

The correct opcode on a 8086 machine for PUSH 0 is 6A 00
or for PUSH DWORD PTR DS:[402048] it is FF35 48204000
As you can see: the machine code will be translated to a hex-value which corrresponds to
our command. The hex-values are called “Opcodes” and the corresponding command
“Mnemonics”.
If you take a hex-editor and open a file this is exactly what you get !
The application is finished with its compilation. Reading these hex-values and doing the
program running is not our problem. This does the computer with some magic we will not
need now.
Maybe you can see now WHY compilers are related to disassemblers... No ?
Ok, here we go:
We want to disassemble a file. Let´s assume we do this manually. We open the file with
an hex-editor. Then we take the first hex-value, look in our opcode/mnemonic table and if
we found it we write it down (like mov eax,0).
If we have not found the value in our table we take the second hex-value. Then we check
the combined hex-value (from the first hex and the second hex) in our opcode/mnemonic
table. If not found we take the third. So FF35 48204000 may be PUSH DWORD PTR
DS:[402048]. Sure the result depends on the processor and the opcode/mnemonic
table we use. Remember: after 15 hex-values we should have a result. If not there is
something wrong because the maximum opcode-length should be 15 !

Now you can see that a compiler and a disassembler are exactly the same !
Well, not really...Only in parts.
Imagine this:
We add more functionality to our disassembler. After getting the correct opcodes and mne-
monics we add another magic function: translating the mnemonics to “source”-code of any
language.
Then we would have a reversed compiler. This is what we call decompiler and as you can
see is the disassembler one part of it.
Yep. Now you see: if you know how a compiler works it is easy to understand a decompiler
and a disassembler.
If you are not further interested in diving into compiler-construction you can jump over this
chapter but I really recommend some reading of it.
So let´s go, this will be a long and hard but interesting part of this book... See you then with
some more grey hairs after this chapter...

Part 1 - Introduction
This series of articles is a tutorial on the theory and practice of developing language pars-
ers and compilers. Before we are finished, we will have covered every aspect of compiler
construction, designed a new programming language, and built a working compiler.
Though I am not a computer scientist by education (my Ph.D. is in a different field, Phys-
ics), I have been interested in compilers for many years. I have bought and tried to digest
the contents of virtually every book on the subject ever written. I don't mind telling you that
it was slow going. Compiler texts are written for Computer Science majors, and are tough
sledding for the rest of us. But over the years a bit of it began to seep in. What really
caused it to jell was when I began to branch off on my own and begin to try things on my
own computer. Now I plan to share with you what I have learned. At the end of this series
you will by no means be a computer scientist, nor will you know all the esoterics of com-
piler theory. I intend to completely ignore the more theoretical aspects of the subject.
What you _WILL_ know is all the practical aspects that one needs to know to build a
working system.
This is a "learn-by-doing" series. In the course of the series I will be performing experi-
ments on a computer. You will be expected to follow along, repeating the experiments that
I do, and performing some on your own. I will be using Turbo Pascal 4.0 on a PC clone. I
will periodically insert examples written in TP. These will be executable code, which you
will be expected to copy into your own computer and run. If you don't have a copy of
Turbo, you will be severely limited in how well you will be able to follow what's going on. If
you don't have a copy, I urge you to get one. After all, it's an excellent product, good for
many other uses!
Some articles on compilers show you examples, or show you (as in the case of Small-C)
a finished product, which you can then copy and use without a whole lot of understanding
of how it works. I hope to do much more than that. I hope to teach you HOW the things
get done, so that you can go off on your own and not only reproduce what I have done,
but improve on it.

This is admittedly an ambitious undertaking, and it won't be done in one page. I expect to do
it in the course of a number of articles. Each article will cover a single aspect of compiler the-
ory, and will pretty much stand alone. If all you're interested in at a given time is one aspect,
then you need to look only at that one article. Each article will be uploaded as it is complete,
so you will have to wait for the last one before you can consider yourself finished. Please be
patient.
The average text on compiler theory covers a lot of ground that we won't be covering here.
The typical sequence is:
o An introductory chapter describing what a compiler is.
o A chapter or two on syntax equations, using Backus-Naur Form (BNF).
o A chapter or two on lexical scanning, with emphasis on deterministic and non-deterministic

finite automata.
o Several chapters on parsing theory, beginning with top-down recursive descent, and ending
with LALR parsers.
o A chapter on intermediate languages, with emphasis on P-code and similar reverse polish
representations.
o Many chapters on alternative ways to handle subroutines and parameter passing, type dec-
larations, and such.
o A chapter toward the end on code generation, usually for some imaginary CPU with a sim-
ple instruction set. Most readers (and in fact, most college classes) never make it this far.
o A final chapter or two on optimization. This chapter often goes unread, too.

I'll be taking a much different approach in this series. To begin with, I won't dwell long on
options. I'll be giving you _A_ way that works. If you want to explore options, well and
good ... I encourage you to do so ... but I'll be sticking to what I know. I also will skip over
most of the theory that puts people to sleep. Don't get me wrong: I don't belittle the theory,
and it's vitally important when it comes to dealing with the more tricky parts of a given lan-
guage. But I believe in putting first things first. Here we'll be dealing with the 95% of com-
piler techniques that don't need a lot of theory to handle.
I also will discuss only one approach to parsing: top-down, recursive descent parsing,
which is the _ONLY_ technique that's at all amenable to hand-crafting a compiler. The
other approaches are only useful if you have a tool like YACC, and also don't care how
much memory space the final product uses.
I also take a page from the work of Ron Cain, the author of the original Small C. Whereas
almost all other compiler authors have historically used an intermediate language like P-
code and divided the compiler into two parts (a front end that produces P-code, and a
back end that processes P-code to produce executable object code), Ron showed us that
it is a straightforward matter to make a compiler directly produce executable object code,
in the form of assembler language statements. The code will _NOT_ be the world's tight-
est code ... producing optimized code is a much more difficult job. But it will work, and
work reasonably well. Just so that I don't leave you with the impression that our end prod-
uct will be worthless, I _DO_ intend to show you how to "soup up" the compiler with some
optimization.
Finally, I'll be using some tricks that I've found to be most helpful in letting me understand
what's going on without wading through a lot of boiler plate. Chief among these is the use
of single-character tokens, with no embedded spaces, for the early design work. I figure
that if I can get a parser to recognize and deal with I-T-L, I can get it to do the same with
IF-THEN- ELSE. And I can. In the second "lesson," I'll show you just how easy it is to
extend a simple parser to handle tokens of arbitrary length. As another trick, I completely
ignore file I/O, figuring that if I can read source from the keyboard and output object to the
screen, I can also do it from/to disk files. Experience has proven that once a translator is
working correctly, it's a straightforward matter to redirect the I/O to files. The last trick is
that I make no attempt to do error correction/recovery. The programs we'll be building will
RECOGNIZE errors, and will not CRASH, but they will simply stop on the first error ... just
like good ol' Turbo does. There will be other tricks that you'll see as you go. Most of them
can't be found in any compiler textbook, but they work.

A word about style and efficiency. As you will see, I tend to write programs in _VERY_ small,
easily understood pieces. None of the procedures we'll be working with will be more than
about 15-20 lines long. I'm a fervent devotee of the KISS (Keep It Simple, Sidney) school of
software development. I try to never do something tricky or complex, when something simple
will do. Inefficient? Perhaps, but you'll like the results. As Brian Kernighan has said, FIRST
make it run, THEN make it run fast. If, later on, you want to go back and tighten up the code
in one of our products, you'll be able to do so, since the code will be quite understandable. If
you do so, however, I urge you to wait until the program is doing everything you want it to.
I also have a tendency to delay building a module until I discover that I need it. Trying to antic-
ipate every possible future contingency can drive you crazy, and you'll generally guess wrong
anyway. In this modern day of screen editors and fast compilers, I don't hesitate to change a
module when I feel I need a more powerful one. Until then, I'll write only what I need.
One final caveat: One of the principles we'll be sticking to here is that we don't fool around
with P-code or imaginary CPUs, but that we will start out on day one producing working, exe-
cutable object code, at least in the form of assembler language source. However, you may
not like my choice of assembler language ... it's 68000 code, which is what works on my sys-
tem (under SK*DOS). I think you'll find, though, that the translation to any other CPU such as
the 80x86 will be quite obvious, though, so I don't see a problem here. In fact, I hope some-
one out there who knows the '86 language better than I do will offer us the equivalent object
code fragments as we need them.

THE CRADLE
Every program needs some boiler plate ... I/O routines, error message routines, etc. The
programs we develop here will be no exceptions. I've tried to hold this stuff to an absolute
minimum, however, so that we can concentrate on the important stuff without losing it
among the trees. The code given below represents about the minimum that we need to
get anything done. It consists of some I/O routines, an error-handling routine and a skele-
ton, null main program. I call it our cradle. As we develop other routines, we'll add them to
the cradle, and add the calls to them as we need to. Make a copy of the cradle and save
it, because we'll be using it more than once.
There are many different ways to organize the scanning activities of a parser. In Unix sys-
tems, authors tend to use getc and ungetc. I've had very good luck with the approach
shown here, which is to use a single, global, lookahead character. Part of the initialization
procedure (the only part, so far!) serves to "prime the pump" by reading the first character
from the input stream. No other special techniques are required with Turbo 4.0 ... each
successive call to GetChar will read the next character in the stream.

{--------------------------------------------------------------}
program Cradle;
{--------------------------------------------------------------}
{ Constant Declarations }
const TAB = Î;
{--------------------------------------------------------------}
{ Variable Declarations }
var Look: char; { Lookahead Character }
{--------------------------------------------------------------}
{ Read New Character From Input Stream }
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
procedure Error(s: string);
begin
WriteLn;
WriteLn(^G, 'Error: ', s, '.');
end;

{--------------------------------------------------------------}
{ Report Error and Halt }
procedure Abort(s: string);
begin
Error(s);
Halt;
end;
{--------------------------------------------------------------}
{ Report What Was Expected }
procedure Expected(s: string);
begin
Abort(s + ' Expected');
end;
{--------------------------------------------------------------}
{ Match a Specific Input Character }
procedure Match(x: char);
begin
if Look = x then GetChar
else Expected('''' + x + '''');
end;

{--------------------------------------------------------------}
{ Recognize an Alpha Character }
function IsAlpha(c: char): boolean;
begin
IsAlpha := upcase(c) in ['A'..'Z'];
end;
{--------------------------------------------------------------}
{ Recognize a Decimal Digit }
function IsDigit(c: char): boolean;
begin
IsDigit := c in ['0'..'9'];
end;
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: char;
begin
if not IsAlpha(Look) then Expected('Name');
GetName := UpCase(Look);
GetChar;
end;

{--------------------------------------------------------------}
{ Get a Number }
function GetNum: char;
begin
if not IsDigit(Look) then Expected('Integer');
GetNum := Look;
GetChar;
end;
{--------------------------------------------------------------}
{ Output a String with Tab }
procedure Emit(s: string);
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }
procedure EmitLn(s: string);
begin
Emit(s);
WriteLn;
end;

{--------------------------------------------------------------}
{ Initialize }
procedure Init;
begin
GetChar;
end;
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
end.
{--------------------------------------------------------------}
That's it for this introduction. Copy the code above into TP and compile it. Make sure that it
compiles and runs correctly. Then proceed to the first lesson, which is on expression parsing.

Part 2 - Expression Parsing
GETTING STARTED
If you've read the introduction document to this series, you will already know what we're
about. You will also have copied the cradle software into your Turbo Pascal system, and
have compiled it. So you should be ready to go.
The purpose of this article is for us to learn how to parse and translate mathematical
expressions. What we would like to see as output is a series of assembler-language
statements that perform the desired actions. For purposes of definition, an expression is
the right-hand side of an equation, as in
x = 2*y + 3/(4*z)
In the early going, I'll be taking things in _VERY_ small steps. That's so that the beginners
among you won't get totally lost. There are also some very good lessons to be learned
early on, that will serve us well later. For the more experienced readers: bear with me.
We'll get rolling soon enough.

SINGLE DIGITS
In keeping with the whole theme of this series (KISS, remember?), let's start with the abso-
lutely most simple case we can think of. That, to me, is an expression consisting of a single
digit.
Before starting to code, make sure you have a baseline copy of the "cradle" that I gave last
time. We'll be using it again for other experiments. Then add this code:
{---------------------------------------------------------------}
{ Parse and Translate a Math Expression }
procedure Expression;
begin
EmitLn('MOVE #' + GetNum + ',D0')
end;
{---------------------------------------------------------------}
And add the line "Expression;" to the main program so that it reads:
{---------------------------------------------------------------}
begin
Init;
Expression;
end.
{---------------------------------------------------------------}

Now run the program. Try any single-digit number as input. You should get a single line of
assembler-language output. Now try any other character as input, and you'll see that the
parser properly reports an error.
CONGRATULATIONS! You have just written a working translator!
OK, I grant you that it's pretty limited. But don't brush it off too lightly. This little "compiler"
does, on a very limited scale, exactly what any larger compiler does: it correctly recog-
nizes legal statements in the input "language" that we have defined for it, and it produces
correct, executable assembler code, suitable for assembling into object format. Just as
importantly, it correctly recognizes statements that are NOT legal, and gives a meaningful
error message. Who could ask for more? As we expand our parser, we'd better make
sure those two characteristics always hold true.
There are some other features of this tiny program worth mentioning. First, you can see
that we don't separate code generation from parsing ... as soon as the parser knows what
we want done, it generates the object code directly. In a real compiler, of course, the
reads in GetChar would be from a disk file, and the writes to another disk file, but this way
is much easier to deal with while we're experimenting.
Also note that an expression must leave a result somewhere. I've chosen the 68000 reg-
ister DO. I could have made some other choices, but this one makes sense.

BINARY EXPRESSIONS
Now that we have that under our belt, let's branch out a bit. Admittedly, an "expression" con-
sisting of only one character is not going to meet our needs for long, so let's see what we can
do to extend it. Suppose we want to handle expressions of the form:
1+2
or 4-3
or, in general, <term> +/- <term>
To do this we need a procedure that recognizes a term and leaves its result somewhere, and
another that recognizes and distinguishes between a '+' and a '-' and generates the appropri-
ate code. But if Expression is going to leave its result in DO, where should Term leave its
result? Answer: the same place. We're going to have to save the first result of Term some-
where before we get the next one.

OK, basically what we want to do is have procedure Term do what Expression was doing
before. So just RENAME procedure Expression as Term, and enter the following new ver-
sion of Expression:
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
begin
Term;
EmitLn('MOVE D0,D1');
case Look of
'+': Add;
'-': Subtract;
else Expected('Addop');
end;
end;
{--------------------------------------------------------------}

Next, just above Expression enter these two procedures:
{--------------------------------------------------------------}
{ Recognize and Translate an Add }
procedure Add;
begin
Match('+');
Term;
EmitLn('ADD D1,D0');
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }
procedure Subtract;
begin
Match('-');
Term;
EmitLn('SUB D1,D0');
end;
{-------------------------------------------------------------}

When you're finished with that, the order of the routines should be:
o Term (The OLD Expression)
o Add
o Subtract
o Expression
Now run the program. Try any combination you can think of of two single digits, separated
by a '+' or a '-'. You should get a series of four assembler-language instructions out of
each run. Now try some expressions with deliberate errors in them. Does the parser catch
the errors?
Take a look at the object code generated. There are two observations we can make. First,
the code generated is NOT what we would write ourselves. The sequence
MOVE #n,D0
MOVE D0,D1
is inefficient. If we were writing this code by hand, we would probably just load the data
directly to D1.
There is a message here: code generated by our parser is less efficient than the code we
would write by hand. Get used to it. That's going to be true throughout this series. It's true
of all compilers to some extent. Computer scientists have devoted whole lifetimes to the
issue of code optimization, and there are indeed things that can be done to improve the
quality of code output. Some compilers do quite well, but there is a heavy price to pay in
complexity, and it's a losing battle anyway ... there will probably never come a time when
a good assembler-language programmer can't out-program a compiler. Before this ses-
sion is over, I'll briefly mention some ways that we can do a little optimization, just to
show you that we can indeed improve things without too much trouble. But remember,
we're here to learn, not to see how tight we can make the object code. For now, and really
throughout this series of articles, we'll studiously ignore optimization and concentrate on
getting out code that works.

Speaking of which: ours DOESN'T! The code is _WRONG_! As things are working now, the
subtraction process subtracts D1 (which has the FIRST argument in it) from D0 (which has
the second). That's the wrong way, so we end up with the wrong sign for the result. So let's fix
up procedure Subtract with a sign-changer, so that it reads
{-------------------------------------------------------------}
procedure Subtract;
begin
Match('-');
Term;
EmitLn('SUB D1,D0');
EmitLn('NEG D0');
end;
{-------------------------------------------------------------}
Now our code is even less efficient, but at least it gives the right answer! Unfortunately, the
rules that give the meaning of math expressions require that the terms in an expression come
out in an inconvenient order for us. Again, this is just one of those facts of life you learn to live
with. This one will come back to haunt us when we get to division.
OK, at this point we have a parser that can recognize the sum or difference of two digits. Ear-
lier, we could only recognize a single digit. But real expressions can have either form (or an
infinity of others). For kicks, go back and run the program with the single input line '1'.
Didn't work, did it? And why should it? We just finished telling our parser that the only kinds of
expressions that are legal are those with two terms. We must rewrite procedure Expression
to be a lot more broadminded, and this is where things start to take the shape of a real parser.

GENERAL EXPRESSIONS
In the REAL world, an expression can consist of one or more terms, separated by
"addops" ('+' or '-'). In BNF, this is written
<expression> ::= <term> [<addop> <term>]*
We can accomodate this definition of an expression with the addition of a simple loop to
procedure Expression:
{---------------------------------------------------------------}
begin
Term;
while Look in ['+', '-'] do begin
case Look of
'+': Add;
'-': Subtract;
end;
end;
end;
{--------------------------------------------------------------}

NOW we're getting somewhere! This version handles any number of terms, and it only cost
us two extra lines of code. As we go on, you'll discover that this is characteristic of top-down
parsers ... it only takes a few lines of code to accomodate extensions to the language. That's
what makes our incremental approach possible. Notice, too, how well the code of procedure
Expression matches the BNF definition. That, too, is characteristic of the method. As you get
proficient in the approach, you'll find that you can turn BNF into parser code just about as fast
as you can type!
OK, compile the new version of our parser, and give it a try. As usual, verify that the "com-
piler" can handle any legal expression, and will give a meaningful error message for an illegal
one. Neat, eh? You might note that in our test version, any error message comes out sort of
buried in whatever code had already been generated. But remember, that's just because we
are using the CRT as our "output file" for this series of experiments. In a production version,
the two outputs would be separated ... one to the output file, and one to the screen.

USING THE STACK

At this point I'm going to violate my rule that we don't introduce any complexity until it's
absolutely necessary, long enough to point out a problem with the code we're generating.
As things stand now, the parser uses D0 for the "primary" register, and D1 as a place to
store the partial sum. That works fine for now, because as long as we deal with only the
"addops" '+' and '-', any new term can be added in as soon as it is found. But in general
that isn't true. Consider, for example, the expression
1+(2-(3+(4-5)))
If we put the '1' in D1, where do we put the '2'? Since a general expression can have any
degree of complexity, we're going to run out of registers fast!
Fortunately, there's a simple solution. Like every modern microprocessor, the 68000 has
a stack, which is the perfect place to save a variable number of items. So instead of mov-
ing the term in D0 to D1, let's just push it onto the stack. For the benefit of those unfamil-
iar with 68000 assembler language, a push is written
-(SP)
and a pop, (SP)+ .
So let's change the EmitLn in Expression to read:
EmitLn('MOVE D0,-(SP)');
and the two lines in Add and Subtract to
EmitLn('ADD (SP)+,D0')
and EmitLn('SUB (SP)+,D0'),
respectively. Now try the parser again and make sure we haven't broken it. Once again,
the generated code is less efficient than before, but it's a necessary step, as you'll see.

MULTIPLICATION AND DIVISION

Now let's get down to some REALLY serious business. As you all know, there are other math
operators than "addops" ... expressions can also have multiply and divide operations. You
also know that there is an implied operator PRECEDENCE, or hierarchy, associated with
expressions, so that in an expression like
2 + 3 * 4,
we know that we're supposed to multiply FIRST, then add. (See why we needed the stack?)
In the early days of compiler technology, people used some rather complex techniques to
insure that the operator precedence rules were obeyed. It turns out, though, that none of this
is necessary ... the rules can be accommodated quite nicely by our top-down parsing tech-
nique. Up till now, the only form that we've considered for a term is that of a single decimal
digit.
More generally, we can define a term as a PRODUCT of FACTORS; i.e.,
<term> ::= <factor> [ <mulop> <factor ]*
What is a factor? For now, it's what a term used to be ... a single digit.
Notice the symmetry: a term has the same form as an expression. As a matter of fact, we can
add to our parser with a little judicious copying and renaming. But to avoid confusion, the list-
ing below is the complete set of parsing routines. (Note the way we handle the reversal of
operands in Divide.)

{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
procedure Factor;
begin
EmitLn('MOVE #' + GetNum + ',D0')
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }
procedure Multiply;
begin
Match('*');
Factor;
EmitLn('MULS (SP)+,D0');
end;

{-------------------------------------------------------------}
{ Recognize and Translate a Divide }
procedure Divide;
begin
Match('/');
Factor;
EmitLn('MOVE (SP)+,D1');
EmitLn('DIVS D1,D0');
end;

{---------------------------------------------------------------}
{ Parse and Translate a Math Term }
procedure Term;
begin
Factor;
while Look in ['*', '/'] do begin
case Look of
'*': Multiply;
'/': Divide;
else Expected('Mulop');
end;
end;
end;

{--------------------------------------------------------------}
procedure Add;
begin
Match('+');
Term;
EmitLn('ADD (SP)+,D0');
end;
{-------------------------------------------------------------}
procedure Subtract;
begin
Match('-');
Term;
EmitLn('SUB (SP)+,D0');
EmitLn('NEG D0');
end;

{---------------------------------------------------------------}
begin
Term;
while Look in ['+', '-'] do begin
case Look of
'+': Add;
'-': Subtract;
end;
end;
end;
{--------------------------------------------------------------}
Hot dog! A NEARLY functional parser/translator, in only 55 lines of Pascal! The output is
starting to look really useful, if you continue to overlook the inefficiency, which I hope you
will. Remember, we're not trying to produce tight code here.

PARENTHESES
We can wrap up this part of the parser with the addition of parentheses with math expres-
sions. As you know, parentheses are a mechanism to force a desired operator precedence.
So, for example, in the expression
2*(3+4) ,
the parentheses force the addition before the multiply. Much more importantly, though, paren-
theses give us a mechanism for defining expressions of any degree of complexity, as in
(1+2)/((3+4)+(5-6))
The key to incorporating parentheses into our parser is to realize that no matter how compli-
cated an expression enclosed by parentheses may be, to the rest of the world it looks like a
simple factor. That is, one of the forms for a factor is:
<factor> ::= (<expression>)
This is where the recursion comes in. An expression can contain a factor which contains
another expression which contains a factor, etc., ad infinitum.

Complicated or not, we can take care of this by adding just a few lines of Pascal to proce-
dure Factor:
{---------------------------------------------------------------}
procedure Expression; Forward;
procedure Factor;
begin
if Look = '(' then begin
Match('(');
Expression;
Match(')');
end
else
EmitLn('MOVE #' + GetNum + ',D0');
end;
{--------------------------------------------------------------}
Note again how easily we can extend the parser, and how well the Pascal code matches
the BNF syntax.
As usual, compile the new version and make sure that it correctly parses legal sentences,
and flags illegal ones with an error message.

UNARY MINUS
At this point, we have a parser that can handle just about any expression, right? OK, try this
input sentence:
-1
WOOPS! It doesn't work, does it? Procedure Expression expects everything to start with an
integer, so it coughs up the leading minus sign. You'll find that +3 won't work either, nor will
something like
-(3-2) .

There are a couple of ways to fix the problem. The easiest (although not necessarily the
best) way is to stick an imaginary leading zero in front of expressions of this type, so that
-3 becomes 0-3. We can easily patch this into our existing version of Expression:
{---------------------------------------------------------------}
begin
if IsAddop(Look) then
EmitLn('CLR D0')
else
Term;
while IsAddop(Look) do begin
case Look of
'+': Add;
'-': Subtract;
end;
end;
end;
{--------------------------------------------------------------}

I TOLD you that making changes was easy! This time it cost us only three new lines of Pas-
cal. Note the new reference to function IsAddop. Since the test for an addop appeared twice,
I chose to embed it in the new function. The form of IsAddop should be apparent from that for
IsAlpha. Here it is:
{--------------------------------------------------------------}
{ Recognize an Addop }
function IsAddop(c: char): boolean;
begin
IsAddop := c in ['+', '-'];
end;
{--------------------------------------------------------------}
OK, make these changes to the program and recompile. You should also include IsAddop in
your baseline copy of the cradle. We'll be needing it again later. Now try the input -1 again.
Wow! The efficiency of the code is pretty poor ... six lines of code just for loading a simple
constant ... but at least it's correct. Remember, we're not trying to replace Turbo Pascal here.
At this point we're just about finished with the structure of our expression parser. This version
of the program should correctly parse and compile just about any expression you care to
throw at it. It's still limited in that we can only handle factors involving single decimal digits.
But I hope that by now you're starting to get the message that we can accomodate further
extensions with just some minor changes to the parser. You probably won't be surprised to
hear that a variable or even a function call is just another kind of a factor.
In the next session, I'll show you just how easy it is to extend our parser to take care of these
things too, and I'll also show you just how easily we can accomodate multicharacter numbers
and variable names. So you see, we're not far at all from a truly useful parser.

A WORD ABOUT OPTIMIZATION

Earlier in this session, I promised to give you some hints as to how we can improve the
quality of the generated code. As I said, the production of tight code is not the main pur-
pose of this series of articles. But you need to at least know that we aren't just wasting our
time here ... that we can indeed modify the parser further to make it produce better code,
without throwing away everything we've done to date. As usual, it turns out that SOME
optimization is not that difficult to do ... it simply takes some extra code in the parser.
There are two basic approaches we can take:
o Try to fix up the code after it's generated

This is the concept of "peephole" optimization. The general idea it that we
know what combinations of instructions the compiler is going to generate,
and we also know which ones are pretty bad (such as the code for -1, above).
So all we do is to scan the produced code, looking for those combina-
tions, and replacing them by better ones. It's sort of a macro expansion,
in reverse, and a fairly straightforward exercise in pattern-matching.
The only complication, really, is that there may be a LOT of such combina-
tions to look for. It's called peephole optimization simply because it only looks
at a small group of instructions at a time. Peephole optimization can have a
dramatic effect on the quality of the code, with little change to the struc-
ture of the compiler itself. There is a price to pay, though, in both the
speed, size, and complexity of the compiler. Looking for all those combina-
tions calls for a lot of IF tests, each one of which is a source of error. And, of
course, it takes time. In the classical implementation of a peephole opti-
mizer, it's done as a second pass to the compiler. The output code is written
to disk, and then the optimizer reads and processes the disk file again. As
a matter of fact, you can see that the optimizer could even be a separate
PROGRAM from the compiler proper. Since the optimizer only looks at the
code through a small "window" of instructions (hence the name), a better
implementation would be to simply buffer up a few lines of output, and scan
the buffer after each EmitLn.

o Try to generate better code in the first place

This approach calls for us to look for special cases BEFORE we Emit them. As a
trivial example, we should be able to identify a constant zero, and Emit a CLR
instead of a load, or even do nothing at all, as in an add of zero, for example.
Closer to home, if we had chosen to recognize the unary minus in Factor instead
of in Expression, we could treat constants like -1 as ordinary constants, rather
then generating them from positive ones. None of these things are difficult to
deal with ... they only add extra tests in the code, which is why I haven't included
them in our program. The way I see it, once we get to the point that we have a
working compiler, generating useful code that executes, we can always go back
and tweak the thing to tighten up the code produced. That's why there are
Release 2.0's in the world.
There IS one more type of optimization worth mentioning, that seems to promise pretty tight
code without too much hassle. It's my "invention" in the sense that I haven't seen it suggested
in print anywhere, though I have no illusions that it's original with me.
This is to avoid such a heavy use of the stack, by making better use of the CPU registers.
Remember back when we were doing only addition and subtraction, that we used registers
D0 and D1, rather than the stack? It worked, because with only those two operations, the
"stack" never needs more than two entries.
Well, the 68000 has eight data registers. Why not use them as a privately managed stack?
The key is to recognize that, at any point in its processing, the parser KNOWS how many
items are on the stack, so it can indeed manage it properly. We can define a private "stack
pointer" that keeps track of which stack level we're at, and addresses the corresponding reg-
ister. Procedure Factor, for example, would not cause data to be loaded into register D0, but
into whatever the current "top-of-stack" register happened to be.
What we're doing in effect is to replace the CPU's RAM stack with a locally managed stack
made up of registers. For most expressions, the stack level will never exceed eight, so we'll
get pretty good code out. Of course, we also have to deal with those odd cases where the
stack level DOES exceed eight, but that's no problem either. We simply let the stack spill over
into the CPU stack. For levels beyond eight, the code is no worse than what we're generating
now, and for levels less than eight, it's considerably better.

For the record, I have implemented this concept, just to make sure it works before I men-
tioned it to you. It does. In practice, it turns out that you can't really use all eight levels ...
you need at least one register free to reverse the operand order for division (sure wish the
68000 had an XTHL, like the 8080!). For expressions that include function calls, we would
also need a register reserved for them. Still, there is a nice improvement in code size for
most expressions.
So, you see, getting better code isn't that difficult, but it does add complexity to the our
translator ... complexity we can do without at this point. For that reason, I STRONGLY
suggest that we continue to ignore efficiency issues for the rest of this series, secure in
the knowledge that we can indeed improve the code quality without throwing away what
we've done.
Next lesson, I'll show you how to deal with variables factors and function calls. I'll also
show you just how easy it is to handle multicharacter tokens and embedded white space.

Part 3 - More Expressions
INTRODUCTION
In the last installment, we examined the techniques used to parse and translate a general
math expression. We ended up with a simple parser that could handle arbitrarily complex
expressions, with two restrictions:
o No variables were allowed, only numeric factors
o The numeric factors were limited to single digits
In this installment, we'll get rid of those restrictions. We'll also extend what we've done to
include assignment statements function calls and. Remember, though, that the second
restriction was mainly self-imposed ... a choice of convenience on our part, to make life eas-
ier and to let us concentrate on the fundamental concepts. As you'll see in a bit, it's an easy
restriction to get rid of, so don't get too hung up about it. We'll use the trick when it serves us
to do so, confident that we can discard it when we're ready to.

VARIABLES
Most expressions that we see in practice involve variables, such as
b * b + 4 * a * c
No parser is much good without being able to deal with them. Fortunately, it's also quite
easy to do.
Remember that in our parser as it currently stands, there are two kinds of factors allowed:
integer constants and expressions within parentheses. In BNF notation,
<factor> ::= <number> | (<expression>)
The '|' stands for "or", meaning of course that either form is a legal form for a factor.
Remember, too, that we had no trouble knowing which was which ... the lookahead char-
acter is a left paren '(' in one case, and a digit in the other.
It probably won't come as too much of a surprise that a variable is just another kind of fac-
tor. So we extend the BNF above to read:
<factor> ::= <number> | (<expression>) | <variable>
Again, there is no ambiguity: if the lookahead character is a letter, we have a variable; if a

digit, we have a number. Back when we translated the number, we just issued code to
load the number, as immediate data, into D0. Now we do the same, only we load a vari-
able.
A minor complication in the code generation arises from the fact that most 68000 operat-
ing systems, including the SK*DOS that I'm using, require the code to be written in "posi-
tion-independent" form, which basically means that everything is PC-relative. The format
for a load in this language is
MOVE X(PC),D0
where X is, of course, the variable name. Armed with that, let's modify the current version
of Factor to read:

{---------------------------------------------------------------}
procedure Factor;
begin
Match('(');
Expression;
Match(')');
end
else if IsAlpha(Look) then
EmitLn('MOVE ' + GetName + '(PC),D0')
else
end;
{--------------------------------------------------------------}
I've remarked before how easy it is to add extensions to the parser, because of the way it's
structured. You can see that this still holds true here. This time it cost us all of two extra lines
of code. Notice, too, how the if-else-else structure exactly parallels the BNF syntax equation.
OK, compile and test this new version of the parser. That didn't hurt too badly, did it?

FUNCTIONS
There is only one other common kind of factor supported by most languages: the function
call. It's really too early for us to deal with functions well, because we haven't yet
addressed the issue of parameter passing. What's more, a "real" language would include
a mechanism to support more than one type, one of which should be a function type. We
haven't gotten there yet, either. But I'd still like to deal with functions now for a couple of
reasons. First, it lets us finally wrap up the parser in something very close to its final form,
and second, it brings up a new issue which is very much worth talking about.
Up till now, we've been able to write what is called a "predictive parser." That means that
at any point, we can know by looking at the current lookahead character exactly what to
do next. That isn't the case when we add functions. Every language has some naming
rules for what constitutes a legal identifier. For the present, ours is simply that it is one of
the letters 'a'..'z'. The problem is that a variable name and a function name obey the same
rules. So how can we tell which is which? One way is to require that they each be
declared before they are used. Pascal takes that approach. The other is that we might
require a function to be followed by a (possibly empty) parameter list. That's the rule used
in C.
Since we don't yet have a mechanism for declaring types, let's use the C rule for now.
Since we also don't have a mechanism to deal with parameters, we can only handle
empty lists, so our function calls will have the form
x() .
Since we're not dealing with parameter lists yet, there is nothing to do but to call the func-
tion, so we need only to issue a BSR (call) instead of a MOVE.

Now that there are two possibilities for the "If IsAlpha" branch of the test in Factor, let's treat
them in a separate procedure. Modify Factor to read:
{---------------------------------------------------------------}
procedure Factor;
begin
Match('(');
Expression;
Match(')');
end
Ident
else
end;
{--------------------------------------------------------------}

and insert before it the new procedure
{---------------------------------------------------------------}
{ Parse and Translate an Identifier }
procedure Ident;
var Name: char;
begin
Name := GetName;
Match('(');
Match(')');
EmitLn('BSR ' + Name);
end
else
EmitLn('MOVE ' + Name + '(PC),D0')
end;
{---------------------------------------------------------------}

OK, compile and test this version. Does it parse all legal expressions? Does it correctly flag
badly formed ones?
The important thing to notice is that even though we no longer have a predictive parser, there
is little or no complication added with the recursive descent approach that we're using. At the
point where Factor finds an identifier (letter), it doesn't know whether it's a variable name or a
function name, nor does it really care. It simply passes it on to Ident and leaves it up to that
procedure to figure it out. Ident, in turn, simply tucks away the identifier and then reads one
more character to decide which kind of identifier it's dealing with.
Keep this approach in mind. It's a very powerful concept, and it should be used whenever you
encounter an ambiguous situation requiring further lookahead. Even if you had to look sev-
eral tokens ahead, the principle would still work.

MORE ON ERROR HANDLING

As long as we're talking philosophy, there's another important issue to point out: error
handling. Notice that although the parser correctly rejects (almost) every malformed
expression we can throw at it, with a meaningful error message, we haven't really had to
do much work to make that happen. In fact, in the whole parser per se (from Ident through
Expression) there are only two calls to the error routine, Expected. Even those aren't nec-
essary ... if you'll look again in Term and Expression, you'll see that those statements
can't be reached. I put them in early on as a bit of insurance, but they're no longer
needed. Why don't you delete them now?
So how did we get this nice error handling virtually for free? It's simply that I've carefully
avoided reading a character directly using GetChar. Instead, I've relied on the error han-
dling in GetName, GetNum, and Match to do all the error checking for me. Astute readers
will notice that some of the calls to Match (for example, the ones in Add and Subtract) are
also unnecessary ... we already know what the character is by the time we get there ...
but it maintains a certain symmetry to leave them in, and the general rule to always use
Match instead of GetChar is a good one.
I mentioned an "almost" above. There is a case where our error handling leaves a bit to
be desired. So far we haven't told our parser what and end-of-line looks like, or what to do
with embedded white space. So a space character (or any other character not part of the
recognized character set) simply causes the parser to terminate, ignoring the unrecog-
nized characters.
It could be argued that this is reasonable behavior at this point. In a "real" compiler, there
is usually another statement following the one we're working on, so any characters not
treated as part of our expression will either be used for or rejected as part of the next one.
But it's also a very easy thing to fix up, even if it's only temporary. All we have to do is
assert that the expression should end with an end-of-line , i.e., a carriage return.
To see what I'm talking about, try the input line
1+2 <space> 3+4

See how the space was treated as a terminator? Now, to make the compiler properly flag this,
add the line
if Look <> CR then Expected('Newline');
in the main program, just after the call to Expression. That catches anything left over in the
input stream. Don't forget to define CR in the const statement:
CR = ^M;
As usual, recompile the program and verify that it does what it's supposed to.

ASSIGNMENT STATEMENTS
OK, at this point we have a parser that works very nicely. I'd like to point out that we got it
using only 88 lines of executable code, not counting what was in the cradle. The compiled
object file is a whopping 4752 bytes. Not bad, considering we weren't trying very hard to
save either source code or object size. We just stuck to the KISS principle.
Of course, parsing an expression is not much good without having something to do with it
afterwards. Expressions USUALLY (but not always) appear in assignment statements, in
the form
<Ident> = <Expression>
We're only a breath away from being able to parse an assignment statement, so let's take
that last step. Just after procedure Expression, add the following new procedure:
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: char;
begin
Name := GetName;
Match('=');
Expression;
EmitLn('LEA ' + Name + '(PC),A0');
EmitLn('MOVE D0,(A0)')
end;
{--------------------------------------------------------------}

Note again that the code exactly parallels the BNF. And notice further that the error checking
was painless, handled by GetName and Match.
The reason for the two lines of assembler has to do with a peculiarity in the 68000, which
requires this kind of construct for PC-relative code.
Now change the call to Expression, in the main program, to one to Assignment. That's all
there is to it.
Son of a gun! We are actually compiling assignment statements. If those were the only kind
of statements in a language, all we'd have to do is put this in a loop and we'd have a full-
fledged compiler!
Well, of course they're not the only kind. There are also little items like control statements (IFs
and loops), procedures, declarations, etc. But cheer up. The arithmetic expressions that
we've been dealing with are among the most challenging in a language. Compared to what
we've already done, control statements will be easy. I'll be covering them in the fifth install-
ment. And the other statements will all fall in line, as long as we remember to KISS.

MULTI-CHARACTER TOKENS
Throughout this series, I've been carefully restricting everything we do to single-character
tokens, all the while assuring you that it wouldn't be difficult to extend to multi- character
ones. I don't know if you believed me or not ... I wouldn't really blame you if you were a bit
skeptical. I'll continue to use that approach in the sessions which follow, because it helps
keep complexity away. But I'd like to back up those assurances, and wrap up this portion
of the parser, by showing you just how easy that extension really is. In the process, we'll
also provide for embedded white space. Before you make the next few changes, though,
save the current version of the parser away under another name. I have some more uses
for it in the next installment, and we'll be working with the single- character version.
Most compilers separate out the handling of the input stream into a separate module
called the lexical scanner. The idea is that the scanner deals with all the character-by-
character input, and returns the separate units (tokens) of the stream. There may come a
time when we'll want to do something like that, too, but for now there is no need. We can
handle the multi-character tokens that we need by very slight and very local modifications
to GetName and GetNum.
The usual definition of an identifier is that the first character must be a letter, but the rest
can be alphanumeric (letters or numbers). To deal with this, we need one other recog-
nizer function
{--------------------------------------------------------------}
{ Recognize an Alphanumeric }
function IsAlNum(c: char): boolean;
begin
IsAlNum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}

Add this function to your parser. I put mine just after IsDigit. While you're at it, might as well
include it as a permanent member of Cradle, too.
Now, we need to modify function GetName to return a string instead of a character:
{--------------------------------------------------------------}
function GetName: string;
var Token: string;
begin
Token := '';
while IsAlNum(Look) do begin
Token := Token + UpCase(Look);
GetChar;
end;
GetName := Token;
end;
{--------------------------------------------------------------}

Similarly, modify GetNum to read:
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: string;
var Value: string;
begin
Value := '';
while IsDigit(Look) do begin
Value := Value + Look;
GetChar;
end;
GetNum := Value;
end;
{--------------------------------------------------------------}
Amazingly enough, that is virtually all the changes required to the parser! The local vari-
able Name in procedures Ident and Assignment was originally declared as "char", and
must now be declared string[8]. (Clearly, we could make the string length longer if we
chose, but most assemblers limit the length anyhow.) Make this change, and then recom-
pile and test. _NOW_ do you believe that it's a simple change?

WHITE SPACE
Before we leave this parser for awhile, let's address the issue of white space. As it stands
now, the parser will barf (or simply terminate) on a single space character embedded any-
where in the input stream. That's pretty unfriendly behavior. So let's "productionize" the thing
a bit by eliminating this last restriction.
The key to easy handling of white space is to come up with a simple rule for how the parser
should treat the input stream, and to enforce that rule everywhere. Up till now, because white
space wasn't permitted, we've been able to assume that after each parsing action, the looka-
head character Look contains the next meaningful character, so we could test it immediately.
Our design was based upon this principle.
It still sounds like a good rule to me, so that's the one we'll use. This means that every routine
that advances the input stream must skip over white space, and leave the next non-white
character in Look. Fortunately, because we've been careful to use GetName, GetNum, and
Match for most of our input processing, it is only those three routines (plus Init) that we need
to modify.
Not surprisingly, we start with yet another new recognizer routine:
{--------------------------------------------------------------}
{ Recognize White Space }
function IsWhite(c: char): boolean;
begin
IsWhite := c in [' ', TAB];
end;
{--------------------------------------------------------------}

We also need a routine that will eat white-space characters, until it finds a non-white
one:
{--------------------------------------------------------------}
{ Skip Over Leading White Space }
procedure SkipWhite;
begin
while IsWhite(Look) do
GetChar;
end;
{--------------------------------------------------------------}
Now, add calls to SkipWhite to Match, GetName, and GetNum as shown below:
{--------------------------------------------------------------}
begin
if Look <> x then Expected('''' + x + '''')
else begin
GetChar;
SkipWhite;
end;
end;

{--------------------------------------------------------------}
var Token: string;
begin
Token := '';
GetChar;
end;
GetName := Token;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Get a Number }
var Value: string;
begin
Value := '';
GetChar;
end;
GetNum := Value;
SkipWhite;
end;
{--------------------------------------------------------------}
(Note that I rearranged Match a bit, without changing the functionality.)

Finally, we need to skip over leading blanks where we "prime the pump" in Init:
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
begin
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
Make these changes and recompile the program. You will find that you will have to move
Match below SkipWhite, to avoid an error message from the Pascal compiler. Test the pro-
gram as always to make sure it works properly.
Since we've made quite a few changes during this session, I'm reproducing the entire parser
below:
{--------------------------------------------------------------}
program parse;
{--------------------------------------------------------------}
const TAB = Î;
CR = ^M;
{--------------------------------------------------------------}

{--------------------------------------------------------------}
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
begin
WriteLn;
end;

{--------------------------------------------------------------}
begin
Error(s);
Halt;
end;
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
IsAlpha := UpCase(c) in ['A'..'Z'];
end;

{--------------------------------------------------------------}
begin
IsDigit := c in ['0'..'9'];
end;
{--------------------------------------------------------------}
{ Recognize an Alphanumeric }
begin
end;
{--------------------------------------------------------------}
begin
IsAddop := c in ['+', '-'];
end;

{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
GetChar;
end;

{--------------------------------------------------------------}
begin
if Look <> x then Expected('''' + x + '''')
else begin
GetChar;
SkipWhite;
end;
end;

{--------------------------------------------------------------}
var Token: string;
begin
Token := '';
GetChar;
end;
GetName := Token;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Get a Number }
var Value: string;
begin
Value := '';
GetChar;
end;
GetNum := Value;
SkipWhite;
end;
{--------------------------------------------------------------}
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}

begin
Emit(s);
WriteLn;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Identifier }
procedure Ident;
var Name: string[8];
begin
Name:= GetName;
Match('(');
Match(')');
end
else
EmitLn('MOVE ' + Name + '(PC),D0');
end;

{---------------------------------------------------------------}
procedure Factor;
begin
Match('(');
Expression;
Match(')');
end
Ident
else
end;

{--------------------------------------------------------------}
procedure Multiply;
begin
Match('*');
Factor;
end;
{-------------------------------------------------------------}
procedure Divide;
begin
Match('/');
Factor;
EmitLn('EXS.L D0');
end;

{---------------------------------------------------------------}
procedure Term;
begin
Factor;
case Look of
'*': Multiply;
'/': Divide;
end;
end;
end;
{--------------------------------------------------------------}
procedure Add;
begin
Match('+');
Term;
end;

{-------------------------------------------------------------}
procedure Subtract;
begin
Match('-');
Term;
EmitLn('NEG D0');
end;

{---------------------------------------------------------------}
begin
EmitLn('CLR D0')
else
Term;
case Look of
'+': Add;
'-': Subtract;
end;
end;
end;

{--------------------------------------------------------------}
var Name: string[8];
begin
Name := GetName;
Match('=');
Expression;
end;
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
begin
GetChar;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Main Program }
begin
Init;
Assignment;
If Look <> CR then Expected('NewLine');
end.
{--------------------------------------------------------------}
Now the parser is complete. It's got every feature we can put in a one-line "compiler."
Tuck it away in a safe place. Next time we'll move on to a new subject, but we'll still be
talking about expressions for quite awhile. Next installment, I plan to talk a bit about inter-
preters as opposed to compilers, and show you how the structure of the parser changes a
bit as we change what sort of action has to be taken. The information we pick up there will
serve us in good stead later on, even if you have no interest in interpreters. See you next
time.

Part 4 - Interpreters
INTRODUCTION
In the first three installments of this series, we've looked at parsing and compiling math
expressions, and worked our way grad- ually and methodically from dealing with very simple
one-term, one-character "expressions" up through more general ones, finally arriving at a
very complete parser that could parse and translate complete assignment statements, with
multi-character tokens, embedded white space, and function calls. This time, I'm going to
walk you through the process one more time, only with the goal of interpreting rather than
compiling object code.
Since this is a series on compilers, why should we bother with interpreters? Simply because I
want you to see how the nature of the parser changes as we change the goals. I also want to
unify the concepts of the two types of translators, so that you can see not only the differ-
ences, but also the similarities.
Consider the assignment statement
x = 2 * y + 3
In a compiler, we want the target CPU to execute this assignment at EXECUTION time. The
translator itself doesn't do any arithmetic ... it only issues the object code that will cause the
CPU to do it when the code is executed. For the example above, the compiler would issue
code to compute the expression and store the results in variable x.
For an interpreter, on the other hand, no object code is generated. Instead, the arithmetic is
computed immediately, as the parsing is going on. For the example, by the time parsing of
the statement is complete, x will have a new value.

The approach we've been taking in this whole series is called "syntax-driven translation."
As you are aware by now, the structure of the parser is very closely tied to the syntax of
the productions we parse. We have built Pascal procedures that recognize every lan-
guage construct. Associated with each of these constructs (and procedures) is a corre-
sponding "action," which does whatever makes sense to do once a construct has been
recognized. In our compiler so far, every action involves emitting object code, to be exe-
cuted later at execution time. In an interpreter, every action involves something to be
done immediately.
What I'd like you to see here is that the layout ... the structure ... of the parser doesn't
change. It's only the actions that change. So if you can write an interpreter for a given lan-
guage, you can also write a compiler, and vice versa. Yet, as you will see, there ARE dif-
ferences, and significant ones. Because the actions are different, the procedures that do
the recognizing end up being written differently. Specifically, in the interpreter the recog-
nizing procedures end up being coded as FUNCTIONS that return numeric values to their
callers. None of the parsing routines for our compiler did that.
Our compiler, in fact, is what we might call a "pure" compiler. Each time a construct is rec-
ognized, the object code is emitted IMMEDIATELY. (That's one reason the code is not
very efficient.) The interpreter we'll be building here is a pure interpreter, in the sense that
there is no translation, such as "tokenizing," performed on the source code. These repre-
sent the two extremes of translation. In the real world, translators are rarely so pure, but
tend to have bits of each technique.
I can think of several examples. I've already mentioned one: most interpreters, such as
Microsoft BASIC, for example, translate the source code (tokenize it) into an intermedi-
ate form so that it'll be easier to parse real time.
Another example is an assembler. The purpose of an assembler, of course, is to produce

object code, and it normally does that on a one-to-one basis: one object instruction per
line of source code. But almost every assembler also permits expressions as arguments.
In this case, the expressions are always constant expressions, and so the assembler isn't
supposed to issue object code for them. Rather, it "interprets" the expressions and com-
putes the corresponding constant result, which is what it actually emits as object code.

As a matter of fact, we could use a bit of that ourselves. The translator we built in the previous
installment will dutifully spit out object code for complicated expressions, even though every
term in the expression is a constant. In that case it would be far better if the translator
behaved a bit more like an interpreter, and just computed the equivalent constant result.
There is a concept in compiler theory called "lazy" translation. The idea is that you typically
don't just emit code at every action. In fact, at the extreme you don't emit anything at all, until
you absolutely have to. To accomplish this, the actions associated with the parsing routines
typically don't just emit code. Sometimes they do, but often they simply return information
back to the caller. Armed with such information, the caller can then make a better choice of
what to do.
For example, given the statement
x = x + 3 - 2 - (5 - 4) ,
our compiler will dutifully spit out a stream of 18 instructions to load each parameter into reg-
isters, perform the arithmetic, and store the result. A lazier evaluation would recognize that
the arithmetic involving constants can be evaluated at compile time, and would reduce the
expression to
x = x + 0 .
An even lazier evaluation would then be smart enough to figure out that this is equivalent to
x = x ,
which calls for no action at all. We could reduce 18 instructions to zero!
Note that there is no chance of optimizing this way in our translator as it stands, because
every action takes place immediately.
Lazy expression evaluation can produce significantly better object code than we have been
able to so far. I warn you, though: it complicates the parser code considerably, because each
routine now has to make decisions as to whether to emit object code or not. Lazy evaluation
is certainly not named that because it's easier on the compiler writer!

Since we're operating mainly on the KISS principle here, I won't go into much more depth
on this subject. I just want you to be aware that you can get some code optimization by
combining the techniques of compiling and interpreting. In particular, you should know
that the parsing routines in a smarter translator will generally return things to their caller,
and sometimes expect things as well. That's the main reason for going over interpretation
in this installment.

THE INTERPRETER
OK, now that you know WHY we're going into all this, let's do it. Just to give you practice,
we're going to start over with a bare cradle and build up the translator all over again. This
time, of course, we can go a bit faster.
Since we're now going to do arithmetic, the first thing we need to do is to change function
GetNum, which up till now has always returned a character (or string). Now, it's better for it to
return an integer. MAKE A COPY of the cradle (for goodness's sake, don't change the ver-
sion in Cradle itself!!) and modify GetNum as follows:
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: integer;
begin
GetNum := Ord(Look) - Ord('0');
GetChar;
end;
{--------------------------------------------------------------}

Now, write the following version of Expression:
{---------------------------------------------------------------}
function Expression: integer;
begin
Expression := GetNum;
end;
{--------------------------------------------------------------}
Finally, insert the statement
Writeln(Expression);
at the end of the main program. Now compile and test.
All this program does is to "parse" and translate a single integer "expression." As always,
you should make sure that it does that with the digits 0..9, and gives an error message for
anything else. Shouldn't take you very long!

OK, now let's extend this to include addops. Change Expression to read:
{---------------------------------------------------------------}
function Expression: integer;
var Value: integer;
begin
Value := 0
else
Value := GetNum;
case Look of
'+': begin
Match('+');
Value := Value + GetNum;
end;
'-': begin
Match('-');
Value := Value - GetNum;
end;
end;
end;
Expression := Value;
end;
{--------------------------------------------------------------}

The structure of Expression, of course, parallels what we did before, so we shouldn't have
too much trouble debugging it. There's been a SIGNIFICANT development, though,
hasn't there? Procedures Add and Subtract went away! The reason is that the action to
be taken requires BOTH arguments of the operation. I could have chosen to retain the
procedures and pass into them the value of the expression to date, which is Value. But it
seemed cleaner to me to keep Value as strictly a local variable, which meant that the
code for Add and Subtract had to be moved in line. This result suggests that, while the
structure we had developed was nice and clean for our simple-minded translation
scheme, it probably wouldn't do for use with lazy evaluation. That's a little tidbit we'll prob-
ably want to keep in mind for later.

OK, did the translator work? Then let's take the next step. It's not hard to figure out what pro-
cedure Term should now look like. Change every call to GetNum in function Expression to a
call to Term, and then enter the following form for Term:
{---------------------------------------------------------------}
function Term: integer;
var Value: integer;
begin
Value := GetNum;
case Look of
'*': begin
Match('*');
Value := Value * GetNum;
end;
'/': begin
Match('/');
Value := Value div GetNum;
end;
end;
end;
Term := Value;
end;
{--------------------------------------------------------------}

Now, try it out. Don't forget two things: first, we're dealing with integer division, so, for
example, 1/3 should come out zero. Second, even though we can output multi-digit
results, our input is still restricted to single digits.
That seems like a silly restriction at this point, since we have already seen how easily
function GetNum can be extended. So let's go ahead and fix it right now. The new version
is
{--------------------------------------------------------------}
{ Get a Number }
var Value: integer;
begin
Value := 0;
Value := 10 * Value + Ord(Look) - Ord('0');
GetChar;
end;
GetNum := Value;
end;
{--------------------------------------------------------------}

If you've compiled and tested this version of the interpreter, the next step is to install function
Factor, complete with parenthesized expressions. We'll hold off a bit longer on the variable
names. First, change the references to GetNum, in function Term, so that they call Factor
instead. Now code the following version of Factor:
{---------------------------------------------------------------}
function Expression: integer; Forward;
function Factor: integer;
begin
Match('(');
Factor := Expression;
Match(')');
end
else
Factor := GetNum;
end;
{---------------------------------------------------------------}
That was pretty easy, huh? We're rapidly closing in on a useful interpreter.

A LITTLE PHILOSOPHY
Before going any further, there's something I'd like to call to your attention. It's a concept
that we've been making use of in all these sessions, but I haven't explicitly mentioned it
up till now. I think it's time, because it's a concept so useful, and so powerful, that it makes
all the difference between a parser that's trivially easy, and one that's too complex to deal
with.
In the early days of compiler technology, people had a terrible time figuring out how to
deal with things like operator precedence ... the way that multiply and divide operators
take precedence over add and subtract, etc. I remember a colleague of some thirty years
ago, and how excited he was to find out how to do it. The technique used involved build-
ing two stacks, upon which you pushed each operator or operand. Associated with each
operator was a precedence level, and the rules required that you only actually performed
an operation ("reducing" the stack) if the precedence level showing on top of the stack
was correct. To make life more interesting, an operator like ')' had different precedence
levels, depending upon whether or not it was already on the stack. You had to give it one
value before you put it on the stack, and another to decide when to take it off. Just for the
experience, I worked all of this out for myself a few years ago, and I can tell you that it's
very tricky.
We haven't had to do anything like that. In fact, by now the parsing of an arithmetic state-
ment should seem like child's play. How did we get so lucky? And where did the prece-
dence stacks go?
A similar thing is going on in our interpreter above. You just KNOW that in order for it to do
the computation of arithmetic statements (as opposed to the parsing of them), there have
to be numbers pushed onto a stack somewhere. But where is the stack?
Finally, in compiler textbooks, there are a number of places where stacks and other struc-
tures are discussed. In the other leading parsing method (LR), an explicit stack is used. In
fact, the technique is very much like the old way of doing arithmetic expressions. Another
concept is that of a parse tree. Authors like to draw diagrams of the tokens in a statement,
connected into a tree with operators at the internal nodes. Again, where are the trees and
stacks in our technique? We haven't seen any. The answer in all cases is that the struc-
tures are implicit, not explicit. In any computer language, there is a stack involved every
time you call a subroutine. Whenever a subroutine is called, the return address is pushed

onto the CPU stack. At the end of the subroutine, the address is popped back off and control
is transferred there. In a recursive language such as Pascal, there can also be local data
pushed onto the stack, and it, too, returns when it's needed.
For example, function Expression contains a local parameter called Value, which it fills by a
call to Term. Suppose, in its next call to Term for the second argument, that Term calls Factor,
which recursively calls Expression again. That "instance" of Expression gets another value
for its copy of Value. What happens to the first Value? Answer: it's still on the stack, and will
be there again when we return from our call sequence.
In other words, the reason things look so simple is that we've been making maximum use of
the resources of the language. The hierarchy levels and the parse trees are there, all right,
but they're hidden within the structure of the parser, and they're taken care of by the order
with which the various procedures are called. Now that you've seen how we do it, it's proba-
bly hard to imagine doing it any other way. But I can tell you that it took a lot of years for com-
piler writers to get that smart. The early compilers were too complex too imagine. Funny how
things get easier with a little practice.
The reason I've brought all this up is as both a lesson and a warning. The lesson: things can
be easy when you do them right. The warning: take a look at what you're doing. If, as you
branch out on your own, you begin to find a real need for a separate stack or tree structure, it
may be time to ask yourself if you're looking at things the right way. Maybe you just aren't
using the facilities of the language as well as you could be.
The next step is to add variable names. Now, though, we have a slight problem. For the com-
piler, we had no problem in dealing with variable names ... we just issued the names to the
assembler and let the rest of the program take care of allocating storage for them. Here, on
the other hand, we need to be able to fetch the values of the variables and return them as the
return values of Factor. We need a storage mechanism for these variables.
Back in the early days of personal computing, Tiny BASIC lived. It had a grand total of 26
possible variables: one for each letter of the alphabet. This fits nicely with our concept of sin-
gle-character tokens, so we'll try the same trick. In the beginning of your interpreter, just after
the declaration of variable Look, insert the line:
Table: Array['A'..'Z'] of integer;

We also need to initialize the array, so add this procedure:
{---------------------------------------------------------------}
{ Initialize the Variable Area }
procedure InitTable;
var i: char;
begin
for i := 'A' to 'Z' do
Table[i] := 0;
end;
{---------------------------------------------------------------}
You must also insert a call to InitTable, in procedure Init. DON'T FORGET to do that, or
the results may surprise you!

Now that we have an array of variables, we can modify Factor to use it. Since we don't have
a way (so far) to set the variables, Factor will always return zero values for them, but let's
go ahead and extend it anyway. Here's the new version:
{---------------------------------------------------------------}
function Expression: integer; Forward;
function Factor: integer;
begin
Match('(');
Match(')');
end
Factor := Table[GetName]
else
Factor := GetNum;
end;
{---------------------------------------------------------------}

As always, compile and test this version of the program. Even though all the variables are
now zeros, at least we can correctly parse the complete expressions, as well as catch any
badly formed expressions.
I suppose you realize the next step: we need to do an assignment statement so we can
put something INTO the variables. For now, let's stick to one-liners, though we will soon
be handling multiple statements.
The assignment statement parallels what we did before:
{--------------------------------------------------------------}
var Name: char;
begin
Name := GetName;
Match('=');
Table[Name] := Expression;
end;
{--------------------------------------------------------------}
To test this, I added a temporary write statement in the main program, to print out the
value of A. Then I tested it with various assignments to it.

Of course, an interpretive language that can only accept a single line of program is not of
much value. So we're going to want to handle multiple statements. This merely means putting
a loop around the call to Assignment. So let's do that now. But what should be the loop exit
criterion? Glad you asked, because it brings up a point we've been able to ignore up till now.
One of the most tricky things to handle in any translator is to determine when to bail out of a
given construct and go look for something else. This hasn't been a problem for us so far
because we've only allowed for a single kind of construct ... either an expression or an
assignment statement. When we start adding loops and different kinds of statements, you'll
find that we have to be very careful that things terminate properly. If we put our interpreter in
a loop, we need a way to quit. Terminating on a newline is no good, because that's what
sends us back for another line. We could always let an unrecognized character take us out,
but that would cause every run to end in an error message, which certainly seems uncool.
What we need is a termination character. I vote for Pascal's ending period ('.'). A minor com-
plication is that Turbo ends every normal line with TWO characters, the carriage return (CR)
and line feed (LF). At the end of each line, we need to eat these characters before processing
the next one. A natural way to do this would be with procedure Match, except that Match's
error message prints the character, which of course for the CR and/or LF won't look so great.
What we need is a special procedure for this, which we'll no doubt be using over and over.
Here it is:
{--------------------------------------------------------------}
{ Recognize and Skip Over a Newline }
procedure NewLine;
begin
if Look = CR then begin

GetChar;
if Look = LF then
GetChar;
end;
end;
{--------------------------------------------------------------}

Insert this procedure at any convenient spot ... I put mine just after Match. Now, rewrite
the main program to look like this:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Assignment;
NewLine;
until Look = '.';
end.
{--------------------------------------------------------------}
Note that the test for a CR is now gone, and that there are also no error tests within New-
Line itself. That's OK, though ... whatever is left over in terms of bogus characters will be
caught at the beginning of the next assignment statement.
Well, we now have a functioning interpreter. It doesn't do us a lot of good, however, since
we have no way to read data in or write it out. Sure would help to have some I/O!

Let's wrap this session up, then, by adding the I/O routines. Since we're sticking to single-
character tokens, I'll use '?' to stand for a read statement, and '!' for a write, with the char-
acter immediately following them to be used as a one-token "parameter list." Here are the
routines:
{--------------------------------------------------------------}
{ Input Routine }
procedure Input;
begin
Match('?');
Read(Table[GetName]);
end;
{--------------------------------------------------------------}
{ Output Routine }
procedure Output;
begin
Match('!');
WriteLn(Table[GetName]);
end;
{--------------------------------------------------------------}

They aren't very fancy, I admit ... no prompt character on input, for example ... but they
get the job done. The corresponding changes in the main program are shown below. Note
that we use the usual trick of a case statement based upon the current lookahead charac-
ter, to decide what to do.
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
case Look of
'?': Input;
'!': Output;
else Assignment;
end;
NewLine;
until Look = '.';
end.
{--------------------------------------------------------------}

You have now completed a real, working interpreter. It's pretty sparse, but it works just like
the "big boys." It includes three kinds of program statements (and can tell the difference!), 26
variables, and I/O statements. The only things that it lacks, really, are control statements,
subroutines, and some kind of program editing function. The program editing part, I'm going
to pass on. After all, we're not here to build a product, but to learn things. The control state-
ments, we'll cover in the next installment, and the subroutines soon after. I'm anxious to get
on with that, so we'll leave the interpreter as it stands.
I hope that by now you're convinced that the limitation of single-character names and the
processing of white space are easily taken care of, as we did in the last session. This time, if
you'd like to play around with these extensions, be my guest ... they're "left as an exercise for
the student." See you next time.

Part 5 - Control Constructs
INTRODUCTION
In the first four installments of this series, we've been concentrating on the parsing of
math expressions and assignment statements.
In this installment, we'll take off on a new and exciting tangent: that of parsing and trans-
lating control constructs such as IF statements. This subject is dear to my heart, because
it represents a turning point for me. I had been playing with the parsing of expressions,
just as we have done in this series, but I still felt that I was a LONG way from being able
to handle a complete language. After all, REAL languages have branches and loops and
subroutines and all that. Perhaps you've shared some of the same thoughts. Awhile back,
though, I had to produce control constructs for a structured assembler preprocessor I was
writing. Imagine my surprise to discover that it was far easier than the expression parsing
I had already been through. I remember thinking, "Hey! This is EASY!" After we've fin-
ished this session, I'll bet you'll be thinking so, too.

THE PLAN
In what follows, we'll be starting over again with a bare cradle, and as we've done twice
before now, we'll build things up one at a time. We'll also be retaining the concept of single-
character tokens that has served us so well to date. This means that the "code" will look a lit-
tle funny, with 'i' for IF, 'w' for WHILE, etc. But it helps us get the concepts down pat without
fussing over lexical scanning. Fear not ... eventually we'll see something looking like "real"
code.
I also don't want to have us get bogged down in dealing with statements other than branches,
such as the assignment statements we've been working on. We've already demonstrated that
we can handle them, so there's no point carrying them around as excess baggage during this
exercise. So what I'll do instead is to use an anonymous statement, "other", to take the place
of the non- control statements and serve as a place-holder for them. We have to generate
some kind of object code for them (we're back into compiling, not interpretation), so for want
of anything else I'll just echo the character input.
OK, then, starting with yet another copy of the cradle, let's define the procedure:
{--------------------------------------------------------------}
{ Recognize and Translate an "Other" }
procedure Other;
begin
EmitLn(GetName);
end;
{--------------------------------------------------------------}

Now include a call to it in the main program, thus:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
Other;
end.
{--------------------------------------------------------------}
Run the program and see what you get. Not very exciting, is it? But hang in there, it's a
start, and things will get better.
The first thing we need is the ability to deal with more than one statement, since a single-
line branch is pretty limited. We did that in the last session on interpreting, but this time
let's get a little more formal. Consider the following BNF:
<program> ::= <block> END
<block> ::= [ <statement> ]*
This says that, for our purposes here, a program is defined as a block, followed by an
END statement. A block, in turn, consists of zero or more statements. We only have one
kind of statement, so far.
What signals the end of a block? It's simply any construct that isn't an "other" statement.
For now, that means only the END statement.

Armed with these ideas, we can proceed to build up our parser. The code for a program (we
have to call it DoProgram, or Pascal will complain, is:
{--------------------------------------------------------------}
{ Parse and Translate a Program }
procedure DoProgram;
begin
Block;
if Look <> 'e' then Expected('End');
EmitLn('END')
end;
{--------------------------------------------------------------}
Notice that I've arranged to emit an "END" command to the assembler, which sort of punctu-
ates the output code, and makes sense considering that we're parsing a complete program
here.

The code for Block is:
{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }
procedure Block;
begin
while not(Look in ['e']) do begin
Other;
end;
end;
{--------------------------------------------------------------}
(From the form of the procedure, you just KNOW we're going to be adding to it in a bit!)
OK, enter these routines into your program. Replace the call to Block in the main pro-
gram, by a call to DoProgram. Now try it and see how it works. Well, it's still not much, but
we're getting closer.

SOME GROUNDWORK
Before we begin to define the various control constructs, we need to lay a bit more ground-
work. First, a word of warning: I won't be using the same syntax for these constructs as you're
familiar with from Pascal or C. For example, the Pascal syntax for an IF is:
IF <condition> THEN <statement>
(where the statement, of course, may be compound).
The C version is similar:
IF ( <condition> ) <statement>
Instead, I'll be using something that looks more like Ada:
IF <condition> <block> ENDIF
In other words, the IF construct has a specific termination symbol. This avoids the dangling-
else of Pascal and C and also precludes the need for the brackets {} or begin-end. The syn-
tax I'm showing you here, in fact, is that of the language KISS that I'll be detailing in later
installments. The other constructs will also be slightly different. That shouldn't be a real prob-
lem for you. Once you see how it's done, you'll realize that it really doesn't matter so much
which specific syntax is involved. Once the syntax is defined, turning it into code is straight-
forward.

Now, all of the constructs we'll be dealing with here involve transfer of control, which at
the assembler-language level means conditional and/or unconditional branches. For
example, the simple IF statement
IF <condition> A ENDIF B ....
must get translated into
Branch if NOT condition to L
L: B
...
It's clear, then, that we're going to need some more procedures to help us deal with these
branches. I've defined two of them below. Procedure NewLabel generates unique labels.
This is done via the simple expedient of calling every label 'Lnn', where nn is a label num-
ber starting from zero. Procedure PostLabel just outputs the labels at the proper place.

Here are the two routines:
{--------------------------------------------------------------}
{ Generate a Unique Label }
function NewLabel: string;
var S: string;
begin
Str(LCount, S);
NewLabel := 'L' + S;
Inc(LCount);
end;
{--------------------------------------------------------------}
{ Post a Label To Output }
procedure PostLabel(L: string);
begin
WriteLn(L, ':');
end;
{--------------------------------------------------------------}

Notice that we've added a new global variable, LCount, so you need to change the VAR
declarations at the top of the program to look like this:
var Look : char; { Lookahead Character }
Lcount: integer; { Label Counter }
Also, add the following extra initialization to Init:
LCount := 0;
(DON'T forget that, or your labels can look really strange!)
At this point I'd also like to show you a new kind of notation. If you compare the form of
the IF statement above with the assembler code that must be produced, you can see
that there are certain actions associated with each of the keywords in the statement:
IF: First, get the condition and issue the code for it.
Then, create a unique label and emit a branch if false.
ENDIF: Emit the label.
These actions can be shown very concisely if we write the syntax this way:
IF
<condition> { Condition;
L = NewLabel;
Emit(Branch False to L); }
<block>
ENDIF { PostLabel(L) }

This is an example of syntax-directed translation. We've been doing it all along ... we've just
never written it down this way before. The stuff in curly brackets represents the ACTIONS to
be taken. The nice part about this representation is that it not only shows what we have to
recognize, but also the actions we have to perform, and in which order. Once we have this
syntax, the code almost writes itself.
About the only thing left to do is to be a bit more specific about what we mean by "Branch if
false."
I'm assuming that there will be code executed for <condition> that will perform Boolean alge-
bra and compute some result. It should also set the condition flags corresponding to that
result. Now, the usual convention for a Boolean variable is to let 0000 represent "false," and
anything else (some use FFFF, some 0001) represent "true."
On the 68000 the condition flags are set whenever any data is moved or calculated. If the
data is a 0000 (corresponding to a false condition, remember), the zero flag will be set. The
code for "Branch on zero" is BEQ. So for our purposes here,
BEQ <=> Branch if false
BNE <=> Branch if true
It's the nature of the beast that most of the branches we see will be BEQ's ... we'll be branch-
ing AROUND the code that's supposed to be executed when the condition is true.

THE IF STATEMENT
With that bit of explanation out of the way, we're finally ready to begin coding the IF-state-
ment parser. In fact, we've almost already done it! As usual, I'll be using our single-char-
acter approach, with the character 'i' for IF, and 'e' for ENDIF (as well as END ... that dual
nature causes no confusion). I'll also, for now, skip completely the character for the
branch condition, which we still have to define.
The code for DoIf is:
{--------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure Block; Forward;
procedure DoIf;
var L: string;
begin
Match('i');
L := NewLabel;
Condition;
EmitLn('BEQ ' + L);
Block;
Match('e');
PostLabel(L);
end;
{--------------------------------------------------------------}

Add this routine to your program, and change Block to reference it as follows:
{--------------------------------------------------------------}
procedure Block;
begin
case Look of
'i': DoIf;
'o': Other;
end;
end;
end;
{--------------------------------------------------------------}

Notice the reference to procedure Condition. Eventually, we'll write a routine that can
parse and translate any Boolean condition we care to give it. But that's a whole install-
ment by itself (the next one, in fact). For now, let's just make it a dummy that emits some
text. Write the following routine:
{--------------------------------------------------------------}
{ Parse and Translate a Boolean Condition }
{ This version is a dummy }
Procedure Condition;
begin
EmitLn('<condition>');
end;
{--------------------------------------------------------------}
Insert this procedure in your program just before DoIf. Now run the program. Try a string
like
aibece
As you can see, the parser seems to recognize the construct and inserts the object code
at the right places. Now try a set of nested IF's, like
aibicedefe
It's starting to look real, eh?

Now that we have the general idea (and the tools such as the notation and the procedures
NewLabel and PostLabel), it's a piece of cake to extend the parser to include other con-
structs. The first (and also one of the trickiest) is to add the ELSE clause to IF. The BNF is
IF <condition> <block> [ ELSE <block>] ENDIF
The tricky part arises simply because there is an optional part, which doesn't occur in the
other constructs.
The corresponding output code should be
<condition>
BEQ L1
<block>
BRA L2
L1: <block>
L2: ...
This leads us to the following syntax-directed translation:
IF
<condition> { L1 = NewLabel;
L2 = NewLabel;
Emit(BEQ L1) }
<block>
ELSE { Emit(BRA L2);
PostLabel(L1) }
<block>
ENDIF { PostLabel(L2) }

Comparing this with the case for an ELSE-less IF gives us a clue as to how to handle
both situations. The code below does it. (Note that I use an 'l' for the ELSE, since 'e' is
otherwise occupied):
{--------------------------------------------------------------}
procedure DoIf;
var L1, L2: string;
begin
Match('i');
Condition;
L1 := NewLabel;
L2 := L1;
EmitLn('BEQ ' + L1);
Block;
if Look = 'l' then begin

Match('l');
L2 := NewLabel;
EmitLn('BRA ' + L2);
PostLabel(L1);
Block;
end;
Match('e');
PostLabel(L2);
end;
{--------------------------------------------------------------}

There you have it. A complete IF parser/translator, in 19 lines of code. Give it a try now.
Try something like
aiblcede
Did it work? Now, just to be sure we haven't broken the ELSE- less case, try
aibece
Now try some nested IF's. Try anything you like, including some badly formed statements.
Just remember that 'e' is not a legal "other" statement.

THE WHILE STATEMENT

The next type of statement should be easy, since we already have the process down pat.
The syntax I've chosen for the WHILE statement is
WHILE <condition> <block> ENDWHILE
I know, I know, we don't REALLY need separate kinds of terminators for each construct
... you can see that by the fact that in our one-character version, 'e' is used for all of them.
But I also remember MANY debugging sessions in Pascal, trying to track down a way-
ward END that the compiler obviously thought I meant to put somewhere else. It's been
my experience that specific and unique keywords, although they add to the vocabulary of
the language, give a bit of error-checking that is worth the extra work for the compiler
writer.
Now, consider what the WHILE should be translated into. It should be:
L1: <condition>
BEQ L2
<block>
BRA L1
L2:
As before, comparing the two representations gives us the actions needed at each point.
WHILE { L1 = NewLabel;
PostLabel(L1) }
<condition> { Emit(BEQ L2) }
<block>
ENDWHILE { Emit(BRA L1);
PostLabel(L2) }

The code follows immediately from the syntax:
{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }
procedure DoWhile;
var L1, L2: string;
begin
Match('w');
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
Condition;
Block;
Match('e');
PostLabel(L2);
end;
{--------------------------------------------------------------}

Since we've got a new statement, we have to add a call to it within procedure Block:
{--------------------------------------------------------------}
procedure Block;
begin
while not(Look in ['e', 'l']) do begin
case Look of
'i': DoIf;
'w': DoWhile;
else Other;
end;
end;
end;
{--------------------------------------------------------------}
No other changes are necessary.
OK, try the new program. Note that this time, the <condition> code is INSIDE the upper
label, which is just where we wanted it. Try some nested loops. Try some loops within
IF's, and some IF's within loops. If you get a bit confused as to what you should type,
don't be discouraged: you write bugs in other languages, too, don't you? It'll look a lot
more meaningful when we get full keywords.

I hope by now that you're beginning to get the idea that this really IS easy. All we have to do
to accomodate a new construct is to work out the syntax-directed translation of it. The code
almost falls out from there, and it doesn't affect any of the other routines. Once you've gotten
the feel of the thing, you'll see that you can add new constructs about as fast as you can
dream them up.

THE LOOP STATEMENT

We could stop right here, and have a language that works. It's been shown many times
that a high-order language with only two constructs, the IF and the WHILE, is sufficient to
write structured code. But we're on a roll now, so let's richen up the repertoire a bit.
This construct is even easier, since it has no condition test at all ... it's an infinite loop.
What's the point of such a loop? Not much, by itself, but later on we're going to add a
BREAK command, that will give us a way out. This makes the language considerably
richer than Pascal, which has no break, and also avoids the funny WHILE(1) or WHILE
TRUE of C and Pascal.
The syntax is simply
LOOP <block> ENDLOOP
and the syntax-directed translation is:
LOOP { L = NewLabel;
PostLabel(L) }
<block>
ENDLOOP { Emit(BRA L }

The corresponding code is shown below. Since I've already used 'l' for the ELSE, I've used
the last letter, 'p', as the "keyword" this time.
{--------------------------------------------------------------}
{ Parse and Translate a LOOP Statement }
procedure DoLoop;
var L: string;
begin
Match('p');
L := NewLabel;
PostLabel(L);
Block;
Match('e');
EmitLn('BRA ' + L);
end;
{--------------------------------------------------------------}
When you insert this routine, don't forget to add a line in Block to call it.

REPEAT-UNTIL
Here's one construct that I lifted right from Pascal. The syntax is
REPEAT <block> UNTIL <condition> ,
and the syntax-directed translation is:
REPEAT { L = NewLabel;
PostLabel(L) }
<block>
UNTIL
<condition> { Emit(BEQ L) }

As usual, the code falls out pretty easily:
{--------------------------------------------------------------}
{ Parse and Translate a REPEAT Statement }
procedure DoRepeat;
var L: string;
begin
Match('r');
L := NewLabel;
PostLabel(L);
Block;
Match('u');
Condition;
EmitLn('BEQ ' + L);
end;
{--------------------------------------------------------------}
As before, we have to add the call to DoRepeat within Block. This time, there's a difference,
though. I decided to use 'r' for REPEAT (naturally), but I also decided to use 'u' for UNTIL.
This means that the 'u' must be added to the set of characters in the while-test. These are the
characters that signal an exit from the current block ... the "follow" characters, in compiler jar-
gon.

{--------------------------------------------------------------}
procedure Block;
begin
while not(Look in ['e', 'l', 'u']) do begin
case Look of
'i': DoIf;
'w': DoWhile;
'p': DoLoop;
'r': DoRepeat;
else Other;
end;
end;
end;
{--------------------------------------------------------------}

THE FOR LOOP

The FOR loop is a very handy one to have around, but it's a bear to translate. That's not so
much because the construct itself is hard ... it's only a loop after all ... but simply because it's
hard to implement in assembler language. Once the code is figured out, the translation is
straightforward enough.
C fans love the FOR-loop of that language (and, in fact, it's easier to code), but I've chosen
instead a syntax very much like the one from good ol' BASIC:
FOR <ident> = <expr1> TO <expr2> <block> ENDFOR
The translation of a FOR loop can be just about as difficult as you choose to make it, depend-
ing upon the way you decide to define the rules as to how to handle the limits. Does expr2 get
evaluated every time through the loop, for example, or is it treated as a constant limit? Do you
always go through the loop at least once, as in FORTRAN, or not? It gets simpler if you adopt
the point of view that the construct is equivalent to:
<ident> = <expr1>
TEMP = <expr2>
WHILE <ident> <= TEMP
<block>
ENDWHILE
Notice that with this definition of the loop, <block> will not be executed at all if <expr1> is ini-
tially larger than <expr2>.

The 68000 code needed to do this is trickier than anything we've done so far. I had a cou-
ple of tries at it, putting both the counter and the upper limit on the stack, both in registers,
etc. I finally arrived at a hybrid arrangement, in which the loop counter is in memory (so
that it can be accessed within the loop), and the upper limit is on the stack. The translated
code came out like this:
<ident> get name of loop counter
<expr1> get initial value
LEA <ident>(PC),A0 address the loop counter
SUBQ #1,D0 predecrement it
MOVE D0,(A0) save it
<expr1> get upper limit
MOVE D0,-(SP) save it on stack
L1: LEA <ident>(PC),A0 address loop counter
MOVE (A0),D0 fetch it to D0
ADDQ #1,D0 bump the counter
MOVE D0,(A0) save new value
CMP (SP),D0 check for range
BLE L2 skip out if D0 > (SP)
<block>
BRA L1 loop for next pass
L2: ADDQ #2,SP clean up the stack

Wow! That seems like a lot of code ... the line containing <block> seems to almost get lost.
But that's the best I could do with it. I guess it helps to keep in mind that it's really only sixteen
words, after all. If anyone else can optimize this better, please let me know.
Still, the parser routine is pretty easy now that we have the code:
{--------------------------------------------------------------}
{ Parse and Translate a FOR Statement }
procedure DoFor;
var L1, L2: string;
Name: char;
begin
Match('f');
L1 := NewLabel;
L2 := NewLabel;
Name := GetName;
Match('=');
Expression;
EmitLn('SUBQ #1,D0');
EmitLn('MOVE D0,(A0)');
Expression;
PostLabel(L1);

EmitLn('MOVE (A0),D0');
EmitLn('ADDQ #1,D0');
EmitLn('CMP (SP),D0');
EmitLn('BGT ' + L2);
Block;
Match('e');
PostLabel(L2);
EmitLn('ADDQ #2,SP');
end;
{--------------------------------------------------------------}

Since we don't have expressions in this parser, I used the same trick as for Condition, and
wrote the routine
{--------------------------------------------------------------}
Procedure Expression;
begin
EmitLn('<expr>');
end;
{--------------------------------------------------------------}
Give it a try. Once again, don't forget to add the call in Block. Since we don't have any input
for the dummy version of Expression, a typical input line would look something like
afi=bece
Well, it DOES generate a lot of code, doesn't it? But at least it's the RIGHT code.

THE DO STATEMENT
All this made me wish for a simpler version of the FOR loop. The reason for all the code
above is the need to have the loop counter accessible as a variable within the loop. If all
we need is a counting loop to make us go through something a specified number of times,
but don't need access to the counter itself, there is a much easier solution. The 68000 has
a "decrement and branch nonzero" instruction built in which is ideal for counting. For
good measure, let's add this construct, too. This will be the last of our loop structures.
The syntax and its translation is:
DO
<expr> { Emit(SUBQ #1,D0);
L = NewLabel;
PostLabel(L);
Emit(MOVE D0,-(SP) }
<block>
ENDDO { Emit(MOVE (SP)+,D0;
Emit(DBRA D0,L) }

That's quite a bit simpler! The loop will execute <expr> times. Here's the code:
{--------------------------------------------------------------}
{ Parse and Translate a DO Statement }
procedure Dodo;
var L: string;
begin
Match('d');
L := NewLabel;
Expression;
PostLabel(L);
Block;
EmitLn('DBRA D0,' + L);
end;
{--------------------------------------------------------------}
I think you'll have to agree, that's a whole lot simpler than the classical FOR. Still, each con-
struct has its place.

THE BREAK STATEMENT

Earlier I promised you a BREAK statement to accompany LOOP. This is one I'm sort of
proud of. On the face of it a BREAK seems really tricky. My first approach was to just use
it as an extra terminator to Block, and split all the loops into two parts, just as I did with the
ELSE half of an IF. That turns out not to work, though, because the BREAK statement is
almost certainly not going to show up at the same level as the loop itself. The most likely
place for a BREAK is right after an IF, which would cause it to exit to the IF construct, not
the enclosing loop. WRONG. The BREAK has to exit the inner LOOP, even if it's nested
down into several levels of IFs.
My next thought was that I would just store away, in some global variable, the ending
label of the innermost loop. That doesn't work either, because there may be a break from
an inner loop followed by a break from an outer one. Storing the label for the inner loop
would clobber the label for the outer one. So the global variable turned into a stack.
Things were starting to get messy.
Then I decided to take my own advice. Remember in the last session when I pointed out
how well the implicit stack of a recursive descent parser was serving our needs? I said
that if you begin to see the need for an external stack you might be doing something
wrong. Well, I was. It is indeed possible to let the recursion built into our parser take care
of everything, and the solution is so simple that it's surprising.
The secret is to note that every BREAK statement has to occur within a block ... there's
no place else for it to be. So all we have to do is to pass into Block the exit address of the
innermost loop. Then it can pass the address to the routine that translates the break
instruction. Since an IF statement doesn't change the loop level, procedure DoIf doesn't
need to do anything except pass the label into ITS blocks (both of them). Since loops DO
change the level, each loop construct simply ignores whatever label is above it and
passes its own exit label along.

All this is easier to show you than it is to describe. I'll demonstrate with the easiest loop,
which is LOOP:
{--------------------------------------------------------------}
procedure DoLoop;
var L1, L2: string;
begin
Match('p');
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
Block(L2);
Match('e');
PostLabel(L2);
end;
{--------------------------------------------------------------}
Notice that DoLoop now has TWO labels, not just one. The second is to give the BREAK
instruction a target to jump to. If there is no BREAK within the loop, we've wasted a label and
cluttered up things a bit, but there's no harm done.

Note also that Block now has a parameter, which for loops will always be the exit address.
The new version of Block is:
{--------------------------------------------------------------}
procedure Block(L: string);
begin
case Look of
'i': DoIf(L);
'w': DoWhile;
'p': DoLoop;
'r': DoRepeat;
'f': DoFor;
'd': DoDo;
'b': DoBreak(L);
else Other;
end;
end;
end;
{--------------------------------------------------------------}

Again, notice that all Block does with the label is to pass it into DoIf and DoBreak. The loop
constructs don't need it, because they are going to pass their own label anyway.
The new version of DoIf is:
{--------------------------------------------------------------}
procedure Block(L: string); Forward;

procedure DoIf(L: string);
var L1, L2: string;
begin
Match('i');
Condition;
L1 := NewLabel;
L2 := L1;
Block(L);
Match('l');
L2 := NewLabel;
PostLabel(L1);
Block(L);
end;
Match('e');
PostLabel(L2);
end;
{--------------------------------------------------------------}

Here, the only thing that changes is the addition of the parameter to procedure Block. An
IF statement doesn't change the loop nesting level, so DoIf just passes the label along.
No matter how many levels of IF nesting we have, the same label will be used.
Now, remember that DoProgram also calls Block, so it now needs to pass it a label. An
attempt to exit the outermost block is an error, so DoProgram passes a null label which is
caught by DoBreak:
{--------------------------------------------------------------}
{ Recognize and Translate a BREAK }
procedure DoBreak(L: string);
begin
Match('b');
if L <> '' then
EmitLn('BRA ' + L)
else Abort('No loop to break from');
end;
{--------------------------------------------------------------}
begin
Block('');
EmitLn('END')
end;
{--------------------------------------------------------------}

That ALMOST takes care of everything. Give it a try, see if you can "break" it <pun>. Careful,
though. By this time we've used so many letters, it's hard to think of characters that aren't
now representing reserved words. Remember: before you try the program, you're going to
have to edit every occurence of Block in the other loop constructs to include the new parame-
ter. Do it just like I did for LOOP.
I said ALMOST above. There is one slight problem: if you take a hard look at the code gener-
ated for DO, you'll see that if you break out of this loop, the value of the loop counter is still
left on the stack. We're going to have to fix that! A shame ... that was one of our smaller rou-
tines, but it can't be helped. Here's a version that doesn't have the problem:
{--------------------------------------------------------------}
procedure Dodo;
var L1, L2: string;
begin
Match('d');
L1 := NewLabel;
L2 := NewLabel;
Expression;
PostLabel(L1);
Block(L2);
EmitLn('DBRA D0,' + L1);
EmitLn('SUBQ #2,SP');
PostLabel(L2);
end;
{--------------------------------------------------------------}
The two extra instructions, the SUBQ and ADDQ, take care of leaving the stack in the right
shape.

CONCLUSION
At this point we have created a number of control constructs ... a richer set, really, than
that provided by almost any other programming language. And, except for the FOR
loop, it was pretty easy to do. Even that one was tricky only because it's tricky in assem-
bler language.
I'll conclude this session here. To wrap the thing up with a red ribbon, we really should
have a go at having real keywords instead of these mickey-mouse single-character
things. You've already seen that the extension to multi-character words is not difficult, but
in this case it will make a big difference in the appearance of our input code. I'll save that
little bit for the next installment. In that installment we'll also address Boolean expres-
sions, so we can get rid of the dummy version of Condition that we've used here. See you
then.
For reference purposes, here is the completed parser for this session:
{--------------------------------------------------------------}
program Branch;
{--------------------------------------------------------------}
const TAB = Î;
CR = ^M;
{--------------------------------------------------------------}

{--------------------------------------------------------------}
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
begin
WriteLn;
end;
{--------------------------------------------------------------}
begin
Error(s);
Halt;
end;

{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
end;

{--------------------------------------------------------------}
begin
IsDigit := c in ['0'..'9'];
end;
{--------------------------------------------------------------}
begin
IsAddop := c in ['+', '-'];
end;
{--------------------------------------------------------------}
begin
end;

{--------------------------------------------------------------}
begin
GetChar;
end;
{--------------------------------------------------------------}
begin
GetChar;
end;

{--------------------------------------------------------------}
{ Get a Number }
begin
GetNum := Look;
GetChar;
end;
{--------------------------------------------------------------}
var S: string;
begin
Str(LCount, S);
Inc(LCount);
end;

{--------------------------------------------------------------}
begin
WriteLn(L, ':');
end;
{--------------------------------------------------------------}
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
begin
Emit(s);
WriteLn;
end;

{--------------------------------------------------------------}
procedure Condition;
begin
EmitLn('<condition>');
end;
{--------------------------------------------------------------}
{ Parse and Translate a Math Expression }
begin
EmitLn('<expr>');
end;

{--------------------------------------------------------------}
procedure Block(L: string); Forward;
procedure DoIf(L: string);
var L1, L2: string;
begin
Match('i');
Condition;
L1 := NewLabel;
L2 := L1;
Block(L);
Match('l');
L2 := NewLabel;
PostLabel(L1);
Block(L);
end;
Match('e');
PostLabel(L2);
end;

{--------------------------------------------------------------}
procedure DoWhile;
var L1, L2: string;
begin
Match('w');
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
Condition;
Block(L2);
Match('e');
PostLabel(L2);
end;

{--------------------------------------------------------------}
procedure DoLoop;
var L1, L2: string;
begin
Match('p');
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
Block(L2);
Match('e');
PostLabel(L2);
end;

{--------------------------------------------------------------}
{ Parse and Translate a REPEAT Statement }
procedure DoRepeat;
var L1, L2: string;
begin
Match('r');
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
Block(L2);
Match('u');
Condition;
PostLabel(L2);
end;

{--------------------------------------------------------------}
{ Parse and Translate a FOR Statement }
procedure DoFor;
var L1, L2: string;
Name: char;
begin
Match('f');
L1 := NewLabel;
L2 := NewLabel;
Name := GetName;
Match('=');
Expression;
Expression;
PostLabel(L1);
EmitLn('ADDQ #1,D0');

EmitLn('CMP (SP),D0');
EmitLn('BGT ' + L2);
Block(L2);
Match('e');
PostLabel(L2);
end;
{--------------------------------------------------------------}
procedure Dodo;
var L1, L2: string;
begin
Match('d');
L1 := NewLabel;
L2 := NewLabel;
Expression;
PostLabel(L1);

Block(L2);
EmitLn('DBRA D0,' + L1);
EmitLn('SUBQ #2,SP');
PostLabel(L2);
end;
{--------------------------------------------------------------}
{ Recognize and Translate a BREAK }
procedure DoBreak(L: string);
begin
Match('b');
EmitLn('BRA ' + L);
end;
{--------------------------------------------------------------}
{ Recognize and Translate an "Other" }
procedure Other;
begin
EmitLn(GetName);
end;

{--------------------------------------------------------------}
begin
case Look of
'i': DoIf(L);
'w': DoWhile;
'p': DoLoop;
'r': DoRepeat;
'f': DoFor;
'd': DoDo;
'b': DoBreak(L);
else Other;
end;
end;
end;

{--------------------------------------------------------------}
begin
Block('');
EmitLn('END')
end;
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
begin
LCount := 0;
GetChar;
end;

{--------------------------------------------------------------}
{ Main Program }
begin
Init;
DoProgram;
end.
{--------------------------------------------------------------}

Part 6 - Boolean Expressions
INTRODUCTION
In Part V of this series, we took a look at control constructs, and developed parsing rou-
tines to translate them into object code. We ended up with a nice, relatively rich set of
constructs.
As we left the parser, though, there was one big hole in our capabilities: we did not
address the issue of the branch condition. To fill the void, I introduced to you a dummy
parse routine called Condition, which only served as a place-keeper for the real thing.
One of the things we'll do in this session is to plug that hole by expanding Condition into a
true parser/translator.
THE PLAN
We're going to approach this installment a bit differently than any of the others. In those
other installments, we started out immediately with experiments using the Pascal com-
piler, building up the parsers from very rudimentary beginnings to their final forms, without
spending much time in planning beforehand. That's called coding without specs, and it's
usually frowned upon. We could get away with it before because the rules of arithmetic
are pretty well established ... we know what a '+' sign is supposed to mean without having
to discuss it at length. The same is true for branches and loops. But the ways in which
programming languages implement logic vary quite a bit from language to language. So
before we begin serious coding, we'd better first make up our minds what it is we want.
And the way to do that is at the level of the BNF syntax rules (the GRAMMAR).

THE GRAMMAR
For some time now, we've been implementing BNF syntax equations for arithmetic expres-
sions, without ever actually writing them down all in one place. It's time that we did so. They
are:
<expression> ::= <unary op> <term> [<addop> <term>]*
<term> ::= <factor> [<mulop> factor]*
<factor> ::= <integer> | <variable> | ( <expression> )
(Remember, the nice thing about this grammar is that it enforces the operator precedence
hierarchy that we normally expect for algebra.)
Actually, while we're on the subject, I'd like to amend this grammar a bit right now. The way
we've handled the unary minus is a bit awkward. I've found that it's better to write the gram-
mar this way:
<term> ::= <signed factor> [<mulop> factor]*
<signed factor> ::= [<addop>] <factor>
<factor> ::= <integer> | <variable> | (<expression>)
This puts the job of handling the unary minus onto Factor, which is where it really belongs.
This doesn't mean that you have to go back and recode the programs you've already written,
although you're free to do so if you like. But I will be using the new syntax from now on.

Now, it probably won't come as a shock to you to learn that we can define an analogous
grammar for Boolean algebra. A typical set or rules is:
<b-expression>::= <b-term> [<orop> <b-term>]*
<b-term> ::= <not-factor> [AND <not-factor>]*
<not-factor> ::= [NOT] <b-factor>
<b-factor> ::= <b-literal> | <b-variable> | (<b-expression>)
Notice that in this grammar, the operator AND is analogous to '*', and OR (and exclusive
OR) to '+'. The NOT operator is analogous to a unary minus. This hierarchy is not abso-
lutely standard ... some languages, notably Ada, treat all logical operators as having the
same precedence level ... but it seems natural.
Notice also the slight difference between the way the NOT and the unary minus are han-
dled. In algebra, the unary minus is considered to go with the whole term, and so never
appears but once in a given term. So an expression like
a * -b
or worse yet,
a - -b
is not allowed. In Boolean algebra, though, the expression
a AND NOT b
makes perfect sense, and the syntax shown allows for that.

RELOPS
OK, assuming that you're willing to accept the grammar I've shown here, we now have syntax
rules for both arithmetic and Boolean algebra. The sticky part comes in when we have to
combine the two. Why do we have to do that? Well, the whole subject came up because of
the need to process the "predicates" (conditions) associated with control statements such as
the IF. The predicate is required to have a Boolean value; that is, it must evaluate to either
TRUE or FALSE. The branch is then taken or not taken, depending on that value. What we
expect to see going on in procedure Condition, then, is the evaluation of a Boolean expres-
sion.
But there's more to it than that. A pure Boolean expression can indeed be the predicate of a
control statement ... things like
IF a AND NOT b THEN ....
But more often, we see Boolean algebra show up in such things as
IF (x >= 0) and (x <= 100) THEN ...
Here, the two terms in parens are Boolean expressions, but the individual terms being com-
pared: x, 0, and 100, are NUMERIC in nature. The RELATIONAL OPERATORS >= and <=
are the catalysts by which the Boolean and the arithmetic ingredients get merged together.
Now, in the example above, the terms being compared are just that: terms. However, in gen-
eral each side can be a math expression. So we can define a RELATION to be:
<relation> ::= <expression> <relop> <expression> ,
where the expressions we're talking about here are the old numeric type, and the relops are
any of the usual symbols
=, <> (or !=), <, >, <=, and >=

If you think about it a bit, you'll agree that, since this kind of predicate has a single Bool-
ean value, TRUE or FALSE, as its result, it is really just another kind of factor. So we can
expand the definition of a Boolean factor above to read:
<b-factor> ::= <b-literal>
| <b-variable>
| (<b-expression>)
| <relation>
THAT's the connection! The relops and the relation they define serve to wed the two kinds
of algebra. It is worth noting that this implies a hierarchy where the arithmetic expression
has a HIGHER precedence that a Boolean factor, and therefore than all the Boolean
operators. If you write out the precedence levels for all the operators, you arrive at the fol-
lowing list:
Level Syntax Element Operator
0 factor literal, variable
1 signed factor unary minus
2 term *, /
3 expression +, -
4 b-factor literal, variable, relop
5 not-factor NOT
6 b-term AND
7 b-expression OR, XOR

If we're willing to accept that many precedence levels, this grammar seems reasonable.
Unfortunately, it won't work! The grammar may be great in theory, but it's no good at all in the
practice of a top-down parser. To see the problem, consider the code fragment:
IF ((((((A + B + C) < 0 ) AND ....
When the parser is parsing this code, it knows after it sees the IF token that a Boolean
expression is supposed to be next. So it can set up to begin evaluating such an expression.
But the first expression in the example is an ARITHMETIC expression, A + B + C. What's
worse, at the point that the parser has read this much of the input line:
IF ((((((A ,
it still has no way of knowing which kind of expression it's dealing with. That won't do,
because we must have different recognizers for the two cases. The situation can be handled
without changing any of our definitions, but only if we're willing to accept an arbitrary amount
of backtracking to work our way out of bad guesses. No compiler writer in his right mind
would agree to that.
What's going on here is that the beauty and elegance of BNF grammar has met face to face
with the realities of compiler technology.
To deal with this situation, compiler writers have had to make compromises so that a single
parser can handle the grammar without backtracking.

FIXING THE GRAMMAR

The problem that we've encountered comes up because our definitions of both arithmetic
and Boolean factors permit the use of parenthesized expressions. Since the definitions
are recursive, we can end up with any number of levels of parentheses, and the parser
can't know which kind of expression it's dealing with.
The solution is simple, although it ends up causing profound changes to our grammar.
We can only allow parentheses in one kind of factor. The way to do that varies consider-
ably from language to language. This is one place where there is NO agreement or con-
vention to help us.
When Niklaus Wirth designed Pascal, the desire was to limit the number of levels of pre-
cedence (fewer parse routines, after all). So the OR and exclusive OR operators are
treated just like an Addop and processed at the level of a math expression. Similarly, the
AND is treated like a Mulop and processed with Term. The precedence levels are
Level Syntax Element Operator
0 factor literal, variable
1 signed factor unary minus, NOT
2 term *, /, AND
3 expression +, -, OR

Notice that there is only ONE set of syntax rules, applying to both kinds of operators. Accord-
ing to this grammar, then, expressions like
x + (y AND NOT z) DIV 3
are perfectly legal. And, in fact, they ARE ... as far as the parser is concerned. Pascal doesn't
allow the mixing of arithmetic and Boolean variables, and things like this are caught at the
SEMANTIC level, when it comes time to generate code for them, rather than at the syntax
level.
The authors of C took a diametrically opposite approach: they treat the operators as different,
and have something much more akin to our seven levels of precedence. In fact, in C there
are no fewer than 17 levels! That's because C also has the operators '=', '+=' and its kin, '<<',
'>>', '++', '--', etc. Ironically, although in C the arithmetic and Boolean operators are treated
separately, the variables are NOT ... there are no Boolean or logical variables in C, so a Bool-
ean test can be made on any integer value.
We'll do something that's sort of in-between. I'm tempted to stick mostly with the Pascal
approach, since that seems the simplest from an implementation point of view, but it results in
some funnies that I never liked very much, such as the fact that, in the expression
IF (c >= 'A') and (c <= 'Z') then ...
the parens above are REQUIRED. I never understood why before, and neither my compiler
nor any human ever explained it very well, either. But now, we can all see that the 'and' oper-
ator, having the precedence of a multiply, has a higher one than the relational operators, so
without the parens the expression is equivalent to
IF c >= ('A' and c) <= 'Z' then
which doesn't make sense.

In any case, I've elected to separate the operators into different levels, although not as
many as in C.
<b-expression> ::= <b-term> [<orop> <b-term>]*
<b-term> ::= <not-factor> [AND <not-factor>]*
<not-factor> ::= [NOT] <b-factor>
<b-factor> ::= <b-literal> | <b-variable> | <relation>
<relation> ::= | <expression> [<relop> <expression]
<term> ::= <signed factor> [<mulop> factor]*
<signed factor>::= [<addop>] <factor>
<factor> ::= <integer> | <variable> | (<b-expression>)
This grammar results in the same set of seven levels that I showed earlier. Really, it's
almost the same grammar ... I just removed the option of parenthesized b-expressions as
a possible b-factor, and added the relation as a legal form of b-factor.
There is one subtle but crucial difference, which is what makes the whole thing work.
Notice the square brackets in the definition of a relation. This means that the relop and
the second expression are OPTIONAL.
A strange consequence of this grammar (and one shared by C) is that EVERY expression
is potentially a Boolean expression. The parser will always be looking for a Boolean
expression, but will "settle" for an arithmetic one. To be honest, that's going to slow down
the parser, because it has to wade through more layers of procedure calls. That's one
reason why Pascal compilers tend to compile faster than C compilers. If it's raw speed
you want, stick with the Pascal syntax.

THE PARSER
Now that we've gotten through the decision-making process, we can press on with develop-
ment of a parser. You've done this with me several times now, so you know the drill: we begin
with a fresh copy of the cradle, and begin adding procedures one by one. So let's do it.
We begin, as we did in the arithmetic case, by dealing only with Boolean literals rather than
variables. This gives us a new kind of input token, so we're also going to need a new recog-
nizer, and a new procedure to read instances of that token type. Let's start by defining the two
new procedures:
{--------------------------------------------------------------}
{ Recognize a Boolean Literal }
function IsBoolean(c: char): Boolean;
begin
IsBoolean := UpCase(c) in ['T', 'F'];
end;
{--------------------------------------------------------------}
{ Get a Boolean Literal }
function GetBoolean: Boolean;
var c: char;
begin
if not IsBoolean(Look) then Expected('Boolean Literal');
GetBoolean := UpCase(Look) = 'T';
GetChar;
end;
{--------------------------------------------------------------}

Type these routines into your program. You can test them by adding into the main pro-
gram the print statement
WriteLn(GetBoolean);
OK, compile the program and test it. As usual, it's not very impressive so far, but it soon
will be.
Now, when we were dealing with numeric data we had to arrange to generate code to
load the values into D0. We need to do the same for Boolean data. The usual way to
encode Boolean variables is to let 0 stand for FALSE, and some other value for TRUE.
Many languages, such as C, use an integer 1 to represent it. But I prefer FFFF hex (or -
1), because a bitwise NOT also becomes a Boolean NOT. So now we need to emit the
right assembler code to load those values. The first cut at the Boolean expression parser
(BoolExpression, of course) is:
{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }
procedure BoolExpression;
begin
if not IsBoolean(Look) then Expected('Boolean Literal');
if GetBoolean then
EmitLn('MOVE #-1,D0')
else
EmitLn('CLR D0');
end;
{---------------------------------------------------------------}

Add this procedure to your parser, and call it from the main program (replacing the print state-
ment you had just put there). As you can see, we still don't have much of a parser, but the
output code is starting to look more realistic.
Next, of course, we have to expand the definition of a Boolean expression. We already have
the BNF rule:
<b-expression> ::= <b-term> [<orop> <b-term>]*
I prefer the Pascal versions of the "orops", OR and XOR. But since we are keeping to single-
character tokens here, I'll encode those with '|' and '~'. The next version of BoolExpression is
almost a direct copy of the arithmetic procedure Expression:
{--------------------------------------------------------------}
{ Recognize and Translate a Boolean OR }
procedure BoolOr;
begin
Match('|');
BoolTerm;
EmitLn('OR (SP)+,D0');
end;

{--------------------------------------------------------------}
{ Recognize and Translate an Exclusive Or }
procedure BoolXor;
begin
Match('~');
BoolTerm;
EmitLn('EOR (SP)+,D0');
end;
{---------------------------------------------------------------}
begin
BoolTerm;
while IsOrOp(Look) do begin
case Look of
'|': BoolOr;
'~': BoolXor;
end;
end;
end;
{---------------------------------------------------------------}

Note the new recognizer IsOrOp, which is also a copy, this time of IsAddOp:
{--------------------------------------------------------------}
{ Recognize a Boolean Orop }
function IsOrop(c: char): Boolean;
begin
IsOrop := c in ['|', '~'];
end;
{--------------------------------------------------------------}
OK, rename the old version of BoolExpression to BoolTerm, then enter the code above. Com-
pile and test this version. At this point, the output code is starting to look pretty good. Of
course, it doesn't make much sense to do a lot of Boolean algebra on constant values, but
we'll soon be expanding the types of Booleans we deal with.
You've probably already guessed what the next step is: The Boolean version of Term.
Rename the current procedure BoolTerm to NotFactor, and enter the following new version of
BoolTerm. Note that is is much simpler than the numeric version, since there is no equivalent
of division.

{---------------------------------------------------------------}
{ Parse and Translate a Boolean Term }
procedure BoolTerm;
begin
NotFactor;
while Look = '&' do begin
Match('&');
NotFactor;
EmitLn('AND (SP)+,D0');
end;
end;
{--------------------------------------------------------------}

Now, we're almost home. We are translating complex Boolean expressions, although only for
constant values. The next step is to allow for the NOT. Write the following procedure:
{--------------------------------------------------------------}
{ Parse and Translate a Boolean Factor with NOT }
procedure NotFactor;
begin
if Look = '!' then begin
Match('!');
BoolFactor;
EmitLn('EOR #-1,D0');
end
else
BoolFactor;
end;
{--------------------------------------------------------------}
And rename the earlier procedure to BoolFactor. Now try that. At this point the parser should
be able to handle any Boolean expression you care to throw at it. Does it? Does it trap badly
formed expressions?

If you've been following what we did in the parser for math expressions, you know that
what we did next was to expand the definition of a factor to include variables and parens.
We don't have to do that for the Boolean factor, because those little items get taken care
of by the next step. It takes just a one line addition to BoolFactor to take care of relations:
{--------------------------------------------------------------}
{ Parse and Translate a Boolean Factor }
procedure BoolFactor;
begin
if IsBoolean(Look) then
if GetBoolean then
EmitLn('MOVE #-1,D0')
else
EmitLn('CLR D0')
else Relation;
end;
{--------------------------------------------------------------}
You might be wondering when I'm going to provide for Boolean variables and parenthe-
sized Boolean expressions. The answer is, I'm NOT! Remember, we took those out of the
grammar earlier. Right now all I'm doing is encoding the grammar we've already agreed
upon. The compiler itself can't tell the difference between a Boolean variable or expres-
sion and an arithmetic one ... all of those will be handled by Relation, either way.

Of course, it would help to have some code for Relation. I don't feel comfortable, though, add-
ing any more code without first checking out what we already have. So for now let's just write
a dummy version of Relation that does nothing except eat the current character, and write a
little message:
{---------------------------------------------------------------}
{ Parse and Translate a Relation }
procedure Relation;
begin
WriteLn('<Relation>');
GetChar;
end;
{--------------------------------------------------------------}
OK, key in this code and give it a try. All the old things should still work ... you should be able
to generate the code for ANDs, ORs, and NOTs. In addition, if you type any alphabetic char-
acter you should get a little <Relation> place-holder, where a Boolean factor should be. Did
you get that? Fine, then let's move on to the full-blown version of Relation.
To get that, though, there is a bit of groundwork that we must lay first. Recall that a relation
has the form
<relation> ::= | <expression> [<relop> <expression]

Since we have a new kind of operator, we're also going to need a new Boolean function to
recognize it. That function is shown below. Because of the single-character limitation, I'm
sticking to the four operators that can be encoded with such a character (the "not equals"
is encoded by '#').
{--------------------------------------------------------------}
{ Recognize a Relop }
function IsRelop(c: char): Boolean;
begin
IsRelop := c in ['=', '#', '<', '>'];
end;
{--------------------------------------------------------------}
Now, recall that we're using a zero or a -1 in register D0 to represent a Boolean value,
and also that the loop constructs expect the flags to be set to correspond. In implement-
ing all this on the 68000, things get a a little bit tricky.
Since the loop constructs operate only on the flags, it would be nice (and also quite effi-
cient) just to set up those flags, and not load anything into D0 at all. This would be fine for
the loops and branches, but remember that the relation can be used ANYWHERE a Bool-
ean factor could be used. We may be storing its result to a Boolean variable. Since we
can't know at this point how the result is going to be used, we must allow for BOTH cases.
Comparing numeric data is easy enough ... the 68000 has an operation for that ... but it
sets the flags, not a value. What's more, the flags will always be set the same (zero if
equal, etc.), while we need the zero flag set differently for the each of the different relops.

The solution is found in the 68000 instruction Scc, which sets a byte value to 0000 or FFFF
(funny how that works!) depending upon the result of the specified condition. If we make the
destination byte to be D0, we get the Boolean value needed.
Unfortunately, there's one final complication: unlike almost every other instruction in the
68000 set, Scc does NOT reset the condition flags to match the data being stored. So we
have to do one last step, which is to test D0 and set the flags to match it. It must seem to be
a trip around the moon to get what we want: we first perform the test, then test the flags to set
data into D0, then test D0 to set the flags again. It is sort of roundabout, but it's the most
straightforward way to get the flags right, and after all it's only a couple of instructions.
I might mention here that this area is, in my opinion, the one that represents the biggest dif-
ference between the efficiency of hand-coded assembler language and compiler-generated
code. We have seen already that we lose efficiency in arithmetic operations, although later I
plan to show you how to improve that a bit. We've also seen that the control constructs them-
selves can be done quite efficiently ... it's usually very difficult to improve on the code gener-
ated for an IF or a WHILE. But virtually every compiler I've ever seen generates terrible code,
compared to assembler, for the computation of a Boolean function, and particularly for rela-
tions. The reason is just what I've hinted at above. When I'm writing code in assembler, I go
ahead and perform the test the most convenient way I can, and then set up the branch so
that it goes the way it should. In effect, I "tailor" every branch to the situation. The compiler
can't do that (practically), and it also can't know that we don't want to store the result of the
test as a Boolean variable. So it must generate the code in a very strict order, and it often
ends up loading the result as a Boolean that never gets used for anything.

In any case, we're now ready to look at the code for Relation. It's shown below with its
companion procedures:
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Equals" }
procedure Equals;
begin
Match('=');
Expression;
EmitLn('CMP (SP)+,D0');
EmitLn('SEQ D0');
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }
procedure NotEquals;
begin
Match('#');
Expression;
EmitLn('SNE D0');
end;

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }
procedure Less;
begin
Match('<');
Expression;
EmitLn('SGE D0');
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }
procedure Greater;
begin
Match('>');
Expression;
EmitLn('SLE D0');
end;

{---------------------------------------------------------------}
procedure Relation;
begin
Expression;
if IsRelop(Look) then begin
case Look of
'=': Equals;
'#': NotEquals;
'<': Less;
'>': Greater;
end;
EmitLn('TST D0');
end;
end;
{---------------------------------------------------------------}

Now, that call to Expression looks familiar! Here is where the editor of your system comes in
handy. We have already generated code for Expression and its buddies in previous sessions.
You can copy them into your file now. Remember to use the single- character versions. Just
to be certain, I've duplicated the arithmetic procedures below. If you're observant, you'll also
see that I've changed them a little to make them correspond to the latest version of the syn-
tax. This change is NOT necessary, so you may prefer to hold off on that until you're sure
everything is working.
{---------------------------------------------------------------}
procedure Ident;
var Name: char;
begin
Name:= GetName;
Match('(');
Match(')');
end
else
end;

{---------------------------------------------------------------}
procedure Factor;
begin
Match('(');
Expression;
Match(')');
end
Ident
else
end;

{---------------------------------------------------------------}
{ Parse and Translate the First Math Factor }
procedure SignedFactor;
begin
if Look = '+' then
GetChar;
if Look = '-' then begin
GetChar;
if IsDigit(Look) then
EmitLn('MOVE #-' + GetNum + ',D0')
else begin
Factor;
EmitLn('NEG D0');
end;
end
else Factor;
end;

{--------------------------------------------------------------}
procedure Multiply;
begin
Match('*');
Factor;
end;
{-------------------------------------------------------------}
procedure Divide;
begin
Match('/');
Factor;
EmitLn('EXS.L D0');
end;

{---------------------------------------------------------------}
procedure Term;
begin
SignedFactor;
case Look of
'*': Multiply;
'/': Divide;
end;
end;
end;

{---------------------------------------------------------------}
procedure Add;
begin
Match('+');
Term;
end;
{---------------------------------------------------------------}
procedure Subtract;
begin
Match('-');
Term;
EmitLn('NEG D0');
end;

{---------------------------------------------------------------}
begin
Term;
case Look of
'+': Add;
'-': Subtract;
end;
end;
end;
{---------------------------------------------------------------}
There you have it ... a parser that can handle both arithmetic AND Boolean algebra, and
things that combine the two through the use of relops. I suggest you file away a copy of this
parser in a safe place for future reference, because in our next step we're going to be chop-
ping it up.

MERGING WITH CONTROL CONSTRUCTS

At this point, let's go back to the file we had previously built that parses control constructs.
Remember those little dummy procedures called Condition and Expression? Now you
know what goes in their places!
I warn you, you're going to have to do some creative editing here, so take your time and
get it right. What you need to do is to copy all of the procedures from the logic parser,
from Ident through BoolExpression, into the parser for control constructs. Insert them at
the current location of Condition. Then delete that procedure, as well as the dummy
Expression. Next, change every call to Condition to refer to BoolExpression instead.
Finally, copy the procedures IsMulop, IsOrOp, IsRelop, IsBoolean, and GetBoolean into
place. That should do it.
Compile the resulting program and give it a try. Since we haven't used this program in
awhile, don't forget that we used single-character tokens for IF, WHILE, etc. Also don't
forget that any letter not a keyword just gets echoed as a block.
Try
ia=bxlye
which stands for "IF a=b X ELSE Y ENDIF".
What do you think? Did it work? Try some others.

ADDING ASSIGNMENTS
As long as we're this far, and we already have the routines for expressions in place, we might
as well replace the "blocks" with real assignment statements. We've already done that before,
so it won't be too hard. Before taking that step, though, we need to fix something else.
We're soon going to find that the one-line "programs" that we're having to write here will really
cramp our style. At the moment we have no cure for that, because our parser doesn't recog-
nize the end-of-line characters, the carriage return (CR) and the line feed (LF). So before
going any further let's plug that hole.
There are a couple of ways to deal with the CR/LFs. One (the C/Unix approach) is just to
treat them as additional white space characters and ignore them. That's actually not such a
bad approach, but it does sort of produce funny results for our parser as it stands now. If it
were reading its input from a source file as any self-respecting REAL compiler does, there
would be no problem. But we're reading input from the keyboard, and we're sort of condi-
tioned to expect something to happen when we hit the return key. It won't, if we just skip over
the CR and LF (try it). So I'm going to use a different method here, which is NOT necessarily
the best approach in the long run. Consider it a temporary kludge until we're further along.
Instead of skipping the CR/LF, We'll let the parser go ahead and catch them, then introduce a
special procedure, analogous to SkipWhite, that skips them only in specified "legal" spots.

Here's the procedure:
{--------------------------------------------------------------}
{ Skip a CRLF }
procedure Fin;
begin
if Look = CR then GetChar;
if Look = LF then GetChar;
end;
{--------------------------------------------------------------}

Now, add two calls to Fin in procedure Block, like this:
{--------------------------------------------------------------}
begin
Fin;
case Look of
'i': DoIf(L);
'w': DoWhile;
'p': DoLoop;
'r': DoRepeat;
'f': DoFor;
'd': DoDo;
'b': DoBreak(L);
else Other;
end;
Fin;
end;
end;
{--------------------------------------------------------------}

Now, you'll find that you can use multiple-line "programs." The only restriction is that you
can't separate an IF or WHILE token from its predicate.
Now we're ready to include the assignment statements. Simply change that call to Other
in procedure Block to a call to Assignment, and add the following procedure, copied from
one of our earlier programs. Note that Assignment now calls BoolExpression, so that we
can assign Boolean variables.
{--------------------------------------------------------------}
var Name: char;
begin
Name := GetName;
Match('=');
BoolExpression;
end;
{--------------------------------------------------------------}
With that change, you should now be able to write reasonably realistic-looking programs,
subject only to our limitation on single-character tokens. My original intention was to get
rid of that limitation for you, too. However, that's going to require a fairly major change to
what we've done so far. We need a true lexical scanner, and that requires some structural
changes. They are not BIG changes that require us to throw away all of what we've done
so far ... with care, it can be done with very minimal changes, in fact. But it does require
that care.

This installment has already gotten pretty long, and it contains some pretty heavy stuff, so
I've decided to leave that step until next time, when you've had a little more time to digest
what we've done and are ready to start fresh.
In the next installment, then, we'll build a lexical scanner and eliminate the single-character
barrier once and for all. We'll also write our first complete compiler, based on what we've
done in this session. See you then.

Part 7 -Lexical Scanning
INTRODUCTION
In the last installment, I left you with a compiler that would ALMOST work, except that we
were still limited to single- character tokens. The purpose of this session is to get rid of
that restriction, once and for all. This means that we must deal with the concept of the lex-
ical scanner.
Maybe I should mention why we need a lexical scanner at all ... after all, we've been able
to manage all right without one, up till now, even when we provided for multi-character
tokens.
The ONLY reason, really, has to do with keywords. It's a fact of computer life that the syn-
tax for a keyword has the same form as that for any other identifier. We can't tell until we
get the complete word whether or not it IS a keyword. For example, the variable IFILE
and the keyword IF look just alike, until you get to the third character. In the examples to
date, we were always able to make a decision based upon the first character of the token,
but that's no longer possible when keywords are present. We need to know that a given
string is a keyword BEFORE we begin to process it. And that's why we need a scanner.
In the last session, I also promised that we would be able to provide for normal tokens
without making wholesale changes to what we have already done. I didn't lie ... we can,
as you will see later. But every time I set out to install these elements of the software into
the parser we have already built, I had bad feelings about it. The whole thing felt entirely
too much like a band-aid. I finally figured out what was causing the problem: I was install-
ing lexical scanning software without first explaining to you what scanning is all about,
and what the alternatives are. Up till now, I have studiously avoided giving you a lot of
theory, and certainly not alternatives. I generally don't respond well to the textbooks that
give you twenty-five different ways to do something, but no clue as to which way best fits
your needs. I've tried to avoid that pitfall by just showing you ONE method, that WORKS.
But this is an important area. While the lexical scanner is hardly the most exciting part of
a compiler, it often has the most profound effect on the general "look & feel" of the lan-
guage, since after all it's the part closest to the user. I have a particular structure in mind

for the scanner to be used with KISS. It fits the look & feel that I want for that language. But it
may not work at all for the language YOU'RE cooking up, so in this one case I feel that it's
important for you to know your options.
So I'm going to depart, again, from my usual format. In this session we'll be getting much
deeper than usual into the basic theory of languages and grammars. I'll also be talking about
areas OTHER than compilers in which lexical scanning plays an important role. Finally, I will
show you some alternatives for the structure of the lexical scanner. Then, and only then, will
we get back to our parser from the last installment. Bear with me ... I think you'll find it's worth
the wait. In fact, since scanners have many applications outside of compilers, you may well
find this to be the most useful session for you.

LEXICAL SCANNING
Lexical scanning is the process of scanning the stream of input characters and separating
it into strings called tokens. Most compiler texts start here, and devote several chapters to
discussing various ways to build scanners. This approach has its place, but as you have
already seen, there is a lot you can do without ever even addressing the issue, and in fact
the scanner we'll end up with here won't look much like what the texts describe. The rea-
son? Compiler theory and, consequently, the programs resulting from it, must deal with
the most general kind of parsing rules. We don't. In the real world, it is possible to specify
the language syntax in such a way that a pretty simple scanner will suffice. And as
always, KISS is our motto.
Typically, lexical scanning is done in a separate part of the compiler, so that the parser per
se sees only a stream of input tokens. Now, theoretically it is not necessary to separate
this function from the rest of the parser. There is only one set of syntax equations that
define the whole language, so in theory we could write the whole parser in one module.
Why the separation? The answer has both practical and theoretical bases.
In 1956, Noam Chomsky defined the "Chomsky Hierarchy" of grammars. They are:
o Type 0: Unrestricted (e.g., English)
o Type 1: Context-Sensitive
o Type 2: Context-Free
o Type 3: Regular
A few features of the typical programming language (particularly the older ones, such as
FORTRAN) are Type 1, but for the most part all modern languages can be described
using only the last two types, and those are all we'll be dealing with here.
The neat part about these two types is that there are very specific ways to parse them. It
has been shown that any regular grammar can be parsed using a particular form of
abstract machine called the state machine (finite automaton). We have already imple-
mented state machines in some of our recognizers.

Similarly, Type 2 (context-free) grammars can always be parsed using a push-down automa-
ton (a state machine augmented by a stack). We have also implemented these machines.
Instead of implementing a literal stack, we have relied on the built-in stack associated with
recursive coding to do the job, and that in fact is the preferred approach for top-down parsing.
Now, it happens that in real, practical grammars, the parts that qualify as regular expressions
tend to be the lower-level parts, such as the definition of an identifier:
<ident> ::= <letter> [ <letter> | <digit> ]*
Since it takes a different kind of abstract machine to parse the two types of grammars, it
makes sense to separate these lower- level functions into a separate module, the lexical
scanner, which is built around the idea of a state machine. The idea is to use the simplest
parsing technique needed for the job.
There is another, more practical reason for separating scanner from parser. We like to think of
the input source file as a stream of characters, which we process right to left without back-
tracking. In practice that isn't possible. Almost every language has certain keywords such as
IF, WHILE, and END. As I mentioned earlier, we can't really know whether a given character
string is a keyword, until we've reached the end of it, as defined by a space or other delimiter.
So in that sense, we MUST save the string long enough to find out whether we have a key-
word or not. That's a limited form of backtracking.
So the structure of a conventional compiler involves splitting up the functions of the lower-
level and higher-level parsing. The lexical scanner deals with things at the character level,
collecting characters into strings, etc., and passing them along to the parser proper as indivis-
ible tokens. It's also considered normal to let the scanner have the job of identifying key-
words.

STATE MACHINES AND ALTERNATIVES

I mentioned that the regular expressions can be parsed using a state machine. In most
compiler texts, and indeed in most compilers as well, you will find this taken literally.
There is typically a real implementation of the state machine, with integers used to define
the current state, and a table of actions to take for each combination of current state and
input character. If you write a compiler front end using the popular Unix tools LEX and
YACC, that's what you'll get. The output of LEX is a state machine implemented in C, plus
a table of actions corresponding to the input grammar given to LEX. The YACC output is
similar ... a canned table-driven parser, plus the table corresponding to the language syn-
tax.
That is not the only choice, though. In our previous installments, you have seen over and
over that it is possible to implement parsers without dealing specifically with tables,
stacks, or state variables. In fact, in Installment V I warned you that if you find yourself
needing these things you might be doing something wrong, and not taking advantage of
the power of Pascal. There are basically two ways to define a state machine's state:
explicitly, with a state number or code, and implicitly, simply by virtue of the fact that I'm at
a certain place in the code (if it's Tuesday, this must be Belgium). We've relied heavily on
the implicit approaches before, and I think you'll find that they work well here, too.
In practice, it may not even be necessary to HAVE a well-defined lexical scanner. This
isn't our first experience at dealing with multi-character tokens. In Installment III, we
extended our parser to provide for them, and we didn't even NEED a lexical scanner. That
was because in that narrow context, we could always tell, just by looking at the single loo-
kahead character, whether we were dealing with a number, a variable, or an operator. In
effect, we built a distributed lexical scanner, using procedures GetName and GetNum.
With keywords present, we can't know anymore what we're dealing with, until the entire
token is read. This leads us to a more localized scanner; although, as you will see, the
idea of a distributed scanner still has its merits.

SOME EXPERIMENTS IN SCANNING

Before getting back to our compiler, it will be useful to experiment a bit with the general con-
cepts.
Let's begin with the two definitions most often seen in real programming languages:
<ident> ::= <letter> [ <letter> | <digit> ]*
<number ::= [<digit>]+
(Remember, the '*' indicates zero or more occurences of the terms in brackets, and the '+',
one or more.)
We have already dealt with similar items in Installment III. Let's begin (as usual) with a bare
cradle. Not surprisingly, we are going to need a new recognizer:
{--------------------------------------------------------------}
{ Recognize an Alphanumeric Character }
begin
end;
{--------------------------------------------------------------}

Using this let's write the following two routines, which are very similar to those we've used
before:
{--------------------------------------------------------------}
var x: string[8];
begin
x := '';
x := x + UpCase(Look);
GetChar;
end;
GetName := x;
end;

{--------------------------------------------------------------}
{ Get a Number }
var x: string[16];
begin
x := '';
x := x + Look;
GetChar;
end;
GetNum := x;
end;
{--------------------------------------------------------------}
(Notice that this version of GetNum returns a string, not an integer as before.)
You can easily verify that these routines work by calling them from the main program, as in
WriteLn(GetName);
This program will print any legal name typed in (maximum eight characters, since that's what
we told GetName). It will reject anything else.
Test the other routine similarly.

WHITE SPACE
We also have dealt with embedded white space before, using the two routines IsWhite
and SkipWhite. Make sure that these routines are in your current version of the cradle,
and add the the line
SkipWhite;
at the end of both GetName and GetNum.
Now, let's define the new procedure:
{--------------------------------------------------------------}
{ Lexical Scanner }
Function Scan: string;
begin
if IsAlpha(Look) then
Scan := GetName
else if IsDigit(Look) then
Scan := GetNum
else begin
Scan := Look;
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}

We can call this from the new main program:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Token := Scan;
writeln(Token);
until Token = CR;
end.
{--------------------------------------------------------------}
(You will have to add the declaration of the string Token at the beginning of the program.
Make it any convenient length, say 16 characters.) Now, run the program. Note how the input
string is, indeed, separated into distinct tokens.

STATE MACHINES
For the record, a parse routine like GetName does indeed implement a state machine.
The state is implicit in the current position in the code. A very useful trick for visualizing
what's going on is the syntax diagram, or "railroad-track" diagram. It's a little difficult to
draw one in this medium, so I'll use them very sparingly, but the figure below should give
you the idea:
|-----> Other---------------------------> Error
Start -------> Letter ---------------> Other -----> Finish
^ V
| |
|<----- Letter <---------|
| |
|<----- Digit <----------
As you can see, this diagram shows how the logic flows as characters are read. Things
begin, of course, in the start state, and end when a character other than an alphanumeric
is found. If the first character is not alpha, an error occurs. Otherwise the machine will
continue looping until the terminating delimiter is found.
Note that at any point in the flow, our position is entirely dependent on the past history of
the input characters. At that point, the action to be taken depends only on the current
state, plus the current input character. That's what make this a state machine.

Because of the difficulty of drawing railroad-track diagrams in this medium, I'll continue to
stick to syntax equations from now on. But I highly recommend the diagrams to you for any-
thing you do that involves parsing. After a little practice you can begin to see how to write a
parser directly from the diagrams. Parallel paths get coded into guarded actions (guarded by
IF's or CASE statements), serial paths into sequential calls. It's almost like working from a
schematic.
We didn't even discuss SkipWhite, which was introduced earlier, but it also is a simple state
machine, as is GetNum. So is their parent procedure, Scan. Little machines make big
machines.
The neat thing that I'd like you to note is how painlessly this implicit approach creates these
state machines. I personally prefer it a lot over the table-driven approach. It also results is a
small, tight, and fast scanner.

NEWLINES
Moving right along, let's modify our scanner to handle more than one line. As I mentioned
last time, the most straightforward way to do this is to simply treat the newline characters,
carriage return and line feed, as white space. This is, in fact, the way the C standard
library routine, iswhite, works. We didn't actually try this before. I'd like to do it now, so you
can get a feel for the results.
To do this, simply modify the single executable line of IsWhite to read:
IsWhite := c in [' ', TAB, CR, LF];
We need to give the main program a new stop condition, since it will never see a CR.
Let's just use:
until Token = '.';
OK, compile this program and run it. Try a couple of lines, terminated by the period. I
used:
now is the time

for all good men.
Hey, what happened? When I tried it, I didn't get the last token, the period. The program
didn't halt. What's more, when I pressed the 'enter' key a few times, I still didn't get the
period.
If you're still stuck in your program, you'll find that typing a period on a new line will termi-
nate it.
What's going on here? The answer is that we're hanging up in SkipWhite. A quick look at
that routine will show that as long as we're typing null lines, we're going to just continue to
loop. After SkipWhite encounters an LF, it tries to execute a GetChar. But since the input
buffer is now empty, GetChar's read statement insists on having another line. Procedure
Scan gets the terminating period, all right, but it calls SkipWhite to clean up, and Skip-
White won't return until it gets a non-null line.

This kind of behavior is not quite as bad as it seems. In a real compiler, we'd be reading from
an input file instead of the console, and as long as we have some procedure for dealing with
end-of-files, everything will come out OK. But for reading data from the console, the behavior
is just too bizarre. The fact of the matter is that the C/Unix convention is just not compatible
with the structure of our parser, which calls for a lookahead character. The code that the Bell
wizards have implemented doesn't use that convention, which is why they need 'ungetc'.
OK, let's fix the problem. To do that, we need to go back to the old definition of IsWhite
(delete the CR and LF characters) and make use of the procedure Fin that I introduced last
time. If it's not in your current version of the cradle, put it there now.
Also, modify the main program to read:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Token := Scan;
writeln(Token);
if Token = CR then Fin;
until Token = '.';
end.
{--------------------------------------------------------------}

Note the "guard" test preceding the call to Fin. That's what makes the whole thing work,
and ensures that we don't try to read a line ahead.
Try the code now. I think you'll like it better.
If you refer to the code we did in the last installment, you'll find that I quietly sprinkled calls
to Fin throughout the code, wherever a line break was appropriate. This is one of those
areas that really affects the look & feel that I mentioned. At this point I would urge you to
experiment with different arrangements and see how you like them. If you want your lan-
guage to be truly free-field, then newlines should be transparent. In this case, the best
approach is to put the following lines at the BEGINNING of Scan:
while Look = CR do
Fin;
If, on the other hand, you want a line-oriented language like Assembler, BASIC, or FOR-
TRAN (or even Ada... note that it has comments terminated by newlines), then you'll need
for Scan to return CR's as tokens. It must also eat the trailing LF. The best way to do that
is to use this line, again at the beginning of Scan:
if Look = LF then Fin;
For other conventions, you'll have to use other arrangements. In my example of the last
session, I allowed newlines only at specific places, so I was somewhere in the middle
ground. In the rest of these sessions, I'll be picking ways to handle newlines that I happen
to like, but I want you to know how to choose other ways for yourselves.

OPERATORS
We could stop now and have a pretty useful scanner for our purposes. In the fragments of
KISS that we've built so far, the only tokens that have multiple characters are the identifiers
and numbers. All operators were single characters. The only exception I can think of is the
relops <=, >=, and <>, but they could be dealt with as special cases.
Still, other languages have multi-character operators, such as the ':=' of Pascal or the '++' and
'>>' of C. So while we may not need multi-character operators, it's nice to know how to get
them if necessary.
Needless to say, we can handle operators very much the same way as the other tokens. Let's
start with a recognizer:
{--------------------------------------------------------------}
{ Recognize Any Operator }
function IsOp(c: char): boolean;
begin
IsOp := c in ['+', '-', '*', '/', '<', '>', ':', '='];
end;
{--------------------------------------------------------------}
It's important to note that we DON'T have to include every possible operator in this list. For
example, the paretheses aren't included, nor is the terminating period. The current version of
Scan handles single-character operators just fine as it is. The list above includes only those
characters that can appear in multi-character operators. (For specific languages, of course,
the list can always be edited.)

Now, let's modify Scan to read:
{--------------------------------------------------------------}
{ Lexical Scanner }
Function Scan: string;
begin
while Look = CR do
Fin;
Scan := GetName
Scan := GetNum
else if IsOp(Look) then
Scan := GetOp
else begin
Scan := Look;
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}
Try the program now. You will find that any code fragments you care to throw at it will be
neatly broken up into individual tokens.

LISTS, COMMAS AND COMMAND LINES

Before getting back to the main thrust of our study, I'd like to get on my soapbox for a
moment.
How many times have you worked with a program or operating system that had rigid rules
about how you must separate items in a list? (Try, the last time you used MSDOS!) Some
programs require spaces as delimiters, and some require commas. Worst of all, some require
both, in different places. Most are pretty unforgiving about violations of their rules.
I think this is inexcusable. It's too easy to write a parser that will handle both spaces and com-
mas in a flexible way. Consider the following procedure:
{--------------------------------------------------------------}
{ Skip Over a Comma }
procedure SkipComma;
begin
SkipWhite;
if Look = ',' then begin
GetChar;
SkipWhite;
end;
end;
{--------------------------------------------------------------}
This eight-line procedure will skip over a delimiter consisting of any number (including zero)
of spaces, with zero or one comma embedded in the string.

TEMPORARILY, change the call to SkipWhite in Scan to a call to SkipComma, and try
inputting some lists. Works nicely, eh? Don't you wish more software authors knew about
SkipComma?
For the record, I found that adding the equivalent of SkipComma to my Z80 assembler-
language programs took all of 6 (six) extra bytes of code. Even in a 64K machine, that's
not a very high price to pay for user-friendliness!
I think you can see where I'm going here. Even if you never write a line of a compiler code
in your life, there are places in every program where you can use the concepts of parsing.
Any program that processes a command line needs them. In fact, if you think about it for
a bit, you'll have to conclude that any time you write a program that processes user
inputs, you're defining a language. People communicate with languages, and the syntax
implicit in your program defines that language. The real question is: are you going to
define it deliberately and explicitly, or just let it turn out to be whatever the program ends
up parsing?
I claim that you'll have a better, more user-friendly program if you'll take the time to define
the syntax explicitly. Write down the syntax equations or draw the railroad-track diagrams,
and code the parser using the techniques I've shown you here. You'll end up with a better
program, and it will be easier to write, to boot.

GETTING FANCY
OK, at this point we have a pretty nice lexical scanner that will break an input stream up into
tokens. We could use it as it stands and have a servicable compiler. But there are some other
aspects of lexical scanning that we need to cover.
The main consideration is <shudder> efficiency. Remember when we were dealing with sin-
gle-character tokens, every test was a comparison of a single character, Look, with a byte
constant. We also used the Case statement heavily.
With the multi-character tokens being returned by Scan, all those tests now become string
comparisons. Much slower. And not only slower, but more awkward, since there is no string
equivalent of the Case statement in Pascal. It seems especially wasteful to test for what used
to be single characters ... the '=', '+', and other operators ... using string comparisons.
Using string comparison is not impossible ... Ron Cain used just that approach in writing
Small C. Since we're sticking to the KISS principle here, we would be truly justified in settling
for this approach. But then I would have failed to tell you about one of the key approaches
used in "real" compilers.
You have to remember: the lexical scanner is going to be called a _LOT_! Once for every
token in the whole source program, in fact. Experiments have indicated that the average
compiler spends anywhere from 20% to 40% of its time in the scanner routines. If there were
ever a place where efficiency deserves real consideration, this is it.
For this reason, most compiler writers ask the lexical scanner to do a little more work, by
"tokenizing" the input stream. The idea is to match every token against a list of acceptable
keywords and operators, and return unique codes for each one recognized. In the case of
ordinary variable names or numbers, we just return a code that says what kind of token they
are, and save the actual string somewhere else.
One of the first things we're going to need is a way to identify keywords. We can always do it
with successive IF tests, but it surely would be nice if we had a general-purpose routine that
could compare a given string with a table of keywords. (By the way, we're also going to need
such a routine later, for dealing with symbol tables.) This usually presents a problem in Pas-
cal, because standard Pascal doesn't allow for arrays of variable lengths. It's a real bother to

have to declare a different search routine for every table. Standard Pascal also doesn't
allow for initializing arrays, so you tend to see code like
Table[1] := 'IF';
Table[2] := 'ELSE';
Table[n] := 'END';
which can get pretty old if there are many keywords.
Fortunately, Turbo Pascal 4.0 has extensions that eliminate both of these problems. Con-
stant arrays can be declared using TP's "typed constant" facility, and the variable dimen-
sions can be handled with its C-like extensions for pointers.
First, modify your declarations like this:
{--------------------------------------------------------------}
{ Type Declarations }
type Symbol = string[8];
SymTab = array[1..1000] of Symbol;
TabPtr = ^SymTab;
{--------------------------------------------------------------}

(The dimension used in SymTab is not real ... no storage is allocated by the declaration itself,
and the number need only be "big enough.")
Now, just beneath those declarations, add the following:
{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }
const KWlist: array [1..4] of Symbol =
('IF', 'ELSE', 'ENDIF', 'END');
{--------------------------------------------------------------}

Next, insert the following new function:
{--------------------------------------------------------------}
{ Table Lookup }
{ If the input string matches a table entry, return the entry index.
If not, return a zero. }
function Lookup(T: TabPtr; s: string; n: integer): integer;
var i: integer;
found: boolean;
begin
found := false;
i := n;
while (i > 0) and not found do
if s = T^[i] then
found := true
else
dec(i);
Lookup := i;
end;
{--------------------------------------------------------------}

To test it, you can temporarily change the main program as follows:
{--------------------------------------------------------------}
{ Main Program }
begin
ReadLn(Token);
WriteLn(Lookup(Addr(KWList), Token, 4));
end.
{--------------------------------------------------------------}
Notice how Lookup is called: The Addr function sets up a pointer to KWList, which gets
passed to Lookup.
OK, give this a try. Since we're bypassing Scan here, you'll have to type the keywords in
upper case to get any matches.
Now that we can recognize keywords, the next thing is to arrange to return codes for them.
So what kind of code should we return? There are really only two reasonable choices. This
seems like an ideal application for the Pascal enumerated type. For example, you can define
something like
SymType = (IfSym, ElseSym, EndifSym, EndSym, Ident, Number, Operator);
and arrange to return a variable of this type. Let's give it a try. Insert the line above into your
type definitions.

Now, add the two variable declarations:
Token: Symtype; { Current Token }
Value: String[16]; { String Token of Look }
Modify the scanner to read:
{--------------------------------------------------------------}
{ Lexical Scanner }
procedure Scan;
var k: integer;
begin
while Look = CR do
Fin;
if IsAlpha(Look) then begin
Value := GetName;
k := Lookup(Addr(KWlist), Value, 4);
if k = 0 then
Token := Ident
else
Token := SymType(k - 1);
end
else if IsDigit(Look) then begin

Value := GetNum;
Token := Number;
end
else if IsOp(Look) then begin
Value := GetOp;
Token := Operator;
end
else begin
Value := Look;
Token := Operator;
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}
(Notice that Scan is now a procedure, not a function.)

Finally, modify the main program to read:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Scan;
case Token of
Ident: write('Ident ');
Number: Write('Number ');
Operator: Write('Operator ');
IfSym, ElseSym, EndifSym, EndSym: Write('Keyword ');
end;
Writeln(Value);
until Token = EndSym;
end.
{--------------------------------------------------------------}
What we've done here is to replace the string Token used earlier with an enumerated
type. Scan returns the type in variable Token, and returns the string itself in the new vari-
able Value.

OK, compile this and give it a whirl. If everything goes right, you should see that we are now
recognizing keywords.
What we have now is working right, and it was easy to generate from what we had earlier.
However, it still seems a little "busy" to me. We can simplify things a bit by letting GetName,
GetNum, GetOp, and Scan be procedures working with the global variables Token and Value,
thereby eliminating the local copies. It also seems a little cleaner to move the table lookup
into GetName. The new form for the four procedures is, then:
{--------------------------------------------------------------}
procedure GetName;
var k: integer;
begin
Value := '';
Value := Value + UpCase(Look);
GetChar;
end;
k := Lookup(Addr(KWlist), Value, 4);
if k = 0 then
Token := Ident
else
Token := SymType(k-1);
end;

{--------------------------------------------------------------}
{ Get a Number }
procedure GetNum;
begin
Value := '';
GetChar;
end;
Token := Number;
end;

{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
Value := '';
if not IsOp(Look) then Expected('Operator');
while IsOp(Look) do begin
GetChar;
end;
Token := Operator;
end;

{--------------------------------------------------------------}
{ Lexical Scanner }
procedure Scan;
var k: integer;
begin
while Look = CR do
Fin;
GetName
GetNum
else if IsOp(Look) then
GetOp
else begin
Value := Look;
Token := Operator;
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}

RETURNING A CHARACTER
Essentially every scanner I've ever seen that was written in Pascal used the mechanism of an
enumerated type that I've just described. It is certainly a workable mechanism, but it doesn't
seem the simplest approach to me.
For one thing, the list of possible symbol types can get pretty long. Here, I've used just one
symbol, "Operator," to stand for all of the operators, but I've seen other designs that actually
return different codes for each one.
There is, of course, another simple type that can be returned as a code: the character.
Instead of returning the enumeration value 'Operator' for a '+' sign, what's wrong with just
returning the character itself? A character is just as good a variable for encoding the different
token types, it can be used in case statements easily, and it's sure a lot easier to type. What
could be simpler?
Besides, we've already had experience with the idea of encoding keywords as single charac-
ters. Our previous programs are already written that way, so using this approach will minimize
the changes to what we've already done.
Some of you may feel that this idea of returning character codes is too mickey-mouse. I must
admit it gets a little awkward for multi-character operators like '<='. If you choose to stay with
the enumerated type, fine. For the rest, I'd like to show you how to change what we've done
above to support that approach.
First, you can delete the SymType declaration now ... we won't be needing that. And you can
change the type of Token to char.
Next, to replace SymType, add the following constant string:
const KWcode: string[5] = 'xilee';
(I'll be encoding all idents with the single character 'x'.)

Lastly, modify Scan and its relatives as follows:
{--------------------------------------------------------------}
procedure GetName;
begin
Value := '';
GetChar;
end;
Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1];
end;

{--------------------------------------------------------------}
{ Get a Number }
procedure GetNum;
begin
Value := '';
GetChar;
end;
Token := '#';
end;

{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
Value := '';
if not IsOp(Look) then Expected('Operator');
while IsOp(Look) do begin
GetChar;
end;
if Length(Value) = 1 then
Token := Value[1]
else
Token := '?';
end;

{--------------------------------------------------------------}
{ Lexical Scanner }
procedure Scan;
var k: integer;
begin
while Look = CR do
Fin;
GetName
GetNum
else if IsOp(Look) then begin
GetOp
else begin
Value := Look;
Token := '?';
GetChar;
end;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Scan;
case Token of
'x': write('Ident ');
'#': Write('Number ');
'i', 'l', 'e': Write('Keyword ');
else Write('Operator ');
end;
Writeln(Value);
until Value = 'END';
end.
{--------------------------------------------------------------}
This program should work the same as the previous version. A minor difference in struc-
ture, maybe, but it seems more straightforward to me.

DISTRIBUTED vs CENTRALIZED SCANNERS

The structure for the lexical scanner that I've just shown you is very conventional, and about
99% of all compilers use something very close to it. This is not, however, the only possible
structure, or even always the best one.
The problem with the conventional approach is that the scanner has no knowledge of con-
text. For example, it can't distinguish between the assignment operator '=' and the relational
operator '=' (perhaps that's why both C and Pascal use different strings for the two). All the
scanner can do is to pass the operator along to the parser, which can hopefully tell from the
context which operator is meant. Similarly, a keyword like 'IF' has no place in the middle of a
math expression, but if one happens to appear there, the scanner will see no problem with it,
and will return it to the parser, properly encoded as an 'IF'.
With this kind of approach, we are not really using all the information at our disposal. In the
middle of an expression, for example, the parser "knows" that there is no need to look for
keywords, but it has no way of telling the scanner that. So the scanner continues to do so.
This, of course, slows down the compilation.
In real-world compilers, the designers often arrange for more information to be passed
between parser and scanner, just to avoid this kind of problem. But that can get awkward,
and certainly destroys a lot of the modularity of the structure.
The alternative is to seek some way to use the contextual information that comes from know-
ing where we are in the parser. This leads us back to the notion of a distributed scanner, in
which various portions of the scanner are called depending upon the context.
In KISS, as in most languages, keywords ONLY appear at the beginning of a statement. In

places like expressions, they are not allowed. Also, with one minor exception (the multi-char-
acter relops) that is easily handled, all operators are single characters, which means that we
don't need GetOp at all.
So it turns out that even with multi-character tokens, we can still always tell from the current
lookahead character exactly what kind of token is coming, except at the very beginning of a
statement.

Even at that point, the ONLY kind of token we can accept is an identifier. We need only to
determine if that identifier is a keyword or the target of an assignment statement.
We end up, then, still needing only GetName and GetNum, which are used very much as
we've used them in earlier installments.
It may seem at first to you that this is a step backwards, and a rather primitive approach.
In fact, it is an improvement over the classical scanner, since we're using the scanning
routines only where they're really needed. In places where keywords are not allowed, we
don't slow things down by looking for them.

MERGING SCANNER AND PARSER

Now that we've covered all of the theory and general aspects of lexical scanning that we'll be
needing, I'm FINALLY ready to back up my claim that we can accomodate multi-character
tokens with minimal change to our previous work. To keep things short and simple I will
restrict myself here to a subset of what we've done before; I'm allowing only one control con-
struct (the IF) and no Boolean expressions. That's enough to demonstrate the parsing of both
keywords and expressions. The extension to the full set of constructs should be pretty appar-
ent from what we've already done.
All the elements of the program to parse this subset, using single-character tokens, exist
already in our previous programs. I built it by judicious copying of these files, but I wouldn't
dare try to lead you through that process. Instead, to avoid any confusion, the whole program
is shown below:

{--------------------------------------------------------------}
program KISS;
{--------------------------------------------------------------}
const TAB = Î;
CR = ^M;
LF = ^J;
{--------------------------------------------------------------}
TabPtr = ^SymTab;
{--------------------------------------------------------------}

{--------------------------------------------------------------}
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
begin
WriteLn;
end;
{--------------------------------------------------------------}
begin
Error(s);
Halt;
end;

{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
IsDigit := c in ['0'..'9'];
end;

{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }
begin
end;
{--------------------------------------------------------------}
begin
IsAddop := c in ['+', '-'];
end;
{--------------------------------------------------------------}
{ Recognize a Mulop }
function IsMulop(c: char): boolean;
begin
IsMulop := c in ['*', '/'];
end;

{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
GetChar;
end;
{--------------------------------------------------------------}
begin
if Look <> x then Expected('''' + x + '''');
GetChar;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Skip a CRLF }
procedure Fin;
begin
SkipWhite;
end;
{--------------------------------------------------------------}
begin
while Look = CR do
Fin;
Getname := UpCase(Look);
GetChar;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Get a Number }
begin
GetNum := Look;
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
var S: string;
begin
Str(LCount, S);
Inc(LCount);
end;

{--------------------------------------------------------------}
begin
WriteLn(L, ':');
end;
{--------------------------------------------------------------}
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
begin
Emit(s);
WriteLn;
end;

{---------------------------------------------------------------}
procedure Ident;
var Name: char;
begin
Name := GetName;
Match('(');
Match(')');
end
else
end;

{---------------------------------------------------------------}
procedure Factor;
begin
Match('(');
Expression;
Match(')');
end
Ident
else
end;

{---------------------------------------------------------------}
var s: boolean;
begin
s := Look = '-';
if IsAddop(Look) then begin
GetChar;
SkipWhite;
end;
Factor;
if s then
EmitLn('NEG D0');
end;
{--------------------------------------------------------------}

procedure Multiply;
begin
Match('*');
Factor;
end;

{-------------------------------------------------------------}
procedure Divide;
begin
Match('/');
Factor;
EmitLn('EXS.L D0');
end;
{---------------------------------------------------------------}
{ Completion of Term Processing (called by Term and FirstTerm }
procedure Term1;
begin
while IsMulop(Look) do begin
case Look of
'*': Multiply;
'/': Divide;
end;
end;
end;

{---------------------------------------------------------------}
procedure Term;
begin
Factor;
Term1;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Term with Possible Leading Sign }
procedure FirstTerm;
begin
SignedFactor;
Term1;
end;
{---------------------------------------------------------------}
procedure Add;
begin
Match('+');
Term;
end;

{---------------------------------------------------------------}
procedure Subtract;
begin
Match('-');
Term;
EmitLn('NEG D0');
end;
{---------------------------------------------------------------}
begin
FirstTerm;
case Look of
'+': Add;
'-': Subtract;
end;
end;
end;

{---------------------------------------------------------------}
begin
EmitLn('Condition');
end;
{---------------------------------------------------------------}
procedure Block;
Forward;
procedure DoIf;
var L1, L2: string;
begin
Match('i');
Condition;
L1 := NewLabel;
L2 := L1;
Block;

Match('l');
L2 := NewLabel;
PostLabel(L1);
Block;
end;
PostLabel(L2);
Match('e');
end;
{--------------------------------------------------------------}
var Name: char;
begin
Name := GetName;
Match('=');
Expression;
end;

{--------------------------------------------------------------}
procedure Block;
begin
case Look of
'i': DoIf;
CR: while Look = CR do
Fin;
else Assignment;
end;
end;
end;
{--------------------------------------------------------------}
begin
Block;
if Look <> 'e' then Expected('END');
EmitLn('END')
end;

{--------------------------------------------------------------}
{ Initialize }
procedure Init;
begin
LCount := 0;
GetChar;
end;
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
DoProgram;
end.
{--------------------------------------------------------------}

A couple of comments:
(1) The form for the expression parser, using FirstTerm, etc.,
is a little different from what you've seen before. It's
yet another variation on the same theme. Don't let it throw
you ... the change is not required for what follows.
(2) Note that, as usual, I had to add calls to Fin at strategic
spots to allow for multiple lines.
Before we proceed to adding the scanner, first copy this file and verify that it does indeed
parse things correctly. Don't forget the "codes": 'i' for IF, 'l' for ELSE, and 'e' for END or
ENDIF.
If the program works, then let's press on. In adding the scanner modules to the program,
it helps to have a systematic plan. In all the parsers we've written to date, we've stuck to a
convention that the current lookahead character should always be a non-blank character.
We preload the lookahead character in Init, and keep the "pump primed" after that. To
keep the thing working right at newlines, we had to modify this a bit and treat the newline
as a legal token.
In the multi-character version, the rule is similar: The current lookahead character should
always be left at the BEGINNING of the next token, or at a newline.

The multi-character version is shown next. To get it, I've made the following changes:
o Added the variables Token and Value, and the type definitions
needed by Lookup.
o Added the definitions of KWList and KWcode.
o Added Lookup.
o Replaced GetName and GetNum by their multi-character versions.
(Note that the call to Lookup has been moved out of GetName,
so that it will not be executed for calls within an
expression.)
o Created a new, vestigial Scan that calls GetName, then scans
for keywords.
o Created a new procedure, MatchString, that looks for a
specific keyword. Note that, unlike Match, MatchString does
NOT read the next keyword.
o Modified Block to call Scan.
o Changed the calls to Fin a bit. Fin is now called within
GetName.

Here is the program in its entirety:
{--------------------------------------------------------------}
program KISS;
{--------------------------------------------------------------}
const TAB = Î;
CR = ^M;
LF = ^J;
{--------------------------------------------------------------}
TabPtr = ^SymTab;

{--------------------------------------------------------------}
Token : char; { Encoded Token }
Value : string[16]; { Unencoded Token }
{--------------------------------------------------------------}
const KWlist: array [1..4] of Symbol =
('IF', 'ELSE', 'ENDIF', 'END');
const KWcode: string[5] = 'xilee';
{--------------------------------------------------------------}
procedure GetChar;
begin
Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }
begin
WriteLn;
end;
{--------------------------------------------------------------}
begin
Error(s);
Halt;
end;
{--------------------------------------------------------------}
begin
end;

{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
IsDigit := c in ['0'..'9'];
end;
{--------------------------------------------------------------}
begin
end;

{--------------------------------------------------------------}
begin
IsAddop := c in ['+', '-'];
end;
{--------------------------------------------------------------}
begin
IsMulop := c in ['*', '/'];
end;
{--------------------------------------------------------------}
begin
end;

{--------------------------------------------------------------}
begin
GetChar;
end;
{--------------------------------------------------------------}
begin
if Look <> x then Expected('''' + x + '''');
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Skip a CRLF }
procedure Fin;
begin

SkipWhite;
end;
{--------------------------------------------------------------}
{ Table Lookup }
var i: integer;
found: boolean;
begin
found := false;
i := n;
if s = T^[i] then
found := true
else
dec(i);
Lookup := i;
end;

{--------------------------------------------------------------}
procedure GetName;
begin
while Look = CR do
Fin;
Value := '';
GetChar;
end;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Get a Number }
procedure GetNum;
begin
Value := '';
GetChar;
end;
Token := '#';
SkipWhite;
end;
{--------------------------------------------------------------}
{ Get an Identifier and Scan it for Keywords }
procedure Scan;
begin
GetName;
Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1];
end;

{--------------------------------------------------------------}
{ Match a Specific Input String }
procedure MatchString(x: string);
begin
if Value <> x then Expected('''' + x + '''');
end;
{--------------------------------------------------------------}
var S: string;
begin
Str(LCount, S);
Inc(LCount);
end;
{--------------------------------------------------------------}
begin
WriteLn(L, ':');
end;

{--------------------------------------------------------------}
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
begin
Emit(s);
WriteLn;
end;

{---------------------------------------------------------------}
procedure Ident;
begin
GetName;
Match('(');
Match(')');
EmitLn('BSR ' + Value);
end
else
EmitLn('MOVE ' + Value + '(PC),D0');
end;

{---------------------------------------------------------------}
procedure Factor;
begin
Match('(');
Expression;
Match(')');
end
Ident
else begin
GetNum;
EmitLn('MOVE #' + Value + ',D0');
end;
end;

{---------------------------------------------------------------}
var s: boolean;
begin
s := Look = '-';
if IsAddop(Look) then begin
GetChar;
SkipWhite;
end;
Factor;
if s then
EmitLn('NEG D0');
end;
{--------------------------------------------------------------}

procedure Multiply;
begin
Match('*');
Factor;
end;

{-------------------------------------------------------------}
procedure Divide;
begin
Match('/');
Factor;
EmitLn('EXS.L D0');
end;
{---------------------------------------------------------------}
{ Completion of Term Processing (called by Term and FirstTerm }
procedure Term1;
begin
case Look of
'*': Multiply;
'/': Divide;
end;
end;
end;

{---------------------------------------------------------------}
procedure Term;
begin
Factor;
Term1;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Term with Possible Leading Sign }
begin
SignedFactor;
Term1;
end;
{---------------------------------------------------------------}
procedure Add;
begin
Match('+');
Term;
end;

{---------------------------------------------------------------}
procedure Subtract;
begin
Match('-');
Term;
EmitLn('NEG D0');
end;
{---------------------------------------------------------------}
begin
FirstTerm;
case Look of
'+': Add;
'-': Subtract;
end;
end;
end;

{---------------------------------------------------------------}
begin
EmitLn('Condition');
end;

{---------------------------------------------------------------}
procedure DoIf;
var L1, L2: string;
begin
Condition;
L1 := NewLabel;
L2 := L1;
Block;
if Token = 'l' then begin
L2 := NewLabel;
PostLabel(L1);
Block;
end;
PostLabel(L2);
MatchString('ENDIF');
end;

{--------------------------------------------------------------}
var Name: string;
begin
Name := Value;
Match('=');
Expression;
end;
{--------------------------------------------------------------}

procedure Block;
begin
Scan;
while not (Token in ['e', 'l']) do begin
case Token of
'i': DoIf;
else Assignment;
end;
Scan;
end;
end;

{--------------------------------------------------------------}
begin
Block;
MatchString('END');
EmitLn('END')
end;
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
begin
LCount := 0;
GetChar;
end;
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
DoProgram;
end.
{--------------------------------------------------------------}
Compare this program with its single-character counterpart. I think you will agree that the
differences are minor.

CONCLUSION
At this point, you have learned how to parse and generate code for expressions, Boolean
expressions, and control structures. You have now learned how to develop lexical scanners,
and how to incorporate their elements into a translator. You have still not seen ALL the ele-
ments combined into one program, but on the basis of what we've done before you should
find it a straightforward matter to extend our earlier programs to include scanners.
We are very close to having all the elements that we need to build a real, functional compiler.
There are still a few things missing, notably procedure calls and type definitions. We will deal
with those in the next few sessions. Before doing so, however, I thought it would be fun to
turn the translator above into a true compiler. That's what we'll be doing in the next install-
ment.
Up till now, we've taken a rather bottom-up approach to parsing, beginning with low-level con-
structs and working our way up. In the next installment, I'll also be taking a look from the top
down, and we'll discuss how the structure of the translator is altered by changes in the lan-
guage definition.
See you then.

Part 8 - A Little Philosophy
INTRODUCTION
This is going to be a different kind of session than the others in our series on parsing and
compiler construction. For this session, there won't be any experiments to do or code to
write. This once, I'd like to just talk with you for a while. Mercifully, it will be a short ses-
sion, and then we can take up where we left off, hopefully with renewed vigor.
When I was in college, I found that I could always follow a prof's lecture a lot better if I
knew where he was going with it. I'll bet you were the same.
So I thought maybe it's about time I told you where we're going with this series: what's
coming up in future installments, and in general what all this is about. I'll also share some
general thoughts concerning the usefulness of what we've been doing.

THE ROAD HOME

So far, we've covered the parsing and translation of arithmetic expressions, Boolean expres-
sions, and combinations connected by relational operators. We've also done the same for
control constructs. In all of this we've leaned heavily on the use of top-down, recursive
descent parsing, BNF definitions of the syntax, and direct generation of assembly-language
code. We also learned the value of such tricks as single-character tokens to help us see the
forest through the trees. In the last installment we dealt with lexical scanning, and I showed
you simple but powerful ways to remove the single-character barriers.
Throughout the whole study, I've emphasized the KISS philosophy ... Keep It Simple, Sidney
... and I hope by now you've realized just how simple this stuff can really be. While there are
for sure areas of compiler theory that are truly intimidating, the ultimate message of this
series is that in practice you can just politely sidestep many of these areas. If the language
definition cooperates or, as in this series, if you can define the language as you go, it's possi-
ble to write down the language definition in BNF with reasonable ease. And, as we've seen,
you can crank out parse procedures from the BNF just about as fast as you can type.
As our compiler has taken form, it's gotten more parts, but each part is quite small and sim-
ple, and very much like all the others.
At this point, we have many of the makings of a real, practical compiler. As a matter of fact,
we already have all we need to build a toy compiler for a language as powerful as, say, Tiny
BASIC. In the next couple of installments, we'll go ahead and define that language.

To round out the series, we still have a few items to cover. These include:
o Procedure calls, with and without parameters
o Local and global variables
o Basic types, such as character and integer types
o Arrays
o Strings
o User-defined types and structures
o Tree-structured parsers and intermediate languages
o Optimization
These will all be covered in future installments. When we're finished, you'll have all the
tools you need to design and build your own languages, and the compilers to translate
them.
I can't design those languages for you, but I can make some comments and recommen-
dations. I've already sprinkled some throughout past installments. You've seen, for exam-
ple, the control constructs I prefer.
These constructs are going to be part of the languages I build. I have three languages in
mind at this point, two of which you will see in installments to come:
TINY - A minimal, but usable language on the order of Tiny

BASIC or Tiny C. It won't be very practical, but it will have
enough power to let you write and run real programs that do
something worthwhile.
KISS - The language I'm building for my own use. KISS is

intended to be a systems programming language. It won't have
strong typing or fancy data structures, but it will support
most of the things I want to do with a higher-order language
(HOL), except perhaps writing compilers.

I've also been toying for years with the idea of a HOL-like assembler, with structured control
constructs and HOL-like assignment statements. That, in fact, was the impetus behind my
original foray into the jungles of compiler theory. This one may never be built, simply because
I've learned that it's actually easier to implement a language like KISS, that only uses a sub-
set of the CPU instructions. As you know, assembly language can be bizarre and irregular in
the extreme, and a language that maps one-for-one onto it can be a real challenge. Still, I've
always felt that the syntax used in conventional assemblers is dumb ... why is
MOVE.L A,B
better, or easier to translate, than
B=A ?
I think it would be an interesting exercise to develop a "compiler" that would give the pro-
grammer complete access to and control over the full complement of the CPU instruction set,
and would allow you to generate programs as efficient as assembly language, without the
pain of learning a set of mnemonics. Can it be done? I don't know. The real question may be,
"Will the resulting language be any easier to write than assembly"? If not, there's no point in
it. I think that it can be done, but I'm not completely sure yet how the syntax should look.
Perhaps you have some comments or suggestions on this one. I'd love to hear them.
You probably won't be surprised to learn that I've already worked ahead in most of the areas
that we will cover. I have some good news: Things never get much harder than they've been
so far. It's possible to build a complete, working compiler for a real language, using nothing
but the same kinds of techniques you've learned so far. And THAT brings up some interesting
questions.

WHY IS IT SO SIMPLE?
Before embarking on this series, I always thought that compilers were just naturally com-
plex computer programs ... the ultimate challenge. Yet the things we have done here have
usually turned out to be quite simple, sometimes even trivial.
For awhile, I thought is was simply because I hadn't yet gotten into the meat of the sub-
ject. I had only covered the simple parts. I will freely admit to you that, even when I began
the series, I wasn't sure how far we would be able to go before things got too complex to
deal with in the ways we have so far. But at this point I've already been down the road far
enough to see the end of it. Guess what?
THERE ARE NO HARD PARTS!
Then, I thought maybe it was because we were not generating very good object code.
Those of you who have been following the series and trying sample compiles know that,
while the code works and is rather foolproof, its efficiency is pretty awful. I figured that if
we were concentrating on turning out tight code, we would soon find all that missing com-
plexity.
To some extent, that one is true. In particular, my first few efforts at trying to improve effi-
ciency introduced complexity at an alarming rate. But since then I've been tinkering
around with some simple optimizations and I've found some that result in very respect-
able code quality, WITHOUT adding a lot of complexity.
Finally, I thought that perhaps the saving grace was the "toy compiler" nature of the study.
I have made no pretense that we were ever going to be able to build a compiler to com-
pete with Borland and Microsoft. And yet, again, as I get deeper into this thing the differ-
ences are starting to fade away.
Just to make sure you get the message here, let me state it flat out:
USING THE TECHNIQUES WE'VE USED HERE, IT IS POSSIBLE TO BUILD A PRO-

DUCTION-QUALITY, WORKING COMPILER WITHOUT ADDING A LOT OF COMPLEX-
ITY TO WHAT WE'VE ALREADY DONE.

Since the series began I've received some comments from you. Most of them echo my own
thoughts: "This is easy! Why do the textbooks make it seem so hard?" Good question.
Recently, I've gone back and looked at some of those texts again, and even bought and read
some new ones. Each time, I come away with the same feeling: These guys have made it
seem too hard.
What's going on here? Why does the whole thing seem difficult in the texts, but easy to us?
Are we that much smarter than Aho, Ullman, Brinch Hansen, and all the rest?
Hardly. But we are doing some things differently, and more and more I'm starting to appreci-
ate the value of our approach, and the way that it simplifies things. Aside from the obvious
shortcuts that I outlined in Part I, like single-character tokens and console I/O, we have made
some implicit assumptions and done some things differently from those who have designed
compilers in the past. As it turns out, our approach makes life a lot easier.
So why didn't all those other guys use it?
You have to remember the context of some of the earlier compiler development. These peo-
ple were working with very small computers of limited capacity. Memory was very limited, the
CPU instruction set was minimal, and programs ran in batch mode rather than interactively.
As it turns out, these caused some key design decisions that have really complicated the
designs. Until recently, I hadn't realized how much of classical compiler design was driven by
the available hardware.
Even in cases where these limitations no longer apply, people have tended to structure their
programs in the same way, since that is the way they were taught to do it.
In our case, we have started with a blank sheet of paper. There is a danger there, of course,
that you will end up falling into traps that other people have long since learned to avoid. But it
also has allowed us to take different approaches that, partly by design and partly by pure
dumb luck, have allowed us to gain simplicity.

Here are the areas that I think have led to complexity in the past:
o Limited RAM Forcing Multiple Passes
I just read "Brinch Hansen on Pascal Compilers" (an excellent book, BTW). He
developed a Pascal compiler for a PC, but he started the effort in 1981 with a 64K
system, and so almost every design decision he made was aimed at making the
compiler fit into RAM. To do this, his compiler has three passes, one of which is
the lexical scanner. There is no way he could, for example, use the distributed
scanner I introduced in the last installment, because the program structure
wouldn't allow it. He also required not one but two intermediate languages, to pro-
vide the communication between phases.
All the early compiler writers had to deal with this issue: Break the compiler up into
enough parts so that it will fit in memory. When you have multiple passes, you
need to add data structures to support the information that each pass leaves
behind for the next. That adds complexity, and ends up driving the design. Lee's
book, "The Anatomy of a Compiler," mentions a FORTRAN compiler developed
for an IBM 1401. It had no fewer than 63 separate passes! Needless to say, in a
compiler like this the separation into phases would dominate the design.
Even in situations where RAM is plentiful, people have tended to use the same
techniques because that is what they're familiar with. It wasn't until Turbo Pascal
came along that we found how simple a compiler could be if you started with dif-
ferent assumptions.

o Batch Processing
In the early days, batch processing was the only choice ... there was no interactive
computing. Even today, compilers run in essentially batch mode.
In a mainframe compiler as well as many micro compilers, considerable effort is

expended on error recovery ... it can consume as much as 30-40% of the compiler
and completely drive the design. The idea is to avoid halting on the first error, but
rather to keep going at all costs, so that you can tell the programmer about as many
errors in the whole program as possible.
All of that harks back to the days of the early mainframes, where turnaround time was
measured in hours or days, and it was important to squeeze every last ounce of infor-
mation out of each run.
In this series, I've been very careful to avoid the issue of error recovery, and instead
our compiler simply halts with an error message on the first error. I will frankly admit
that it was mostly because I wanted to take the easy way out and keep things simple.
But this approach, pioneered by Borland in Turbo Pascal, also has a lot going for it
anyway. Aside from keeping the compiler simple, it also fits very well with the idea of
an interactive system. When compilation is fast, and especially when you have an edi-
tor such as Borland's that will take you right to the point of the error, then it makes a lot
of sense to stop there, and just restart the compilation after the error is fixed.

o Large Programs
Early compilers were designed to handle large programs ... essentially infinite
ones. In those days there was little choice; the idea of subroutine libraries and
separate compilation were still in the future. Again, this assumption led to multi-
pass designs and intermediate files to hold the results of partial processing.
Brinch Hansen's stated goal was that the compiler should be able to compile itself.
Again, because of his limited RAM, this drove him to a multi-pass design. He
needed as little resident compiler code as possible, so that the necessary tables
and other data structures would fit into RAM.
I haven't stated this one yet, because there hasn't been a need ... we've always
just read and written the data as streams, anyway. But for the record, my plan has
always been that, in a production compiler, the source and object data should all
coexist in RAM with the compiler, a la the early Turbo Pascals. That's why I've
been careful to keep routines like GetChar and Emit as separate routines, in spite
of their small size. It will be easy to change them to read to and write from mem-
ory.
o Emphasis on Efficiency
John Backus has stated that, when he and his colleagues developed the original
FORTRAN compiler, they KNEW that they had to make it produce tight code. In
those days, there was a strong sentiment against HOLs and in favor of assembly
language, and efficiency was the reason. If FORTRAN didn't produce very good
code by assembly standards, the users would simply refuse to use it. For the
record, that FORTRAN compiler turned out to be one of the most efficient ever
built, in terms of code quality. But it WAS complex!
Today, we have CPU power and RAM size to spare, so code efficiency is not so
much of an issue. By studiously ignoring this issue, we have indeed been able to
Keep It Simple. Ironically, though, as I have said, I have found some optimizations
that we can add to the basic compiler structure, without having to add a lot of com-
plexity. So in this case we get to have our cake and eat it too: we will end up with
reasonable code quality, anyway.

o Limited Instruction Sets
The early computers had primitive instruction sets. Things that we take for granted,
such as stack operations and indirect addressing, came only with great difficulty.
Example: In most compiler designs, there is a data structure called the literal pool.
The compiler typically identifies all literals used in the program, and collects them into
a single data structure. All references to the literals are done indirectly to this pool. At
the end of the compilation, the compiler issues commands to set aside storage and
initialize the literal pool.
We haven't had to address that issue at all. When we want to load a literal, we just do
it, in line, as in
MOVE #3,D0
There is something to be said for the use of a literal pool, particularly on a machine
like the 8086 where data and code can be separated. Still, the whole thing adds a
fairly large amount of complexity with little in return.
Of course, without the stack we would be lost. In a micro, both subroutine calls and
temporary storage depend heavily on the stack, and we have used it even more than
necessary to ease expression parsing.

o Desire for Generality
Much of the content of the typical compiler text is taken up with issues we haven't
addressed here at all ... things like automated translation of grammars, or genera-
tion of LALR parse tables. This is not simply because the authors want to impress
you. There are good, practical reasons why the subjects are there.
We have been concentrating on the use of a recursive-descent parser to parse a

deterministic grammar, i.e., a grammar that is not ambiguous and, therefore, can
be parsed with one level of lookahead. I haven't made much of this limitation, but
the fact is that this represents a small subset of possible grammars. In fact, there
is an infinite number of grammars that we can't parse using our techniques. The
LR technique is a more powerful one, and can deal with grammars that we can't.
In compiler theory, it's important to know how to deal with these other grammars,
and how to transform them into grammars that are easier to deal with. For exam-
ple, many (but not all) ambiguous grammars can be transformed into unambigu-
ous ones. The way to do this is not always obvious, though, and so many people
have devoted years to develop ways to transform them automatically.
In practice, these issues turn out to be considerably less important. Modern lan-
guages tend to be designed to be easy to parse, anyway. That was a key motiva-
tion in the design of Pascal. Sure, there are pathological grammars that you would
be hard pressed to write unambiguous BNF for, but in the real world the best
answer is probably to avoid those grammars!
In our case, of course, we have sneakily let the language evolve as we go, so we
haven't painted ourselves into any corners here. You may not always have that
luxury. Still, with a little care you should be able to keep the parser simple without
having to resort to automatic translation of the grammar.

We have taken a vastly different approach in this series. We started with a clean sheet of
paper, and developed techniques that work in the context that we are in; that is, a single-user
PC with rather ample CPU power and RAM space. We have limited ourselves to reasonable
grammars that are easy to parse, we have used the instruction set of the CPU to advantage,
and we have not concerned ourselves with efficiency. THAT's why it's been easy.
Does this mean that we are forever doomed to be able to build only toy compilers? No, I don't
think so. As I've said, we can add certain optimizations without changing the compiler struc-
ture. If we want to process large files, we can always add file buffering to do that. These
things do not affect the overall program design.
And I think that's a key factor. By starting with small and limited cases, we have been able to
concentrate on a structure for the compiler that is natural for the job. Since the structure nat-
urally fits the job, it is almost bound to be simple and transparent. Adding capability doesn't
have to change that basic structure. We can simply expand things like the file structure or add
an optimization layer. I guess my feeling is that, back when resources were tight, the struc-
tures people ended up with were artificially warped to make them work under those condi-
tions, and weren't optimum structures for the problem at hand.

CONCLUSION
Anyway, that's my arm-waving guess as to how we've been able to keep things simple.
We started with something simple and let it evolve naturally, without trying to force it into
some traditional mold.
We're going to press on with this. I've given you a list of the areas we'll be covering in
future installments. With those installments, you should be able to build complete, work-
ing compilers for just about any occasion, and build them simply. If you REALLY want to
build production-quality compilers, you'll be able to do that, too.
For those of you who are chafing at the bit for more parser code, I apologize for this
digression. I just thought you'd like to have things put into perspective a bit. Next time,
we'll get back to the mainstream of the tutorial.
So far, we've only looked at pieces of compilers, and while we have many of the makings
of a complete language, we haven't talked about how to put it all together. That will be the
subject of our next two installments. Then we'll press on into the new subjects I listed at
the beginning of this installment.
See you then.

Part 9 - A Top View
Part 9 - A Top View
INTRODUCTION
In the previous installments, we have learned many of the techniques required to build a full-
blown compiler. We've done both assignment statements (with Boolean and arithmetic
expressions), relational operators, and control constructs. We still haven't addressed proce-
dure or function calls, but even so we could conceivably construct a mini-language without
them. I've always thought it would be fun to see just how small a language one could build
that would still be useful. We're ALMOST in a position to do that now. The problem is: though
we know how to parse and translate the constructs, we still don't know quite how to put them
all together into a language.
In those earlier installments, the development of our programs had a decidedly bottom-up fla-
vor. In the case of expression parsing, for example, we began with the very lowest level con-
structs, the individual constants and variables, and worked our way up to more complex
expressions.
Most people regard the top-down design approach as being better than the bottom-up one. I
do too, but the way we did it certainly seemed natural enough for the kinds of things we were
parsing.
You mustn't get the idea, though, that the incremental approach that we've been using in all
these tutorials is inherently bottom-up. In this installment I'd like to show you that the
approach can work just as well when applied from the top down ... maybe better. We'll con-
sider languages such as C and Pascal, and see how complete compilers can be built starting
from the top.
In the next installment, we'll apply the same technique to build a complete translator for a
subset of the KISS language, which I'll be calling TINY. But one of my goals for this series is
that you will not only be able to see how a compiler for TINY or KISS works, but that you will
also be able to design and build compilers for your own languages. The C and Pascal exam-
ples will help. One thing I'd like you to see is that the natural structure of the compiler
depends very much on the language being translated, so the simplicity and ease of construc-
tion of the compiler depends very much on letting the language set the program structure.

It's a bit much to produce a full C or Pascal compiler here, and we won't try. But we can
flesh out the top levels far enough so that you can see how it goes.
Let's get started.

Part 9 - A Top View
THE TOP LEVEL

One of the biggest mistakes people make in a top-down design is failing to start at the true
top. They think they know what the overall structure of the design should be, so they go
ahead and write it down.
Whenever I start a new design, I always like to do it at the absolute beginning. In program
design language (PDL), this top level looks something like:
begin
solve the problem
end
OK, I grant you that this doesn't give much of a hint as to what the next level is, but I like to
write it down anyway, just to give me that warm feeling that I am indeed starting at the top.
For our problem, the overall function of a compiler is to compile a complete program. Any def-
inition of the language, written in BNF, begins here. What does the top level BNF look like?
Well, that depends quite a bit on the language to be translated. Let's take a look at Pascal.

THE STRUCTURE OF PASCAL

Most texts for Pascal include a BNF or "railroad-track" definition of the language. Here
are the first few lines of one:
<program> ::= <program-header> <block> '.'
<program-header> ::= PROGRAM <ident>
<block> ::= <declarations> <statements>
We can write recognizers to deal with each of these elements, just as we've done before.
For each one, we'll use our familiar single-character tokens to represent the input, then
flesh things out a little at a time. Let's begin with the first recognizer: the program itself.
To translate this, we'll start with a fresh copy of the Cradle. Since we're back to single-
character names, we'll just use a 'p' to stand for 'PROGRAM.'

Part 9 - A Top View
To a fresh copy of the cradle, add the following code, and insert a call to it from the main pro-
gram:
{--------------------------------------------------------------}
{ Parse and Translate A Program }
procedure Prog;
var Name: char;
begin
Match('p'); { Handles program header part }
Name := GetName;
Prolog(Name);
Match('.');
Epilog(Name);
end;
{--------------------------------------------------------------}
The procedures Prolog and Epilog perform whatever is required to let the program interface
with the operating system, so that it can execute as a program. Needless to say, this part will
be VERY OS-dependent. Remember, I've been emitting code for a 68000 running under the
OS I use, which is SK*DOS. I realize most of you are using PC's and would rather see some-
thing else, but I'm in this thing too deep to change now!

Anyhow, SK*DOS is a particularly easy OS to interface to. Here is the code for Prolog and
Epilog:
{--------------------------------------------------------------}
{ Write the Prolog }
procedure Prolog;
begin
EmitLn('WARMST EQU $A01E');
end;
{--------------------------------------------------------------}
{ Write the Epilog }
procedure Epilog(Name: char);
begin
EmitLn('DC WARMST');
EmitLn('END ' + Name);
end;
{--------------------------------------------------------------}
As usual, add this code and try out the "compiler." At this point, there is only one legal
input:
px. (where x is any single letter, the program name)
Well, as usual our first effort is rather unimpressive, but by now I'm sure you know that
things will get more interesting. There is one important thing to note: THE OUTPUT IS A
WORKING, COMPLETE, AND EXECUTABLE PROGRAM (at least after it's assembled).

Part 9 - A Top View
This is very important. The nice feature of the top-down approach is that at any stage you can
compile a subset of the complete language and get a program that will run on the target
machine. From here on, then, we need only add features by fleshing out the language con-
structs. It's all very similar to what we've been doing all along, except that we're approaching
it from the other end.

FLESHING IT OUT
To flesh out the compiler, we only have to deal with language features one by one. I like to
start with a stub procedure that does nothing, then add detail in incremental fashion. Let's
begin by processing a block, in accordance with its PDL above. We can do this in two
stages. First, add the null procedure:
{--------------------------------------------------------------}
{ Parse and Translate a Pascal Block }
procedure DoBlock(Name: char);
begin
end;
{--------------------------------------------------------------}
and modify Prog to read:
{--------------------------------------------------------------}
procedure Prog;
var Name: char;
begin
Match('p');
Name := GetName;
Prolog;
DoBlock(Name);
Match('.');
Epilog(Name);
end;
{--------------------------------------------------------------}

Part 9 - A Top View
That certainly shouldn't change the behavior of the program, and it doesn't. But now the defi-
nition of Prog is complete, and we can proceed to flesh out DoBlock. That's done right from
its BNF definition:
{--------------------------------------------------------------}
{ Parse and Translate a Pascal Block }
procedure DoBlock(Name: char);
begin
Declarations;
PostLabel(Name);
Statements;
end;
{--------------------------------------------------------------}
The procedure PostLabel was defined in the installment on branches. Copy it into your cra-
dle.
I probably need to explain the reason for inserting the label where I have. It has to do with the
operation of SK*DOS. Unlike some OS's, SK*DOS allows the entry point to the main program
to be anywhere in the program. All you have to do is to give that point a name. The call to
PostLabel puts that name just before the first executable statement in the main program. How
does SK*DOS know which of the many labels is the entry point, you ask? It's the one that
matches the END statement at the end of the program.
OK, now we need stubs for the procedures Declarations and Statements. Make them null
procedures as we did before.
Does the program still run the same? Then we can move on to the next stage.

DECLARATIONS
The BNF for Pascal declarations is:
<declarations> ::= ( <label list> |
<constant list> |
<type list> |
<variable list> |
<procedure> |
<function> )*
(Note that I'm using the more liberal definition used by Turbo Pascal. In the standard Pas-
cal definition, each of these parts must be in a specific order relative to the rest.)

Part 9 - A Top View
As usual, let's let a single character represent each of these declaration types. The new form
of Declarations is:
{--------------------------------------------------------------}
{ Parse and Translate the Declaration Part }
procedure Declarations;
begin
while Look in ['l', 'c', 't', 'v', 'p', 'f'] do
case Look of
'l': Labels;
'c': Constants;
't': Types;
'v': Variables;
'p': DoProcedure;
'f': DoFunction;
end;
end;
{--------------------------------------------------------------}
Of course, we need stub procedures for each of these declaration types. This time, they can't
quite be null procedures, since otherwise we'll end up with an infinite While loop. At the very
least, each recognizer must eat the character that invokes it.

Insert the following procedures:
{--------------------------------------------------------------}
{ Process Label Statement }
procedure Labels;
begin
Match('l');
end;
{--------------------------------------------------------------}
{ Process Const Statement }
procedure Constants;
begin
Match('c');
end;
{--------------------------------------------------------------}
{ Process Type Statement }
procedure Types;
begin
Match('t');
end;

Part 9 - A Top View
{--------------------------------------------------------------}
{ Process Var Statement }
procedure Variables;
begin
Match('v');
end;
{--------------------------------------------------------------}
{ Process Procedure Definition }
procedure DoProcedure;
begin
Match('p');
end;
{--------------------------------------------------------------}
{ Process Function Definition }
procedure DoFunction;
begin
Match('f');
end;
{--------------------------------------------------------------}

Now try out the compiler with a few representative inputs. You can mix the declarations
any way you like, as long as the last character in the program is'.' to indicate the end of
the program. Of course, none of the declarations actually declare anything, so you don't
need (and can't use) any characters other than those standing for the keywords.
We can flesh out the statement part in a similar way. The BNF for it is:
<statements> ::= <compound statement>
<compound statement> ::= BEGIN <statement>
(';' <statement>) END
Note that statements can begin with any identifier except END.
So the first stub form of procedure Statements is:
{--------------------------------------------------------------}
{ Parse and Translate the Statement Part }
procedure Statements;
begin
Match('b');
while Look <> 'e' do
GetChar;
Match('e');
end;
{--------------------------------------------------------------}

Part 9 - A Top View
At this point the compiler will accept any number of declarations, followed by the BEGIN
block of the main program. This block itself can contain any characters at all (except an
END), but it must be present.
The simplest form of input is now
'pxbe.'
Try it. Also try some combinations of this. Make some deliberate errors and see what hap-
pens.
At this point you should be beginning to see the drill. We begin with a stub translator to pro-
cess a program, then we flesh out each procedure in turn, based upon its BNF definition. Just
as the lower-level BNF definitions add detail and elaborate upon the higher-level ones, the
lower-level recognizers will parse more detail of the input program. When the last stub has
been expanded, the compiler will be complete. That's top-down design/implementation in its
purest form.
You might note that even though we've been adding procedures, the output of the program
hasn't changed. That's as it should be. At these top levels there is no emitted code required.
The recognizers are functioning as just that: recognizers. They are accepting input sen-
tences, catching bad ones, and channeling good input to the right places, so they are doing
their job. If we were to pursue this a bit longer, code would start to appear.

The next step in our expansion should probably be procedure Statements. The Pascal
definition is:
<statement> ::= <simple statement> | <structured statement>
<simple statement> ::= <assignment> | <procedure call> | null
<structured statement> ::= <compound statement> |
<if statement> |
<case statement> |
<while statement> |
<repeat statement> |
<for statement> |
<with statement>
These are starting to look familiar. As a matter of fact, you have already gone through the
process of parsing and generating code for both assignment statements and control
structures. This is where the top level meets our bottom-up approach of previous ses-
sions. The constructs will be a little different from those we've been using for KISS, but
the differences are nothing you can't handle.
I think you can get the picture now as to the procedure. We begin with a complete BNF
description of the language. Starting at the top level, we code up the recognizer for that
BNF statement, using stubs for the next-level recognizers. Then we flesh those lower-
level statements out one by one.
As it happens, the definition of Pascal is very compatible with the use of BNF, and BNF
descriptions of the language abound. Armed with such a description, you will find it fairly
straightforward to continue the process we've begun.

Part 9 - A Top View
You might have a go at fleshing a few of these constructs out, just to get a feel for it. I don't
expect you to be able to complete a Pascal compiler here ... there are too many things such
as procedures and types that we haven't addressed yet ... but it might be helpful to try some
of the more familiar ones. It will do you good to see executable programs coming out the
other end.
If I'm going to address those issues that we haven't covered yet, I'd rather do it in the context
of KISS. We're not trying to build a complete Pascal compiler just yet, so I'm going to stop the
expansion of Pascal here. Let's take a look at a very different language.

THE STRUCTURE OF C
The C language is quite another matter, as you'll see. Texts on C rarely include a BNF
definition of the language. Probably that's because the language is quite hard to write
BNF for.
One reason I'm showing you these structures now is so that I can impress upon you these
two facts:
(1) The definition of the language drives the structure of the
compiler. What works for one language may be a disaster for
another. It's a very bad idea to try to force a given
structure upon the compiler. Rather, you should let the BNF
drive the structure, as we have done here.
(2) A language that is hard to write BNF for will probably be
hard to write a compiler for, as well. C is a popular
language, and it has a reputation for letting you do
virtually anything that is possible to do. Despite the
success of Small C, C is _NOT_ an easy language to parse.

Part 9 - A Top View
A C program has less structure than its Pascal counterpart. At the top level, everything in C is
a static declaration, either of data or of a function. We can capture this thought like this:
<program> ::= ( <global declaration> )*
<global declaration> ::= <data declaration> |
<function>
In Small C, functions can only have the default type int, which is not declared. This makes the
input easy to parse: the first token is either "int," "char," or the name of a function. In Small C,
the preprocessor commands are also processed by the compiler proper, so the syntax
becomes:
<global declaration> ::= '#' <preprocessor command> |
'int' <data list> |
'char' <data list> |
<ident> <function body> |
Although we're really more interested in full C here, I'll show you the code corresponding to
this top-level structure for Small C.

{--------------------------------------------------------------}
procedure Prog;
begin
while Look <> ^Z do begin
case Look of
'#': PreProc;
'i': IntDecl;
'c': CharDecl;
else DoFunction(Int);
end;
end;
end;
{--------------------------------------------------------------}
Note that I've had to use a ^Z to indicate the end of the source. C has no keyword such as
END or the '.' to otherwise indicate the end.
With full C, things aren't even this easy. The problem comes about because in full C, func-
tions can also have types. So when the compiler sees a keyword like "int," it still doesn't
know whether to expect a data declaration or a function definition. Things get more com-
plicated since the next token may not be a name ... it may start with an '*' or '(', or combi-
nations of the two.

Part 9 - A Top View
More specifically, the BNF for full C begins with:
<program> ::= ( <top-level decl> )*
<top-level decl> ::= <function def> | <data decl>
<data decl> ::= [<class>] <type> <decl-list>
<function def> ::= [<class>] [<type>] <function decl>
You can now see the problem: The first two parts of the declarations for data and functions
can be the same. Because of the ambiguity in the grammar as written above, it's not a suit-
able grammar for a recursive-descent parser. Can we transform it into one that is suitable?
Yes, with a little work. Suppose we write it this way:
<top-level decl> ::= [<class>] <decl>
<decl> ::= <type> <typed decl> | <function decl>
<typed decl> ::= <data list> | <function decl>
We can build a parsing routine for the class and type definitions, and have them store away
their findings and go on, without their ever having to "know" whether a function or a data dec-
laration is being processed. To begin, key in the following version of the main program:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
while Look <> ^Z do begin
GetClass;
GetType;
TopDecl;
end;
end.
{--------------------------------------------------------------}

For the first round, just make the three procedures stubs that do nothing _BUT_ call Get-
Char.
Does this program work? Well, it would be hard put NOT to, since we're not really asking
it to do anything. It's been said that a C compiler will accept virtually any input without
choking. It's certainly true of THIS compiler, since in effect all it does is to eat input char-
acters until it finds a ^Z.
Next, let's make GetClass do something worthwhile. Declare the global variable
var Class: char;
and change GetClass to do the following:
{--------------------------------------------------------------}
{ Get a Storage Class Specifier }
Procedure GetClass;
begin
if Look in ['a', 'x', 's'] then begin
Class := Look;
GetChar;
end
else Class := 'a';
end;
{--------------------------------------------------------------}
Here, I've used three single characters to represent the three storage classes "auto,"
"extern," and "static." These are not the only three possible classes ... there are also "reg-
ister" and "typedef," but this should give you the picture. Note that the default class is
"auto."

Part 9 - A Top View
We can do a similar thing for types. Enter the following procedure next:
{--------------------------------------------------------------}
{ Get a Type Specifier }
procedure GetType;
begin
Typ := ' ';
if Look = 'u' then begin
Sign := 'u';
Typ := 'i';
GetChar;
end
else Sign := 's';
if Look in ['i', 'l', 'c'] then begin
Typ := Look;
GetChar;
end;
end;
{--------------------------------------------------------------}
Note that you must add two more global variables, Sign and Typ.
With these two procedures in place, the compiler will process the class and type definitions
and store away their findings. We can now process the rest of the declaration.

We are by no means out of the woods yet, because there are still many complexities just
in the definition of the type, before we even get to the actual data or function names. Let's
pretend for the moment that we have passed all those gates, and that the next thing in the
input stream is a name. If the name is followed by a left paren, we have a function decla-
ration. If not, we have at least one data item, and possibly a list, each element of which
can have an initializer.
Insert the following version of TopDecl:
{--------------------------------------------------------------}
{ Process a Top-Level Declaration }
procedure TopDecl;
var Name: char;
begin
Name := Getname;
if Look = '(' then
DoFunc(Name)
else
DoData(Name);
end;
{--------------------------------------------------------------}
(Note that, since we have already read the name, we must pass it along to the appropri-
ate routine.)

Part 9 - A Top View
Finally, add the two procedures DoFunc and DoData:
{--------------------------------------------------------------}
{ Process a Function Definition }
procedure DoFunc(n: char);
begin
Match('(');
Match(')');
Match('{');
Match('}');
if Typ = ' ' then Typ := 'i';
Writeln(Class, Sign, Typ, ' function ', n);
end;

{--------------------------------------------------------------}
{ Process a Data Declaration }
procedure DoData(n: char);
begin
if Typ = ' ' then Expected('Type declaration');
Writeln(Class, Sign, Typ, ' data ', n);
while Look = ',' do begin
Match(',');
n := GetName;
WriteLn(Class, Sign, Typ, ' data ', n);
end;
Match(';');
end;
{--------------------------------------------------------------}
Since we're still a long way from producing executable code, I decided to just have these
two routines tell us what they found.
OK, give this program a try. For data declarations, it's OK to give a list separated by com-
mas. We can't process initializers as yet. We also can't process argument lists for the
functions, but the "(){}" characters should be there.
We're still a _VERY_ long way from having a C compiler, but what we have is starting to
process the right kinds of inputs, and is recognizing both good and bad inputs. In the pro-
cess, the natural structure of the compiler is starting to take form.

Part 9 - A Top View
Can we continue this until we have something that acts more like a compiler. Of course we
can. Should we? That's another matter. I don't know about you, but I'm beginning to get dizzy,
and we've still got a long way to go to even get past the data declarations.
At this point, I think you can see how the structure of the compiler evolves from the language
definition. The structures we've seen for our two examples, Pascal and C, are as different as
night and day. Pascal was designed at least partly to be easy to parse, and that's reflected in
the compiler. In general, in Pascal there is more structure and we have a better idea of what
kinds of constructs to expect at any point. In C, on the other hand, the program is essentially
a list of declarations, terminated only by the end of file.
We could pursue both of these structures much farther, but remember that our purpose here
is not to build a Pascal or a C compiler, but rather to study compilers in general. For those of
you who DO want to deal with Pascal or C, I hope I've given you enough of a start so that you
can take it from here (although you'll soon need some of the stuff we still haven't covered yet,
such as typing and procedure calls). For the rest of you, stay with me through the next install-
ment. There, I'll be leading you through the development of a complete compiler for TINY, a
subset of KISS.
See you then.

Part 10 - Introducing “Tiny”
INTRODUCTION
In the last installment, I showed you the general idea for the top-down development of a
compiler. I gave you the first few steps of the process for compilers for Pascal and C, but
I stopped far short of pushing it through to completion. The reason was simple: if we're
going to produce a real, functional compiler for any language, I'd rather do it for KISS, the
language that I've been defining in this tutorial series.
In this installment, we're going to do just that, for a subset of KISS which I've chosen to
call TINY.
The process will be essentially that outlined in Installment IX, except for one notable dif-
ference. In that installment, I suggested that you begin with a full BNF description of the
language. That's fine for something like Pascal or C, for which the language definition is
firm. In the case of TINY, however, we don't yet have a full description ... we seem to be
defining the language as we go. That's OK. In fact, it's preferable, since we can tailor the
language slightly as we go, to keep the parsing easy.
So in the development that follows, we'll actually be doing a top-down development of

BOTH the language and its compiler. The BNF description will grow along with the com-
piler.
In this process, there will be a number of decisions to be made, each of which will influ-
ence the BNF and therefore the nature of the language. At each decision point I'll try to
remember to explain the decision and the rationale behind my choice. That way, if you
happen to hold a different opinion and would prefer a different option, you can choose it
instead. You now have the background to do that. I guess the important thing to note is
that nothing we do here is cast in concrete. When YOU'RE designing YOUR language,
you should feel free to do it YOUR way.

Many of you may be asking at this point: Why bother starting over from scratch? We had a
working subset of KISS as the outcome of Installment VII (lexical scanning). Why not just
extend it as needed? The answer is threefold. First of all, I have been making a number of
changes to further simplify the program ... changes like encapsulating the code generation
procedures, so that we can convert to a different target machine more easily. Second, I want
you to see how the development can indeed be done from the top down as outlined in the last
installment. Finally, we both need the practice. Each time I go through this exercise, I get a lit-
tle better at it, and you will, also.

GETTING STARTED
Many years ago there were languages called Tiny BASIC, Tiny Pascal, and Tiny C, each
of which was a subset of its parent full language. Tiny BASIC, for example, had only sin-
gle-character variable names and global variables. It supported only a single data type.
Sound familiar? At this point we have almost all the tools we need to build a compiler like
that.
Yet a language called Tiny-anything still carries some baggage inherited from its parent
language. I've often wondered if this is a good idea. Granted, a language based upon
some parent language will have the advantage of familiarity, but there may also be some
peculiar syntax carried over from the parent that may tend to add unnecessary complexity
to the compiler. (Nowhere is this more true than in Small C.)
I've wondered just how small and simple a compiler could be made and still be useful, if it
were designed from the outset to be both easy to use and to parse. Let's find out. This
language will just be called "TINY," period. It's a subset of KISS, which I also haven't fully
defined, so that at least makes us consistent (!). I suppose you could call it TINY KISS.
But that opens up a whole can of worms involving cuter and cuter (and perhaps more ris-
que) names, so let's just stick with TINY.
The main limitations of TINY will be because of the things we haven't yet covered, such
as data types. Like its cousins Tiny C and Tiny BASIC, TINY will have only one data type,
the 16-bit integer. The first version we develop will also have no procedure calls and will
use single-character variable names, although as you will see we can remove these
restrictions without much effort.
The language I have in mind will share some of the good features of Pascal, C, and Ada.
Taking a lesson from the comparison of the Pascal and C compilers in the previous
installment, though, TINY will have a decided Pascal flavor. Wherever feasible, a lan-
guage structure will be bracketed by keywords or symbols, so that the parser will know
where it's going without having to guess.
One other ground rule: As we go, I'd like to keep the compiler producing real, executable
code. Even though it may not DO much at the beginning, it will at least do it correctly.

Finally, I'll use a couple of Pascal restrictions that make sense: All data and procedures must
be declared before they are used. That makes good sense, even though for now the only
data type we'll use is a word. This rule in turn means that the only reasonable place to put the
executable code for the main program is at the end of the listing.
The top-level definition will be similar to Pascal:
<program> ::= PROGRAM <top-level decl> <main> '.'
Already, we've reached a decision point. My first thought was to make the main block
optional. It doesn't seem to make sense to write a "program" with no main program, but it
does make sense if we're allowing for multiple modules, linked together. As a matter of fact, I
intend to allow for this in KISS. But then we begin to open up a can of worms that I'd rather
leave closed for now. For example, the term "PROGRAM" really becomes a misnomer. The
MODULE of Modula-2 or the Unit of Turbo Pascal would be more appropriate. Second, what
about scope rules? We'd need a convention for dealing with name visibility across modules.
Better for now to just keep it simple and ignore the idea altogether.
There's also a decision in choosing to require the main program to be last. I toyed with the
idea of making its position optional, as in C. The nature of SK*DOS, the OS I'm compiling for,
make this very easy to do. But this doesn't really make much sense in view of the Pascal-like
requirement that all data and procedures be declared before they're referenced. Since the
main program can only call procedures that have already been declared, the only position
that makes sense is at the end, a la Pascal.

Given the BNF above, let's write a parser that just recognizes the brackets:
{--------------------------------------------------------------}
procedure Prog;
begin
Match('p');
Header;
Prolog;
Match('.');
Epilog;
end;
{--------------------------------------------------------------}
The procedure Header just emits the startup code required by the assembler:
{--------------------------------------------------------------}
{ Write Header Info }
procedure Header;
begin
WriteLn('WARMST', TAB, 'EQU $A01E');
end;
{--------------------------------------------------------------}

The procedures Prolog and Epilog emit the code for identifying the main program, and for
returning to the OS:
{--------------------------------------------------------------}
procedure Prolog;
begin
PostLabel('MAIN');
end;
{--------------------------------------------------------------}
procedure Epilog;
begin
EmitLn('END MAIN');
end;
{--------------------------------------------------------------}

The main program just calls Prog, and then looks for a clean ending:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
Prog;
if Look <> CR then Abort('Unexpected data after ''.''');
end.
{--------------------------------------------------------------}
At this point, TINY will accept only one input "program," the null program:
PROGRAM . (or 'p.' in our shorthand.)
Note, though, that the compiler DOES generate correct code for this program. It will run,
and do what you'd expect the null program to do, that is, nothing but return gracefully to
the OS.
As a matter of interest, one of my favorite compiler benchmarks is to compile, link, and

execute the null program in whatever language is involved. You can learn a lot about the
implementation by measuring the overhead in time required to compile what should be a
trivial case. It's also interesting to measure the amount of code produced. In many compil-
ers, the code can be fairly large, because they always include the whole run- time library
whether they need it or not. Early versions of Turbo Pascal produced a 12K object file for
this case. VAX C generates 50K!
The smallest null programs I've seen are those produced by Modula-2 compilers, and
they run about 200-800 bytes.

In the case of TINY, we HAVE no run-time library as yet, so the object code is indeed tiny: two
bytes. That's got to be a record, and it's likely to remain one since it is the minimum size
required by the OS.
The next step is to process the code for the main program. I'll use the Pascal BEGIN-block:
<main> ::= BEGIN <block> END
Here, again, we have made a decision. We could have chosen to require a "PROCEDURE
MAIN" sort of declaration, similar to C. I must admit that this is not a bad idea at all ... I don't
particularly like the Pascal approach since I tend to have trouble locating the main program in
a Pascal listing. But the alternative is a little awkward, too, since you have to deal with the
error condition where the user omits the main program or misspells its name. Here I'm taking
the easy way out.
Another solution to the "where is the main program" problem might be to require a name for
the program, and then bracket the main by
BEGIN <name>
END <name>
similar to the convention of Modula 2. This adds a bit of "syntactic sugar" to the language.
Things like this are easy to add or change to your liking, if the language is your own design.

To parse this definition of a main block, change procedure Prog to read:
{--------------------------------------------------------------}
procedure Prog;
begin
Match('p');
Header;
Main;
Match('.');
end;
{--------------------------------------------------------------}

and add the new procedure:
{--------------------------------------------------------------}
{ Parse and Translate a Main Program }
procedure Main;
begin
Match('b');
Prolog;
Match('e');
Epilog;
end;
{--------------------------------------------------------------}
Now, the only legal program is:
PROGRAM BEGIN END . (or 'pbe.')
Aren't we making progress??? Well, as usual it gets better. You might try some deliberate
errors here, like omitting the 'b' or the 'e', and see what happens. As always, the compiler
should flag all illegal inputs.

DECLARATIONS
The obvious next step is to decide what we mean by a declaration. My intent here is to
have two kinds of declarations: variables and procedures/functions. At the top level, only
global declarations are allowed, just as in C.
For now, there can only be variable declarations, identified by the keyword VAR (abbrevi-
ated 'v'):
<top-level decls> ::= ( <data declaration> )*
<data declaration> ::= VAR <var-list>
Note that since there is only one variable type, there is no need to declare the type. Later
on, for full KISS, we can easily add a type description.
The procedure Prog becomes:
{--------------------------------------------------------------}
procedure Prog;
begin
Match('p');
Header;
TopDecls;
Main;
Match('.');
end;
{--------------------------------------------------------------}

Now, add the two new procedures:
{--------------------------------------------------------------}
{ Process a Data Declaration }
procedure Decl;
begin
Match('v');
GetChar;
end;
{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
while Look <> 'b' do
case Look of
'v': Decl;
else Abort('Unrecognized Keyword ''' + Look + '''');
end;
end;
{--------------------------------------------------------------}
Note that at this point, Decl is just a stub. It generates no code, and it doesn't process a list ...
every variable must occur in a separate VAR statement.
OK, now we can have any number of data declarations, each starting with a 'v' for VAR,
before the BEGIN-block. Try a few cases and see what happens.

DECLARATIONS AND SYMBOLS

That looks pretty good, but we're still only generating the null program for output. A real
compiler would issue assembler directives to allocate storage for the variables. It's about
time we actually produced some code.
With a little extra code, that's an easy thing to do from procedure Decl. Modify it as fol-
lows:
{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }
procedure Decl;
var Name: char;
begin
Match('v');
Alloc(GetName);
end;
{--------------------------------------------------------------}
The procedure Alloc just issues a command to the assembler to allocate storage:
{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: char);
begin
WriteLn(N, ':', TAB, 'DC 0');
end;
{--------------------------------------------------------------}

Give this one a whirl. Try an input that declares some variables, such as:
pvxvyvzbe.
See how the storage is allocated? Simple, huh? Note also that the entry point, "MAIN,"
comes out in the right place.
For the record, a "real" compiler would also have a symbol table to record the variables being
used. Normally, the symbol table is necessary to record the type of each variable. But since in
this case all variables have the same type, we don't need a symbol table for that reason. As it
turns out, we're going to find a symbol necessary even without different types, but let's post-
pone that need until it arises.
Of course, we haven't really parsed the correct syntax for a data declaration, since it involves
a variable list. Our version only permits a single variable. That's easy to fix, too.
The BNF for <var-list> is
<var-list> ::= <ident> (, <ident>)*
Adding this syntax to Decl gives this new version:
{--------------------------------------------------------------}
procedure Decl;
var Name: char;
begin
Match('v');
Alloc(GetName);
GetChar;
Alloc(GetName);
end;
end;
{--------------------------------------------------------------}
OK, now compile this code and give it a try. Try a number of lines of VAR declarations, try a
list of several variables on one line, and try combinations of the two. Does it work?

INITIALIZERS
As long as we're dealing with data declarations, one thing that's always bothered me
about Pascal is that it doesn't allow initializing data items in the declaration. That feature
is admittedly sort of a frill, and it may be out of place in a language that purports to be a
minimal language. But it's also SO easy to add that it seems a shame not to do so. The
BNF becomes:
<var-list> ::= <var> ( <var> )*
<var> ::= <ident> [ = <integer> ]
Change Alloc as follows:
{--------------------------------------------------------------}
begin
Write(N, ':', TAB, 'DC ');
if Look = '=' then begin
Match('=');
WriteLn(GetNum);
end
else
WriteLn('0');
end;
{--------------------------------------------------------------}

There you are: an initializer with six added lines of Pascal.
OK, try this version of TINY and verify that you can, indeed, give the variables initial values.
By golly, this thing is starting to look real! Of course, it still doesn't DO anything, but it looks
good, doesn't it?
Before leaving this section, I should point out that we've used two versions of function Get-
Num. One, the earlier one, returns a character value, a single digit. The other accepts a multi-
digit integer and returns an integer value. Either one will work here, since WriteLn will handle
either type. But there's no reason to limit ourselves to single-digit values here, so the correct
version to use is the one that returns an integer. Here it is:
{--------------------------------------------------------------}
{ Get a Number }
var Val: integer;
begin
Val := 0;
Val := 10 * Val + Ord(Look) - Ord('0');
GetChar;
end;
GetNum := Val;
end;
{--------------------------------------------------------------}

As a matter of fact, strictly speaking we should allow for expressions in the data field of
the initializer, or at the very least for negative values. For now, let's just allow for negative
values by changing the code for Alloc as follows:
{--------------------------------------------------------------}
begin
if InTable(N) then Abort('Duplicate Variable Name ' + N);
ST[N] := 'v';
Match('=');
If Look = '-' then begin
Write(Look);
Match('-');
end;
WriteLn(GetNum);
end
else
WriteLn('0');
end;
{--------------------------------------------------------------}
Now you should be able to initialize variables with negative and/or multi-digit values.

THE SYMBOL TABLE

There's one problem with the compiler as it stands so far: it doesn't do anything to record a
variable when we declare it. So the compiler is perfectly content to allocate storage for sev-
eral variables with the same name. You can easily verify this with an input like
pvavavabe.
Here we've declared the variable A three times. As you can see, the compiler will cheerfully
accept that, and generate three identical labels. Not good.
Later on, when we start referencing variables, the compiler will also let us reference variables
that don't exist. The assembler will catch both of these error conditions, but it doesn't seem
friendly at all to pass such errors along to the assembler. The compiler should catch such
things at the source language level.
So even though we don't need a symbol table to record data types, we ought to install one
just to check for these two conditions. Since at this point we are still restricted to single-char-
acter variable names, the symbol table can be trivial. To provide for it, first add the following
declaration at the beginning of your program:
var ST: array['A'..'Z'] of char;
and insert the following function:
{--------------------------------------------------------------}
{ Look for Symbol in Table }
function InTable(n: char): Boolean;
begin
InTable := ST[n] <> ' ';
end;
{--------------------------------------------------------------}

We also need to initialize the table to all blanks. The following lines in Init will do the
job:
var i: char;
begin
ST[i] := ' ';
...
Finally, insert the following two lines at the beginning of Alloc:
ST[N] := 'v';
That should do it. The compiler will now catch duplicate declarations. Later, we can also
use InTable when generating references to the variables.

EXECUTABLE STATEMENTS
At this point, we can generate a null program that has some data variables declared and pos-
sibly initialized. But so far we haven't arranged to generate the first line of executable code.
Believe it or not, though, we almost have a usable language! What's missing is the execut-
able code that must go into the main program. But that code is just assignment statements
and control statements ... all stuff we have done before. So it shouldn't take us long to provide
for them, as well.
The BNF definition given earlier for the main program included a statement block, which we
have so far ignored:
<main> ::= BEGIN <block> END
For now, we can just consider a block to be a series of assignment statements:
<block> ::= (Assignment)*

Let's start things off by adding a parser for the block. We'll begin with a stub for the
assignment statement:
{--------------------------------------------------------------}
begin
GetChar;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure Block;
begin
while Look <> 'e' do
Assignment;
end;
{--------------------------------------------------------------}

Modify procedure Main to call Block as shown below:
{--------------------------------------------------------------}
procedure Main;
begin
Match('b');
Prolog;
Block;
Match('e');
Epilog;
end;
{--------------------------------------------------------------}
This version still won't generate any code for the "assignment statements" ... all it does is to
eat characters until it sees the 'e' for 'END.' But it sets the stage for what is to follow.
The next step, of course, is to flesh out the code for an assignment statement. This is some-
thing we've done many times before, so I won't belabor it. This time, though, I'd like to deal
with the code generation a little differently. Up till now, we've always just inserted the Emits
that generate output code in line with the parsing routines. A little unstructured, perhaps, but
it seemed the most straightforward approach, and made it easy to see what kind of code
would be emitted for each construct.
However, I realize that most of you are using an 80x86 computer, so the 68000 code gener-
ated is of little use to you. Several of you have asked me if the CPU-dependent code couldn't
be collected into one spot where it would be easier to retarget to another CPU. The answer,
of course, is yes.

To accomplish this, insert the following "code generation" routines:
{---------------------------------------------------------------}
{ Clear the Primary Register }
procedure Clear;
begin
EmitLn('CLR D0');
end;
{---------------------------------------------------------------}
{ Negate the Primary Register }
procedure Negate;
begin
EmitLn('NEG D0');
end;
{---------------------------------------------------------------}
{ Load a Constant Value to Primary Register }
procedure LoadConst(n: integer);
begin
Emit('MOVE #');
WriteLn(n, ',D0');
end;

{---------------------------------------------------------------}
{ Load a Variable to Primary Register }
procedure LoadVar(Name: char);
begin
if not InTable(Name) then Undefined(Name);
end;
{---------------------------------------------------------------}
{ Push Primary onto Stack }
procedure Push;
begin
end;
{---------------------------------------------------------------}
{ Add Top of Stack to Primary }
procedure PopAdd;
begin
end;

{---------------------------------------------------------------}
{ Subtract Primary from Top of Stack }
procedure PopSub;
begin
EmitLn('NEG D0');
end;
{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary }
procedure PopMul;
begin
end;

{---------------------------------------------------------------}
{ Divide Top of Stack by Primary }
procedure PopDiv;
begin
EmitLn('EXT.L D7');
end;
{---------------------------------------------------------------}
{ Store Primary to Variable }
procedure Store(Name: char);
begin
end;
{---------------------------------------------------------------}
The nice part of this approach, of course, is that we can retarget the compiler to a new CPU
simply by rewriting these "code generator" procedures. In addition, we will find later that we
can improve the code quality by tweaking these routines a bit, without having to modify the
compiler proper.

Note that both LoadVar and Store check the symbol table to make sure that the variable is
defined. The error handler Undefined simply calls Abort:
{--------------------------------------------------------------}
{ Report an Undefined Identifier }
procedure Undefined(n: string);
begin
Abort('Undefined Identifier ' + n);
end;
{--------------------------------------------------------------}
OK, we are now finally ready to begin processing executable code. We'll do that by
replacing the stub version of procedure Assignment.
We've been down this road many times before, so this should all be familiar to you. In
fact, except for the changes associated with the code generation, we could just copy the
procedures from Part VII. Since we are making some changes, I won't just copy them, but
we will go a little faster than usual.
The BNF for the assignment statement is:
<assignment> ::= <ident> = <expression>
<expression> ::= <first term> ( <addop> <term> )*
<first term> ::= <first factor> <rest>
<term> ::= <factor> <rest>
<rest> ::= ( <mulop> <factor> )*
<first factor> ::= [ <addop> ] <factor>
<factor> ::= <var> | <number> | ( <expression> )

This version of the BNF is also a bit different than we've used before ... yet another "variation
on the theme of an expression." This particular version has what I consider to be the best
treatment of the unary minus. As you'll see later, it lets us handle negative constant values
efficiently. It's worth mentioning here that we have often seen the advantages of "tweaking"
the BNF as we go, to help make the language easy to parse. What you're looking at here is a
bit different: we've tweaked the BNF to make the CODE GENERATION more efficient! That's
a first for this series.
Anyhow, the following code implements the BNF:
{---------------------------------------------------------------}
procedure Factor;
begin
Match('(');
Expression;
Match(')');
end
LoadVar(GetName)
else
LoadConst(GetNum);
end;

{--------------------------------------------------------------}
{ Parse and Translate a Negative Factor }
procedure NegFactor;
begin
Match('-');
LoadConst(-GetNum)
else begin
Factor;
Negate;
end;
end;

{--------------------------------------------------------------}
{ Parse and Translate a Leading Factor }
procedure FirstFactor;
begin
case Look of
'+': begin
Match('+');
Factor;
end;
'-': NegFactor;
else Factor;
end;
end;

{--------------------------------------------------------------}
procedure Multiply;
begin
Match('*');
Factor;
PopMul;
end;
{-------------------------------------------------------------}
procedure Divide;
begin
Match('/');
Factor;
PopDiv;
end;

{---------------------------------------------------------------}
{ Common Code Used by Term and FirstTerm }
procedure Term1;
begin
Push;
case Look of
'*': Multiply;
'/': Divide;
end;
end;
end;
{---------------------------------------------------------------}
procedure Term;
begin
Factor;
Term1;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Leading Term }
begin
FirstFactor;
Term1;
end;
{--------------------------------------------------------------}
procedure Add;
begin
Match('+');
Term;
PopAdd;
end;

{-------------------------------------------------------------}
procedure Subtract;
begin
Match('-');
Term;
PopSub;
end;
{---------------------------------------------------------------}
begin
FirstTerm;
Push;
case Look of
'+': Add;
'-': Subtract;
end;
end;
end;

{--------------------------------------------------------------}
var Name: char;
begin
Name := GetName;
Match('=');
Expression;
Store(Name);
end;
{--------------------------------------------------------------}
OK, if you've got all this code inserted, then compile it and check it out. You should be
seeing reasonable-looking code, representing a complete program that will assemble and
execute. We have a compiler!

BOOLEANS
The next step should also be familiar to you. We must add Boolean expressions and rela-
tional operations. Again, since we've already dealt with them more than once, I won't elabo-
rate much on them, except where they are different from what we've done before. Again, we
won't just copy from other files because I've changed a few things just a bit. Most of the
changes just involve encapsulating the machine-dependent parts as we did for the arithmetic
operations. I've also modified procedure NotFactor somewhat, to parallel the structure of
FirstFactor. Finally, I corrected an error in the object code for the relational operators: The
Scc instruction I used only sets the low 8 bits of D0. We want all 16 bits set for a logical true,
so I've added an instruction to sign-extend the low byte.
To begin, we're going to need some more recognizers:
{--------------------------------------------------------------}
function IsOrop(c: char): boolean;
begin
IsOrop := c in ['|', '~'];
end;
{--------------------------------------------------------------}
function IsRelop(c: char): boolean;
begin
IsRelop := c in ['=', '#', '<', '>'];
end;
{--------------------------------------------------------------}

Also, we're going to need some more code generation routines:
{---------------------------------------------------------------}
{ Complement the Primary Register }
procedure NotIt;
begin
EmitLn('NOT D0');
end;
{---------------------------------------------------------------}
{---------------------------------------------------------------}
{ AND Top of Stack with Primary }
procedure PopAnd;
begin
end;
{---------------------------------------------------------------}
{ OR Top of Stack with Primary }
procedure PopOr;
begin
end;

{---------------------------------------------------------------}
{ XOR Top of Stack with Primary }
procedure PopXor;
begin
end;
{---------------------------------------------------------------}
{ Compare Top of Stack with Primary }
procedure PopCompare;
begin
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was = }
procedure SetEqual;
begin
EmitLn('SEQ D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was != }
procedure SetNEqual;
begin
EmitLn('SNE D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was > }
procedure SetGreater;
begin
EmitLn('SLT D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was < }
procedure SetLess;
begin
EmitLn('SGT D0');
EmitLn('EXT D0');
end;

All of this gives us the tools we need. The BNF for the Boolean expressions is:
<bool-expr> ::= <bool-term> ( <orop> <bool-term> )*
<bool-term> ::= <not-factor> ( <andop> <not-factor> )*
<not-factor> ::= [ '!' ] <relation>
<relation> ::= <expression> [ <relop> <expression> ]
Sharp-eyed readers might note that this syntax does not include the non-terminal "bool-fac-
tor" used in earlier versions. It was needed then because I also allowed for the Boolean con-
stants TRUE and FALSE. But remember that in TINY there is no distinction made between
Boolean and arithmetic types ... they can be freely intermixed. So there is really no need for
these predefined values ... we can just use -1 and 0, respectively.
In C terminology, we could always use the defines:
#define TRUE -1
#define FALSE 0
(That is, if TINY had a preprocessor.) Later on, when we allow for declarations of constants,
these two values will be predefined by the language.
The reason that I'm harping on this is that I've already tried the alternative, which is to include
TRUE and FALSE as keywords. The problem with that approach is that it then requires lexi-
cal scanning for EVERY variable name in every expression. If you'll recall, I pointed out in
Installment VII that this slows the compiler down considerably. As long as keywords can't be
in expressions, we need to do the scanning only at the beginning of every new statement ...
quite an improvement. So using the syntax above not only simplifies the parsing, but speeds
up the scanning as well.

OK, given that we're all satisfied with the syntax above, the corresponding code is shown
below:
{---------------------------------------------------------------}
procedure Equals;
begin
Match('=');
Expression;
PopCompare;
SetEqual;
end;
{---------------------------------------------------------------}
procedure NotEquals;
begin
Match('#');
Expression;
PopCompare;
SetNEqual;
end;

{---------------------------------------------------------------}
procedure Less;
begin
Match('<');
Expression;
PopCompare;
SetLess;
end;
{---------------------------------------------------------------}
procedure Greater;
begin
Match('>');
Expression;
PopCompare;
SetGreater;
end;

{---------------------------------------------------------------}
procedure Relation;
begin
Expression;
Push;
case Look of
'=': Equals;
'#': NotEquals;
'<': Less;
'>': Greater;
end;
end;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Boolean Factor with Leading NOT }
begin
Match('!');
Relation;
NotIt;
end
else
Relation;
end;

{---------------------------------------------------------------}
procedure BoolTerm;
begin
NotFactor;
Push;
Match('&');
NotFactor;
PopAnd;
end;
end;
{--------------------------------------------------------------}
procedure BoolOr;
begin
Match('|');
BoolTerm;
PopOr;
end;

{--------------------------------------------------------------}
procedure BoolXor;
begin
Match('~');
BoolTerm;
PopXor;
end;
{---------------------------------------------------------------}
begin
BoolTerm;
Push;
case Look of
'|': BoolOr;
'~': BoolXor;
end;
end;
end;

To tie it all together, don't forget to change the references to Expression in procedures
Factor and Assignment so that they call BoolExpression instead. OK, if you've got all that
typed in, compile it and give it a whirl. First, make sure you can still parse an ordinary
arithmetic expression. Then, try a Boolean one. Finally, make sure that you can assign
the results of relations. Try, for example:
pvx,y,zbx=z>ye.
which stands for:
PROGRAM
VAR X,Y,Z
BEGIN
X = Z > Y
END.
See how this assigns a Boolean value to X?

CONTROL STRUCTURES
We're almost home. With Boolean expressions in place, it's a simple matter to add control
structures. For TINY, we'll only allow two kinds of them, the IF and the WHILE:
<if> ::= IF <bool-expression> <block> [ ELSE <block>] ENDIF
<while> ::= WHILE <bool-expression> <block> ENDWHILE
Once again, let me spell out the decisions implicit in this syntax, which departs strongly from
that of C or Pascal. In both of those languages, the "body" of an IF or WHILE is regarded as
a single statement. If you intend to use a block of more than one statement, you have to build
a compound statement using BEGIN-END (in Pascal) or '{}' (in C). In TINY (and KISS) there
is no such thing as a compound statement ... single or multiple they're all just blocks to these
languages.
In KISS, all the control structures will have explicit and unique keywords bracketing the state-
ment block, so there can be no confusion as to where things begin and end. This is the mod-
ern approach, used in such respected languages as Ada and Modula 2, and it completely
eliminates the problem of the "dangling else."
Note that I could have chosen to use the same keyword END to end all the constructs, as is
done in Pascal. (The closing '}' in C serves the same purpose.) But this has always led to
confusion, which is why Pascal programmers tend to write things like
end { loop }
or end { if }
As I explained in Part V, using unique terminal keywords does increase the size of the key-
word list and therefore slows down the scanning, but in this case it seems a small price to pay
for the added insurance. Better to find the errors at compile time rather than run time.
One last thought: The two constructs above each have the non- terminals
<bool-expression> and <block>

juxtaposed with no separating keyword. In Pascal we would expect the keywords THEN
and DO in these locations.
I have no problem with leaving out these keywords, and the parser has no trouble either,
ON CONDITION that we make no errors in the bool-expression part. On the other hand, if
we were to include these extra keywords we would get yet one more level of insurance at
very little cost, and I have no problem with that, either. Use your best judgment as to
which way to go.
OK, with that bit of explanation let's proceed. As usual, we're going to need some new
code generation routines. These generate the code for conditional and unconditional
branches:
{---------------------------------------------------------------}
{ Branch Unconditional }
procedure Branch(L: string);
begin
EmitLn('BRA ' + L);
end;
{---------------------------------------------------------------}
{ Branch False }
procedure BranchFalse(L: string);
begin
EmitLn('TST D0');
EmitLn('BEQ ' + L);
end;
{--------------------------------------------------------------}

Except for the encapsulation of the code generation, the code to parse the control constructs
is the same as you've seen before:
{---------------------------------------------------------------}
procedure DoIf;
var L1, L2: string;
begin
Match('i');
BoolExpression;
L1 := NewLabel;
L2 := L1;
BranchFalse(L1);
Block;
Match('l');
L2 := NewLabel;
Branch(L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
Match('e');
end;

{--------------------------------------------------------------}
rocedure DoWhile;
var L1, L2: string;
begin
Match('w');
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
BoolExpression;
BranchFalse(L2);
Block;
Match('e');
Branch(L1);
PostLabel(L2);
end;
{--------------------------------------------------------------}

To tie everything together, we need only modify procedure Block to recognize the "keywords"
for the IF and WHILE. As usual, we expand the definition of a block like so:
<block> ::= ( <statement> )*
where
<statement> ::= <if> | <while> | <assignment>
The corresponding code is:
{--------------------------------------------------------------}
procedure Block;
begin
case Look of
'i': DoIf;
'w': DoWhile;
else Assignment;
end;
end;
end;
{--------------------------------------------------------------}

OK, add the routines I've given, compile and test them. You should be able to parse the
single-character versions of any of the control constructs. It's looking pretty good!
As a matter of fact, except for the single-character limitation we've got a virtually complete
version of TINY. I call it, with tongue planted firmly in cheek, TINY Version 0.1.

LEXICAL SCANNING
Of course, you know what's next: We have to convert the program so that it can deal with
multi-character keywords, newlines, and whitespace. We have just gone through all that in
Part VII. We'll use the distributed scanner technique that I showed you in that installment. The
actual implementation is a little different because the way I'm handling newlines is different.
To begin with, let's simply allow for whitespace. This involves only adding calls to SkipWhite
at the end of the three routines, GetName, GetNum, and Match. A call to SkipWhite in Init
primes the pump in case there are leading spaces.
Next, we need to deal with newlines. This is really a two-step process, since the treatment of
the newlines with single- character tokens is different from that for multi-character ones. We
can eliminate some work by doing both steps at once, but I feel safer taking things one step
at a time.
Insert the new procedure:
{--------------------------------------------------------------}
{ Skip Over an End-of-Line }
procedure NewLine;
begin
while Look = CR do begin
GetChar;
SkipWhite;
end;
end;
{--------------------------------------------------------------}

Note that we have seen this procedure before in the form of Procedure Fin. I've changed
the name since this new one seems more descriptive of the actual function. I've also
changed the code to allow for multiple newlines and lines with nothing but white space.
The next step is to insert calls to NewLine wherever we decide a newline is permissible.
As I've pointed out before, this can be very different in different languages. In TINY, I've
decided to allow them virtually anywhere. This means that we need calls to NewLine at
the BEGINNING (not the end, as with SkipWhite) of the procedures GetName, GetNum,
and Match.
For procedures that have while loops, such as TopDecl, we need a call to NewLine at the
beginning of the procedure AND at the bottom of each loop. That way, we can be assured
that NewLine has just been called at the beginning of each pass through the loop.
If you've got all this done, try the program out and verify that it will indeed handle white
space and newlines.
If it does, then we're ready to deal with multi-character tokens and keywords. To begin,
add the additional declarations (copied almost verbatim from Part VII):
{--------------------------------------------------------------}
TabPtr = ^SymTab;

{--------------------------------------------------------------}
Token: char; { Encoded Token }
Value: string[16]; { Unencoded Token }
ST: Array['A'..'Z'] of char;
{--------------------------------------------------------------}
const NKW = 9;
NKW1 = 10;
const KWlist: array[1..NKW] of Symbol =
('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE',
'VAR', 'BEGIN', 'END', 'PROGRAM');
const KWcode: string[NKW1] = 'xilewevbep';
{--------------------------------------------------------------}

Next, add the three procedures, also from Part VII:
{--------------------------------------------------------------}
{ Table Lookup }
var i: integer;
found: Boolean;
begin
found := false;
i := n;
if s = T^[i] then
found := true
else
dec(i);
Lookup := i;
end;
{--------------------------------------------------------------}

{--------------------------------------------------------------}
procedure Scan;
begin
GetName;
Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}

Now, we have to make a fairly large number of subtle changes to the remaining proce-
dures. First, we must change the function GetName to a procedure, again as we did in
Part VII:
{--------------------------------------------------------------}
procedure GetName;
begin
NewLine;
Value := '';
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}
Note that this procedure leaves its result in the global string Value.

Next, we have to change every reference to GetName to reflect its new form. These occur in
Factor, Assignment, and Decl:
{---------------------------------------------------------------}
procedure BoolExpression; Forward;
procedure Factor;
begin
Match('(');
BoolExpression;
Match(')');
end
else if IsAlpha(Look) then begin
GetName;
LoadVar(Value[1]);
end
else
LoadConst(GetNum);
end;
{--------------------------------------------------------------}

{--------------------------------------------------------------}
var Name: char;
begin
Name := Value[1];
Match('=');
BoolExpression;
Store(Name);
end;
{---------------------------------------------------------------}

procedure Decl;
begin
GetName;
Alloc(Value[1]);
Match(',');
GetName;
Alloc(Value[1]);
end;
end;
{--------------------------------------------------------------}

(Note that we're still only allowing single-character variable names, so we take the easy way
out here and simply use the first character of the string.)
Finally, we must make the changes to use Token instead of Look as the test character and to
call Scan at the appropriate places. Mostly, this involves deleting calls to Match, occasionally
replacing calls to Match by calls to MatchString, and Replacing calls to NewLine by calls to
Scan. Here are the affected routines:
{---------------------------------------------------------------}
procedure DoIf;
var L1, L2: string;
begin
BoolExpression;
L1 := NewLabel;
L2 := L1;
BranchFalse(L1);
Block;
L2 := NewLabel;
Branch(L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
end;

{--------------------------------------------------------------}
procedure DoWhile;
var L1, L2: string;
begin
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
BoolExpression;
BranchFalse(L2);
Block;
MatchString('ENDWHILE');
Branch(L1);
PostLabel(L2);
end;

{--------------------------------------------------------------}
procedure Block;
begin
Scan;
while not(Token in ['e', 'l']) do begin
case Token of
'i': DoIf;
'w': DoWhile;
else Assignment;
end;
Scan;
end;
end;

{--------------------------------------------------------------}
procedure TopDecls;
begin
Scan;
while Token <> 'b' do begin
case Token of
'v': Decl;
else Abort('Unrecognized Keyword ' + Value);
end;
Scan;
end;
end;

{--------------------------------------------------------------}
procedure Main;
begin
MatchString('BEGIN');
Prolog;
Block;
MatchString('END');
Epilog;
end;
{--------------------------------------------------------------}
procedure Prog;
begin
MatchString('PROGRAM');
Header;
TopDecls;
Main;
Match('.');
end;

{--------------------------------------------------------------}
{ Initialize }
procedure Init;
var i: char;
begin
ST[i] := ' ';
GetChar;
Scan;
end;
{--------------------------------------------------------------}
That should do it. If all the changes got in correctly, you should now be parsing programs
that look like programs. (If you didn't make it through all the changes, don't despair. A
complete listing of the final form is given later.)
Did it work? If so, then we're just about home. In fact, with a few minor exceptions we've
already got a compiler that's usable. There are still a few areas that need improvement.

MULTI-CHARACTER VARIABLE NAMES

One of those is the restriction that we still have, requiring single-character variable names.
Now that we can handle multi- character keywords, this one begins to look very much like an
arbitrary and unnecessary limitation. And indeed it is. Basically, its only virtue is that it permits
a trivially simple implementation of the symbol table. But that's just a convenience to the com-
piler writers, and needs to be eliminated.
We've done this step before. This time, as usual, I'm doing it a little differently. I think the
approach used here keeps things just about as simple as possible.
The natural way to implement a symbol table in Pascal is by declaring a record type, and
making the symbol table an array of such records. Here, though, we don't really need a type
field yet (there is only one kind of entry allowed so far), so we only need an array of symbols.
This has the advantage that we can use the existing procedure Lookup to search the symbol
table as well as the keyword list. As it turns out, even when we need more fields we can still
use the same approach, simply by storing the other fields in separate arrays.
OK, here are the changes that need to be made. First, add the new typed constant:
NEntry: integer = 0;
Then change the definition of the symbol table as follows:
const MaxEntry = 100;
var ST : array[1..MaxEntry] of Symbol;
(Note that ST is _NOT_ declared as a SymTab. That declaration is a phony one to get
Lookup to work. A SymTab would take up too much RAM space, and so one is never actually
allocated.)

Next, we need to replace InTable:
{--------------------------------------------------------------}
function InTable(n: Symbol): Boolean;
begin
InTable := Lookup(@ST, n, MaxEntry) <> 0;
end;
{--------------------------------------------------------------}
We also need a new procedure, AddEntry, that adds a new entry to the table:
{--------------------------------------------------------------}
{ Add a New Entry to Symbol Table }
procedure AddEntry(N: Symbol; T: char);
begin
if InTable(N) then Abort('Duplicate Identifier ' + N);
if NEntry = MaxEntry then Abort('Symbol Table Full');
Inc(NEntry);
ST[NEntry] := N;
SType[NEntry] := T;
end;
{--------------------------------------------------------------}

This procedure is called by Alloc:
{--------------------------------------------------------------}
procedure Alloc(N: Symbol);
begin
AddEntry(N, 'v');
{--------------------------------------------------------------}
Finally, we must change all the routines that currently treat the variable name as a single
character. These include LoadVar and Store (just change the type from char to string), and
Factor, Assignment, and Decl (just change Value[1] to Value).

One last thing: change procedure Init to clear the array as shown:
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
var i: integer;
begin
for i := 1 to MaxEntry do begin
ST[i] := '';
SType[i] := ' ';
end;
GetChar;
Scan;
end;
{--------------------------------------------------------------}
That should do it. Try it out and verify that you can, indeed, use multi-character variable
names.

MORE RELOPS
We still have one remaining single-character restriction: the one on relops. Some of the
relops are indeed single characters, but others require two. These are '<=' and '>='. I also
prefer the Pascal '<>' for "not equals," instead of '#'.
If you'll recall, in Part VII I pointed out that the conventional way to deal with relops is to
include them in the list of keywords, and let the lexical scanner find them. But, again, this
requires scanning throughout the expression parsing process, whereas so far we've been
able to limit the use of the scanner to the beginning of a statement.
I mentioned then that we can still get away with this, since the multi-character relops are so
few and so limited in their usage. It's easy to just treat them as special cases and handle
them in an ad hoc manner.
The changes required affect only the code generation routines and procedures Relation and
friends. First, we're going to need two more code generation routines:
{---------------------------------------------------------------}
{ Set D0 If Compare was <= }
procedure SetLessOrEqual;
begin
EmitLn('SGE D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was >= }
procedure SetGreaterOrEqual;
begin
EmitLn('SLE D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}
Then, modify the relation parsing routines as shown below:
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than or Equal" }
procedure LessOrEqual;
begin
Match('=');
Expression;
PopCompare;
SetLessOrEqual;
end;

{---------------------------------------------------------------}
procedure NotEqual;
begin
Match('>');
Expression;
PopCompare;
SetNEqual;
end;

{---------------------------------------------------------------}
procedure Less;
begin
Match('<');
case Look of
'=': LessOrEqual;
'>': NotEqual;
else begin
Expression;
PopCompare;
SetLess;
end;
end;
end;

{---------------------------------------------------------------}
procedure Greater;
begin
Match('>');
Match('=');
Expression;
PopCompare;
SetGreaterOrEqual;
end
else begin
Expression;
PopCompare;
SetGreater;
end;
end;
{---------------------------------------------------------------}
That's all it takes. Now you can process all the relops. Try it.

INPUT/OUTPUT
We now have a complete, working language, except for one minor embarassment: we
have no way to get data in or out. We need some I/O.
Now, the convention these days, established in C and continued in Ada and Modula 2, is
to leave I/O statements out of the language itself, and just include them in the subroutine
library. That would be fine, except that so far we have no provision for subroutines. Any-
how, with this approach you run into the problem of variable-length argument lists. In Pas-
cal, the I/O statements are built into the language because they are the only ones for
which the argument list can have a variable number of entries. In C, we settle for kludges
like scanf and printf, and must pass the argument count to the called procedure. In Ada
and Modula 2 we must use the awkward (and SLOW!) approach of a separate call for
each argument.
So I think I prefer the Pascal approach of building the I/O in, even though we don't need
to.

As usual, for this we need some more code generation routines. These turn out to be the eas-
iest of all, because all we do is to call library procedures to do the work:
{---------------------------------------------------------------}
{ Read Variable to Primary Register }
procedure ReadVar;
begin
EmitLn('BSR READ');
Store(Value);
end;
{---------------------------------------------------------------}
{ Write Variable from Primary Register }
procedure WriteVar;
begin
EmitLn('BSR WRITE');
end;
{--------------------------------------------------------------}
The idea is that READ loads the value from input to the D0, and WRITE outputs it from there.
These two procedures represent our first encounter with a need for library procedures ... the
components of a Run Time Library (RTL). Of course, someone (namely us) has to write these
routines, but they're not part of the compiler itself. I won't even bother showing the routines
here, since these are obviously very much OS-dependent. I _WILL_ simply say that for
SK*DOS, they are particularly simple ... almost trivial. One reason I won't show them here is
that you can add all kinds of fanciness to the things, for example by prompting in READ for
the inputs, and by giving the user a chance to reenter a bad input.

But that is really separate from compiler design, so for now I'll just assume that a library
call TINYLIB.LIB exists. Since we now need it loaded, we need to add a statement to
include it in procedure Header:
{--------------------------------------------------------------}
procedure Header;
begin
EmitLn('LIB TINYLIB');
end;
{--------------------------------------------------------------}
That takes care of that part. Now, we also need to recognize the read and write com-
mands. We can do this by adding two more keywords to our list:
{--------------------------------------------------------------}
const NKW = 11;
NKW1 = 12;
'READ', 'WRITE', 'VAR', 'BEGIN', 'END',
'PROGRAM');
const KWcode: string[NKW1] = 'xileweRWvbep';
{--------------------------------------------------------------}

(Note how I'm using upper case codes here to avoid conflict with the 'w' of WHILE.)
Next, we need procedures for processing the read/write statement and its argument list:
{--------------------------------------------------------------}
{ Process a Read Statement }
procedure DoRead;
begin
Match('(');
GetName;
ReadVar;
Match(',');
GetName;
ReadVar;
end;
Match(')');
end;

{--------------------------------------------------------------}
{ Process a Write Statement }
procedure DoWrite;
begin
Match('(');
Expression;
WriteVar;
Match(',');
Expression;
WriteVar;
end;
Match(')');
end;
{--------------------------------------------------------------}

Finally, we must expand procedure Block to handle the new statement types:
{--------------------------------------------------------------}
procedure Block;
begin
Scan;
case Token of
'i': DoIf;
'w': DoWhile;
'R': DoRead;
'W': DoWrite;
else Assignment;
end;
Scan;
end;
end;
{--------------------------------------------------------------}
That's all there is to it. _NOW_ we have a language!

CONCLUSION
At this point we have TINY completely defined. It's not much ... actually a toy compiler.
TINY has only one data type and no subroutines ... but it's a complete, usable language.
While you're not likely to be able to write another compiler in it, or do anything else very
seriously, you could write programs to read some input, perform calculations, and output
the results. Not too bad for a toy.
Most importantly, we have a firm base upon which to build further extensions. I know
you'll be glad to hear this: this is the last time I'll start over in building a parser ... from now
on I intend to just add features to TINY until it becomes KISS. Oh, there'll be other times
we will need to try things out with new copies of the Cradle, but once we've found out how
to do those things they'll be incorporated into TINY.
What will those features be? Well, for starters we need subroutines and functions. Then
we need to be able to handle different types, including arrays, strings, and other struc-
tures. Then we need to deal with the idea of pointers. All this will be upcoming in future
installments.
See you then.
For references purposes, the complete listing of TINY Version 1.0 is shown below:

{--------------------------------------------------------------}
program Tiny10;
{--------------------------------------------------------------}
const TAB = Î;
CR = ^M;
LF = ^J;
LCount: integer = 0;
{--------------------------------------------------------------}
TabPtr = ^SymTab;

{--------------------------------------------------------------}
SType: array[1..MaxEntry] of char;

{--------------------------------------------------------------}
const NKW = 11;
NKW1 = 12;
'READ', 'WRITE', 'VAR', 'BEGIN', 'END',
'PROGRAM');
const KWcode: string[NKW1] = 'xileweRWvbep';

{--------------------------------------------------------------}
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
begin
WriteLn;
end;
{--------------------------------------------------------------}
begin
Error(s);
Halt;
end;

{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
end;

{--------------------------------------------------------------}
begin
IsDigit := c in ['0'..'9'];
end;
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
IsAddop := c in ['+', '-'];
end;

{--------------------------------------------------------------}
begin
IsMulop := c in ['*', '/'];
end;
{--------------------------------------------------------------}
begin
IsOrop := c in ['|', '~'];
end;
{--------------------------------------------------------------}
begin
IsRelop := c in ['=', '#', '<', '>'];
end;

{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
GetChar;
end;

{--------------------------------------------------------------}
procedure NewLine;
begin
while Look = CR do begin
GetChar;
SkipWhite;
end;
end;
{--------------------------------------------------------------}
begin
NewLine;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Table Lookup }
var i: integer;
found: Boolean;
begin
found := false;
i := n;
if s = T^[i] then
found := true
else
dec(i);
Lookup := i;
end;

{--------------------------------------------------------------}
{ Locate a Symbol in Table }
{ Returns the index of the entry. Zero if not present. }
function Locate(N: Symbol): integer;
begin
Locate := Lookup(@ST, n, MaxEntry);
end;
{--------------------------------------------------------------}
begin
InTable := Lookup(@ST, n, MaxEntry) <> 0;
end;

{--------------------------------------------------------------}
begin
if InTable(N) then Abort('Duplicate Identifier ' + N);
Inc(NEntry);
ST[NEntry] := N;
SType[NEntry] := T;
end;

{--------------------------------------------------------------}
procedure GetName;
begin
NewLine;
Value := '';
GetChar;
end;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Get a Number }
var Val: integer;
begin
NewLine;
Val := 0;
GetChar;
end;
GetNum := Val;
SkipWhite;
end;

{--------------------------------------------------------------}
procedure Scan;
begin
GetName;
end;
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
Write(TAB, s);
end;

{--------------------------------------------------------------}
begin
Emit(s);
WriteLn;
end;
{--------------------------------------------------------------}
var S: string;
begin
Str(LCount, S);
Inc(LCount);
end;

{--------------------------------------------------------------}
begin
WriteLn(L, ':');
end;
{---------------------------------------------------------------}
procedure Clear;
begin
EmitLn('CLR D0');
end;
{---------------------------------------------------------------}
procedure Negate;
begin
EmitLn('NEG D0');
end;

{---------------------------------------------------------------}
procedure NotIt;
begin
EmitLn('NOT D0');
end;
{---------------------------------------------------------------}
procedure LoadConst(n: integer);
begin
Emit('MOVE #');
WriteLn(n, ',D0');
end;
{---------------------------------------------------------------}
procedure LoadVar(Name: string);
begin
end;

{---------------------------------------------------------------}
procedure Push;
begin
end;
{---------------------------------------------------------------}
procedure PopAdd;
begin
end;
{---------------------------------------------------------------}
procedure PopSub;
begin
EmitLn('NEG D0');
end;

{---------------------------------------------------------------}
procedure PopMul;
begin
end;
{---------------------------------------------------------------}
procedure PopDiv;
begin
EmitLn('EXT.L D7');
end;

{---------------------------------------------------------------}
procedure PopAnd;
begin
end;
{---------------------------------------------------------------}
procedure PopOr;
begin
end;
{---------------------------------------------------------------}
procedure PopXor;
begin
end;

{---------------------------------------------------------------}
begin
end;
{---------------------------------------------------------------}
procedure SetEqual;
begin
EmitLn('SEQ D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}
begin
EmitLn('SNE D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
begin
EmitLn('SLT D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}
procedure SetLess;
begin
EmitLn('SGT D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}
begin
EmitLn('SGE D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
begin
EmitLn('SLE D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}
procedure Store(Name: string);
begin
end;
{---------------------------------------------------------------}
begin
EmitLn('BRA ' + L);
end;

{---------------------------------------------------------------}
{ Branch False }
begin
EmitLn('TST D0');
EmitLn('BEQ ' + L);
end;
{---------------------------------------------------------------}
procedure ReadVar;
begin
EmitLn('BSR READ');
Store(Value[1]);
end;
{ Write Variable from Primary Register }
procedure WriteVar;
begin
end;

{--------------------------------------------------------------}
procedure Header;
begin
end;
{--------------------------------------------------------------}
procedure Prolog;
begin
PostLabel('MAIN');
end;
{--------------------------------------------------------------}
procedure Epilog;
begin
EmitLn('END MAIN');
end;

{---------------------------------------------------------------}
procedure Factor;
begin
Match('(');
BoolExpression;
Match(')');
end
else if IsAlpha(Look) then begin
GetName;
LoadVar(Value);
end
else
LoadConst(GetNum);
end;

{--------------------------------------------------------------}
{ Parse and Translate a Negative Factor }
procedure NegFactor;
begin
Match('-');
LoadConst(-GetNum)
else begin
Factor;
Negate;
end;
end;

{--------------------------------------------------------------}
{ Parse and Translate a Leading Factor }
procedure FirstFactor;
begin
case Look of
'+': begin
Match('+');
Factor;
end;
'-': NegFactor;
else Factor;
end;
end;

{--------------------------------------------------------------}
procedure Multiply;
begin
Match('*');
Factor;
PopMul;
end;
{-------------------------------------------------------------}
procedure Divide;
begin
Match('/');
Factor;
PopDiv;
end;

{---------------------------------------------------------------}
{ Common Code Used by Term and FirstTerm }
procedure Term1;
begin
Push;
case Look of
'*': Multiply;
'/': Divide;
end;
end;
end;
{---------------------------------------------------------------}
procedure Term;
begin
Factor;
Term1;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Leading Term }
begin
FirstFactor;
Term1;
end;
{--------------------------------------------------------------}
procedure Add;
begin
Match('+');
Term;
PopAdd;
end;

{-------------------------------------------------------------}
procedure Subtract;
begin
Match('-');
Term;
PopSub;
end;
{---------------------------------------------------------------}
begin
FirstTerm;
Push;
case Look of
'+': Add;
'-': Subtract;
end;
end;
end;

{---------------------------------------------------------------}
procedure Equal;
begin
Match('=');
Expression;
PopCompare;
SetEqual;
end;
{---------------------------------------------------------------}
begin
Match('=');
Expression;
PopCompare;
SetLessOrEqual;
end;

{---------------------------------------------------------------}
procedure NotEqual;
begin
Match('>');
Expression;
PopCompare;
SetNEqual;
end;

{---------------------------------------------------------------}
procedure Less;
begin
Match('<');
case Look of
'=': LessOrEqual;
'>': NotEqual;
else begin
Expression;
PopCompare;
SetLess;
end;
end;
end;

{---------------------------------------------------------------}
procedure Greater;
begin
Match('>');
Match('=');
Expression;
PopCompare;
SetGreaterOrEqual;
end
else begin
Expression;
PopCompare;
SetGreater;
end;
end;

{---------------------------------------------------------------}
procedure Relation;
begin
Expression;
Push;
case Look of
'=': Equal;
'<': Less;
'>': Greater;
end;
end;
end;

{---------------------------------------------------------------}
begin
Match('!');
Relation;
NotIt;
end
else
Relation;
end;

{---------------------------------------------------------------}
procedure BoolTerm;
begin
NotFactor;
Push;
Match('&');
NotFactor;
PopAnd;
end;
end;
{--------------------------------------------------------------}
procedure BoolOr;
begin
Match('|');
BoolTerm;
PopOr;
end;

{--------------------------------------------------------------}
procedure BoolXor;
begin
Match('~');
BoolTerm;
PopXor;
end;
{---------------------------------------------------------------}
begin
BoolTerm;
Push;
case Look of
'|': BoolOr;
'~': BoolXor;
end;
end;
end;

{--------------------------------------------------------------}
var Name: string;
begin
Name := Value;
Match('=');
BoolExpression;
Store(Name);
end;

{---------------------------------------------------------------}
procedure DoIf;
var L1, L2: string;
begin
BoolExpression;
L1 := NewLabel;
L2 := L1;
BranchFalse(L1);
Block;
L2 := NewLabel;
Branch(L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
end;

{--------------------------------------------------------------}
procedure DoWhile;
var L1, L2: string;
begin
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
BoolExpression;
BranchFalse(L2);
Block;
Branch(L1);
PostLabel(L2);
end;

{--------------------------------------------------------------}
procedure DoRead;
begin
Match('(');
GetName;
ReadVar;
Match(',');
GetName;
ReadVar;
end;
Match(')');
end;

{--------------------------------------------------------------}
procedure DoWrite;
begin
Match('(');
Expression;
WriteVar;
Match(',');
Expression;
WriteVar;
end;
Match(')');
end;

{--------------------------------------------------------------}
procedure Block;
begin
Scan;
case Token of
'i': DoIf;
'w': DoWhile;
'R': DoRead;
'W': DoWrite;
else Assignment;
end;
Scan;
end;
end;

{--------------------------------------------------------------}
procedure Alloc(N: Symbol);
begin
AddEntry(N, 'v');
Match('=');
If Look = '-' then begin
Write(Look);
Match('-');
end;
WriteLn(GetNum);
end
else
WriteLn('0');
end;

{--------------------------------------------------------------}
procedure Decl;
begin
GetName;
Alloc(Value);
Match(',');
GetName;
Alloc(Value);
end;
end;

{--------------------------------------------------------------}
procedure TopDecls;
begin
Scan;
while Token <> 'b' do begin
case Token of
'v': Decl;
else Abort('Unrecognized Keyword ' + Value);
end;
Scan;
end;
end;

{--------------------------------------------------------------}
procedure Main;
begin
Prolog;
Block;
MatchString('END');
Epilog;
end;

{--------------------------------------------------------------}
procedure Prog;
begin
Header;
TopDecls;
Main;
Match('.');
end;
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
var i: integer;
begin
for i := 1 to MaxEntry do begin
ST[i] := '';
SType[i] := ' ';
end;
GetChar;
Scan;
end;

{--------------------------------------------------------------}
{ Main Program }
begin
Init;
Prog;
if Look <> CR then Abort('Unexpected data after ''.''');
end.
{--------------------------------------------------------------}

Part 11 - Lexical Scan Revisited
INTRODUCTION
I've got some good news and some bad news. The bad news is that this installment is not
the one I promised last time. What's more, the one after this one won't be, either. The
good news is the reason for this installment: I've found a way to simplify and improve the
lexical scanning part of the compiler. Let me explain.

BACKGROUND
If you'll remember, we talked at length about the subject of lexical scanners in Part VII, and I
left you with a design for a distributed scanner that I felt was about as simple as I could make
it ... more than most that I've seen elsewhere. We used that idea in Part X. The compiler
structure that resulted was simple, and it got the job done.
Recently, though, I've begun to have problems, and they're the kind that send a message that
you might be doing something wrong.
The whole thing came to a head when I tried to address the issue of semicolons. Several
people have asked me about them, and whether or not KISS will have them separating the
statements. My intention has been NOT to use semicolons, simply because I don't like them
and, as you can see, they have not proved necessary.
But I know that many of you, like me, have gotten used to them, and so I set out to write a
short installment to show you how they could easily be added, if you were so inclined.
Well, it turned out that they weren't easy to add at all. In fact it was darned difficult.
I guess I should have realized that something was wrong, because of the issue of newlines.
In the last couple of installments we've addressed that issue, and I've shown you how to deal
with newlines with a procedure called, appropriately enough, NewLine. In TINY Version 1.0, I
sprinkled calls to this procedure in strategic spots in the code.
It seems that every time I've addressed the issue of newlines, though, I've found it to be
tricky, and the resulting parser turned out to be quite fragile ... one addition or deletion here or
there and things tended to go to pot. Looking back on it, I realize that there was a message in
this that I just wasn't paying attention to.
When I tried to add semicolons on top of the newlines, that was the last straw. I ended up with
much too complex a solution. I began to realize that something fundamental had to change.
So, in a way this installment will cause us to backtrack a bit and revisit the issue of scanning
all over again. Sorry about that. That's the price you pay for watching me do this in real time.
But the new version is definitely an improvement, and will serve us well for what is to come.

As I said, the scanner we used in Part X was about as simple as one can get. But any-
thing can be improved. The new scanner is more like the classical scanner, and not as
simple as before. But the overall compiler structure is even simpler than before. It's also
more robust, and easier to add to and/or modify. I think that's worth the time spent in this
digression. So in this installment, I'll be showing you the new structure. No doubt you'll be
happy to know that, while the changes affect many procedures, they aren't very profound
and so we lose very little of what's been done so far.
Ironically, the new scanner is much more conventional than the old one, and is very much
like the more generic scanner I showed you earlier in Part VII. Then I started trying to get
clever, and I almost clevered myself clean out of business. You'd think one day I'd learn:
K-I-S-S!

THE PROBLEM
The problem begins to show itself in procedure Block, which I've reproduced below:
{--------------------------------------------------------------}
procedure Block;
begin
Scan;
case Token of
'i': DoIf;
'w': DoWhile;
'R': DoRead;
'W': DoWrite;
else Assignment;
end;
Scan;
end;
end;
{--------------------------------------------------------------}

As you can see, Block is oriented to individual program statements. At each pass through
the loop, we know that we are at the beginning of a statement. We exit the block when we
have scanned an END or an ELSE.
But suppose that we see a semicolon instead. The procedure as it's shown above can't
handle that, because procedure Scan only expects and can only accept tokens that begin
with a letter.
I tinkered around for quite awhile to come up with a fix. I found many possible
approaches, but none were very satisfying. I finally figured out the reason.
Recall that when we started with our single-character parsers, we adopted a convention
that the lookahead character would always be prefetched. That is, we would have the
character that corresponds to our current position in the input stream fetched into the glo-
bal character Look, so that we could examine it as many times as needed. The rule we
adopted was that EVERY recognizer, if it found its target token, would advance Look to
the next character in the input stream.
That simple and fixed convention served us very well when we had single-character
tokens, and it still does. It would make a lot of sense to apply the same rule to multi-char-
acter tokens.
But when we got into lexical scanning, I began to violate that simple rule. The scanner of
Part X did indeed advance to the next token if it found an identifier or keyword, but it
DIDN'T do that if it found a carriage return, a whitespace character, or an operator.
Now, that sort of mixed-mode operation gets us into deep trouble in procedure Block,
because whether or not the input stream has been advanced depends upon the kind of
token we encounter. If it's a keyword or the target of an assignment statement, the "cur-
sor," as defined by the contents of Look, has been advanced to the next token OR to the
beginning of whitespace. If, on the other hand, the token is a semicolon, or if we have hit
a carriage return, the cursor has NOT advanced.
Needless to say, we can add enough logic to keep us on track. But it's tricky, and makes
the whole parser very fragile.

There's a much better way, and that's just to adopt that same rule that's worked so well
before, to apply to TOKENS as well as single characters. In other words, we'll prefetch tokens
just as we've always done for characters. It seems so obvious once you think about it that
way.
Interestingly enough, if we do things this way the problem that we've had with newline char-
acters goes away. We can just lump them in as whitespace characters, which means that the
handling of newlines becomes very trivial, and MUCH less prone to error than we've had to
deal with in the past.

THE SOLUTION
Let's begin to fix the problem by re-introducing the two procedures:
{--------------------------------------------------------------}
procedure GetName;
begin
SkipWhite;
if Not IsAlpha(Look) then Expected('Identifier');
Token := 'x';
Value := '';
repeat
GetChar;
until not IsAlNum(Look);
end;

{--------------------------------------------------------------}
{ Get a Number }
procedure GetNum;
begin
SkipWhite;
if not IsDigit(Look) then Expected('Number');
Token := '#';
Value := '';
repeat
GetChar;
until not IsDigit(Look);
end;
{--------------------------------------------------------------}
These two procedures are functionally almost identical to the ones I showed you in Part VII.
They each fetch the current token, either an identifier or a number, into the global string
Value. They also set the encoded version, Token, to the appropriate code. The input stream
is left with Look containing the first character NOT part of the token.

We can do the same thing for operators, even multi-character operators, with a procedure
such as:
{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
Token := Look;
Value := '';
repeat
GetChar;
until IsAlpha(Look) or IsDigit(Look) or IsWhite(Look);
end;
{--------------------------------------------------------------}
Note that GetOp returns, as its encoded token, the FIRST character of the operator. This
is important, because it means that we can now use that single character to drive the
parser, instead of the lookahead character.

We need to tie these procedures together into a single procedure that can handle all three
cases. The following procedure will read any one of the token types and always leave the
input stream advanced beyond it:
{--------------------------------------------------------------}
{ Get the Next Input Token }
procedure Next;
begin
SkipWhite;
if IsAlpha(Look) then GetName
else if IsDigit(Look) then GetNum
else GetOp;
end;
{--------------------------------------------------------------}
***NOTE that here I have put SkipWhite BEFORE the calls rather than after. This means that,
in general, the variable Look will NOT have a meaningful value in it, and therefore we should
NOT use it as a test value for parsing, as we have been doing so far. That's the big departure
from our normal approach.
Now, remember that before I was careful not to treat the carriage return (CR) and line feed
(LF) characters as white space. This was because, with SkipWhite called as the last thing in
the scanner, the encounter with LF would trigger a read statement. If we were on the last line
of the program, we couldn't get out until we input another line with a non-white character.
That's why I needed the second procedure, NewLine, to handle the CRLF's.
But now, with the call to SkipWhite coming first, that's exactly the behavior we want. The
compiler must know there's another token coming or it wouldn't be calling Next. In other
words, it hasn't found the terminating END yet. So we're going to insist on more data until we
find something.

All this means that we can greatly simplify both the program and the concepts, by treating
CR and LF as whitespace characters, and eliminating NewLine. You can do that simply
by modifying the function IsWhite:
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
We've already tried similar routines in Part VII, but you might as well try these new ones
out. Add them to a copy of the Cradle and call Next with the following main program:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Next;
WriteLn(Token, ' ', Value);
until Token = '.';
end.
{--------------------------------------------------------------}

Compile it and verify that you can separate a program into a series of tokens, and that you
get the right encoding for each token.
This ALMOST works, but not quite. There are two potential problems: First, in KISS/TINY
almost all of our operators are single-character operators. The only exceptions are the relops
>=, <=, and <>. It seems a shame to treat all operators as strings and do a string compare,
when only a single character compare will almost always suffice. Second, and much more
important, the thing doesn't WORK when two operators appear together, as in (a+b)*(c+d).
Here the string following 'b' would be interpreted as a single operator ")*(."
It's possible to fix that problem. For example, we could just give GetOp a list of legal charac-
ters, and we could treat the parentheses as different operator types than the others. But this
begins to get messy.
Fortunately, there's a better way that solves all the problems. Since almost all the operators
are single characters, let's just treat them that way, and let GetOp get only one character at a
time. This not only simplifies GetOp, but also speeds things up quite a bit. We still have the
problem of the relops, but we were treating them as special cases anyway.
So here's the final version of GetOp:
{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
SkipWhite;
Token := Look;
Value := Look;
GetChar;
end;
{--------------------------------------------------------------}

Note that I still give the string Value a value. If you're truly concerned about efficiency, you
could leave this out. When we're expecting an operator, we will only be testing Token any-
how, so the value of the string won't matter. But to me it seems to be good practice to give
the thing a value just in case.
Try this new version with some realistic-looking code. You should be able to separate any
program into its individual tokens, with the caveat that the two-character relops will scan
into two separate tokens. That's OK ... we'll parse them that way.
Now, in Part VII the function of Next was combined with procedure Scan, which also
checked every identifier against a list of keywords and encoded each one that was found.
As I mentioned at the time, the last thing we would want to do is to use such a procedure
in places where keywords should not appear, such as in expressions. If we did that, the
keyword list would be scanned for every identifier appearing in the code. Not good.
The right way to deal with that is to simply separate the functions of fetching tokens and
looking for keywords. The version of Scan shown below does NOTHING but check for
keywords. Notice that it operates on the current token and does NOT advance the input
stream.
{--------------------------------------------------------------}
{ Scan the Current Identifier for Keywords }
procedure Scan;
begin
if Token = 'x' then
end;
{--------------------------------------------------------------}

There is one last detail. In the compiler there are a few places that we must actually check the
string value of the token. Mainly, this is done to distinguish between the different END's, but
there are a couple of other places. (I should note in passing that we could always eliminate
the need for matching END characters by encoding each one to a different character. Right
now we are definitely taking the lazy man's route.)
The following version of MatchString takes the place of the character-oriented Match. Note
that, like Match, it DOES advance the input stream.
{--------------------------------------------------------------}
begin
Next;
end;
{--------------------------------------------------------------}

FIXING UP THE COMPILER

Armed with these new scanner procedures, we can now begin to fix the compiler to use
them properly. The changes are all quite minor, but there are quite a few places where
changes are necessary. Rather than showing you each place, I will give you the general
idea and then just give the finished product.
First of all, the code for procedure Block doesn't change, though its function does:
{--------------------------------------------------------------}
procedure Block;
begin
Scan;
case Token of
'i': DoIf;
'w': DoWhile;
'R': DoRead;
'W': DoWrite;
else Assignment;
end;
Scan;
end;
end;
{--------------------------------------------------------------}

Remember that the new version of Scan doesn't advance the input stream, it only scans for
keywords. The input stream must be advanced by each procedure that Block calls.
In general, we have to replace every test on Look with a similar test on Token. For example:
{---------------------------------------------------------------}
begin
BoolTerm;
while IsOrOp(Token) do begin
Push;
case Token of
'|': BoolOr;
'~': BoolXor;
end;
end;
end;
{--------------------------------------------------------------}

In procedures like Add, we don't have to use Match anymore. We need only call Next to
advance the input stream:
{--------------------------------------------------------------}
procedure Add;
begin
Next;
Term;
PopAdd;
end;
{-------------------------------------------------------------}

Control structures are actually simpler. We just call Next to advance over the control key-
words:
{---------------------------------------------------------------}
procedure DoIf;
var L1, L2: string;
begin
Next;
BoolExpression;
L1 := NewLabel;
L2 := L1;
BranchFalse(L1);
Block;
Next;
L2 := NewLabel;
Branch(L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
end;
{--------------------------------------------------------------}

That's about the extent of the REQUIRED changes. In the listing of TINY Version 1.1
below, I've also made a number of other "improvements" that aren't really required. Let
me explain them briefly:
(1) I've deleted the two procedures Prog and Main, and combined
their functions into the main program. They didn't seem to
add to program clarity ... in fact they seemed to just
muddy things up a little.
(2) I've deleted the keywords PROGRAM and BEGIN from the
keyword list. Each one only occurs in one place, so it's
not necessary to search for it.
(3) Having been bitten by an overdose of cleverness, I've
reminded myself that TINY is supposed to be a minimalist
program. Therefore I've replaced the fancy handling of
unary minus with the dumbest one I could think of. A giant
step backwards in code quality, but a great simplification
of the compiler. KISS is the right place to use the other
version.
(4) I've added some error-checking routines such as CheckTable
and CheckDup, and replaced in-line code by calls to them.

This cleans up a number of routines.
(5) I've taken the error checking out of code generation
routines like Store, and put it in the parser where it
belongs. See Assignment, for example.
(6) There was an error in InTable and Locate that caused them
to search all locations instead of only those with valid
data in them. They now search only valid cells. This
allows us to eliminate the initialization of the symbol
table, which was done in Init.
(7) Procedure AddEntry now has two arguments, which helps to
make things a bit more modular.
(8) I've cleaned up the code for the relational operators by
the addition of the new procedures CompareExpression and
NextExpression.
(9) I fixed an error in the Read routine ... the earlier value
did not check for a valid variable name.

CONCLUSION
The resulting compiler for TINY is given below. Other than the removal of the keyword
PROGRAM, it parses the same language as before. It's just a bit cleaner, and more
importantly it's considerably more robust. I feel good about it.
The next installment will be another digression: the discussion of semicolons and such
that got me into this mess in the first place. THEN we'll press on into procedures and
types. Hang in there with me. The addition of those features will go a long way towards
removing KISS from the "toy language" category. We're getting very close to being able to
write a serious compiler.

TINY VERSION 1.1

{--------------------------------------------------------------}
program Tiny11;
{--------------------------------------------------------------}
const TAB = Î;
CR = ^M;
LF = ^J;
LCount: integer = 0;
{--------------------------------------------------------------}
TabPtr = ^SymTab;

{--------------------------------------------------------------}
SType: array[1..MaxEntry] of char;
{--------------------------------------------------------------}
const NKW = 9;
NKW1 = 10;
'READ', 'WRITE', 'VAR', 'END');
const KWcode: string[NKW1] = 'xileweRWve';

{--------------------------------------------------------------}
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
begin
WriteLn;
end;
{--------------------------------------------------------------}
begin
Error(s);
Halt;
end;

--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
{ Report a Duplicate Identifier }
procedure Duplicate(n: string);
begin
Abort('Duplicate Identifier ' + n);
end;

{--------------------------------------------------------------}
{ Check to Make Sure the Current Token is an Identifier }
procedure CheckIdent;
begin
if Token <> 'x' then Expected('Identifier');
end;
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
IsDigit := c in ['0'..'9'];
end;

{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
IsAddop := c in ['+', '-'];
end;
{--------------------------------------------------------------}
begin
IsMulop := c in ['*', '/'];
end;

{--------------------------------------------------------------}
begin
IsOrop := c in ['|', '~'];
end;
{--------------------------------------------------------------}
begin
IsRelop := c in ['=', '#', '<', '>'];
end;
{--------------------------------------------------------------}
begin
end;

{--------------------------------------------------------------}
begin
GetChar;
end;
{--------------------------------------------------------------}
{ Table Lookup }
var i: integer;
found: Boolean;
begin
found := false;
i := n;
if s = T^[i] then
found := true
else
dec(i);
Lookup := i;
end;

{--------------------------------------------------------------}
{ Locate a Symbol in Table }
{ Returns the index of the entry. Zero if not present. }
function Locate(N: Symbol): integer;
begin
Locate := Lookup(@ST, n, NEntry);
end;
{--------------------------------------------------------------}
begin
InTable := Lookup(@ST, n, NEntry) <> 0;
end;
{--------------------------------------------------------------}
{ Check to See if an Identifier is in the Symbol Table }
{ Report an error if it's not. }
procedure CheckTable(N: Symbol);
begin
if not InTable(N) then Undefined(N);
end;

{--------------------------------------------------------------}
{ Check the Symbol Table for a Duplicate Identifier }
{ Report an error if identifier is already in table. }
procedure CheckDup(N: Symbol);
begin
if InTable(N) then Duplicate(N);
end;
{--------------------------------------------------------------}
begin
CheckDup(N);
Inc(NEntry);
ST[NEntry] := N;
SType[NEntry] := T;
end;

{--------------------------------------------------------------}
procedure GetName;
begin
SkipWhite;
if Not IsAlpha(Look) then Expected('Identifier');
Token := 'x';
Value := '';
repeat
GetChar;
until not IsAlNum(Look);
end;

{--------------------------------------------------------------}
{ Get a Number }
procedure GetNum;
begin
SkipWhite;
if not IsDigit(Look) then Expected('Number');
Token := '#';
Value := '';
repeat
GetChar;
until not IsDigit(Look);
end;
{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
SkipWhite;
Token := Look;
Value := Look;
GetChar;
end;

{--------------------------------------------------------------}
{ Get the Next Input Token }
procedure Next;
begin
SkipWhite;
if IsAlpha(Look) then GetName
else if IsDigit(Look) then GetNum
else GetOp;
end;
{--------------------------------------------------------------}
{ Scan the Current Identifier for Keywords }
procedure Scan;
begin
if Token = 'x' then
end;

{--------------------------------------------------------------}
begin
Next;
end;
{--------------------------------------------------------------}
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
begin
Emit(s);
WriteLn;
end;

{--------------------------------------------------------------}
var S: string;
begin
Str(LCount, S);
Inc(LCount);
end;
{--------------------------------------------------------------}
begin
WriteLn(L, ':');
end;
{---------------------------------------------------------------}
procedure Clear;
begin
EmitLn('CLR D0');
end;

{---------------------------------------------------------------}
procedure Negate;
begin
EmitLn('NEG D0');
end;
{---------------------------------------------------------------}
procedure NotIt;
begin
EmitLn('NOT D0');
end;
{---------------------------------------------------------------}
procedure LoadConst(n: string);
begin
Emit('MOVE #');
WriteLn(n, ',D0');
end;

{---------------------------------------------------------------}
procedure LoadVar(Name: string);
begin
end;
{---------------------------------------------------------------}
procedure Push;
begin
end;
{---------------------------------------------------------------}
procedure PopAdd;
begin
end;

{---------------------------------------------------------------}
procedure PopSub;
begin
EmitLn('NEG D0');
end;
{---------------------------------------------------------------}
procedure PopMul;
begin
end;

{---------------------------------------------------------------}
procedure PopDiv;
begin
EmitLn('EXT.L D7');
end;
{---------------------------------------------------------------}
procedure PopAnd;
begin
end;
{---------------------------------------------------------------}
procedure PopOr;
begin
end;

{---------------------------------------------------------------}
procedure PopXor;
begin
end;
{---------------------------------------------------------------}
begin
end;
{---------------------------------------------------------------}
procedure SetEqual;
begin
EmitLn('SEQ D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
begin
EmitLn('SNE D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}
begin
EmitLn('SLT D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}
procedure SetLess;
begin
EmitLn('SGT D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
begin
EmitLn('SGE D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}
begin
EmitLn('SLE D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}
procedure Store(Name: string);
begin
end;

{---------------------------------------------------------------}
begin
EmitLn('BRA ' + L);
end;
{---------------------------------------------------------------}
{ Branch False }
begin
EmitLn('TST D0');
EmitLn('BEQ ' + L);
end;

{---------------------------------------------------------------}
procedure ReadIt(Name: string);
begin
EmitLn('BSR READ');
Store(Name);
end;
{ Write from Primary Register }
procedure WriteIt;
begin
end;
{--------------------------------------------------------------}
procedure Header;
begin
end;

{--------------------------------------------------------------}
procedure Prolog;
begin
PostLabel('MAIN');
end;
{--------------------------------------------------------------}
procedure Epilog;
begin
EmitLn('END MAIN');
end;
{---------------------------------------------------------------}
{ Allocate Storage for a Static Variable }
procedure Allocate(Name, Val: string);
begin
WriteLn(Name, ':', TAB, 'DC ', Val);
end;

{---------------------------------------------------------------}
procedure Factor;
begin
if Token = '(' then begin
Next;
BoolExpression;
MatchString(')');
end
else begin
if Token = 'x' then
LoadVar(Value)
else if Token = '#' then
LoadConst(Value)
else Expected('Math Factor');
Next;
end;
end;

{--------------------------------------------------------------}
procedure Multiply;
begin
Next;
Factor;
PopMul;
end;
{-------------------------------------------------------------}
procedure Divide;
begin
Next;
Factor;
PopDiv;
end;

{---------------------------------------------------------------}
procedure Term;
begin
Factor;
while IsMulop(Token) do begin
Push;
case Token of
'*': Multiply;
'/': Divide;
end;
end;
end;
{--------------------------------------------------------------}
procedure Add;
begin
Next;
Term;
PopAdd;
end;

{-------------------------------------------------------------}
procedure Subtract;
begin
Next;
Term;
PopSub;
end;

{---------------------------------------------------------------}
begin
if IsAddop(Token) then
Clear
else
Term;
while IsAddop(Token) do begin
Push;
case Token of
'+': Add;
'-': Subtract;
end;
end;
end;
{---------------------------------------------------------------}
{ Get Another Expression and Compare }
procedure CompareExpression;
begin
Expression;
PopCompare;
end;

{---------------------------------------------------------------}
{ Get The Next Expression and Compare }
procedure NextExpression;
begin
Next;
CompareExpression;
end;
{---------------------------------------------------------------}
procedure Equal;
begin
NextExpression;
SetEqual;
end;
{---------------------------------------------------------------}
begin
NextExpression;
SetLessOrEqual;
end;

{---------------------------------------------------------------}
procedure NotEqual;
begin
NextExpression;
SetNEqual;
end;
{---------------------------------------------------------------}
procedure Less;
begin
Next;
case Token of
'=': LessOrEqual;
'>': NotEqual;
else begin
CompareExpression;
SetLess;
end;
end;
end;

{---------------------------------------------------------------}
procedure Greater;
begin
Next;
if Token = '=' then begin
NextExpression;
SetGreaterOrEqual;
end
else begin
CompareExpression;
SetGreater;
end;
end;

{---------------------------------------------------------------}
procedure Relation;
begin
Expression;
if IsRelop(Token) then begin
Push;
case Token of
'=': Equal;
'<': Less;
'>': Greater;
end;
end;
end;

{---------------------------------------------------------------}
begin
if Token = '!' then begin
Next;
Relation;
NotIt;
end
else
Relation;
end;

{---------------------------------------------------------------}
procedure BoolTerm;
begin
NotFactor;
while Token = '&' do begin
Push;
Next;
NotFactor;
PopAnd;
end;
end;
{--------------------------------------------------------------}
procedure BoolOr;
begin
Next;
BoolTerm;
PopOr;
end;

{--------------------------------------------------------------}
procedure BoolXor;
begin
Next;
BoolTerm;
PopXor;
end;
{---------------------------------------------------------------}
begin
BoolTerm;
while IsOrOp(Token) do begin
Push;
case Token of
'|': BoolOr;
'~': BoolXor;
end;
end;
end;

{--------------------------------------------------------------}
var Name: string;
begin
CheckTable(Value);
Name := Value;
Next;
MatchString('=');
BoolExpression;
Store(Name);
end;

{---------------------------------------------------------------}
procedure DoIf;
var L1, L2: string;
begin
Next;
BoolExpression;
L1 := NewLabel;
L2 := L1;
BranchFalse(L1);
Block;
Next;
L2 := NewLabel;
Branch(L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
end;

{--------------------------------------------------------------}
procedure DoWhile;
var L1, L2: string;
begin
Next;
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
BoolExpression;
BranchFalse(L2);
Block;
Branch(L1);
PostLabel(L2);
end;

{--------------------------------------------------------------}
{ Read a Single Variable }
procedure ReadVar;
begin
CheckIdent;
CheckTable(Value);
ReadIt(Value);
Next;
end;
{--------------------------------------------------------------}
procedure DoRead;
begin
Next;
MatchString('(');
ReadVar;
while Token = ',' do begin
Next;
ReadVar;
end;
MatchString(')');
end;

{--------------------------------------------------------------}
procedure DoWrite;
begin
Next;
MatchString('(');
Expression;
WriteIt;
while Token = ',' do begin
Next;
Expression;
WriteIt;
end;
MatchString(')');
end;

{--------------------------------------------------------------}
procedure Block;
begin
Scan;
case Token of
'i': DoIf;
'w': DoWhile;
'R': DoRead;
'W': DoWrite;
else Assignment;
end;
Scan;
end;
end;

{--------------------------------------------------------------}
procedure Alloc;
begin
Next;
if Token <> 'x' then Expected('Variable Name');
CheckDup(Value);
AddEntry(Value, 'v');
Allocate(Value, '0');
Next;
end;

{--------------------------------------------------------------}
procedure TopDecls;
begin
Scan;
while Token = 'v' do
Alloc;
while Token = ',' do
Alloc;
end;
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
begin
GetChar;
Next;
end;

{--------------------------------------------------------------}
{ Main Program }
begin
Init;
Header;
TopDecls;
Prolog;
Block;
MatchString('END');
Epilog;
end.
{--------------------------------------------------------------}

Part 12 - Miscellany
INTRODUCTION
This installment is another one of those excursions into side alleys that don't seem to fit into
the mainstream of this tutorial series. As I mentioned last time, it was while I was writing this
installment that I realized some changes had to be made to the compiler structure. So I had
to digress from this digression long enough to develop the new structure and show it to you.
Now that that's behind us, I can tell you what I set out to in the first place. This shouldn't take
long, and then we can get back into the mainstream.
Several people have asked me about things that other languages provide, but so far I haven't
addressed in this series. The two biggies are semicolons and comments. Perhaps you've
wondered about them, too, and wondered how things would change if we had to deal with
them. Just so you can proceed with what's to come, without being bothered by that nagging
feeling that something is missing, we'll address such issues here.

SEMICOLONS
Ever since the introduction of Algol, semicolons have been a part of almost every modern
language. We've all used them to the point that they are taken for granted. Yet I suspect
that more compilation errors have occurred due to misplaced or missing semicolons than
any other single cause. And if we had a penny for every extra keystroke programmers
have used to type the little rascals, we could pay off the national debt.
Having been brought up with FORTRAN, it took me a long time to get used to using semi-
colons, and to tell the truth I've never quite understood why they were necessary. Since I
program in Pascal, and since the use of semicolons in Pascal is particularly tricky, that
one little character is still by far my biggest source of errors.
When I began developing KISS, I resolved to question EVERY construct in other lan-
guages, and to try to avoid the most common problems that occur with them. That puts
the semicolon very high on my hit list.
To understand the role of the semicolon, you have to look at a little history.
Early programming languages were line-oriented. In FORTRAN, for example, various

parts of the statement had specific columns or fields that they had to appear in. Since
some statements were too long for one line, the "continuation card" mechanism was pro-
vided to let the compiler know that a given card was still part of the previous line. The
mechanism survives to this day, even though punched cards are now things of the distant
past.
When other languages came along, they also adopted various mechanisms for dealing
with multiple-line statements. BASIC is a good example. It's important to recognize,
though, that the FORTRAN mechanism was not so much required by the line orientation
of that language, as by the column-orientation. In those versions of FORTRAN where
free-form input is permitted, it's no longer needed.
When the fathers of Algol introduced that language, they wanted to get away from line-
oriented programs like FORTRAN and BASIC, and allow for free-form input. This
included the possibility of stringing multiple statements on a single line, as in
a=b; c=d; e=e+1;

In cases like this, the semicolon is almost REQUIRED. The same line, without the semico-
lons, just looks "funny":
a=b c= d e=e+1
I suspect that this is the major ... perhaps ONLY ... reason for semicolons: to keep programs
from looking funny.
But the idea of stringing multiple statements together on a single line is a dubious one at best.
It's not very good programming style, and harks back to the days when it was considered
improtant to conserve cards. In these days of CRT's and indented code, the clarity of pro-
grams is far better served by keeping statements separate. It's still nice to have the OPTION
of multiple statements, but it seems a shame to keep programmers in slavery to the semico-
lon, just to keep that one rare case from "looking funny."
When I started in with KISS, I tried to keep an open mind. I decided that I would use semico-
lons when it became necessary for the parser, but not until then. I figured this would happen
just about the time I added the ability to spread statements over multiple lines. But, as you
can see, that never happened. The TINY compiler is perfectly happy to parse the most com-
plicated statement, spread over any number of lines, without semicolons.
Still, there are people who have used semicolons for so long, they feel naked without them.
I'm one of them. Once I had KISS defined sufficiently well, I began to write a few sample pro-
grams in the language. I discovered, somewhat to my horror, that I kept putting semicolons in
anyway. So now I'm facing the prospect of a NEW rash of compiler errors, caused by
UNWANTED semicolons. Phooey!
Perhaps more to the point, there are readers out there who are designing their own lan-
guages, which may include semicolons, or who want to use the techniques of these tutorials
to compile conventional languages like C. In either case, we need to be able to deal with
semicolons.

SYNTACTIC SUGAR
This whole discussion brings up the issue of "syntactic sugar" ... constructs that are
added to a language, not because they are needed, but because they help make the pro-
grams look right to the programmer. After all, it's nice to have a small, simple compiler, but
it would be of little use if the resulting language were cryptic and hard to program. The
language FORTH comes to mind (a premature OUCH! for the barrage I know that one's
going to fetch me). If we can add features to the language that make the programs easier
to read and understand, and if those features help keep the programmer from making
errors, then we should do so. Particularly if the constructs don't add much to the complex-
ity of the language or its compiler.
The semicolon could be considered an example, but there are plenty of others, such as
the 'THEN' in a IF-statement, the 'DO' in a WHILE-statement, and even the 'PROGRAM'
statement, which I came within a gnat's eyelash of leaving out of TINY. None of these
tokens add much to the syntax of the language ... the compiler can figure out what's going
on without them. But some folks feel that they DO add to the readability of programs, and
that can be very important.
There are two schools of thought on this subject, which are well represented by two of our
most popular languages, C and Pascal.
To the minimalists, all such sugar should be left out. They argue that it clutters up the lan-
guage and adds to the number of keystrokes programmers must type. Perhaps more
importantly, every extra token or keyword represents a trap laying in wait for the inatten-
tive programmer. If you leave out a token, misplace it, or misspell it, the compiler will get
you. So these people argue that the best approach is to get rid of such things. These folks
tend to like C, which has a minimum of unnecessary keywords and punctuation.
Those from the other school tend to like Pascal. They argue that having to type a few
extra characters is a small price to pay for legibility. After all, humans have to read the
programs, too. Their best argument is that each such construct is an opportunity to tell
the compiler that you really mean for it to do what you said to. The sugary tokens serve as
useful landmarks to help you find your way.

The differences are well represented by the two languages. The most oft-heard complaint
about C is that it is too forgiving. When you make a mistake in C, the erroneous code is too
often another legal C construct. So the compiler just happily continues to compile, and leaves
you to find the error during debug. I guess that's why debuggers are so popular with C pro-
grammers.
On the other hand, if a Pascal program compiles, you can be pretty sure that the program will
do what you told it. If there is an error at run time, it's probably a design error.
The best example of useful sugar is the semicolon itself. Consider the code fragment:
a=1+(2*b+c) b...
Since there is no operator connecting the token 'b' with the rest of the statement, the compiler
will conclude that the expression ends with the ')', and the 'b' is the beginning of a new state-
ment. But suppose I have simply left out the intended operator, and I really want to say:
a=1+(2*b+c)*b...
In this case the compiler will get an error, all right, but it won't be very meaningful since it will
be expecting an '=' sign after the 'b' that really shouldn't be there.
If, on the other hand, I include a semicolon after the 'b', THEN there can be no doubt where I
intend the statement to end. Syntactic sugar, then, can serve a very useful purpose by provid-
ing some additional insurance that we remain on track.
I find myself somewhere in the middle of all this. I tend to favor the Pascal-ers' view ... I'd
much rather find my bugs at compile time rather than run time. But I also hate to just throw
verbosity in for no apparent reason, as in COBOL. So far I've consistently left most of the
Pascal sugar out of KISS/TINY. But I certainly have no strong feelings either way, and I also
can see the value of sprinkling a little sugar around just for the extra insurance that it brings.
If you like this latter approach, things like that are easy to add. Just remember that, like the
semicolon, each item of sugar is something that can potentially cause a compile error by its
omission.

DEALING WITH SEMICOLONS

There are two distinct ways in which semicolons are used in popular languages. In Pas-
cal, the semicolon is regarded as an statement SEPARATOR. No semicolon is required
after the last statement in a block. The syntax is:
<block> ::= <statement> ( ';' <statement>)*
<statement> ::= <assignment> | <if> | <while> ... | null
(The null statement is IMPORTANT!)
Pascal also defines some semicolons in other places, such as after the PROGRAM state-
ment.
In C and Ada, on the other hand, the semicolon is considered a statement TERMINATOR,
and follows all statements (with some embarrassing and confusing exceptions). The syn-
tax for this is simply:
<block> ::= ( <statement> ';')*
Of the two syntaxes, the Pascal one seems on the face of it more rational, but experience
has shown that it leads to some strange difficulties. People get so used to typing a semi-
colon after every statement that they tend to type one after the last statement in a block,
also. That usually doesn't cause any harm ... it just gets treated as a null statement. Many
Pascal programmers, including yours truly, do just that. But there is one place you abso-
lutely CANNOT type a semicolon, and that's right before an ELSE. This little gotcha has
cost me many an extra compilation, particularly when the ELSE is added to existing code.
So the C/Ada choice turns out to be better. Apparently Nicklaus Wirth thinks so, too: In his
Modula 2, he abandoned the Pascal approach.
Given either of these two syntaxes, it's an easy matter (now that we've reorganized the
parser!) to add these features to our parser. Let's take the last case first, since it's simpler.

To begin, I've made things easy by introducing a new recognizer:
{--------------------------------------------------------------}
{ Match a Semicolon }
procedure Semi;
begin
MatchString(';');
end;
{--------------------------------------------------------------}
This procedure works very much like our old Match. It insists on finding a semicolon as the
next token. Having found it, it skips to the next one.

Since a semicolon follows a statement, procedure Block is almost the only one we need
to change:
{--------------------------------------------------------------}
procedure Block;
begin
Scan;
case Token of
'i': DoIf;
'w': DoWhile;
'R': DoRead;
'W': DoWrite;
'x': Assignment;
end;
Semi;
Scan;
end;
end;
{--------------------------------------------------------------}

Note carefully the subtle change in the case statement. The call to Assignment is now
guarded by a test on Token. This is to avoid calling Assignment when the token is a semico-
lon (which could happen if the statement is null).
Since declarations are also statements, we also need to add a call to Semi within procedure
TopDecls:
{--------------------------------------------------------------}
procedure TopDecls;
begin
Scan;
while Token = 'v' do begin
Alloc;
while Token = ',' do
Alloc;
Semi;
end;
end;
{--------------------------------------------------------------}

Finally, we need one for the PROGRAM statement:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
Semi;
Header;
TopDecls;
Prolog;
Block;
MatchString('END');
Epilog;
end.
{--------------------------------------------------------------}
It's as easy as that. Try it with a copy of TINY and see how you like it.

The Pascal version is a little trickier, but it still only requires minor changes, and those only to
procedure Block. To keep things as simple as possible, let's split the procedure into two parts.
The following procedure handles just one statement:
{--------------------------------------------------------------}
{ Parse and Translate a Single Statement }
procedure Statement;
begin
Scan;
case Token of
'i': DoIf;
'w': DoWhile;
'R': DoRead;
'W': DoWrite;
'x': Assignment;
end;
end;
{--------------------------------------------------------------}

Using this procedure, we can now rewrite Block like this:
{--------------------------------------------------------------}
procedure Block;
begin
Statement;
while Token = ';' do begin
Next;
Statement;
end;
end;
{--------------------------------------------------------------}
That sure didn't hurt, did it? We can now parse semicolons in Pascal-like fashion.

A COMPROMISE
Now that we know how to deal with semicolons, does that mean that I'm going to put them in
KISS/TINY? Well, yes and no. I like the extra sugar and the security that comes with knowing
for sure where the ends of statements are. But I haven't changed my dislike for the compila-
tion errors associated with semicolons.
So I have what I think is a nice compromise: Make them OPTIONAL!
Consider the following version of Semi:
{--------------------------------------------------------------}
{ Match a Semicolon }
procedure Semi;
begin
if Token = ';' then Next;
end;
{--------------------------------------------------------------}
This procedure will ACCEPT a semicolon whenever it is called, but it won't INSIST on one.
That means that when you choose to use semicolons, the compiler will use the extra informa-
tion to help keep itself on track. But if you omit one (or omit them all) the compiler won't com-
plain. The best of both worlds.
Put this procedure in place in the first version of your program (the one for C/Ada syntax),
and you have the makings of TINY Version 1.2.

COMMENTS
Up until now I have carefully avoided the subject of comments. You would think that this
would be an easy subject ... after all, the compiler doesn't have to deal with comments at
all; it should just ignore them. Well, sometimes that's true.
Comments can be just about as easy or as difficult as you choose to make them. At one
extreme, we can arrange things so that comments are intercepted almost the instant they
enter the compiler. At the other, we can treat them as lexical elements. Things tend to get
interesting when you consider things like comment delimiters contained in quoted strings.

SINGLE-CHARACTER DELIMITERS
Here's an example. Suppose we assume the Turbo Pascal standard and use curly braces for
comments. In this case we have single- character delimiters, so our parsing is a little easier.
One approach is to strip the comments out the instant we encounter them in the input stream;
that is, right in procedure GetChar. To do this, first change the name of GetChar to something
else, say GetCharX. (For the record, this is going to be a TEMPORARY change, so best not
do this with your only copy of TINY. I assume you understand that you should always do
these experiments with a working copy.)
Now, we're going to need a procedure to skip over comments. So key in the following one:
{--------------------------------------------------------------}
{ Skip A Comment Field }
procedure SkipComment;
begin
while Look <> '}' do
GetCharX;
GetCharX;
end;
{--------------------------------------------------------------}
Clearly, what this procedure is going to do is to simply read and discard characters from the
input stream, until it finds a right curly brace. Then it reads one more character and returns it
in Look.

Now we can write a new version of GetChar that SkipComment to strip out comments:
{--------------------------------------------------------------}
{ Get Character from Input Stream }
{ Skip Any Comments }
procedure GetChar;
begin
GetCharX;
if Look = '{' then SkipComment;
end;
{--------------------------------------------------------------}
Code this up and give it a try. You'll find that you can, indeed, bury comments anywhere
you like. The comments never even get into the parser proper ... every call to GetChar
just returns any character that's NOT part of a comment.
As a matter of fact, while this approach gets the job done, and may even be perfectly sat-
isfactory for you, it does its job a little TOO well. First of all, most programming languages
specify that a comment should be treated like a space, so that comments aren't allowed
to be embedded in, say, variable names. This current version doesn't care WHERE you
put comments.
Second, since the rest of the parser can't even receive a '{' character, you will not be
allowed to put one in a quoted string.
Before you turn up your nose at this simplistic solution, though, I should point out that as
respected a compiler as Turbo Pascal also won't allow a '{' in a quoted string. Try it. And
as for embedding a comment in an identifier, I can't imagine why anyone would want to do
such a thing, anyway, so the question is moot. For 99% of all applications, what I've just
shown you will work just fine.

But, if you want to be picky about it and stick to the conventional treatment, then we need to
move the interception point downstream a little further.
To do this, first change GetChar back to the way it was and change the name called in
SkipComment. Then, let's add the left brace as a possible whitespace character:
{--------------------------------------------------------------}
begin
IsWhite := c in [' ', TAB, CR, LF, '{'];
end;
{--------------------------------------------------------------}

Now, we can deal with comments in procedure SkipWhite:
{--------------------------------------------------------------}
begin
while IsWhite(Look) do begin
if Look = '{' then
SkipComment
else
GetChar;
end;
end;
{--------------------------------------------------------------}
Note that SkipWhite is written so that we will skip over any combination of whitespace
characters and comments, in one call.
OK, give this one a try, too. You'll find that it will let a comment serve to delimit tokens. It's
worth mentioning that this approach also gives us the ability to handle curly braces within
quoted strings, since within such strings we will not be testing for or skipping over
whitespace.
There's one last item to deal with: Nested comments. Some programmers like the idea of
nesting comments, since it allows you to comment out code during debugging. The code
I've given here won't allow that and, again, neither will Turbo Pascal.

But the fix is incredibly easy. All we need to do is to make SkipComment recursive:
{--------------------------------------------------------------}
begin
while Look <> '}' do begin
GetChar;
if Look = '{' then SkipComment;
end;
GetChar;
end;
{--------------------------------------------------------------}
That does it. As sophisticated a comment-handler as you'll ever need.

MULTI-CHARACTER DELIMITERS
That's all well and good for cases where a comment is delimited by single characters, but
what about the cases such as C or standard Pascal, where two characters are required?
Well, the principles are still the same, but we have to change our approach quite a bit. I'm
sure it won't surprise you to learn that things get harder in this case.
For the multi-character situation, the easiest thing to do is to intercept the left delimiter
back at the GetChar stage. We can "tokenize" it right there, replacing it by a single char-
acter.

Let's assume we're using the C delimiters '/*' and '*/'. First, we need to go back to the "Get-
CharX' approach. In yet another copy of your compiler, rename GetChar to GetCharX and
then enter the following new procedure GetChar:
{--------------------------------------------------------------}
{ Read New Character. Intercept '/*' }
procedure GetChar;
begin
if TempChar <> ' ' then begin
Look := TempChar;
TempChar := ' ';
end
else begin
GetCharX;
if Look = '/' then begin
Read(TempChar);
if TempChar = '*' then begin
Look := '{';
TempChar := ' ';
end;
end;
end;
end;
{--------------------------------------------------------------}

As you can see, what this procedure does is to intercept every occurrence of '/'. It then
examines the NEXT character in the stream. If the character is a '*', then we have found
the beginning of a comment, and GetChar will return a single character replacement for it.
(For simplicity, I'm using the same '{' character as I did for Pascal. If you were writing a C
compiler, you'd no doubt want to pick some other character that's not used elsewhere in
C. Pick anything you like ... even $FF, anything that's unique.)
If the character following the '/' is NOT a '*', then GetChar tucks it away in the new global
TempChar, and returns the '/'.
Note that you need to declare this new variable and initialize it to ' '. I like to do things like
that using the Turbo "typed constant" construct:
const TempChar: char = ' ';
Now we need a new version of SkipComment:
{--------------------------------------------------------------}
begin
repeat
repeat
GetCharX;
until Look = '*';
GetCharX;
until Look = '/';
GetChar;
end;
{--------------------------------------------------------------}

A few things to note: first of all, function IsWhite and procedure SkipWhite don't need to be
changed, since GetChar returns the '{' token. If you change that token character, then of
course you also need to change the character in those two routines.
Second, note that SkipComment doesn't call GetChar in its loop, but GetCharX. That means
that the trailing '/' is not intercepted and is seen by SkipComment. Third, although GetChar is
the procedure doing the work, we can still deal with the comment characters embedded in a
quoted string, by calling GetCharX instead of GetChar while we're within the string. Finally,
note that we can again provide for nested comments by adding a single statement to
SkipComment, just as we did before.

ONE-SIDED COMMENTS
So far I've shown you how to deal with any kind of comment delimited on the left and the
right. That only leaves the one- sided comments like those in assembler language or in
Ada, that are terminated by the end of the line. In a way, that case is easier. The only pro-
cedure that would need to be changed is SkipComment, which must now terminate at the
newline characters:
{--------------------------------------------------------------}
begin
repeat
GetCharX;
until Look = CR;
GetChar;
end;
{--------------------------------------------------------------}
If the leading character is a single one, as in the ';' of assembly language, then we're
essentially done. If it's a two- character token, as in the '--' of Ada, we need only modify
the tests within GetChar. Either way, it's an easier problem than the balanced case.

CONCLUSION
At this point we now have the ability to deal with both comments and semicolons, as well as
other kinds of syntactic sugar. I've shown you several ways to deal with each, depending
upon the convention desired. The only issue left is: which of these conventions should we
use in KISS/TINY?
For the reasons that I've given as we went along, I'm choosing the following:
(1) Semicolons are TERMINATORS, not separators
(2) Semicolons are OPTIONAL
(3) Comments are delimited by curly braces
(4) Comments MAY be nested
Put the code corresponding to these cases into your copy of TINY. You now have TINY Ver-
sion 1.2.
Now that we have disposed of these sideline issues, we can finally get back into the main-
stream. In the next installment, we'll talk about procedures and parameter passing, and we'll
add these important features to TINY. See you then.

Part 13 - Procedures
INTRODUCTION
At last we get to the good part!
At this point we've studied almost all the basic features of compilers and parsing. We
have learned how to translate arithmetic expressions, Boolean expressions, control con-
structs, data declarations, and I/O statements. We have defined a language, TINY 1.3,
that embodies all these features, and we have written a rudimentary compiler that can
translate them. By adding some file I/O we could indeed have a working compiler that
could produce executable object files from programs written in TINY. With such a com-
piler, we could write simple programs that could read integer data, perform calculations
with it, and output the results.
That's nice, but what we have is still only a toy language. We can't read or write even a
single character of text, and we still don't have procedures.
It's the features to be discussed in the next couple of installments that separate the men
from the toys, so to speak. "Real" languages have more than one data type, and they
support procedure calls. More than any others, it's these two features that give a lan-
guage much of its character and personality. Once we have provided for them, our lan-
guages, TINY and its successors, will cease to become toys and will take on the
character of real languages, suitable for serious programming jobs.
For several installments now, I've been promising you sessions on these two important
subjects. Each time, other issues came up that required me to digress and deal with
them. Finally, we've been able to put all those issues to rest and can get on with the main-
stream of things. In this installment, I'll cover procedures. Next time, we'll talk about the
basic data types.

ONE LAST DIGRESSION

This has been an extraordinarily difficult installment for me to write. The reason has nothing
to do with the subject itself ... I've known what I wanted to say for some time, and in fact I pre-
sented most of this at Software Development '89, back in February. It has more to do with the
approach. Let me explain.
When I first began this series, I told you that we would use several "tricks" to make things
easy, and to let us learn the concepts without getting too bogged down in the details. Among
these tricks was the idea of looking at individual pieces of a compiler at a time, i.e. performing
experiments using the Cradle as a base. When we studied expressions, for example, we
dealt with only that part of compiler theory. When we studied control structures, we wrote a
different program, still based on the Cradle, to do that part. We only incorporated these con-
cepts into a complete language fairly recently. These techniques have served us very well
indeed, and led us to the development of a compiler for TINY version 1.3.
When I first began this session, I tried to build upon what we had already done, and just add
the new features to the existing compiler. That turned out to be a little awkward and tricky ...
much too much to suit me.
I finally figured out why. In this series of experiments, I had abandoned the very useful tech-
niques that had allowed us to get here, and without meaning to I had switched over into a
new method of working, that involved incremental changes to the full TINY compiler.
You need to understand that what we are doing here is a little unique. There have been a
number of articles, such as the Small C articles by Cain and Hendrix, that presented finished
compilers for one language or another. This is different. In this series of tutorials, you are
watching me design and implement both a language and a compiler, in real time.
In the experiments that I've been doing in preparation for this article, I was trying to inject the
changes into the TINY compiler in such a way that, at every step, we still had a real, working
compiler. In other words, I was attempting an incremental enhancement of the language and
its compiler, while at the same time explaining to you what I was doing.

That's a tough act to pull off! I finally realized that it was dumb to try. Having gotten this far
using the idea of small experiments based on single-character tokens and simple, spe-
cial-purpose programs, I had abandoned them in favor of working with the full compiler. It
wasn't working.
So we're going to go back to our roots, so to speak. In this installment and the next, I'll be
using single-character tokens again as we study the concepts of procedures, unfettered
by the other baggage that we have accumulated in the previous sessions. As a matter of
fact, I won't even attempt, at the end of this session, to merge the constructs into the
TINY compiler. We'll save that for later.
After all this time, you don't need more buildup than that, so let's waste no more time and
dive right in.

THE BASICS
All modern CPU's provide direct support for procedure calls, and the 68000 is no exception.
For the 68000, the call is a BSR (PC-relative version) or JSR, and the return is RTS. All we
have to do is to arrange for the compiler to issue these commands at the proper place.
Actually, there are really THREE things we have to address. One of them is the call/return
mechanism. The second is the mechanism for DEFINING the procedure in the first place.
And, finally, there is the issue of passing parameters to the called procedure. None of these
things are really very difficult, and we can of course borrow heavily on what people have done
in other languages ... there's no need to reinvent the wheel here. Of the three issues, that of
parameter passing will occupy most of our attention, simply because there are so many
options available.

A BASIS FOR EXPERIMENTS

As always, we will need some software to serve as a basis for what we are doing. We
don't need the full TINY compiler, but we do need enough of a program so that some of
the other constructs are present. Specifically, we need at least to be able to handle state-
ments of some sort, and data declarations.
The program shown below is that basis. It's a vestigial form of TINY, with single-character
tokens. It has data declarations, but only in their simplest form ... no lists or initializers. It
has assignment statements, but only of the kind
<ident> = <ident>
In other words, the only legal expression is a single variable name. There are no control
constructs ... the only legal statement is the assignment.

Most of the program is just the standard Cradle routines. I've shown the whole thing here, just
to make sure we're all starting from the same point:
{--------------------------------------------------------------}
program Calls;
{--------------------------------------------------------------}
const TAB = Î;
CR = ^M;
LF = ^J;
{--------------------------------------------------------------}
var ST: Array['A'..'Z'] of char;
{--------------------------------------------------------------}
procedure GetChar;
begin
Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }
begin
WriteLn;
end;
{--------------------------------------------------------------}
begin
Error(s);
Halt;
end;
{--------------------------------------------------------------}
begin
end;

{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
{ Report an Duplicate Identifier }
procedure Duplicate(n: string);
begin
Abort('Duplicate Identifier ' + n);
end;
{--------------------------------------------------------------}
{ Get Type of Symbol }
function TypeOf(n: char): char;
begin
TypeOf := ST[n];
end;

{--------------------------------------------------------------}
function InTable(n: char): Boolean;
begin
InTable := ST[n] <> ' ';
end;
{--------------------------------------------------------------}
{ Add a New Symbol to Table }
procedure AddEntry(Name, T: char);
begin
if Intable(Name) then Duplicate(Name);
ST[Name] := T;
end;
{--------------------------------------------------------------}
{ Check an Entry to Make Sure It's a Variable }
procedure CheckVar(Name: char);
begin
if TypeOf(Name) <> 'v' then Abort(Name + ' is not a
variable');
end;

{--------------------------------------------------------------}
begin
IsAlpha := upcase(c) in ['A'..'Z'];
end;
{--------------------------------------------------------------}
begin
IsDigit := c in ['0'..'9'];
end;
{--------------------------------------------------------------}
begin
end;

{--------------------------------------------------------------}
begin
IsAddop := c in ['+', '-'];
end;
{--------------------------------------------------------------}
begin
IsMulop := c in ['*', '/'];
end;
{--------------------------------------------------------------}
begin
IsOrop := c in ['|', '~'];
end;

{--------------------------------------------------------------}
begin
IsRelop := c in ['=', '#', '<', '>'];
end;
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
GetChar;
end;

{--------------------------------------------------------------}
procedure Fin;
begin
GetChar;
if Look = LF then
GetChar;
end;
end;
{--------------------------------------------------------------}
begin
SkipWhite;
end;

{--------------------------------------------------------------}
begin
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Get a Number }
begin
GetNum := Look;
GetChar;
SkipWhite;
end;

{--------------------------------------------------------------}
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
begin
Emit(s);
WriteLn;
end;
{--------------------------------------------------------------}
begin
WriteLn(L, ':');
end;

{--------------------------------------------------------------}
{ Load a Variable to the Primary Register }
procedure LoadVar(Name: char);
begin
CheckVar(Name);
end;
{--------------------------------------------------------------}
{ Store the Primary Register }
procedure StoreVar(Name: char);
begin
CheckVar(Name);
end;

{--------------------------------------------------------------}
{ Initialize }
procedure Init;
var i: char;
begin
GetChar;
SkipWhite;
ST[i] := ' ';
end;
{--------------------------------------------------------------}
{ Vestigial Version }
begin
LoadVar(GetName);
end;

{--------------------------------------------------------------}
var Name: char;
begin
Name := GetName;
Match('=');
Expression;
StoreVar(Name);
end;
{--------------------------------------------------------------}
procedure DoBlock;
begin
Assignment;
Fin;
end;
end;

{--------------------------------------------------------------}
{ Parse and Translate a Begin-Block }
procedure BeginBlock;
begin
Match('b');
Fin;
DoBlock;
Match('e');
Fin;
end;
{--------------------------------------------------------------}
begin
ST[N] := 'v';
end;

{--------------------------------------------------------------}
procedure Decl;
var Name: char;
begin
Match('v');
Alloc(GetName);
end;
{--------------------------------------------------------------}
procedure TopDecls;
begin
while Look <> 'b' do begin
case Look of
'v': Decl;
else Abort('Unrecognized Keyword ' + Look);
end;
Fin;
end;
end;

{--------------------------------------------------------------}
{ Main Program }
begin
Init;
TopDecls;
BeginBlock;
end.
{--------------------------------------------------------------}
Note that we DO have a symbol table, and there is logic to check a variable name to
make sure it's a legal one. It's also worth noting that I have included the code you've seen
before to provide for white space and newlines. Finally, note that the main program is
delimited, as usual, by BEGIN-END brackets.
Once you've copied the program to Turbo, the first step is to compile it and make sure it
works. Give it a few declarations, and then a begin-block. Try something like:
va (for VAR A)
vb (for VAR B)
vc (for VAR C)
b (for BEGIN)
a=b
b=c
e. (for END.)
As usual, you should also make some deliberate errors, and verify that the program
catches them correctly.

DECLARING A PROCEDURE
If you're satisfied that our little program works, then it's time to deal with the procedures.
Since we haven't talked about parameters yet, we'll begin by considering only procedures
that have no parameter lists.
As a start, let's consider a simple program with a procedure, and think about the code we'd
like to see generated for it:
PROGRAM FOO;
.
.
PROCEDURE BAR; BAR:
BEGIN .
. .
. .
END; RTS
BEGIN { MAIN PROGRAM } MAIN:
. .
. .
FOO; BSR BAR
. .
. .
END. END MAIN
Here I've shown the high-order language constructs on the left, and the desired assembler
code on the right. The first thing to notice is that we certainly don't have much code to gener-
ate here! For the great bulk of both the procedure and the main program, our existing con-
structs take care of the code to be generated.
The key to dealing with the body of the procedure is to recognize that although a procedure
may be quite long, declaring it is really no different than declaring a variable. It's just one
more kind of declaration. We can write the BNF:
<declaration> ::= <data decl> | <procedure>
This means that it should be easy to modify TopDecl to deal with procedures. What about the
syntax of a procedure? Well, here's a suggested syntax, which is essentially that of Pascal:

<procedure> ::= PROCEDURE <ident> <begin-block>
There is practically no code generation required, other than that generated within the
begin-block. We need only emit a label at the beginning of the procedure, and an RTS at
the end.
Here's the required code:
{--------------------------------------------------------------}
{ Parse and Translate a Procedure Declaration }
procedure DoProc;
var N: char;
begin
Match('p');
N := GetName;
Fin;
ST[N] := 'p';
PostLabel(N);
BeginBlock;
Return;
end;
{--------------------------------------------------------------}
Note that I've added a new code generation routine, Return, which merely emits an RTS
instruction. The creation of that routine is "left as an exercise for the student."
To finish this version, add the following line within the Case statement in DoBlock:

'p': DoProc;
I should mention that this structure for declarations, and the BNF that drives it, differs from
standard Pascal. In the Jensen & Wirth definition of Pascal, variable declarations, in fact ALL
kinds of declarations, must appear in a specific sequence, i.e. labels, constants, types, vari-
ables, procedures, and main program. To follow such a scheme, we should separate the two
declarations, and have code in the main program something like
DoVars;
DoProcs;
DoMain;
However, most implementations of Pascal, including Turbo, don't require that order and let
you freely mix up the various declarations, as long as you still don't try to refer to something
before it's declared. Although it may be more aesthetically pleasing to declare all the global
variables at the top of the program, it certainly doesn't do any HARM to allow them to be
sprinkled around. In fact, it may do some GOOD, in the sense that it gives you the opportunity
to do a little rudimentary information hiding. Variables that should be accessed only by the
main program, for example, can be declared just before it and will thus be inaccessible by the
procedures.
OK, try this new version out. Note that we can declare as many procedures as we choose (as
long as we don't run out of single- character names!), and the labels and RTS's all come out
in the right places.

It's worth noting here that I do _NOT_ allow for nested procedures. In TINY, all proce-
dures must be declared at the global level, the same as in C. There has been quite a dis-
cussion about this point in the Computer Language Forum of CompuServe. It turns out
that there is a significant penalty in complexity that must be paid for the luxury of nested
procedures. What's more, this penalty gets paid at RUN TIME, because extra code must
be added and executed every time a procedure is called. I also happen to believe that
nesting is not a good idea, simply on the grounds that I have seen too many abuses of the
feature. Before going on to the next step, it's also worth noting that the "main program" as
it stands is incomplete, since it doesn't have the label and END statement. Let's fix that lit-
tle oversight:
{--------------------------------------------------------------}
procedure DoMain;
begin
Match('b');
Fin;
Prolog;
DoBlock;
Epilog;
end;
{--------------------------------------------------------------}

{--------------------------------------------------------------}
{ Main Program }
begin
Init;
TopDecls;
DoMain;
end.
{--------------------------------------------------------------}
Note that DoProc and DoMain are not quite symmetrical. DoProc uses a call to BeginBlock,
whereas DoMain cannot. That's because a procedure is signaled by the keyword PROCE-
DURE (abbreviated by a 'p' here), while the main program gets no keyword other than the
BEGIN itself.
And _THAT_ brings up an interesting question: WHY?
If we look at the structure of C programs, we find that all functions are treated just alike,
except that the main program happens to be identified by its name, "main." Since C functions
can appear in any order, the main program can also be anywhere in the compilation unit.
In Pascal, on the other hand, all variables and procedures must be declared before they're
used, which means that there is no point putting anything after the main program ... it could
never be accessed. The "main program" is not identified at all, other than being that part of
the code that comes after the global BEGIN. In other words, if it ain't anything else, it must be
the main program.

This causes no small amount of confusion for beginning programmers, and for big Pascal
programs sometimes it's difficult to find the beginning of the main program at all. This
leads to conventions such as identifying it in comments:
BEGIN { of MAIN }
This has always seemed to me to be a bit of a kludge. The question comes up: Why
should the main program be treated so much differently than a procedure? In fact, now
that we've recognized that procedure declarations are just that ... part of the global decla-
rations ... isn't the main program just one more declaration, also?
The answer is yes, and by treating it that way, we can simplify the code and make it con-
siderably more orthogonal. I propose that we use an explicit keyword, PROGRAM, to
identify the main program (Note that this means that we can't start the file with it, as in
Pascal). In this case, our BNF becomes:
<declaration> ::= <data decl> | <procedure> | <main program>
<procedure> ::= PROCEDURE <ident> <begin-block>
<main program> ::= PROGRAM <ident> <begin-block>

The code also looks much better, at least in the sense that DoMain and DoProc look more
alike:
{--------------------------------------------------------------}
procedure DoMain;
var N: char;
begin
Match('P');
N := GetName;
Fin;
Prolog;
BeginBlock;
end;
{--------------------------------------------------------------}

{--------------------------------------------------------------}
procedure TopDecls;
begin
while Look <> '.' do begin
case Look of
'v': Decl;
'p': DoProc;
'P': DoMain;
end;
Fin;
end;
end;

{--------------------------------------------------------------}
{ Main Program }
begin
Init;
TopDecls;
Epilog;
end.
{--------------------------------------------------------------}
Since the declaration of the main program is now within the loop of TopDecl, that does
present some difficulties. How do we ensure that it's the last thing in the file? And how do we
ever exit from the loop? My answer for the second question, as you can see, was to bring
back our old friend the period. Once the parser sees that, we're done.
To answer the first question: it depends on how far we're willing to go to protect the program-
mer from dumb mistakes. In the code that I've shown, there's nothing to keep the program-
mer from adding code after the main program ... even another main program. The code will
just not be accessible. However, we COULD access it via a FORWARD statement, which
we'll be providing later. As a matter of fact, many assembler language programmers like to
use the area just after the program to declare large, uninitialized data blocks, so there may
indeed be some value in not requiring the main program to be last. We'll leave it as it is.
If we decide that we should give the programmer a little more help than that, it's pretty easy to
add some logic to kick us out of the loop once the main program has been processed. Or we
could at least flag an error if someone tries to include two mains.

CALLING THE PROCEDURE

If you're satisfied that things are working, let's address the second half of the equation ...
the call.
Consider the BNF for a procedure call:
<proc_call> ::= <identifier>
for an assignment statement, on the other hand, the BNF is:
<assignment> ::= <identifier> '=' <expression>
At this point we seem to have a problem. The two BNF statements both begin on the
right-hand side with the token <identifier>. How are we supposed to know, when we see
the identifier, whether we have a procedure call or an assignment statement? This looks
like a case where our parser ceases being predictive, and indeed that's exactly the case.
However, it turns out to be an easy problem to fix, since all we have to do is to look at the
type of the identifier, as recorded in the symbol table. As we've discovered before, a
minor local violation of the predictive parsing rule can be easily handled as a special
case.

Here's how to do it:
{--------------------------------------------------------------}
procedure Assignment(Name: char);
begin
Match('=');
Expression;
StoreVar(Name);
end;

{--------------------------------------------------------------}
{ Decide if a Statement is an Assignment or Procedure Call }
procedure AssignOrProc;
var Name: char;
begin
Name := GetName;
case TypeOf(Name) of
' ': Undefined(Name);
'v': Assignment(Name);
'p': CallProc(Name);
else Abort('Identifier ' + Name +
' Cannot Be Used Here');
end;
end;

{--------------------------------------------------------------}
procedure DoBlock;
begin
AssignOrProc;
Fin;
end;
end;
{--------------------------------------------------------------}
As you can see, procedure Block now calls AssignOrProc instead of Assignment. The func-
tion of this new procedure is to simply read the identifier, determine its type, and then call
whichever procedure is appropriate for that type. Since the name has already been read, we
must pass it to the two procedures, and modify Assignment to match. Procedure CallProc is a
simple code generation routine:
{--------------------------------------------------------------}
{ Call a Procedure }
procedure CallProc(N: char);
begin
EmitLn('BSR ' + N);
end;
{--------------------------------------------------------------}

Well, at this point we have a compiler that can deal with procedures. It's worth noting that
procedures can call procedures to any depth. So even though we don't allow nested
DECLARATIONS, there is certainly nothing to keep us from nesting CALLS, just as we
would expect to do in any language. We're getting there, and it wasn't too hard, was it?
Of course, so far we can only deal with procedures that have no parameters. The proce-
dures can only operate on the global variables by their global names. So at this point we
have the equivalent of BASIC's GOSUB construct. Not too bad ... after all lots of serious
programs were written using GOSUBs, but we can do better, and we will. That's the next
step.

PASSING PARAMETERS
Again, we all know the basic idea of passed parameters, but let's review them just to be safe.
In general the procedure is given a parameter list, for example
PROCEDURE FOO(X, Y, Z)
In the declaration of a procedure, the parameters are called formal parameters, and may be
referred to in the body of the procedure by those names. The names used for the formal
parameters are really arbitrary. Only the position really counts. In the example above, the
name 'X' simply means "the first parameter" wherever it is used.
When a procedure is called, the "actual parameters" passed to it are associated with the for-
mal parameters, on a one-for-one basis.
The BNF for the syntax looks something like this:
<procedure> ::= PROCEDURE <ident>
'(' <param-list> ')' <begin-block>
<param_list> ::= <parameter> ( ',' <parameter> )* | null
Similarly, the procedure call looks like:
<proc call> ::= <ident> '(' <param-list> ')'

Note that there is already an implicit decision built into this syntax. Some languages, such
as Pascal and Ada, permit parameter lists to be optional. If there are no parameters, you
simply leave off the parens completely. Other languages, like C and Modula 2, require the
parens even if the list is empty. Clearly, the example we just finished corresponds to the
former point of view. But to tell the truth I prefer the latter. For procedures alone, the deci-
sion would seem to favor the "listless" approach. The statement
Initialize; ,
standing alone, can only mean a procedure call. In the parsers we've been writing, we've
made heavy use of parameterless procedures, and it would seem a shame to have to
write an empty pair of parens for each case.
But later on we're going to be using functions, too. And since functions can appear in the
same places as simple scalar identifiers, you can't tell the difference between the two.
You have to go back to the declarations to find out. Some folks consider this to be an
advantage. Their argument is that an identifier gets replaced by a value, and what do you
care whether it's done by substitution or by a function? But we sometimes _DO_ care,
because the function may be quite time-consuming. If, by writing a simple identifier into a
given expression, we can incur a heavy run-time penalty, it seems to me we ought to be
made aware of it.
Anyway, Niklaus Wirth designed both Pascal and Modula 2. I'll give him the benefit of the
doubt and assume that he had a good reason for changing the rules the second time
around!
Needless to say, it's an easy thing to accomodate either point of view as we design a lan-
guage, so this one is strictly a matter of personal preference. Do it whichever way you like
best.

Before we go any further, let's alter the translator to handle a (possibly empty) parameter list.
For now we won't generate any extra code ... just parse the syntax. The code for processing
the declaration has very much the same form we've seen before when dealing with VAR-lists:
{--------------------------------------------------------------}
{ Process the Formal Parameter List of a Procedure }
procedure FormalList;
begin
Match('(');
if Look <> ')' then begin
FormalParam;
Match(',');
FormalParam;
end;
end;
Match(')');
end;
{--------------------------------------------------------------}

Procedure DoProc needs to have a line added to call FormalList:
{--------------------------------------------------------------}
procedure DoProc;
var N: char;
begin
Match('p');
N := GetName;
FormalList;
Fin;
ST[N] := 'p';
PostLabel(N);
BeginBlock;
Return;
end;
{--------------------------------------------------------------}

For now, the code for FormalParam is just a dummy one that simply skips the parameter
name:
{--------------------------------------------------------------}
{ Process a Formal Parameter }
procedure FormalParam;
var Name: char;
begin
Name := GetName;
end;
{--------------------------------------------------------------}
For the actual procedure call, there must be similar code to process the actual parameter list:
{--------------------------------------------------------------}
{ Process an Actual Parameter }
procedure Param;
var Name: char;
begin
Name := GetName;
end;

{--------------------------------------------------------------}
{ Process the Parameter List for a Procedure Call }
procedure ParamList;
begin
Match('(');
Param;
Match(',');
Param;
end;
end;
Match(')');
end;
{--------------------------------------------------------------}
{ Process a Procedure Call }
procedure CallProc(Name: char);
begin
ParamList;
Call(Name);
end;
{--------------------------------------------------------------}

Note here that CallProc is no longer just a simple code generation routine. It has some struc-
ture to it. To handle this, I've renamed the code generation routine to just Call, and called it
from within CallProc.
OK, if you'll add all this code to your translator and try it out, you'll find that you can indeed
parse the syntax properly. I'll note in passing that there is _NO_ checking to make sure that
the number (and, later, types) of formal and actual parameters match up. In a production
compiler, we must of course do this. We'll ignore the issue now if for no other reason than that
the structure of our symbol table doesn't currently give us a place to store the necessary
information. Later on, we'll have a place for that data and we can deal with the issue then.

THE SEMANTICS OF PARAMETERS

So far we've dealt with the SYNTAX of parameter passing, and we've got the parsing
mechanisms in place to handle it. Next, we have to look at the SEMANTICS, i.e., the
actions to be taken when we encounter parameters. This brings us square up against the
issue of the different ways parameters can be passed.
There is more than one way to pass a parameter, and the way we do it can have a pro-
found effect on the character of the language. So this is another of those areas where I
can't just give you my solution. Rather, it's important that we spend some time looking at
the alternatives so that you can go another route if you choose to.
There are two main ways parameters are passed:
o By value
o By reference (address)
The differences are best seen in the light of a little history.
The old FORTRAN compilers passed all parameters by reference. In other words, what
was actually passed was the address of the parameter. This meant that the called sub-
routine was free to either read or write that parameter, as often as it chose to, just as
though it were a global variable. This was actually quite an efficient way to do things, and
it was pretty simple since the same mechanism was used in all cases, with one exception
that I'll get to shortly.
There were problems, though. Many people felt that this method created entirely too
much coupling between the called subroutine and its caller. In effect, it gave the subrou-
tine complete access to all variables that appeared in the parameter list.

Many times, we didn't want to actually change a parameter, but only use it as an input. For
example, we might pass an element count to a subroutine, and wish we could then use that
count within a DO-loop. To avoid changing the value in the calling program, we had to make a
local copy of the input parameter, and operate only on the copy. Some FORTRAN program-
mers, in fact, made it a practice to copy ALL parameters except those that were to be used as
return values. Needless to say, all this copying defeated a good bit of the efficiency associ-
ated with the approach.
There was, however, an even more insidious problem, which was not really just the fault of
the "pass by reference" convention, but a bad convergence of several implementation deci-
sions.
Suppose we have a subroutine:
SUBROUTINE FOO(X, Y, N)
where N is some kind of input count or flag. Many times, we'd like to be able to pass a literal
or even an expression in place of a variable, such as:
CALL FOO(A, B, J + 1)
Here the third parameter is not a variable, and so it has no address. The earliest FORTRAN
compilers did not allow such things, so we had to resort to subterfuges like:
K = J + 1
CALL FOO(A, B, K)
Here again, there was copying required, and the burden was on the programmer to do it. Not
good.
Later FORTRAN implementations got rid of this by allowing expressions as parameters.

What they did was to assign a compiler-generated variable, store the value of the expression
in the variable, and then pass the address of the expression.
So far, so good. Even if the subroutine mistakenly altered the anonymous variable, who was
to know or care? On the next call, it would be recalculated anyway.

The problem arose when someone decided to make things more efficient. They rea-
soned, rightly enough, that the most common kind of "expression" was a single integer
value, as in:
CALL FOO(A, B, 4)
It seemed inefficient to go to the trouble of "computing" such an integer and storing it in a

temporary variable, just to pass it through the calling list. Since we had to pass the
address of the thing anyway, it seemed to make lots of sense to just pass the address of
the literal integer, 4 in the example above.
To make matters more interesting, most compilers, then and now, identify all literals and
store them separately in a "literal pool," so that we only have to store one value for each
unique literal. That combination of design decisions: passing expressions, optimization
for literals as a special case, and use of a literal pool, is what led to disaster.
To see how it works, imagine that we call subroutine FOO as in the example above, pass-
ing it a literal 4. Actually, what gets passed is the address of the literal 4, which is stored in
the literal pool. This address corresponds to the formal parameter, K, in the subroutine
itself.
Now suppose that, unbeknownst to the programmer, subroutine FOO actually modifies K
to be, say, -7. Suddenly, that literal 4 in the literal pool gets CHANGED, to a -7. From then
on, every expression that uses a 4 and every subroutine that passes a 4 will be using the
value of -7 instead! Needless to say, this can lead to some bizarre and difficult-to-find
behavior. The whole thing gave the concept of pass-by-reference a bad name, although
as we have seen, it was really a combination of design decisions that led to the problem.
In spite of the problem, the FORTRAN approach had its good points. Chief among them
is the fact that we don't have to support multiple mechanisms. The same scheme, pass-
ing the address of the argument, works for EVERY case, including arrays. So the size of
the compiler can be reduced.
Partly because of the FORTRAN gotcha, and partly just because of the reduced coupling
involved, modern languages like C, Pascal, Ada, and Modula 2 generally pass scalars by
value.

This means that the value of the scalar is COPIED into a separate value used only for the
call. Since the value passed is a copy, the called procedure can use it as a local variable and
modify it any way it likes. The value in the caller will not be changed.
It may seem at first that this is a bit inefficient, because of the need to copy the parameter.
But remember that we're going to have to fetch SOME value to pass anyway, whether it be
the parameter itself or an address for it. Inside the subroutine, using pass-by-value is defi-
nitely more efficient, since we eliminate one level of indirection. Finally, we saw earlier that
with FORTRAN, it was often necessary to make copies within the subroutine anyway, so
pass-by-value reduces the number of local variables. All in all, pass-by-value is better.
Except for one small little detail: if all parameters are passed by value, there is no way for a
called to procedure to return a result to its caller! The parameter passed is NOT altered in the
caller, only in the called procedure. Clearly, that won't get the job done.
There have been two answers to this problem, which are equivalent. In Pascal, Wirth pro-
vides for VAR parameters, which are passed-by-reference. What a VAR parameter is, in fact,
is none other than our old friend the FORTRAN parameter, with a new name and paint job for
disguise. Wirth neatly gets around the "changing a literal" problem as well as the "address of
an expression" problem, by the simple expedient of allowing only a variable to be the actual
parameter. In other words, it's the same restriction that the earliest FORTRANs imposed.
C does the same thing, but explicitly. In C, _ALL_ parameters are passed by value. One kind
of variable that C supports, however, is the pointer. So by passing a pointer by value, you in
effect pass what it points to by reference. In some ways this works even better yet, because
even though you can change the variable pointed to all you like, you still CAN'T change the
pointer itself. In a function such as strcpy, for example, where the pointers are incremented as
the string is copied, we are really only incrementing copies of the pointers, so the values of
those pointers in the calling procedure still remain as they were. To modify a pointer, you
must pass a pointer to the pointer.
Since we are simply performing experiments here, we'll look at BOTH pass-by-value and
pass-by-reference. That way, we'll be able to use either one as we need to. It's worth men-
tioning that it's going to be tough to use the C approach to pointers here, since a pointer is a
different type and we haven't studied types yet!

PASS-BY-VALUE
Let's just try some simple-minded things and see where they lead us. Let's begin with the
pass-by-value case. Consider the procedure call:
FOO(X, Y)
Almost the only reasonable way to pass the data is through the CPU stack. So the code
we'd like to see generated might look something like this:
MOVE X(PC),-(SP) ; Push X
MOVE Y(PC),-(SP) ; Push Y
BSR FOO ; Call FOO
That certainly doesn't seem too complex!
When the BSR is executed, the CPU pushes the return address onto the stack and jumps
to FOO. At this point the stack will look like this:
Value of X (2 bytes)
Value of Y (2 bytes)
SP --> Return Address (4 bytes)
So the values of the parameters have addresses that are fixed offsets from the stack
pointer. In this example, the addresses are:
X: 6(SP)
Y: 4(SP)

Now consider what the called procedure might look like:
PROCEDURE FOO(A, B)
BEGIN
A = B
END
(Remember, the names of the formal parameters are arbitrary ... only the positions count.)
The desired output code might look like:
FOO: MOVE 4(SP),D0
MOVE D0,6(SP)
RTS
Note that, in order to address the formal parameters, we're going to have to know which posi-
tion they have in the parameter list. This means some changes to the symbol table stuff. In
fact, for our single-character case it's best to just create a new symbol table for the formal
parameters.
Let's begin by declaring a new table:
var Params: Array['A'..'Z'] of integer;
We also will need to keep track of how many parameters a given procedure has:
var NumParams: integer;

And we need to initialize the new table. Now, remember that the formal parameter list will
be different for each procedure that we process, so we'll need to initialize that table anew
for each procedure. Here's the initializer:
{--------------------------------------------------------------}
{ Initialize Parameter Table to Null }
procedure ClearParams;
var i: char;
begin
Params[i] := 0;
NumParams := 0;
end;
{--------------------------------------------------------------}

We'll put a call to this procedure in Init, and also at the end of DoProc:
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
var i: char;
begin
GetChar;
SkipWhite;
ST[i] := ' ';
ClearParams;
end;
{--------------------------------------------------------------}

{--------------------------------------------------------------}
procedure DoProc;
var N: char;
begin
Match('p');
N := GetName;
FormalList;
Fin;
ST[N] := 'p';
PostLabel(N);
BeginBlock;
Return;
ClearParams;
end;
{--------------------------------------------------------------}
Note that the call within DoProc ensures that the table will be clear when we're in the
main program.

OK, now we need a few procedures to work with the table. The next few functions are essen-
tially copies of InTable, TypeOf, etc.:
{--------------------------------------------------------------}
{ Find the Parameter Number }
function ParamNumber(N: char): integer;
begin
ParamNumber := Params[N];
end;
{--------------------------------------------------------------}
{ See if an Identifier is a Parameter }
function IsParam(N: char): boolean;
begin
IsParam := Params[N] <> 0;
end;
{--------------------------------------------------------------}
{ Add a New Parameter to Table }
procedure AddParam(Name: char);
begin
if IsParam(Name) then Duplicate(Name);
Inc(NumParams);
Params[Name] := NumParams;
end;
{--------------------------------------------------------------}

Finally, we need some code generation routines:
{--------------------------------------------------------------}
{ Load a Parameter to the Primary Register }
procedure LoadParam(N: integer);
var Offset: integer;
begin
Offset := 4 + 2 * (NumParams - N);
Emit('MOVE ');
WriteLn(Offset, '(SP),D0');
end;
{--------------------------------------------------------------}
{ Store a Parameter from the Primary Register }
procedure StoreParam(N: integer);
begin
Emit('MOVE D0,');
WriteLn(Offset, '(SP)');
end;

{--------------------------------------------------------------}
{ Push The Primary Register to the Stack }
procedure Push;
begin
end;
{--------------------------------------------------------------}
( The last routine is one we've seen before, but it wasn't in this vestigial version of the pro-
gram.)
With those preliminaries in place, we're ready to deal with the semantics of procedures with
calling lists (remember, the code to deal with the syntax is already in place).
Let's begin by processing a formal parameter. All we have to do is to add each parameter to
the parameter symbol table:
{--------------------------------------------------------------}
{ Process a Formal Parameter }
procedure FormalParam;
begin
AddParam(GetName);
end;
{--------------------------------------------------------------}

Now, what about dealing with a formal parameter when it appears in the body of the pro-
cedure? That takes a little more work. We must first determine that it IS a formal parame-
ter. To do this, I've written a modified version of TypeOf:
{--------------------------------------------------------------}
{ Get Type of Symbol }
function TypeOf(n: char): char;
begin
if IsParam(n) then
TypeOf := 'f'
else
TypeOf := ST[n];
end;
{--------------------------------------------------------------}
(Note that, since TypeOf now calls IsParam, it may need to be relocated in your source.)

We also must modify AssignOrProc to deal with this new type:
{--------------------------------------------------------------}
{ Decide if a Statement is an Assignment or Procedure Call }
procedure AssignOrProc;
var Name: char;
begin
Name := GetName;
case TypeOf(Name) of
' ': Undefined(Name);
'v', 'f': Assignment(Name);
'p': CallProc(Name);
else Abort('Identifier ' + Name + ' Cannot Be Used
Here');
end;
end;
{--------------------------------------------------------------}

Finally, the code to process an assignment statement and an expression must be

extended:
{--------------------------------------------------------------}
{ Vestigial Version }
var Name: char;
begin
Name := GetName;
if IsParam(Name) then
LoadParam(ParamNumber(Name))
else
LoadVar(Name);
end;

{--------------------------------------------------------------}
procedure Assignment(Name: char);
begin
Match('=');
Expression;
if IsParam(Name) then
StoreParam(ParamNumber(Name))
else
StoreVar(Name);
end;
{--------------------------------------------------------------}
As you can see, these procedures will treat every variable name encountered as either a for-
mal parameter or a global variable, depending on whether or not it appears in the parameter
symbol table. Remember that we are using only a vestigial form of Expression. In the final
program, the change shown here will have to be added to Factor, not Expression.

procedure call, which we can do with one new line of code:
{--------------------------------------------------------------}
procedure Param;
begin
Expression;
Push;
end;
{--------------------------------------------------------------}
That's it. Add these changes to your program and give it a try. Try declaring one or two
procedures, each with a formal parameter list. Then do some assignments, using combi-
nations of global and formal parameters. You can call one procedure from within another,
but you cannot DECLARE a nested procedure. You can even pass formal parameters
from one procedure to another. If we had the full syntax of the language here, you'd also
be able to do things like read or write formal parameters or use them in complicated
expressions.

WHAT'S WRONG?
At this point, you might be thinking: Surely there's more to this than a few pushes and pops.
There must be more to passing parameters than this.
You'd be right. As a matter of fact, the code that we're generating here leaves a lot to be
desired in several respects.
The most glaring oversight is that it's wrong! If you'll look back at the code for a procedure
call, you'll see that the caller pushes each actual parameter onto the stack before it calls the
procedure. The procedure USES that information, but it doesn't change the stack pointer.
That means that the stuff is still there when we return. SOMEBODY needs to clean up the
stack, or we'll soon be in very hot water!
Fortunately, that's easily fixed. All we have to do is to increment the stack pointer when we're
finished.
Should we do that in the calling program, or the called procedure? Some folks let the called
procedure clean up the stack, since that requires less code to be generated per call, and
since the procedure, after all, knows how many parameters it's got. But that means that it
must do something with the return address so as not to lose it.

I prefer letting the caller clean up, so that the callee need only execute a return. Also, it
seems a bit more balanced, since the caller is the one who "messed up" the stack in the
first place. But THAT means that the caller must remember how many items it pushed. To
make things easy, I've modified the procedure ParamList to be a function instead of a pro-
cedure, returning the number of bytes pushed:
{--------------------------------------------------------------}
{ Process the Parameter List for a Procedure Call }
function ParamList: integer;
var N: integer;
begin
N := 0;
Match('(');
Param;
inc(N);
Match(',');
Param;
inc(N);
end;
end;
Match(')');
ParamList := 2 * N;
end;
{--------------------------------------------------------------}

Procedure CallProc then uses this to clean up the stack:
{--------------------------------------------------------------}
{ Process a Procedure Call }
procedure CallProc(Name: char);
var N: integer;
begin
N := ParamList;
Call(Name);
CleanStack(N);
end;
{--------------------------------------------------------------}
Here I've created yet another code generation procedure:
{--------------------------------------------------------------}
{ Adjust the Stack Pointer Upwards by N Bytes }
procedure CleanStack(N: integer);
begin
if N > 0 then begin
Emit('ADD #');
WriteLn(N, ',SP');
end;
end;
{--------------------------------------------------------------}

OK, if you'll add this code to your compiler, I think you'll find that the stack is now under
control.
The next problem has to do with our way of addressing relative to the stack pointer. That
works fine in our simple examples, since with our rudimentary form of expressions
nobody else is messing with the stack. But consider a different example as simple as:
PROCEDURE FOO(A, B)
BEGIN
A = A + B
END
The code generated by a simple-minded parser might be:
FOO: MOVE 6(SP),D0 ; Fetch A
MOVE D0,-(SP) ; Push it
MOVE 4(SP),D0 ; Fetch B
ADD (SP)+,D0 ; Add A
MOVE D0,6(SP) : Store A
RTS
This would be wrong. When we push the first argument onto the stack, the offsets for the
two formal parameters are no longer 4 and 6, but are 6 and 8. So the second fetch would
fetch A again, not B.
This is not the end of the world. I think you can see that all we really have to do is to alter
the offset every time we do a push, and that in fact is what's done if the CPU has no sup-
port for other methods.

Fortunately, though, the 68000 does have such support. Recognizing that this CPU would be
used a lot with high-order language compilers, Motorola decided to add direct support for this
kind of thing.
The problem, as you can see, is that as the procedure executes, the stack pointer bounces
up and down, and so it becomes an awkward thing to use as a reference to access the formal
parameters. The solution is to define some _OTHER_ register, and use it instead. This regis-
ter is typically set equal to the original stack pointer, and is called the frame pointer.
The 68000 instruction set LINK lets you declare such a frame pointer, and sets it equal to the
stack pointer, all in one instruction. As a matter of fact, it does even more than that. Since this
register may have been in use for something else in the calling procedure, LINK also pushes
the current value of that register onto the stack. It can also add a value to the stack pointer, to
make room for local variables.
The complement of LINK is UNLK, which simply restores the stack pointer and pops the old
value back into the register.
Using these two instructions, the code for the previous example becomes:
FOO: LINK A6,#0
MOVE 10(A6),D0 ; Fetch A
MOVE 8(A6),D0 ; Fetch B
MOVE D0,10(A6) : Store A
UNLK A6
RTS

Fixing the compiler to generate this code is a lot easier than it is to explain it. All we need
to do is to modify the code generation created by DoProc. Since that makes the code a lit-
tle more than one line, I've created new procedures to deal with it, paralleling the Prolog
and Epilog procedures called by DoMain:
{--------------------------------------------------------------}
{ Write the Prolog for a Procedure }
procedure ProcProlog(N: char);
begin
PostLabel(N);
EmitLn('LINK A6,#0');
end;
{--------------------------------------------------------------}
{ Write the Epilog for a Procedure }
procedure ProcEpilog;
begin
EmitLn('UNLK A6');
EmitLn('RTS');
end;
{--------------------------------------------------------------}

Procedure DoProc now just calls these:
{--------------------------------------------------------------}
procedure DoProc;
var N: char;
begin
Match('p');
N := GetName;
FormalList;
Fin;
ST[N] := 'p';
ProcProlog(N);
BeginBlock;
ProcEpilog;
ClearParams;
end;
{--------------------------------------------------------------}

Finally, we need to change the references to SP in procedures LoadParam and

StoreParam:
{--------------------------------------------------------------}
begin
Emit('MOVE ');
WriteLn(Offset, '(A6),D0');
end;

{--------------------------------------------------------------}
begin
Emit('MOVE D0,');
WriteLn(Offset, '(A6)');
end;
{--------------------------------------------------------------}
(Note that the Offset computation changes to allow for the extra push of A6.)
That's all it takes. Try this out and see how you like it.
At this point we are generating some relatively nice code for procedures and procedure calls.
Within the limitation that there are no local variables (yet) and that no procedure nesting is
allowed, this code is just what we need.
There is still just one little small problem remaining:
WE HAVE NO WAY TO RETURN RESULTS TO THE CALLER!
But that, of course, is not a limitation of the code we're generating, but one inherent in the
call-by-value protocol. Notice that we CAN use formal parameters in any way inside the pro-
cedure. We can calculate new values for them, use them as loop counters (if we had loops,
that is!), etc. So the code is doing what it's supposed to. To get over this last problem, we
need to look at the alternative protocol.

CALL-BY-REFERENCE
This one is easy, now that we have the mechanisms already in place. We only have to
make a few changes to the code generation. Instead of pushing a value onto the stack,
we must push an address. As it turns out, the 68000 has an instruction, PEA, that does
just that.
We'll be making a new version of the test program for this. Before we do anything else,
>>>> MAKE A COPY <<<<
of the program as it now stands, because we'll be needing it again later.
Let's begin by looking at the code we'd like to see generated for the new case. Using the
same example as before, we need the call
FOO(X, Y)
to be translated to:
PEA X(PC) ; Push the address of X
PEA Y(PC) ; Push Y the address of Y
BSR FOO ; Call FOO

That's a simple matter of a slight change to Param:
{--------------------------------------------------------------}
procedure Param;
begin
EmitLn('PEA ' + GetName + '(PC)');
end;
{--------------------------------------------------------------}
(Note that with pass-by-reference, we can't have expressions in the calling list, so Param can
just read the name directly.)
At the other end, the references to the formal parameters must be given one level of indirec-
tion:
FOO: LINK A6,#0

MOVE.L 12(A6),A0 ; Fetch the address of A
MOVE (A0),D0 ; Fetch A
MOVE.L 8(A6),A0 ; Fetch the address of B
MOVE (A0),D0 ; Fetch B
MOVE.L 12(A6),A0 ; Fetch the address of A
MOVE D0,(A0) : Store A
UNLK A6
RTS

All of this can be handled by changes to LoadParam and StoreParam:
{--------------------------------------------------------------}
begin
Emit('MOVE.L ');
WriteLn(Offset, '(A6),A0');
end;
{--------------------------------------------------------------}
begin
Emit('MOVE.L ');
WriteLn(Offset, '(A6),A0');
end;
{--------------------------------------------------------------}

To get the count right, we must also change one line in ParamList:
ParamList := 4 * N;
That should do it. Give it a try and see if it's generating reasonable-looking code. As you will
see, the code is hardly optimal, since we reload the address register every time a parameter
is needed. But that's consistent with our KISS approach here, of just being sure to generate
code that works. We'll just make a little note here, that here's yet another candidate for opti-
mization, and press on.
Now we've learned to process parameters using pass-by-value and pass-by-reference. In the
real world, of course, we'd like to be able to deal with BOTH methods. We can't do that yet,
though, because we have not yet had a session on types, and that has to come first.
If we can only have ONE method, then of course it has to be the good ol' FORTRAN method
of pass-by-reference, since that's the only way procedures can ever return values to their
caller.
This, in fact, will be one of the differences between TINY and KISS. In the next version of
TINY, we'll use pass-by-reference for all parameters. KISS will support both methods.

LOCAL VARIABLES
So far, we've said nothing about local variables, and our definition of procedures doesn't
allow for them. Needless to say, that's a big gap in our language, and one that needs to
be corrected.
Here again we are faced with a choice: Static or dynamic storage?
In those old FORTRAN programs, local variables were given static storage just like global
ones. That is, each local variable got a name and allocated address, like any other vari-
able, and was referenced by that name.
That's easy for us to do, using the allocation mechanisms already in place. Remember,
though, that local variables can have the same names as global ones. We need to some-
how deal with that by assigning unique names for these variables.
The characteristic of static storage, of course, is that the data survives a procedure call
and return. When the procedure is called again, the data will still be there. That can be an
advantage in some applications. In the FORTRAN days we used to do tricks like initialize
a flag, so that you could tell when you were entering a procedure for the first time and
could do any one-time initialization that needed to be done.
Of course, the same "feature" is also what makes recursion impossible with static stor-
age. Any new call to a procedure will overwrite the data already in the local variables.
The alternative is dynamic storage, in which storage is allocated on the stack just as for
passed parameters. We also have the mechanisms already for doing this. In fact, the
same routines that deal with passed (by value) parameters on the stack can easily deal
with local variables as well ... the code to be generated is the same. The purpose of the
offset in the 68000 LINK instruction is there just for that reason: we can use it to adjust the
stack pointer to make room for locals. Dynamic storage, of course, inherently supports
recursion.

When I first began planning TINY, I must admit to being prejudiced in favor of static storage.
That's simply because those old FORTRAN programs were pretty darned efficient ... the
early FORTRAN compilers produced a quality of code that's still rarely matched by modern
compilers. Even today, a given program written in FORTRAN is likely to outperform the same
program written in C or Pascal, sometimes by wide margins. (Whew! Am I going to hear
about THAT statement!)
I've always supposed that the reason had to do with the two main differences between FOR-
TRAN implementations and the others: static storage and pass-by-reference. I know that
dynamic storage supports recursion, but it's always seemed to me a bit peculiar to be willing
to accept slower code in the 95% of cases that don't need recursion, just to get that feature
when you need it. The idea is that, with static storage, you can use absolute addressing
rather than indirect addressing, which should result in faster code.
More recently, though, several folks have pointed out to me that there really is no perfor-
mance penalty associated with dynamic storage. With the 68000, for example, you shouldn't
use absolute addressing anyway ... most operating systems require position independent
code. And the 68000 instruction
MOVE 8(A6),D0
has exactly the same timing as
MOVE X(PC),D0.
So I'm convinced, now, that there is no good reason NOT to use dynamic storage.
Since this use of local variables fits so well into the scheme of pass-by-value parameters,
we'll use that version of the translator to illustrate it. (I _SURE_ hope you kept a copy!)
The general idea is to keep track of how many local parameters there are. Then we use the
integer in the LINK instruction to adjust the stack pointer downward to make room for them.
Formal parameters are addressed as positive offsets from the frame pointer, and locals as
negative offsets. With a little bit of work, the same procedures we've already created can take
care of the whole thing.

Let's start by creating a new variable, Base:
var Base: integer;
We'll use this variable, instead of NumParams, to compute stack offsets. That means
changing the two references to NumParams in LoadParam and StoreParam:
{--------------------------------------------------------------}
begin
Offset := 8 + 2 * (Base - N);
Emit('MOVE ');
WriteLn(Offset, '(A6),D0');
end;
{--------------------------------------------------------------}
begin
Offset := 8 + 2 * (Base - N);
Emit('MOVE D0,');
WriteLn(Offset, '(A6)');
end;
{--------------------------------------------------------------}

The idea is that the value of Base will be frozen after we have processed the formal parame-
ters, and won't increase further as the new, local variables, are inserted in the symbol table.
This is taken care of at the end of FormalList:
{--------------------------------------------------------------}
{ Process the Formal Parameter List of a Procedure }
procedure FormalList;
begin
Match('(');
FormalParam;
Match(',');
FormalParam;
end;
end;
Match(')');
Fin;
Base := NumParams;
NumParams := NumParams + 4;
end;
{--------------------------------------------------------------}

(We add four words to make allowances for the return address and old frame pointer,
which end up between the formal parameters and the locals.)
About all we need to do next is to install the semantics for declaring local variables into
the parser. The routines are very similar to Decl and TopDecls:
{--------------------------------------------------------------}
{ Parse and Translate a Local Data Declaration }
procedure LocDecl;
var Name: char;
begin
Match('v');
AddParam(GetName);
Fin;
end;

{--------------------------------------------------------------}
{ Parse and Translate Local Declarations }
function LocDecls: integer;
var n: integer;
begin
n := 0;
while Look = 'v' do begin
LocDecl;
inc(n);
end;
LocDecls := n;
end;
{--------------------------------------------------------------}
Note that LocDecls is a FUNCTION, returning the number of locals to DoProc.

Next, we modify DoProc to use this information:
{--------------------------------------------------------------}
procedure DoProc;
var N: char;
k: integer;
begin
Match('p');
N := GetName;
ST[N] := 'p';
FormalList;
k := LocDecls;
ProcProlog(N, k);
BeginBlock;
ProcEpilog;
ClearParams;
end;
{--------------------------------------------------------------}
(I've made a couple of changes here that weren't really necessary. Aside from rearrang-
ing things a bit, I moved the call to Fin to within FormalList, and placed one inside LocDe-
cls as well. Don't forget to put one at the end of FormalList, so that we're together here.)

Note the change in the call to ProcProlog. The new argument is the number of WORDS (not
bytes) to allocate space for. Here's the new version of ProcProlog:
{--------------------------------------------------------------}
{ Write the Prolog for a Procedure }
procedure ProcProlog(N: char; k: integer);
begin
PostLabel(N);
Emit('LINK A6,#');
WriteLn(-2 * k)
end;
{--------------------------------------------------------------}
That should do it. Add these changes and see how they work.

CONCLUSION
At this point you know how to compile procedure declarations and procedure calls, with
parameters passed by reference and by value. You can also handle local variables. As
you can see, the hard part is not in providing the mechanisms, but in deciding just which
mechanisms to use. Once we make these decisions, the code to translate the constructs
is really not that difficult. I didn't show you how to deal with the combination of local
parameters and pass-by-reference parameters, but that's a straightforward extension to
what you've already seen. It just gets a little more messy, that's all, since we need to sup-
port both mechanisms instead of just one at a time. I'd prefer to save that one until after
we've dealt with ways to handle different variable types.
That will be the next installment, which will be coming soon to a Forum near you. See you
then.

Part 14 - Types
Part 14 - Types
INTRODUCTION
In the last installment (Part XIII: PROCEDURES) I mentioned that in that part and this one,
we would cover the two features that tend to separate the toy language from a real, usable
one. We covered procedure calls in that installment. Many of you have been waiting patiently,
since August '89, for me to drop the other shoe. Well, here it is.
In this installment, we'll talk about how to deal with different data types. As I did in the last
segment, I will NOT incorporate these features directly into the TINY compiler at this time.
Instead, I'll be using the same approach that has worked so well for us in the past: using only
fragments of the parser and single-character tokens. As usual, this allows us to get directly to
the heart of the matter without having to wade through a lot of unnecessary code. Since the
major problems in dealing with multiple types occur in the arithmetic operations, that's where
we'll concentrate our focus.
A few words of warning: First, there are some types that I will NOT be covering in this install-
ment. Here we will ONLY be talking about the simple, predefined types. We won't even deal
with arrays, pointers or strings in this installment; I'll be covering them in the next few.
Second, we also will not discuss user-defined types. That will not come until much later, for
the simple reason that I still haven't convinced myself that user-defined types belong in a lan-
guage named KISS. In later installments, I do intend to cover at least the general concepts of
user-defined types, records, etc., just so that the series will be complete. But whether or not
they will be included as part of KISS is still an open issue. I am open to comments or sugges-
tions on this question.

Finally, I should warn you: what we are about to do CAN add considerable extra compli-
cation to both the parser and the generated code. Handling variables of different types is
straightforward enough. The complexity comes in when you add rules about conversion
between types. In general, you can make the compiler as simple or as complex as you
choose to make it, depending upon the way you define the type-conversion rules. Even if
you decide not to allow ANY type conversions (as in Ada, for example) the problem is still
there, and is built into the mathematics. When you multiply two short numbers, for exam-
ple, you can get a long result.
I've approached this problem very carefully, in an attempt to Keep It Simple. But we can't
avoid the complexity entirely. As has so often has happened, we end up having to trade
code quality against complexity, and as usual I will tend to opt for the simplest approach.

Part 14 - Types
WHAT'S COMING NEXT?

Before diving into the tutorial, I think you'd like to know where we are going from here ...
especially since it's been so long since the last installment.
I have not been idle in the meantime. What I've been doing is reorganizing the compiler itself
into Turbo Units. One of the problems I've encountered is that as we've covered new areas
and thereby added features to the TINY compiler, it's been getting longer and longer. I real-
ized a couple of installments back that this was causing trouble, and that's why I've gone
back to using only compiler fragments for the last installment and this one. The problem is
that it just seems dumb to have to reproduce the code for, say, processing boolean exclusive
OR's, when the subject of the discussion is parameter passing.
The obvious way to have our cake and eat it, too, is to break up the compiler into separately
compilable modules, and of course the Turbo Unit is an ideal vehicle for doing this. This
allows us to hide some fairly complex code (such as the full arithmetic and boolean expres-
sion parsing) into a single unit, and just pull it in whenever it's needed. In that way, the only
code I'll have to reproduce in these installments will be the code that actually relates to the
issue under discussion.
I've also been toying with Turbo 5.5, which of course includes the Borland object-oriented
extensions to Pascal. I haven't decided whether to make use of these features, for two rea-
sons. First of all, many of you who have been following this series may still not have 5.5, and
I certainly don't want to force anyone to have to go out and buy a new compiler just to com-
plete the series. Secondly, I'm not convinced that the O-O extensions have all that much
value for this application. We've been having some discussions about that in CompuServe's
CLM forum, and so far we've not found any compelling reason to use O-O constructs. This is
another of those areas where I could use some feedback from you readers. Anyone want to
vote for Turbo 5.5 and O-O?

In any case, after the next few installments in the series, the plan is to upload to you a
complete set of Units, and complete functioning compilers as well. The plan, in fact, is to
have THREE compilers: One for a single-character version of TINY (to use for our exper-
iments), one for TINY and one for KISS. I've pretty much isolated the differences between
TINY and KISS, which are these:
o TINY will support only two data types: The character and the 16-bit integer. I may also
try to do something with strings, since without them a compiler would be pretty useless.
KISS will support all the usual simple types, including arrays and even floating point.
o TINY will only have two control constructs, the IF and the WHILE. KISS will support a
very rich set of constructs, including one we haven't discussed here before ... the CASE.
o KISS will support separately compilable modules.
One caveat: Since I still don't know much about 80x86 assembler language, all these
compiler modules will still be written to support 68000 code. However, for the programs I
plan to upload, all the code generation has been carefully encapsulated into a single unit,
so that any enterprising student should be able to easily retarget to any other processor.
This task is "left as an exercise for the student." I'll make an offer right here and now: For
the person who provides us the first robust retarget to 80x86, I will be happy to discuss
shared copyrights and royalties from the book that's upcoming.
But enough talk. Let's get on with the study of types. As I said earlier, we'll do this one as
we did in the last installment: by performing experiments using single-character tokens.

Part 14 - Types
THE SYMBOL TABLE

It should be apparent that, if we're going to deal with variables of different types, we're going
to need someplace to record what those types are. The obvious vehicle for that is the symbol
table, and we've already used it that way to distinguish, for example, between local and glo-
bal variables, and between variables and procedures.
The symbol table structure for single-character tokens is particularly simple, and we've used
it several times before. To deal with it, we'll steal some procedures that we've used before.
First, we need to declare the symbol table itself:
{--------------------------------------------------------------}
ST: Array['A'..'Z'] of char; { *** ADD THIS LINE ***}
{--------------------------------------------------------------}
Next, we need to make sure it's initialized as part of procedure Init:
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
var i: char;
begin
ST[i] := '?';
GetChar;
end;
{--------------------------------------------------------------}

We don't really need the next procedure, but it will be helpful for debugging. All it does is
to dump the contents of the symbol table:
{--------------------------------------------------------------}
{ Dump the Symbol Table }
procedure DumpTable;
var i: char;
begin
WriteLn(i, ' ', ST[i]);
end;
{--------------------------------------------------------------}
It really doesn't matter much where you put this procedure ... I plan to cluster all the sym-
bol table routines together, so I put mine just after the error reporting procedures.

Part 14 - Types
If you're the cautious type (as I am), you might want to begin with a test program that does
nothing but initializes, then dumps the table. Just to be sure that we're all on the same wave-
length here, I'm reproducing the entire program below, complete with the new procedures.
Note that this version includes support for white space:
{--------------------------------------------------------------}
program Types;
{--------------------------------------------------------------}
const TAB = Î;
CR = ^M;
LF = ^J;
{--------------------------------------------------------------}
ST: Array['A'..'Z'] of char;
{--------------------------------------------------------------}
procedure GetChar;
begin
Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }
begin
WriteLn;
end;
{--------------------------------------------------------------}
begin
Error(s);
Halt;
end;
{--------------------------------------------------------------}
begin
end;

Part 14 - Types
{--------------------------------------------------------------}
{ Dump the Symbol Table }
procedure DumpTable;
var i: char;
begin
end;
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
IsDigit := c in ['0'..'9'];
end;

{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
begin
IsAddop := c in ['+', '-'];
end;
{--------------------------------------------------------------}
begin
IsMulop := c in ['*', '/'];
end;

Part 14 - Types
{--------------------------------------------------------------}
begin
IsOrop := c in ['|', '~'];
end;
{--------------------------------------------------------------}
begin
IsRelop := c in ['=', '#', '<', '>'];
end;
{--------------------------------------------------------------}
begin
end;

{--------------------------------------------------------------}
begin
GetChar;
end;
{--------------------------------------------------------------}
procedure Fin;
begin
GetChar;
if Look = LF then
GetChar;
end;
end;

Part 14 - Types
{--------------------------------------------------------------}
begin
SkipWhite;
end;
{--------------------------------------------------------------}
begin
GetChar;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Get a Number }
begin
GetNum := Look;
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
begin
Write(TAB, s);
end;

Part 14 - Types
{--------------------------------------------------------------}
begin
Emit(s);
WriteLn;
end;
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
var i: char;
begin
ST[i] := '?';
GetChar;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Main Program }
begin
Init;
DumpTable;
end.
{--------------------------------------------------------------}
OK, run this program. You should get a (very fast) printout of all the letters of the alphabet
(potential identifiers), each followed by a question mark. Not very exciting, but it's a start.
Of course, in general we only want to see the types of the variables that have been
defined. We can eliminate the others by modifying DumpTable with an IF test. Change the
loop to read:
if ST[i] <> '?' then
Now, run the program again. What did you get?

Part 14 - Types
Well, that's even more boring than before! There was no output at all, since at this point
NONE of the names have been declared. We can spice things up a bit by inserting some
statements declaring some entries in the main program. Try these:
ST['A'] := 'a';
ST['P'] := 'b';
ST['X'] := 'c';
This time, when you run the program, you should get an output showing that the symbol table
is working right.

ADDING ENTRIES
Of course, writing to the table directly is pretty poor practice, and not one that will help us
much later. What we need is a procedure to add entries to the table. At the same time, we
know that we're going to need to test the table, to make sure that we aren't redeclaring a
variable that's already in use (easy to do with only 26 choices!). To handle all this, enter
the following new procedures:
{--------------------------------------------------------------}
{ Report Type of a Variable }
function TypeOf(N: char): char;
begin
TypeOf := ST[N];
end;
{--------------------------------------------------------------}
{ Report if a Variable is in the Table }
function InTable(N: char): boolean;
begin
InTable := TypeOf(N) <> '?';
end;

Part 14 - Types
{--------------------------------------------------------------}
{ Check for a Duplicate Variable Name }
procedure CheckDup(N: char);
begin
if InTable(N) then Abort('Duplicate Name ' + N);
end;
{--------------------------------------------------------------}
{ Add Entry to Table }
procedure AddEntry(N, T: char);
begin
CheckDup(N);
ST[N] := T;
end;
{--------------------------------------------------------------}

Now change the three lines in the main program to read:
AddEntry('A', 'a');
AddEntry('P', 'b');
AddEntry('X', 'c');
and run the program again. Did it work? Then we have the symbol table routines needed
to support our work on types. In the next section, we'll actually begin to use them.

Part 14 - Types
ALLOCATING STORAGE
In other programs like this one, including the TINY compiler itself, we have already addressed
the issue of declaring global variables, and the code generated for them. Let's build a vesti-
gial version of a "compiler" here, whose only function is to allow us declare variables.
Remember, the syntax for a declaration is:
<data decl> ::= VAR <identifier>
Again, we can lift a lot of the code from previous programs. The following are stripped-down
versions of those procedures. They are greatly simplified since I have eliminated niceties like
variable lists and initializers. In procedure Alloc, note that the new call to AddEntry will also
take care of checking for duplicate declarations:
{--------------------------------------------------------------}
begin
AddEntry(N, 'v');
end;

{--------------------------------------------------------------}
procedure Decl;
var Name: char;
begin
Match('v');
Alloc(GetName);
end;
{--------------------------------------------------------------}
procedure TopDecls;
begin
case Look of
'v': Decl;
end;
Fin;
end;
end;
{--------------------------------------------------------------}

Part 14 - Types
Now, in the main program, add a call to TopDecls and run the program. Try allocating a few
variables, and note the resulting code generated. This is old stuff for you, so the results
should look familiar. Note from the code for TopDecls that the program is ended by a termi-
nating period.
While you're at it, try declaring two variables with the same name, and verify that the parser
catches the error.

DECLARING TYPES
Allocating storage of different sizes is as easy as modifying procedure TopDecls to recog-
nize more than one keyword. There are a number of decisions to be made here, in terms
of what the syntax should be, etc., but for now I'm going to duck all the issues and simply
declare by executive fiat that our syntax will be:
<data decl> ::= <typename> <identifier>
where:
<typename> ::= BYTE | WORD | LONG
(By an amazing coincidence, the first letters of these names happen to be the same as
the 68000 assembly code length specifications, so this choice saves us a little work.)
We can create the code to take care of these declarations with only slight modifications.
In the routines below, note that I've separated the code generation parts of Alloc from the
logic parts. This is in keeping with our desire to encapsulate the machine-dependent part
of the compiler.
{--------------------------------------------------------------}
{ Generate Code for Allocation of a Variable }
procedure AllocVar(N, T: char);
begin
WriteLn(N, ':', TAB, 'DC.', T, ' 0');
end;

Part 14 - Types
{--------------------------------------------------------------}
procedure Alloc(N, T: char);
begin
AddEntry(N, T);
AllocVar(N, T);
end;
{--------------------------------------------------------------}
procedure Decl;
var Typ: char;
begin
Typ := GetName;
Alloc(GetName, Typ);
end;

{--------------------------------------------------------------}
procedure TopDecls;
begin
case Look of
'b', 'w', 'l': Decl;
end;
Fin;
end;
end;
{--------------------------------------------------------------}
Make the changes shown to these procedures, and give the thing a try. Use the single
characters 'b', 'w', and 'l' for the keywords (they must be lower case, for now). You will see
that in each case, we are allocating the proper storage size. Note from the dumped sym-
bol table that the sizes are also recorded for later use. What later use? Well, that's the
subject of the rest of this installment.

Part 14 - Types
ASSIGNMENTS
Now that we can declare variables of different sizes, it stands to reason that we ought to be
able to do something with them. For our first trick, let's just try loading them into our working
register, D0. It makes sense to use the same idea we used for Alloc; that is, make a load pro-
cedure that can load more than one size. We also want to continue to encapsulate the
machine- dependent stuff. The load procedure looks like this:
{---------------------------------------------------------------}
procedure LoadVar(Name, Typ: char);
begin
Move(Typ, Name + '(PC)', 'D0');
end;
{---------------------------------------------------------------}
On the 68000, at least, it happens that many instructions turn out to be MOVE's. It turns out
to be useful to create a separate code generator just for these instructions, and then call it as
needed:
{---------------------------------------------------------------}
{ Generate a Move Instruction }
procedure Move(Size: char; Source, Dest: String);
begin
EmitLn('MOVE.' + Size + ' ' + Source + ',' + Dest);
end;
{---------------------------------------------------------------}

Note that these two routines are strictly code generators; they have no error-checking or
other logic. To complete the picture, we need one more layer of software that provides
these functions.
First of all, we need to make sure that the type we are dealing with is a loadable type.
This sounds like a job for another recognizer:
{--------------------------------------------------------------}
{ Recognize a Legal Variable Type }
function IsVarType(c: char): boolean;
begin
IsVarType := c in ['B', 'W', 'L'];
end;
{--------------------------------------------------------------}

Part 14 - Types
Next, it would be nice to have a routine that will fetch the type of a variable from the symbol
table, while checking it to make sure it's valid:
{--------------------------------------------------------------}
{ Get a Variable Type from the Symbol Table }
function VarType(Name: char): char;
var Typ: char;
begin
Typ := TypeOf(Name);
if not IsVarType(Typ) then Abort('Identifier ' + Name +
' is not a variable');
VarType := Typ;
end;
{--------------------------------------------------------------}
Armed with these tools, a procedure to cause a variable to be loaded becomes trivial:
{--------------------------------------------------------------}
procedure Load(Name: char);
begin
LoadVar(Name, VarType(Name));
end;
{--------------------------------------------------------------}

(NOTE to the concerned: I know, I know, all this is all very inefficient. In a production pro-
gram, we probably would take steps to avoid such deep nesting of procedure calls. Don't
worry about it. This is an EXERCISE, remember? It's more important to get it right and
understand it, than it is to make it get the wrong answer, quickly. If you get your compiler
completed and find that you're unhappy with the speed, feel free to come back and hack
the code to speed it up!)
It would be a good idea to test the program at this point. Since we don't have a procedure
for dealing with assignments yet, I just added the lines:
Load('A');
Load('B');
Load('C');
Load('X');
to the main program. Thus, after the declaration section is complete, they will be exe-
cuted to generate code for the loads. You can play around with this, and try different com-
binations of declarations to see how the errors are handled.
I'm sure you won't be surprised to learn that storing variables is a lot like loading them.
The necessary procedures are shown next:
{---------------------------------------------------------------}
procedure StoreVar(Name, Typ: char);
begin
Move(Typ, 'D0', '(A0)');
end;

Part 14 - Types
{--------------------------------------------------------------}
{ Store a Variable from the Primary Register }
procedure Store(Name: char);
begin
StoreVar(Name, VarType(Name));
end;
{--------------------------------------------------------------}
You can test this one the same way as the loads.
Now, of course, it's a RATHER small step to use these to handle assignment statements.
What we'll do is to create a special version of procedure Block that supports only assignment
statements, and also a special version of Expression that only supports single variables as
legal expressions. Here they are:

{---------------------------------------------------------------}
var Name: char;
begin
Load(GetName);
end;
{--------------------------------------------------------------}
var Name: char;
begin
Name := GetName;
Match('=');
Expression;
Store(Name);
end;

Part 14 - Types
{--------------------------------------------------------------}
procedure Block;
begin
Assignment;
Fin;
end;
end;
{--------------------------------------------------------------}
(It's worth noting that, if anything, the new procedures that permit us to manipulate types are,
if anything, even simpler and cleaner than what we've seen before. This is mostly thanks to
our efforts to encapsulate the code generator procedures.)
There is one small, nagging problem. Before, we used the Pascal terminating period to get us
out of procedure TopDecls. This is now the wrong character ... it's used to terminate Block. In
previous programs, we've used the BEGIN symbol (abbreviated 'b') to get us out. But that is
now used as a type symbol.
The solution, while somewhat of a kludge, is easy enough. We'll use an UPPER CASE 'B' to
stand for the BEGIN. So change the character in the WHILE loop within TopDecls, from '.' to
'B', and everything will be fine.

Now, we can complete the task by changing the main program to read:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
TopDecls;
Match('B');
Fin;
Block;
DumpTable;
end.
{--------------------------------------------------------------}
(Note that I've had to sprinkle a few calls to Fin around to get us out of Newline troubles.)

Part 14 - Types
OK, run this program. Try the input:
ba { byte a } *** DON'T TYPE THE COMMENTS!!! ***
wb { word b }
lc { long c }
B { begin }
a=a
a=b
a=c
b=a
b=b
b=c
c=a
c=b
c=c
For each declaration, you should get code generated that allocates storage. For each assign-
ment, you should get code that loads a variable of the correct size, and stores one, also of
the correct size.
There's only one small little problem: The generated code is WRONG!

Look at the code for a=c above. The code is:
MOVE.L C(PC),D0
LEA A(PC),A0
MOVE.B D0,(A0)
This code is correct. It will cause the lower eight bits of C to be stored into A, which is a
reasonable behavior. It's about all we can expect to happen.
But now, look at the opposite case. For c=a, the code generated is:
MOVE.B A(PC),D0
LEA C(PC),A0
MOVE.L D0,(A0)
This is NOT correct. It will cause the byte variable A to be stored into the lower eight bits
of D0. According to the rules for the 68000 processor, the upper 24 bits are unchanged.
This means that when we store the entire 32 bits into C, whatever garbage that was in
those high bits will also get stored. Not good.
So what we have run into here, early on, is the issue of TYPE CONVERSION, or COER-
CION.
Before we do anything with variables of different types, even if it's just to copy them, we
have to face up to the issue. It is not the most easy part of a compiler. Most of the bugs I
have seen in production compilers have had to do with errors in type conversion for some
obscure combination of arguments. As usual, there is a tradeoff between compiler com-
plexity and the potential quality of the generated code, and as usual, we will take the path
that keeps the compiler simple. I think you'll find that, with this approach, we can keep the
potential complexity in check rather nicely.

Part 14 - Types
THE COWARD'S WAY OUT

Before we get into the details (and potential complexity) of type conversion, I'd like you to see
that there is one super-simple way to solve the problem: simply promote every variable to a
long integer when we load it!
This takes the addition of only one line to LoadVar, although if we are not going to COM-
PLETELY ignore efficiency, it should be guarded by an IF test. Here is the modified version:
{---------------------------------------------------------------}
begin
if Typ <> 'L' then
EmitLn('CLR.L D0');
end;
{---------------------------------------------------------------}
(Note that StoreVar needs no similar change.)

If you run some tests with this new version, you will find that everything works correctly
now, albeit sometimes inefficiently. For example, consider the case a=b (for the same
declarations shown above). Now the generated code turns out to be:
CLR.L D0
MOVE.W B(PC),D0
LEA A(PC),A0
MOVE.B D0,(A0)
In this case, the CLR turns out not to be necessary, since the result is going into a byte-
sized variable. With a little bit of work, we can do better. Still, this is not bad, and it typical
of the kinds of inefficiencies that we've seen before in simple- minded compilers.
I should point out that, by setting the high bits to zero, we are in effect treating the num-
bers as UNSIGNED integers. If we want to treat them as signed ones instead (the more
likely case) we should do a sign extension after the load, instead of a clear before it. Just
to tie this part of the discussion up with a nice, red ribbon, let's change LoadVar as shown
below:
{---------------------------------------------------------------}
begin
if Typ = 'B' then
EmitLn('CLR.L D0');
if Typ = 'W' then
EmitLn('EXT.L D0');
end;
{---------------------------------------------------------------}
With this version, a byte is treated as unsigned (as in Pascal and C), while a word is
treated as signed.

Part 14 - Types
A MORE REASONABLE SOLUTION

As we've seen, promoting every variable to long while it's in memory solves the problem, but
it can hardly be called efficient, and probably wouldn't be acceptable even for those of us who
claim be unconcerned about efficiency. It will mean that all arithmetic operations will be done
to 32-bit accuracy, which will DOUBLE the run time for most operations, and make it even
worse for multiplication and division. For those operations, we would need to call subroutines
to do them, even if the data were byte or word types. The whole thing is sort of a cop-out, too,
since it ducks all the real issues.
OK, so that solution's no good. Is there still a relatively easy way to get data conversion? Can
we still Keep It Simple?
Yes, indeed. All we have to do is to make the conversion at the other end ... that is, we con-
vert on the way _OUT_, when the data is stored, rather than on the way in.
But, remember, the storage part of the assignment is pretty much independent of the data
load, which is taken care of by procedure Expression. In general the expression may be arbi-
trarily complex, so how can procedure Assignment know what type of data is left in register
D0?
Again, the answer is simple: We'll just _ASK_ procedure Expression! The answer can be
returned as a function value.
All of this requires several procedures to be modified, but the mods, like the method, are
quite simple. First of all, since we aren't requiring LoadVar to do all the work of conversion,
let's go back to the simple version:
{---------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}

Next, let's add a new procedure that will convert from one type to another:
{---------------------------------------------------------------}
{ Convert a Data Item from One Type to Another }
procedure Convert(Source, Dest: char);
begin
if Source <> Dest then begin
if Source = 'B' then
EmitLn('AND.W #$FF,D0');
if Dest = 'L' then
EmitLn('EXT.L D0');
end;
end;
{--------------------------------------------------------------}

Part 14 - Types
Next, we need to do the logic required to load and store a variable of any type. Here are
the routines for that:
{---------------------------------------------------------------}
function Load(Name: char): char;
var Typ : char;
begin
Typ := VarType(Name);
LoadVar(Name, Typ);
Load := Typ;
end;
{--------------------------------------------------------------}
{ Store a Variable from the Primary Register }
procedure Store(Name, T1: char);
var T2: char;
begin
T2 := VarType(Name);
Convert(T1, T2);
StoreVar(Name, T2);
end;
{--------------------------------------------------------------}

Note that Load is a function, which not only emits the code for a load, but also returns the
variable type. In this way, we always know what type of data we are dealing with. When
we execute a Store, we pass it the current type of the variable in D0. Since Store also
knows the type of the destination variable, it can convert as necessary.
Armed with all these new routines, the implementation of our rudimentary assignment
statement is essentially trivial. Procedure Expression now becomes a function, which
returns its type to procedure Assignment:
{---------------------------------------------------------------}
function Expression: char;
begin
Expression := Load(GetName);
end;
{--------------------------------------------------------------}
var Name: char;
begin
Name := GetName;
Match('=');
Store(Name, Expression);
end;
{--------------------------------------------------------------}

Part 14 - Types
Again, note how incredibly simple these two routines are. We've encapsulated all the type
logic into Load and Store, and the trick of passing the type around makes the rest of the work
extremely easy. Of course, all of this is for our special, trivial case of Expression. Naturally, for
the general case it will have to get more complex. But you're looking now at the FINAL ver-
sion of procedure Assignment!
All this seems like a very simple and clean solution, and it is indeed. Compile this program
and run the same test cases as before. You will see that all types of data are converted prop-
erly, and there are few if any wasted instructions. Only the byte-to-long conversion uses two
instructions where one would do, and we could easily modify Convert to handle this case,
too.
Although we haven't considered unsigned variables in this case, I think you can see that we
could easily fix up procedure Convert to deal with these types as well. This is "left as an exer-
cise for the student."

LITERAL ARGUMENTS
Sharp-eyed readers might have noticed, though, that we don't even have a proper form of
a simple factor yet, because we don't allow for loading literal constants, only variables.
Let's fix that now.
To begin with, we'll need a GetNum function. We've seen several versions of this, some
returning only a single character, some a string, and some an integer. The one needed
here will return a LongInt, so that it can handle anything we throw at it. Note that no type
information is returned here: GetNum doesn't concern itself with how the number will be
used:
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: LongInt;
var Val: LongInt;
begin
Val := 0;
GetChar;
end;
GetNum := Val;
SkipWhite;
end;
{---------------------------------------------------------------}

Part 14 - Types
Now, when dealing with literal data, we have one little small problem. With variables, we
know what type things should be because they've been declared to be that type. We have no
such type information for literals. When the programmer says, "-1," does that mean a byte,
word, or longword version? We have no clue. The obvious thing to do would be to use the
largest type possible, i.e. a longword. But that's a bad idea, because when we get to more
complex expressions, we'll find that it will cause every expression involving literals to be pro-
moted to long, as well.
A better approach is to select a type based upon the value of the literal, as shown next:
{--------------------------------------------------------------}
{ Load a Constant to the Primary Register }
function LoadNum(N: LongInt): char;
var Typ : char;
begin
if abs(N) <= 127 then
Typ := 'B'
else if abs(N) <= 32767 then
Typ := 'W'
else Typ := 'L';
LoadConst(N, Typ);
LoadNum := Typ;
end;
{---------------------------------------------------------------}

(I know, I know, the number base isn't really symmetric. You can store -128 in a single
byte, and -32768 in a word. But that's easily fixed, and not worth the time or the added
complexity to fool with it here. It's the thought that counts.)
Note that LoadNum calls a new version of the code generator routine LoadConst, which
has an added argument to define the type:
{---------------------------------------------------------------}
{ Load a Constant to the Primary Register }
procedure LoadConst(N: LongInt; Typ: char);
var temp:string;
begin
Str(N, temp);
Move(Typ, '#' + temp, 'D0');
end;
{--------------------------------------------------------------}

Part 14 - Types
Now we can modify procedure Expression to accomodate the two possible kinds of fac-
tors:
{---------------------------------------------------------------}
begin
Expression := Load(GetName)
else
Expression := LoadNum(GetNum);
end;
{--------------------------------------------------------------}
(Wow, that sure didn't hurt too bad! Just a few extra lines do the job.)
OK, compile this code into your program and give it a try. You'll see that it now works for
either variables or constants as valid expressions.

ADDITIVE EXPRESSIONS
If you've been following this series from the beginning, I'm sure you know what's coming
next: We'll expand the form for an expression to handle first additive expressions, then
multiplicative, then general expressions with parentheses.
The nice part is that we already have a pattern for dealing with these more complex
expressions. All we have to do is to make sure that all the procedures called by Expres-
sion (Term, Factor, etc.) always return a type identifier. If we do that, the program struc-
ture gets changed hardly at all.

Part 14 - Types
The first step is easy: We can rename our existing function Expression to Term, as we've
done so many times before, and create the new version of Expression:
{---------------------------------------------------------------}
var Typ: char;
begin
Typ := Unop
else
Typ := Term;
Push(Typ);
case Look of
'+': Typ := Add(Typ);
'-': Typ := Subtract(Typ);
end;
end;
Expression := Typ;
end;
{--------------------------------------------------------------}

Note in this routine how each procedure call has become a function call, and how the
local variable Typ gets updated at each pass.
Note also the new call to a function Unop, which lets us deal with a leading unary minus.
This change is not necessary ... we could still use a form more like what we've done
before. I've chosen to introduce UnOp as a separate routine because it will make it easier,
later, to produce somewhat better code than we've been doing. In other words, I'm look-
ing ahead to optimization issues.
For this version, though, we'll retain the same dumb old code, which makes the new rou-
tine trivial:
{---------------------------------------------------------------}
{ Process a Term with Leading Unary Operator }
function Unop: char;
begin
Clear;
Unop := 'W';
end;
{---------------------------------------------------------------}
Procedure Push is a code-generator routine, and now has a type argument:
{---------------------------------------------------------------}
procedure Push(Size: char);
begin
Move(Size, 'D0', '-(SP)');
end;
{---------------------------------------------------------------}

Part 14 - Types
Now, let's take a look at functions Add and Subtract. In the older versions of these routines,
we let them call code generator routines PopAdd and PopSub. We'll continue to do that,
which makes the functions themselves extremely simple:
{---------------------------------------------------------------}
function Add(T1: char): char;
begin
Match('+');
Add := PopAdd(T1, Term);
end;
{-------------------------------------------------------------}
function Subtract(T1: char): char;
begin
Match('-');
Subtract := PopSub(T1, Term);
end;
{---------------------------------------------------------------}

The simplicity is deceptive, though, because what we've done is to defer all the logic to
PopAdd and PopSub, which are no longer just code generation routines. They must also
now take care of the type conversions required.
And just what conversion is that? Simple: Both arguments must be of the same size, and
the result is also of that size. The smaller of the two arguments must be "promoted" to the
size of the larger one.
But this presents a bit of a problem. If the argument to be promoted is the second argu-
ment (i.e. in the primary register D0), we are in great shape. If it's not, however, we're in a
fix: we can't change the size of the information that's already been pushed onto the stack.
The solution is simple but a little painful: We must abandon that lovely "pop the data and
do something with it" instructions thoughtfully provided by Motorola.
The alternative is to assign a secondary register, which I've chosen to be R7. (Why not
R1? Because I have later plans for the other registers.)
The first step in this new structure is to introduce a Pop procedure analogous to the Push.
This procedure will always Pop the top element of the stack into D7:
{---------------------------------------------------------------}
{ Pop Stack into Secondary Register }
procedure Pop(Size: char);
begin
Move(Size, '(SP)+', 'D7');
end;
{---------------------------------------------------------------}

Part 14 - Types
The general idea is that all the "Pop-Op" routines can call this one. When this is done, we will
then have both operands in registers, so we can promote whichever one we need to. To deal
with this, procedure Convert needs another argument, the register name:
{---------------------------------------------------------------}
{ Convert a Data Item from One Type to Another }
procedure Convert(Source, Dest: char; Reg: String);
begin
if Source <> Dest then begin
if Source = 'B' then
EmitLn('AND.W #$FF,' + Reg);
if Dest = 'L' then
EmitLn('EXT.L ' + Reg);
end;
end;
{---------------------------------------------------------------}

The next function does a conversion, but only if the current type T1 is smaller in size than
the desired type T2. It is a function, returning the final type to let us know what it decided
to do:
{---------------------------------------------------------------}
{ Promote the Size of a Register Value }
function Promote(T1, T2: char; Reg: string): char;
var Typ: char;
begin
Typ := T1;
if T1 <> T2 then
if (T1 = 'B') or ((T1 = 'W') and (T2 = 'L')) then begin
Convert(T1, T2, Reg);
Typ := T2;
end;
Promote := Typ;
end;
{---------------------------------------------------------------}

Part 14 - Types
Finally, the following function forces the two registers to be of the same type:
{---------------------------------------------------------------}
{ Force both Arguments to Same Type }
function SameType(T1, T2: char): char;
begin
T1 := Promote(T1, T2, 'D7');
SameType := Promote(T2, T1, 'D0');
end;
{---------------------------------------------------------------}

These new routines give us the ammunition we need to flesh out PopAdd and PopSub:
{---------------------------------------------------------------}
{ Generate Code to Add Primary to the Stack }
function PopAdd(T1, T2: char): char;
begin
Pop(T1);
T2 := SameType(T1, T2);
GenAdd(T2);
PopAdd := T2;
end;
{---------------------------------------------------------------}
{ Generate Code to Subtract Primary from the Stack }
function PopSub(T1, T2: char): char;
begin
Pop(T1);
T2 := SameType(T1, T2);
GenSub(T2);
PopSub := T2;
end;
{---------------------------------------------------------------}

Part 14 - Types
After all the buildup, the final results are almost anticlimactic. Once again, you can see that
the logic is quite simple. All the two routines do is to pop the top-of-stack into D7, force the
two operands to be the same size, and then generate the code.
Note the new code generator routines GenAdd and GenSub. These are vestigial forms of the
ORIGINAL PopAdd and PopSub. That is, they are pure code generators, producing a regis-
ter-to-register add or subtract:
{---------------------------------------------------------------}
procedure GenAdd(Size: char);
begin
EmitLn('ADD.' + Size + ' D7,D0');
end;
{---------------------------------------------------------------}
procedure GenSub(Size: char);
begin
EmitLn('SUB.' + Size + ' D7,D0');
EmitLn('NEG.' + Size + ' D0');
end;
{---------------------------------------------------------------}

OK, I grant you: I've thrown a lot of routines at you since we last tested the code. But you
have to admit that each new routine is pretty simple and transparent. If you (like me) don't
like to test so many new routines at once, that's OK. You can stub out routines like Con-
vert, Promote, and SameType, since they don't read any inputs. You won't get the correct
code, of course, but things should work. Then flesh them out one at a time.
When testing the program, don't forget that you first have to declare some variables, and
then start the "body" of the program with an upper-case 'B' (for BEGIN). You should find
that the parser will handle any additive expressions. Once all the conversion routines are
in, you should see that the correct code is generated, with type conversions inserted
where necessary. Try mixing up variables of different sizes, and also literals. Make sure
that everything's working properly. As usual, it's a good idea to try some erroneous
expressions and see how the compiler handles them.

Part 14 - Types
WHY SO MANY PROCEDURES?

At this point, you may think I've pretty much gone off the deep end in terms of deeply nested
procedures. There is admittedly a lot of overhead here. But there's a method in my madness.
As in the case of UnOp, I'm looking ahead to the time when we're going to want better code
generation. The way the code is organized, we can achieve this without major modifications
to the program. For example, in cases where the value pushed onto the stack does _NOT_
have to be converted, it's still better to use the "pop and add" instruction. If we choose to test
for such cases, we can embed the extra tests into PopAdd and PopSub without changing
anything else much.

MULTIPLICATIVE EXPRESSIONS
The procedure for dealing with multiplicative operators is much the same. In fact, at the
first level, they are almost identical, so I'll just show them here without much fanfare. The
first one is our general form for Factor, which includes parenthetical subexpressions:
{---------------------------------------------------------------}
{ Parse and Translate a Factor }
function Expression: char; Forward;
function Factor: char;
begin
Match('(');
Match(')');
end
Factor := Load(GetName)
else
Factor := LoadNum(GetNum);
end;

Part 14 - Types
{--------------------------------------------------------------}
Function Multiply(T1: char): char;
begin
Match('*');
Multiply := PopMul(T1, Factor);
end;
{--------------------------------------------------------------}
function Divide(T1: char): char;
begin
Match('/');
DIvide := PopDiv(T1, Factor);
end;

{---------------------------------------------------------------}
function Term: char;
var Typ: char;
begin
Typ := Factor;
Push(Typ);
case Look of
'*': Typ := Multiply(Typ);
'/': Typ := Divide(Typ);
end;
end;
Term := Typ;
end;
{---------------------------------------------------------------}
These routines parallel the additive ones almost exactly. As before, the complexity is
encapsulated within PopMul and PopDiv. If you'd like to test the program before we get
into that, you can build dummy versions of them, similar to PopAdd and PopSub. Again,
the code won't be correct at this point, but the parser should handle expressions of arbi-
trary complexity.

Part 14 - Types
MULTIPLICATION
Once you've convinced yourself that the parser itself is working properly, we need to figure
out what it will take to generate the right code. This is where things begin to get a little sticky,
because the rules are more complex.
Let's take the case of multiplication first. This operation is similar to the "addops" in that both
operands should be of the same size. It differs in two important respects:
o The type of the product is typically not the same as that of the two operands. For the prod-
uct of two words, we get a longword result.
o The 68000 does not support a 32 x 32 multiply, so a call to a software routine is needed.
This routine will become part of the run-time library.
o It also does not support an 8 x 8 multiply, so all byte operands must be promoted to words.

The actions that we have to take are best shown in the following table:
T1 --> | | | |
| | | |
| | B | W | L |
T2 V | | | |
-----------------------------------------------------------------
| | | |
B | Convert D0 to W | Convert D0 to W | Convert D0 to L |
| Convert D7 to W | | |
| MULS | MULS | JSR MUL32 |
| Result = W | Result = L | Result = L |
| | | |
-----------------------------------------------------------------
| | | |
W | Convert D7 to W | | Convert D0 to L |
| MULS | MULS | JSR MUL32 |
| Result = L | Result = L | Result = L |
| | | |

Part 14 - Types
-----------------------------------------------------------------
| | | |
L | Convert D7 to L | Convert D7 to L | |
| JSR MUL32 | JSR MUL32 | JSR MUL32 |
| Result = L | Result = L | Result = L |
| | | |
-----------------------------------------------------------------
This table shows the actions to be taken for each combination of operand types. There are
three things to note: First, we assume a library routine MUL32 which performs a 32 x 32 mul-
tiply, leaving a >> 32-bit << (not 64-bit) product. If there is any overflow in the process, we
choose to ignore it and return only the lower 32 bits.
Second, note that the table is symmetric ... the two operands enter in the same way. Finally,
note that the product is ALWAYS a longword, except when both operands are bytes. (It's
worth noting, in passing, that this means that many expressions will end up being longwords,
whether we like it or not. Perhaps the idea of just promoting them all up front wasn't all that
outrageous, after all!)

Now, clearly, we are going to have to generate different code for the 16-bit and 32-bit mul-
tiplies. This is best done by having separate code generator routines for the two cases:
{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary (Word) }
procedure GenMult;
begin
EmitLn('MULS D7,D0')
end;
{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary (Long) }
procedure GenLongMult;
begin
EmitLn('JSR MUL32');
end;
{---------------------------------------------------------------}

Part 14 - Types
An examination of the code below for PopMul should convince you that the conditions in the
table are met:
{---------------------------------------------------------------}
{ Generate Code to Multiply Primary by Stack }
function PopMul(T1, T2: char): char;
var T: char;
begin
Pop(T1);
T := SameType(T1, T2);
Convert(T, 'W', 'D7');
Convert(T, 'W', 'D0');
if T = 'L' then
GenLongMult
else
GenMult;
if T = 'B' then
PopMul := 'W'
else
PopMul:= 'L';
end;
{---------------------------------------------------------------}

As you can see, the routine starts off just like PopAdd. The two arguments are forced to
the same type. The two calls to Convert take care of the case where both operands are
bytes. The data themselves are promoted to words, but the routine remembers the type
so as to assign the correct type to the result. Finally, we call one of the two code genera-
tor routines, and then assign the result type. Not too complicated, really.
At this point, I suggest that you go ahead and test the program. Try all combinations of
operand sizes.

Part 14 - Types
DIVISION
The case of division is not nearly so symmetric. I also have some bad news for you:
All modern 16-bit CPU's support integer divide. The manufacturer's data sheet will describe
this operation as a 32 x 16-bit divide, meaning that you can divide a 32-bit dividend by a 16-
bit divisor. Here's the bad news:
THEY'RE LYING TO YOU!!!
If you don't believe it, try dividing any large 32-bit number (meaning that it has non-zero bits in
the upper 16 bits) by the integer 1. You are guaranteed to get an overflow exception.
The problem is that the instruction really requires that the resulting quotient fit into a 16-bit
result. This won't happen UNLESS the divisor is sufficiently large. When any number is
divided by unity, the quotient will of course be the same as the dividend, which had better fit
into a 16-bit word.
Since the beginning of time (well, computers, anyway), CPU architects have provided this lit-
tle gotcha in the division circuitry. It provides a certain amount of symmetry in things, since it
is sort of the inverse of the way a multiply works. But since unity is a perfectly valid (and
rather common) number to use as a divisor, the division as implemented in hardware needs
some help from us programmers.
The implications are as follows:
o The type of the quotient must always be the same as that of the dividend. It is independent
of the divisor.
o In spite of the fact that the CPU supports a longword dividend, the hardware-provided
instruction can only be trusted for byte and word dividends. For longword dividends, we need
another library routine that can return a long result.

This looks like a job for another table, to summarize the required actions:
T1 --> | | | |
| | | |
| | B | W | L |
T2 V | | | |
-----------------------------------------------------------------
| | | |
B | Convert D0 to W | Convert D0 to W | Convert D0 to L |
| Convert D7 to L | Convert D7 to L | |
| DIVS | DIVS | JSR DIV32 |
| Result = B | Result = W | Result = L |
| | | |
-----------------------------------------------------------------
| | | |
W | Convert D7 to L | Convert D7 to L | Convert D0 to L |
| DIVS | DIVS | JSR DIV32 |
| | | |

Part 14 - Types
-----------------------------------------------------------------
| | | |
L | Convert D7 to L | Convert D7 to L | |
| JSR DIV32 | JSR DIV32 | JSR DIV32 |
| | | |
-----------------------------------------------------------------
(You may wonder why it's necessary to do a 32-bit division, when the dividend is, say, only a
byte in the first place. Since the number of bits in the result can only be as many as that in the
dividend, why bother? The reason is that, if the divisor is a longword, and there are any high
bits set in it, the result of the division must be zero. We might not get that if we only use the
lower word of the divisor.)

The following code provides the correct function for PopDiv:
{---------------------------------------------------------------}
{ Generate Code to Divide Stack by the Primary }
function PopDiv(T1, T2: char): char;
begin
Pop(T1);
Convert(T1, 'L', 'D7');
if (T1 = 'L') or (T2 = 'L') then begin
Convert(T2, 'L', 'D0');
GenLongDiv;
PopDiv := 'L';
end
else begin
Convert(T2, 'W', 'D0');
GenDiv;
PopDiv := T1;
end;
end;
{---------------------------------------------------------------}

Part 14 - Types
The two code generation procedures are:
{---------------------------------------------------------------}
{ Divide Top of Stack by Primary (Word) }
procedure GenDiv;
begin
Move('W', 'D7', 'D0');
end;
{---------------------------------------------------------------}
{ Divide Top of Stack by Primary (Long) }
procedure GenLongDiv;
begin
EmitLn('JSR DIV32');
end;
{---------------------------------------------------------------}
Note that we assume that DIV32 leaves the (longword) result in D0.
OK, install the new procedures for division. At this point you should be able to generate code
for any kind of arithmetic expression. Give it a whirl!

BEGINNING TO WIND DOWN

At last, in this installment, we've learned how to deal with variables (and literals) of differ-
ent types. As you can see, it hasn't been too tough. In fact, in some ways most of the
code looks even more simple than it does in earlier programs. Only the multiplication and
division operators require a little thinking and planning.
The main concept that made things easy was that of converting procedures such as
Expression into functions that return the type of the result. Once this was done, we were
able to retain the same general structure of the compiler.
I won't pretend that we've covered every single aspect of the issue. I conveniently ignored
unsigned arithmetic. From what we've done, I think you can see that to include them adds
no new challenges, just extra possibilities to test for.
I've also ignored the logical operators And, Or, etc. It turns out that these are pretty easy
to handle. All the logical operators are bitwise operations, so they are symmetric and
therefore work in the same fashion as PopAdd. There is one difference, however: if it is
necessary to extend the word length for a logical variable, the extension should be done
as an UNSIGNED number. Floating point numbers, again, are straightforward to handle
... just a few more procedures to be added to the run-time library, or perhaps instructions
for a math chip.
Perhaps more importantly, I have also skirted the issue of type CHECKING, as opposed
to conversion. In other words, we've allowed for operations between variables of all com-
binations of types. In general this will not be true ... certainly you don't want to add an inte-
ger, for example, to a string. Most languages also don't allow you to mix up character and
integer variables.
Again, there are really no new issues to be addressed in this case. We are already check-
ing the types of the two operands ... much of this checking gets done in procedures like
SameType. It's pretty straightforward to include a call to an error handler, if the types of
the two operands are incompatible.

Part 14 - Types
In the general case, we can think of every single operator as being handled by a different pro-
cedure, depending upon the type of the two operands. This is straightforward, though
tedious, to implement simply by implementing a jump table with the operand types as indices.
In Pascal, the equivalent operation would involve nested Case statements. Some of the
called procedures could then be simple error routines, while others could effect whatever kind
of conversion we need. As more types are added, the number of procedures goes up by a
square-law rule, but that's still not an unreasonably large number of procedures.
What we've done here is to collapse such a jump table into far fewer procedures, simply by
making use of symmetry and other simplifying rules.

TO COERCE OR NOT TO COERCE

In case you haven't gotten this message yet, it sure appears that TINY and KISS will
probably _NOT_ be strongly typed languages, since I've allowed for automatic mixing and
conversion of just about any type. Which brings up the next issue:
Is this really what we want to do?
The answer depends on what kind of language you want, and the way you'd like it to
behave. What we have not addressed is the issue of when to allow and when to deny the
use of operations involving different data types. In other words, what should be the
SEMANTICS of our compiler? Do we want automatic type conversion for all cases, for
some cases, or not at all?
Let's pause here to think about this a bit more. To do so, it will help to look at a bit of his-
tory.
FORTRAN II supported only two simple data types: Integer and Real. It allowed implicit
type conversion between real and integer types during assignment, but not within expres-
sions. All data items (including literal constants) on the right-hand side of an assignment
statement had to be of the same type. That made things pretty easy ... much simpler than
what we've had to do here.
This was changed in FORTRAN IV to support "mixed-mode" arithmetic. If an expression

had any real data items in it, they were all converted to reals and the expression itself was
real. To round out the picture, functions were provided to explicitly convert from one type
to the other, so that you could force an expression to end up as either type.
This led to two things: code that was easier to write, and code that was less efficient.
That's because sloppy programmers would write expressions with simple constants like 0
and 1 in them, which the compiler would dutifully compile to convert at execution time.
Still, the system worked pretty well, which would tend to indicate that implicit type conver-
sion is a Good Thing.
C is also a weakly typed language, though it supports a larger number of types. C won't
complain if you try to add a character to an integer, for example. Partly, this is helped by
the C convention of promoting every char to integer when it is loaded, or passed through

Part 14 - Types
a parameter list. This simplifies the conversions quite a bit. In fact, in subset C compilers that
don't support long or float types, we end up back where we were in our earlier, simple-minded
first try: every variable has the same representation, once loaded into a register. Makes life
pretty easy!
The ultimate language in the direction of automatic type conversion is PL/I. This language
supports a large number of data types, and you can mix them all freely. If the implicit conver-
sions of FORTRAN seemed good, then those of PL/I should have been Heaven, but it turned
out to be more like Hell! The problem was that with so many data types, there had to be a
large number of different conversions, AND a correspondingly large number of rules about
how mixed operands should be converted. These rules became so complex that no one
could remember what they were! A lot of the errors in PL/I programs had to do with unex-
pected and unwanted type conversions. Too much of a Good Thing can be bad for you!
Pascal, on the other hand, is a language which is "strongly typed," which means that in gen-
eral you can't mix types, even if they differ only in _NAME_, and yet have the same base
type! Niklaus Wirth made Pascal strongly typed to help keep programmers out of trouble, and
the restrictions have indeed saved many a programmer from himself, because the compiler
kept him from doing something dumb. Better to find the bug in compilation rather than the
debug phase. The same restrictions can also cause frustration when you really WANT to mix
types, and they tend to drive an ex-C-programmer up the wall.
Even so, Pascal does permit some implicit conversions. You can assign an integer to a real
value. You can also mix integer and real types in expressions of type Real. The integers will
be automatically coerced to real, just as in FORTRAN (and with the same hidden cost in run-
time overhead).
You can't, however, convert the other way, from real to integer, without applying an explicit
conversion function, Trunc. The theory here is that, since the numerical value of a real num-
ber is necessarily going to be changed by the conversion (the fractional part will be lost), you
really shouldn't do it in "secret."
In the spirit of strong typing, Pascal will not allow you to mix Char and Integer variables, with-
out applying the explicit coercion functions Chr and Ord.

Turbo Pascal also includes the types Byte, Word, and LongInt. The first two are basically
the same as unsigned integers. In Turbo, these can be freely intermixed with variables of
type Integer, and Turbo will automatically handle the conversion. There are run-time
checks, though, to keep you from overflowing or otherwise getting the wrong answer.
Note that you still can't mix Byte and Char types, even though they are stored internally in
the same representation.
The ultimate in a strongly-typed language is Ada, which allows _NO_ implicit type conver-
sions at all, and also will not allow mixed-mode arithmetic. Jean Ichbiah's position is that
conversions cost execution time, and you shouldn't be allowed to build in such cost in a
hidden manner. By forcing the programmer to explicitly request a type conversion, you
make it more apparent that there could be a cost involved.
I have been using another strongly-typed language, a delightful little language called
Whimsical, by John Spray. Although Whimsical is intended as a systems programming
language, it also requires explicit conversion EVERY time. There are NEVER any auto-
matic conversions, even the ones supported by Pascal.
This approach does have certain advantages: The compiler never has to guess what to
do: the programmer always tells it precisely what he wants. As a result, there tends to be
a more nearly one-to-one correspondence between source code and compiled code, and
John's compiler produces VERY tight code.
On the other hand, I sometimes find the explicit conversions to be a pain. If I want, for
example, to add one to a character, or AND it with a mask, there are a lot of conversions
to make. If I get it wrong, the only error message is "Types are not compatible." As it hap-
pens, John's particular implementation of the language in his compiler doesn't tell you
exactly WHICH types are not compatible ... it only tells you which LINE the error is in.
I must admit that most of my errors with this compiler tend to be errors of this type, and
I've spent a lot of time with the Whimsical compiler, trying to figure out just WHERE in the
line I've offended it. The only real way to fix the error is to keep trying things until some-
thing works.

Part 14 - Types
So what should we do in TINY and KISS? For the first one, I have the answer: TINY will sup-
port only the types Char and Integer, and we'll use the C trick of promoting Chars to Integers
internally. That means that the TINY compiler will be _MUCH_ simpler than what we've
already done. Type conversion in expressions is sort of moot, since none will be required!
Since longwords will not be supported, we also won't need the MUL32 and DIV32 run-time
routines, nor the logic to figure out when to call them. I _LIKE_ it!
KISS, on the other hand, will support the type Long.
Should it support both signed and unsigned arithmetic? For the sake of simplicity I'd rather
not. It does add quite a bit to the complexity of type conversions. Even Niklaus Wirth has
eliminated unsigned (Cardinal) numbers from his new language Oberon, with the argument
that 32-bit integers should be long enough for anybody, in either case.
But KISS is supposed to be a systems programming language, which means that we should
be able to do whatever operations that can be done in assembler. Since the 68000 supports
both flavors of integers, I guess KISS should, also. We've seen that logical operations need
to be able to extend integers in an unsigned fashion, so the unsigned conversion procedures
are required in any case.

CONCLUSION
That wraps up our session on type conversions. Sorry you had to wait so long for it, but
hope you feel that it was worth the wait.
In the next few installments, we'll extend the simple types to include arrays and pointers,
and we'll have a look at what to do about strings. That should pretty well wrap up the
mainstream part of the series. After that, I'll give you the new versions of the TINY and
KISS compilers, and then we'll start to look at optimization issues.
See you then.

Part 15 - Back To The Future
INTRODUCTION
Can it really have been four years since I wrote installment fourteen of this series? Is it really
possible that six long years have passed since I began it? Funny how time flies when you're
having fun, isn't it?
I won't spend a lot of time making excuses; only point out that things happen, and priorities
change. In the four years since installment fourteen, I've managed to get laid off, get
divorced, have a nervous breakdown, begin a new career as a writer, begin another one as a
consultant, move, work on two real-time systems, and raise fourteen baby birds, three
pigeons, six possums, and a duck. For awhile there, the parsing of source code was not high
on my list of priorities. Neither was writing stuff for free, instead of writing stuff for pay. But I do
try to be faithful, and I do recognize and feel my responsibility to you, the reader, to finish
what I've started. As the tortoise said in one of my son's old stories, I may be slow, but I'm
sure. I'm sure that there are people out there anxious to see the last reel of this film, and I
intend to give it to them. So, if you're one of those who's been waiting, more or less patiently,
to see how this thing comes out, thanks for your patience. I apologize for the delay. Let's
move on.

NEW STARTS, OLD DIRECTIONS

Like many other things, programming languages and programming styles change with
time. In 1994, it seems a little anachronistic to be programming in Turbo Pascal, when the
rest of the world seems to have gone bananas over C++. It also seems a little strange to
be programming in a classical style when the rest of the world has switched to object-ori-
ented methods. Still, in spite of the four-year hiatus, it would be entirely too wrenching a
change, at this point, to switch to, say, C++ with object- orientation . Anyway, Pascal is
still not only a powerful programming language (more than ever, in fact), but it's a wonder-
ful medium for teaching. C is a notoriously difficult language to read ... it's often been
accused, along with Forth, of being a "write-only language." When I program in C++, I find
myself spending at least 50% of my time struggling with language syntax rather than with
concepts. A stray "&" or "*" can not only change the functioning of the program, but its
correctness as well. By contrast, Pascal code is usually quite transparent and easy to
read, even if you don't know the language. What you see is almost always what you get,
and we can concentrate on concepts rather than implementation details. I've said from
the beginning that the purpose of this tutorial series was not to generate the world's fast-
est compiler, but to teach the fundamentals of compiler technology, while spending the
least amount of time wrestling with language syntax or other aspects of software imple-
mentation. Finally, since a lot of what we do in this course amounts to software experi-
mentation, it's important to have a compiler and associated environment that compiles
quickly and with no fuss. In my opinion, by far the most significant time measure in soft-
ware development is the speed of the edit/compile/test cycle. In this department, Turbo
Pascal is king. The compilation speed is blazing fast, and continues to get faster in every
release (how do they keep doing that?). Despite vast improvements in C compilation
speed over the years, even Borland's fastest C/C++ compiler is still no match for Turbo
Pascal. Further, the editor built into their IDE, the make facility, and even their superb
smart linker, all complement each other to produce a wonderful environment for quick
turnaround. For all of these reasons, I intend to stick with Pascal for the duration of this
series. We'll be using Turbo Pascal for Windows, one of the compilers provided Borland
Pascal with Objects, version 7.0. If you don't have this compiler, don't worry ... nothing we
do here is going to count on your having the latest version. Using the Windows version
helps me a lot, by allowing me to use the Clipboard to copy code from the compiler's edi-
tor into these documents. It should also help you at least as much, copying the code in
the other direction.

I've thought long and hard about whether or not to introduce objects to our discussion. I'm a
big advocate of object-oriented methods for all uses, and such methods definitely have their
place in compiler technology. In fact, I've written papers on just this subject (Refs. 1-3). But
the architecture of a compiler which is based on object-oriented approaches is vastly different
than that of the more classical compiler we've been building. Again, it would seem to be
entirely too much to change these horses in mid- stream. As I said, programming styles
change. Who knows, it may be another six years before we finish this thing, and if we keep
changing the code every time programming style changes, we may NEVER finish.
So for now, at least, I've determined to continue the classical style in Pascal, though we might
indeed discuss objects and object orientation as we go. Likewise, the target machine will
remain the Motorola 68000 family. Of all the decisions to be made here, this one has been
the easiest. Though I know that many of you would like to see code for the 80x86, the 68000
has become, if anything, even more popular as a platform for embedded systems, and it's to
that application that this whole effort began in the first place. Compiling for the PC, MSDOS
platform, we'd have to deal with all the issues of DOS system calls, DOS linker formats, the
PC file system and hardware, and all those other complications of a DOS environment. An
embedded system, on the other hand, must run standalone, and it's for this kind of applica-
tion, as an alternative to assembly language, that I've always imagined that a language like
KISS would thrive. Anyway, who wants to deal with the 80x86 architecture if they don't have
to?
The one feature of Turbo Pascal that I'm going to be making heavy use of is units. In the past,
we've had to make compromises between code size and complexity, and program functional-
ity. A lot of our work has been in the nature of computer experimentation, looking at only one
aspect of compiler technology at a time. We did this to avoid to avoid having to carry around
large programs, just to investigate simple concepts. In the process, we've re-invented the
wheel and re-programmed the same functions more times than I'd like to count. Turbo units
provide a wonderful way to get functionality and simplicity at the same time: You write reus-
able code, and invoke it with a single line. Your test program stays small, but it can do power-
ful things.
One feature of Turbo Pascal units is their initialization block. As with an Ada package, any
code in the main begin-end block of a unit gets executed as the program is initialized. As
you'll see later, this sometimes gives us neat simplifications in the code. Our procedure Init,
which has been with us since Installment 1, goes away entirely when we use units. The vari-
ous routines in the Cradle, another key features of our approach, will get distributed among
the units.

The concept of units, of course, is no different than that of C modules. However, in C (and
C++), the interface between modules comes via preprocessor include statements and
header files. As someone who's had to read a lot of other people's C programs, I've
always found this rather bewildering. It always seems that whatever data structure you'd
like to know about is in some other file. Turbo units are simpler for the very reason that
they're criticized by some: The function interfaces and their implementation are included
in the same file. While this organization may create problems with code security, it also
reduces the number of files by half, which isn't half bad. Linking of the object files is also
easy, because the Turbo compiler takes care of it without the need for make files or other
mechanisms.

STARTING OVER?
Four years ago, in Installment 14, I promised you that our days of re-inventing the wheel, and
recoding the same software over and over for each lesson, were over, and that from now on
we'd stick to more complete programs that we would simply add new features to. I still intend
to keep that promise; that's one of the main purposes for using units. However, because of
the long time since Installment 14, it's natural to want to at least do some review, and anyhow,
we're going to have to make rather sweeping changes in the code to make the transition to
units. Besides, frankly, after all this time I can't remember all the neat ideas I had in my head
four years ago. The best way for me to recall them is to retrace some of the steps we took to
arrive at Installment 14. So I hope you'll be understanding and bear with me as we go back to
our roots, in a sense, and rebuild the core of the software, distributing the routines among the
various units, and bootstrapping ourselves back up to the point we were at lo, those many
moons ago. As has always been the case, you're going to get to see me make all the mis-
takes and execute changes of direction, in real time. Please bear with me ... we'll start getting
to the new stuff before you know it.
Since we're going to be using multiple modules in our new approach, we have to address the
issue of file management. If you've followed all the other sections of this tutorial, you know
that, as our programs evolve, we're going to be replacing older, more simple-minded units
with more capable ones. This brings us to an issue of version control. There will almost cer-
tainly be times when we will overlay a simple file (unit), but later wish we had the simple one
again. A case in point is embodied in our predilection for using single-character variable
names, keywords, etc., to test concepts without getting bogged down in the details of a lexi-
cal scanner. Thanks to the use of units, we will be doing much less of this in the future. Still, I
not only suspect, but am certain that we will need to save some older versions of files, for
special purposes, even though they've been replaced by newer, more capable ones.
To deal with this problem, I suggest that you create different directories, with different ver-
sions of the units as needed. If we do this properly, the code in each directory will remain self-
consistent. I've tentatively created four directories: SINGLE (for single-character experimen-
tation), MULTI (for, of course, multi-character versions), TINY, and KISS.
Enough said about philosophy and details. Let's get on with the resurrection of the software.

THE INPUT UNIT

A key concept that we've used since Day 1 has been the idea of an input stream with one
lookahead character. All the parsing routines examine this character, without changing it,
to decide what they should do next. (Compare this approach with the C/Unix approach
using getchar and unget, and I think you'll agree that our approach is simpler). We'll begin
our hike into the future by translating this concept into our new, unit-based organization.
The first unit, appropriately called Input, is shown below:
{--------------------------------------------------------------}
unit Input;
{--------------------------------------------------------------}
interface
var Look: char; { Lookahead character }
procedure GetChar; { Read new character }
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
procedure GetChar;
begin
Read(Look);
end;

{--------------------------------------------------------------}
{ Unit Initialization }
begin
GetChar;
end.
{--------------------------------------------------------------}
As you can see, there's nothing very profound, and certainly nothing complicated, about this
unit, since it consists of only a single procedure. But already, we can see how the use of units
gives us advantages. Note the executable code in the initialization block. This code "primes
the pump" of the input stream for us, something we've always had to do before, by inserting
the call to GetChar in line, or in procedure Init. This time, the call happens without any special
reference to it on our part, except within the unit itself. As I predicted earlier, this mechanism
is going to make our lives much simpler as we proceed. I consider it to be one of the most
useful features of Turbo Pascal, and I lean on it heavily.
Copy this unit into your compiler's IDE, and compile it. To test the software, of course, we
always need a main program. I used the following, really complex test program, which we'll
later evolve into the Main for our compiler:
{--------------------------------------------------------------}
program Main;
uses WinCRT, Input;
begin
WriteLn(Look);
end.
{--------------------------------------------------------------}

Note the use of the Borland-supplied unit, WinCRT. This unit is necessary if you intend to
use the standard Pascal I/O routines, Read, ReadLn, Write, and WriteLn, which of course
we intend to do. If you forget to include this unit in the "uses" clause, you will get a really
bizarre and indecipherable error message at run time.
Note also that we can access the lookahead character, even though it's not declared in
the main program. All variables declared within the interface section of a unit are global,
but they're hidden from prying eyes; to that extent, we get a modicum of information hid-
ing. Of course, if we were writing in an object- oriented fashion, we should not allow out-
side modules to access the units internal variables. But, although Turbo units have a lot in
common with objects, we're not doing object-oriented design or code here, so our use of
Look is appropriate.
Go ahead and save the test program as Main.pas. To make life easier as we get more
and more files, you might want to take this opportunity to declare this file as the compiler's
Primary file. That way, you can execute the program from any file. Otherwise, if you press
Cntl-F9 to compile and run from one of the units, you'll get an error message. You set the
primary file using the main submenu, "Compile," in the Turbo IDE.
I hasten to point out, as I've done before, that the function of unit Input is, and always has
been, considered to be a dummy version of the real thing. In a production version of a
compiler, the input stream will, of course, come from a file rather than from the keyboard.
And it will almost certainly include line buffering, at the very least, and more likely, a rather
large text buffer to support efficient disk I/O. The nice part about the unit approach is that,
as with objects, we can modify the code in the unit to be as simple or as sophisticated as
we like. As long as the interface, as embodied in the public procedures and the looka-
head character, don't change, the rest of the program is totally unaffected. And since units
are compiled, rather than merely included, the time required to link with them is virtually
nil. Again, the result is that we can get all the benefits of sophisticated implementations,
without having to carry the code around as so much baggage.
In later installments, I intend to provide a full-blown IDE for the KISS compiler, using a
true Windows application generated by Borland's OWL applications framework. For now,
though, we'll obey my #1 rule to live by: Keep It Simple.

THE OUTPUT UNIT

Of course, every decent program should have output, and ours is no exception. Our output
routines included the Emit functions. The code for the corresponding output unit is shown
next:
{--------------------------------------------------------------}
unit Output;
{--------------------------------------------------------------}
interface
procedure Emit(s: string);{ Emit an instruction }
procedure EmitLn(s: string);{ Emit an instruction line }
{--------------------------------------------------------------}
implementation
const TAB = Î;

{--------------------------------------------------------------}
{ Emit an Instruction }
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
{ Emit an Instruction, Followed By a Newline }
begin
Emit(s);
WriteLn;
end;
end.

{--------------------------------------------------------------}
(Notice that this unit has no initialization clause, so it needs no begin-block.)
Test this unit with the following main program:
{--------------------------------------------------------------}
program Test;
uses WinCRT, Input, Output, Scanner, Parser;
begin
WriteLn('MAIN:");
EmitLn('Hello, world!');
end.
{--------------------------------------------------------------}
Did you see anything that surprised you? You may have been surprised to see that you
needed to type something, even though the main program requires no input. That's because
of the initialization in unit Input, which still requires something to put into the lookahead char-
acter. Sorry, there's no way out of that box, or rather, we don't _WANT_ to get out. Except for
simple test cases such as this, we will always want a valid lookahead character, so the right
thing to do about this "problem" is ... nothing.
Perhaps more surprisingly, notice that the TAB character had no effect; our line of "instruc-
tions" begins at column 1, same as the fake label. That's right: WinCRT doesn't support tabs.
We have a problem.

There are a few ways we can deal with this problem. The one thing we can't do is to sim-
ply ignore it. Every assembler I've ever used reserves column 1 for labels, and will rebel
to see instructions starting there. So, at the very least, we must space the instructions
over one column to keep the assembler happy. . That's easy enough to do: Simply
change, in procedure Emit, the line:
Write(TAB, s);
by:
Write(' ', s);
I must admit that I've wrestled with this problem before, and find myself changing my
mind as often as a chameleon changes color. For the purposes we're going to be using,
99% of which will be examining the output code as it's displayed on a CRT, it would be
nice to see neatly blocked out "object" code. The line:
SUB1: MOVE #4,D0
just plain looks neater than the different, but functionally identical code,
SUB1:
MOVE #4,D0
In test versions of my code, I included a more sophisticated version of the procedure

PostLabel, that avoids having labels on separate lines, but rather defers the printing of a
label so it can end up on the same line as the associated instruction. As recently as an
hour ago, my version of unit Output provided full support for tabs, using an internal col-
umn count variable and software to manage it. I had, if I do say so myself, some rather
elegant code to support the tab mechanism, with a minimum of code bloat. It was awfully
tempting to show you the "prettyprint" version, if for no other reason than to show off the
elegance.

Nevertheless, the code of the "elegant" version was considerably more complex and larger.
Since then, I've had second thoughts. In spite of our desire to see pretty output, the inescap-
able fact is that the two versions of the MAIN: code fragment shown above are functionally
identical; the assembler, which is the ultimate destination of the code, couldn't care less
which version it gets, except that the prettier version will contain more characters, therefore
will use more disk space and take longer to assemble. but the prettier one not only takes
more code to generate, but will create a larger output file, with many more space characters
than the minimum needed. When you look at it that way, it's not very hard to decide which
approach to use, is it?
What finally clinched the issue for me was a reminder to consider my own first command-
ment: KISS. Although I was pretty proud of all my elegant little tricks to implement tabbing, I
had to remind myself that, to paraphrase Senator Barry Goldwater, elegance in the pursuit of
complexity is no virtue. Another wise man once wrote, "Any idiot can design a Rolls-Royce. It
takes a genius to design a VW." So the elegant, tab-friendly version of Output is history, and
what you see is the simple, compact, VW version.

THE ERROR UNIT

Our next set of routines are those that handle errors. To refresh your memory, we take the
approach, pioneered by Borland in Turbo Pascal, of halting on the first error. Not only
does this greatly simplify our code, by completely avoiding the sticky issue of error recov-
ery, but it also makes much more sense, in my opinion, in an interactive environment. I
know this may be an extreme position, but I consider the practice of reporting all errors in
a program to be an anachronism, a holdover from the days of batch processing. It's time
to scuttle the practice. So there.
In our original Cradle, we had two error-handling procedures: Error, which didn't halt, and
Abort, which did. But I don't think we ever found a use for the procedure that didn't halt, so
in the new, lean and mean unit Errors, shown next, procedure Error takes the place of
Abort.
{--------------------------------------------------------------}
unit Errors;
{--------------------------------------------------------------}
interface
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
{ Write error Message and Halt }

begin
WriteLn;
Halt;
end;
{--------------------------------------------------------------}
{ Write "<something> Expected" }
begin
Error(s + ' Expected');
end;
end.
{--------------------------------------------------------------}

As usual, here's a test program:
{--------------------------------------------------------------}
program Test;
uses WinCRT, Input, Output, Errors;
begin
Expected('Integer');
end.
{--------------------------------------------------------------}
Have you noticed that the "uses" line in our main program keeps getting longer? That's
OK. In the final version, the main program will only call procedures in our parser, so its
use clause will only have a couple of entries. But for now, it's probably best to include all
the units so we can test procedures in them.

SCANNING AND PARSING

The classical compiler architecture consists of separate modules for the lexical scanner,
which supplies tokens in the language, and the parser, which tries to make sense of the
tokens as syntax elements. If you can still remember what we did in earlier installments, you'll
recall that we didn't do things that way. Because we're using a predictive parser, we can
almost always tell what language element is coming next, just by examining the lookahead
character. Therefore, we found no need to prefetch tokens, as a scanner would do.
But, even though there is no functional procedure called "Scanner," it still makes sense to
separate the scanning functions from the parsing functions. So I've created two more units
called, amazingly enough, Scanner and Parser. The Scanner unit contains all of the routines
known as recognizers. Some of these, such as IsAlpha, are pure boolean routines which
operate on the lookahead character only. The other routines are those which collect tokens,
such as identifiers and numeric constants. The Parser unit will contain all of the routines mak-
ing up the recursive-descent parser. The general rule should be that unit Parser contains all
of the information that is language-specific; in other words, the syntax of the language should
be wholly contained in Parser. In an ideal world, this rule should be true to the extent that we
can change the compiler to compile a different language, merely by replacing the single unit,
Parser.
In practice, things are almost never this pure. There's always a small amount of "leakage" of
syntax rules into the scanner as well. For example, the rules concerning what makes up a
legal identifier or constant may vary from language to language. In some languages, the rules
concerning comments permit them to be filtered by the scanner, while in others they do not.
So in practice, both units are likely to end up having language- dependent components, but
the changes required to the scanner should be relatively trivial.

Now, recall that we've used two versions of the scanner routines: One that handled only
single-character tokens, which we used for a number of our tests, and another that pro-
vided full support for multi-character tokens. Now that we have our software separated
into units, I don't anticipate getting much use out of the single- character version, but it
doesn't cost us much to provide for both. I've created two versions of the Scanner unit.
The first one, called Scanner1, contains the single-digit version of the recognizers:
{--------------------------------------------------------------}
unit Scanner1;
{--------------------------------------------------------------}
interface
uses Input, Errors;
function GetNumber: char;

{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
{ Recognize a Numeric Character }
begin
IsDigit := c in ['0'..'9'];
end;

{--------------------------------------------------------------}
function IsAlnum(c: char): boolean;
begin
IsAlnum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}
{ Recognize an Addition Operator }
begin
IsAddop := c in ['+','-'];
end;

{--------------------------------------------------------------}
{ Recognize a Multiplication Operator }
begin
IsMulop := c in ['*','/'];
end;
{--------------------------------------------------------------}
{ Match One Character }
begin
end;

{--------------------------------------------------------------}
begin
GetChar;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNumber: char;
begin
GetNumber := Look;
GetChar;
end;
end.
{--------------------------------------------------------------}

The following code fragment of the main program provides a good test of the scanner. For
brevity, I'll only include the executable code here; the rest remains the same. Don't forget,
though, to add the name Scanner1 to the "uses" clause.
Write(GetName);
Match('=');
Write(GetNumber);
Match('+');
WriteLn(GetName);
This code will recognize all sentences of the form:
x=0+y
where x and y can be any single-character variable names, and 0 any digit. The code should
reject all other sentences, and give a meaningful error message. If it did, you're in good
shape and we can proceed.

THE SCANNER UNIT

The next, and by far the most important, version of the scanner is the one that handles
the multi-character tokens that all real languages must have. Only the two functions, Get-
Name and GetNumber, change between the two units, but just to be sure there are no
mistakes, I've reproduced the entire unit here. This is unit Scanner:
{--------------------------------------------------------------}
unit Scanner;
{--------------------------------------------------------------}
interface
uses Input, Errors;
function GetNumber: longint;

{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
begin
end;
{--------------------------------------------------------------}
{ Recognize a Numeric Character }
begin
IsDigit := c in ['0'..'9'];
end;

{--------------------------------------------------------------}
function IsAlnum(c: char): boolean;
begin
IsAlnum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}
{ Recognize an Addition Operator }
begin
IsAddop := c in ['+','-'];
end;

{--------------------------------------------------------------}
{ Recognize a Multiplication Operator }
begin
IsMulop := c in ['*','/'];
end;
{--------------------------------------------------------------}
{ Match One Character }
begin
end;

{--------------------------------------------------------------}
var n: string;
begin
n := '';
while IsAlnum(Look) do begin
n := n + Look;
GetChar;
end;
GetName := n;
end;

{--------------------------------------------------------------}
{ Get a Number }
function GetNumber: string;
var n: string;
begin
n := '';
n := n + Look;
GetChar;
end;
GetNumber := n;
end;
end.
{--------------------------------------------------------------}
The same test program will test this scanner, also. Simply change the "uses" clause to use
Scanner instead of Scanner1. Now you should be able to type multi-character names and
numbers.

DECISIONS, DECISIONS
In spite of the relative simplicity of both scanners, a lot of thought has gone into them, and
a lot of decisions had to be made. I'd like to share those thoughts with you now so you
can make your own educated decision, appropriate for your application. First, note that
both versions of GetName translate the input characters to upper case. Obviously, there
was a design decision made here, and this is one of those cases where the language
syntax splatters over into the scanner. In the C language, the case of characters in identi-
fiers is significant. For such a language, we obviously can't map the characters to upper
case. The design I'm using assumes a language like Pascal, where the case of charac-
ters doesn't matter. For such languages, it's easier to go ahead and map all identifiers to
upper case in the scanner, so we don't have to worry later on when we're comparing
strings for equality.
We could have even gone a step further, and map the characters to upper case right as
they come in, in GetChar. This approach works too, and I've used it in the past, but it's too
confining. Specifically, it will also map characters that may be part of quoted strings,
which is not a good idea. So if you're going to map to upper case at all, GetName is the
proper place to do it.
Note that the function GetNumber in this scanner returns a string, just as GetName does.
This is another one of those things I've oscillated about almost daily, and the last swing
was all of ten minutes ago. The alternative approach, and one I've used many times in
past installments, returns an integer result.
Both approaches have their good points. Since we're fetching a number, the approach
that immediately comes to mind is to return it as an integer. But bear in mind that the
eventual use of the number will be in a write statement that goes back to the outside
world. Someone -- either us or the code hidden inside the write statement -- is going to
have to convert the number back to a string again. Turbo Pascal includes such string con-
version routines, but why use them if we don't have to? Why convert a number from string
to integer form, only to convert it right back again in the code generator, only a few state-
ments later?

Furthermore, as you'll soon see, we're going to need a temporary storage spot for the value
of the token we've fetched. If we treat the number in its string form, we can store the value of
either a variable or a number in the same string. Otherwise, we'll have to create a second,
integer variable.
On the other hand, we'll find that carrying the number as a string virtually eliminates any
chance of optimization later on. As we get to the point where we are beginning to concern
ourselves with code generation, we'll encounter cases in which we're doing arithmetic on
constants. For such cases, it's really foolish to generate code that performs the constant
arithmetic at run time. Far better to let the parser do the arithmetic at compile time, and
merely code the result. To do that, we'll wish we had the constants stored as integers rather
than strings.
What finally swung me back over to the string approach was an aggressive application of the
KISS test, plus reminding myself that we've studiously avoided issues of code efficiency. One
of the things that makes our simple-minded parsing work, without the complexities of a "real"
compiler, is that we've said up front that we aren't concerned about code efficiency. That
gives us a lot of freedom to do things the easy way rather than the efficient one, and it's a
freedom we must be careful not to abandon voluntarily, in spite of the urges for efficiency
shouting in our ear. In addition to being a big believer in the KISS philosophy, I'm also an
advocate of "lazy programming," which in this context means, don't program anything until
you need it. As P.J. Plauger says, "Never put off until tomorrow what you can put off indefi-
nitely." Over the years, much code has been written to provide for eventualities that never
happened. I've learned that lesson myself, from bitter experience. So the bottom line is: We
won't convert to an integer here because we don't need to. It's as simple as that.

For those of you who still think we may need the integer version (and indeed we may),
here it is:
{--------------------------------------------------------------}
{ Get a Number (integer version) }
function GetNumber: longint;
var n: longint;
begin
n := 0;
n := 10 * n + (Ord(Look) - Ord('0'));
GetChar;
end;
GetNumber := n;
end;
{--------------------------------------------------------------}
You might file this one away, as I intend to, for a rainy day.

PARSING
At this point, we have distributed all the routines that made up our Cradle into units that we
can draw upon as we need them. Obviously, they will evolve further as we continue the pro-
cess of bootstrapping ourselves up again, but for the most part their content, and certainly the
architecture that they imply, is defined. What remains is to embody the language syntax into
the parser unit. We won't do much of that in this installment, but I do want to do a little, just to
leave us with the good feeling that we still know what we're doing. So before we go, let's gen-
erate just enough of a parser to process single factors in an expression. In the process, we'll
also, by necessity, find we have created a code generator unit, as well.
Remember the very first installment of this series? We read an integer value, say n, and gen-
erated the code to load it into the D0 register via an immediate move:
MOVE #n,D0
Shortly afterwards, we repeated the process for a variable,
MOVE X(PC),D0

and then for a factor that could be either constant or variable. For old times sake, let's
revisit that process. Define the following new unit:
{--------------------------------------------------------------}
unit Parser;
{--------------------------------------------------------------}
interface
uses Input, Scanner, Errors, CodeGen;
procedure Factor;
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
procedure Factor;
begin
LoadConstant(GetNumber);
end;
end.
{--------------------------------------------------------------}

As you can see, this unit calls a procedure, LoadConstant, which actually effects the output of
the assembly-language code. The unit also uses a new unit, CodeGen. This step represents
the last major change in our architecture, from earlier installments: The removal of the
machine-dependent code to a separate unit. If I have my way, there will not be a single line of
code, outside of CodeGen, that betrays the fact that we're targeting the 68000 CPU. And this
is one place I think that having my way is quite feasible.
For those of you who wish I were using the 80x86 architecture (or any other one) instead of
the 68000, here's your answer: Merely replace CodeGen with one suitable for your CPU of
choice.
So far, our code generator has only one procedure in it. Here's the unit:
{--------------------------------------------------------------}
unit CodeGen;
{--------------------------------------------------------------}
interface
uses Output;
procedure LoadConstant(n: string);
{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Load the Primary Register with a Constant }
begin
EmitLn('MOVE #' + n + ',D0' );
end;
end.
{--------------------------------------------------------------}
Copy and compile this unit, and execute the following main program:
{--------------------------------------------------------------}
program Main;
uses WinCRT, Input, Output, Errors, Scanner, Parser;
begin
Factor;
end.
{--------------------------------------------------------------}
There it is, the generated code, just as we hoped it would be.

Now, I hope you can begin to see the advantage of the unit-based architecture of our new
design. Here we have a main program that's all of five lines long. That's all of the program we
need to see, unless we choose to see more. And yet, all those units are sitting there, patiently
waiting to serve us. We can have our cake and eat it too, in that we have simple and short
code, but powerful allies. What remains to be done is to flesh out the units to match the capa-
bilities of earlier installments. We'll do that in the next installment, but before I close, let's fin-
ish out the parsing of a factor, just to satisfy ourselves that we still know how. The final
version of CodeGen includes the new procedure, LoadVariable:
{--------------------------------------------------------------}
unit CodeGen;
{--------------------------------------------------------------}
interface
uses Output;
procedure LoadVariable(Name: string);
{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Load the Primary Register with a Constant }
begin
EmitLn('MOVE #' + n + ',D0' );
end;
{--------------------------------------------------------------}
procedure LoadVariable(Name: string);
begin
end;
end.
{--------------------------------------------------------------}

The parser unit itself doesn't change, but we have a more complex version of procedure Fac-
tor:
{--------------------------------------------------------------}
procedure Factor;
begin
LoadConstant(GetNumber)
else if IsAlpha(Look)then
LoadVariable(GetName)
else
Error('Unrecognized character ' + Look);
end;
{--------------------------------------------------------------}
Now, without altering the main program, you should find that our program will process either a
variable or a constant factor. At this point, our architecture is almost complete; we have units
to do all the dirty work, and enough code in the parser and code generator to demonstrate
that everything works. What remains is to flesh out the units we've defined, particularly the
parser and code generator, to support the more complex syntax elements that make up a real
language. Since we've done this many times before in earlier installments, it shouldn't take
long to get us back to where we were before the long hiatus. We'll continue this process in
Installment 16, coming soon. See you then.

REFERENCES
1. Crenshaw, J.W., "Object-Oriented Design of Assemblers and Compilers," Proc. Soft-
ware Development '91 Conference, Miller Freeman, San Francisco, CA, February 1991,
pp. 143-155.
2. Crenshaw, J.W., "A Perfect Marriage," Computer Language, Volume 8, #6, June 1991,
pp. 44-55.
3. Crenshaw, J.W., "Syntax-Driven Object-Oriented Design," Proc. 1991 Embedded Sys-

tems Conference, Miller Freeman, San Francisco, CA, September 1991, pp. 45-60.

Part 16 - Unit Construction
INTRODUCTION
This series of tutorials promises to be perhaps one of the longest- running mini-series in his-
tory, rivalled only by the delay in Volume IV of Knuth. Begun in 1988, the series ran into a
four-year hiatus in 1990 when the "cares of this world," changes in priorities and interests,
and the need to make a living seemed to stall it out after Installment 14. Those of you with
loads of patience were finally rewarded, in the spring of last year, with the long-awaited
Installment 15. In it, I began to try to steer the series back on track, and in the process, to
make it easier to continue on to the goal, which is to provide you with not only enough under-
standing of the difficult subject of compiler theory, but also enough tools, in the form of
canned subroutines and concepts, so that you would be able to continue on your own and
become proficient enough to build your own parsers and translators. Because of that long
hiatus, I thought it appropriate to go back and review the concepts we have covered so far,
and to redo some of the software, as well. In the past, we've never concerned ourselves
much with the development of production-quality software tools ... after all, I was trying to
teach (and learn) concepts, not production practice. To do that, I tended to give you, not com-
plete compilers or parsers, but only those snippets of code that illustrated the particular point
we were considering at the moment.
I still believe that's a good way to learn any subject; no one wants to have to make changes to
100,000 line programs just to try out a new idea. But the idea of just dealing with code snip-
pets, rather than complete programs, also has its drawbacks in that we often seemed to be
writing the same code fragments over and over. Although repetition has been thoroughly
proven to be a good way to learn new ideas, it's also true that one can have too much of a
good thing. By the time I had completed Installment 14 I seemed to have reached the limits of
my abilities to juggle multiple files and multiple versions of the same software functions. Who
knows, perhaps that's one reason I seemed to have run out of gas at that point.
Fortunately, the later versions of Borland's Turbo Pascal allow us to have our cake and eat it
too. By using their concept of separately compilable units, we can still write small subroutines
and functions, and keep our main programs and test programs small and simple. But, once
written, the code in the Pascal units will always be there for us to use, and linking them in is
totally painless and transparent.

Since, by now, most of you are programming in either C or C++, I know what you're think-
ing: Borland, with their Turbo Pascal (TP), certainly didn't invent the concept of separately
compilable modules. And of course you're right. But if you've not used TP lately, or ever,
you may not realize just how painless the whole process is. Even in C or C++, you still
have to build a make file, either manually or by telling the compiler how to do so. You
must also list, using "extern" statements or header files, the functions you want to import.
In TP, you don't even have to do that. You need only name the units you wish to use, and
all of their procedures automatically become available.
It's not my intention to get into a language-war debate here, so I won't pursue the subject
any further. Even I no longer use Pascal on my job ... I use C at work and C++ for my arti-
cles in Embedded Systems Programming and other magazines. Believe me, when I set
out to resurrect this series, I thought long and hard about switching both languages and
target systems to the ones that we're all using these days, C/C++ and PC architecture,
and possibly object-oriented methods as well. In the end, I felt it would cause more confu-
sion than the hiatus itself has. And after all, Pascal still remains one of the best possible
languages for teaching, not to mention production programming. Finally, TP still compiles
at the speed of light, much faster than competing C/C++ compilers. And Borland's smart
linker, used in TP but not in their C++ products, is second to none. Aside from being much
faster than Microsoft-compatible linkers, the Borland smart linker will cull unused proce-
dures and data items, even to the extent of trimming them out of defined objects if they're
not needed. For one of the few times in our lives, we don't have to compromise between
completeness and efficiency. When we're writing a TP unit, we can make it as complete
as we like, including any member functions and data items we may think we will ever
need, confident that doing so will not create unwanted bloat in the compiled and linked
executable.
The point, really, is simply this: By using TP's unit mechanism, we can have all the advan-
tages and convenience of writing small, seemingly stand-alone test programs, without
having to constantly rewrite the support functions that we need. Once written, the TP
units sit there, quietly waiting to do their duty and give us the support we need, when we
need it.

Using this principle, in Installment 15 I set out to minimize our tendency to re-invent the wheel
by organizing our code into separate Turbo Pascal units, each containing different parts of the
compiler. We ended up with the following units:
*Input
*Output
*Errors
*Scanner
*Parser
*CodeGen
Each of these units serves a different function, and encapsulates specific areas of functional-
ity. The Input and Output units, as their name implies, provide character stream I/O and the
all-important lookahead character upon which our predictive parser is based. The Errors unit,
of course, provides standard error handling. The Scanner unit contains all of our boolean
functions such as IsAlpha, and the routines GetName and GetNumber, which process multi-
character tokens.
The two units we'll be working with the most, and the ones that most represent the personal-
ity of our compiler, are Parser and CodeGen. Theoretically, the Parser unit should encapsu-
late all aspects of the compiler that depend on the syntax of the compiled language (though,
as we saw last time, a small amount of this syntax spills over into Scanner). Similarly, the
code generator unit, CodeGen, contains all of the code dependent upon the target machine.
In this installment, we'll be continuing with the development of the functions in these two all-
important units.

JUST LIKE CLASSICAL?

Before we proceed, however, I think I should clarify the relationship between, and the
functionality of these units. Those of you who are familiar with compiler theory as taught
in universities will, of course, recognize the names, Scanner, Parser, and CodeGen, all of
which are components of a classical compiler implementation. You may be thinking that
I've abandoned my commitment to the KISS philosophy, and drifted towards a more con-
ventional architecture than we once had. A closer look, however, should convince you
that, while the names are similar, the functionalities are quite different.
Together, the scanner and parser of a classical implementation comprise the so-called
"front end," and the code generator, the back end. The front end routines process the lan-
guage-dependent, syntax-related aspects of the source language, while the code genera-
tor, or back end, deals with the target machine-dependent parts of the problem. In
classical compilers, the two ends communicate via a file of instructions written in an inter-
mediate language (IL).
Typically, a classical scanner is a single procedure, operating as a co- procedure with the
parser. It "tokenizes" the source file, reading it character by character, recognizing lan-
guage elements, translating them into tokens, and passing them along to the parser. You
can think of the parser as an abstract machine, executing "op codes," which are the
tokens. Similarly, the parser generates op codes of a second abstract machine, which
mechanizes the IL. Typically, the IL file is written to disk by the parser, and read back
again by the code generator.
Our organization is quite different. We have no lexical scanner, in the classical sense; our
unit Scanner, though it has a similar name, is not a single procedure or co-procedure, but
merely a set of separate subroutines which are called by the parser as needed.
Similarly, the classical code generator, the back end, is a translator in its own right, read-
ing an IL "source" file, and emitting an object file. Our code generator doesn't work that
way. In our compiler, there IS no intermediate language; every construct in the source
language syntax is converted into assembly language as it is recognized by the parser.
Like Scanner, the unit CodeGen consists of individual procedures which are called by the
parser as needed.

This "code 'em as you find 'em" philosophy may not produce the world's most efficient code -
- for example, we haven't provided (yet!) a convenient place for an optimizer to work its magic
-- but it sure does simplify the compiler, doesn't it?
And that observation prompts me to reflect, once again, on how we have managed to reduce
a compiler's functions to such comparatively simple terms. I've waxed eloquent on this sub-
ject in past installments, so I won't belabor the point too much here. However, because of the
time that's elapsed since those last soliloquies, I hope you'll grant me just a little time to
remind myself, as well as you, how we got here. We got here by applying several principles
that writers of commercial compilers seldom have the luxury of using. These are:
o The KISS philosophy -- Never do things the hard way without a reason
o Lazy coding -- Never put off until tomorrow what you can put of forever (with credits to P.J.
Plauger)
o Skepticism -- Stubborn refusal to do something just because that's the way it's always been
done.
o Acceptance of inefficient code o Rejection of arbitrary constraints
As I've reviewed the history of compiler construction, I've learned that virtually every produc-
tion compiler in history has suffered from pre- imposed conditions that strongly influenced its
design. The original FORTRAN compiler of John Backus, et al, had to compete with assem-
bly language, and therefore was constrained to produce extremely efficient code. The IBM
compilers for the minicomputers of the 70's had to run in the very small RAM memories then
available -- as small as 4k. The early Ada compiler had to compile itself. Per Brinch Hansen
decreed that his Pascal compiler developed for the IBM PC must execute in a 64k machine.
Compilers developed in Computer Science courses had to compile the widest variety of lan-
guages, and therefore required LALR parsers.
In each of these cases, these preconceived constraints literally dominated the design of the
compiler.

A good example is Brinch Hansen's compiler, described in his excellent book, "Brinch
Hansen on Pascal Compilers" (highly recommended). Though his compiler is one of the
most clear and un-obscure compiler implementations I've seen, that one decision, to
compile large files in a small RAM, totally drives the design, and he ends up with not just
one, but many intermediate files, together with the drivers to write and read them.
In time, the architectures resulting from such decisions have found their way into com-
puter science lore as articles of faith. In this one man's opinion, it's time that they were re-
examined critically. The conditions, environments, and requirements that led to classical
architectures are not the same as the ones we have today. There's no reason to believe
the solutions should be the same, either.
In this tutorial, we've followed the leads of such pioneers in the world of small compilers
for Pcs as Leor Zolman, Ron Cain, and James Hendrix, who didn't know enough compiler
theory to know that they "couldn't do it that way." We have resolutely refused to accept
arbitrary constraints, but rather have done whatever was easy. As a result, we have
evolved an architecture that, while quite different from the classical one, gets the job done
in very simple and straightforward fashion.
I'll end this philosophizing with an observation re the notion of an intermediate language.
While I've noted before that we don't have one in our compiler, that's not exactly true; we
_DO_ have one, or at least are evolving one, in the sense that we are defining code gen-
eration functions for the parser to call. In essence, every call to a code generation proce-
dure can be thought of as an instruction in an intermediate language. Should we ever find
it necessary to formalize an intermediate language, this is the way we would do it: emit
codes from the parser, each representing a call to one of the code generator procedures,
and then process each code by calling those procedures in a separate pass, imple-
mented in a back end. Frankly, I don't see that we'll ever find a need for this approach, but
there is the connection, if you choose to follow it, between the classical and the current
approaches.

FLESHING OUT THE PARSER

Though I promised you, somewhere along about Installment 14, that we'd never again write
every single function from scratch, I ended up starting to do just that in Installment 15. One
reason: that long hiatus between the two installments made a review seem eminently justified
... even imperative, both for you and for me. More importantly, the decision to collect the pro-
cedures into modules (units), forced us to look at each one yet again, whether we wanted to
or not. And, finally and frankly, I've had some new ideas in the last four years that warranted
a fresh look at some old friends. When I first began this series, I was frankly amazed, and
pleased, to learn just how simple parsing routines can be made. But this last time around, I've
surprised myself yet again, and been able to make them just that last little bit simpler, yet.
Still, because of this total rewrite of the parsing modules, I was only able to include so much
in the last installment. Because of this, our hero, the parser, when last seen, was a shadow of
his former self, consisting of only enough code to parse and process a factor consisting of
either a variable or a constant. The main effort of this current installment will be to help flesh
out the parser to its former glory. In the process, I hope you'll bear with me if we sometimes
cover ground we've long since been over and dealt with.

First, let's take care of a problem that we've addressed before: Our current version of pro-
cedure Factor, as we left it in Installment 15, can't handle negative arguments. To fix that,
we'll introduce the procedure SignedFactor:
{--------------------------------------------------------------}
{ Parse and Translate a Factor with Optional Sign }
var Sign: char;
begin
Sign := Look;
GetChar;
Factor;
if Sign = '-' then Negate;
end;
{--------------------------------------------------------------}

Note that this procedure calls a new code generation routine, Negate:
{--------------------------------------------------------------}
{ Negate Primary }
procedure Negate;
begin
EmitLn('NEG D0');
end;
{--------------------------------------------------------------}
(Here, and elsewhere in this series, I'm only going to show you the new routines. I'm count-
ing on you to put them into the proper unit, which you should normally have no trouble identi-
fying. Don't forget to add the procedure's prototype to the interface section of the unit.)
In the main program, simply change the procedure called from Factor to SignedFactor, and
give the code a test. Isn't it neat how the Turbo linker and make facility handle all the details?
Yes, I know, the code isn't very efficient. If we input a number, -3, the generated code is:
MOVE #3,D0
NEG D0
which is really, really dumb. We can do better, of course, by simply pre-appending a minus
sign to the string passed to LoadConstant, but it adds a few lines of code to SignedFactor,
and I'm applying the KISS philosophy very aggressively here. What's more, to tell the truth, I
think I'm subconsciously enjoying generating "really, really dumb" code, so I can have the
pleasure of watching it get dramatically better when we get into optimization methods.

Most of you have never heard of John Spray, so allow me to introduce him to you here.
John's from New Zealand, and used to teach computer science at one of its universities.
John wrote a compiler for the Motorola 6809, based on a delightful, Pascal-like language
of his own design called "Whimsical." He later ported the compiler to the 68000, and for
awhile it was the only compiler I had for my homebrewed 68000 system.
For the record, one of my standard tests for any new compiler is to see how the compiler
deals with a null program like:
program main;
begin
end.
My test is to measure the time required to compile and link, and the size of the object file
generated. The undisputed _LOSER_ in the test is the DEC C compiler for the VAX,
which took 60 seconds to compile, on a VAX 11/780, and generated a 50k object file.
John's compiler is the undisputed, once, future, and forever king in the code size depart-
ment. Given the null program, Whimsical generates precisely two bytes of code, imple-
menting the one instruction,
RET
By setting a compiler option to generate an include file rather than a standalone program,
John can even cut this size, from two bytes to zero! Sort of hard to beat a null object file,
wouldn't you say?
Needless to say, I consider John to be something of an expert on code optimization, and I

like what he has to say: "The best way to optimize is not to have to optimize at all, but to
produce good code in the first place." Words to live by. When we get started on optimiza-
tion, we'll follow John's advice, and our first step will not be to add a peephole optimizer or
other after-the-fact device, but to improve the quality of the code emitted before optimiza-
tion. So make a note of SignedFactor as a good first candidate for attention, and for now
we'll leave it be.

TERMS AND EXPRESSIONS

I'm sure you know what's coming next: We must, yet again, create the rest of the procedures
that implement the recursive-descent parsing of an expression. We all know that the hierar-
chy of procedures for arithmetic expressions is:
expression
term
factor
However, for now let's continue to do things one step at a time, and consider only expres-
sions with additive terms in them. The code to implement expressions, including a possibly
signed first term, is shown next:
{--------------------------------------------------------------}
begin
SignedFactor;
while IsAddop(Look) do
case Look of
'+': Add;
'-': Subtract;
end;
end;
{--------------------------------------------------------------}

This procedure calls two other procedures to process the operations:
{--------------------------------------------------------------}
{ Parse and Translate an Addition Operation }
procedure Add;
begin
Match('+');
Push;
Factor;
PopAdd;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Subtraction Operation }
procedure Subtract;
begin
Match('-');
Push;
Factor;
PopSub;
end;
{--------------------------------------------------------------}

The three procedures Push, PopAdd, and PopSub are new code generation routines. As the
name implies, procedure Push generates code to push the primary register (D0, in our 68000
implementation) to the stack. PopAdd and PopSub pop the top of the stack again, and add it
to, or subtract it from, the primary register. The code is shown next:
{--------------------------------------------------------------}
{ Push Primary to Stack }
procedure Push;
begin
end;
{--------------------------------------------------------------}
{ Add TOS to Primary }
procedure PopAdd;
begin
end;

{--------------------------------------------------------------}
{ Subtract TOS from Primary }
procedure PopSub;
begin
Negate;
end;
{--------------------------------------------------------------}
Add these routines to Parser and CodeGen, and change the main program to call Expres-
sion. Voila!
The next step, of course, is to add the capability for dealing with multiplicative terms. To
that end, we'll add a procedure Term, and code generation procedures PopMul and Pop-
Div. These code generation procedures are shown next:
{--------------------------------------------------------------}
{ Multiply TOS by Primary }
procedure PopMul;
begin
end;

{--------------------------------------------------------------}
{ Divide Primary by TOS }
procedure PopDiv;
begin
EmitLn('EXT.L D7');
end;
{--------------------------------------------------------------}
I admit, the division routine is a little busy, but there's no help for it. Unfortunately, while the
68000 CPU allows a division using the top of stack (TOS), it wants the arguments in the
wrong order, just as it does for subtraction. So our only recourse is to pop the stack to a
scratch register (D7), perform the division there, and then move the result back to our primary
register, D0. Note the use of signed multiply and divide operations. This follows an implied,
but unstated, assumption, that all our variables will be signed 16-bit integers. This decision
will come back to haunt us later, when we start looking at multiple data types, type conver-
sions, etc.

Our procedure Term is virtually a clone of Expression, and looks like this:
{--------------------------------------------------------------}
{ Parse and Translate a Term }
procedure Term;
begin
Factor;
while IsMulop(Look) do
case Look of
'*': Multiply;
'/': Divide;
end;
end;
{--------------------------------------------------------------}

Our next step is to change some names. SignedFactor now becomes SignedTerm, and the
calls to Factor in Expression, Add, Subtract and SignedTerm get changed to call Term:
{--------------------------------------------------------------}
{ Parse and Translate a Term with Optional Leading Sign }
procedure SignedTerm;
var Sign: char;
begin
Sign := Look;
GetChar;
Term;
if Sign = '-' then Negate;
end;
{--------------------------------------------------------------}
...

{--------------------------------------------------------------}
begin
SignedTerm;
case Look of
'+': Add;
'-': Subtract;
end;
end;
{--------------------------------------------------------------}
If memory serves me correctly, we once had BOTH a procedure SignedFactor and a pro-
cedure SignedTerm. I had reasons for doing that at the time ... they had to do with the
handling of Boolean algebra and, in particular, the Boolean "not" function. But certainly,
for arithmetic operations, that duplication isn't necessary. In an expression like:
-x*y
it's very apparent that the sign goes with the whole TERM, x*y, and not just the factor x,
and that's the way Expression is coded.
Test this new code by executing Main. It still calls Expression, so you should now be able
to deal with expressions containing any of the four arithmetic operators.

Our last bit of business, as far as expressions goes, is to modify procedure Factor to allow for
parenthetical expressions. By using a recursive call to Expression, we can reduce the
needed code to virtually nothing. Five lines added to Factor do the job:
{--------------------------------------------------------------}
procedure Factor;
begin
if Look ='(' then begin
Match('(');
Expression;
Match(')');
end
LoadConstant(GetNumber)
else if IsAlpha(Look)then
LoadVariable(GetName)
else
Error('Unrecognized character ' + Look);

end;
{--------------------------------------------------------------}
At this point, your "compiler" should be able to handle any legal expression you can throw at
it. Better yet, it should reject all illegal ones!

ASSIGNMENTS
As long as we're this close, we might as well create the code to deal with an assignment
statement. This code needs only to remember the name of the target variable where we
are to store the result of an expression, call Expression, then store the number. The pro-
cedure is shown next:
{--------------------------------------------------------------}
var Name: string;
begin
Name := GetName;
Match('=');
Expression;
StoreVariable(Name);
end;
{--------------------------------------------------------------}
The assignment calls for yet another code generation routine:

{--------------------------------------------------------------}
{ Store the Primary Register to a Variable }
procedure StoreVariable(Name: string);
begin
end;
{--------------------------------------------------------------}
Now, change the call in Main to call Assignment, and you should see a full assignment state-
ment being processed correctly. Pretty neat, eh? And painless, too.
In the past, we've always tried to show BNF relations to define the syntax we're developing. I
haven't done that here, and it's high time I did. Here's the BNF:
<factor> ::= <variable> | <constant> | '(' <expression> ')'
<signed_term> ::= [<addop>] <term>
<term> ::= <factor> (<mulop> <factor>)*
<expression> ::= <signed_term> (<addop> <term>)*
<assignment> ::= <variable> '=' <expression>

BOOLEANS
The next step, as we've learned several times before, is to add Boolean algebra. In the
past, this step has at least doubled the amount of code we've had to write. As I've gone
over this step in my mind, I've found myself diverging more and more from what we did in
previous installments. To refresh your memory, I noted that Pascal treats the Boolean
operators pretty much identically to the way it treats arithmetic ones. A Boolean "and" has
the same precedence level as multiplication, and the "or" as addition. C, on the other
hand, sets them at different precedence levels, and all told has a whopping 17 levels. In
our earlier work, I chose something in between, with seven levels. As a result, we ended
up with things called Boolean expressions, paralleling in most details the arithmetic
expressions, but at a different precedence level. All of this, as it turned out, came about
because I didn't like having to put parentheses around the Boolean expressions in state-
ments like:
IF (c >= 'A') and (c <= 'Z') then ...
In retrospect, that seems a pretty petty reason to add many layers of complexity to the
parser. Perhaps more to the point, I'm not sure I was even able to avoid the parens.
For kicks, let's start anew, taking a more Pascal-ish approach, and just treat the Boolean
operators at the same precedence level as the arithmetic ones. We'll see where it leads
us. If it seems to be down the garden path, we can always backtrack to the earlier
approach.

For starters, we'll add the "addition-level" operators to Expression. That's easily done; first,
modify the function IsAddop in unit Scanner to include two extra operators: '|' for "or," and '~'
for "exclusive or":
{--------------------------------------------------------------}
begin
IsAddop := c in ['+','-', '|', '~'];
end;
{--------------------------------------------------------------}
Next, we must include the parsing of the operators in procedure
Expression:
{--------------------------------------------------------------}
begin
SignedTerm;
case Look of
'+': Add;
'-': Subtract;
'|': _Or;
'~': _Xor;
end;
{--------------------------------------------------------------}

(The underscores are needed, of course, because "or" and "xor" are reserved words in
Turbo Pascal.)
Next, the procedures _Or and _Xor:
{--------------------------------------------------------------}
procedure _Or;
begin
Match('|');
Push;
Term;
PopOr;
end;
{--------------------------------------------------------------}
procedure _Xor;
begin
Match('~');
Push;
Term;
PopXor;
end;
{--------------------------------------------------------------}

And, finally, the new code generator procedures:
{--------------------------------------------------------------}
{ Or TOS with Primary }
procedure PopOr;
begin
end;
{--------------------------------------------------------------}
{ Exclusive-Or TOS with Primary }
procedure PopXor;
begin
end;
{--------------------------------------------------------------}
Now, let's test the translator (you might want to change the call in Main back to a call to
Expression, just to avoid having to type "x=" for an assignment every time).

So far, so good. The parser nicely handles expressions of the form:
x|y~z
Unfortunately, it also does nothing to protect us from mixing Boolean and arithmetic alge-
bra. It will merrily generate code for:
(a+b)*(c~d)
We've talked about this a bit, in the past. In general the rules for what operations are legal
or not cannot be enforced by the parser itself, because they are not part of the syntax of
the language, but rather its semantics. A compiler that doesn't allow mixed-mode expres-
sions of this sort must recognize that c and d are Boolean variables, rather than numeric
ones, and balk at multiplying them in the next step. But this "policing" can't be done by the
parser; it must be handled somewhere between the parser and the code generator. We
aren't in a position to enforce such rules yet, because we haven't got either a way of
declaring types, or a symbol table to store the types in. So, for what we've got to work with
at the moment, the parser is doing precisely what it's supposed to do.
Anyway, are we sure that we DON'T want to allow mixed-type operations? We made the
decision some time ago (or, at least, I did) to adopt the value 0000 as a Boolean "false,"
and -1, or FFFFh, as a Boolean "true." The nice part about this choice is that bitwise oper-
ations work exactly the same way as logical ones. In other words, when we do an opera-
tion on one bit of a logical variable, we do it on all of them. This means that we don't need
to distinguish between logical and bitwise operations, as is done in C with the operators &
and &&, and | and ||. Reducing the number of operators by half certainly doesn't seem all
bad.

From the point of view of the data in storage, of course, the computer and compiler couldn't
care less whether the number FFFFh represents the logical TRUE, or the numeric -1. Should
we? I sort of think not. I can think of many examples (though they might be frowned upon as
"tricky" code) where the ability to mix the types might come in handy. Example, the Dirac
delta function, which could be coded in one simple line:
-(x=0)
or the absolute value function (DEFINITELY tricky code!):
x*(1+2*(x<0))
Please note, I'm not advocating coding like this as a way of life. I'd almost certainly write
these functions in more readable form, using IFs, just to keep from confusing later maintain-
ers. Still, a moral question arises: Do we have the right to ENFORCE our ideas of good cod-
ing practice on the programmer, but writing the language so he can't do anything else? That's
what Nicklaus Wirth did, in many places in Pascal, and Pascal has been criticized for it -- for
not being as "forgiving" as C.
An interesting parallel presents itself in the example of the Motorola 68000 design. Though
Motorola brags loudly about the orthogonality of their instruction set, the fact is that it's far
from orthogonal. For example, you can read a variable from its address:
MOVE X,D0 (where X is the name of a variable)
but you can't write in the same way. To write, you must load an address register with the
address of X. The same is true for PC- relative addressing:
MOVE X(PC),DO (legal)
MOVE D0,X(PC) (illegal)
When you begin asking how such non-orthogonal behavior came about, you find that some-
one in Motorola had some theories about how software should be written. Specifically, in this
case, they decided that self-modifying code, which you can implement using PC-relative
writes, is a Bad Thing. Therefore, they designed the processor to prohibit it. Unfortunately, in
the process they also prohibited _ALL_ writes of the forms shown above, however benign.
Note that this was not something done by default. Extra design work had to be done, and
extra gates added, to destroy the natural orthogonality of the instruction set.

One of the lessons I've learned from life: If you have two choices, and can't decide which
one to take, sometimes the best thing to do is nothing. Why add extra gates to a proces-
sor to enforce some stranger's idea of good programming practice? Leave the instruc-
tions in, and let the programmers debate what good programming practice is. Similarly,
why should we add extra code to our parser, to test for and prevent conditions that the
user might prefer to do, anyway? I'd rather leave the compiler simple, and let the software
experts debate whether the practices should be used or not.
All of which serves as rationalization for my decision as to how to prevent mixed-type

arithmetic: I won't. For a language intended for systems programming, the fewer rules,
the better. If you don't agree, and want to test for such conditions, we can do it once we
have a symbol table.

BOOLEAN "AND"
With that bit of philosophy out of the way, we can press on to the "and" operator, which goes
into procedure Term. By now, you can probably do this without me, but here's the code, any-
way:
In Scanner,
{--------------------------------------------------------------}
begin
IsMulop := c in ['*','/', '&'];
end;
{--------------------------------------------------------------}

In Parser,
{--------------------------------------------------------------}
procedure Term;
begin
Factor;
while IsMulop(Look) do
case Look of
'*': Multiply;
'/': Divide;
'&': _And;
end;
end;

{--------------------------------------------------------------}
{ Parse and Translate a Boolean And Operation }
procedure _And;
begin
Match('&');
Push;
Factor;
PopAnd;
end;
{--------------------------------------------------------------}

and in CodeGen,
{--------------------------------------------------------------}
{ And Primary with TOS }
procedure PopAnd;
begin
end;
{--------------------------------------------------------------}
Your parser should now be able to process almost any sort of logical expression, and
(should you be so inclined), mixed-mode expressions as well.
Why not "all sorts of logical expressions"? Because, so far, we haven't dealt with the logi-
cal "not" operator, and this is where it gets tricky. The logical "not" operator seems, at first
glance, to be identical in its behavior to the unary minus, so my first thought was to let the
exclusive or operator, '~', double as the unary "not." That didn't work. In my first attempt,
procedure SignedTerm simply ate my '~', because the character passed the test for an
addop, but SignedTerm ignores all addops except '-'. It would have been easy enough to
add another line to SignedTerm, but that would still not solve the problem, because note
that Expression only accepts a signed term for the _FIRST_ argument.
Mathematically, an expression like:
-a * -b
makes little or no sense, and the parser should flag it as an error. But the same expres-
sion, using a logical "not," makes perfect sense:
not a and not b

In the case of these unary operators, choosing to make them act the same way seems an
artificial force fit, sacrificing reasonable behavior on the altar of implementational ease. While
I'm all for keeping the implementation as simple as possible, I don't think we should do so at
the expense of reasonableness. Patching like this would be missing the main point, which is
that the logical "not" is simply NOT the same kind of animal as the unary minus. Consider the
exclusive or, which is most naturally written as:
a~b ::= (a and not b) or (not a and b)
If we allow the "not" to modify the whole term, the last term in parentheses would be inter-
preted as:
not(a and b)
which is not the same thing at all. So it's clear that the logical "not" must be thought of as con-
nected to the FACTOR, not the term.
The idea of overloading the '~' operator also makes no sense from a mathematical point of
view. The implication of the unary minus is that it's equivalent to a subtraction from zero:
-x <=> 0-x
In fact, in one of my more simple-minded versions of Expression, I reacted to a leading addop

by simply preloading a zero, then processing the operator as though it were a binary opera-
tor. But a "not" is not equivalent to an exclusive or with zero ... that would just give back the
original number. Instead, it's an exclusive or with FFFFh, or -1.
In short, the seeming parallel between the unary "not" and the unary minus falls apart under
closer scrutiny. "not" modifies the factor, not the term, and it is not related to either the unary
minus nor the exclusive or. Therefore, it deserves a symbol to call its own. What better sym-
bol than the obvious one, also used by C, the '!' character? Using the rules about the way we
think the "not" should behave, we should be able to code the exclusive or (assuming we'd
ever need to), in the very natural form:
a & !b | !a & b
Note that no parentheses are required -- the precedence levels we've chosen automatically
take care of things.

If you're keeping score on the precedence levels, this definition puts the '!' at the top of
the heap. The levels become:
1.!
2.- (unary)
3.*, /, &
4.+, -, |, ~
Looking at this list, it's certainly not hard to see why we had trouble using '~' as the "not"
symbol!

So how do we mechanize the rules? In the same way as we did with SignedTerm, but at the
factor level. We'll define a procedure NotFactor:
{--------------------------------------------------------------}
{ Parse and Translate a Factor with Optional "Not" }
begin
if Look ='!' then begin
Match('!');
Factor;
Notit;
end
else
Factor;
end;
{--------------------------------------------------------------}

and call it from all the places where we formerly called Factor, i.e., from Term, Multiply,
Divide, and _And. Note the new code generation procedure:
{--------------------------------------------------------------}
{ Bitwise Not Primary }
procedure NotIt;
begin
EmitLn('EOR #-1,D0');
end;
{--------------------------------------------------------------}

Try this now, with a few simple cases. In fact, try that exclusive or example,
a&!b|!a&b
You should get the code (without the comments, of course):
MOVE A(PC),DO ; load a
MOVE D0,-(SP); push it
MOVE B(PC),DO; load b
EOR #-1,D0; not it
AND (SP)+,D0; and with a
MOVE D0,-(SP); push result
MOVE A(PC),DO; load a
EOR #-1,D0; not it
MOVE D0,-(SP); push it
MOVE B(PC),DO; load b
AND (SP)+,D0; and with !a
OR (SP)+,D0; or with first term
That's precisely what we'd like to get. So, at least for both arithmetic and logical operators,
our new precedence and new, slimmer syntax hang together. Even the peculiar, but legal,
expression with leading addop:
~x

makes sense. SignedTerm ignores the leading '~', as it should, since the expression is
equivalent to:
0~x,
which is equal to x.
When we look at the BNF we've created, we find that our boolean algebra now adds only
one extra line:
<not_factor> ::= [!] <factor>
<factor> ::= <variable> | <constant> | '(' <expression> ')'
<signed_term> ::= [<addop>] <term>
<term> ::= <not_factor> (<mulop> <not_factor>)*
<expression> ::= <signed_term> (<addop> <term>)*
<assignment> ::= <variable> '=' <expression>
That's a big improvement over earlier efforts. Will our luck continue to hold when we get
to relational operators? We'll find out soon, but it will have to wait for the next installment.
We're at a good stopping place, and I'm anxious to get this installment into your hands.
It's already been a year since the release of Installment 15. I blush to admit that all of this
current installment has been ready for almost as long, with the exception of relational
operators. But the information does you no good at all, sitting on my hard disk, and by
holding it back until the relational operations were done, I've kept it out of your hands for
that long. It's time for me to let go of it and get it out where you can get value from it.
Besides, there are quite a number of serious philosophical questions associated with the
relational operators, as well, and I'd rather save them for a separate installment where I
can do them justice.
Have fun with the new, leaner arithmetic and logical parsing, and I'll see you soon with
relationals.

CHAPTER 3 Practical problems and
their solutions...
We want to discuss in this chapter important pratical problems and their solutions. Most of the
problems seem to have a simple solution but when you go deeper in the development of the
disassembler you will see that these problems should be discussed.
Some of the problems are how to load the file into memory or how to catch the files entry-
point. As well we will have a look how you can do complex parsing in assembly language.
This chapter is for the unexperienced users and should help to solve some basic problems.

Practical problems and their solutions...
Lesson 1 - Loading Files Into Memory

Lesson 2 - Receiving Infos Of The Sections Of A PE-File
Lesson 2 - Receiving Infos Of The Sections Of A

PE-File

Lesson 3 - Catching The Entry-Point Of A PE-

File

Lesson 4 - Linked Lists
Assembler-Source-Code1
; #########################################################################
; LinkedList.inc
; The following code is for educational purposes only.
; However, since linkedlists are a fundamental part of programming,
; feel free to use this file as you please.
; #########################################################################
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; Initial Linkedlist Code:
; KillEntry, AddEntry, plus initial structure
; EvilHomer2k, 15 August 2002, 3:54 in the morning.
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; Initial Example Program:
; bug fixes, and the addition of KillEntryPlusChildren
; Scronty, 15 August 2002, 11:24pm.
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; Initial Linkedlist Code Update:
; References to false LinkedObject fields corrected.
; Code in AddEntry altered to include an ObjectSize param for
; new entries.
; EvilHomer2k, 15 August 2002, 9:00 pm.
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. This code was taken from http://board.win32asmcommunity.net/showthread.php?s=&threadid=7361&high-

light=linked+lists

; Example Program Update:

; Implemented EvilHomer2ks altered AddEntry param (ObjectSize).
; Added Sibling fieldnames in LINKEDOBJECT struct (no procs).
; Added Application-Specific fieldnames in LINKEDOBJECT
; struct (NAME).
; Added NewName and KillName procs for the NAME fieldnames.
; Scronty, 16 August 2002, 11:08am.
; --------
; Changed name of AddEntry procedure to AddChildEntry.
; Added procedure: AddSiblingEntry
; Added procedure: KillEntryPlusYoungerSiblings
; Modified procedure: KillEntry to also patch Sibling links.
; EvilHomer, 18 August 2002, 11:02pm.
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; Current Table of Procedures:
; ============================
; -AddChildEntry
; -AddSiblingEntry
; -NewName
; -KillName
; -KillEntry (not recursive)(Checks all links)
; -KillEntryPlusChildren (recursive)(does not check Sibling links)
; -KillEntryPlusYoungerSiblings (recursive)(does not check Parent-Child links)
;
;-------------------------------------------------------------------
;Structure of an Entry in a Linked List
;-------------------------------------------------------------------
_LINKEDOBJECT STRUCT ;Example structure is minimal - add some more fields to

it
; Everything between the "~~~~" lines are mandatory.

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pParent DWORD ? ;Pointer to my parent if I have one (Parent)
pChild DWORD ? ;Pointer to my child if I have one (Child)
pOlderSibling DWORD ? ;Pointer to my older sibling if I have one
(Older Sibling)
pYoungerSibling DWORD ? ;Pointer to my younger sibling if I have one
(Younger Sibling)
hLock DWORD ? ;Handle for freeing this memory

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; Add Application-Specific fields here

;_____________________________________
; NAME
pName DWORD ? ;Pointer to the name of this object (Name)
hNameLock DWORD ? ;Handle for freeing the Names' memory
;_____________________________________
_LINKEDOBJECT ENDS
LINKEDOBJECT TYPEDEF _LINKEDOBJECT
LPLINKEDOBJECT TYPEDEF PTR _LINKEDOBJECT
; ########################################################################
; macros:
; CTEXT macro
; eg.
; invoke MessageBox, NULL, CTEXT("Hello World!"), NULL, MB_OK
CTEXT macro y:vararg
local sym
const segment
ifidni <y>, <>

sym db 0
else
sym db y, 0
endif
const ends
exitm <offset sym>
endm
;## 'return' Macro ##
return MACRO returnvalue
mov eax, returnvalue
ret
ENDM

; ########################################################################
.data
ERRbuff db 128 DUP (0)
; ########################################################################
.code
KillName PROC lpThis:PTR LINKEDOBJECT
push edi
mov eax, lpThis

mov edi, eax
.if [edi].LINKEDOBJECT.hNameLock != NULL ;I have a Name

mov eax,[edi].LINKEDOBJECT.hNameLock ;This bit happens
regardless...
invoke GlobalUnlock,eax ;Unlock this memory
mov eax,[edi].LINKEDOBJECT.hNameLock ;Grab the handle to
this memory
invoke GlobalFree,eax ;Release this memory
;_____________________________________
;Null-out Name fields
mov [edi].LINKEDOBJECT.pName, NULL
mov [edi].LINKEDOBJECT.hNameLock, NULL
;_____________________________________
invoke MessageBox,NULL,CTEXT("Killed Name!"),CTEXT("Success!"),MB_OK

.endif
pop edi
return TRUE
KillName ENDP

NewName PROC lpThis:PTR LINKEDOBJECT, pszNewName:DWORD

LOCAL dwSize:DWORD
push edi
push esi
mov eax, lpThis

mov edi, eax
;________________________________
;Get the string length
mov eax, pszNewName
@@:
mov dl, [eax]
inc eax
cmp dl, 0
jne @B
sub eax, pszNewName
dec eax ; correct count
mov dwSize, eax
;________________________________
mov eax, dwSize
inc eax
invoke GlobalAlloc,GPTR,eax ;Allocate memory for name
mov [edi].LINKEDOBJECT.hNameLock, eax ;Remember the unlock
handle

.if eax != NULL

invoke GlobalLock,[edi].LINKEDOBJECT.hNameLock
mov [edi].LINKEDOBJECT.pName, eax
.if eax != NULL
;________________________________
;Copy name into allocated memory
cld
mov esi, [pszNewName]
mov eax, [edi].LINKEDOBJECT.pName
mov edi, eax
mov ecx, dwSize
shr ecx, 2
rep movsd
mov ecx, dwSize

and ecx, 3
rep movsb
inc edi
mov BYTE PTR [edi], 0 ;Appended a 0
;________________________________
mov eax, lpThis

invoke MessageBox,NULL,CTEXT("Added Name!"),[eax].LINKEDOB-
JECT.pName,MB_OK
;Return the node-pointer back to the caller

mov eax, lpThis
pop esi
pop edi
return eax ;Return pointer to
the new Object in EAX
.else ;GlobalLock failed...
invoke GetLastError
invoke wsprintf,addr ERRbuff,CTEXT("GlobalLock err #%lu"),eax
invoke MessageBox,NULL, addr
ERRbuff,CTEXT("Error!"),MB_OK+MB_ICONERROR
invoke GlobalFree,[edi].LINKEDOBJECT.hNameLock ;Free the mem-
ory we Failed to Lock
.endif

.else ;GlobalAlloc failed...

invoke GetLastError
invoke wsprintf,addr ERRbuff,CTEXT("GlobalAlloc err #%lu"),eax
invoke MessageBox,NULL,addr ERRbuff,CTEXT("Error!"),MB_OK+MB_ICONERROR
.endif
pop esi
pop edi
xor eax,eax
ret ;Return ERROR in eax
since we have Failed
NewName ENDP

;-------------------------------------------------------------------
;KillEntry Procedure
;-Removes an entry from a Linked-List.
;-Revised to handle parent-child and/or sibling links.
;-Examines Parent<--<THIS>-->Child links and
;-Patches- Parent<-->Child (Bypassing Self).
;-Examines OlderSibling<--<THIS>--YoungerSibling links and
;-Patches- OlderSibling<-->YoungerSibling (Bypassing Self).
;-Also detects and patches NewRoot and NewLast nodes.
;-Releases the allocated memory used by the killed entry.
;-In other words, flawless and transparent removal
;-of a single entry in our List.
;-------------------------------------------------------------------
KillEntry PROC lpThis:PTR LinkedObject ;pointer to entry to be deleted
push edi
push esi
mov eax, lpThis

mov edi, eax
;Kill any Name for this node

invoke KillName, edi
.if ([edi].LINKEDOBJECT.pParent != NULL) || ([edi].LINKEDOBJECT.pOlderSibling

!= NULL) ;I have a Parent and thus am not Root..
.if [edi].LINKEDOBJECT.pChild != NULL ;..and I also have a
Child..
mov eax, [edi].LINKEDOBJECT.pParent ;(fetch ParentPointer)
mov esi, eax
mov eax, [edi].LINKEDOBJECT.pChild ;(fetch ChildPointer)
mov [esi].LINKEDOBJECT.pChild, eax ;link Parent to Child,
bypassing me
mov eax, [edi].LINKEDOBJECT.pChild ;(ChildPointer)
mov esi, eax
mov eax, [edi].LINKEDOBJECT.pParent ;(ParentPointer)
mov [esi].LINKEDOBJECT.pParent, eax ;link Child to Parent,
bypassing me

.else ;..and I have no Child..

mov eax, [edi].LINKEDOBJECT.pParent
mov esi, eax
mov [esi].LINKEDOBJECT.pChild,NULL ;Kill Parent's link to me
.endif
;----------
.if [edi].LINKEDOBJECT.pYoungerSibling !=NULL ;..and I have a younger sib-
ling
mov eax, [edi].LINKEDOBJECT.pOlderSibling
mov esi,eax
mov eax,[edi].LINKEDOBJECT.pYoungerSibling
mov [esi].LINKEDOBJECT.pYoungerSibling,eax
mov eax,[edi].LINKEDOBJECT.pYoungerSibling
mov esi,eax
mov eax,[edi].LINKEDOBJECT.pOlderSibling
mov [esi].LINKEDOBJECT.pOlderSibling,eax
.else ;..I have no younger
sibling..
mov esi, eax
mov [esi].LINKEDOBJECT.pYoungerSibling,NULL ;Kill Parent's link to me
.endif ;(setting it as Last)
invoke MessageBox,NULL,CTEXT("killed Child!"),CTEXT("Success!"),MB_OK
.else ;I am Root and have no Parent..

.if [edi].LINKEDOBJECT.pChild != NULL ;..but I do have a Child
mov eax, [edi].LINKEDOBJECT.pChild
mov esi, eax
mov [esi].LINKEDOBJECT.pParent,NULL ;Kill Child's link to Parent
.endif ;(setting it as Root)
.if [edi].LINKEDOBJECT.pYoungerSibling != NULL ;..but I do have a Child
mov eax, [edi].LINKEDOBJECT.pYoungerSibling
mov esi, eax
mov [esi].LINKEDOBJECT.pOlderSibling,NULL ;Kill Child's link to Par-
ent
.endif ;(setting it as Root)
invoke MessageBox,NULL,CTEXT("killed Root!"),CTEXT("Success!"),MB_OK

.endif

; (no parent and no child? alone? nothing to repair then)
mov eax,[edi].LINKEDOBJECT.hLock ;This bit happens regard-

less...
invoke GlobalUnlock,eax ;Unlock this memory
mov eax,[edi].LINKEDOBJECT.hLock ;Grab the handle to this
memory
invoke GlobalFree,eax ;Release this memory
pop esi
pop edi
return TRUE ;cya

KillEntry ENDP
;-------------------------------------------------------------------
;AddChildEntry Procedure
;-Adds an entry to a Linked-List...
;-Allocates memory for a new entry,
;-Examines Parent<--->Child links and
;-Patches- Parent<--<THIS>-->Child (Inserting Self).
;-Bidirection links are preserved.
;-------------------------------------------------------------------
AddChildEntry PROC lpParent:PTR LINKEDOBJECT, ObjectSize:DWORD
LOCAL lpOldChild:PTR LINKEDOBJECT
LOCAL hMem:DWORD
push edi
push esi
mov eax, lpParent

mov esi, eax

invoke GlobalAlloc,GPTR,ObjectSize
mov hMem,eax
.if eax != NULL
invoke GlobalLock,hMem
mov edi, eax
.if eax != NULL
mov eax, hMem
mov [edi].LINKEDOBJECT.hLock, eax ;Remember my unlock
handle
.if esi != NULL ;I have a Parent and
thus am not Root
mov eax, [esi].LINKEDOBJECT.pChild
mov lpOldChild, eax ;store possible child
mov [esi].LINKEDOBJECT.pChild, edi ;Tell Parent hes my
new daddy- APPENDING
mov [edi].LINKEDOBJECT.pParent, esi ;Tell Myself I have
a Parent
mov eax, lpOldChild
.if eax != NULL ;and that Parent had a
Child - INSERTING!
mov eax, lpOldChild
mov esi, eax
mov [esi].LINKEDOBJECT.pParent, edi ;Tell Child I'm their
new sugardaddy
mov [edi].LINKEDOBJECT.pChild, esi ;Tell Myself I have a
Child
invoke MessageBox,NULL,CTEXT("Child Inserted!"),CTEXT("Suc-
cess!"),MB_OK
.else
mov [edi].LINKEDOBJECT.pChild, NULL
invoke MessageBox,NULL,CTEXT("Child Appended!"),CTEXT("Suc-
cess!"),MB_OK
.endif ;I have no kids to worry
about or
.else ;I am Root with No
Parent and No Child
mov [edi].LINKEDOBJECT.pParent, NULL
mov [edi].LINKEDOBJECT.pOlderSibling, NULL
mov [edi].LINKEDOBJECT.pYoungerSibling, NULL
invoke MessageBox,NULL,CTEXT("Added Root!"),CTEXT("Success!"),MB_OK
.endif

;_____________________________________
;Null-out Application-Specific fields
;_____________________________________
mov eax, edi
pop esi
pop edi
invoke GetLastError
invoke GlobalFree,hMem ;Free the memory we
Failed to Lock
.endif
invoke GetLastError
.endif
pop esi
pop edi
xor eax,eax
AddChildEntry ENDP

;-------------------------------------------------------------------
;AddSiblingEntry Procedure
;-Adds an entry to a Linked-List...
;-Allocates memory for a new entry,
;-Examines OlderSibling<--->YoungerSibling links and
;-Patches- Parent<--<THIS>-->Child (Inserting Self).
;-Bidirection links are preserved.
;-------------------------------------------------------------------
AddSiblingEntry PROC lpParent:PTR LINKEDOBJECT, ObjectSize:DWORD
LOCAL lpOldChild:PTR LINKEDOBJECT
LOCAL hMem:DWORD
push edi
push esi
mov eax, lpParent

mov esi, eax
invoke GlobalAlloc,GPTR,ObjectSize
mov hMem,eax
.if eax != NULL
invoke GlobalLock,hMem
mov edi, eax
.if eax != NULL
mov eax, hMem
mov [edi].LINKEDOBJECT.hLock, eax ;Remember my unlock
handle
.if esi != NULL ;I have a Parent and
thus am not Root
mov eax, [esi].LINKEDOBJECT.pYoungerSibling
mov lpOldChild, eax ;store possible child
mov [esi].LINKEDOBJECT.pYoungerSibling, edi ;Tell Parent hes my new
daddy- APPENDING
mov [edi].LINKEDOBJECT.pOlderSibling, esi ;Tell Myself I have a
Parent
mov eax, lpOldChild
.if eax != NULL ;and that Parent had a
Child - INSERTING!
mov eax, lpOldChild
mov esi, eax
mov [esi].LINKEDOBJECT.pOlderSibling, edi ;Tell Child I'm their
new sugardaddy

mov [edi].LINKEDOBJECT.pYoungerSibling, esi ;Tell

Myself I have a Child
invoke MessageBox,NULL,CTEXT("Sibling Inserted!"),CTEXT("Suc-
cess!"),MB_OK
.else
invoke MessageBox,NULL,CTEXT("Sibling Appended!"),CTEXT("Suc-
cess!"),MB_OK
.endif ;I have no kids to
worry about or
.else ;I am Root with No
Parent and No Child
mov [edi].LINKEDOBJECT.pParent, NULL
mov [edi].LINKEDOBJECT.pOlderSibling, NULL
mov [edi].LINKEDOBJECT.pYoungerSibling, NULL
invoke MessageBox,NULL,CTEXT("Added Root!"),CTEXT("Suc-
cess!"),MB_OK
.endif
;_____________________________________
;Null-out Application-Specific fields
;_____________________________________
mov eax, edi
pop esi
pop edi
invoke GetLastError
invoke GlobalFree,hMem ;Free the memory we
Failed to Lock
.endif


invoke GetLastError
.endif
pop esi
pop edi
xor eax,eax
AddSiblingEntry ENDP

;-------------------------------------------------------------------
;KillEntryPlusChildren Procedure
;-Examines Parent<--<THIS>-->Child links and
;-Patches- Parent (Deleting Self).
;-Also detects Child links and recursively removes them.
;-of a single entry in our List.
;-------------------------------------------------------------------
KillEntryPlusChildren PROC lpThis:PTR LinkedObject ;pointer to entry to be
deleted
push edi
push esi
mov eax, lpThis

mov edi, eax
.if [edi].LINKEDOBJECT.pParent != NULL ;I have a

Parent and thus am not Root..
mov eax, [edi].LINKEDOBJECT.pParent
mov esi, eax
mov [esi].LINKEDOBJECT.pChild,NULL ;Kill Par-
ent's link to me
.if [edi].LINKEDOBJECT.pChild !=NULL ;..and I
also have a Child..
invoke KillEntryPlusChildren, [edi].LINKEDOBJECT.pChild ; Kill
Child
invoke KillName, lpThis
invoke MessageBox,NULL,CTEXT("killed Child Node!"),CTEXT("Suc-
cess!"),MB_OK
.else
invoke MessageBox,NULL,CTEXT("killed End Node!"),CTEXT("Suc-
cess!"),MB_OK
.endif
.else ;I am Root
with No Parent

.if [edi].LINKEDOBJECT.pChild != NULL ;..and I also

have a Child..
invoke KillEntryPlusChildren, [edi].LINKEDOBJECT.pChild ; Kill Child
.endif
invoke MessageBox,NULL,CTEXT("Killed Root!"),CTEXT("Success!"),MB_OK
.endif
invoke GlobalUnlock,[edi].LINKEDOBJECT.hLock ;Unlock this

memory
invoke GlobalFree,[edi].LINKEDOBJECT.hLock ;Release this
memory
pop esi
pop edi
return TRUE ;cya

KillEntryPlusChildren ENDP

;-------------------------------------------------------------------
;KillEntryPlusYoungerSiblings Procedure
;-Examines OlderSibling<--<THIS>-->YoungerSibling links and
;-Patches- OlderSibling<-->YoungerSibling (Deleting Self).
;-Also detects Child links and recursively removes them.
;-of a single entry in our List plus its Younger Siblings.
;-------------------------------------------------------------------
KillEntryPlusYoungerSiblings PROC lpThis:PTR LinkedObject ;pointer to entry to
be deleted
push edi
push esi
mov eax, lpThis

mov edi, eax
.if [edi].LINKEDOBJECT.pOlderSibling != NULL ;I have a Parent

and thus am not Root..
mov esi, eax
mov [esi].LINKEDOBJECT.pYoungerSibling,NULL ;Kill Parent's
link to me
.if [edi].LINKEDOBJECT.pYoungerSibling !=NULL ;..and I also
have a Child..
invoke KillEntryPlusYoungerSiblings, [edi].LINKEDOBJECT.pYoungerSib-
ling; Kill Child
invoke MessageBox,NULL,CTEXT("killed Child Sibling
Node!"),CTEXT("Success!"),MB_OK
.else
invoke MessageBox,NULL,CTEXT("killed End Sibling Node!"),CTEXT("Suc-
cess!"),MB_OK
.endif
.else ;I am Root
with No Parent

.if [edi].LINKEDOBJECT.pChild != NULL ;..and I also

have a Child..
invoke KillEntryPlusYoungerSiblings [edi].LINKEDOBJECT.pYoungerSib-
ling ; Kill Child
.endif
invoke MessageBox,NULL,CTEXT("Killed Root!"),CTEXT("Success!"),MB_OK
.endif
invoke GlobalUnlock,[edi].LINKEDOBJECT.hLock ;Unlock this

memory
invoke GlobalFree,[edi].LINKEDOBJECT.hLock ;Release this
memory
pop esi
pop edi
return TRUE ;cya

KillEntryPlusYoungerSiblings ENDP

Lesson 5 - Parsing2
Here we have some code-snippets which could be helpful. For sure parsing is very very
various and can be very complex, so these are just some examples to help you on your
way!
2. This lesson contains various code-snippets from different authors and have been all posted at http://
board.win32asmcommunity.net. Respect the authors work!

Lesson 5 - Parsing
By Sliver
Find_First_Of
Find_Last_Of
Find_First_Not_Of
required arguements:
1) the string to be searched
2) the separators to be found
3) the starting position
How it works:
1) Find_First_Of returns the first instance of a given separator assuming the sentence was:
"Hello everyone how are you doing" and the separator was " " (a space) it would return the 5
in eax first letter in the string is at starting position (0)
2)Find_Last_Of find last occurance of separator
3)Find_First_Not_Of find first occurance of something that's not a separator

code:
; #########################################################################
;
; Find First Of / Find_Last_Of / Find_First_Not_Of
; Suppose you had a string -- a paragraph of prose, perhaps -- and you wanted
; break it up into individual words. You would need to find where the
; separators were, and those could be any of a number of different characters;
; there could be spaces, commas, periods, colons and so on. This is a procedure
; where for any one of a given set of characters occurs in a string -- this could
; tell you where the delimiter for the words are. I hope this makes someones
; life a little easier :-) Cheers, Walter Reid (Sliver)
;
;
; Works like this:
; invoke Find_First_Of, string to be searched, separators, starting position
; returns the locations of the first separator in eax
;
; invoke Find_First_Of, string to be searched, separators, starting position
; returns the location of the last separator in eax
; #########################################################################
.386
.model flat, stdcall
option casemap :none ; case sensitive
; #########################################################################
include \masm32\include\masm32.inc
include \masm32\include\debug.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\debug.lib

Lesson 5 - Parsing
Main PROTO
Find_First_Of PROTO :DWORD, :DWORD, :DWORD
Find_Last_Of PROTO :DWORD, :DWORD, :DWORD
Find_First_Not_Of PROTO :DWORD, :DWORD, :DWORD
Find_Last_Of proc lpszSource:DWORD, lpszTarget:DWORD, StartPos:DWORD

LOCAL val:DWORD
mov val, 0
mov edi, lpszTarget
xor ecx, ecx
start_scan:
mov esi, lpszSource
add esi, StartPos
add esi, ecx
next:
mov al, byte ptr [esi]
inc esi
cmp al, byte ptr [edi]

je found
inc ecx
cmp al, 0
jne next
found2:
mov ecx, val
inc edi
cmp byte ptr [edi], 0
jne start_scan
jmp done
found:
mov val, ecx
jmp found2

done:
mov eax, ecx
ret
Find_Last_Of endp
Find_First_Of proc lpszSource:DWORD, lpszTarget:DWORD, StartPos:DWORD

LOCAL val:DWORD
mov val, 100
mov edi, lpszTarget
start_scan:
mov esi, lpszSource
add esi, StartPos
xor ecx, ecx
next:
inc esi
cmp ecx, val

je found

je found
inc ecx
cmp al, 0
je start_scan
jmp next

Lesson 5 - Parsing
found:
mov val, ecx
xor ecx, ecx
inc edi
jne start_scan
done:
mov eax, val
add eax, StartPos
ret
Find_First_Of endp
Find_First_Not_Of proc lpszSource:DWORD, lpszTarget:DWORD, StartPos:DWORD

LOCAL val:DWORD
mov val, 100

xor ecx, ecx
mov esi, lpszSource

add esi, StartPos
start_scan:
mov edi, lpszTarget
next:
cmp al, 0
je done

jne no_match

match:
inc esi
cmp byte ptr [esi], 0
je done
inc ecx
jmp start_scan
no_match:
inc edi
jne next
mov val, ecx
done:
mov eax, val
add eax, StartPos
ret
Find_First_Not_Of endp
; #########################################################################
.data
Msg1 db "Hi! My name is Walter. How are you?",0
Msg2 db "aeioufaefeaio",0
Txt db " ?.!,",0
Txt2 db "uoaei",0
; #########################################################################
.code
start:
invoke Main
invoke ExitProcess,0

Lesson 5 - Parsing
Main proc
invoke Find_First_Of, ADDR Msg1, ADDR Txt, 0

PrintText "Find the first separator ( ?.!,) -- starting at pos 0"
PrintText "in the sentance 'Hi! My name is Walter. How are you?'"
PrintDec eax
PrintText "Value returned is from the first character (0)"
PrintText " "

PrintText " "
invoke Find_First_Of, ADDR Msg1, ADDR Txt2, 14

PrintText "Find the first vowel (uoaei) -- starting at pos 14 (space after 'is')"
PrintDec eax
PrintText " "

PrintText " "
PrintText " "
PrintText " "
invoke Find_Last_Of, ADDR Msg1, ADDR Txt, 0

PrintText "Find the last separator ( ?.!,) -- starting at pos 0"
PrintDec eax
PrintText " "

PrintText " "
invoke Find_Last_Of, ADDR Msg1, ADDR Txt2, 0

PrintText "Find the last vowel (uoaei) -- starting at pos 0"
PrintDec eax

PrintText " "

PrintText " "
PrintText " "
PrintText " "
invoke Find_First_Not_Of, ADDR Msg2, ADDR Txt2, 0

PrintText "Find the first not of separator (uoaei) -- starting at pos 0"
PrintText "in the sentance 'aeioufaefeaio'"
PrintDec eax
PrintText " "

PrintText " "
invoke Find_First_Not_Of, ADDR Msg2, ADDR Txt2, 6

PrintText "Find the first not of separator (uoaei) -- starting at pos 6"
PrintText "in the sentance 'aeioufaefeaio'"
PrintDec eax
ret
Main endp
end start

Lesson 5 - Parsing
by Eóin
code:
ParseString Proc uses ebx esi edi pStr:DWORD,sPos:DWORD,pBuf:DWORD

InRange MACRO a,b,c
lea ecx,[a-b]
lea edx,[a-c-1]
xor edx,ecx
or ebx,edx
EndM
Ranges MACRO
InRange eax,'a','z'
InRange eax,'0','9'
InRange eax,'A','Z'
EndM
mov esi,pStr
mov edi,pBuf
add esi,sPos
assume esi:ptr byte
assume edi:ptr byte
@@:movzx eax,[esi]
xor ebx,ebx
test eax,eax
jz nlb
Ranges
js @F
inc esi
jmp @B
@@:mov [edi],al
inc esi
inc edi
movzx eax,[esi]
xor ebx,ebx
test eax,eax
jz nlb

Ranges
js @B
nlb:mov [edi],0
mov eax,esi
sub eax,pStr
ret
ParseString EndP
Usage is simple, call the function with a pointer to the string you wish to parse, the start
position and a pointer to a buffer to contain the parse part.
.data
szTest db "This is a test",0
.data?
Pos dd ?
Buf db 64 dup (?)
.code
Invoke ParseString,addr szTest,0,addr Buf
mov Pos,eax ; Buf contains "This",0
Invoke ParseString,addr szTest,Pos,addr Buf

mov Pos,eax ; Buf contain "is",0
…

Lesson 5 - Parsing
By Stryker
String Reverse
Output: dlrow leurc olleh
code:
.386
.MODEL flat, stdcall
option casemap:none
INCLUDE \masm32\include\windows.inc
INCLUDE \masm32\include\kernel32.inc
INCLUDELIB \masm32\lib\kernel32.lib
INCLUDE \masm32\include\user32.inc
INCLUDELIB \masm32\lib\user32.lib
.data
mystringdata db "hello cruel world", 0

buffer db 20 DUP(0)
.code
Start:
invoke lstrlen, OFFSET mystringdata

mov ecx, eax
mov esi, OFFSET mystringdata
mov edi, OFFSET buffer
@@:
dec ecx
mov dl, BYTE ptr [esi+ecx]
mov BYTE ptr[edi], dl
inc edi
or ecx, ecx
ja @b
invoke MessageBox, 0, OFFSET buffer, 0, 0

END Start

Reverses string until the center character then reverses up the string again.
Output: dlrow leuel world

code:
.386
.MODEL flat, stdcall
option casemap:none
INCLUDE \masm32\include\kernel32.inc
INCLUDELIB \masm32\lib\kernel32.lib
.data
mystringdata db "hello cruel world", 0
.code
Start:
invoke lstrlen, OFFSET mystringdata

mov ecx, eax
mov esi, OFFSET mystringdata
mov edi, OFFSET mystringdata
@@:
dec ecx
mov dl, BYTE ptr [esi+ecx]
mov BYTE ptr[edi], dl
inc edi
or ecx, ecx
ja @b
mov BYTE ptr[edi], cl
invoke MessageBox, 0, OFFSET mystringdata, 0, 0

END Start

Lesson 6 -OOP
Lesson 6 -OOP3
Main-File of OOP
.386
.model flat,stdcall
option casemap:none
include \masm32\include\masm32.inc
includelib \masm32\lib\masm32.lib
include \masm32\include\Objects.inc ; Our Object Include Macro Set

include myClass.asm ; The Class Definition File
.data?
myNiceClass dd ? ; Class Instance Handle
.code
start:
mov myNiceClass, $NEW( myClass ) ; init class: myNiceClass = new

myClass()
METHOD myNiceClass, myClass, SetVariable ; now set variable: myNice-

Class.setVariable();
METHOD myNiceClass, myClass, Print ; and print: print (myNice-

Class.myVariable);
3. This is is copyrighted By NaN ( jaymeson@hotmail.com ). He submitted this source at board.win32assembler.net after I

asked for help on OOP for MASM32. Respect this please.

DESTROY myNiceClass ; Must Clean up when finished.
invoke ExitProcess, NULL

end start
end

Lesson 6 -OOP
Class-File of OOP
IFNDEF _myClass_
_myClass_ equ 1
; --
===================================================================================
==--
; #CLASS: myClass
; #VERSION: 1.0
; --
===================================================================================
==--
; Built by NaN's Object Class Creator
; © Sept 19, 2001
;
; By NaN ( jaymeson@hotmail.com )
; http://nan32asm.cjb.net
;
; --
===================================================================================
==--
; #AUTHOR: NaN
; #DATE: Sept. 25, 2001
;
; #DESCRIPTION:
;
; Test "Hello World" class for example.
;
; --
===================================================================================
==--
; CLASS METHOD PROTOS
; --
===================================================================================
==--
myClass_Init PROTO :DWORD

; --
================================================================================
=====--
; FUNCTION POINTER PROTOS
; --
================================================================================
=====--
myCl_destructorPto TYPEDEF PROTO :DWORD
myCl_PrintPto TYPEDEF PROTO :DWORD
myCl_SetVariablePto TYPEDEF PROTO :DWORD
; --
================================================================================
=====--
; CLASS STRUCTURE
; --
================================================================================
=====--
CLASS myClass, myCl
CMETHOD destructor ; MUST BE THE FIRST, OR OBJECTS.INC WILL FAIL
CMETHOD Print ; Used to create a Message Box, and Print the Vari-
able Data
CMETHOD SetVariable ; Used to fill the internal buffer with a text
string
PrivateBuffer dd 32 dup(?) ; 128 byte buffer
myClass ENDS
.data
BEGIN_INIT
dd offset myCl_destructor_Funct
dd offset myCl_Print_Funct
dd offset myCl_SetVariable_Funct
dd 32 dup( 0 ) ; 32 NULL's for initial buffer.
END_INIT
.code

Lesson 6 -OOP
; --
===================================================================================
==--
; #METHOD: CONSTRUCTOR (NONE)
;
; #DESCRIPTION: Empty Constructor, that does nothing specific..
;
; --
===================================================================================
==--
myClass_Init PROC uses edi esi lpTHIS:DWORD
SET_CLASS myClass
SetObject edi, myClass
ReleaseObject edi
ret
myClass_Init ENDP
; --
===================================================================================
==--
; #METHOD: destructor (NONE)
;
; #DESCRIPTION: Empty Destructor, that does nothing specific..
;
; --
===================================================================================
==--
myCl_destructor_Funct PROC uses edi lpTHIS:DWORD
ReleaseObject edi
ret
myCl_destructor_Funct ENDP

; --
================================================================================
=====--
; #METHOD: Print()
;
; #DESCRIPTION: Creates a Message Box with the String Data, IF and only IF the
string
; string is set using the SetVariable Method..
;
; --
================================================================================
=====--
myCl_Print_Funct PROC uses edi lpTHIS:DWORD
mov al, BYTE PTR [edi].PrivateBuffer ; Get first buffer byte

cmp al, 0 ; See if its NULL
je @F ; Yes, then dont print,
and exit
invoke MessageBox, NULL, addr [edi].PrivateBuffer, ; Print Out the Message
NULL, MB_OK ;
@@: ; Exit
ReleaseObject edi
ret
myCl_Print_Funct ENDP
; --
================================================================================
=====--
; #METHOD: SetVariable (R)
;
; #DESCRIPTION: Fills a private class buffer with string data..
;

Lesson 6 -OOP
; --
===================================================================================
==--
myCl_SetVariable_Funct PROC uses edi lpTHIS:DWORD
.data
SetDataString db "Hello ASM Coder, this is the NaN/Thomas OOP model!",0
.code
invoke StrLen, addr SetDataString

mov edx, eax
invoke MemCopy, addr SetDataString, addr [edi].PrivateBuffer, edx
ReleaseObject edi
ret
myCl_SetVariable_Funct ENDP
ENDIF

Lesson 7 - SEH4
SEH.asm
.386
.model flat,stdcall
option casemap:none
INCLUDE SEH.inc
.DATA
szGood DB "SEH succeed :)",0
szCap DB "OK",0
.CODE
main:
InstSehFrame <OFFSET SavePlace1>
; CRASH CODE 1
XOR EAX, EAX
XCHG DWORD PTR [EAX], EAX
SavePlace1:
KillSehFrame
InstSehFrame <OFFSET SavePlace2>
; CRASH CODE 2
XOR EBX, EBX
XOR EDX, EDX
4. Code-snippet coded by yoda (http://y0da.cjb.net)

Lesson 7 - SEH
MOV EAX, 2
DIV EBX
SavePlace2:
KillSehFrame
INVOKE MessageBox,0,offset szGood,offset szCap,MB_OK

RET
end main

SEH.inc
COMMENT @
SEH.inc (MASM)
-------
...a lame include file for SEH macro's.
by yoda
;---- STRUCTs ----

sSEH STRUCT
OrgEsp DD ?
OrgEbp DD ?
SaveEip DD ?
sSEH ENDS
;---- MACROs ----

InstSehFrame MACRO ContinueAddr
ASSUME FS : NOTHING
IFNDEF SehStruct
SehStruct EQU 1
.DATA
SEH sSEH <>
ENDIF
.CODE
MOV SEH.SaveEip, ContinueAddr
MOV SEH.OrgEbp, EBP
PUSH OFFSET SehHandler
PUSH FS:[0]
MOV SEH.OrgEsp, ESP
MOV FS:[0], ESP
ENDM
KillSehFrame MACRO

Lesson 7 - SEH
POP FS:[0]
ADD ESP, 4
ENDM
;---- ROUTINEs ----

.CODE
SehHandler PROC C pExcept:DWORD,pFrame:DWORD,pContext:DWORD,pDispatch:DWORD
MOV EAX, pContext

ASSUME EAX : PTR CONTEXT
PUSH SEH.SaveEip
POP [EAX].regEip
PUSH SEH.OrgEsp
POP [EAX].regEsp
PUSH SEH.OrgEbp
POP [EAX].regEbp
MOV EAX, ExceptionContinueExecution
RET
SehHandler ENDP

Lesson 8 - Trees5
CurNode == rootNode
code:
PrefixPrint PROC CurNode:DWORD

mov eax, CurNode
or eax, eax
jnz @F
ret
@@:
push eax
invoke dwtoa, (BINTREE PTR[eax]).ID, OFFSET tmpBufr
invoke SendDlgItemMessage, wHndle, IDE_BINTREEOUTPUT, EM_REPLACESEL,
NULL, OFFSET Newline
invoke SendDlgItemMessage, wHndle, IDE_BINTREEOUTPUT, EM_REPLACESEL,
NULL, OFFSET tmpBufr
pop eax
push eax
mov eax, (BINTREE PTR[eax]).ptLeft
invoke PrefixPrint, eax
pop eax
mov eax, (BINTREE PTR[eax]).ptRight
invoke PrefixPrint, eax
ret
PrefixPrint ENDP
IDValue == id/key of the structure
code:
CreateNode PROC IDValue:DWORD
invoke GetProcessHeap
mov hPrcs, eax
5. The following code-snippets were coded by stryker and published at

http://board.win32asmcommunity.net. I have added parts of the thread.

Lesson 8 - Trees
invoke HeapAlloc, eax, HEAP_ZERO_MEMORY, SIZEOF BINTREE

mov hMem, eax
mov edx, IDValue

mov (BINTREE PTR [eax]).ID, edx
mov (BINTREE PTR [eax]).ptLeft, NULL
mov (BINTREE PTR [eax]).ptRight, NULL
ret
CreateNode ENDP
CurNode == rootNode
IDValue == id/key of the structure
code:
FindASpot PROC CurNode:DWORD, IDValue:DWORD
mov eax, CurNode

or eax, eax
jnz @@NodeNotNull
ret
@@NodeNotNull:
mov edx, IDValue

push eax
push edx
cmp edx, (BINTREE PTR [eax]).ID

jl @@GoLeft
ja @@GoRight
invoke MessageBox, NULL, OFFSET BinTreeError, OFFSET BinTreeTitle, MB_OK

pop edx
pop eax
ret
@@GoLeft:

cmp (BINTREE PTR [eax]).ptLeft, NULL

jne @@RecurseOnLeft
push eax
invoke CreateNode, edx
pop ecx
mov (BINTREE PTR [ecx]).ptLeft, eax
jmp @@FoundASpot
@@RecurseOnLeft:
mov eax, (BINTREE PTR [eax]).ptLeft

invoke FindASpot, eax, edx
jmp @@FoundASpot
@@GoRight:
cmp (BINTREE PTR [eax]).ptRight, NULL

jne @@RecurseOnRight
push eax
invoke CreateNode, edx
pop ecx
mov (BINTREE PTR [ecx]).ptRight, eax
jmp @@FoundASpot
@@RecurseOnRight:
mov eax, (BINTREE PTR [eax]).ptRight

invoke FindASpot, eax, edx
@@FoundASpot:
pop edx
pop eax
ret
FindASpot ENDP

Lesson 8 - Trees
nNode == rootNode
code:
DestroyTree PROC nNode:DWORD
mov eax, nNode

or eax, eax
jnz @F
ret
@@:
push eax
invoke DestroyTree, eax
pop eax
push eax
invoke DestroyTree, eax
pop eax
invoke HeapFree, hPrcs, NULL, eax
ret
DestroyTree ENDP
code:
@@BTN_ADDTOBTREE:
invoke GetDlgItemInt, hWnd, IDE_INTOBTREE, OFFSET bState, FALSE
cmp rootNode, NULL

jne @@rootExists
invoke CreateNode, eax

mov rootNode, eax
jmp @@PRINTCURRENTBINTREE

@@rootExists:
invoke FindASpot, rootNode, eax
@@PRINTCURRENTBINTREE:
invoke SetDlgItemText, hWnd, IDE_BINTREEOUTPUT, NULL

invoke PrefixPrint, rootNode
jmp @@RETURN_TRUE
Assuming we clicked a button called AddToTree...
1. First we get the key/id number of the node from an edit box.
2. We then check if a root node exists, if not we will call CreateNode, the return value will
be the pointer in memory that was allocated. if it exists we will call the FindASpot proce-
dure, what this does is it will recurse until it finds the right spot to place the leaf node.
3. Then print the current members of the current tree.
I forgot to add these 2 things on how to print the tree.

code:
InfixPrint PROC CurNode:DWORD
mov eax, CurNode

or eax, eax
jnz @F
ret
@@:
push eax
invoke InfixPrint, eax
pop eax

Lesson 8 - Trees
push eax
invoke SendDlgItemMessage, wHndle, IDE_BINTREEOUTPUT, EM_REPLACESEL, NULL,
OFFSET Newline
OFFSET tmpBufr
pop eax
invoke InfixPrint, eax
ret
InfixPrint ENDP
PostfixPrint PROC CurNode:DWORD
mov eax, CurNode

or eax, eax
jnz @F
ret
@@:
push eax
invoke PostfixPrint, eax
pop eax
push eax
invoke PostfixPrint, eax
pop eax
OFFSET Newline
OFFSET tmpBufr
ret
PostfixPrint ENDP

If spring semester is over I'll release the whole source. Here are the struc-
tures and "variables" I used
code:
_DATA SEGMENT
Newline DB 0Dh, 0Ah, 0
BinTreeTitle DB "stryker", 0
BinTreeError DB "Cannot Add To The Tree. Current", 0Dh, 0Ah
DB "ID Value Already Exists On The Tree.", 0
_DATA ENDS
_BSS SEGMENT
tmpBufr DB 9 DUP(?)
bState DD ?
hPrcs DD ?
hMem DD ?
rootNode DD ?
wHndle DD ?
_BSS ENDS
BINTREE STRUCT
ID DD ?
ptLeft DD ?
ptRight DD ?
BINTREE ENDS
Here's another one on searching for a node in the tree

code:
BSearch PROC nNode:DWORD, IDValue:DWORD
mov eax, nNode

mov edx, IDValue
@@SwingNode:
cmp eax, NULL

Lesson 8 - Trees
je @@BSearchExit
cmp edx, (BINTREE PTR [eax]).ID

jl @@GoLeft
ja @@GoRight
ret
@@GoLeft:

jmp @@SwingNode
@@GoRight:

jmp @@SwingNode
@@BSearchExit:
xor eax, eax

ret
BSearch ENDP
The return value will be in EAX, if it returns 0 then the key/id doesn't exists, else
...
Preliminary call: invoke BSearch, rootNode, IDorKeyToSearch
Here's another update. This one will count all the nodes in a tree.
code:
BCount PROC nNode:DWORD, nCount:DWORD
mov eax, nCount

mov ecx, nNode
or ecx, ecx
jz @F
inc eax
push ecx

mov ecx, (BINTREE PTR [ecx]).ptLeft

invoke BCount, ecx, eax
pop ecx
mov ecx, (BINTREE PTR [ecx]).ptRight
invoke BCount, ecx, eax
@@:
ret
BCount ENDP
Return value will be in eax.
Preliminary Call: invoke BCount, rootNode, 0
Here's another update:

code:
Max PROC A:DWORD, B:DWORD
mov eax, A
mov edx, B
cmp eax, edx
jb @F
ret
@@:
mov eax, edx
ret
Max ENDP
BHeight PROC nNode:DWORD
mov ecx, nNode

or ecx, ecx
jnz @F
mov eax, -1
ret
@@:

Lesson 8 - Trees
push ecx
invoke BHeight, ecx
pop ecx
push eax
invoke BHeight, ecx
pop edx
inc eax
inc edx
invoke Max, eax, edx
ret
BHeight ENDP
BLevel PROC nNode:DWORD
mov ecx, nNode

or ecx, ecx
jnz @F
xor eax, eax
ret
@@:
push ecx
invoke BLevel, ecx
pop ecx
push eax
invoke BLevel, ecx
pop edx
inc eax
inc edx
invoke Max, eax, edx
ret
BLevel ENDP
"BHeight will give you the height of the tree.

"BLevel will give you the level of the tree.

"Preliminary Call: invoke Function, rootNode
"Return Value/s : in EAX.
Here's the final installment for the binary trees. I don't know if this one works perfectly but
I did my best to hunt down the bugs.
To remove a node from the tree, just pass the root node and the key of the node to delete
and call BDelete.
code:
BRemove PROC nParent:DWORD, nNode:DWORD
mov ecx, nParent

mov eax, nNode
cmp (BINTREE PTR [eax]).ptLeft, NULL

jne @@CheckRightNode
jne @@ChildOnRight
;Leaf Node
cmp eax, rootNode

jne @F
mov rootNode, 0
jmp @@DeallocateNode
@@:
cmp (BINTREE PTR [ecx]).ptLeft, eax

jne @@NullifyRight
mov (BINTREE PTR [ecx]).ptLeft, NULL
@@NullifyRight:
mov (BINTREE PTR [ecx]).ptRight, NULL


Lesson 8 - Trees
@@CheckRightNode:

jne @@TwoChildren
;Child On Left
or ecx, ecx
jnz @F
mov ecx, (BINTREE PTR [eax]).ptLeft

mov rootNode, ecx
@@:

je @@JoinLeft
mov edx, (BINTREE PTR [eax]).ptLeft

mov (BINTREE PTR [ecx]).ptRight, edx
@@JoinLeft:
mov edx, (BINTREE PTR [eax]).ptLeft

mov (BINTREE PTR [ecx]).ptLeft, edx
@@ChildOnRight:
;Child On Right
or ecx, ecx
jnz @F
mov ecx, (BINTREE PTR [eax]).ptRight

mov rootNode, ecx

@@:

je @@JoinRight
mov edx, (BINTREE PTR [eax]).ptRight

mov (BINTREE PTR [ecx]).ptRight, edx
@@JoinRight:
mov edx, (BINTREE PTR [eax]).ptRight

mov (BINTREE PTR [ecx]).ptLeft, edx
@@TwoChildren:
;Two Child Nodes
mov edx, eax

mov ecx, eax
@@FindTheLargestKey:

je @@Replace
mov ecx, eax
jmp @@FindTheLargestKey
@@Replace:
;Just copy the contents to its new location
push (BINTREE PTR [eax]).pID

pop (BINTREE PTR [edx]).pID
;Process other structure field names.
;Revert to the 2 cases above. Because the one to replace

Lesson 8 - Trees
;cannot be and will not have 2 child nodes. The one to replace
;will either fall into cases 1 and 2 which is either a leaf node
;or a node with only one child.
invoke BRemove, ecx, eax

ret
@@DeallocateNode:
invoke HeapFree, hPrcs, NULL, eax

ret
BRemove ENDP
BDelete PROC nNode:DWORD, IDValue:DWORD
mov eax, nNode

mov edx, IDValue
@@SwingNode:
or eax, eax
jz @@BSearchExit
cmp edx, (BINTREE PTR [eax]).pID

jl @@GoLeft
ja @@GoRight
invoke BRemove, ecx, eax

ret
@@GoLeft:
mov ecx, eax

jmp @@SwingNode
@@GoRight:
mov ecx, eax

jmp @@SwingNode

@@BSearchExit:
xor eax, eax

ret
BDelete ENDP

CHAPTER 4 The Basic Skeleton Of
The Disassembler
Before we go to the coding of our disassembler we should define how our disasembler will
look like.
At first we need to design the GUI. There we define what buttons, list, menu-points we need.
Next we try to modularise the project. We need to define what functionality the disassembler
needs and try to package it into modules. If you are experienced in coding HLL languages
like C++ you know that you normally package your "real" procedures into modules, proce-
dures and libraries. We do the same here.
Imagine: your code is getting longer and longer and you have just one file! Your wheel mouse
would be thankfull if you do not so…
Another important part is that changing modules is faser than doing so in one big long file.
This chapter is the beginning of our "real" disassembler. Even the experienced users should
have a look at it because this code-design is our base layout. Sure you can adapt the layout
later for your needs, but please let us all talk about the same.

The Basic Skeleton Of The Disassembler
To go further with our disassembler we should have the same design and GUI for all read-
ers.
Well, this is it...

This skeleton shows us HOW the disassembler will look after we are finished. As you can see
we will get all necessary informations about our file: ImageBase, EntryPoint RVA and File Off-
set, the number of sections and much more.
So now we have designed the GUI of the diassembler and as you can see this normally
reflects all the modules and procedures we need to code.
But were can we start with it ?
Well, designing a software-product should always be were you should begin. Never start with
hacking some code into your IDE... when the project grows, you will be lost in your own
source and debugging will be a pain !
In this chapter we will make a working skeleton of our disassembler-engine with it´s GUI.
Therefore we will need the following files:
AodBasicDisasm.Asm
AodBasicDisasm.rc
Const.inc
Idata.inc
Main.inc
PE.asm
Protos.inc
Struct.inc
Types.inc
Udata.inc
The resource files
We will discuss each file on it´s own. First we look at the Main-Files, then we have a look at
the include-files and last we have a deeper look into the PE file.

Part 1 - AodBasicDisasm.asm
As you can guess this is the main-file of our disassembler-engine. Here we “draw” our
GUI for the engine and include all necessary files.
One of the necessary commands is

include Main.inc ;Libraries, Definitions & Modules
Here we bind the needed libraries and modules into the engine. We will have later a
deeper look into these files.
So this is what I have learned from the Iczelion tutorials: I will give you now the source
code of this file, then we will discuss it nearly line by line...
;======================================================================
; AoD Basic Disassembler
; http://aod.anticrack.de/
;======================================================================
.686
.model flat, stdcall;32 bit memory model
option casemap :none;case sensitive
include Main.inc;Libraries, Definitions & Modules
.code
start:
invoke GetModuleHandle,NULL;Get the Main hInstance
movhInstance,eax
mov icex.dwICC,ICC_PROGRESS_CLASS
invoke InitCommonControlsEx,addr icex
mov AllocatedMem,0;First use
invoke LoadIconA,hInstance,IDI_ICON
mov hIcon,eax
invoke DialogBoxParam,hInstance,IDD_MAIN,NULL,addr DlgProc,NULL
;Show Main Dialog

;>-- Dialog Proc --<;
DlgProc proc uses esi edi ebx ebp
hWin:HWND,uMsg:UINT,wParam:WPARAM,lParam:LPARAM
push hWin
pop hWnd;Store Dialog Window Handle
moveax,uMsg;Window Msg in EAX
.if eax==WM_INITDIALOG
;-------------------------- Dialog Init ------------------
invoke SendMessage,hWnd,WM_SETICON,ICON_SMALL,hIcon;Set Icon
invoke GetDlgItem,hWin,IDC_DISASM ;Get some handles
mov hDisassembler,eax
invoke GetDlgItem,hWnd,IDC_STATUSBAR
mov hStatusbar,eax
invoke GetDlgItem,hWnd,IDC_PROGRESS
mov hProgressbar,eax
invoke lstrcpy,addr lf.lfFaceName,addr FontC;Set Font
mov lf.lfCharSet,DEFAULT_CHARSET;CharSet
mov lf.lfHeight,-12;Height
mov lf.lfWidth,FW_DONTCARE;Width
mov lf.lfPitchAndFamily,DEFAULT_PITCH OR FF_MODERN;Pitch & Family
invoke CreateFontIndirect,addr lf;Create & Get Font Handle
mov hLfnt,eax;Store Font Handle
invoke SendMessage,hDisassembler,WM_SETFONT,hLfnt,FALSE
;Set Font for Disassembler
invoke lstrcpy,addr lf.lfFaceName,addr FontT;Set Font
mov lf.lfPitchAndFamily,DEFAULT_PITCH OR FF_DONTCARE;Pitch & Family
mov hSfnt,eax;Store Font Handle
invoke SendMessage,hStatusbar,WM_SETFONT,hSfnt,FALSE
;Set Font for Status
invoke CreateSolidBrush,dwListBoxBack;ListBox BackGround
mov hListBoxBack,eax;Store the brush handle

.elseif eax==WM_CTLCOLORLISTBOX
;----------------- Colorize Our Disassembler -----
mov eax,wParam;wParam = Handle to HDC
mov ebx,lParam;lParam = Control Handle
.if ebx==hDisassembler
invoke SetTextColor,eax,dwDisasmFore;Set ForeGround Color
.elseif ebx==hStatusbar
invoke SetTextColor,eax,dwStatusFore;Set ForeGround Color
.endif
invoke SetBkColor,eax,dwListBoxBack;Set BackGround Color
mov eax,hListBoxBack;Return the brush handle
ret
.elseif eax==WM_COMMAND
;---------------------------- WM_COMMAND -----------------
moveax,wParam
.if ax==IDM_OPEN
;----- MenuItem OPEN ---------------------------------
invoke ResetVars;Reset Variables & Close Files if needed
invoke OpenTheFile;Open the file to be disassembled
cmp eax,0
;If the function succeeds the file is mapped in memory
jz ErrInOpening
invoke CheckPE;Check for valid PE file
cmp eax,0
jz ErrInPE
invoke DisplayWelcome
;invoke DisassembleFile, CodeSection, dwCodeSize;Disassemble it!
;invoke AddLine, offset disNewLine
;invoke AddLine, offset disEnd
.elseif ax==IDM_GOTOOFFSET
;----- MenuItem GOTO OFFSET ----------------------------
invoke DialogBoxParam,hInstance,IDD_GOTOOFFSET,hWin,
addr GotoOffsetDlgProc ,NULL
.elseif ax==IDM_GOTOENTRY
;----- MenuItem GOTO ENTRY POINT -----------------------
invoke SendMessage,hDisassembler,LB_FINDSTRING,
-1,addr szEntryPoint
cmp eax,LB_ERR
jz NotFound

invoke SendMessage,hDisassembler,LB_SETCURSEL,eax,
0;If found, move the cursor at this position
ret
NotFound:
invoke MessageBeep,-1;If not, BEEPs
ret
.elseif ax==IDM_ABOUT
;----- MenuItem ABOUT -------------------------------------
invoke MessageBox,hWnd,addr About,addr CapAbout,
MB_OK;Show About Box
.elseif ax==IDM_EXIT
;----- MenuItem EXIT --------------------------------------
invoke SendMessage,hWnd,WM_CLOSE,NULL,NULL;Same as WM_CLOSE
.endif
.elseif eax==WM_CLOSE
;------------------------------ WM_CLOSE --------------------------
ErrInOpening:
invoke MessageBox, hWnd,addr AreYouSure,addr Exit,MB_YESNO
cmp eax, IDYES
jnz NoExit
invoke DeleteObject,hListBoxBack;Delete the brush
invoke DeleteObject,hLfnt;Delete Font Handles
invoke DeleteObject,hSfnt
invoke ResetVars;Close Files
invoke EndDialog,hWnd,0;The End
.else
NoExit:
ErrInPE:
moveax,FALSE
ret
.endif
moveax,TRUE
ret
DlgProc endp
end start

Part 1 - Discussion of AodBasicDisasm.asm

First we have to define our program:
______________________________________________________________________
.686
.model flat, stdcall;32 bit memory model
option casemap :none;case sensitive
include Main.inc;Libraries, Definitions & Modules

______________________________________________________________________
Well, this is not very impressive. We define our memory model and include our main.inc
file which includes our Libraries, definitions and modules.
______________________________________________________________________
start:
invoke GetModuleHandle,NULL;Get the Main hInstance
mov hInstance,eax
mov icex.dwICC,ICC_PROGRESS_CLASS
invoke InitCommonControlsEx,addr icex
mov AllocatedMem,0;First use
invoke LoadIconA,hInstance,IDI_ICON
mov hIcon,eax
invoke DialogBoxParam,hInstance,IDD_MAIN,NULL,addr DlgProc,NULL;Show
Main Dialog
______________________________________________________________________
Here we have the main-routine of our disassembler. We initialise our common controls,
allocate memory, load the icon of our application, show our application as DialogBox and
finally we exit the application.

______________________________________________________________________
DlgProc proc uses esi edi ebx ebp
hWin:HWND,uMsg:UINT,wParam:WPARAM,lParam:LPARAM
______________________________________________________________________
Here we define our main procedure. I don´t have to explain this...
So the next block contains the routine which is responsible for our GUI when the disassem-
bler starts:
(I don´t go here into details since this is no assembly-course)
______________________________________________________________________
.if eax==WM_INITDIALOG
;-------------------------- Dialog Init ------------------
invoke SendMessage,hWnd,WM_SETICON,ICON_SMALL,hIcon;Set Icon
invoke GetDlgItem,hWin,IDC_DISASM ;Get some handles
mov hDisassembler,eax
invoke GetDlgItem,hWnd,IDC_STATUSBAR
mov hStatusbar,eax
invoke GetDlgItem,hWnd,IDC_PROGRESS
mov hProgressbar,eax
invoke lstrcpy,addr lf.lfFaceName,addr FontC;Set Font
mov lf.lfCharSet,DEFAULT_CHARSET;CharSet
mov lf.lfPitchAndFamily,DEFAULT_PITCH OR FF_MODERN;Pitch & Family
mov hLfnt,eax;Store Font Handle
invoke SendMessage,hDisassembler,WM_SETFONT,hLfnt,FALSE;Set Font for Disas-
sembler
invoke lstrcpy,addr lf.lfFaceName,addr FontT;Set Font
mov lf.lfPitchAndFamily,DEFAULT_PITCH OR FF_DONTCARE;Pitch & Family
mov hSfnt,eax;Store Font Handle
invoke SendMessage,hStatusbar,WM_SETFONT,hSfnt,FALSE;Set Font for Status
invoke CreateSolidBrush,dwListBoxBack;ListBox BackGround
mov hListBoxBack,eax;Store the brush handle
______________________________________________________________________

Next we do some colors to our ListBox-element since we want to make differences in the
disassembled code better viewable:
______________________________________________________________________
.elseif eax==WM_CTLCOLORLISTBOX;----------------- Colorize Our Disassem-

bler -----
mov ebx,lParam;lParam = Control Handle
.if ebx==hDisassembler
invoke SetTextColor,eax,dwDisasmFore;Set ForeGround Color
.elseif ebx==hStatusbar
invoke SetTextColor,eax,dwStatusFore;Set ForeGround Color
.endif
invoke SetBkColor,eax,dwListBoxBack;Set BackGround Color
mov eax,hListBoxBack;Return the brush handle
ret
______________________________________________________________________
And no we can start with the interesting part !

Our disassembling-routines !
First we have to check which command the user has send:
______________________________________________________________________
.elseif eax==WM_COMMAND
;---------------------------- WM_COMMAND -----------------
moveax,wParam
______________________________________________________________________

The next is THE disassembler main routine ! We have for now 3 parts we want to handle...
1. Open a file
2. Go to a specific offset
3. Go to the Entry Point
Don´t be shocked ! The next lines are few and contain not much information!
______________________________________________________________________
.if ax==IDM_OPEN
;----- MenuItem OPEN ---------------------------------
invoke ResetVars;Reset Variables & Close Files if needed
invoke OpenTheFile;Open the file to be disassembled
cmp eax,0;If the function succeeds the file is mapped in memory
jz ErrInOpening
cmp eax,0
jz ErrInPE
invoke DisplayWelcome
______________________________________________________________________
As we can see we first reset all variables before we do anything else. This is necessary if we
had another disassembled file in memory. Imagine we will merge here some offset or what-
ever of 2 different files...
After this we call a routine which opens the wanted file and loads it into memory. Howevery
this function will work, it does the job. We will take a deeper look at this functions later.
Before we come to the interesting part we do some small error handling.

Now we are ready to do the important parts...

______________________________________________________________________

______________________________________________________________________
This is an important function ! We have to check if we have a valid file. I not, we should
stop with the disassembling process or our machine my hang or whatever !
Well, after this we show a little “Welcome Message” - whatever this means. We don´t
have to know yet.
______________________________________________________________________
______________________________________________________________________
Yes. This is the heart of our main application. Finally we have reached the core. The core
contains 3 main procedures:
Disassembling the file and adding our output so that we can see it with our GUI.
______________________________________________________________________
.elseif ax==IDM_GOTOOFFSET
;----- MenuItem GOTO OFFSET ----------------------------
invoke DialogBoxParam,hInstance,IDD_GOTOOFFSET,hWin,
addr GotoOffsetDlgProc ,NULL
______________________________________________________________________
This handles our offset problem. Whatever it does, it is not important here. Even for this
problem we will need to have a deeper look later.

______________________________________________________________________
.elseif ax==IDM_GOTOENTRY
;----- MenuItem GOTO ENTRY POINT -----------------------
invoke SendMessage,hDisassembler,LB_FINDSTRING,-1,addr szEntryPoint
cmp eax,LB_ERR
jz NotFound
invoke SendMessage,hDisassembler,LB_SETCURSEL,eax,0
;If found, move the cursor at this position
ret
______________________________________________________________________
This routine handles the “Jump to our Entry-Point”. As I said before: We will discuss this later.
______________________________________________________________________
NotFound:
invoke MessageBeep,-1;If not, BEEPs
ret
.elseif ax==IDM_ABOUT
;----- MenuItem ABOUT -------------------------------------
invoke MessageBox,hWnd,addr About,addr CapAbout,MB_OK;Show About Box
.elseif ax==IDM_EXIT
;----- MenuItem EXIT --------------------------------------
invoke SendMessage,hWnd,WM_CLOSE,NULL,NULL;Same as WM_CLOSE
.endif
.elseif eax==WM_CLOSE
;------------------------------ WM_CLOSE --------------------------
______________________________________________________________________
Here we handle the rest of our possible command and something we never want:
NotFound is our error-message if we do not find an entry-point.

______________________________________________________________________
ErrInOpening:
invoke MessageBox, hWnd,addr AreYouSure,addr Exit,MB_YESNO
cmp eax, IDYES
jnz NoExit
invoke DeleteObject,hListBoxBack;Delete the brush
invoke DeleteObject,hLfnt;Delete Font Handles
invoke DeleteObject,hSfnt
invoke ResetVars;Close Files
invoke EndDialog,hWnd,0;The End
.else
______________________________________________________________________
Here we handle the problems when we can not open the wanted file. Maybe it is dam-
aged or opened by another applocation - who knows, but we handle this.
Well, this was the easy beginning of our disassembler engine. I promise that you will get
much harder stuff when we go into details.

Part 2 - PE.asm
.code
;>-- Get Sections Info --<;
GetSections proc uses esi edi ebx
;esi points to PE-HEADER
xor eax,eax
mov ax,word ptr [esi].FileHeader.NumberOfSections;Get # of Sections
mov wSections,ax;Store it
push eax
invoke wsprintfA,addr StatusText,addr stTempSections,eax;Display in Status
invoke SetStatus,addr StatusText
pop eax
push eax
invoke wsprintf,addr StatusText,addr stHex,eax
invoke SetDlgItemText,hWnd,IDC_SECTIONS,addr StatusText
pop eax;Display correct number of sections
cmp ax,MAX_SECTIONS;But check if they fit on sections buffer
jbe NSectionsOk;And adjust if they don't
mov wSections,MAX_SECTIONS

NSectionsOk:
add esi,sizeof IMAGE_NT_HEADERS;1st Section's name (esi points to
IMAGE_SECTION_HEADER)
assume edi: ptr SECTION
lea edi,FileSections;edi points to Section's data
assume esi: ptr IMAGE_SECTION_HEADER;Assume esi as an
IMAGE_SECTION_HEADER
assume edi: ptr SECTION;Assume edi as a SECTION
xor ebx,ebx;Section Index = 0
GetSectionsInfo:
push esi
push edi
mov ecx,8;Section's Name Length
rep movsb;Copy Name
pop edi
pop esi
mov ax,word ptr [esi].Misc.VirtualSize;VirtualSize
mov word ptr [edi].VirtualSize,ax
mov ax,word ptr [esi].VirtualAddress;VirtualAddress
mov word ptr [edi].VirtualAddress,ax
mov ax,word ptr [esi].SizeOfRawData;PhysicalSize
mov word ptr [edi].RawSize,ax
mov ax,word ptr [esi].PointerToRawData;PhysicalOffset
mov word ptr [edi].RawAddress,ax
mov eax,[esi].Characteristics;Characteristics
mov [edi].Characteristics,eax
add esi,sizeof IMAGE_SECTION_HEADER;Next Section (Source)
add edi,sizeof SECTION;Next Section (Destination)
inc ebx;Inc Section Index
cmp bx,word ptr [wSections];Last Section?
jnz GetSectionsInfo
ret
GetSections endp

;>-- Display Sections Info --<;
DisplaySections proc uses esi
LOCAL dwVOffset:DWORD,\
dwVSize:DWORD,\
dwROffset:DWORD,\
dwRSize:DWORD,\
dwChars:DWORD
assume esi: ptr SECTION

lea esi,FileSections
xor ecx,ecx;Section Index = 0
ShowSections:
push ecx
movzx eax,word ptr [esi].VirtualAddress;VirtualAddress
mov dwVOffset,eax
mov ax,word ptr [esi].VirtualSize;VirtualSize
mov dwVSize,eax
mov ax,word ptr [esi].RawAddress;PhysicalOffset
mov dwROffset,eax
mov ax,word ptr [esi].RawSize;PhysicalSize
mov dwRSize,eax
mov eax,dword ptr [esi].Characteristics;Characteristics
mov dwChars,eax
invoke wsprintfA,addr StatusText,addr stSectionsFound,\
dwVOffset,dwVSize,dwROffset,dwRSize,dwChars;Display In Status ListBox
invoke lstrcat,addr StatusText,esi;Append the section's name
invoke lstrcat,addr StatusText,addr stRightBracket
invoke SetStatus, addr StatusText
add esi,sizeof SECTION;Next Section
pop ecx;Restore Index
inc cx;Next Section
cmp cx,word ptr [wSections];Last Section?
jb ShowSections
ret
DisplaySections endp

;>-- Detects Code Section --<;

GetCodeSection proc uses esi ebx
movzx eax,EntryPointRVA
push eax
invoke SetDlgItemText,hWnd,IDC_EPRVA,addr StatusText
pop eax
push eax
invoke RVAToOffset,eax;Convert EntryPointRVA to Offset
mov EntryPointOffset,ax;eax = EntryPointOffset
invoke SetDlgItemText,hWnd,IDC_EPOFFSET,addr StatusText
pop ebx
add ebx,ImageBase;eax = ImageBase + EntryPointRVA
invoke wsprintf,addr szEntryPoint,addr findEP,ebx
movzx ecx,byte ptr [CodeSectionIndex];Get Code Section Index
mov eax,sizeof SECTION
mul ecx
add eax,offset FileSections;Get the stored Code Section
mov esi,eax
movzx eax,word ptr [esi].RawAddress;Get Code Section PhysicalOffset
mov dword ptr [CodeSection],eax
invoke wsprintf,addr StatusText,addr stEntryPoint,ebx,eax
invoke SetStatus, addr StatusText
mov ecx, AllocatedMem
add [CodeSection],ecx;CodeSection = Offset of Code Section in memory
movzx eax,word ptr [esi].VirtualSize;Get Code Section Size
mov dword ptr [dwCodeSize],eax;Store Code Size
movzx eax,word ptr [esi].VirtualAddress
add [VirtualAddr],eax;VirtualAddr holds the Virtual Address of the
first
ret;instruction in code section
GetCodeSection endp

;>-- Check For Valid PE --<;
CheckPE proc uses esi ebx
mov esi,AllocatedMem ;esi points to beginning of mapped file
cmp word ptr [esi],IMAGE_DOS_SIGNATURE;Check For 'MZ'
jnz NotValidMZ;Jump if not valid
invoke SetStatus,addr stValidMZ;Valid 'MZ' Signature!
assume esi: ptr IMAGE_DOS_HEADER
movzx eax,word ptr [esi].e_lfanew;Get The PE Offset
add esi,eax;esi points to PE Header
cmp word ptr [esi],IMAGE_NT_SIGNATURE;Check For 'PE'
jnz NotValidPE;Jump if not valid
push esi;Store Pointer to PE Header
add esi,sizeof IMAGE_NT_HEADERS - sizeof IMAGE_OPTIONAL_HEADER32
assume esi: ptr IMAGE_OPTIONAL_HEADER32
mov ax,word ptr [esi].AddressOfEntryPoint;Entry Point RVA
mov word ptr [EntryPointRVA],ax
mov ebx,dword ptr [esi].ImageBase;ImageBase
mov ImageBase,ebx
invoke wsprintf,addr StatusText,addr stHex,ebx
invoke SetDlgItemText,hWnd,IDC_IMAGEBASE,addr StatusText
movzx eax,word ptr [esi].Subsystem;SubSystem
mov eax,SubSystem[eax*4];Retrieve SubSystem Text from Array
invoke wsprintf,addr StatusText,addr String,eax
invoke SetDlgItemText,hWnd,IDC_SUBSYSTEM,addr StatusText
mov eax,[esi].DataDirectory.VirtualAddress;Exports RVA
invoke wsprintf,addr StatusText,addr sRVA,eax
invoke SetDlgItemText,hWnd,IDC_AEXPORT,addr StatusText
mov eax,[esi].DataDirectory.isize;Exports Size
invoke wsprintf,addr StatusText,addr sSize,eax
invoke SetDlgItemText,hWnd,IDC_SEXPORT,addr StatusText
mov eax,[esi].DataDirectory[sizeof IMAGE_DATA_DIRECTORY].VirtualAddress
;Imports RVA
invoke SetDlgItemText,hWnd,IDC_AIMPORT,addr StatusText
mov eax,[esi].DataDirectory[sizeof IMAGE_DATA_DIRECTORY].isize;Imports
Size
invoke SetDlgItemText,hWnd,IDC_SIMPORT,addr StatusText
mov eax,[esi].DataDirectory[sizeof IMAGE_DATA_DIRECTORY*2].VirtualAddress
;Rsrc RVA


invoke SetDlgItemText,hWnd,IDC_ARESOURCE,addr StatusText
mov eax,[esi].DataDirectory[sizeof IMAGE_DATA_DIRECTORY*2].isize;Rsrc
Size
invoke SetDlgItemText,hWnd,IDC_SRESOURCE,addr StatusText
mov eax,dword ptr [esi].SizeOfImage;ImageBaseSize
mov ImageBaseSize,eax
mov dword ptr [VirtualAddr],ebx ;VirtualAddr = ImageBase
pop esi;Retrieve Pointer to PE Header
invoke SetStatus,addr stValidPE;Valid PE Detected.
invoke GetSections;Get the file sections
invoke DisplaySections ;Display their Infos
invoke GetCodeSection ;Determine the Code Section
cmp dword ptr [CodeSection],0;Valid Section?
jz NoCodeSection ;No Code Secion? then Exit
;-------------------------------------------------------;
; VirtualAddr = VirtualAddr + CodeSection's VirtualAddr
;-------------------------------------------------------;
mov eax,1
ret;No Errors, return (1)
NotValidMZ:
invoke SetStatus,addr stNotValidMZ;Not a valid MZ...
jmp PExit
NotValidPE:
invoke SetStatus,addr stNotValidPE;Not a valid PE...
jmp PExit
NoCodeSection:
invoke SetStatus,addr stNoCodeSection;No Code Section...
PExit:
invoke ByeBye
xor eax,eax
ret
CheckPE endp

Part 2 - Discussion of PE.asm

Part 3 - Tools.asm
.code
;>-- Clear Disassembler ListBox --<;
ResetDisassembler proc
invoke SendMessage,hDisassembler,LB_RESETCONTENT,0,0
ret
ResetDisassembler endp
;>-- Add Line In Disasm --<;

AddLine proc dwLineToAdd:DWORD
invoke SendMessage,hDisassembler,LB_ADDSTRING,0,dwLineToAdd
ret
AddLine endp
;>-- Display Welcome Message --<;

DisplayWelcome proc
invoke ResetDisassembler
invoke AddLine,addr Welcome1
invoke AddLine,addr EmptyLine
ret
DisplayWelcome endp

;>-- Reset Variables for a New File --<;
ResetVars proc
cmp AllocatedMem,0
jz NotAllocatedBefore;Test if we must release memory used
invoke UnmapViewOfFile,AllocatedMem;and reset variables
invoke CloseHandle,hmapFile
NotAllocatedBefore:
mov FileSize,0
mov AllocatedMem,0
mov AllocatedMemEnd,0
mov CodeSection,0
mov wSections,0
mov ImageBase,0
mov ImageBaseSize,0
mov dwCodeSize,0
mov CurVirtualOffset,0
mov VirtualAddr,0
ret
ResetVars endp
;>-- Show Msgs in Status Bar --<;

SetStatus proc dwMsg:DWORD
invoke SendMessage,hStatusbar,LB_ADDSTRING,0,dwMsg
invoke SendMessage,hStatusbar,LB_SETTOPINDEX ,eax, 0
ret
SetStatus endp
;>-- Clear Status Messages --<;

ClearStatus proc
invoke SendMessage,hStatusbar,LB_RESETCONTENT,0,0
ret
ClearStatus endp
;>-- Display Msg & Exit --<;

ByeBye proc
invoke SetStatus, addr stExiting
mov byte ptr [FileName],0;Clear FileName
invoke SetDlgItemText,hWnd,IDC_FILENAME,addr FileName
ret
ByeBye endp

;>-- Goto Any Virtual Offset --<;

GotoOffsetDlgProc proc hWin:HWND,uMsg:UINT,wParam:WPARAM,lParam:LPARAM
moveax,uMsg
.if eax==WM_COMMAND
moveax,wParam
.if ax==IDC_GOTO;--> Goto Button
invoke GetDlgItemText,hWin, IDC_GOTOOFFSET,addr OffsetToGoto+1,9
mov byte ptr [OffsetToGoto], ' ';First char is a space
invoke SendMessage,hDisassembler,LB_FINDSTRING,0,addr OffsetToGoto ;Find
Offset
cmp eax, LB_ERR
jz OffNotFound
invoke SendMessage,hDisassembler,LB_SETCURSEL,eax,0;Select Line With
Offset in Disasm
jmp OffFound
OffNotFound:
invoke MessageBox,hWin,addr OffsetNotFound,addr Err,MB_ICONINFORMATION
jmp TryAgain
.elseif ax==IDC_GOTOCANCEL;--> Cancel Button
invoke SendMessage,hWin,WM_CLOSE,NULL,NULL;Close
.endif
.elseif eax==WM_CLOSE;--> Close Dialog
OffFound:
invoke EndDialog,hWin,0
.else
TryAgain:
mov eax,FALSE
ret
.endif
mov eax,TRUE
ret
GotoOffsetDlgProc endp

;>-- Convert RVA To Offset --<;
RVAToOffset proc uses ebx esi dwRVA:DWORD
lea esi,FileSections;esi = Section's Data
xor ecx,ecx;Section index
mov edx,dwRVA;Move RVA to edx
SearchNewSection:
movzx eax,word ptr [esi].VirtualAddress;RVA Section Start
cmp edx,eax
jl @F
; RVA >= RVA Section Start
movzx ebx,word ptr [esi].RawSize;Get Section's Size
add ebx,eax;RVA Section End
cmp edx,ebx;
jbe SectionFound;RVA Section Start <= RVA <= RVA Section End
@@:
add esi,sizeof SECTION;Next Section
inc cx
cmp cx,wSections;Check if we looped through all sections
jnz SearchNewSection
xor eax,eax;If nothing found return -1
dec eax
ret
SectionFound:
mov byte ptr [CodeSectionIndex],cl;Store the Code Section Index
mov ebx,eax
movzx eax,word ptr [esi].RawAddress;Get CodeSection's PhysicalOffset
sub ebx,eax;ebx = RVA Section Start - Offset Section Start
sub edx,ebx;edx = RVA - (VirtualAddr - PhysicalOffset)
mov eax,edx;Return File Offset in EAX
ret
RVAToOffset endp

;>-- Procedure to open files --<;

OpenTheFile proc
Invoke MessageBox,hWnd,addr msgLoadDefault,addr msgCap-
tion,MB_ICONQUESTION or MB_YESNO;Load Default?
cmp eax, IDYES
jnz LoadTheFile
invoke lstrcpy,addr FileName,addr TestingFile
jmp ContinueLoading
LoadTheFile:
mov ofn.lStructSize,SIZEOF ofn ;Prepare ofn structure
push hInstance
pop ofn.hInstance
mov ofn.lpstrFile, OFFSET FileName
mov ofn.nMaxFile,255h
mov ofn.Flags, OFN_FILEMUSTEXIST or \
OFN_PATHMUSTEXIST or OFN_LONGNAMES or\
invoke GetOpenFileName,addr ofn
ContinueLoading:
invoke CreateFile,addr FileName,\;Open the file as READ_ONLY
GENERIC_READ,\
FILE_SHARE_READ,\
NULL,OPEN_EXISTING,\
FILE_ATTRIBUTE_ARCHIVE,NULL
cmp eax,-1
jz ErrorHappened;CreateFile Failed?
mov hFile,eax;Store the file handle
invoke GetFileSize,hFile,0;
mov FileSize,eax;Store the file size
mov AllocatedMemEnd,eax;Store the end of the AllocatedMem (1)
invoke CreateFileMapping,hFile,0,PAGE_READONLY,0,0,0;Map the file in
memory
cmp eax,0;

jz ErrorHappened;Error Creating File Mapping?
mov hmapFile,eax;Store the file mapped handle
invoke MapViewOfFile,eax,FILE_MAP_READ,0,0,0;Map View of File
cmp eax,0;
jz ErrorMapping;Error Mapping View of File? :)
mov AllocatedMem,eax;Store allocated file offset
add AllocatedMemEnd,eax;Store the end of the AllocatedMem (2)
invoke ClearStatus;Clear the status bar
invoke SetDlgItemText,hWnd,IDC_FILENAME,addr FileName;Display File Name
invoke SetStatus,offset stFileLoaded;In the status bar too
invoke CloseHandle,hFile;Close the file (still mapped in memory)
mov eax,1;
ret;Return (1) = succeeded
ErrorMapping:
invoke CloseHandle,hmapFile;Close the mapped file
ErrorHappened:
xor eax,eax
ret;Something's wrong return (0)
OpenTheFile endp

Part 3 - Discussion of Tools.asm

Part 4 - Const.inc
.const
;-- Main Dialog equates -----
IDI_ICONequ 300
IDD_MAINequ 101
IDD_GOTOOFFSETequ 103
IDC_DISASMequ 1001
IDC_STATUSBARequ 1002
IDC_FILENAMEequ 1005
IDC_PROGRESSequ 1006
IDC_SUBSYSTEMequ 1008
IDC_IMAGEBASEequ 1010
IDC_EPRVAequ 1012
IDC_EPOFFSETequ 1014
IDC_SECTIONSequ 1016
IDC_AEXPORTequ 1019
IDC_SEXPORTequ 1020
IDC_AIMPORTequ 1022
IDC_SIMPORTequ 1023
IDC_ARESOURCEequ 1025
IDC_SRESOURCEequ 1026
;-- Main Menu Items ---------

IDM_FILEequ 3001
IDM_OPENequ 3002
IDM_EXITequ 3003
IDM_VIEWequ 3010
IDM_VEXPORTequ 3011
IDM_VIMPORTequ 3012
IDM_VRSRCequ 3013
IDM_VAPIequ 3014
IDM_VSTRINGSequ 3015
IDM_GOTOequ 3020
IDM_GOTOENTRY equ 3021
IDM_GOTOOFFSETequ 3022
IDM_HELPequ 3090
IDM_GETHELPequ 3091
IDM_ABOUTequ 3092

;-- GotoOffset Dialog equates ---

IDC_GOTOOFFSETequ 1002
IDC_GOTOequ 1003
IDC_GOTOCANCELequ 1004
;-- Constants ---------------------------------------------

MAX_SECTIONSequ 10;Max number of sections allowed
MAX_BUFFERequ 256 ;Size of Buffers

Part 4 - Discussion of Const.inc

Part 5 - Idata.inc
.data
;== MESSAGES
========================================================================
====
;-- GotoOffset Msgs ----------------------------------------------------
-----------------
OffsetNotFounddb "No matching offset found!",0
Errdb "Err..",0
;-- Titles & Msgs ------------------------------------------------------

-----------------
CapAboutdb "AoD Basic Disassembler",0
Aboutdb "AoD Basic Disassembler Stage-1",13,10,"September, 2002",0
AreYouSuredb "Are you sure you want to exit?",0

Exitdb "Are you nuts!? :p",0
msgLoadDefaultdb "Load the default file ""test.exe""?",0

msgCaptiondb "Load File",0
FilterString db "(*.exe)",0,"*.exe",0,"(*.dll)",0,"*.dll",0,0
TestingFiledb "test.exe",0
Welcome1db "---------------------------------",0
Welcome2db " AoD Basic Disassembler Stage - 1",0
Welcome3db "---------------------------------",0
;-- Status Msgs --------------------------------------------------------

-----------------
stFileLoadeddb "File loaded.",0
stValidPEdb "Valid PE Detected.",0
stValidMZdb "Valid MZ Detected.",0
stNotValidPEdb "Inalid PE Detected.",0
stNotValidMZdb "Invalid MZ Detected.",0
stExitingdb "Exiting...",0
stTempSectionsdb "Found %X Section(s).",0
stSectionsFounddb "Virtual Address %08X - Virtual Size %08X"
db " - Raw Offset %08X - Raw Size %08X. - Chars. %08X ( ", 0

stRightBracketdb " )",0
stEntryPointdb "EntryPoint (RVA) %08X - EntryPoint (Offset) %08X.",0
stNoCodeSectiondb "Code section couldn't be found! at least in this version
:P",0
stHexdb "%08X",0
;--Misc -------------------------------------------------------------------
--------------
EmptyLinedb " ",0
;== TEMPLATES
===========================================================================
;-- To find EntryPoint ----------------------------------------------------
--------------
szEntryPointdb " 00000000:",0
findEPdb " %08X:",0
;-- To Show RVA's & Sizes -------------------------------------------------

--------------
sRVAdb "RVA: %08X",0
sSizedb "Size: %08X",0
;-- Misc ------------------------------------------------------------------

--------------
Stringdb "%s",0
;== ARRAYS
===========================================================================
===
;--- SubSystem types ------------------------------------------------------
--------------
S0BYTE "Unknown",0
S1BYTE "Native",0
S2BYTE "Windows-GUI",0
S3BYTE "Windows-Console",0
S5BYTE "OS/2 Console",0
S7BYTE "Posix Console",0
S8BYTE "Native Win9x Driver",0
S9BYTE "Windows CE",0
SubSystemPBYTE S0,S1,S2,S3,S0,S5,S0,S7,S8,S9

;== COLORS
========================================================================
======
;-- List Boxes Colors --------------------------------------------------
-----------------
dwListBoxBackCOLORREF White; Back Color for both List Boxes
dwDisasmForeCOLORREF 000490093h; Fore Color for Disassembler List Box
dwStatusForeCOLORREF Blue; Fore Color for Status List Box
;== FONTS
========================================================================
=======
;-- Disassembler Font --------------------------------------------------
-----------------
FontCdb "Courier New",0
FontTdb "Tahoma",0

Part 5 - Discussion of Idata.inc

Part 6 - Main.inc
;--System includes --------
include windows.inc
include kernel32.inc
include user32.inc
include comdlg32.inc
include Comctl32.inc
include gdi32.inc
;-- System libraries ------

includelib kernel32.lib
includelib user32.lib
includelib Comctl32.lib
includelib comdlg32.lib
includelib gdi32.lib
;-- Includes -----------

include Protos.inc
include Types.inc
include Const.inc
include Idata.inc
include Udata.inc
include Struct.inc
;-- Modules ---------------

include Tools.asm
include PE.asm

Part 6 - Discussion of Main.inc

Part 7 - Protos.inc
;-- Main Module Prototypes ---------------------------------------------
------------------
DlgProcPROTO :HWND,:UINT,:WPARAM,:LPARAM
;-- PE.asm Prototypes --------------------------------------------------

------------------
CheckPEPROTO; Check for a valid PE
GetSectionsPROTO; Get the sections names & info
DisplaySectionsPROTO; Display info about sections
GetCodeSectionPROTO; Detect the code section
;-- Tools.asm Prototypes -----------------------------------------------

------------------
ResetVarsPROTO; Reset Variables & Close Files
OpenTheFilePROTO; Open the file to disasm
SetStatusPROTO:DWORD; Display a Status Msg
ByeByePROTO; Display exit message & exit
ClearStatusPROTO; Clear the Status List Box
RVAToOffsetPROTO:DWORD; Converts RVA to Offset
AddLine PROTO:DWORD; Display a line in disassembler
DisplayWelcome PROTO; Display the silly welcome text
ResetDisassembler PROTO; Clear the text in disassembler
GotoOffsetDlgProc PROTO:HWND,:UINT,:WPARAM,:LPARAM ; Goto any Virtual
Offset

Part 7 - Discussion of Protos.inc

Part 8- Struct.inc
.data
;== DEFINITIONS
========================================================================
=
;-- Section info -------------------------------------------------------
-----------------
SECTION struct 2;Section's:
sNameDWORD 0;Name
sName1DWORD 0;Name (cont.)
sName2BYTE 0;Name end
VirtualAddress WORD ?;RVA
VirtualSizeWORD ?;Size
RawAddressWORD ?;File Offset
RawSizeWORD ?;File Size
CharacteristicsDWORD ?;Characteristics. (i.e. executable code, in/uni-
tialized data)
SECTION ends
;== INITIALIZATIONS
=====================================================================
icexINITCOMMONCONTROLSEX <sizeof INITCOMMONCONTROLSEX,0>;Common Controls
lfLOGFONT <>;Font
ofnOPENFILENAME <>;FileNameDialog Parameters
FileSectionsSECTION MAX_SECTIONS dup ({});Sections info

Part 8 - Discussion of Struct.inc

Part 9 - Types.inc
;-- Type Definitions ----
PBYTE TYPEDEF PTR BYTE; Pointer to Byte

Part 9 - Discussion of Types.inc

Part 10 - Udata.inc
.data?
;== BUFFERS
========================================================================
OffsetToGotodb 10 dup (?);Buffer for Offset to look for
FileNamedb MAX_BUFFER dup (?);Buffer to hold the file name
StatusTextdb MAX_BUFFER dup (?);Buffer to store the Status Text
;== HANDLES
========================================================================
hInstanceHINSTANCE ?;Main hInstance
hWndHWND ?;Main hWnd
hIconHICON ?;Icon Handle
hLfntHFONT ?;Font Handle for Disasm
hSfntHFONT ? ;Font Handle for Status
hListBoxBackHBRUSH ?;Brush Handle
hDisassemblerHWND ?;Disassembler (listbox) Handle
hStatusbarHWND ?;Status (ListBox) Handle
hProgressbarHWND ?;ProgressBar Handle
hFileHANDLE ?;File Handle
hmapFileHANDLE ?;Mapped File Handle
;== GLOBALS
========================================================================
FileSizeDWORD ?; File Size
ImageBaseDWORD ?; PE Image Base
ImageBaseSizeDWORD ? ; PE Image Base Size
AllocatedMemDWORD ?; Mapped File Offset
AllocatedMemEndDWORD ?; AllocatedMem + FileSize
CodeSectionDWORD ?; Mapped File Code Section Offset
EntryPointRVAWORD ?; EntryPoint (RVA)
EntryPointOffsetWORD ?; EntryPoint (Offset)
VirtualAddrDWORD ?; First Instruction to disassemble (VA)
CurVirtualOffsetDWORD ?; Current Virtual Address being diassembled
dwCodeSizeDWORD ?; Size of Code Section
wSectionsWORD ?;# of Sections in file
dwCurSectionWORD ?
CodeSectionIndexBYTE ?; Code Section Index

Part 10 - Discussion of Udata.inc

Part 11 - AoDBasicDisasm.rc
#define IDI_ICON300
IDI_ICONICONDISCARDABLE Res\chip.ico
#include <Res\BasicDisasmMainDlg.rc>
#include <Res\MainMenuMnu.rc>
#include <Res\GotoOffsetDlg.rc>

Part 12 - BasicDisasmMainDlg.Rc
#define IDD_MAIN 101
#define IDC_DISASM 1001
#define IDC_SEP 1004
#define IDC_FILENAME 1005
#define IDC_STATUSBAR 1002
#define IDC_PROGRESS 1006
#define IDC_PEINFO 1003
#define IDC_LSUBSYSTEM 1007
#define IDC_SUBSYSTEM 1008
#define IDC_LIBASE 1009
#define IDC_IMAGEBASE 1010
#define IDC_LEP 1011
#define IDC_EPRVA 1012
#define IDC_LEPO 1013
#define IDC_EPOFFSET 1014
#define IDC_LSECTIONS 1015
#define IDC_SECTIONS 1016
#define IDC_DIRECTORY 1017
#define IDC_LEXPORT 1018
#define IDC_AEXPORT 1019
#define IDC_SEXPORT 1020
#define IDC_LIMPORT 1021
#define IDC_AIMPORT 1022
#define IDC_SIMPORT 1023
#define IDC_LRESOURCE 1024
#define IDC_ARESOURCE 1025
#define IDC_SRESOURCE 1026
#define IDC_FNAME 1027
IDD_MAIN DIALOGEX 6,5,546,387
CAPTION "AoD Basic Disassembler Stage-1"
FONT 8,"MS Sans Serif"
MENU 3000
STYLE 0x10CA0800
EXSTYLE 0x00040000
BEGIN
LISTBOX IDC_DISASM,4,31,450,310,NOT 0x00820000|0x502100C0,0x00000201
CONTROL "",IDC_SEP,"Static",NOT
0x00830000|0x50000012,4,16,450,1,0x00000000
LTEXT "",IDC_FILENAME,44,1,410,12,NOT 0x00830000|0x50001000,0x00000000

LISTBOX IDC_STATUSBAR,4,328,538,56,NOT 0x00820000|0x50210040,0x00000201
CONTROL "",IDC_PROGRESS,"msctls_progress32",NOT
0x10830000|0x40000000,4,20,450,7,0x00000300
LTEXT "PE Information",IDC_PEINFO,460,3,84,15,NOT
0x00830000|0x50000001,0x00000201
LTEXT "SubSystem:",IDC_LSUBSYSTEM,460,24,84,9,NOT
0x00830000|0x50000000,0x00000000
LTEXT "",IDC_SUBSYSTEM,460,36,84,9,NOT 0x00830000|0x50000002,0x00000000
LTEXT "ImageBase:",IDC_LIBASE,460,49,84,9,NOT
0x00830000|0x50000000,0x00000000
LTEXT "",IDC_IMAGEBASE,460,62,84,9,NOT 0x00830000|0x50000002,0x00000000
LTEXT "EntryPoint RVA:",IDC_LEP,460,75,84,9,NOT
0x00830000|0x50000000,0x00000000
LTEXT "",IDC_EPRVA,460,88,84,9,NOT 0x00830000|0x50000002,0x00000000
LTEXT "EntryPoint File Offset:",IDC_LEPO,460,101,84,9,NOT
0x00830000|0x50000000,0x00000000
LTEXT "",IDC_EPOFFSET,460,114,84,9,NOT 0x00830000|0x50000002,0x00000000
LTEXT "# of Sections:",IDC_LSECTIONS,460,127,84,9,NOT
0x00830000|0x50000000,0x00000000
LTEXT "",IDC_SECTIONS,460,140,84,9,NOT 0x00830000|0x50000002,0x00000000
LTEXT "Directory",IDC_DIRECTORY,460,153,84,13,NOT
0x00830000|0x50000001,0x00000001
LTEXT "Export Table:",IDC_LEXPORT,460,169,84,9,NOT
0x00830000|0x50000000,0x00000000
LTEXT "",IDC_AEXPORT,460,182,84,9,NOT 0x00830000|0x50000002,0x00000000
LTEXT "",IDC_SEXPORT,460,195,84,9,NOT 0x00830000|0x50000002,0x00000000
LTEXT "Import Table:",IDC_LIMPORT,460,208,84,9,NOT
0x00830000|0x50000000,0x00000000
LTEXT "",IDC_AIMPORT,460,221,84,9,NOT 0x00830000|0x50000002,0x00000000
LTEXT "",IDC_SIMPORT,460,234,84,9,NOT 0x00830000|0x50000002,0x00000000
LTEXT "Resource Table:",IDC_LRESOURCE,460,247,84,9,NOT
0x00830000|0x50000000,0x00000000
LTEXT "",IDC_ARESOURCE,460,260,84,9,NOT 0x00830000|0x50000002,0x00000000
LTEXT "",IDC_SRESOURCE,460,273,84,9,NOT 0x00830000|0x50000002,0x00000000
LTEXT "File Name:",IDC_FNAME,4,1,36,12,NOT
0x00830000|0x50001000,0x00000000
END

Part 13 - GotoOffsetDlg.Rc
#define IDD_GOTOOFFSET 103
#define IDC_STC5 1001
#define IDC_GOTOOFFSET 1002
#define IDC_GOTO 1003
#define IDC_GOTOCANCEL 1004
IDD_GOTOOFFSET DIALOGEX 6,6,108,36
CAPTION "Goto Offset:"
FONT 8,"MS Sans Serif"
STYLE 0x10CF0000
EXSTYLE 0x00000080
BEGIN
LTEXT "Offset:",IDC_STC5,10,7,24,9,NOT
0x00830000|0x50000000,0x00000000
EDITTEXT IDC_GOTOOFFSET,38,5,64,11,NOT
0x00820000|0x50010000,0x00000200
PUSHBUTTON "Go",IDC_GOTO,8,22,44,11,NOT
0x00820000|0x50010001,0x00000000
PUSHBUTTON "Cancel",IDC_GOTOCANCEL,54,22,44,11,NOT
0x00820000|0x50010000,0x00000000
END

Part 14 - MainMenuMnu.Rc
#define IDM_FILE 3001
#define IDM_OPEN 3002
#define IDM_EXIT 3003
#define IDM_VIEW 3010
#define IDM_VEXPORT 3011
#define IDM_VIMPORT 3012
#define IDM_VRSRC 3013
#define IDM_VAPI 3014
#define IDM_VSTRINGS 3015
#define IDM_GOTO 3020
#define IDM_GOTOENTRY 3021
#define IDM_GOTOOFFSET 3022
#define IDM_HELP 3090
#define IDM_GETHELP 3091
#define IDM_ABOUT 3092
3000 MENU
BEGIN
POPUP "&File"
BEGIN
MENUITEM "&Open",IDM_OPEN
MENUITEM "E&xit",IDM_EXIT
END
POPUP "&View"
BEGIN
MENUITEM "&Exports",IDM_VEXPORT,GRAYED
MENUITEM "&Imports",IDM_VIMPORT,GRAYED
MENUITEM "&Resources",IDM_VRSRC,GRAYED
MENUITEM "&API Calls",IDM_VAPI,GRAYED
MENUITEM "&String References",IDM_VSTRINGS,GRAYED
END
POPUP "&GoTo"
BEGIN
MENUITEM "Goto &Entry Point",IDM_GOTOENTRY,GRAYED
MENUITEM "Goto Virtual &Offset",IDM_GOTOOFFSET,GRAYED
END
POPUP "&Help"

BEGIN
MENUITEM "&Help",IDM_GETHELP,GRAYED
MENUITEM "&About",IDM_ABOUT
END
END

Lesson 2 - Modules And Procedures
Lesson 2 - Modules And Procedures


CHAPTER 5 A Simple Disassembler-
Engine

A Simple Disassembler-Engine
Lesson 1 - Theory

Lesson 2 - Practice
Lesson 2 - Practice

A Simple Disassembler-Engine
Lesson 3 - Result And Sources

CHAPTER 6 Building A DLL As
Disassembler-Engine

Building A DLL As Disassembler-Engine

CHAPTER 7 An Advanced
Disassembler-Engine

An Advanced Disassembler-Engine
Lesson 1 - Theory

Lesson 2 - Practice
Lesson 2 - Practice

An Advanced Disassembler-Engine
Lesson 3 - Results and Sources

CHAPTER 8 Improving The
Disassembler-Engine
String-References, API´s
and more...

Improving The Disassembler-Engine String-References, API´s and more...

CHAPTER 9 Disassembler Extreme
- Polymorphic Code and
more...

Disassembler Extreme - Polymorphic Code and more...

CHAPTER 10 Appendix

Appendix

The Art of Disassembly

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Art of Disassembly

Uploaded by

Copyright:

Available Formats

The Art Of Disassembly

http://aod.anticrack.de and http://board.anticrack.de

A project by: Zero, CuTedEvil, Crick

The Art Of Disassembly 3

What is AoD - The Art of Disassembly?

The practical part of the disassembler was developed during an online-course/discussion

4 The Art Of Disassembly

Are we allowed to include the articles made by others ?

Zero - Main Author

CuTedEvil - Main Coder

Crick - Main Coder

The Art Of Disassembly 5

No Freeware or shareware or whatever.

No copyright, copyleft, copytop or copydown… just a little copycenter :D

For included articles not by us please respect their copyrights !

This document IS ABSOLUTELY FREE !!!

6 The Art Of Disassembly

The Art Of Disassembly 7

The Software You Really Need !

- MASM32v7 package as our assembler

- RadAsm as IDE for development

- OllyDbg for debugging

Anyway you may need some more links to get informed:

8 The Art Of Disassembly

The Art Of Disassembly 9

10 The Art Of Disassembly

The Art Of Disassembly 11

12 The Art Of Disassembly

Overview of the PE-File format1

1. This is the original tutorial by Iczelion

The Art Of Disassembly 13

14 The Art Of Disassembly

The Art Of Disassembly 15

16 The Art Of Disassembly

Detecting a valid PE-File2

2. This is the original tutorial by Iczelion

The Art Of Disassembly 17

IMAGE_DOS_SIGNATURE equ 5A4Dh

The steps are now as follows:

18 The Art Of Disassembly

The Art Of Disassembly 19

20 The Art Of Disassembly

add edi, [edi].e_lfanew

The Art Of Disassembly 21

SEHHandler proc C uses edx pExcept:DWORD, pFrame:DWORD, pContext:DWORD, pDis-

22 The Art Of Disassembly

The Art Of Disassembly 23

24 The Art Of Disassembly

The Art Of Disassembly 25

3. This is the original tutorial by Iczelion

26 The Art Of Disassembly

The most interesting information is in OptionalHeader. However, some fields in FileHeader

The Art Of Disassembly 27

TABLE 1. The File-Header

Field Name Meanings

28 The Art Of Disassembly

The Art Of Disassembly 29

There is a word that's used frequently in relation to PE file format: RVA

4. This is the original tutorial by Iczelion

30 The Art Of Disassembly

TABLE 2. Optional Header

The Art Of Disassembly 31