Professional Documents
Culture Documents
4, 2012
ISSN 2067 4074
Abstract: The disassembly of binary file is used to restore the software application code in a readable and
understandable format for humans. Further, the assembly code file can be used in reverse engineering processes
to establish the logical flows of the computer program or its vulnerabilities in real-world running environment.
The paper highlights the features of the binary executable files under the x86 architecture and portable format,
presents issues of disassembly process of a machine code file and intermediate code, disassembly algorithms
which can be applied to a correct and complete reconstruction of the source file written in assembly language,
and techniques and tools used in binary code disassembly.
233
www.jmeds.eu
time and costs during the running of the The executable file COM contains x86
computer program. instructions in binary format and has the
The disassembly process is one of the following features:
three main classes of techniques for The binary code has not an
reverse engineering of software [11]. organization format;
Reverse engineering of software is the The file execution starts from the first
process for discovery the technological byte, after Program Segment Prefix;
principles of a product or system based of The COM file has a length less than
analysis of its structure, function and 64KB;
operation [17]. The content of the COM file is the
The main problem of the reverse image of the program in the memory.
engineering is the intellectual propriety on Program Segment Prefix is a data
software. As reverse engineering structure used to store the state of a
technique, the disassembly is used program and has the following features:
whether the machine code owners agree It is loaded by operating system before
with it. the machine code stored in COM file;
As negative issue, the disassembly process It contains data necessary to operating
can be carried out by malicious software system;
developers to discover the vulnerabilities It has the length of 256 bytes.
and holes of the computer programs to The contents of segment registers for x86
hack them. Also, the discovered logical family of processors are depicted in figure
flows and algorithms can be used in other 1.
commercial computer programs without an
agreement with the owners of the
disassembled computer program.
The list of the available disassemblers
includes tools for Windows like IDA Pro, PE
Explorer, W32DASM, BORG Disassembler,
HT Editor, diStorm64 and Linux like
Bastard Disassembler, ciasdis, objdump,
gdb, lida linux interactive disassembler,
ldasm.
During the disassembly process, the most
difficult issues is to separate the code from
data, especially when data are inserted in
code segment or code is inserted in data Figure 1 The contents of the segment registers
segment. for COM files
The assembly process removes the text-
based identifiers and code comments. This The first executed instruction has always
issue together with the mix of data and the address CS:0x0100.
code make more difficult the For the machine code stored in a COM file
understanding of the assembly code and depicted in figure 2 the disassembled
obtained after the disassembly process. code can be viewed in figure 3 when the
The machine code is generated for a COM file is debugged by MS-DOS
particular processor or family of application td.exe.
processors. In addition, operating systems
check that the machine code file has a B80700BB090003D88BC3B8004CCD2100
valid executable file format. For example, 0024313624
the most known executable files are COM
for CP/M and MS-DOS, Portable Executable Figure 2 Binary executable code of the COM file
(PE) for 32-bit and 64-bit version of
Windows, Executable and Linkable Format
(ELF) for Linux and versions of Unix, and
Mach Object (Mach-O) for Mac OS X.
234
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074
235
www.jmeds.eu
difficult. These features aim the following Table 3 Standard entry point of the main function
[14]: Code Machine code Assembly
Code and static data can be insert in a offset instructions
section in a mixed manner; ; void main(){
Using of variable length and unaligned 00000 55 push ebp
instruction encodings. 00001 8B EC mov ebp,esp
00003 81 EC E4 00 00 sub esp,228
The two above features are a big issue to
00
identify the instructions hidden in or the
bypass to other instructions encoding or
The stack frame of the main function has
data bytes. So, the x86 executable format
228 bytes as length.
is easier to be used for hiding the
For the same function, the standard exit
malicious code in binary executables.
sequence is:
Identification of assembly instructions is
made on code patterns delimited within
Table 4 Standard exit sequence of the main
the binary executable. The x86 code
function
patterns are detailed in [16]. The Code Machine code Assembly
structures and assembly entities are offset instructions
explained below. 00041 8B E5 mov esp,ebp
Stack. It is a data structure used in x86 00043 5D pop ebp
architecture to store data temporarily; the 00044 C3 ret 0
esp register points to the top of stack; the
operating system monitors the stack to not The non-standards stack frames aim the
be in a condition like underflow or following situations [16]:
overflow; the stack is a computer memory Using of uninitialized registers;
area where data are linearly stored; other external functions store data in
memory area where data can be allocated registers before the subroutine calling;
is the heap memory; in heap, data are Establishing the function scope by
non-linear and variable in number and in using the static keyword; the external
size; functions cannot interface with the
Functions and stack frames. Each function static subroutine;
runs on its partition on the stack called Using other types of local variables,
stack frame; a subroutine uses the like static variables.
function parameters and automatic local Calling conventions. They specify the rules
variables allocated in the stack frame; a regarding the calling of a subroutine. The
stack frame is created at the current esp rules aim the following:
location; the following assembly code is The way in which the arguments are
standard for a function entry: passed to the function;
The way in which the result or results
push ebp are passed back by a function;
mov ebp, esp The call of a function;
sub esp, X Management of the stack and the stack
frame by a function.
X represents the number of bytes allocated For example, for a function named funct
for the automatic variables used by the having two arguments x and y, the
function. assembly code for its call can be:
The assembly code for the standard exit
sequence is: push x
push y
mov esp, ebp call funct
pop ebp
ret The x and y arguments have 32 bits,
according to x86 architecture to be stored
For the C code presented in chapter 1, the on the stack frame of the func function.
entry point in main function has the For example, it considers the C code for
assembly code: func function:
237
www.jmeds.eu
; c = a + b;
int func(int a, int b){ 0002D 8B4508 mov eax, DWORD PTR
int c=0; _a$[ebp]
00030 03450C add eax, DWORD PTR
c=a+b;
_b$[ebp]
return c; 00033 8945F8 mov DWORD PTR
} _c$[ebp], eax
; else
The assembly instructions generated from 00036 eb09 jmp SHORT $LN1@func
the machine code for func routine call ; c = a - b;
written in C compiler under Visual Studio $LN2@func:
00038 8B4508 mov eax, DWORD PTR
2010 are: _a$[ebp]
0003B 2B450C sub eax, DWORD PTR
Table 5 Parameter transfers and func routine call _b$[ebp]
Code Machine Assembly instructions 0003E 8945F8 mov DWORD PTR
offset code _c$[ebp], eax
00033 8B45EC mov eax, DWORD PTR ; return c;
_y$[ebp] $LN1@func:
00036 50 push eax 00041 8B45F8 mov eax, DWORD PTR
00037 8B4DF8 mov ecx, DWORD PTR _c$[ebp]
_x$[ebp]
0003A 51 push ecx
0003B E80000 call ?func@@YAHHH@Z The TRUE branch is the sequence of
0000 instructions between code offsets 0x0002D
and 0x00037, and the FALSE branch is
Branches. In high-level programming delimited by the code offsets 0x00038 and
languages, the using of goto instructions is 0x00040.
recommended to be avoided. The reason is Avoidance of some assembly instruction
that those programming languages have blocks is possible due to using the jump
been implemented the branching instructions and labels assigned to next
structures into branching instructions. instruction to be executed after a jump in
The x86 assembly language has not been the logical flow of the computer program.
implemented complex branching Loops. They are implemented for repetitive
instructions. It uses jump instructions to operations. To identify the loop structure
control program flow. in a machine code file, the following
For example, it considers the C code for elements must be established:
the func routine written in C compiler The value of condition to repeat the
under Visual Studio 2010: operation set;
The value of condition to exit the loop
int func(int a, int b){ structure;
int c=0; The point to start the operation set;
if(a<b) The point to end the loop structure;
c=a+b; The operation set.
else For example, in the func routine written in
c=a-b; C language under Visual Studio 2010, the
return c; Do-For loop is implemented:
}
int func(int a, int b){
The disassembled code for If-Then-Else int c=0, i;
branch structure is: for(i=1; i<=10; i++)
c=a+b;
Table 6 If-Then-Else branch structure return c;
Code Machine Assembly instructions }
offset code
; if(a<b)
00025 8B4508 mov eax, DWORD PTR
After disassembling, the assembler
_a$[ebp] instructions corresponding to Do-For loop
00028 3b450C cmp eax, DWORD PTR structure are:
_b$[ebp]
0002B 7D0B jge SHORT $LN2@func
238
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074
Table 7 Do-For loop structure Table 8 Disassembled code for local and global
Code Machine Assembly instructions variables
offset code Code Machine Assembly instructions
; for(i=1; i<=10; i++) offset code
00025 C745EC mov DWORD PTR ; global variable definition and
010000 _i$[ebp], 1 allocation
00 ; int x = 7;
0002C EB09 jmp SHORT $LN3@func ; int y = 9;
$LN2@func: PUBLIC ?x@@3HA
0002E 8B45EC mov eax, DWORD PTR PUBLIC ?y@@3HA
_i$[ebp]
00031 83C001 add eax, 1 _DATA SEGMENT
00034 8945EC mov DWORD PTR ?x@@3HA DD 07H
_i$[ebp], eax ?y@@3HA DD 09H
$LN3@func: _DATA ENDS
00037 837DEC cmp DWORD PTR ; local variable allocation
0A _i$[ebp], 10 ; int c = 0;
0003B 7F 0B jg SHORT $LN1@func 0001e c745f8 mov DWORD PTR _c$[ebp],
; c=a+b; 000000 0
0003D 8B4508 mov eax, DWORD PTR 00
_a$[ebp] ; int i = 10;
00040 03450C add eax, DWORD PTR 00025 c745ec mov DWORD PTR _i$[ebp],
_b$[ebp] 0a0000 10
00043 8945F8 mov DWORD PTR 00
_c$[ebp], eax
00046 EBE6 jmp SHORT $LN2@func Constants. They are memory areas that do
; return c; not change their content during the
$LN1@func:
machine code running.
00048 8B45F8 mov eax, DWORD PTR
_c$[ebp] Volatile memory. Volatile variables can
be accessed from external or concurrent
Besides the code patterns, the data processes. The hint to identify a "volatile
patterns can be delimited in a binary variable is a frequent access of the
executable. Below, some techniques to memory and update of its values.
identify data in a machine code file are Simple accessor methods. They are used
explained [16]. to restrict the access to a variable. They
Variables. They are memory areas of a receive no parameter and return the value
computer program where data to be of a variable.
processed are stored. There are classified Simple setter (manipulator) methods.
two types of variables: Similar to simple accessor method, a
Local variables are defined in simple setter method alters the value of a
subroutines and are stored in stack given variable.
frames; they are accessed as an offset The most part of the computer programs
from esp or ebp; the static variables use complex data objects. The data
are not allocated on the stack frame; structures that must be identified by a
Global variables are accessed via a disassembler are arrays, structures and
hardcoded memory address; they are advanced structures [16].
not allocated in the stack and are not a Arrays are designed to allocate and access
limited scope. multiple data objects of the same type.
After disassembling a machine code file, it Structures are implemented to allocate
observes that the local variables are and access data objects of different data
allocated in the stack frame of a function types.
within .text section, and the global Advanced structures are implemented as
variables are defined and allocated in .data support for complex operations of the
section. The roles of .text and .data computer program logical flow.
sections are explained in table 2. Other issues regarding the data patterns
The disassembled machine code for local aim object-oriented programming
and global variables is: (identification of classes and objects) and
floating point numbers (using of floating
point stack).
239
www.jmeds.eu
240
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074
Linear traversal disassembly has the is presented in [6] and it has the following
following features: content:
Starts at the first byte of the .text
section; .text section contains the while (startAddr addr endAddr){
binary code of the executable as it if (addr has been visited already)
return;
can see in chapter 1; I = decode instruction at address
Instructions are decoded one after addr;
another; mark addr as visited;
Recursive traversal disassembly if (I is a branch or function call)
consists of the following steps: for each possible target t of I do
call disassembly rocedure for
Starts at the first byte of the .text t;
section; }
Whenever a branch instruction is else addr += length(I);
identified, the following actions are }
done: *) according to [6]
o Determination of the addresses
where the branch instruction The recursive disassembly procedure is
blocks begin; called for the address of the function entry
o The branch instruction blocks point, and the address of the function end
are disassembled; calculated as with the linear disassembly
Other algorithms identification of procedure.
jump tables, speculative disassembly, The weaknesses of the recursive traversal
hybrid disassembly. algorithm aim [6]:
The linear traversal disassembly algorithm Assumption that the control transfer
is presented in [6], and the linear has a reasonable behavior; for
disassembly procedure has the following example, a conditional branch has two
content: passible targets, the function call
returns to the fallowing instruction
while (startAddr addr endAddr) { after the call instruction;
I = decode instruction at address Difficulty to identify the set of possible
addr;
addr += length(I); targets of indirect control transfers;
} indirect jumps are approached by ad
*) according to [6] hoc techniques and speculative
disassembly.
The linear disassembly procedure The disassembly algorithms works with the
considers as input the address of the following elements identified or
function entry point and it is executed until constructed on binary code [7]:
the end of the function calculated as: Function entry points functions are
instruction blocks that can be
endAddr = startAddr + sizeCode independently identified and
disassembled; the binary code is made
where: by functions related to each other; the
startAddr the address of the function disassembly tool must identify the
entry point; function entry points to bound the
sizeCode length of the .text section; parts of the binary code file;
endAddr the address of the function identification of the function is made on
end. instructions usually used to set up a
The linear traversal disassembly algorithm new stack frame; also, the function call
does not take into account the control flow instruction can be used to identify de
of the program and data embedded in the binary modules of the computer
executable code. program;
As result, other disassembly algorithm is Control flow graph this graph is made
implemented to remove the linear by nodes and edges; the nodes
disassembly disadvantages. The algorithm represent basic blocks and an edge
represents a possible control flow from
241
www.jmeds.eu
a basic block to another; a basic block Windows application or javap for Java
has not jumps or jump targets in the applications.
middle; a possible control flow is In the below paragraphs some examples of
implemented by function calls, intermediate code disassembly are offered
conditional or unconditional jumps, or as techniques of reverse engineering for
return instructions, all these packing software application that have
the control transfer instructions; a intermediate code representation.
control flow graph can be built for each As NET-based disassembly example, the
function; the traditional approach for following C# source code is considered:
intra-procedural control flow graph
starts with the function entry point and using System;
instructions are disassembled until a using System.Collections.Generic;
control transfer instruction is using System.Linq;
encountered. using System.Text;
Because the x86 instructions have variable
namespace AngajatApplication
length and they are not aligned in
{
memory, for each code address or code
class Angajat
offset the disassembly algorithm tries to {
decode the binary code into an assembly public String Nume;
instruction. As result, a list of potential public int id;
assembly instructions is generated. A valid
instruction set is extracted from the public Angajat(String aNume,
potential instruction list. int nr)
Dynamic disassembly aims snapshots of {
software applications at run time. Unlike Nume = aNume;
static disassembly, the dynamic id = nr;
disassembly analyses only parts of the prelDate(aNume, nr);
binary file which are to be converted into }
assembly code.
public String NumeAngajat()
A static disassembler used together with
{
debugger becomes a tool of dynamic
return this.Nume;
disassembly. }
In dynamic disassembly the speed of
disassembly is not affected by the size of public int IDAngajat()
the executable file. In static disassembly, {
the speed of disassembly is directly return this.id;
proportional to the size of the executable }
file.
The software development technologies public void prelDate(String
have evolved considering the portable sNume, int snr) { }
requirements of the modern software
applications. The code generated by such public static void Main() {
compilers has a different format from the }
}
machine code. The code is called
}
intermediate and examples of intermediate
code file are PE format for Windows-based
The first part of the intermediate file
development technologies and class type
generated by .NET compiler is presented in
files for Java technologies.
figure 6.
The intermediate code is interpreted by a
virtual machine at run time in order to be
executed by Central Processing Unit (CPU).
Also, in reverse engineering processes, the
intermediate code is disassembled using
software applications like Intermediate
Language Disassembler (ILDASM) for
242
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074
244
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074
245
www.jmeds.eu
After disassembly process of the class file, code injections. There is hex editing
the restored code in the human-readable software having complex functions to help
format has the following form: its user to find quicker the executable file
areas in which the user has an interest.
Compiled from Angajat.java That hex editor software can be used by
class Angajat extends java.lang.Object { any kind of user, including the users with
public java.lang.String Nume;
public int id; low knowledge in software programming.
public File packing is the process consisting of
Angajat(java.lang.String,int); reduction the size of a software
public int IDAngajat(); application, being made by a tool called
public java.lang.String file packer. At run time, software called file
NumeAngajat();
public void
unpacker is launched to decompress or
prelDate(java.lang.String, int); unpack the executable file in memory.
} Reverse engineering process needs the
unpacked form of the executable file. A
Method Angajat(java.lang.String,int) packed executable file is identifying based
0 aload_0
on its header which is modified. Manual
1 invokespecial #3 <Method
java.lang.Object()> techniques or automatic techniques like
4 aload_0 file unpacking software can be used to
5 aload_1 unpack the executable file. The main
6 putfield #4 <Field java.lang.String problem of the automatic techniques is to
Nume>
find the unpacking software to be used for
9 aload_0
10 iload_2 a successful unpacking.
11 putfield #5 <Field int id> File analyzers are software used to identify
14 aload_0 the packer employed to get a packed file.
15 aload_1 Identification is made on the signature
16 iload_2 byte and it aims the compiler or
17 invokevirtual #6 <Method void
prelDate(java.lang.String, int)> programming language used to develop
20 return the packed software application.
Tools like registry monitors supervise the
Method int IDAngajat() access to registry keys by software
0 aload_0 programs. Software application makes
1 getfield #5 <Field int id>
4 ireturn readings from and writings to registry keys
to restore or change a configuration.
Method java.lang.String NumeAngajat() Useful information for reverse engineering
0 aload_0 is obtained from the access of software
1 getfield #4 <Field java.lang.String application to registry keys.
Nume>
4 areturn
File monitoring consists of supervision the
access of software applications to files
Method void prelDate(java.lang.String, stored on disk. The accessed file can
int) contain sensitive information like security
0 return algorithms used in application, access data
or procedures to some functions and so
After disassembly process, the human- forth. The file content is a valuable source
readable code is analyzed to apply reverse of information for the reverse engineering
engineering techniques or to classify the process.
computer program as malign or benign for
the computer systems. Acknowledgement
Hex editors are software applications used
Parts of this paper were presented by the
to find the binary content of a file,
author at 5th International Conference on
including an executable one. A strong
Security for Information Technology and
feature of the hex editors is permission to
Communications, Bucharest, Romania, 31
modify the content or to inject new
May 1 June 2012.
content in the binary form. As effect, the
behavior of the software application is
observed after consecutive changes or
246
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074
247
www.jmeds.eu
248