You are on page 1of 16

Journal of Mobile, Embedded and Distributed Systems, vol. IV, no.

4, 2012
ISSN 2067 4074

Binary Code Disassembly for Reverse Engineering


Marius POPA
Department of Economic Informatics and Cybernetics
Bucharest University of Economic Studies
ROMANIA
marius.popa@ase.ro

Abstract: The disassembly of binary file is used to restore the software application code in a readable and
understandable format for humans. Further, the assembly code file can be used in reverse engineering processes
to establish the logical flows of the computer program or its vulnerabilities in real-world running environment.
The paper highlights the features of the binary executable files under the x86 architecture and portable format,
presents issues of disassembly process of a machine code file and intermediate code, disassembly algorithms
which can be applied to a correct and complete reconstruction of the source file written in assembly language,
and techniques and tools used in binary code disassembly.

Key-Words: disassembly, reverse engineering, native, intermediate code.

1 Binary code and file formats code is to convert it into assembly


language.
The modern computer programs are The disassembly is the process which
developed in programming languages that converts the machine code into equivalent
are a human readable form [2], [3], [4], format in assembly language. During this
[5]. The source code written by software process the assembly instruction set
developers is compiled into a binary mnemonics are translated into assembly
format. In software development, there instructions that can be easily read by
are two classes of binaries: software developers.
Machine code is not directly The practical and positive issues of the
understandable by software developer, disassembly process and its results are
but it is directly executed by the [16]:
machine; it is generated by compiler Improvement of the portability for
depending on the hardware computer programs delivered in
characteristics; machine code format; unlike machine
Intermediate code like machine code, code, the intermediate code is portable
is not directly understandable by due to its interpreting by a virtual
software developer and is not directly machine which must be mandatorily
executed by the machine; the installed on the host machine;
executable code is obtained after an The software developers determine the
interpreting process performed by a logical flows of the disassembled
specialized component called virtual software application; the algorithms
machine; the most known and used and other programming entities are
virtual machines are Java Virtual extracted from the software application
Machine and Common Language and used in other versions or
Runtime (CLR) [10], [11]. programs;
The computer programs delivered in the Security issues are identified and can
machine code format are more difficult to be patched without access to the
be maintained because of the difficulty to original source code;
understand the executable format. To The old version of a computer program
implement the maintainance activities, the is completed with new functionalities
software developer need the source code and interfaces.
and documentation. Another way to obtain The effects of the disassembly process
the understandable form of the machine implementation are quantified in terms of

233
www.jmeds.eu

time and costs during the running of the The executable file COM contains x86
computer program. instructions in binary format and has the
The disassembly process is one of the following features:
three main classes of techniques for The binary code has not an
reverse engineering of software [11]. organization format;
Reverse engineering of software is the The file execution starts from the first
process for discovery the technological byte, after Program Segment Prefix;
principles of a product or system based of The COM file has a length less than
analysis of its structure, function and 64KB;
operation [17]. The content of the COM file is the
The main problem of the reverse image of the program in the memory.
engineering is the intellectual propriety on Program Segment Prefix is a data
software. As reverse engineering structure used to store the state of a
technique, the disassembly is used program and has the following features:
whether the machine code owners agree It is loaded by operating system before
with it. the machine code stored in COM file;
As negative issue, the disassembly process It contains data necessary to operating
can be carried out by malicious software system;
developers to discover the vulnerabilities It has the length of 256 bytes.
and holes of the computer programs to The contents of segment registers for x86
hack them. Also, the discovered logical family of processors are depicted in figure
flows and algorithms can be used in other 1.
commercial computer programs without an
agreement with the owners of the
disassembled computer program.
The list of the available disassemblers
includes tools for Windows like IDA Pro, PE
Explorer, W32DASM, BORG Disassembler,
HT Editor, diStorm64 and Linux like
Bastard Disassembler, ciasdis, objdump,
gdb, lida linux interactive disassembler,
ldasm.
During the disassembly process, the most
difficult issues is to separate the code from
data, especially when data are inserted in
code segment or code is inserted in data Figure 1 The contents of the segment registers
segment. for COM files
The assembly process removes the text-
based identifiers and code comments. This The first executed instruction has always
issue together with the mix of data and the address CS:0x0100.
code make more difficult the For the machine code stored in a COM file
understanding of the assembly code and depicted in figure 2 the disassembled
obtained after the disassembly process. code can be viewed in figure 3 when the
The machine code is generated for a COM file is debugged by MS-DOS
particular processor or family of application td.exe.
processors. In addition, operating systems
check that the machine code file has a B80700BB090003D88BC3B8004CCD2100
valid executable file format. For example, 0024313624
the most known executable files are COM
for CP/M and MS-DOS, Portable Executable Figure 2 Binary executable code of the COM file
(PE) for 32-bit and 64-bit version of
Windows, Executable and Linkable Format
(ELF) for Linux and versions of Unix, and
Mach Object (Mach-O) for Mac OS X.

234
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074

Figure 3. Disassembled code of the COM file in


td.exe

After the assembly instruction int 21h, the


next 6 bytes are used to store data in the
COM file. The application td.exe considers
Figure 4. The content of the segment registers
the 6 bytes as operation codes for binary
for Windows executable files
instructions and it tries to disassemble the
bytes used for data storing. The assembly
instructions generated on the 6 bytes are:

Table 1 Disassembled code from data area of


the COM file

Binary code Assembly code


0000 add [bx+si],al
2431 and al,31h
362400 and ss:al,00h

Because the sequence of bytes 0x3624 has


not an equivalent in assembly code, td.exe
application adds the next byte and the
disassembled code is and ss:al,00h.
The executable file for Windows operating
system has the following features:
Eliminates the disadvantages of the
COM files; Figure 5. Binary executable code of the x86
Inserts a header used to identify and Windows executable file
manage the binary code at runtime;
Contains information regarding The binary executable code is included in
reallocation of the memory; .text section of the Windows exe file. The
Provides different locations for code, executable file has the length of 27648
data and stack segments. bytes (27KB). The length of the .text
The contents of segment registers for x86 section is 12799 bytes (12KB) between the
family of processors are depicted in figure address offsets 0x00000400 and
4. 0x000035FF.
The address of the first executed Unlike the COM file, the Windows
instruction is calculated using the executable file in the Portable Executable
information from the executable header. (PE) format is structured and contains
The binary content of the x86 Windows metadata regarding the internal
executable file for the same logical flow organization and code reallocation at
like in the above COM file is depicted in runtime.
figure 5. The PE file format structure has the
following elements [1]:
1. MS-DOS information: used to keep
information to MS-DOS and to treat

235
www.jmeds.eu

cross attempts to launch MS-DOS and handler data (free format


Windows executables: it includes DOS and x86/object only)
header and MS-DOS stub program; Executable code (free
.text
2. Windows information: has the role to format)
manage the internal virtual memory Thread-local storage
space allocated for the EXE file by .tls
(object only)
Windows operating system; the Thread-local storage
components are: the PE signature (the .tls$
(object only)
string PE), file header and optional GP-relative initialized data
header; (free format and for ARM,
3. Section information: includes section .vsdata
SH4, and Thumb
headers and sections; a section has a architectures only)
specific type in the table 2. Exception information (free
.xdata
format)
Table 2 Section names in Windows PE file [13]
The section names explained in table 2 are
Name Content available for binary executable files and
Uninitialized data (free object files under the Windows family of
.bss
format) operating systems.
CLR metadata that In [12], the Win32 Portable Executable file
.cormeta indicates that the object file format is explained in-depth.
contains managed code For 64-bit Windows system, the PE file
Initialized data (free format has few modifications aiming the
.data
format) widening of certain fields from 32 bits to
Generated Frame Pointer 64 bits. The 64-bit PE file format is called
Omission (FPO) debug PE32+.
.debug$F information (object only, Dynamic-Link Library (DLL) files have the
x86 architecture only, and same format like executable files. There is
now obsolete) a single bit that indicates a different
Precompiled debug types treatment of two kinds of file.
.debug$P
(object only) The content of PE file sections stored on
Debug symbols (object disk is the same with the content loaded at
.debug$S
only) run time into memory. PE file loading
.debug$T Debug types (object only) makes a mapping of PE section into the
.drective Linker options address space. Mapping makes a
.edata Export tables translation from disk offset to memory
.idata Import tables offset as it is explained in [12].
Includes registered After mapping in the memory, each PE file
Structured Exception section starts at a memory page boundary.
Handler (SEH) (image only) For x86 system, the memory pages are 4
.idlsym KB aligned, and 64-bit system the memory
to support Interface
Definition Language (IDL) pages are 8 KB aligned.
attributes.
.pdata Exception information 2 Issues of Disassembly
.rdata Read-only initialized data Process
.reloc Image relocations
.rsrc Resource directory Disassembly process transforms the
Global Pointer (GP)-relative machine code into assembly instructions
.sbss uninitialized data (free readable by humans (software developer
format) and other interested users). The main task
GP-relative initialized data of a disassembler tool is to identify the
.sdata
(free format) byte sequences corresponding to an
GP-relative read-only data assembly instruction.
.srdata
(free format) Some features of x86 binary executables
.sxdata Registered exception make the disassembly process more
236
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074

difficult. These features aim the following Table 3 Standard entry point of the main function
[14]: Code Machine code Assembly
Code and static data can be insert in a offset instructions
section in a mixed manner; ; void main(){
Using of variable length and unaligned 00000 55 push ebp
instruction encodings. 00001 8B EC mov ebp,esp
00003 81 EC E4 00 00 sub esp,228
The two above features are a big issue to
00
identify the instructions hidden in or the
bypass to other instructions encoding or
The stack frame of the main function has
data bytes. So, the x86 executable format
228 bytes as length.
is easier to be used for hiding the
For the same function, the standard exit
malicious code in binary executables.
sequence is:
Identification of assembly instructions is
made on code patterns delimited within
Table 4 Standard exit sequence of the main
the binary executable. The x86 code
function
patterns are detailed in [16]. The Code Machine code Assembly
structures and assembly entities are offset instructions
explained below. 00041 8B E5 mov esp,ebp
Stack. It is a data structure used in x86 00043 5D pop ebp
architecture to store data temporarily; the 00044 C3 ret 0
esp register points to the top of stack; the
operating system monitors the stack to not The non-standards stack frames aim the
be in a condition like underflow or following situations [16]:
overflow; the stack is a computer memory Using of uninitialized registers;
area where data are linearly stored; other external functions store data in
memory area where data can be allocated registers before the subroutine calling;
is the heap memory; in heap, data are Establishing the function scope by
non-linear and variable in number and in using the static keyword; the external
size; functions cannot interface with the
Functions and stack frames. Each function static subroutine;
runs on its partition on the stack called Using other types of local variables,
stack frame; a subroutine uses the like static variables.
function parameters and automatic local Calling conventions. They specify the rules
variables allocated in the stack frame; a regarding the calling of a subroutine. The
stack frame is created at the current esp rules aim the following:
location; the following assembly code is The way in which the arguments are
standard for a function entry: passed to the function;
The way in which the result or results
push ebp are passed back by a function;
mov ebp, esp The call of a function;
sub esp, X Management of the stack and the stack
frame by a function.
X represents the number of bytes allocated For example, for a function named funct
for the automatic variables used by the having two arguments x and y, the
function. assembly code for its call can be:
The assembly code for the standard exit
sequence is: push x
push y
mov esp, ebp call funct
pop ebp
ret The x and y arguments have 32 bits,
according to x86 architecture to be stored
For the C code presented in chapter 1, the on the stack frame of the func function.
entry point in main function has the For example, it considers the C code for
assembly code: func function:

237
www.jmeds.eu

; c = a + b;
int func(int a, int b){ 0002D 8B4508 mov eax, DWORD PTR
int c=0; _a$[ebp]
00030 03450C add eax, DWORD PTR
c=a+b;
_b$[ebp]
return c; 00033 8945F8 mov DWORD PTR
} _c$[ebp], eax
; else
The assembly instructions generated from 00036 eb09 jmp SHORT $LN1@func
the machine code for func routine call ; c = a - b;
written in C compiler under Visual Studio $LN2@func:
00038 8B4508 mov eax, DWORD PTR
2010 are: _a$[ebp]
0003B 2B450C sub eax, DWORD PTR
Table 5 Parameter transfers and func routine call _b$[ebp]
Code Machine Assembly instructions 0003E 8945F8 mov DWORD PTR
offset code _c$[ebp], eax
00033 8B45EC mov eax, DWORD PTR ; return c;
_y$[ebp] $LN1@func:
00036 50 push eax 00041 8B45F8 mov eax, DWORD PTR
00037 8B4DF8 mov ecx, DWORD PTR _c$[ebp]
_x$[ebp]
0003A 51 push ecx
0003B E80000 call ?func@@YAHHH@Z The TRUE branch is the sequence of
0000 instructions between code offsets 0x0002D
and 0x00037, and the FALSE branch is
Branches. In high-level programming delimited by the code offsets 0x00038 and
languages, the using of goto instructions is 0x00040.
recommended to be avoided. The reason is Avoidance of some assembly instruction
that those programming languages have blocks is possible due to using the jump
been implemented the branching instructions and labels assigned to next
structures into branching instructions. instruction to be executed after a jump in
The x86 assembly language has not been the logical flow of the computer program.
implemented complex branching Loops. They are implemented for repetitive
instructions. It uses jump instructions to operations. To identify the loop structure
control program flow. in a machine code file, the following
For example, it considers the C code for elements must be established:
the func routine written in C compiler The value of condition to repeat the
under Visual Studio 2010: operation set;
The value of condition to exit the loop
int func(int a, int b){ structure;
int c=0; The point to start the operation set;
if(a<b) The point to end the loop structure;
c=a+b; The operation set.
else For example, in the func routine written in
c=a-b; C language under Visual Studio 2010, the
return c; Do-For loop is implemented:
}
int func(int a, int b){
The disassembled code for If-Then-Else int c=0, i;
branch structure is: for(i=1; i<=10; i++)
c=a+b;
Table 6 If-Then-Else branch structure return c;
Code Machine Assembly instructions }
offset code
; if(a<b)
00025 8B4508 mov eax, DWORD PTR
After disassembling, the assembler
_a$[ebp] instructions corresponding to Do-For loop
00028 3b450C cmp eax, DWORD PTR structure are:
_b$[ebp]
0002B 7D0B jge SHORT $LN2@func

238
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074

Table 7 Do-For loop structure Table 8 Disassembled code for local and global
Code Machine Assembly instructions variables
offset code Code Machine Assembly instructions
; for(i=1; i<=10; i++) offset code
00025 C745EC mov DWORD PTR ; global variable definition and
010000 _i$[ebp], 1 allocation
00 ; int x = 7;
0002C EB09 jmp SHORT $LN3@func ; int y = 9;
$LN2@func: PUBLIC ?x@@3HA
0002E 8B45EC mov eax, DWORD PTR PUBLIC ?y@@3HA
_i$[ebp]
00031 83C001 add eax, 1 _DATA SEGMENT
00034 8945EC mov DWORD PTR ?x@@3HA DD 07H
_i$[ebp], eax ?y@@3HA DD 09H
$LN3@func: _DATA ENDS
00037 837DEC cmp DWORD PTR ; local variable allocation
0A _i$[ebp], 10 ; int c = 0;
0003B 7F 0B jg SHORT $LN1@func 0001e c745f8 mov DWORD PTR _c$[ebp],
; c=a+b; 000000 0
0003D 8B4508 mov eax, DWORD PTR 00
_a$[ebp] ; int i = 10;
00040 03450C add eax, DWORD PTR 00025 c745ec mov DWORD PTR _i$[ebp],
_b$[ebp] 0a0000 10
00043 8945F8 mov DWORD PTR 00
_c$[ebp], eax
00046 EBE6 jmp SHORT $LN2@func Constants. They are memory areas that do
; return c; not change their content during the
$LN1@func:
machine code running.
00048 8B45F8 mov eax, DWORD PTR
_c$[ebp] Volatile memory. Volatile variables can
be accessed from external or concurrent
Besides the code patterns, the data processes. The hint to identify a "volatile
patterns can be delimited in a binary variable is a frequent access of the
executable. Below, some techniques to memory and update of its values.
identify data in a machine code file are Simple accessor methods. They are used
explained [16]. to restrict the access to a variable. They
Variables. They are memory areas of a receive no parameter and return the value
computer program where data to be of a variable.
processed are stored. There are classified Simple setter (manipulator) methods.
two types of variables: Similar to simple accessor method, a
Local variables are defined in simple setter method alters the value of a
subroutines and are stored in stack given variable.
frames; they are accessed as an offset The most part of the computer programs
from esp or ebp; the static variables use complex data objects. The data
are not allocated on the stack frame; structures that must be identified by a
Global variables are accessed via a disassembler are arrays, structures and
hardcoded memory address; they are advanced structures [16].
not allocated in the stack and are not a Arrays are designed to allocate and access
limited scope. multiple data objects of the same type.
After disassembling a machine code file, it Structures are implemented to allocate
observes that the local variables are and access data objects of different data
allocated in the stack frame of a function types.
within .text section, and the global Advanced structures are implemented as
variables are defined and allocated in .data support for complex operations of the
section. The roles of .text and .data computer program logical flow.
sections are explained in table 2. Other issues regarding the data patterns
The disassembled machine code for local aim object-oriented programming
and global variables is: (identification of classes and objects) and
floating point numbers (using of floating
point stack).

239
www.jmeds.eu

Code optimization is a stage during the software based on Windows platforms. In


compilation process. The stages of [8], some of these methods are presented
optimization are: as it follows:
Intermediate representation Debugging;
optimization data flow and code flow Disassembly;
optimizations; Hex-editing;
Code generation optimization using Unpacking;
the fast machine instructions, File analysis;
During disassembly process, the control Registry monitoring;
flow graph is built on sequences of File monitoring.
instructions encoded in machine code. In The software developers use debuggers to
[9], the control flow reconstruction is split fix bugs of the software under
in two parts: development. Debuggers are used to
Call graph relationship between verify the control flows and memory area
routines are highlighted; the routines evolution during program execution for a
are the nodes, and the calls and specific test input data. These futures
returns are the edges; facilitate understanding of the algorithms
Control flow graph jumps in the and finding the content of the sensitive
routine are highlighted, and it can be memory areas.
built for each routine; the nodes are The disassembly process is presented in
called basic blocks, and the edges are previous chapter together with its issues.
jumps and fall-through edges; the There are two major classes of
basic blocks contain one-step executed disassembly techniques [15]:
instructions. Static disassembly the binary file is
The reconstruction of control flow graph not executed; the instruction stream is
faces to the following problems [9]: parsed as it is found in the machine
Determination of the branch targets; code file to establish or approximate
Difficulties to establish the basic blocks the computer program behavior;
boundaries; Dynamic disassembly the binary file
The end of a routine is difficult to be is executed, and its execution is
established; monitored to identify the instruction
Complicated analysis because of actions and behavior; the execution is
guarded code; made for some input sets, and as
More operations assigned to effect some instruction streams of the
instructions; binary file can be avoided.
Handling multiple entry points and The issues of static disassembly aim the
external routines; following [15]:
Interlocked or overlapping procedures Variable length instructions as it can
(optimizing compilers, hand-written see in the previous chapter, the
assembly); sequences of operation codes of the
Code blocks can contain data blocks. instructions have variable lengths; the
The control flow graph is approximated length of each binary instruction is
after a static analysis on the initial control counted on code offsets;
flow graph. Indirect control transfers is
Compiler and link-time optimizations implemented by dynamic linking, jump
introduce variable instruction sequences in tables and so forth;
the machine code. This issue leads to a Data are interleaved with code streams
difficult detection of the function entry data blocks can be inserted in binary
points based on patter-matching. code sections making the disassembly
more difficult because the disassembly
3. Techniques and Tools Used in tool must identify the data blocks as
not being part of the binary code.
Reverse Engineering The algorithms applied in static
disassembly are [15]:
There are different techniques and tools in
reverse engineering applying for the

240
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074

Linear traversal disassembly has the is presented in [6] and it has the following
following features: content:
Starts at the first byte of the .text
section; .text section contains the while (startAddr addr endAddr){
binary code of the executable as it if (addr has been visited already)
return;
can see in chapter 1; I = decode instruction at address
Instructions are decoded one after addr;
another; mark addr as visited;
Recursive traversal disassembly if (I is a branch or function call)
consists of the following steps: for each possible target t of I do
call disassembly rocedure for
Starts at the first byte of the .text t;
section; }
Whenever a branch instruction is else addr += length(I);
identified, the following actions are }
done: *) according to [6]
o Determination of the addresses
where the branch instruction The recursive disassembly procedure is
blocks begin; called for the address of the function entry
o The branch instruction blocks point, and the address of the function end
are disassembled; calculated as with the linear disassembly
Other algorithms identification of procedure.
jump tables, speculative disassembly, The weaknesses of the recursive traversal
hybrid disassembly. algorithm aim [6]:
The linear traversal disassembly algorithm Assumption that the control transfer
is presented in [6], and the linear has a reasonable behavior; for
disassembly procedure has the following example, a conditional branch has two
content: passible targets, the function call
returns to the fallowing instruction
while (startAddr addr endAddr) { after the call instruction;
I = decode instruction at address Difficulty to identify the set of possible
addr;
addr += length(I); targets of indirect control transfers;
} indirect jumps are approached by ad
*) according to [6] hoc techniques and speculative
disassembly.
The linear disassembly procedure The disassembly algorithms works with the
considers as input the address of the following elements identified or
function entry point and it is executed until constructed on binary code [7]:
the end of the function calculated as: Function entry points functions are
instruction blocks that can be
endAddr = startAddr + sizeCode independently identified and
disassembled; the binary code is made
where: by functions related to each other; the
startAddr the address of the function disassembly tool must identify the
entry point; function entry points to bound the
sizeCode length of the .text section; parts of the binary code file;
endAddr the address of the function identification of the function is made on
end. instructions usually used to set up a
The linear traversal disassembly algorithm new stack frame; also, the function call
does not take into account the control flow instruction can be used to identify de
of the program and data embedded in the binary modules of the computer
executable code. program;
As result, other disassembly algorithm is Control flow graph this graph is made
implemented to remove the linear by nodes and edges; the nodes
disassembly disadvantages. The algorithm represent basic blocks and an edge
represents a possible control flow from

241
www.jmeds.eu

a basic block to another; a basic block Windows application or javap for Java
has not jumps or jump targets in the applications.
middle; a possible control flow is In the below paragraphs some examples of
implemented by function calls, intermediate code disassembly are offered
conditional or unconditional jumps, or as techniques of reverse engineering for
return instructions, all these packing software application that have
the control transfer instructions; a intermediate code representation.
control flow graph can be built for each As NET-based disassembly example, the
function; the traditional approach for following C# source code is considered:
intra-procedural control flow graph
starts with the function entry point and using System;
instructions are disassembled until a using System.Collections.Generic;
control transfer instruction is using System.Linq;
encountered. using System.Text;
Because the x86 instructions have variable
namespace AngajatApplication
length and they are not aligned in
{
memory, for each code address or code
class Angajat
offset the disassembly algorithm tries to {
decode the binary code into an assembly public String Nume;
instruction. As result, a list of potential public int id;
assembly instructions is generated. A valid
instruction set is extracted from the public Angajat(String aNume,
potential instruction list. int nr)
Dynamic disassembly aims snapshots of {
software applications at run time. Unlike Nume = aNume;
static disassembly, the dynamic id = nr;
disassembly analyses only parts of the prelDate(aNume, nr);
binary file which are to be converted into }
assembly code.
public String NumeAngajat()
A static disassembler used together with
{
debugger becomes a tool of dynamic
return this.Nume;
disassembly. }
In dynamic disassembly the speed of
disassembly is not affected by the size of public int IDAngajat()
the executable file. In static disassembly, {
the speed of disassembly is directly return this.id;
proportional to the size of the executable }
file.
The software development technologies public void prelDate(String
have evolved considering the portable sNume, int snr) { }
requirements of the modern software
applications. The code generated by such public static void Main() {
compilers has a different format from the }
}
machine code. The code is called
}
intermediate and examples of intermediate
code file are PE format for Windows-based
The first part of the intermediate file
development technologies and class type
generated by .NET compiler is presented in
files for Java technologies.
figure 6.
The intermediate code is interpreted by a
virtual machine at run time in order to be
executed by Central Processing Unit (CPU).
Also, in reverse engineering processes, the
intermediate code is disassembled using
software applications like Intermediate
Language Disassembler (ILDASM) for

242
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074

Figure 8 Dump options set out for .NET


Figure 6 Intermediate code of the .NET application
application
After dumping, a human-readable code
In figure 6, the 0x4D5A bytes from intermediate file is generated and the
corresponding to MZ string in ASCII metadata assigned to PE format are
encoding and 0x5045 corresponding to presented in the restored file.
PE string in ASCII encoding can be Because the restored file is very large, the
observed as signature of an executable file below presentation contains restored code
in portable format. of the class Angajat.
For NET intermediate code disassembly,
the ILDASM application is used. Figure 7 // =============== CLASS MEMBERS
highlights the .NET application loaded by DECLARATION ===================
ILDASM.
.class /*02000002*/ private auto
ansi beforefieldinit
AngajatApplication.Angajat
extends
[mscorlib/*23000001*/]System.Object/
*01000001*/
{
.field /*04000001*/ public string
Nume
.field /*04000002*/ public int32
id
.method /*06000001*/ public
hidebysig specialname rtspecialname
instance void
.ctor(string aNume,
int32
nr) cil managed
// SIG: 20 02 01 0E 08
Figure 7 .NET application loaded by ILDASM {
disassembler // Method begins at RVA 0x2050
// Code size 33 (0x21)
For .NET application loaded in ILDASM, the .maxstack 8
following dump options are set out: .language '{3F5162F8-07C6-11D3-
9053-00C04FA302A1}', '{994B45C4-
E6E9-11D2-903F-00C04FA302A1}',
'{5A869D0B-6611-11D3-BD2A-
0000F80849BD}'
// Source File 'D:\Aplicatii
CSharp\SECITC2012-4Sol\SECITC2012-
4Proj\Program.cs'
.line 13,13 : 9,45
243
www.jmeds.eu

'D:\\Aplicatii CSharp\\SECITC2012- } // end of method Angajat::.ctor


4Sol\\SECITC2012-4Proj\\Program.cs'
//000013: public .method /*06000002*/ public
Angajat(String aNume, int nr) hidebysig instance string
IL_0000: /* 02 | NumeAngajat() cil managed
*/ ldarg.0 // SIG: 20 00 0E
IL_0001: /* 28 | (0A)000011 {
*/ call instance void // Method begins at RVA 0x2074
[mscorlib/*23000001*/]System.Object/ // Code size 12 (0xc)
*01000001*/::.ctor() /* 0A000011 */ .maxstack 1
IL_0006: /* 00 | .locals /*11000001*/ init ([0]
*/ nop string CS$1$0000)
.line 14,14 : 9,10 '' .line 21,21 : 9,10 ''
//000014: { //000019:
IL_0007: /* 00 | //000020: public String
*/ nop NumeAngajat()
.line 15,15 : 13,26 '' //000021: {
//000015: Nume = aNume; IL_0000: /* 00 |
IL_0008: /* 02 | */ nop
*/ ldarg.0 .line 22,22 : 13,30 ''
IL_0009: /* 03 | //000022: return
*/ ldarg.1 this.Nume;
IL_000a: /* 7D | (04)000001 IL_0001: /* 02 |
*/ stfld string */ ldarg.0
AngajatApplication.Angajat/*02000002 IL_0002: /* 7B | (04)000001
*/::Nume /* 04000001 */ */ ldfld string
.line 16,16 : 13,21 '' AngajatApplication.Angajat/*02000002
//000016: id = nr; */::Nume /* 04000001 */
IL_000f: /* 02 | IL_0007: /* 0A |
*/ ldarg.0 */ stloc.0
IL_0010: /* 04 | IL_0008: /* 2B | 00
*/ ldarg.2 */ br.s IL_000a
IL_0011: /* 7D | (04)000002
*/ stfld int32 .line 23,23 : 9,10 ''
AngajatApplication.Angajat/*02000002 //000023: }
*/::id /* 04000002 */ IL_000a: /* 06 |
.line 17,17 : 13,33 '' */ ldloc.0
//000017: IL_000b: /* 2A |
prelDate(aNume, nr); */ ret
IL_0016: /* 02 | } // end of method
*/ ldarg.0 Angajat::NumeAngajat
IL_0017: /* 03 |
*/ ldarg.1 .method /*06000003*/ public
IL_0018: /* 04 | hidebysig instance int32
*/ ldarg.2 IDAngajat() cil managed
IL_0019: /* 28 | (06)000004 // SIG: 20 00 08
*/ call instance void {
AngajatApplication.Angajat/*02000002 // Method begins at RVA 0x208c
*/::prelDate(string, // Code size 12 (0xc)
.maxstack 1
int32) /* 06000004 */ .locals /*11000002*/ init ([0]
IL_001e: /* 00 | int32 CS$1$0000)
*/ nop .line 26,26 : 9,10 ''
.line 18,18 : 9,10 '' //000024:
//000018: } //000025: public int
IL_001f: /* 00 | IDAngajat()
*/ nop //000026: {
IL_0020: /* 2A | IL_0000: /* 00 |
*/ ret */ nop

244
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074

.line 27,27 : 13,28 '' */ nop


//000027: return .line 32,32 : 37,38 ''
this.id; IL_0001: /* 2A |
IL_0001: /* 02 | */ ret
*/ ldarg.0 } // end of method Angajat::Main
IL_0002: /* 7B | (04)000002
*/ ldfld int32 } // end of class
AngajatApplication.Angajat/*02000002 AngajatApplication.Angajat
*/::id /* 04000002 */
IL_0007: /* 0A | As Java disassembly example, the
*/ stloc.0 following Java source code is considered:
IL_0008: /* 2B | 00
*/ br.s IL_000a import java.*;
import java.lang.*;
.line 28,28 : 9,10 ''
//000028: } class Angajat extends
IL_000a: /* 06 | java.lang.Object {
*/ ldloc.0
IL_000b: /* 2A | public String Nume;
*/ ret public int id;
} // end of method
Angajat::IDAngajat public Angajat(String aNume,int nr){
Nume = aNume;
.method /*06000004*/ public id = nr;
hidebysig instance void prelDate(aNume, nr);
prelDate(string sNume, }
int32 snr) cil
managed public String NumeAngajat(){
// SIG: 20 02 01 0E 08 return this.Nume;
{ }
// Method begins at RVA 0x20a4
// Code size 2 (0x2) public int IDAngajat(){
.maxstack 8 return this.id;
.line 30,30 : 53,54 '' }
//000029:
//000030: public void public void prelDate(String
prelDate(String sNume, int snr) { } sNume,int snr){ }
IL_0000: /* 00 | }
*/ nop
.line 30,30 : 55,56 ''
IL_0001: /* 2A | The bytecode file generated by Java
*/ ret compiler has the content highlighted in
} // end of method figure 9.
Angajat::prelDate

.method /*06000005*/ public


hidebysig static
void Main() cil managed
// SIG: 00 00 01
{
.entrypoint
// Method begins at RVA 0x20a7
// Code size 2 (0x2)
.maxstack 8
.line 32,32 : 35,36 ''
//000031:
//000032: public static void
Main() { } Figure 9 Bytecode content of class file
IL_0000: /* 00 |

245
www.jmeds.eu

After disassembly process of the class file, code injections. There is hex editing
the restored code in the human-readable software having complex functions to help
format has the following form: its user to find quicker the executable file
areas in which the user has an interest.
Compiled from Angajat.java That hex editor software can be used by
class Angajat extends java.lang.Object { any kind of user, including the users with
public java.lang.String Nume;
public int id; low knowledge in software programming.
public File packing is the process consisting of
Angajat(java.lang.String,int); reduction the size of a software
public int IDAngajat(); application, being made by a tool called
public java.lang.String file packer. At run time, software called file
NumeAngajat();
public void
unpacker is launched to decompress or
prelDate(java.lang.String, int); unpack the executable file in memory.
} Reverse engineering process needs the
unpacked form of the executable file. A
Method Angajat(java.lang.String,int) packed executable file is identifying based
0 aload_0
on its header which is modified. Manual
1 invokespecial #3 <Method
java.lang.Object()> techniques or automatic techniques like
4 aload_0 file unpacking software can be used to
5 aload_1 unpack the executable file. The main
6 putfield #4 <Field java.lang.String problem of the automatic techniques is to
Nume>
find the unpacking software to be used for
9 aload_0
10 iload_2 a successful unpacking.
11 putfield #5 <Field int id> File analyzers are software used to identify
14 aload_0 the packer employed to get a packed file.
15 aload_1 Identification is made on the signature
16 iload_2 byte and it aims the compiler or
17 invokevirtual #6 <Method void
prelDate(java.lang.String, int)> programming language used to develop
20 return the packed software application.
Tools like registry monitors supervise the
Method int IDAngajat() access to registry keys by software
0 aload_0 programs. Software application makes
1 getfield #5 <Field int id>
4 ireturn readings from and writings to registry keys
to restore or change a configuration.
Method java.lang.String NumeAngajat() Useful information for reverse engineering
0 aload_0 is obtained from the access of software
1 getfield #4 <Field java.lang.String application to registry keys.
Nume>
4 areturn
File monitoring consists of supervision the
access of software applications to files
Method void prelDate(java.lang.String, stored on disk. The accessed file can
int) contain sensitive information like security
0 return algorithms used in application, access data
or procedures to some functions and so
After disassembly process, the human- forth. The file content is a valuable source
readable code is analyzed to apply reverse of information for the reverse engineering
engineering techniques or to classify the process.
computer program as malign or benign for
the computer systems. Acknowledgement
Hex editors are software applications used
Parts of this paper were presented by the
to find the binary content of a file,
author at 5th International Conference on
including an executable one. A strong
Security for Information Technology and
feature of the hex editors is permission to
Communications, Bucharest, Romania, 31
modify the content or to inject new
May 1 June 2012.
content in the binary form. As effect, the
behavior of the software application is
observed after consecutive changes or
246
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 4, 2012
ISSN 2067 4074

4. Conclusion [5] Cristian Toma, Sample Development


on Java Smart-Card Electronic Wallet
Specific techniques and tools depending on Application, Journal of Mobile,
development platform and technology Embedded and Distributed Systems
must be considered in order to implement JMEDS, vol. 1, no. 2, 2009, pp. 60
a reverse engineering process. The paper 80
content has focused on software [6] Cullen Linn, Saumya Debray,
application developed on Windows systems Obfuscation of Executable Code to
highlighting the specific approaching of Improve Resistance to Static
reverse engineering for software Disassembly, Proceedings of the 10th
applications developed on it. ACM Conference on Computer and
As techniques in reverse engineering, Communications Security, ACM New
disassembly process is used to generating York, NY, USA, 2003, pp. 290 299
the human-readable format for the [7] Giovanni Vigna, Static Disassembly and
computer programs delivered as machine Code Analysis, Malware Detection.
code or intermediate code files. There are Advances in Information Security,
disassembly traversal algorithms to Springer, Heidelberg, vol. 35, 2007,
generate the assembly code from the pp. 19 42
machine code even if there is not a 100% [8] Hardik Shah, Software Security and
covering of the machine code flows by the Reverse Engineering,
assembly code flows. http://www.infosecwriters.com/text_r
Based on the assembly language, a esources/pdf/software_security_and_r
software specialist can implement reverse everse_engineering.pdf
engineering techniques to investigate the [9] Henrik Theiling, Extracting Safe and
software vulnerabilities of a computer Precise Control Flow from Binaries,
program. Proceedings of the Seventh
The main problem remains the intellectual International Conference on Real-Time
property. Firstly, the software engineers Systems and Applications, IEEE
must deal this problem with the computer Computer Society Washington, DC,
program owners. On the other hand, a USA, 2000, pp. 23 30
malicious user can break computer [10] Marius Popa, Techniques of Program
programs to use them for commercial Code Obfuscation for Secure Software,
advantages or to exploit their Journal of Mobile, Embedded and
vulnerabilities to get information and other Distributed Systems JMEDS, vol. 3,
advantages unlawfully. no. 4, 2011, pp. 205 219
[11] Marius Popa, Characteristics of
Program Code Obfuscation for Reverse
References Engineering of Software, Proceedings
[1] Ashkbiz Danehkar, Inject your code to of the 4th International Conference on
a Portable Executable file, 27 Security for Information Technology
December 2005, and Communications, Bucharest, 17
http://www.codeproject.com 18 November 2011, ASE Publishing
[2] Ctlin Boja, Security Survey of House, Bucharest, pp. 103 112
Internet Browsers Data Managers, [12] Matt Pietrek, An In-Depth Look into
Journal of Mobile, Embedded and the Win32 Portable Executable File
Distributed Systems JMEDS, vol. 3, Format, msdn magazine,
no. 3, 2011, pp. 109 119 http://msdn.microsoft.com /en-
[3] Ctlin Boja, Mihai Doinea, Security us/magazine/cc301805.aspx
Assessment of Web Based Distributed [13] Microsoft Portable Executable and
Applications, Informatica Economic, Common Object File Format
vol. 14, no. 1, 2010, pp. 152 162 Specification, Revision 8.2, 21
[4] Cristian Toma, Security Issues for 2D September 2010
Barcodes Ticketing Systems, Journal [14] Richard Wartell, Yan Zhou, Kevin W.
of Mobile, Embedded and Distributed Hamlen, Murat Kantarcioglu, and
Systems JMEDS, vol. 3, no. 1, 2011, Bhavani Thuraisingham,
pp. 34 53

247
www.jmeds.eu

Differentiating Code from Data in x86


Binaries, Proceedings of the 2011
European Conference on Machine
Learning and Knowledge Discovery in
Databases - Volume Part III, Springer-
Verlag Berlin, Heidelberg, 2011, pp.
522 536
[15] Roberto Paleari, Static disassembly
and analysis of malicious code, 5 July
2007,
http://roberto.greyhats.it/talks.html
[16] The Wikibook of x86 Disassembly
Using C and Assembly Language,
Wikimedia Foundation Inc., 14
January 2008
[17] http://en.wikipedia.org/wiki/
Reverse_engineering

248

You might also like