You are on page 1of 60

A

M.Tech DISSERTATION REPORT


on

BioInformatics Techniques for Metamorphic Malware Analysis and Detection

Submitted for partial fulllment for the degree of Master of Technology (Computer Engineering) in Department of Computer Engineering (June-2011)

Supervisors: Dr. Vijay Laxmi Dr. Manoj Singh Gaur

By: Grijesh Chauhan (2009PCP116)

MALAVIYA NATIONAL INSTITUTE OF TECHNOLOGY JAIPUR

Department of Computer Engineering

Malaviya National Institute of Technology Jaipur

Rajasthan - 302017

CERTIFICATE

This is to certify that the Dissertation Report on BioInformatics Techniques for Metamorphic Malware Detection, by Grijesh Chauhan is the work completed under my supervision, hence approved for submission in partial fulllment for the Master of Technology in Computer Engineering during academic session 2009-2011.

(Dr.Vijay Laxmi) Reader and Head of Department Date : M.N.I.T., Jaipur

(Dr. M.S.Gaur) Professor Date: M.N.I.T.,Jaipur

Declaration
I, Grijesh Chauhan, declare that this Dissertation titled, BioInformatics Techniques for Metamorphic Malware Analysis and Detection and the work presented in it are my own. I conrm that:

This work was done wholly or mainly while in candidature for a M.Tech. degree at MNIT. Where any part of this Dissertation has previously been submitted for a degree or any other qualication at MNIT or any other institution, this has been clearly stated. Where I have consulted the published work of others, this is always clearly attributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this Dissertation is entirely my own work. I have acknowledged all main sources of help.

Signed:

Date:

Abstract
Modern malware which are metamorphic or polymorphic in nature mutates their code by employing code obfuscation and encryption methods to thwart detection. Conventional signature based scanners fail to detect these malware. Also, signature based scanner requires frequent updates and size of data base also increases exponentially. In order to address the problems of detecting known variants of metamorphic malware, we proposed a method known as MetamOrphic Malware Exploration Techniques using MSA (MOMENTUM) using Biometrics techniques for Protein and DNA matching. Instead of using xed signature more sophisticated signature(s) extracted using multiple sequence alignment (MSA). Experiments are conducted over obfuscated malware data set collected from VX Heavens,tools and user agencies and benign samples gathered from fresh installation of Windows XP operating system,Cygwin etc. Experiment are performed by segregating the data set into two parts one for modeling signature and other is reserved for testing. The results shows that the proposed method is capable of identifying malware variants with minimum false alarms and misses.

Acknowledgements
I take immense pleasure to express my deep and sincere gratitude to my esteemed guide, Dr. Vijay Laxmi, (Head of the Department, Department of Computer Engineering, Malaviya National Institute of Technology), and Dr. Manoj Singh Gaur (Professor, Department of Computer Engineering, Malaviya National Institute of Technology) for their invaluable guidance, and spending precious hours for my work. Their excellent cooperation and suggestion through stimulating and benecial discussions provided me with an impetus to work and made the completion of work possible. My sincere thanks to all faculty members of Department of Computer Engineering, MNIT Jaipur, for their constant support, imparting best knowledege in M.Tech course. I would like to thank all non-teaching sta members of Department of Computer Engineering, Malaviya National Institute of Technology, Jaipur and all those people whose lovely sense of favors I have received for completing this Dissertation work.

I would always be indebted to the support and prayers of my parents in completing this work successfully. I thank my friends who have directly or indirectly contributed by giving their valuable suggestions.

Signed:

Date:

iii

Contents
Declaration Abstract Acknowledgements List of Figures List of Tables 1 Introduction 1.1 Motivation . . . 1.2 Objective . . . 1.3 Related Work . 1.4 Contributions of 1.5 Outlines . . . . i ii iii vi vii 1 2 3 3 4 5 7 7 8 8 9 9 10 10 10 11 11 12 14 14 16 16 17 17

. . . . . . . . . . . . Thesis . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2 Malware and Types 2.1 Types of Malware . . . . . . . . . . . . . . . 2.1.1 Virus . . . . . . . . . . . . . . . . . . 2.1.2 Worms . . . . . . . . . . . . . . . . . 2.1.3 Trojans . . . . . . . . . . . . . . . . 2.1.4 Backdoors . . . . . . . . . . . . . . . 2.1.5 Logic Bombs . . . . . . . . . . . . . 2.1.6 Adware . . . . . . . . . . . . . . . . 2.2 Polymorphic . . . . . . . . . . . . . . . . . . 2.3 Metamorphic . . . . . . . . . . . . . . . . . 2.3.1 Dead Code Insertion . . . . . . . . . 2.3.2 Reorder Instruction using Jump . . . 2.3.3 Equivalent Instruction Substitution . 2.3.4 Subroutine In lining and Outlining . 2.3.5 Independent Instruction Permutation 2.4 Detection Techniques . . . . . . . . . . . . . 2.4.1 Static Detection . . . . . . . . . . . . 2.4.2 Dynamic Detection . . . . . . . . . . iv

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

Contents 2.4.3

v Heuristic Detection . . . . . . . . . . . . . . . . . . . . . . . 17 18 19 19 21 21 22 23 23 24

3 Bioinformatics Techniques 3.1 Global Alignment . . . . . . . . . . . 3.1.1 NeedlemanWunsch Method . 3.1.2 Levenshtein distance . . . . . 3.2 Local Alignment . . . . . . . . . . . 3.3 Phylogenetic Tree . . . . . . . . . . . 3.4 Multiple Sequence Alignment Method 3.4.1 Iterative Alignment . . . . . . 3.4.2 Progressive Alignment . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

4 Metamorphic Malware Exploration Technique Using MSA (MOMENTUM) 4.1 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Analysis of metamorphism in Tools/Real malware . . . . . . . . . . 4.2.1 Type of obfuscation . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Indentication of Base Malware . . . . . . . . . . . . . . . . 4.3 Signature Modeling and Testing . . . . . . . . . . . . . . . . . . . . 4.3.1 Single Signature . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Group Signature . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Result and Inferences 5.1 Evaluation Metrics . . . . 5.2 Intra Family Analysis . . . 5.3 Inter Family Analysis . . . 5.4 Comparative Analysis . . 5.5 Testing with Signature . . 5.6 Comparative Analysis with

26 26 28 29 30 30 32 32 32 34 35 36 36 37 38 39 41

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antiviruses

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

6 Conclusions and Future Work

A Executable Unpacking 43 A.1 Symptoms of Packed Malicious Executables . . . . . . . . . . . . . 44 A.2 Manual Unpacking of Packed Executable . . . . . . . . . . . . . . . 45 A.3 Executable Unpacking using Ether . . . . . . . . . . . . . . . . . . 46

Bibliography

49

List of Figures
2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 4.4 4.5 4.6 5.1 5.2 5.3 Metamorphic malware variants using obfuscation metamorphic engine. . . . . . . . . . . . . . . . . . Subroutine In lining and Subroutine Outlining . . Subroutine Permutation . . . . . . . . . . . . . . . Global Alignment for DNA Sequences . . . . . . Local Alignment for DNA Sequences . . . . . . . Phylogentic tree and alignment of sequences. . . Multiple Aligned opcode sequences corresponding Progressive Alignement . . . . . . . . . . . . . . and embedded . . . . . . . . . . . . . . . . . . . . . . . . . . . with . . . . 12 . . . . 15 . . . . 16 . . . . . . . . . . . . . . . . . . . . . . 20 22 22 24 24 27 29 31 31 32 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . to malware samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Brief Outline of Method for Metamorphic Malware Detection Method for Investigation of Metamorphism. . . . . . . . . . . Sum of Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . Signature Modeling and Testing . . . . . . . . . . . . . . . . . Extraction of single signature. . . . . . . . . . . . . . . . . . . Wildcard based representation of Group signature. . . . . . . Intra Family Analysis of malware (Synthetic and Real). . . Inter Family Analysis of malware (Synthetic and Real). . . Detection rate of antiviruses compared with dierent type of signature. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . 36 . . . . . . . . 37 constructed . . . . . . . . 39

A.1 Portable Executable Unpacking Procedure . . . . . . . . . . . . . . . . . . 44 A.2 Userspace Unpacking using Ether . . . . . . . . . . . . . . . . . . . . . . . 48

vi

List of Tables
2.1 2.2 4.1 4.2 5.1 5.2 Dierent types of Junk code instructions used by metamorphic engine. . . 13 Dictionary of equivalent instructions. . . . . . . . . . . . . . . . . . . . . . 15 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Instruction Replacements . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Comparative Analysis of Malware Samples . . . . . . . . . . . . . . . . . . 37 Evaluation Metrics for dierent types of signatures. . . . . . . . . . . . . . 38

vii

Chapter 1 Introduction
The advent of Internet has increased the appearance of malware in the digital world. Majority of the transactions are performed online by na ve users which have increased the threat of stolen password, transaction credentials or personal informations. The term malware generally refers to all software which have illicit intentions. They are categorized into computer viruses, worms, Trojan, backdoors, rootkits etc. Basically, malware can be categorized based on the mode of propagation as mobile malware which are worms, spyware, botnets etc. or static malware like viruses. The focus of these malicious softwares are to replicate be exploiting system vulnerabilities. Conventionally malware scanners are based on matching signatures of known samples for detection. The signature based scanners are fast but imposes certain limitations like (a) failure to detect unseen malware (b) lacks semantic knowledge of the samples (c) failure to detect obfuscated or encrypted instances. Minor change in the code of malicious samples would thwart detection. Antivirus companies have evolved with better methods for identifying malware but malware writing is getting sophistication and challenging scanners. Identication of polymorphic and metamorphic malware is dicult as a simple change in the byte pattern signicantly changes the signature of the samples. Maintaining the signature for each malware results in (a) increase of malware data base and (b) system may be infected by new samples by the time signature is created. Basically, the detection process can be categorized as (a) static analysis and (b) dynamic analysis. Malware can be analyzed by

Chapter 1. Introduction

checking the structure (content) of the assembly code without the executing the samples. Thus, the system is not infected and maliciousness is derived by either constructing the control ow graph or frequencies of opcodes. In dynamic analysis each malware sample is executed in a controlled environment. The impact of infection is monitored by inspecting the strains left by malware samples (system registry, processor register etc.). The method gives rened output but is expensive with respect to running time.

1.1

Motivation

Metamorphic malware mutate its code on each replication preserving functionality of the code. The code is mutated with the help of a small mutation engine called as metamorphic engine. Metamorphic malware uses dierent obfuscation mechanisms to evade the conventional signature based scanner based on exact string matching techniques. Metamorphic engine is a prime element which keeps it hidden from the antivirus products. Also, size of metamorphic engine is designed to be small so as to bypass the detection [8]. This indicates that metamorphic engine performs structural transformation to the code with limited set of replacement. As total change in the code is impossible since the functionality of malware variant would suer a change and might loose its maliciousness by producing an unnecessary code. Malicious programs compared to benign are less diverse since maliciousesness is preserved for infection and propagation. DNA/ proteins mutate from one generation to another inheriting some functional, structural similarity with the ancestors. In this implementation work it was assumed that metamorphic malware like the DNA/protein sequence transforms the code with modication in the opcode sequence. The mismatches in the opcode sequence from one generation to another may be considered as the point of mutation. Thus, exact string matching techniques would fail to detect new malware variants. At this point we shift from the general area of exact matching and exact pattern discovery to the general area of inexact, approximate matching, and sequence alignment. Bioinformatics sequence alignment method is used in this work which aligns the sequence based on the evolutionary relationship and is found to be better for signature extraction and detection of variants of malware.

Chapter 1. Introduction

1.2

Objective

Motivated by Bioinformatics techniques the objective of this thesis is to detect metamorphic malware. Using the sequence alignment method for each malware family two types of signature(s) are constructed which are (a) group and (b) signature. Unseen malware is tested with extracted signature(s). Also the obfuscation and metamorphism in malware constructors and real malware is explored to identify the types of prominent instructions used for mutating the malware.

1.3

Related Work

In their proposed work, authors [14] and [15] created a rewriting engine for detecting morphed malware variants. The analysis of variants of malware is based on syntactic as well as semantic structure of a program. Signatures of malware are represented in the form of a control ow graph. Signature matching technique is based on tree automaton. Krugel et al [16] proposed a method based code analysis to identify structural similarity between malicious code (worms). The proposed method is based on the CFG generated for worms which describes a ngerprint for worm. Their system is found to be resilient against common code transformation techniques. Authors in [17] proposed a novel method for analyzing malware based on code graph. Each malware executable was inspected and instructions corresponding to system call sequence were represented in the form of a topological graph. The proposed code graph system was used to dierentiate malware and benign programs by checking the applicability of specic system call. In their proposed work [9], authors proposed a semantic based approach for detecting variants of malware. This method is based on the functionality of system call executed by malware samples. The main focus is to identify all instructions and its parameters which are used for calling a system call. They propose a pattern matching technique which is able to identify semantically equivalent parts of code. The method is capable of identifying programs that are related to each other and the ones that are totally dissimilar. Rachit et al [13] created a malware normalizer making use of term rewriting rules. The method was applied on virus named as Win32.Evol. The main objective of

Chapter 1. Introduction

their proposed work was to convert program variants into smaller number of variants i.e to convert all programs into a normal program. In Hunting for metamorphic engines [10], Hidden Markov Models (HMMs) were used to represent statistical properties of a set of metamorphic virus variants. The metamorphic virus data set was generated from metamorphic engines: Second Generation virus generator (G2), Next Generation Virus Construction Kit (NGVCK), Virus Creation Lab for Win32 (VCL32) and Mass Code Generator (MPCGEN). HMM is trained on a family of metamorphic viruses and determines whether a given program is similar to the viruses the HMM represents. In [11], the critical API calls were extracted statically using IDA-Pro [6]. Thus, all the latebounded API calls that are made using GetProcAddress, LoadLibraryEx, etc. are not taken into account. On top of this approach did not work for packed malware. The authors in [1] proposed a phylogeny model, particularly used in areas of bioinformatics, for extracting information in genes, proteins or nucleotide sequences. The ngram feature extraction technique was proposed and xed permutation was applied on the code to generate new sequences, called n-perms. Since new variants of malware evolve by incorporating permutations, the proposed nperm model was developed to capture instruction and block permutations. The experiment was conducted on a limited data set consisting of 9 benign samples and 141 worms collected from VX Heavens [2]. The proposed method showed that similar variants appeared closer in the phylogenetic tree where each node represented a malware variant. The method did not depict how the nperm model would behave if the instructions in a block of code are replaced by equivalent instructions which could either expand or shrink the size of blocks (with respect to number of instructions in a block).

1.4

Contributions of Thesis

In this thesis work a novel method to detect metamorphic malware variants is proposed. The method is based on static analysis where the unpacked samples are disassembled and the opcode sequences of samples are used for comparison. In [7] proposed that the opcode sequence there is large dierence in the opcode sequence of malicious and

Chapter 1. Introduction

benign sample. Thus, opcode could be used to create sequence of malware samples. A evolutionary tree also known as Phylogenetic tree is constructed for a family of malware. Threshold within the family is computed and unseen samples are detected using this threshold. Two types of signatures called as (a) group signature and (b) single signature for a family is constructed. In order to extract single and group signature multiple sequence alignment (MSA) is used which is primarily used in area of bioinformatics. Our experiments shows some promising results and shows the eectiveness of the method for detecting known samples of metamorphic malware with less false alarms. Experiments have been conducted on obfuscated malware data set collected from VX Heavens [2] and some from user agencies. Malware variants are also prepared using the constructors like NGVCK, MPCGEN, G2, PSMPC. Through our experiment we have found that the obfuscation is minimal in samples created using the constructors. Primarily the obfuscation is simple instruction replacement, junk code insertion which is reordered using the jump instructions. Also, most of the families of the malware generated using the constructors overlaps depicting minimal obfuscation of the code from one generation to other generation.

1.5

Outline

In Chapter 2, an introduction to malware and dierent types of malcode is given. The chapter discusses infection and propagation modes used by the malicious software. Then, polymorphic malware is briey introduced with detailed explanation to metamorphic malware is covered. Later in the chapter malware detection techniques are described. Chapter 3 discusses various bioinformatics techniques used in DNA/protein sequence alignment. In this chapter two types of sequence alignment method known as global and local alignment is described. Phylogenetic tree used for evolutionary relationship is explained with brief outline of the construction techniques. During the end of this chapter Multiple Sequence Alignment (MSA) is described in detail, this method is used for aligning more than two sequences. Methods for constructing MSA which are iterative and progressive method is also introduced. Chapter 4 describes the proposed and implementation method known as Metamorphic Malware Exploration Technique Using MSA(MOMENTUM). This chapter explains in

Chapter 1. Introduction

detail the dataset preprocessing which involves unpacking and classication into dierent families. This chapter describes dierent steps involved in exploring metamorphism on synthetic and real malware data and highlights the prominent opcode sequence used by malware. Signature modeling is explained in detail along with testing unseen samples with extracted signature to validate the hypothesis for detection. Chapter 5 give details of experiments conducted along with the analysis of results. Finally, conclusions and future work is discussed in Chapter 6.

Chapter 2 Malware and Types


Malware can be dened as programs with unethical intentions. They contain instructions which tries to nd vulnerabilities of computer systems in an unauthorized manner to infect or steal valuable information from machines. Once installed, some malware provide access of user machines to remote attackers. All malicious software can be categorized as computer viruses, worms, Trojans, backdoor, adware, spyware etc. Many malicious softwares are distributed along with free wares or open source software with the motive of making money. They are primarily installed on computer systems while browsing sites from which games, movies, web browsers, music etc. are downloaded. The compromised machines exposes useful information of the system and user to the attackers machine which could be either (a) credit card number (b) root password or (c) use the compromised system to launch attacks or sending spam messages to other systems. Once the system is infected it tries to delete system les, change registry entry, hides task manager, launch spying software which can monitor user key logging activities.

2.1

Types of Malware

Malware can be classied based on their mode of infection and propagation mechanism. Modern malware are more sophisticated in terms of their complexity in behaviour and appearance of code. Present day malware are employing antidebugging, antivirtual machine checks to stay dormant in order to evade detection. As antivirus products

Chapter 2. Malware and Types

are becoming more powerful malware writing is becoming more complex and challenging than the antivirus products. Brief outline of various types of malware is given in subsequent subsections.

2.1.1

Virus

A computer virus is a program which infect the system by replication. They use a host program for infection and are propagated only by human intervention. The virus would be activated only if infected program keeps on executing. Viruses can be harmful and some are written for fun. Harmful viruses could delete system les or freeze computer by occupying volume of hard disk space. Harmless computer virus displays messages to attract users but replicate by creating their clones. Normally, computer viruses targets autorun les, executable system les, macros of document les for the purpose of replication. Computer viruses have basically four function (a) A search routine which locates a program or le with specic le extension to infect. Once the le is found it marks each such le to avoid over infection or avoid searching infected les (b) copy routine which copies the malicious code to a host le. This malicious code could be prepended, appended or added at dierent locations of the host le (c) antidetection mechanism to evade detection by antivirus products. These mechanism could be either encryption, code morphing or interrupt vector table modication etc. (d) payload which is primarily is the main part of any virus used for self replication.

2.1.2

Worms

Worms are malicious program which are also selfreplicating program like computer virus but use Internet to spread. The most striking feature of a worm is that it does not require human intervention to spread. Worm exploits two fundamental vulnerability (a) software bug and (b) security holes to propagate. Software bug could be either the buer overow vulnerability which appears in program by using functions like strcpy instead of safe function like strncpy, allows the attack to allocate oversized memory and copy malicious code as with well known program finger. Similar type of software bug is found in a program like sendmail which deliver message to programs residing in the local or remote machine. The recipient program executes a script in a new shell which

Chapter 2. Malware and Types

is present in the body of the message. Worm attempts to scan open ports to launch dierent types of attacks. It also spreads through email by sending spam messages to contact list of a particular user account. In most cases user is indirectly forced to open or download attachments for triggering malicious activities of worm. Basically once a vulnerable system is located, worm scans /etc/passwd le for encrypted password and possibly cracks it by making multiple attempts. Thus, once username and password is fetched any malicious code could be remotely executed by worm using utility like rexec.

2.1.3

Trojans

A Trojan Horse is a nonself replicating program and enters the computer in an unnoticeable manner and is usually disguised as a legitimate application. Once the system is infected by Trojan it allows unrestricted access of the user system to attacker sitting in the remote location. These malicious software require a host program in which they hide. The basic component of a Trojan Horse is a server and client program. The server launch a program which attracts the user which exists in the form of games, images, videos etc. in which the malicious program hides. After these applications are downloaded in the system, machine gets infected and Trojan (client program) performs spying activity.

2.1.4

Backdoors

Backdoor is a program which is created to bypass network security checks to create a channel for the attacker to control, spy or interact with the victim machine. Backdoors are planted in softwares (open source or free ware) before their distribution. When these softwares are installed and executed backdoor open the channel, connect to the remote machine to leak valuable information concerning the user and computer system. Some of the backdoors are created for legitimate purpose in order to avoid time consuming authentication performed for debugging network server [18]. Sometimes backdoor make use of Trojans for compromising a computer system. The user machine is victimized when a image of video consisting of backdoor is downloaded. Many backdoors are installed if an ActiveX is installed in the user system while browsing certain sites. Most

Chapter 2. Malware and Types

10

of the browsers prompts the user when they download ActiveX control to prevent their machines from attacks.

2.1.5

Logic Bombs

This category of malware can exist stand alone or could be interleaved inside legitimate program. They do not replicate and have two basic component (a) payload: which is capable of performing malicious activities like formating harddisk or deleting system les (b) trigger: which make it more dangerous as the logic bombs would stay dormant for a specic event to occur to deliver its malicious payload.

2.1.6

Adware

It forces unsolicited advertisements when user is browsing the Internet. Adware gathers browsing behaviour, planted by many companies by creating interest to shop by popping up too many advertisements. Sometimes adware are very dangerous as they redirect to unsolicited site which requires users to ll in their information like password for email, credit card or cvv numbers which logs keystrokes to gather all valuable information.

Most of the popular malware today employ encryption and obfuscation to evade detection. Such malware are called as polymorphic and metamorphic malware they are described in subsequent subsections.

2.2

Polymorphic

Polymorphic malware encrypting their code with random key to avoid detection. Each polymorphic virus have a polymorphic engine colled virus decryption routine (VDR), which generate new keys and contains decryption module for decrypting the encrypted malicious body responsible for infecting applications and system. Once executed, the virus is re-encrypted and added to another vulnerable host application. Thus, when an antivirus scans the malware for signature it nd dierent pattern (as keys are dierent) and thus thwart detection.

Chapter 2. Malware and Types

11

Malware scanner perform in memory scanning of each suspicious sample for detection. Ultimately a malware needs to execute for infecting the machine hence should reside in the main memory. Thus, the antivirus scans though all samples in the memory and match all patterns against the signatures in the repository. Another major problem found with the polymorphic malware are its decryption algorithm. If the scanner could locate the decryption algorithm then this could become a signature for identication of polymorphic malware. Malware authors scrambles statements or replace some registers with unused register to obtain different byte pattern to avoid detection. Another approach could be to prepare a dictionary of some binary code and its equivalent replacement with other binary patterns. Using this table the polymorphic engine could automatically identify binary pattern, map these pattern using the dictionary to replace it with equivalent code to generate new malware variants.

2.3

Metamorphic

Metamorphic malware are very sophisticated in nature as it completely modies the code upon each replicate to generate a new malware variant. This make the antivirus products very dicult to identify metamorphic malware using signature matching techniques. Metamorphic malware constitutes a engine normally referred to as metamorphic engine which mutates the code from one generation to other. Normally the size of metamorphic engine is kept too small in order to avoid detection. A metamorphic engines alters the program by applying various obfuscation technique like (a) junk code (b) instruction permutation by reordering the control ow using jump instructions (c) equivalent instruction replacement and (d) subroutine in lining and outlining. Figure 2.1 shows metamorphic malware embedded with metamorphic engine using obfuscation transformation.

2.3.1

Dead Code Insertion

In this technique some garbage code or NOP is inserted to the actual code. Basically this is the simplest of the obfuscation as it does not reorder the program code. Garbage code is inserted to confuse the scanner by increasing irrelevant byte pattern in the malicious samples to avoid detection. Dead code insertion is illustrated by all instruction written in boldface in the following code snippet.

Chapter 2. Malware and Types

12

Figure 2.1: Metamorphic malware variants using obfuscation and embedded with metamorphic engine.

mov eax, 020H mov eax, eax mov ebx, 0ABH add eax, ebx add eax, 00H push eax pop ebx push eax pop eax nop add eax, ebx add eax, 00H mul ecx mov [esi], ebx Some of the junk code used are listed in Table 2.1. The left hand side of the Table depicts the instructions and the right hand side depicts the meaning of each instruction. ;Garbage Code ;Garbage Code ;Garbage Code ;Garbage Code ;Garbage Code ;Garbage Code

2.3.2

Reorder Instruction using Jump

This virus adds jump instruction and garbage code in each mutant. The Win95/Zperm is an example of this technique. Since the virus body is not constant, string based detection is not possible. Consider the following piece of code without any jump instructions instruction 1 ; entry point

Chapter 2. Malware and Types


Table 2.1: Dierent types of Junk code instructions used by metamorphic engine.
Instructions NOP CLD PUSHFD POPFD PUSHAD POPAD MOV REG, REG ADD REG, 0 OR REG, 0 AND REG, -1 PUSH REG POP REG XCHG REG, REG XOR REG, 0 SUB REG, 0 SBB REG, 0 ADC REG, 0 SHL REG, 0 SHR REG, 0 AND REG, 1 Meaning No Operation No Operation No Operation No Operation REG := REG REG := REG + 0 REG := REG |0 REG := REG & -1 No Operation No Operation No Operation No Operation No Operation No Operation No Operation No Operation REG := REG & 1

13

instruction 2 instruction 2 . . . instruction n In later generation the virus body is modied by the engine by inserting jump instructions at random positions which is shown below. instruction 2 jump 3 instruction 4 jump n instruction 1 ;entry point jump 2 instruction 3 jump 4

Chapter 2. Malware and Types . . . instruction n

14

2.3.3

Equivalent Instruction Substitution

Some malware like Win95Zperm [21] and Win32.Evol [8] make use of equivalent instruction substitution as an obfuscation mechanism. In our proposed code morpher, we make use of a dictionary of instructions which can be possibly replaced by equivalent instructions. Instruction replacement can either expand or shrink the size of code of ospring. Our morpher basically increase the size of the generated variants. Table 2.2 depicts the instruction and their equivalent set of instructions.

2.3.4

Subroutine in Lining and Outlining

Subroutine in lining is a method in which the call to subroutine is replaced by its denition. It is a form of program obfuscation which replaces some/all calls to the subroutine with their code denitions. Code outlining divides a block of code into subroutine (s) and add subroutine call for the newly created subroutine (s). The Figure 2.2 shows an example of subroutine in lining for two subroutine call S1() and S2() and outlining of code to create a new subroutine S12().

... Call S1 Call S2 ...

S1: move eax, ebx add eax, 12h push eax ret S2: mul ecx mov edx, eax ret

... move eax, ebx add eax, 12h push eax mul ecx mov edx, eax ...

... move eax, ebx add eax, 12h push eax mul ecx mov edx, eax ...

... move eax, ebx call S12 mov edx, eax ... S12: push eax add eax, 12h mul ecx ret

Figure 2.2: Subroutine In lining and Subroutine Outlining

Chapter 2. Malware and Types


Table 2.2: Dictionary of equivalent instructions.
Instructions ADD REG, -1 ADD REG, 0 ADD REG, 1 AND REG, -1 XOR Reg,-1 XOR Mem,-1 MOV Reg,Reg SUB Reg,Imm SUB Mem,Imm AND REG, 0 AND REG, REG JMP REG MOV REG, REG AND Mem,0 XOR Reg,Reg SUB Reg,Reg OR Reg,Reg AND Reg,Reg MOV REG1, REG2 NOP XOR Reg,0 XOR Mem,0 ADD Reg,0 ADD Mem,0 OR Reg,0 OR Mem,0 AND Reg,-1 AND Mem,-1 AND Reg,0 TEST Reg,Reg LEA Reg,[Imm] LEA Reg,[Reg+Imm] LEA Reg1,[Reg2] LEA Reg1,[Reg1+Reg2] MOV Reg,Reg Equivalent Instructions NEG REG; NOT REG or NOT REG; NEG REG NOP INC REG or NOT REG; NEG REG or NEG REG; NOT REG NOP NOT Reg NOT Mem NOP ADD Reg,-Imm ADD Mem,-Imm MOV REG, 0 CMP REG, 0 PUSH REG; RET NOP MOV Mem,0 MOV Reg,0 MOV Reg,0 CMP Reg,0 CMP Reg,0 PUSH REG2; POP REG1 or XCHG REG1, REG2 PUSHFD; POPFD or PUSHAD; POPAD or PUSH REG; POP REG MOV Reg,0 MOV Mem,0 NOP NOP NOP NOP NOP NOP MOV Reg,0 CMP Reg,0 MOV Reg,Imm ADD Reg,Imm MOV Reg1,Reg2 ADD Reg1,Reg2 NOP

15

Subroutine Permutation: Some metamorphic viruses make use of permutation of subroutines. If a virus code consists of n subroutine, it is possible to have n generations. Figure 2.3 shows few permutations of the virus code consisting of 5 subroutines.
EP 1 2 3 4 5 4 1 5 2 3 EP

Figure 2.3: Subroutine Permutation

Chapter 2. Malware and Types

16

2.3.5

Independent Instruction Permutation

Transposition or instruction permutation modies the instruction execution order if they are not interdependent. Consider two instructions op R1, R2 followed by op R3, R4. These two instructions can be swapped provided R1, R2, R3, R4 are dierent. For example, the instructions mov ecx, imm and inc eax are not interdependent hence they can be swapped. ... mov ecx, imm inc eax ..... is equivalent to ... inc eax mov ecx, imm

2.4

Detection Techniques

Malware detection deals with the dierent mechanism for ltering out malicious programs. The detection mechanisms can be broadly classied as static, dynamic and heuristic methods.

2.4.1

Static Detection

Static analysis deals with detection of malcode without executing them on computer system. The disassembled code is scanned for malicious by examining either the import address table (IAT), opcode patterns, byte n gram. Signature in the form of byte patterns are extracted from each malicious samples and checked against a repository. Static detection mechanism using control ow graphs as signatures is also used to ag maliciousness. The main advantage of static detection mechanism is that the system is not infected by malcode. The detection approach is fast as surface scanning of malware

Chapter 2. Malware and Types

17

program is performed. This method lacks detection of encrypted malware as the actual malicious payload is released during execution.

2.4.2

Dynamic Detection

Dynamic analysis is used to mine maliciousness by executing malware samples in controlled environment. The controlled environment is used so as to keep the host machine unaected. Dynamic analysis is particularly useful when dealing with encrypted malware. Code emulation might result in appropriate detection but this mechanism when used alone may sometimes defeat the detection process as the decryption may consume much of the time. In order to thwart detection some malware use multiple jump instruction to defeat dynamic scanners.

2.4.3

Heuristic Detection

Heuristic detection mechanism can be used along with static or dynamic techniques. The scanner primarily use heuristics for detecting unseen malware samples. Some of the heuristics for detection of malicious code are (a) presence of entry point in last section (b) suspicious section names (c) large data sections or (d) small import table size. Heuristic detector are prone to too many false alarms where the benign samples are incorrectly identied as malware.

Chapter 3 Bioinformatics Techniques


Bioinformatic is the application of computer science on biological data. In bioinfomatics biological informations are extracted to gain better understanding about dierent biological species. Sequence alignment is an elementary method used in any biological study to compare two or more biological sequences (protein or DNA). The alignment method attempt to nd regions of high similarities as a whole or parts to deduce evolutionary relationship among sequences. Metamorphic malware like proteins or nucleotide have some fragments of code which are inherited from their base malware. These segments of code is partially subjected to change from one generation to subsequent generations. Malcode is transformed by a metamorphic engine to conceal the malicious payload so that maliciousness is not revealed. Fundamentally code obfuscation is performed by metamorphic engine to thwart detection. The structure of metamorphic variants are dierent but they share common functionalies. Dierence in variants of the same base malware cannot be too large hence, techniques used in bioinformatics can be applied for its detection. It can be assumed that genes in DNA can be thought as opcode sequence in malware. The size of the metamorphic engine is usually small to hide it from detection. Each malware sample is represented as a sequence of mnemonic pattern (opcode sequence) without considering the operands. Initially the approach might appear to be trivial but metamorphic malware variants cannot undergo total transformation. Our assumption is that there may be replacement of some opcode(s) with equivalent opcode(s) but complete change is impossible in order to maintain preserve functionality. It can be inferred that variants preserve some base malicious 18

Chapter 3. Bioinformatics Techniques

19

code which is transformed by the engine to produce new variant(s). Thus, using sequence alignment techniques opcode sequences are arranged:

To determine similarity amongst malware samples. To explore frequent occurring patterns in a family of malware. These patterns depict maliciousness. To store, retrieve and compare malicious opcode sequences.

The basic approach to sequence alignment can be broadly categorized as: 1. Global Sequence Alignment 2. Local Sequence Alignment Global alignment technique aligns sequences over complete length. This method is particularly useful when the sequences are more or less of similar length. On the other hand, local sequence alignment attempts to compare segments of all possible lengths to optimize the similarity measure. Local alignment mainly used when the query sequences have dissimilar size. Multiple sequence alignment (MSA) is another form of alignment technique used to align three or more sequences. MSA is used in identifying conserved sequence regions across a group of sequences. In this work using evolutionary relationship among sequences progressive MSA is implemented. In the following sections sequence alignment methods (global, local, MSA) is introduced.

3.1

Global Alignment

Global Alignment is used to align sequences end to end. Figures 3.1 shows global alignments for two sequence X and Y . The alignment of two DNA sequence in the Figure 3.1 shows match, mismatch and gaps introduced by global alignment methods. Two well known methods of global alignment are (a)NeedlemanWunsch and (b) Levenshtein or Edit distance. These methods are briey discussed in following subsections.

Chapter 3. Bioinformatics Techniques

20

Figure 3.1: Global Alignment for DNA Sequences

3.1.1

NeedlemanWunsch Method

NeedlemanWunsch method [20] determines global optimal alignment between the two sequence X and Y . Following are some basic steps involved in aligning opcode sequence:

Initialization: In this step a score and trace back matrix of size (M + 1) (N + 1) is created where M and N are the length of two instances. Let the score and trace back matrix be S (M + 1, N + 1) and T (M + 1, N + 1). Initially the rst row and rst column of score and trace back matrix is lled with 0. Populate Score Matrix: The score of each cell S (i, j ) is determined by the scores of neighboring three cells i.e. (top, diagonal and left). In addition to lling the score matrix the trace back matrix is populated with the directions like left(L), diagonal(D) and up(U). The trace back matrix depicts the direction of cell with maximum value in the score matrix which contributes for the score of new cell S (i, j ). Thus, S (i, j ) is computed as follows: S (i, 0) = i S (0, j ) = j S (i, j ) = max(S (i 1, j 1) + (X [i], Y [i]), S (i 1, j ) + , S (i, j 1) + )) where (X [i], Y [i]) indicate match/mismatch score while aligning character X [i], Y [i] and is gap penality.

Chapter 3. Bioinformatics Techniques

21

Traceback: Traceback step recover to the alignment from the trace back matrix. Traceback start at bottom-right cell T (M + 1, N + 1) until the rst row or column is encountered. Each cell with direction D depicts match and cells with directions of L, U depicts the gap introduced in the sequence.

3.1.2

Levenshtein distance

The Levenshtein distance also known as edit distance algorithm is an approximate string matching algorithm used to nd the occurence of a subtring of a pattern in a text. This method is used to determine the similarity between two sequences. Edit distance determines the minimum number of operations required to transform one opcode sequence into to other. One of the common way of implementing the edit distance method is using a dynamic programming approach. The Levenshtein distance algorithm for two strings string1, string2 of length m and n is shown below: 1. Create a distance matrix consisting of m rows and n columns. 2. Initialize the rst row and column as [0 m] and [0 n]. 3. For each of the symbol of string1 and string2 If string1[i] = string2[j], the cost is 0. If string1[i] != string2[j], then the cost is 1. The value of cell distanceMatrix[i, j] is minimum of distanceMatrix[i-1, j] + 1, distanceMatrix[i, j-1] + 1, or d[i-1, j-1] + cost.

3.2

Local Alignment

Simth Waterman [22] is a local sequence alignement method which can be used to align sequences of arbitarary length. The score and trace back matrix in case of Smith Waterman alignment method is computed in similar way the NeedlemanWunsch method execept that zero is included to prevent calculated negative similarity. This state of the cell indicates no similarity.

Chapter 3. Bioinformatics Techniques

22

For any two sequence X and Y the score matrix is populated using equation given below:
S (i, j ) = max(S (i 1, j 1) + (X [i], Y [i]), S (i 1, j ) + , S (i, j 1) + ), 0)

where S is the score matrix, is score corresponding to match and represents the gap penlaty. The regions of high similarity is estimated by nding maximum score from the score matrix. Aligned sequences are retrived by reading the trace back matrix follwing the direction starting from the cell having maximum value. Figure 3.2 depict local alignment of DNA sequences.

Figure 3.2: Local Alignment for DNA Sequences

3.3

Multiple Sequence Alignment Method

The multiple sequence alignment (MSA) method is used to align more than two sequences at a time. MSA can be build up by repeatedly applying global/local on two sequences and later on align subsequent alignments and sequences. In the proposed methodology (MOMENTUM), MSA in particularly is used to determine related functional, structural aspects of opcode sequences in terms of signature(s). Given a set of k malware samples with opcode sequences M1 , M2 , Mk , gaps are inserted while aligining the opcode sequence so that all opcode sequence have same length. This similar opcode sequences are conserved and the number of gaps is minimized. Figure 3.3 depicts the MSA of ve malware sequences. Two common methods of implementing MSA are: 1. Iterative method

Chapter 3. Bioinformatics Techniques 2. Progressive alignment method

23

Iterative method repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. Second, is most widely used method to building MSA uses a heuristic based progressive technique.

Figure 3.3: Multiple Aligned opcode sequences corresponding to malware samples.

3.3.1

Iterative Method

The iterative alignment method builts an initial alignment of sequences. They are primarly used to improve overall alignment score. A tree is created which depicts the order in which nodes are aligned. The tree is read in a bottom up fashion repeatedly by aligning sequences until the root node is visited which gives the complete alignment for a family. The main advantage of using the iterative alignment method is it fast and scales large number of sequences. The iterative alignment method has a limitation that the misalignment is preserved and is propogated to all sequences.

3.3.2

Progressive Alignment

he hierarchical or tree method), that builds up a nal MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related Progressive Alignment method identies most similar instances align them rst. Successively less similar instances are added to the initial alignment. This process is repeated until combined results of aligning opcode sequences of a malware famliy is obtained. ClustalW [23] is a progressive alignment techinque which is based on dynamic programming (DP) approach. Figure 3.4 shows the aligned sequences obtained using progressive alignment

Chapter 3. Bioinformatics Techniques method.

24

Figure 3.4: Progressive Alignement

The basic progressive alignment approach involves three steps: Compute Distance Matrix unsing pairwise alignment for all pairs of malware sequences in a family. Construct Phylogenetuc Tree using distance matrix as heuristic. A phylogenatic tree illustrate evolutionary relationship among various biological species. Figure 3.5 depicts the a phylogenetic tree for ve dierent sequences. In this gure set of closely related sequences has common root node. NeighbourJoining (NJ) [24] method is used to construct tree. The phylogenetic tree use as guide tree denes the order in which the sequences are aligned in the next step.

Figure 3.5: Phylogentic tree.

Chapter 3. Bioinformatics Techniques

25

Construct MSA by traveling guided tree in bottomup align opcode sequences using evolutionary relationship, with similar ones aligned rst followed by the less similar instances.

Chapter 4 Metamorphic Malware Exploration Technique Using MSA (MOMENTUM)


Metamorphic malware have self modifying and replication ability. It is equipped with a metamorphic engine which generates variants using code obfuscation techniques. Opcode sequence which represents maliciousness is transformed using metamorphic engine to obscure the infection mechanism. Sequence alignment methods can be used to determine the conserved regions of opcode which might be similar with respect to other opcode sequences. Also, the mismatch could be analyzed to determine semantic equivalence of instructions. In this chapter, we discuss the applicapability of various sequence alignment methods in dierent phase of proposed Metamorphic Malware Exploration Technique Using MSA (MOMENTUM)for detection and classication of malware executable. Figure 4.1 briey outlines the implemented method.

4.1

Data acquisition

Experiments are condcuted on malware and benign samples in Portable Executable (PE) [25] format. The malware samples are collected from varied sources which includes synthetic malware created using virus kits like NGVCK, MPCGEN, G2, PSMPC and real malware collected from VX Heavens and user agencies. Gathered malware samples are scanned using 14 26

Chapter 4. Metamorphic Malware Exploration Technique Using MSA (MOMENTUM)

27

Figure 4.1: Brief Outline of Method for Metamorphic Malware Detection

Chapter 4. Metamorphic Malware Exploration Technique Using MSA (MOMENTUM)

28

antiviruses (trial period) and were classied into dierent families. Benign samples are collected from System 32 folder of fresh installation of Windows XP operating system. Some benign samples are collected from dierent site which includes games, browsers, media players etc. Each benign sample is also scanned using the antiviruses. Since most of the malware collected are packed. Sample are unpacked using signature based unpackers like PEiD, GUNPacker [3] and dynamic unpacker like EtherUnpack. The details of unpacking is discussed in Appendix A. Table 4.1 gives the description of the data set used in the experiment.
Table 4.1: Dataset Description
TYPE Synthetic Real Malware Benign SOURCE NGVCK, G2, PSMPC, MPCGEN User Agencies, Vx Heavens System 32, Cygwin, ganmes etc. NO. FAMILIES 46 57 1 NO. SAMPLES 1051 1330 1064

4.2

Analysis of metamorphism in Tools/Real

malware
In the proposed work the metamorphism amongst the malware samples generated with various constructors are analyzed. Similar experiment is conducted on malware real samples collected from Vx Heavens and user agencies. Initially, pairwise alignment is found out for all opcode sequences of the malware samples using global and local alignment methods. Two type of analysis is performed (a) one is the intra family and (b) second is the inter family analysis. From the intra family pairwise alignment we obtain distance of samples, a base le and the opcode sequence alignments between the malware samples. Average distance of samples in a family is computed which is useful for investigating the degree of metamorphism in a family of malware. With opcode sequence alignments we can determine the types of instructions contributing obfuscation. Inter family pairwise alignment between the base

Chapter 4. Metamorphic Malware Exploration Technique Using MSA (MOMENTUM)

29

Figure 4.2: Method for Investigation of Metamorphism.

malware is performed to determine if dierent malware families overlap. Figure 4.2 depicts the method of identication of metamorphism in synthetic and real malware. It is also observed in most of the cases mov, push and pop instructions are used.

4.2.1

Type of obfuscation

Metamorphic engine make use of instruction substitution or permutation as a way of obfuscation. The opcode sequence appear as a mismatch or gap in the alignment and depicts a point of mutation. Usually it is in case of malware families single and multiple instruction replacement is observed. These replacements are incorporated by the metamorphic engine by maintaining the functionality of the variants of a family to evade detection. Table 4.2 list out some of the instructions used for obfuscation in the collected malware samples.

4.2.2

Indentication of Base Malware

The Sum of Pair (SOP) alignment method computes the pairwise alignment between every pair of opcode sequence. At a time three sequences could be aligned by constructing a cube like structure. This method is imposes constraint on the system with respect to the memory and space utilization.

Chapter 4. Metamorphic Malware Exploration Technique Using MSA (MOMENTUM)


Table 4.2: Replacement of opcodes for malware generator (NGVCK, G2, PSMPC, MPCGEN). For all generator mov, push, pop and jump instructions are replaced.
NGVCK G2 PSMPC add mov int call jnz loop push mov mov pop mov pop lea mov call mov xor cwd mov sub mov movsb push add rep movsb mov xor xor mov and mov cwd mov mov jz int inc mov cmp movsb movsw MPCGEN mov pop cmp mov int mov mov lea jmp int call add add movsw lea jmp movsw mov push pop

30

To align three sequences the running time complexity is (23 1)n3 orO(n3 ). in general for k sequence the running time complexity is O(2k 1)nk or O(2k nk ). Thus, it can be inferred that alignment between two sequence can be extended for k sequence but the running time exponentially increases. A method known as Star Sequence Alignment method is used to align multiple sequences. In this method a malware sample Mc is selected as the central or base le. Then, the optimal alignment of all instances Mi with Mc is computed, and each new sample is aligned with base le by inserting gaps to nally form multiple aligned sequence. Figure 4.3 depicts the pairwise alignment of the samples and selection of central le using Sum of Pairs method.

4.3

Signature Modeling and Testing

In this phase of the method signature(s) are extracted from the data set. The data set is initially portioned into train and test set. Signatures for each family is extracted from the MSA of signatures of each family. Figure 4.4 depicts the phase involved in modeling the signatures.

4.3.1

Single Signature

Opcode sequence corresponding to each malware family is aligned using MSA. From each row of aligned MSA sequence an opcode that appears

Chapter 4. Metamorphic Malware Exploration Technique Using MSA (MOMENTUM)

31

Figure 4.3: Malware samples arranged in star like fashion with M2 is base samples and M1 the closest and M5 the farthest samples from base. The closest sample will be more similar to the base malware samples.

Figure 4.4: Signature Modelling and Testing

Chapter 4. Metamorphic Malware Exploration Technique Using MSA (MOMENTUM)

32

in 60% of the samples in a row is preserved. The combination of all such opcode sequence from all rows of a MSA is considered as a single signature for a family. Figure 4.5 show single signature extracted from MSA of opcode sequences.

Figure 4.5: Extraction of single signature.

4.3.2

Group Signature

Each malware family is subdivided into number of smaller groups based on Phyogenetic tree. All samples which are close based on the distance are grouped to form a subgroup. A subgroup may contain two or more samples, opcode sequences are aligned using MSA and single signature for each subgroup is extracted. Thus, for k subgroups we obtain k signatures. MSA of k signatures are further created and wild card based signature is retained. This signature is also referred as group signature. The main advantage of representing group signature based on wild card is that it saves time during the testing phase otherwise test sample need to be checked against i prominent signatures from k subgroup signature where i < k . Figure 4.6 shows wildcard representation of group signature and Mt is the malware test sample. This

Figure 4.6: Wildcard based representation of Group signature.

Chapter 4. Metamorphic Malware Exploration Technique Using MSA (MOMENTUM)

33

4.3.3

Testing

The last module of MOMENTUM determines the family to which the unseen samples (malware/benign) belong. This is determined by aligning the test samples against single and group signatures of each family. The unseen samples is said to belong to a family if high score value or low values of distance by aligning it with signature(s). Threshold of each malware family is determined and samples in the test set is detected by using three types of signature. For computing the threshold corresponding to a family both malware and benign samples in the training set is considered. Each variant and benign samples are matched with the signature(s) and a score is determined. Higher score represents high match with a signature. Threshold th for a family is determined as follows. th = (Bmax + Mmin ) 2

where Bmin , Bmax depicts minimum and maximum score corresponding to benign samples with signature(s). Similarly Mmin , Mmax represents highest and lowest score of a malware with the signature(s). A test sample t is considered as benign if the score obatined by aligning this sample with the signature if less than threshold th otherwise the sample is agged as malware.

Chapter 5 Result and Inferences


The experiments are performed on Intel Core i7 870 processor with 8GB RAM installed on the machine. Some tools like IDA Pro disassemble, GUNPacker, Ether are installed in machines which is used for dierent purpose like (a) packed executable analysis (b) to disassemble code. The data set consists of malware families synthetic and real malware. Malware samples are collected from VX Heavens repository, use agencies and some have been constructed using the malware constructors like NGVCK (Next Generation Virus Construction Kit), G2, PSMPC, and MPCGEN. Following are dierent phases in the experiments. 1. Dataset preparation: Collected samples of malware and benign executables are scanned using 14 antiviruses. Using the scanned reports of the antiviruses, malware executables are separated into dierent families. The entire data set is divided into two parts one for training and other for testing. Executables are disassembled using IDA Pro disassembler to obtain the assembly code of the executables and mnemonics are extracted fro each assembly representation of the malicious/benign les. 2. Validation of obfuscation: From each representative malware family a central or base le is selected. Sequence alignment techniques are applied within the family to obtain alignments for each pair of samples. Alignments depicts point of match and mutations. Total number of mutations in malware dataset is estimated.

34

Chapter 5. Result and Inferences

35

3. Metamorphism in Malware Tools: Inter family pairwise analysis is performed amongst all base samples selected for each family. If the distance between any two base malware is very less then the families are considered to overlap. 4. Signature Modelling: Two types of signature are extracted from MSA of each malware family. These signatures are referred as (a) single and (b) group. A training model is prepared with malware and benign samples in the dataset and threshold for each malware family is determined. Unseen samples (of test set) are tested using threshold determined during training and evaluation metrics is computed.

5.1

Evaluation Metrics

Experimental results are evaluated using evaluation metrics like TPR, TNR, FPR, FNR. These metrics are computed using True positives (T P ), True Negative (T N ), False Positive (F P ) and False Negative (F N ). T P indicates the number of samples classied as malware, T N is the number of correctly classied benign instances, F P is the number of benign samples incorrectly classied as malware and F N is the malicious samples classied as benign. The performance of any detector/scanner can be measured by primarily checking the True Positive rate (TPR) and True Negative Rate (TNR) which are also known as sensitivity and specicity respectively. 1. True Positive Rate (TPR) : T P R = T P/(T P + F N ) 2. False Positive Rate (FPR) : F P R = F P/(F P + T N ) 3. True Negative Rate (TNR) : T N R = T N/(T N + F P )

Chapter 5. Result and Inferences 4. False Negative Rate (FNR) : F N R = F N/(F N + T P )

36

In case of a protection system, high value of TPR and TNR along with low FPR and FNR is required. This would ascertain that the scanner is capable of correctly identifying samples as malware or benign.

5.2

Intra Family Analysis

Figures 5.1 shows intra family analysis for malware constructors.

Figure 5.1: Intra Family Analysis of malware (Synthetic and Real).

From the graph we can observe the following Non zero values indicates presence of metamorphism in synthetic data. Levenshtein distance is high due to junk code insertion. In spite of high values of global distance, local distances are low in most of the samples. This indicates presence of similar regions in code.

5.3

Inter Family Analysis

Inter family analysis is performed by comparing the base samples of dierent families. Figure 5.2 shows inter family analysis of malware families.

Chapter 5. Result and Inferences

37

Figure 5.2: Inter Family Analysis of malware (Synthetic and Real).

Distance is less than intra family distance. This indicates most of malware share some base code and could be detected using common signature. Levenstein Distance is relatively high in comparison of local and Needleman Wunsch alignments because of variable functionality of the code resulting in increase of the number of gaps in alignment.

5.4

Comparative Analysis

This section shows comparative analysis among dierent types of samples based on various parameters (a) alignment per samples (b) average sum of distance and (c) degree of obfuscation (refer Table 5.1).
Table 5.1: Comparative Analysis of Malware Samples
Virus Type NGVCK G2 MPCGEN PSMPC Vx Heavens Replacement Avg. SoD Obfuscation /Alignment 47 1.03 Average Simple 3 1.45 Low Simple 31 0.61 Average Simple 1 1.35 Low Weak 122 8.3 Large Complex

Viruses generated using tools belong to same family. Families of real malware are distinct. In PSMPC loop and jump instructions contribute for obfuscation this increases the distance between samples.

Chapter 5. Result and Inferences NGVCK viruses overlaps with real malware (Savior).

38

mov, add, sub, push and pop have been replaced most of the times with equivalent instructions instructions. Obfuscation is primarly single instruction is replacement instead of multiple instructions. This is validated by observing the global and local alignments of samples. The types of mismatch in global and local alignment are same suggesting less complex obfuscation.

5.5

Testing with Signature

Malware families created using the scanners are separated into number of families. For each malware family two types of signature (single and group) are extracted. Single signature is the maximum preserving opcode sequence in a multiple aligned sequence of a family of malware. Each row of MSA depict match, mismatch and gap corresponding to opcode sequences. Group signature is the wildcard representation of signatures of the subfamilies in a family. Table 5.2 shows values for evaluation metrics for dierent types of signature.
Table 5.2: Evaluation Metrics for dierent types of signatures.
Types of TPR Signatures Single 0.95 Group 0.73 FNR TNR 0.046 0.27 0.48 0.99 FPR 0.52 0.01

It is observed that the detection rate is approximately 95% with a FPR of 46%. This indicates that most of the malware samples are detected but many benign samples are incorrectly classied as malware. Since single signature is constructed by extracting maximum preserving (55%) opcodes in MSA row, opcodes responsible for mutations are lost in signature (they appears to be less dominant). Thus, most of the benign samples in test set score well with the signature and are detected as malware. In case of group signature a detection rate of 73% is obtained with very less false positive rate (FPR = 0.1). This indicates that malware samples in the

Chapter 5. Result and Inferences

39

test set is detected by wild card representation of signature. The group signature actually depicts wildcard representation of signatures of subfamilies for a family. Opcode sequence present in this signature is absent in benign samples, thus, they could be discriminated from the malware samples.

5.6

Comparative Analysis with Antiviruses

Entire dataset was scanned using 14 antiviruses and the detection rate was computed from their scan report. Figure 5.3 depicts the detection rate obtained from antiviruses and the MOMENTUM. The top ve detection rate was obtained with antiviruses like Avast, Avira, AVG, GData, Kaspersky (arranged in ascending order of detection rate). It was observed that the detection rate of MOMENTUM is close to the top three commercial antivirus product. Some of the malicious les (total 37 malware) were not detected by any of the antivirus.

Figure 5.3: Detection rate of antiviruses compared with dierent type of constructed signature.

Out of 37 undetected malware executable from dierent antiviruses, using our implementation methodology (MOMENTUM) 30 malware was detected with single signature and 20 malcode were detected using group signature (wildcard signature). Eectiveness of the method suggests that bioinformatics sequence alignment methods could used eectively to detect malware. Also, these methods could be used for generating malware signatures and in assisting scanners for detection purpose.

Chapter 6 Conclusions and Future Work


Malicious Software (malware) is a major threat to computer systems. Malware detection mechanisms are gaining prominence amongst researchers and have turned out to be a topic of research. The number of malware has increased at an alarming rate due to the fact that malware writers are deploying obfuscation methods. The nonsignature based detection methods are important as the malware writer are producing metamorphic or polymorphic malware. Thus, a strong signature based methods is required to detect these modern malware. In this thesis the problem of detection of metamorphic malware is discussed using MSA methods. Signature(s) (single and group ) for a malware family is extracted and tested using the unseen samples. Metamorphism amongst malware constructors and real malware is explored. It was found in this investigation that the malware constructors used minimal obfuscation which were mainly single, multiple instruction replacement. Primarily the obfuscation found was code reordering. The detection rate of the implementation method (MOMENTUM) is also compared with that of antiviruses. It was obaserved that the unseen samples were detected using signatures with low false positives. Also, the detection rate of implementation method is comparable with that of antivirus like Avast, Avira, AVG. Some of the undected malware executables from all antiviruses were detected by MOMENTUM. In continuation to the present work some suitable scoring scheme could be devised that could identify unseen samples. This could be initiated by assigning some weights to mnemonic

40

Chapter 6. Conclusions and Future Work

41

pairs that are responsible for mutation. Also, the operands of instructions could be considered to improve detection rates.

Appendix A Executable Unpacking


A packer is program used to encrypt the executable there by reducing its size and to avoid the executable from reverse engineering. Most of the packers are dependent on specic le format like Portable Executable (PE) or Dynamic Link Library (DLL). The packed executable would restore in its original form once it is loaded in the memory. Malware authors use packers to avoid detection by anti virus products as the malicious code is hidden from the scanners. Basically, we can think of packer as a software which place an executable inside another executable. Thus, the outer executable is responsible for unpacking the original executable which is hidden by a packer. The basic function of packers is to encrypt the code, resources and import table. Executable packers insert some random number of jump instructions in order to confuse the disassemblers. Advanced packers also encrypts the Portable Executable (PE) sections so that the antivirus virtually fails to scan proper malicious code. Static analysis of packed code is not possible as the malicious payload is unpacked during runtime. Thus, the antivirus using sandbox environment has the capability of unpacking the executable by executing each suspicious sample. However, unpacking executable is computational expensive. If the unpacked malware is analyzed for detection then we may basically scan the packer code instead of malicious executable code. Unpacking could be performed using the generic unpacker like GUNPacker [3]. The basic problems with these signature based packers are (a) packer signatures need to be updated periodically and (b) diculty in the detection of multiple layer packed executables.

42

Chapter 6. Conclusions and Future Work

43

Another way of software unpacking is by using Ether Unpack [4]. The main problem using Ether is that it requires dedicated operating system and hardware. Initially the sample to be unpacked is executed in the guest operating system (Windows XP SP2) and Ether tries to locate all memory writes that are performed by the executing process. Whenever a memory write operation is performed the process dump is stored under the images directory. Ether considers each memory write operation as the candidate Original Entry Point (OEP). Figures A.1 depicts the process of unpacking executables (malware/benign) using signature based packers and Ether Unpack.

Figure A.1: Portable Executable Unpacking Procedure

A.1 bles

Symptoms of Packed Malicious Executa-

Packed PE les can be detected using signature based, heuristics based or dynamic unpackers. Native and packed malicious code some dierence which are listed below

Chapter 6. Conclusions and Future Work

44

(i.) Nonstandard section names: Most of the compilers and linkers have follow convention for naming the sections. The executable packers prepends nostandard section name like .upx0, .upx1 etc. in the packed code. (ii.) Small Code Section: The packed code contain small code with populated data section. The disassembles also exposes the code of stub instead of actual code. (iii.) Missing String Table: The string table or symbol table is used by most of the compiler to store address of symbols instead of maintaining multiple strings in the table. The packer normally encrypts the strings, inserts garbage address corresponding to each string in the string table. (iv.) Small Import Table size: The native executable have populated entries in the Import Address Table (IAT) one for each API. The packed PE samples have small import table with few imports of common APIs like GetProcAddress or LoadLibrary. (v.) Execution of Code starts at last section: The PE le is divided into logical structures called as sections which are data, code, reloc etc. Some of the malware packers hide the original entry point and add new section possibly at the end of the all sections. (vi.) Section Characteristics: The characteristics are the ags for each section describing about the permissions alloted to a section. The code section has characteristics ag set as executable but lacks write permission. The malware packers either have both execute and write permission or leave the permissions as 0.

A.2

Manual Unpacking of Packed Executable

Packed PE les can be identied using signature based packers which tries match executable packer signature with the known signatures of the packers stored in the repository. Another way to nd a executable as being packed using the known packers is to perform entropy analysis of the suspicious le. The entropy for complete le or the few bytes from the beginning of the le could indicate whether a le is packed or not. Following are the steps adopted to manually unpack malicious code (refer Figure

Chapter 6. Conclusions and Future Work

45

(i.) The preliminary step is to identify the type of packer used to pack an executable. Once a packer is known to us we need to locate the original entry point of the executable by executing the suspicious sample. (ii.) The executable is loaded in OllyDebugger and a break point is set and the program is allowed to execute until it stops the execution. At this point the memory dump is retrieved. The memory dump contain both the unpacked and the unpacking stub code. (iii.) The dump executables entry point still points to the starting address of the packer. Since it is required that the unpacked data should be rst executed followed by the unpacker code the entry point is calculated as RVA Entry Point = OEP - Base Address (iv.) Finally the import table is reconstructed by specify proper RVA Entry point. This total would reconstruct the import address table.

A.3

Executable Unpacking using Ether

Malware analysis using sandbox environment or controlled environment is gaining prominence. The main reason is that the malware is executed in guest operating using the concept of virtualization. In case of system virtualization the hypervisor or Virtual Machine Monitor (VMM) manages multiple operating system and the resources. The primary advantage analyzing in a controlled environment is that the host operating system remains in eected. We have used Ether patched XEN capable of executing multiple virtual machines, each having its own operating system, on a physical system. The performance of using XEN is thus close to using native system. XEN based hypervisor acts as an interface between the guest operating system which in our case is Windows XP2 and a host operating system (Debian Lenny). In order to boot a guest operating system a hypervisor basically requires (a) a disk (b) kernel image (c) conguration le consisting of IP address, amount of memory to use etc. Following are some elements of a hypervisor: Hypervisor Layer: The hypervisor layer glues the guest and the host operating system. Using the hypercall layer the guest operating system interacts with the host operating system.

Chapter 6. Conclusions and Future Work

46

Interrupt Handler: This component of a typical hypervisors route the interrupt to/from a guest operating system and virtual devices. Likewise, the hypervisor is designed to identify and understand faults or exception occurring at specic guest operating system. The faults occurring at a guest is not transferred in the hypervisor so as to interrupt the working of hypervisor. Page Mapper: This is the core module of a hypervisor which maps the hardware to pages of specic guest operating system. Scheduler: The component which transfers the control between multiple guest operating system and scheduler back and forth. To analyze the malware dynamically, the malware is executed on the DomU machine(XP SP2) and its footprints are recorded on the DomO system(Debian Lenny) as shown: $ ether 1 unpack userspace Virus.Win32.Evol.a /root/Desktop/Virus.Win32.Evol.a where 1 is the Domain ID of the machine running obtained using xm list Virus.Win32.Evol.a is the process to be analyzed /root/Desktop/Virus.Win32.Evol.a is the local path of the malware on Debian Lenny(Dom0). Ether as shown in gure, dumps the sample by nding the Original Entry Point using the memory writes a program does. The dumped sample could be found in the images directory of ether.

Chapter 6. Conclusions and Future Work

47

Figure A.2: Userspace Unpacking using Ether

Publications
1. Vinod P., V.Laxmi, M.S.Gaur, Grijesh Chauhan, Malware Detection using Non-Signature based Method, In Proceeding of IEEE International Conference on Network Communication and Computer-ICNCC 2011, pp-427-43, DOI:978-1-4244-9551-1/11. 2. Vinod P., V.Laxmi, M.S.Gaur, Grijesh Chauhan, Detecting Malicious Files using Non-Signature based Methods, To appear in Oxford Computer Journal.

48

Bibliography
[1] Md.Enamul Karim, Andrew Walenstein, and Arun Lakhotia (2005) Malware Phylogeny Generation using Permutations of Code. Journal in Computer Virology, 1(12):1323. [2] VX Heavens: http://vx.netlux.org/lib [3] GUNPacker: http://www.woodmann.com/collaborative/tools/

index.php/GUnPacker [4] Ether Unpacker: http://ether.gtisc.gatech.edu/ [5] VMPacker: http://www.leechermods.com/2010/01/

vmunpacker-16-latest-version.html [6] The IDA PRO Disassembler. http://www.datarescue.com/idabase [7] Daniel Bilar Opcodes as Predictor for Malware. International Journal Electron Secure Digial Forensic (2007), 1(2):156168. [8] Chouchane, Mohamed R. and Lakhotia, Arun (2006) Using engine signature to detect metamorphic malware. In Proceedings of the 4th ACM workshop on Recurring malcode, WORM 06, 7378, New York, NY, USA [9] Qinghua Zhang and Douglas S. Reeves. MetaAware: Identifying Metamorphic Malware. Computer Security Applications Conference, Annual, 0:411420, 2007. [10] Mark Stamp Wing Wong. Hunting for Metamorphic Engines. 2006. [11] V. Sai Sathyanarayan, Pankaj Kohli, and Bezawada Bruhadeshwar. Signature Generation and Detection of Malware Families. In ACISP 08: Proceedings of the 13th Australasian conference on Information Security and Privacy, pages 336349, Berlin, Heidelberg, 2008. Springer-Verlag. 49

Bibliography

50

[12] Sagle B. Needleman and Christian D. Wunsch. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. pages 443453, 1970. [13] Rachit Mathur, Antony Maida, and Douglas S. Reeves C.E.

PalmerQinghua Zhang. Normalizing Metamorphic Malware Using Term Rewriting. In In:Proc of sixth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM 06)), pages 7584, 2006. [14] Guillaume Bonfante, Matthieu Kaczmarek, and Jean-Yves Marion. Architecture of a Morphological Malware Detector. Computer Virology, pages 263270, 2009. [15] Matthieu Kaczmarek Guillaume Bonfante and Jean-Yves Marion. Control Flow Graphs as Malware Signatures. 2007. [16] Christopher Kruegel, Engin Kirda, Darren Mutz, William Robertson, and Giovanni Vigna. Polymorphic Worm Detection using Structural Information of Executables. In In RAID, pages 207226. Springer-Verlag, 2005. [17] Heejo Lee Kyoochang Jeong. Code Graph for Malware Detection. In In International conference on Information Networking,ICOIN, pages 15. IEEE, 2008. [18] Sanford, Michael, Computer viruses and malware by John Aycock, SIGACT News, (41), 1, March, 2010,4447 [19] Mona Singh, Phylogenetics, Lecture Notes: princeton.edu/~mona/Lecture/msa1.pdf [20] The NeedlemanWunsch algorithm for sequence alignment, Vladimir Likic http:\www.ludwig.edu.au/course/lectures2005/Likic.pdf [21] P eter Sz or and Peter Ferrie. Hunting for metamorphic. In In Virus Bulletin Conference, pages 123144, 2001. [22] T. F. Smith and M. S. Waterman Identication of common molecular subsequences, In Journal of Molecular Biology, vol. 147, 1, pp 195 - 197, 1981, http://www.cs.

Bibliography

51

[23] ClustalW2 - Multiple Sequence Alignment http://www.ebi.ac.uk/ Tools/msa/clustalw2/ [24] N Saitou and M Nei, The Neighborjoining Method: A New Method for Reconstructing Phylogenetic Trees, Oxford Journals, Life Sciences Medicine, Molecular Biology and Evolution, Volume 4,pp 406-425. [25] Matt Pietrek, An In-Depth Look into the Win32 Portable Executable File Format http://msdn.microsoft.com/en-us/magazine/ cc301805.aspx

You might also like