Professional Documents
Culture Documents
Abstract— The aim of malware analysis is to detect applied to extract useful information from large datasets.
whether a file is infected or not in order to avoid any kind of This has made it possible to analyze the data extracted from
system intrusion. The goal of this research is to find the the attributes of malicious portable executables.
optimal machine learning algorithm to predict whether a file is
malicious or not by using different machine learning models on
a given dataset. For the above purpose the implementation and A. Detection Methods
accuracy comparisons are done with the help of python
libraries and then summary analysis will be then used to
suggest the best machine learning model for the detection of Static Analysis:
the malware infected files this can then be used as a layer in a It is done by analyzing the program in the form of
bigger neural network for dynamic malware analysis and software code of malware and gain knowledge how the
attack detection and prevention. malware works. Reverse engineering is used in the form of
decompile tool, disassemble tool for understanding the
Keywords— malware, machine learning, malware analysis, structure of the malware[7]. It includes the various
Decision Tree, Support Vector Machines, Random Forest, Linear
techniques:
Regression
1. String Extraction (error messages)
2. Fingerprinting (in the form of hash and detect
I. INTRODUCTION hardcoded username, files)
3. File Metadata (PE headers)
The use of Internet and its wide spread resources
has been on an exponential increasing trend in the past few Dynamic Analysis:
decades. This trend has led to multiple services being made When we execute a file, its behavior is being noted along
available to the user over secure connections, these services with the other information related to the file as well as its
include banking, purchases and even data exchange. This is properties and intentions of the creator for those executable
one of the main reason due to most of the hackers to export files. It is faster as compared to static analysis of the
different kind of malwares to naïve users. When the malware.
malware infects an particular computing device be it a
mobile device or a desktop each and every transactions in
B. Why Machine Learning?
which that particular device is involved becomes
compromised. Mostly malwares are spywares or viruses that
are intended for the basic purpose of stealing confidential In order to detect a polymorphic malware that
information or money in any form. These malwares then change its signatures, as well as new malware, for which
empower hackers to commit various cyber-crimes like fake signatures have not been created yet. Due to the inaccuracy
e-payments, denial of service attack, illegal hacking etc. of the heuristic based detectors while detecting the malware
Despite of so many antimalware measures there have been we need to switch ourselves to machine learning algorithms
so many practical attacks that have occurred in the past. Due combined with heuristic approach to offer high accuracy
to the ever-changing internet and new technologies coming rate. When relying on heuristics-based approach, there has
in which have increased the ease of network and computer to be a positive threshold for malware triggers, defining
penetration, the task of keeping the users safe has become the quantity of heuristics wished for the software to be
really difficult for companies that provide the world with called malicious. For example, we can outline a set of
anti-malware solutions. To avoid attacks and to provide the suspicious features, such as “registry key changed”,
users with security updates in the limited period of time is “connection established”, “permission changed”, etc.
another challenge that such companies face. Any software that shows at least five features from that set
Malware protection and removal once infected, is one of the can will be termed as malicious. Although this approach
main tasks of any anti-malware companies, as even a single provides some stage of effectiveness, it is no
attack could lead to loss of confidential data, money and the longer always accurate, considering that some features can
systems privacy. Recent advancements in computer science have greater “weight” than others, for example, “permission
and hardware have made it possible for machine learning changed” usually results in greater severe impact to
algorithms and models to effectively and efficiently be the device than “registry key changed”. In addition to that,
2
some function combinations [8] may be extra suspicious than files. The dataset contains more than 50 attributes describing
features via themselves. To take these correlations into file properties of different malicious files. Visualization’s
account and provide more correct detection, machine were then made on various attributes of this dataset and how
gaining knowledge of strategies can be used. it affects in order to know whether a file is malicious or not.
This research was conducted using a dataset with 4900 The machine learning algorithm that were used for static
records of portable executable attributes of malicious files, detection are Linear Regression, Decision Tree, Support
which is used to predict whether they the test files are Vector Machines and Random Forest. On the basis of the
malicious or not by using different machine learning detection and prediction summary analysis conclusions of
algorithms. The models will be trained using this dataset and which model is better over the other and which attribute is
the accuracy was the tested on another 100 records which the best for determining the malware were made.
were extracted from a mix of malicious and non-malicious
on technique and tools used in malware analysis. Most of
II. RELATED WORK the literature we came across during our research was either
focused on static analysis method or technique used for
Previously, static and dynamic analysis for analysis malware, there were only a few in which a
malware analysis was used by Distler [1. Meanwhile, Ari substantial comparison or conclusions were made upon the
[2] has also been doing malware analysis using reverse best possible machine learning algorithm for prediction of
engineering techniques and methodologies by bring in the malicious files. Whereas our work tests four different
use of biscuit apt1 as a malware sample. Another malware machine learning algorithms for malware analysis in depth
analysis research also doing by Flores [3] with comparison, using static analysis of portable header
win32.Kryptic. In the meantime, Daoud [4] has research attributes to get more detail information for characteristics
regarding technique used by malware to avoid detection of malware.
from antivirus. Research conducted by Uppal [5] more focus
Network monitoring. Uncover which ports are Fort (Found stone, 2008), tcpview
open, (Microsoft, 2008c), nessus (Tenable
collect network traffic and find vulnerabilities. Network Security, 2008), nmap
(Insecure.org, 2008), wireshark (Combs,
2008), and snort (Sourcefire, 2008).
Registry monitoring. Monitor registry Regmon (Microsoft, 2008c)
activities as
they occur.
CODE Disassembly, debugging IDA Pro
OllyDbg (Yuschuk, 2008)
Table I Summary of malware analysis tools showing analysis type purpose and name of commonly used tool name.
2 A Threat to Cyber Brand, Valli, Paper displayed a threat to digital flexibility as a theoretical model
Resilience: A Woodward of a malware rebirthing botnet [16]
malware (2011)
Rebirthing Botnet.
3
3 Lessons learned Brand, Valli, Examiner must comprehend the counter investigation procedure
from an Woodward that can be utilized and how to moderate them, the impediments of
Investigation into (2011) existing apparatuses and how to utilize a suitable examination
the system to reveal the purpose of malware. [17]
Analysis
Avoidance
Techniques of
Malicious
Software.
7 TT Analyze: A tool Bayer, Presented a tool TT analyzer for dynamically analyzing the
for Kruegel, behavior of windows executables [12]
analyzing malware. Kirda (2006)
There have been a lot of past works regarding the B. Decision Tree
malware analysis on the different binary malware datasets
using various machine learning algorithms. These
4
Decision Tree algorithm can also be used for directories, section table, and Import Address Table (IAT)
classification of dataset. The decision tree algorithm uses are the main contents of PE file.
tree representation in which the internal node represents the
attribute while the leaf nodes represents the class label. A finite set of quantitative (can be integer, real or binary
Here the attribute will be class and the leaf nodes will be value), categorical and labelled numeral’s can be derived
malicious and non-malicious depending on the other from the different types of features. An example of the
attributes. numerical feature is CPU (in %) or RAM (in Megabytes)
It also helps us to determine the most important attribute is usage, while nominal can be a file type (like ∗.dll or ∗.exe)
resulting in declaring a file as malicious or not. or Application Program Interface (API) function call (like
Pseudocode: write () or read ()). [6]
While distributing the dataset into small subsets the entropy
changes which results in Information Gain that is measured On Windows NT operating systems, PE currently supports
by change in entropy given by the below formulae: the IA-32, IA-64, x86-64 (AMD64/Intel
64), ARM and ARM64 instruction set architectures(ISAs).
Entropy (a, b) = - [instances of a*(log [instances of a]) Prior to Windows 2000, Windows NT (and thus PE)
+instances of b*(log [instances of b])] supported the MIPS, Alpha, and PowerPC ISAs. Because
PE is used on Windows CE, it continues to support several
This entropy is the measure of accuracy of the model on the variants of the MIPS, ARM (including Thumb),
given dataset. [18] and SuperH ISAs. [10]
Another discriminative classification algorithm The project implementation has six modules. They are
formally defined a separating hyperplane, the algorithm dataset collection, data pre-processing, Feature selection,
works by plotting the data in n-dimensional space. Model Selection, Classifier Model for predicting malicious
Hyperplanes are then used in order to distinguish between or normal file, Comparison of Accuracy on Logistic
different cluster classes. Out of all the hyperplanes in the regression, Decision Tree, Random Forest and Super Vector
given space the hyperplane e which is at the maximum Machine algorithms.
distance from both the clusters is choose. This is a
supervised learning model which outputs an optimal A. Dataset Collection
hyperplane. [18]
The collection of data is based on the different PE
header attributes of files present in the dataset as it is directly
D. Random Forest proportional to the probability of malware’s involvement in
those files to a great extent. The dataset can be collected
Random Forest works on the fundamental that it is from any online repository.
basically a combination of many decision tree resulting in
the formation of a forest. It gives a more stable and accurate B. Data Pre-Processing
results. Instead of searching for the most important feature
while splitting a node, it searches for the best feature among Pre-Processing can take a considerable amount of time
a subset of features. [18] as it needs removal of NULL values in different attributes
along with the unreliable data present in the dataset. Data
pre-processing includes cleaning, normalization, and
E. Static characteristics of PE files transformation of attributes according to the criteria for the
analysis of malware. In order to perform the feature selection
PE file format was introduced in Windows 3.1 as on a particular test data, it must be run through a python
PE32 and further evolved as PE32+ format for 64-bit script that is used for the purpose of PE header extraction.
Windows Operating Systems. Common Object File Format
(COFF) header, standard COFF fields such as header, data
5
The goal of using a bagging technique (ensemble model) Table IV gives the accuracy of various models on a
for classification is to combine the results of multiple predefined test dataset along with the actual dataset .This
models and get a combined result for better accuracy. gives the accuracy rate varying from 89% to 100% for the
Bagging technique uses bootstrapping for training multiple test dataset and 65% to 97.8% for the train dataset.
models and then from their collective result obtain the final
resulting model. Multiple subsets are created from the
original data with replacement. A base model is specified
and is then used for training each of these subsets
independently.
Random forest is one such model which uses bagging
technique for training a model for classification.
Bootstrapping is done to create the subsets from the data.
The base model used in the case of random forest is decision
trees which here, splits the records in two different
categories (clean and malicious) based on one feature,
which are which are further split into two based on the next
feature, selected in order of feature importance. The same is
continued till all the leaf nodes are classified (contain the
result malicious/non-malicious). These decision tree results
from the ones trained on different subsets are averaged to
finally produce the output. In this system all the malwares
are labelled as malicious while the rest are labelled as
normal files.
VI CONCLUSION