You are on page 1of 8

JOURNAL OF NETWORKS, VOL. 7, NO.

4, APRIL 2012

715

RobotDroid: A Lightweight Malware Detection


Framework on Smartphones
Min Zhao

PLA University of Science and technology, Nanjing, China


Email: ezhouzhaomin@gmail.com

Tao Zhang, Fangbin Ge, Zhijian Yuan

PLA University of Science and technology, Nanjing, China


Email:{zhangtao421, gefangbin, nudt_yzj}@gmail.com

AbstractSmartphones have been widely used in recent


years due to their capabilities of communication and
multimedia processing, thus they also become attack targets
of malware. Threat of malicious software has become an
important factor in the safety of smartphones. Android is
the most popular open-source smartphone operating system
and its permission declaration access control mechanisms
cant detect the behavior of malware. In this paper, a new
software behavior signature based malware detection
framework named RobotDroid using SVM active learning
algorithm is proposed, active learning algorithm is very
efficient in solving a small amount of labeled samples and
unlabeled samples posed a lot of mixed sample training set
classify problems, as a result, RobotDroid can detect kinds
of malicious software and there variants effectively in
runtime and it can self extend malware characteristics
database dynamically. Experimental results show that the
approach has high detection rate and low rate of false
positive and false negative, the power and performance
impact on the original system can also be ignored.
Index Termssmartphone security, malware detection,
active learning, Android

I. INTRODUCTION
Recently, with the development of mobile computing
ability and high-speed mobile communication network of
technology advances, smartphones are now becoming
increasingly popular and cheaper, their kinds and users
have increased so greatly in the past few years. These
smartphones offer new computing environment, due to
the openness of its operating system, the net-work is ease
of use that make it more vulnerable to malicious attacks
and information thefts, also brought new challenges for
their security researchers.
As more and more personal information are stored in
smartphones, including digital images, personal address
book, personal documents et al., it is easier to connect to
the other terminal and may types of network, terminal
software can access the network without permission of its
owner, the user privacy of information leak out by
running this kinds of software. Meanwhile, Malware may
also be without the authorization of owner to "hide" some
of the high payment services and power-exhausted servi 2012 ACADEMY PUBLISHER
doi:10.4304/jnw.7.4.715-722

ces. Traditional Malware detection theory proposed based


on PC architecture is not very applicable to lower
computing capability and power-limited smartphones, a
new type of malware detection mechanism suitable for
smartphones is desirable.
In this paper, we propose a new framework named
RobotDroid to detect smartphone malware, it is based on
SVM active learning algorithm, and in the Android
system validated the effectiveness of the method. Our
experimental results show that the proposed method has
good applicability and scalability can be realized on a
variety of popular malware detection, and can detect
unknown malware. It has less impact on system
performance; cost impact on the original system capacity
can also be ignored.
The work is organized as follows: Section 2 introduces
the related research work and evaluation; Section 3
describe the malware detection system on the Android
model architecture overview; Section 4 describe the
design and implement of the framework in detail; section
5 experimentally validate using this method to establish
the effectiveness of the detection system; final summary
of the full text, and describes future research ideas.
II. RELATED WORK
The initial studies on smartphone malware [1, 2, 3, 4]
mainly focused on understanding the threats and
behaviors of emerging malware. Guo et al. [1] examined
various types of attacks that can be launched to a
compromised smartphone, and suggested many kinds of
potential defenses. Radmilo et al. [2] revealed the
vulnerability of MMS/SMS, which can be exploited to
launch attacks on battery exhaustion. Mulliner et al. [3]
demonstrated a proof-of-concept malware which crosses
service boundaries in Windows CE phones. They also
revealed buffer overflow vulnerabilities in MMS [4].
Forrest and Pearlmutter et al. [5] presented a typical
host-based anomaly detection, in this way and by
monitoring system call sequence stored in the database, if
the behavior of the program is not appear in system call
sequence database, it is a invasion. Later, the introduction
of behavior of learning algorithms, finite state machines
and hidden Markov chain methods can achieve from the

716

system call sequence. All of these previous practices are


based on the clearly defined representation of the
detected program's normal behavior and abnormal from
the previous defined normal mode model. However, these
methods ignore the existed semantics of system calls; the
limitation of this kind of detection methods is a simple
procedure scrambled can escape detection.
Christorescu and Somesh et al. [6] proposed a new
semantic-aware static malware detection technique, and
try to identify the same semantics of different programs
to detect the scrambling code, they decompiled the code
to generate behavior predefined template and store them
to match malware, then, it can be used to detect simple
scrambled malware, its shortcomings is that it needs to
precisely match a predefined template, limited the
number of malware that can be detected.
Zhu and Cao et al. [7] made use of social network to
detect the cellular network worm. Smartphones through
the network traffic between the social relations between
the terminal drawn maps, smartphone user usually open
and download content from their reliable friends; the
social network worms diagram describes the most likely
transmission mode of network worms. The Authors of the
paper proposed two segmentation algorithms graph of
social relations: balance segmentation and clustering
segmentation. Social network approach can solve the
mobile network worm propagation detection proble-m,
but the method cant be used to detect other types of
smartphone malware.
Abhijit and Hu et al. [8, 13] proposed a framework
mobile for worms, viruses and Trojan horse detecti-on.
They first present a time domain sequence based on the
logical order of program behavior, and then they give an
effective representation of malware behaviors. Each one
of these behaviors may not be threatened if a single look.
The authors validated the framework in Symbian OS.
They stored 25 kinds of typical behavior of malicious
software coding sequence into the database, and then
proposed a two-stage mapping technology with the
knowledge of the run-time system monitoring events and
procedures based API. They use support vector machine
classifier to distinguish between malware and normal
software.
Enck et al. presented TaintDroid in [17, 18, 19]. Their
system used dynamic analysis techniques to monitor
sensitive information on Android. Thus, they can track a
suspicious third-party application that uses sensitive data
as GPS location information or address book information.
The shortcoming of their method is that an application
using sensitive data does not necessarily correspond to
malware, many normal applications may be considered as
malware.
Shabtai et al. presented in [23] a methodology to detect
suspicious temporal patterns as malicious behavior,
known as knowledge-based temporal abstraction. Both
works use knowledge-based analysis while our system is
behavior based. These can be complementary techniques.
Even though, their approach is recommended for
detecting continuous attacks (e.g., DoS, worm infection),
and our framework detects terminal malware, such as

2012 ACADEMY PUBLISHER

JOURNAL OF NETWORKS, VOL. 7, NO. 4, APRIL 2012

information theft, power exhaust etc., the most frequently


seen attacks nowadays.
Burguera and Zurutuza et al. [24] give a framework to
detect malware on Android platform. They monitoring
system call in Linux level and generate software
behavioral patterns and classify these patterns by using
cluster algorithm. Their method is efficient in detecting
malware that behavior can be seen from Linux kernel, but
behavior of many kinds of malware can not been seen
from Linux level, such as send malicious SMS malware,
malicious call malware etc.
III. OVERVIEW OF ROBOTDROID

Figure 1. Note how the caption is centered in the column.

Fig.1 illustrates the architecture of RobotDroid. The


upper side is the learning component, including character
eristic monitoring module, characteristic learning module,
behavior characteristics signature module and signature
database. Characteristics monitor module monitored all
running software to get their running characteristics,
forming the original characteristics of the normal software behavior and malware behavior. Characteristics of
these two types are put into learning modules to generate
the behavioral characteristics. The behavior signature
module signed the behavior into behavior signatures and
stored them in the signature database. The bottom half of
Fig 1 is part of malware detection, including run-time
behavior monitoring module, behave-or signature
module, decision module and the response module.
Running behavior Characteristics monitor module-le
monitored the key points of service managers and intent
users of Android, and then sign the behavior sequences
with the same algorithm above. Comparing the behavior

JOURNAL OF NETWORKS, VOL. 7, NO. 4, APRIL 2012

signatures with the signatures in the signature database,


response module will give a response if the signature
matched the malware signature in the database.
IV. DESIGN AND IMPLEMENT
A. Software behavior Signature and algorithm
In smartphone operating systems, the behavior of malware may occur in multiple locations, the occurrence of
these acts combined according to certain timing in order
to constitute malicious software behavior, one or a few of
these separate behavior cant determine whether they are
malicious behaviors or not. This collection is then
processed by temporal relations, after all the behaviors
are abstracted and signed to software behavior patter-ns.
Code packing, simple scrambling does not change the
behavior of software, malware and its variants are
generally in the same run-time behavior patterns, the signature of these malware can be detected through the same
behavior. Compared with feather-based malware
detection method, the signature database of behavior
signature based is becoming smaller, so the behaviorbased detection of malicious software signature is ideal
for resource-constrained mobile devices. New malicious
software usually include new behavior signature that is
inconsistent with the previous known normal behavior, so
behavior-based malware detection signatures can dete-ct
new and unknown malware.
This paper defines software behavior [23] as Intent
issued and system resources access by applications in
Android-based smartphone operating systems. Software
behavior signature can be getting as follows: First, insert
monitoring points in every service managers in Android
framework. Every monitor collects manager call by
process identifier based on timestamp and write manager
call logs. Finally, classify logs by process identifier and
sort them by timestamp.
Fig. 2 displays typical
behavioral characteristics of the process of collection of
the original process:

Figure 2. Note how the caption is centered in the column.

The Behavioral characteristics of the process collecting


algorithm are described in Fig.3.
Log (PID, MID, timestamp) said it would write to the
log file when process PID access service manager MID in
time timestamp. After a while, the process behavior
sequence will be collected in chronological order.
Function Classify (PID) is used to classify behavioral
sequence into several classes based on process identifier.
Function SortLogs(log) is used to sort logs based on logs
timestamp.

2012 ACADEMY PUBLISHER

717

algorithm 1Behavioral characteristics collecting


1. For (each manager)
2. If (PID get this manager)
3.
timestampnow;
3.
Log(PID,manager,timestamp);;
4. End if;
5. Classify(PID);
6. SortLogs(log);
Figure 3. Behavior characteristics collecting algotithm.

B. Learning Algorithm for malware signatures


Currently, most malicious software detection
approaches are using rule-based detection technology
[15]; such kind detection approaches can only detect
predefined rule database of malicious software and cant
detect a good number of known malwares and new
malwares variants. Therefore, intelligent anomaly
detection technology is put forward and becomes a
research hotspot. Anomaly detection techniques
commonly used the following theories: probability and
statistics, artificial neural networks, fuzzy recognition and
artificial immune method.
The traditional rule-based anomaly detection approaches using statistical methods mostly divide collected data
into normal and abnormal categories. To solve such
problems, you first need to mark the type of samples to
build the training sample set, the establishment of the
training sample set depends on security experts, and the
cost is much expensive. To improve the classification
accuracy in the learning process needs enough training
samples, on one hand increases the cost of building the
training sample set, on the other hand, collecting a large
number of learning samples is also difficult. To address
this problem, you need a learning method in the case of a
small number of training samples, access to better
classification results.
Active learning as a way to solve this problem have
been proposed, it is proposed by Lewis and Gale et al. [9],
and it changed the traditional focus from a known sample
of passive learning methods, which according to the
learning process [22], take the initiative to choose the
most Good sample to study, thus effectively reducing the
number of samples required for evaluation. Support
Vector Machine (SVM) is a small number of samples in
training to achieve a good classification of the case of the
intelligent learning algorithm for generalization ability
[14].
Supposed that the number of Android malware type is
limited, we put all exists Android software into a
software pool [25]. We can use pool-based active
learning algorithm to decide a behavioral sequence is a
normal or abnormal.
It is assumed that the instances x are independently and
identically distributed according to some underlying
distribution F(x) and the labels are distributed according
to some conditional distribution P(y|x). Given an
unlabeled pool U, an active learner al has three
components: (f, q, X). The first component is a classifier,
f: X {-1, 1}, trained on the current set of labeled data

718

JOURNAL OF NETWORKS, VOL. 7, NO. 4, APRIL 2012

X (and possibly unlabeled instances in U too). The


second component q(X) is the querying function that,
given a current labeled set X, decides which instance in U
to query next. The active learner can return a classifier f
after each query or after some fixed number of queries.
Given a set of labeled training data and a Mercer
kernel K, there is a set of hyperplanes that separate the
data in the induced feature space F. This set of consistent
hypotheses is defined the version space (Mitchell, 1982).
In other words, hypothesis f is in version space if for
every training instance xi with label yi we have that f(xi) >
0 if yi = 1 and f(xi) < 0 if yi = 1. More formally, we
give following definition of version space:
Definition 1: set of possible hypotheses is given as:

H = { f | f ( x) =

wi( x )
w W }
|| w ||

(1)

Where the parameter space W is simply equal to F.


The version space, V is then defined as:
V = { f H | i {1...n} yif ( xi ) > 0} (2)
Notice that since H is a set of hyperplanes, there is a
bijection between unit vectors w and hypotheses f in H.
Thus we will redefine V as:
V = {w W ||| w ||= 1, yi ( wi( xi )) > 0, i = 1...n} (3)
The main difference between active learning and
passive learning is how to choose the next unlabeled
instance to query. We use an approach that queries points
so as to attempt to reduce the size of the version space as
much as possible. We take a myopic approach that
greedily chooses the next query based on this criterion.
We also note that myopia is a standard approximation
used in sequential decision making problems. We need
two more definitions before we can proceed:
Definition 2: Area(V) is the surface area that the
version space V occupies on the hypersphere ||w|| = 1.
Definition 3: Given an active learner al, let Vi denote
the version space of al after i queries have been made.
Now, given the (i + 1)th query xi+1, define:

Vi = Vi {w W | ( wi( xi + 1)) > 0}

(4)

Vi = Vi {w W | + ( wi ( xi + 1)) > 0}

(5)

So Vi and Vi

denote the resulting version spaces

when the next query xi+1 is labeled as -1 and 1 respectively.


We wish to reduce the version space as fast as possible.
Intuitively, one good way of doing this is to choose a
query that halves the version space. The result is that for
any given number of queries, the learner that chooses
successive queries that halves the version spaces is the
learner that minimizes the maximum expected size of the
version space, where the maximum is taken over all
conditional distributions of y given x.
The following discussion provides motivation for an
approach where software running instances that split the
current version space into two equal parts as much as
possible. Given an unlabeled instance x from the pool, it
is not practical to explicitly compute the sizes of the new
version spaces V- and V+ (i.e., the version spaces
obtained when x is labeled as 1 and +1 respectively). We
2012 ACADEMY PUBLISHER

next present three ways of approximating this procedure:


simple margin, MaxMin margin and Radio margin.
The above three methods are approximations to the
software behavioral monitoring component that always
halves version space. After performing some number of
executions we then return a classifier by learning a SVM
with the labeled instances.
The margin can be used as an indication of the version
space size irrespective of whether the feature vectors
have constant modulus. Thus the explanation for the
MaxMin and Ratio methods still holds even without the
constraint on the modulus of the training feature vectors.
The Simple method can still be used when the training
feature vectors do not have constant modulus, but the
motivating explanation no longer holds since the maximal
margin hyperplane can no longer be viewed as the center
of the largest allowable sphere.
Just as the Android software behavior sequence is not
very sophisticated and the computing capability of
smartphone is limited, we use simple margin way. The
test result proposed in evaluation part shows that this is
suitable.
This article give a SVM active learning algorithm that
is applied to detect Android malicious software, making
less need of the training android sample set, the classifier
achieve higher classification accuracy, resulting in improved detection of malicious software training speed and
reduce the construction cost of training samples purposes.
SVM parameters are obtained through training, required to obtain two types of training samples that the
signature patterns of normal behavior and abnormal behavior signature patterns. This paper defines the behavior of
the signature track the behavior of the software, used with
a window of length k behavior of the software to obtain
the signature track on the sliding short sequences of
system resource access. How to choose the size of the
sliding window is the key issue. If the selected sequen-ce
length is too short, the resource access order relation will
lose, if the length is too large and cant reflect the context
of normal and abnormal conditions of local order.
Hofmeyr SA et al.[5] draw conclusions from the experiment: when the window is greater than 30, the call
sequence from the program behavior cant be determin-ed
for useful information. Lee W et al. [10] suggest that the
most appropriate resource access short sequence length is
6 or 7. The short sequence length is 6 in this paper; the
experimental result shows this selection is proper. The
algorithm displayed in Fig. 4 is used to generate normal
short sequences.
In the algorithm, NCS represent normal characteristic
sequence database, ncs represent a normal characteristic
sequence instance, MCS represent malware characteristic
sequence database, mcs represent a malware
characteristic sequence instance. When using a sliding
window of length 6 on the behavior signatures of normal
software, the normal characteristic sequences will be
getting. When using to malicious ones, the result will
include both the normal sequence and malicious
sequences. Because the number of malware is far smaller
than the normal, the size of MCS is far smaller than NCS.

JOURNAL OF NETWORKS, VOL. 7, NO. 4, APRIL 2012

When a sequence appears in both NCS and MCS, delete


it from NCS. When a sequence is not included in MCS
and not completely match any ncs in NCS, using
Hamming distance measure its similarity with the normal
samples.
algorithm 2Normal Short Sequences Generate
1. NCS slidewindow(nobs);
2. NCL NCS;
3. MCS slidewindow(mobs);
4. for (mcs in MCS)
5. for (ncs in NCS)
6.
if (mcs = ncs) del mcs from MCS;
7.
for (mcs in MCS)
8.
d MAX;
9.
for (ncs in NCS)
10.
d(mcs,ncs) Harmin(mcs,ncs);
11.
if (d(mcs,ncs)<d)
12.
dd(mcs,ncs);
13.
if d>D del mcs from MCS;
14.
end for;
15.
end for;
16. endif;
Figure 4. Normal Short Sequences Generate algorithm

For the two short sequences i and j, the Hamming


distance between them is denoted by d(i, j). For each new
sequence i, the definition of the minimum Hamming
distance dmin(i) is min{d (i, j)}. dmin(i) is the value of the
expression of the sequence and extent of deviation from
normal mode.
Finally, for a not match sequence i, compare dmin(i)
with D that is a predetermined comparison threshold to
determine if it is abnormal, that is, if dmin(i) equals or
larger than D, the sequence i is a ncs, whereby abnormal
short sequences available sample set.
C. Malware detection
The SVM active learning algorithm is used for
detection, SVM's most prominent feature is based on the
principle of structural risk minimization. Vapnik et al.
[11] maximize the generalization ability of learning, that
is a limited training sample set can still guarantee an
independent test set that maintain a small error.
The characteristics of the learning process as a normal
sample of short sequence is not complete, resulting in
short sequences based on the normal access to the
abnormally short sequence in the sample may contain
normal
intermittent,
SVM
classifier
generate
classification error, so the introduction of detection
module, which presented below the level of risk using
malware to make decisions.
Taking into account the different smartphone behavior
of malicious software on the system and the user is
different from the losses caused by the introduction of a
risk factor (Risk Factor, referred to as RF), RF is used for
each short sequence of a malicious act to give a weight,
the right base Value is set to 1, if the behavior of the system and user security threat is greater, giving a greater
than 1 RF. The introduction of risk (Risk Rank, referred

2012 ACADEMY PUBLISHER

719

to as RR) is a software as a measure of the quantitative


identification as malware, RR is defined as follows:
n

RR = ncsi RFncsi

(6)

i =1

Set a malware detection threshold D, the value


determined by the experimental results, our results show
that the D value of 17 is the best detection threshold.
When the RR is greater than the calculated D, the
software is ultimately determined as malware.
V. EVALUATION
The
computational
complexity
and
battery
consumption are two essential factors of security system
of mobile devices. They are needed to be considered
when making any changes to the software stack on these
devices. We have evaluated both these aspects for the
malware detection framework presented in this paper. As
a test system, we have taken the Android Froyo that
kernel version is 2.6.25 operating on the htc hero handset.
Evaluation of the framework presented below.
A. Malware detection evaluation
As Android market is most famous Android software
sharing place, we choose the most popular 100 software
in Android market as our normal software test swatches,
choose 3 typical popular malware as our malicious
software test swatches. The test result is displayed in
Table I.
Google Inc. announces that much famous and popular
software has been infected by these three types of
malware. We put the 100 normal software and selected 2
of each type of malware to characteristic learning module
to taint the detect engine, then another 200 software are
sent to detect engine to test the effective of the detect
framework.
TABLE I.
MALICIOUS SOFTWARE SWATCHES
name
Infected software
Report time
2011.1
Geinimi
Monkey Jump 2,
President vs. Aliens,
City Defense, Baseball
2010
2011.3
DroidDream
Sexy Girls: Hot
Japanese, Sex Sound,
Super StopWatch and
Timer, Super Color
Flashlight
2011.5
Plankton
Angry birds,
DroidKungfu,
YZHCSMS

We use False Positive rate and False Negative rate to


measure the accuracy of malware detection. The False
Positive Rate(say FPR in short) and False Negative Rate
(say FNR in short) is defined as follows:

NormalAsMal
100%
TotalDetected
MalAsNormal
FNR =
100%
TotalDetected
FPR =

(7)
(8)

In the up 2 equations, NormalAsMal means the


number of software that it is normal software, but the
detection system put it to malware class. MalAsNormal

720

JOURNAL OF NETWORKS, VOL. 7, NO. 4, APRIL 2012

means the number of software that it is malware, but the


detection system put it to normal software class. Total
Detected means the number of software that being
detected. The test result is described in Table II:

Malware
type

TABLE II.
MALWARE DETECTION EVALUATION RESULT
Infected
Correct
False
False
num
detected positive negative

Detection
rate

Geinimi

30

28

3.7%

3%

93.3%

DroidDream

30

27

5.4%

4.6%

90%

Plankton

30

27

4.3%

5.7%

90%

Table II shows the result when applying the detection


framework to different famous malware. An SVM active
learning engine is built for each type of malware. We can
see that the framework can detect most of these three
types of malware, and the false positive rate and false
negative rate are small, because the behavior of these
three type of malware are very distinct.
B. Performance evaluation
The primary users of smartphones in general and
Android in particular are usually unable or unwilling to
sacrifice performance for security. Moreover, the
computational power of most smartphones, while being
superior to traditional cell phones, is still lower than
desktop computers. It is therefore necessary that the
security policy model not overly tax the computational
capabilities of the phone.
Message service, location service and shell script are
three import executable program while using Android
smartphones. These three programs can also candidate
our three types protected object. Fig 5 shows the
spending time that with and without the malware
detection system. The result shows that the performance
decrease is bearable.

Figure 5. Time Consumption Evaluation

The result shows that after add malware detection


system in virginal Android system, the decrease of
running efficiency of three typical applications are
endurable, and the largest decrease is not pass 15%.
C. Power consumption evaluation
Measurement of battery consumption on Android is
difficult due to the fact that the battery charge level
reported by the Android hardware is at a very coarse
grained level. Using software for measurement of battery
consumption during Access GPS simply yields no
2012 ACADEMY PUBLISHER

charge in battery level. However, note that since we use


hash table to store normal and abnormal software
behavior signature information, the decision-making time
consumption is linear.
Therefore, using the same arguments as those for time
consumption, we can conclude that the battery
consumption overhead caused by our decision making
mechanism is also bearable.
VI. CONCLUSION AND FUTURE WORK
Near all of the market indicators foresee that there will
be a massive increase in the number of smartphones
purchased in the next 5 or even 10 years. This will create
a potential for a massive increase in malware generation,
and in particular in the sector dominated by the market
leader, potentially the Android platform.
In this paper we have proposed a new framework to
obtain and analyze smartphone application activity in
Android framework. In collaboration with the Android
user community, it will be capable of distinguishing
between benign and malicious applications of the same
name and version, detecting anomalous behavior of
known applications. We have indicated that monitoring
software behavioral activity in Android framework is a
feasible way for detecting malware. According to the
brief survey in section 2, we have seen that there're many
different approaches to detect malware in traditional PC
and malware in smartphone, such as Microsoft Windows
Mobile, Nokia Symbian, Apple ios and Google Android.
We considered that monitoring software behavioral
activity in Android framework is one of the most accurate
techniques to determine the behavior of Android
applications, since they provide detailed and effected low
level information. We do realize that framework level
Android SDK function call monitoring techniques can
contribute to a deeper analysis of the malware, providing
more useful information about malware behavior and
more accurate results. On the other hand, more
monitoring capability will place a higher demand on the
amount of resources consumed in the device.
We have seen that SendTextMessage(), SendMultipartTextMessage(), getPhoneService, and getCurrentLocation() are the most used SDK functions by
malware. A benign application could make moderate or
heavy use of those function calls and thus trigger false
positives, but authors trust that slight differences would
make the system classify Trojans correctly. We have seen
that trojanized applications made more these kinds of
SDK function call executions.
The most important contribution of this work is the
mechanism we propose for obtaining real traces of
application behavior in Android framework. We have
seen in previous works that it is possible to obtain
behavior information using artificially created user actions, or creating replicas of smartphones, but RobotDroid
helps the community to obtain real application traces of
hundreds or even thousands of applications.
Next step, we will deploy the RobotDroid on Google's
Android Market and distribute it to as many users as
possible. Users running our application will have the

JOURNAL OF NETWORKS, VOL. 7, NO. 4, APRIL 2012

opportunity to see their own smartphone behavior. We


could even alert the users when one of their applications
shows an abnormal trace. The system can also act as an
early warning system, being capable of detecting
malicious or abnormally behaving applications in early
stages of propagation. By implementing other tools that
run in more powerful PC or server, we have demonstrated
that one can obtain behavior-based information and get it
processed on a central server.
We have chosen a simple active learning algorithm to
distinguish between benign applications and their
correspondent malware version. The results have been
encouraging, although we need to address some open
issues. First, the system would always separate the
software behavioral signature vectors in two sets even if
there is no malware on it. The software behavioral
signature sequence mapping would change drastically
whenever a malicious signature vector enters into the
normal dataset. These issues require some manual check
or further automatic analysis. Second, one could
intentionally submit incorrect data to the system leaving
the dataset corrupt. One next step is to authenticate the
submitting application so we can ensure that nobody is
directly sending wrong data to the system. Regarding the
communication mechanism between the RobotDroid
application and our server, it is made using the public
TCP/IP protocol in current version, without focusing on
protecting the privacy of transferred data. If an attacker
sniffs and manipulates the traffic in the communication
process, it can lead to misclassification errors. In order to
avoid this, we will introduce encryption mechanisms to
provide integrity of data and authenticity of the sender.
We have to take into account that applying this technique
in the mobile device, it might have an extra overhead in
the processor, resulting in a fast battery drain.
This work is simply the first step in a longer journey
towards realizing practical smartphone malware detection
system. The first shortcoming of our detection framework
is that the detection framework is implemented in
Android framework, if users want to use our system, they
must use our modified Android platform, and this is not
suitable for massive distribution. Secondly, our framework is tested in not platform-independent; the migration
from one version platform to another is not very easy. We
plan to find more optimal algorithm to decrease the time
consumption and extend our mechanism to other Linuxbased embedded systems.
REFERENCES
[1]. C. Guo, H. Wang, and W. Zhu. Smartphone attacks and
defenses. In HotNets-III, UCSD, Nov, 2004
[2]. R. Racic, D. Ma, and H. Chen. Exploiting mms
vulnerabilities to stealthily exhause mobile phones
battery. In IEEE SecureComm, 2006
[3]. C. Mulliner, G. Vigna, D. Dagon, and W. Lee. Using
labeling to prevent cross-service attacks against
smartphones. In DIMVA , 2006
[4]. C. Mulliner and G. Vigna. Vulnerability analysis of mms
user agents. In Proc. of ACM ACSAC , 2006

2012 ACADEMY PUBLISHER

721

[5]. S. Forrest and B. Pearlmutter. Detecting instructions using


system calls: Alternative data models. In IEEE
Symposium on Security and Privacy, 1999
[6]. Mihai Christodorescu,Somesh Jha et al. Semantic-aware
malware detection IEEE Symposim of Security and
Privacy , 2005
[7]. Zhichao Zhu,Guohong Cao et al. A Social Network Based
Patching Scheme for Worm Containment in Cellular
Networks Infocomm, 2009
[8]. Abhijit Bose, Xin Hu et al. Behavioral Detection of
Malware on Mobile Handsets MobiSys08, June 1720 ,
2008
[9]. D. Lewis and W. Gale. A sequential algorithm for training
text classifiers. In Proceedings of the ACM SIGIR
Conference on Research and Development in Information
Retrieval, pages 3-12. ACM/Springer, 1994
[10]. Lee W,Dong X,Informatiion-Theoretic Measures for
Anomaly Detection[A].Proc of the 2001 IEEE Symp on
Security and Privacy[C]. 130-143, 2001
[11]. Vapnik V N,The Nature of Statistical Learning
Theory[M]. New York:Spring-Verlag, 1995
[12]. L. Xie, X. Zhang, A. Chaugule, T. Jaeger, and S.
Zhu.Designing System-level Defenses against Cellphone
Malware. In Proc. of 28th IEEE International Symposium
on Reliable Distributed Systems (SRDS) , 2009
[13]. L.Xie, X.Zhang pBMDS: A Behavior-based Malware
Detection System for Cellphone Devices WiSec10,
March 2224, 2010, Hoboken, New Jersey, USA, 2010
[14]. Robin Sommer, Vern Paxson Outside the Closed World:
On Using Machine Learning For Network Intrusion
Detection. IEEE Symposium of Security and Privacy
Oakland California USA, 2010
[15]. William Enck, Machigar Ongtang, and Patrick McDaniel
On Lightweight Mobile Phone Application Certification
ACM CCS09, Chicago, Illinois, USA, 2009
[16]. Francesco Di Cerbo, Andrea Girardello, Florian
Michahelles, and Svetlana Voronkova. Detection of
malicious applications on android os. In Proceedings of
the 4th international conference on Computational
forensics, IWCF'10, pages 138-149, Berlin,Heidelberg,
Springer-Verlag. 2010
[17]. William Enck, Peter Gilbert, Byung-Gon Chun,Landon P.
Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N.
Sheth. Taintdroid: an information-ow tracking system for
realtime privacy monitoring on smartphones. In
Proceedings of the 9th USENIX conference on Operating
systems design and implementation, OSDI'10, pages 1-6,
Berkeley, CA, USA. USENIX Association. 2010
[18]. William Enck, Damien Octeau, Patrick McDaniel, and
Swarat Chaudhuri. A study of android application
security. In Proceedings of the 20th USENIX Security
Symposium. USENIX Association, August 2011.
[19]. Machigar Ongtang, Stephen McLaughlin, William Enck,
and Patrick McDaniel. Semantically rich applicationcentric security in android. In Proceedings of the 25th
Annual Computer Security Applications Conference,
ACSAC'09, pages 340-349, Los Alamitos, CA, USA.
IEEE Computer Society. 2009
[20]. Georgios Portokalidis, Philip Homburg, Kostas
Anagnostakis, and Herbert Bos. Paranoid android:
versatile protection for smartphones. In Proceedings of the
26th Annual Computer Security Applications Conference,

722

[21].

[22].

[23].

[24].

[25].

JOURNAL OF NETWORKS, VOL. 7, NO. 4, APRIL 2012

ACSAC'10, ACSAC '10, pages 347-356, New York, NY,


USA. ACM. 2010
Aubrey-Derrick Schmidt, Ahmet Camtepe, and Sahin
Albayrak. Static smartphone malware detection. In
proceedings of the 5th Security Research Conference
(Future Security 2010), ISBN: 978-3-8396-0159-4, page
146, 2010.
Ashkan Shari_ Shamili, Christian Bauckhage, and Tansu
Alpcan. Malware detection on mobile devices using
distributed machine learning. In Proceedings of the 2010
20th International Conference on Pattern Recognition,
ICPR '10, pages 4348-4351, Washington, DC, USA. IEEE
Computer Society. 2010
Asaf Shabtai, Uri Kanonov, Yuval Elovici, Chanan
Glezer, and Yael Weiss. Andromaly: a behavioral
malware detection framework for android devices. Journal
of Intelligent Information Systems, pages 1-30.
10.1007/s10844-010-0148-x. 2011
Iker Burguera, Urko Zurutuza, Simin NadjmTehrani.
Crowdroid: Behavior-Based Malware Detection System
for Android. SPSM11, October 17, 2011, Chicago,
Illinois, USA. 2011
Simon Tong, Daphne Koller. Support Vector Machine
Active Learning with Applications to Text Classification.
Journal of Machine Learning Research (2001) 45-66,
2001

2012 ACADEMY PUBLISHER

You might also like