You are on page 1of 18

Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Contents lists available at ScienceDirect

Journal of King Saud University –


Computer and Information Sciences
journal homepage: www.sciencedirect.com

iABC-AL: Active learning-based privacy leaks threat detection for iOS


applications
Arpita Jadhav Bhatt ⇑, Chetna Gupta, Sangeeta Mittal
Department of Computer Science & IT, Jaypee Institute of Information Technology, Noida, India

a r t i c l e i n f o a b s t r a c t

Article history: Do iOS applications breach privacy? With plethora of iOS applications available in market, most users are
Received 28 March 2018 unaware of security risks they pose. This includes breach of user’s privacy by sharing personal and sen-
Revised 18 May 2018 sitive Smartphone data without user’s consent. Apple follows strict code signing procedure to ensure that
Accepted 18 May 2018
applications are developed from trusted enterprises. However, past malware attacks on iOS devices have
Available online xxxx
demonstrated that there is lack of protection from permission misuse by applications. While machine
learning approaches offer promising results in detecting such malicious applications for Android operat-
Keywords:
ing system, there has been minimal research in extending them to iOS platform due to unavailability of
iOS applications
Information security
labeled data-sets. In this study, we propose iABC-AL (iOS Application analyzer and Behavior Classifier using
Static analysis Active Learning), a framework to detect malicious iOS applications. The objective of iABC-AL is to protect
Permission extraction permission induced user’s privacy risks by (i) maximizing precision of machine learning based classifica-
Active learning tion models and (ii) minimize requirement of labeled training data-set. To attain the objective, iABC-AL
framework incorporates category of application and active learning approaches. A total of 2325 iOS appli-
cations were evaluated. Empirical results demonstrate that the proposed approach achieves accuracy rate
of 91.5% and increases precision of supervised approach by 14.5%.
Ó 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction Users can customize their iOS devices by installing third-party


applications from online portal, App store (Apple Inc., https://
With the growth of mobile computing platforms and high- www.apple.com/in/ios/app-store/). In recent years, tremendous
speed channels for mobile communication, smart phones are growth in number of iOS users has led to increase in number of
becoming increasingly popular, thereby, increasing its number of downloads for the third-party applications (Statista Inc., https://
users. They are now being used for storing and managing a large www.statista.com/statistics/263401/global-apple-iphone-sales-
amount of personal data like messages, contacts, images, videos, since-3rd-quarter-2007/). Consequently, the risk due to applica-
emails, GPS location (Wikipedia, https://en.wikipedia.org/wiki/Sm tions that are potentially harmful to user’s privacy has also
artphone). One of the prime reasons for the success of Smartphone increased (Kurtz et al., 2014; Park et al., 2014). Privacy concerns
is their functional expandability by installing third-party applica- are mainly about leakage of data from applications like emails,
tions (Apps). After Android, iOS is the most widely used Smart- documents as well as leakage of contextual data like location,
phone operating system and has been designed and marketed by audio and video (Kurtz et al., 2014). As crucial data is stored on
Apple Inc. (Wikipedia, https://en.wikipedia.org/wiki/Apple_Inc). the Smartphone, this information has attracted the attention of
cybercriminals and advertising industries (Park et al., 2014). Per-
⇑ Corresponding author at: Department of Computer Science & IT, Jaypee mission based resource access is one of the ways to control app
Institute of Information Technology, A-10, Sector-62, Noida, Uttar Pradesh penetration in the phone. Change in permission usage were found
201309, India. either in terms of granting resource access without user’s knowl-
E-mail addresses: arpita.jadhav@jiit.ac.in (A.J. Bhatt), chetna.gupta@jiit.ac.in (C. edge or sharing sensitive information in clear text to third parties.
Gupta), sangeeta.mittal@jiit.ac.in (S. Mittal).
Such privacy violations can be discovered only after extensive
Peer review under responsibility of King Saud University.
manual analysis (Symantec Corporation, https://www.
symantec.com/about/newsroom/press-releases/2017/skycure_
0718_01). Some malware intentionally planted by researchers like
Jekyll apps, changed their behavior post installation (Wang et al.,
Production and hosting by Elsevier
2013).

https://doi.org/10.1016/j.jksuci.2018.05.008
1319-1578/Ó 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
2 A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Third-party applications access phone data and sensors by ask- The paper also addresses research questions: (i) Performance
ing for permissions during installation (Chandramohan and Tan, evaluation of iABC-AL for malware detection in iOS apps (ii) Effec-
2012). Many apps are implemented either by individual develop- tiveness of iABC-AL on amount of availability of labeled data for
ers, or by small enterprise companies, all of whom may not be training (iii) Performance of iABC-AL on varying number of itera-
trusted for integrity of permission based usage behavior (Agarwal tions using active learning.
and Hall, 2012). Some apps may access more phone resources The rest of the paper is organized as: Section II describes the
while running than they sought permission for during installation related work. Section III describes active learning. Section IV
(Szydlowski et al., 2012). This malicious behavior of apps may lead describes category wise iOS application privacy breach analysis
to privacy breach of phone user. using active learning. Section V depicts the experimental setup.
The problem of detecting malicious apps in iOS gets aggravated Section VI depicts the empirical analysis of results obtained. The
due to (i) closed-source nature of iOS platform (ii) distribution of paper is concluded in Section VII.
iOS applications as application binaries and thus can’t be directly
executed on emulators or Xcode (Apple Inc., https://developer.ap
ple.com/xcode/). Android on the other hand is an open source OS 2. Related work
and thus most of the studies for detecting malicious applications
focus on this platform (Martin et al., 2016). Most of the techniques Significant research has been done on malicious app detection
proposed for malicious intent detection in Android can’t be effec- in Android OS in contrast to iOS that hasn’t been studied suffi-
tively applied to iOS apps due to non-availability of large labeled ciently mainly due to its closed source features. In this work, infor-
data sets required by underlying classifiers. Besides this, number mation about iOS apps’ code and online behavior has been
of permissions documented for Android is 135 approximately, obtained by using reverse engineering tools (Dale Rapp, https://d
making it a rich feature set for permission based malware detec- alewifisec.wordpress.com/2013/01/24/how-to-find-vulnerabil
tion (Wang et al., 2014). Apple documents only 13 permissions ities-in-mobile-apps-through-reverse-engineering/). This informa-
for iOS apps (Stackoverflow, https://stackoverflow.com/ques tion has been used to classify the app has malicious or benign.
tions/29894749/complete-list-of-ios-app-permissions). This Kurtz et al. performed automated privacy analysis of iOS appli-
makes permission induced privacy risks difficult to be handled in cations using dynamic analysis (Kurtz et al., 2014). The authors
iOS apps. evaluated 1136 iOS apps by tracing sensitive API calls and tracking
To address the above challenges in this paper, a novel frame- network connections. However, their approach requires every app
work, iABC-AL for iOS devices is proposed, which uses active learn- to be installed in device before being detected as malicious which
ing approaches to train a model for detection of malicious iOS limits its usefulness in real time.
applications. The main objective of iABC-AL is to protect user’s pri- Code inspection and behavior monitoring are two main ways to
vacy by maximizing the precision of classification model and min- detect a malicious app. In this work, behavior detection in terms of
imize requirement of labeled training data. The active learning granted permissions-based usage has been used to detect apps
approach provides very good precision by using 10% of training with ill-intent.
data set which is greater or equivalent to precision rate when Study on malicious and legitimate applications in Android OS
90% of training data set is used in traditional supervised classifica- demonstrate the difference in nature of permission requests in
tion. In order to test the effectiveness of the proposed approach, a benign and malicious apps (Rovelli and Vigfusson, 2014; Huang
total of 2235 iOS applications (1470 apps from App store (Apple et al., 2013). It has been found that over one-third apps are over-
Inc., https://www.apple.com/in/ios/app-store/) and 885 apps from privileged than the permissions they sought from user (Felt et al.,
vShare (vShare: Download paid apps for free on iOS 10 2012). Thus, instead of just looking at the permission set asked
(iPhone&iPad) and Android without jailbreak, http://www. by app, it is important to keep a tab on actual permission usage.
vshare.com/) were evaluated across 20 different application Wang W et al. explored permission induced risk of Android
categories, for 13 user permissions (Stackoverflow, https://stacko applications using feature ranking based on T-test, correlation
verflow.com/questions/29894749/complete-list-of-ios-app- coefficient, and mutual information (Wang et al., 2014). Sequential
permissions). forward selection (SFS) and principal component analysis (PCA)
Major contributions of the proposed work are following: has also been utilized to identify risky permissions and their
subsets to detect malicious applications using various machine
 One of the first study on quantitative analysis of permission learning classifiers. We extended this approach to capture
induced privacy breach for iOS apps permission-based privacy risk detection in iOS apps by tracing sen-
 Design of a novel framework for evaluating and classifying sitive API usage in code and during runtime. Supervised machine
behavior of iOS applications by utilizing application’s category learning approaches applied to dataset of apps and their
information and set of permissions. permissions-based behavior has been used to classify apps as mali-
 Developed a privacy leak threat detection model based on cious or benign.
active learning to get paralleled precision with very small However, permission-based malware detection techniques for
labeled datasets. Several experiments with different amount iOS platform pose some challenges that were not present in
of training data were conducted. The amount of training data Android apps. For example, number of permissions which require
was varied from 10%-90% with an interval of 10%. The results user’s approval are only 13 and that too are not clearly docu-
depicted that the proposed approach could achieve a best- mented by Apple (Stackoverflow, https://stackoverflow.com/ques
case precision of 91.3%, TPR of 90.1%, recall of 90.1% and F- tions/29894749/complete-list-of-ios-app-permissions). Given this
measure of 90.2%, by utilizing only 10% of entire data set for small set, it is difficult to correctly characterize apps based on per-
training. mission usage. In this work, this problem has been handled by
 The empirical results exhibit that the proposed model is effi- including category of app as an important dimension in deciding
cient and scalable for detecting malicious iOS applications. permission-based risk.
 Comparison of our proposed work with existing work, demon- Most of the studies on android apps have been on imbalanced
strates increased accuracy rate of 88.5%–91.3% against 51–91% datasets due to unavailability of large samples of malicious apps
in (Pajouh et al., 2017) for different active learning scenarios (Nissim et al., 2017; Nissim, et al., 2014), and (Nissim et al.,
and machine learning classifiers. 2015). In case of our study on iOS, even the sample of benign

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 3

applications was also not very large due to the reverse engineering ware detection framework CHABADA for comparing an Android
effort required to get an app’s behavior. One of the prominent app’s behavior against its description (Ma et al., 2015). By incorpo-
works for iOS by Pajouh H et al. addressed this problem by devel- rating active learning approaches, a good set of applications were
oping intelligent malware threat detection for OS X malware selected for experts to label. Resultant model gave good results
(Pajouh et al., 2017). Their proposed model was based on the when tested on 22,555 android applications. In iABC-AL, we
supervised approach in which kernel-based SVM and weighing fac- enquire this approach in more detail by using different active lean-
tor for applications library calls were used to detect OS malware ing scenarios and number of querying strategies that help obtain-
and gave a detection accuracy of 91%. This work has been taken ing best values on standard metrics.
as benchmark to compare performance of our proposed model Background study thus suggests that like Android apps and
here. other areas of malware detection, active learning based supervised
However, in this work, active learning approaches over super- learning can offer promising results for detecting malicious iOS
vised learning models have been explored to overcome the prob- applications. In this paper, we propose a framework named
lem of lack of large datasets. Active learning approaches are ‘‘iABC-AL” can identify risky permissions across categories of apps
based on learning on selective best samples of a dataset (Settles, to minimize the privacy threats for end users. The framework
2009). Active learning support has been used in classifying mali- incorporates active learning approaches to classify applications
ciousness in many cases where labeled data is less like in text files effectively by minimizing the requirement of large labeled data
of different formats and source codes (Settles, 2009; Nissim et al., set to address.
2017; Nissim, et al., 2014; Nissim et al., 2015). Apart from that, use
of active learning to establish maliciousness in Android apps has
3. Active learning approaches
also been studied in (Rashidi et al., 2017; Zhao et al., 2012).
Nissim et al. aimed at detecting new unknown malicious docx
Active learning is termed as a special class of semi-supervised
files using active learning methods and structural feature extrac-
machine learning approach in which learning algorithm queries
tion methods (Nissim et al., 2015). The experimental results
the user or Oracle (some other source of information) to choose
demonstrated that use of active learning methods detected
most informative instance (Settles, 2009; Wikipedia, Active
unknown malware by utilizing only 14% of labeled documents
learning (machine learning)). The machine learning algorithm
thereby reducing the labeling efforts by 95.5%. A comparative anal-
can achieve better accuracy with fewer labeled data set by use of
ysis of the results obtained from active learning approaches and
active learning approaches (Ramirez-Loaiza et al., 2017), which is
those obtained from SVM-margin and passive learning approaches
the main motivation behind using active learning approaches.
was performed. The authors have demonstrated that by using
Active learning is a supervised machine learning approach which
active learning approaches the results have improved by 91% in
is used for selecting samples from data pool. The active learner
acquiring unknown docx file malware. Similarly in (Nissim et al.,
can query an oracle (for example a human annotator) for annotat-
2017), ALDOCX framework to detect unknown malicious docu-
ing unlabeled instances (Settles, 2009). Active learning approaches
ment files using active learning approaches based on structural
can be used in many machine learning problems such as speech
feature extraction methods has been presented. The framework
recognition, information extraction, classification, filtering,
extracted meta-features from docx files and by using machine
malware detection etc. It is also used where unlabeled data set is
learning algorithms became capable of detecting new malicious
abundant, labels are difficult, obtaining labeled data set is time-
files. Active learning based approaches can be applied to detect
consuming task or expensive to obtain (Settles, 2009).
malicious pdfs also (Nissim et al., 2014).
In this paper, active learning was employed for detecting mali-
Source code is another class of data that may have malicious-
cious iOS apps. As most of the malicious detection approaches use
ness but large labeled samples are not available. Moskovitch
rule-based detection techniques (Zhao et al., 2012). The limitation
et al. in (Moskovitch et al., 2007) proposed an approach to detect
with such detection approaches is that they can detect malware
unknown malicious code using active learning. They evaluated
based on pre-defined rule database and may lack in detecting
the approach on more than 3000 files using simple-margin and
new malware and its variants. Traditional rule-based approaches
error reduction active learning methods and achieved good classi-
for anomaly detection use a statistical method to divide data into
fication results. Besides this, active learning for sequence labeling,
benign and malicious category (Zhao et al., 2012). In order to
information extraction and document segmentation has also been
improve the precision of classifiers during machine learning pro-
proposed (Settles and Craven, 1070).
cess, large labeled data set is required, which increases the cost
Thung F et al. proposed combined active and semi-supervised
of making training data set (Zhao et al., 2012). The establishment
approach for defect prediction (Thung et al., 2015). Active learning
of labeled training data set depends on security expert and is also
was helpful in constructing a good classifier model with minimal
expensive. Active learning is used to overcome the limitations with
manual annotation effort. Authors achieved 71% accuracy in pre-
supervised approaches (Zhao et al., 2012) as it selects the best sam-
dicting labels of 500 defects while using only 50 defects in
ple for training thereby reducing the number of samples that are
learning.
required for evaluation. In this work, active learning approaches
Malware detection framework -RobotDroid for Android operat-
were employed to improve the precision rate obtained using
ing system using support vector machine and active learning
supervised learning approach as well as to reduce the requirement
demonstrated capability of detecting malicious software and its
of labeled data set. The following sub-section provides an overview
variants during run-time (Zhao et al., 2012).
of active learning scenarios through which queries can be formu-
Bahman et al. proposed a framework for detecting malicious
lated for classifying the behavior of iOS applications. Fig. 1
Android applications based on support vector machine (SVM)
describes the taxonomy of active learning approaches.
and active learning techniques (Rashidi, 2017). For building the
active learning model, expected error reduction querying strategy
was used for integrating new instances of Android malware. The 3.1. Active learning scenarios
model evaluated by conducting various experiments on DREWIN
malware data set. The results demonstrated that their proposed Active learning also called ‘‘query learning” (Settles, 2009). The
approach accurately detects malicious Android applications and main idea is that the learning algorithm chooses the data with the
is also capable of detecting new malware .Ma et al. proposed a mal- help of which it learns and performs better with reduced training.

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
4 A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Difference between Stream-based selective sampling and Pool-


based
The major difference between stream-based selective sampling
and pool-based active learning approach is that stream-based selec-
tive sampling approach scans data sequentially and then makes
query decision individually whereas pool based evaluates and ranks
entire data set before selecting best query (Settles, 2009).
In this paper pool-based and stream-based selective sampling,
active learning techniques using different query strategy frame-
works were applied. A detailed analysis of empirical results is
explained in section VI. The following sub-section describes the
query strategy frameworks that were employed on both pool-
based and stream-based selective sampling active learning
scenarios.

3.2. Query strategy frameworks

Different active learning scenarios evaluate information of unla-


beled instances sampled from a given distribution (Settles, 2009).
With reference to (Rashidi, 2017), we use notation xA , which refers
Fig. 1. Taxonomy of active learning approaches.
to the most significant instance in the data set (i.e. optimal query)
with reference to query selection algorithm A.

Active learning can be used where labeled data set is very limited, 3.2.1. Uncertainty sampling
difficult to obtain, time-consuming, requires human effort and It is the most commonly used query framework. Two query
expensive. Active learning system overcomes the problem of label- frameworks can be used to pick the most informative instance
ing the unlabeled data by querying in form of an unlabeled (Settles, 2009; Ramirez-Loaiza et al., 2017).
instance to Oracle (for example a human annotator) (Settles,
2009). By this technique, the active learner targets to achieve 3.2.1.1. Least confident. In this query framework, the instances
higher accuracy by using few labeled instances thereby reducing about which the learner is least confident to label are queried using
cost of a labeled data set. There are numerous different query Eq. (1) (Settles, 2009; Settles, 2012).
strategies which are used to select the instances that are most ^jxÞÞ
xLC ¼ argmaxx ð1  Ph ðy ð1Þ
informative. The commonly used active learning scenarios are
stream-based selective sampling and pool-based active learning sce-  denotes the class, label having highest probability, that is,
where y
nario (Settles, 2009; Moskovitch et al., 2007). ^ ¼ argmaxy Ph ðyjxÞ
y ð2Þ

3.1.1. Stream-based selective sampling 3.2.1.2. Entropy sampling. A more generalized form of uncertainty
The approach is also called a sequential active learning approach sampling is the entropy sampling (Settles, 2009) which is depicted
because each unlabeled instance is picked one at a time amongst by Eqs. (3) and (4) (Settles, 2012). With respect to binary
the data source. The learner then decides whether to query the classification, the entropy-based sampling selects the instance
instance or discard it. The main assumption of this approach is that with posterior probability close to 0.5 and can also be generalized
unlabeled instance needs to be sampled first from the actual distri- to multi-label classifiers (Settles, 2009).
bution. The learner can then decide to request its label (Settles,
2009). xH ¼ argmaxx ðHh ðYjxÞÞ ð3Þ
X
Hh ðYjxÞ ¼ ð P h ðyjxÞ  log Ph ðyjxÞÞ ð4Þ
3.1.2. Pool-based active learning scenario y
In pool-based active learning scenario, the queries are picked
from a large pool of unlabeled data set by using uncertainty sam- 3.2.1.3. Margin sampling. The query strategy is directly oriented to
pling query strategy (Settles, 2009; Ramirez-Loaiza et al., 2017). linear support vector machine classifier (Moskovitch et al., 2007). A
The uncertainty sampling querying strategy chooses that instance linear hyper-plane separates the instances with respect to a given
from the pool for which the model is least certain about its label. class is generated. It is defined by the perpendicular distance
In the pool-based active learning phase, the learner can begin between instances from different classes (Moskovitch et al.,
with a limited number of instances amongst the labeled training 2007) having maximal margin. The hyper-plane splits the
data set. Next, the learner may request for labels of the selected instances according to the classes. Eq. (5) can be used to select
instances, learn from the query results and leverage new knowl- the most informative instance to be queried using margin sampling
edge to select the instances that could be queried next. Once (Settles, 2009; Settles, 2012).
queried, no additional assumptions are made on learner and the xM ¼ argminx ðPh ðy^1 jxÞ  P h ðy^2 jxÞÞ ð5Þ
new labeled instance is added to the labeled data set. Thereafter,
the learner can proceed in a standard supervised way. The pool- 1 and y
where y 2 denote the first and second probable class labels
based active learning approach can be used for real-world learn- under model h (Settles, 2009; Settles, 2012).
ing problems where a large collection of unlabeled data-set can
be collected. The approach assumes that there is a small set of 3.2.2. Query by committee
labeled data and queries are pulled out from the pool (Settles, The query by committee approach maintains a committee
2009). C ¼ fhð1Þ ; . . . hðCÞ g of different models, which are trained on labeled

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 5

set L. Every member of committee votes for labeling the query can- i) A benign application: an application which is safe to use and
didates (Settles, 2009; Settles, 2012). The framework picks the does not result in user’s data privacy leaks. The application
query instance for which the committee disagrees (Ramirez- makes uses of genuine permissions and does not share user’s
Loaiza et al., 2017). To measure the level of disagreement using data to an unauthorized entity.
query by committee following approaches are used (Settles, ii) A suspicious application: an over-privileged application that
2009; Settles, 2012). makes use of unnecessary/extra permissions than required
as per its category and may cause privacy leak to user’s data.
3.2.2.1. Vote entropy. To select the instance for which the commit- iii) A malicious application: an application that shares user’s
tee maximally, disagrees. Eq. (6) (Settles, 2009; Settles, 2012) is sensitive data with third-party domains, advertisement
used. companies or third-party analytics without user’s permis-
! sion. These applications are very harmful as the users may
X Vðy Þ Vðyi Þ be unaware that these applications may transmit a lot of
xVE ¼ argmaxx  i
 log ð6Þ user’s data over the network.
i
C C

where Vðyi Þ denotes number of votes that label yi receives amongst Fig. 2 shows the proposed architecture for the iABC-AL frame-
the committee members and C, committee size (Settles, 2009; work for analyzing iOS applications and classifying their behavior
Settles, 2012). for privacy leaks. The analysis process begins with the installation
of the iOS application on an iOS device from App store, Cydia store
or any third-party application store. The iOS device can be iPhone/
3.2.2.2. Kullback leibler divergence. The approach calculates the dif-
iPod/iPad. The iOS applications were then reverse engineered for
ference between two probabilities. In order to select the most
static analysis (Dale Rapp, https://dalewifisec.wordpress.co
informative query, the framework selects the instance which has
m/2013/01/24/how-to-find-vulnerabilities-in-mobile-apps-throug
the largest average difference between: any one member of com-
h-reverse-engineering/) wherein privacy related frameworks and
mittee and all learners (consensus) (Settles, 2009; Settles, 2012).
classes were extracted and later mapped into permission variables.
!
1 XC Based on the permission usage along-with the category to which
xKL ¼ argmaxx  DðPhðcÞ jjPc Þ ð7Þ an application belongs, applications are classified as
C c¼1
benign/malicious/suspicious.

X PhðcÞ ðyi jxÞ


DðPhðcÞ jjPc Þ ¼ P hðcÞ ðyi jxÞ  log ð8Þ
i
Pc ðyi jxÞ 4.1. Application preprocessing

where hðCÞ denotes a model in committee and C, the entire This phase is used to explore the permission induced risk of iOS
committee. applications by analyzing their frameworks and classes. The phase
also incorporates category of the applications as an important fea-
1X C
ture to classify them as benign, suspicious and malicious. Consid-
Pc ðyi jxÞ ¼ P hðcÞ ðyi jxÞ ð9Þ
C c¼1 ering application’s category helps in identifying the misuse of
risky permissions for a given category as applications belonging
A stratified 10-fold cross validation using machine learning to same category request a similar set of permissions (Wang
classifiers: Naïve Bayes and SVM (support vector machine) was et al., 2014).
employed. A total of 2325 iOS apps were evaluated. A Boolean per- Before downloading applications from the app store, Apple’s
mission matrix P mn where m represents the number of instances developer guidelines for different application categories were
of iOS applications and n represents the number of permissions. studied. These guidelines describe basic features with respect to
With the closed-source nature of iOS platform, obtaining a labeled a category and help the developers to select a category before
data set is difficult for iOS applications. A data set of 2325 applica- uploading the applications on app store (Apple Inc., https://develo
tions is not sufficient for the reliable training of supervised
machine learning classifiers. Therefore, active learning approaches
were employed as they minimize and requirement of labeled data
set. The querying strategies that were employed with classifiers
were entropy sampling, kullback leibler divergence, least confident
sampling, margin sampling, relevance sampling and vote entropy
(Reyes et al., 2016). The detailed analysis of the results is demon-
strated in section VI. The following section describes the working
of iABC-AL framework that uses active learning approaches to
detect malicious applications along with the category of an appli-
cation as a dimension to detect privacy breach.

4. Category wise iOS application privacy breach analysis using


active learning

The iABC-AL framework for analyzing and classifying iOS appli-


cations behavior comprises of two phases: (a) application prepro-
cessing and (b) privacy leak threat detection phase.
For classifying the behavior of iOS applications, a new class
label malicious was added along-with benign and suspicious. The
framework classifies an iOS app into one of following three
categories: Fig. 2. Architecture of iABC—AL.

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
6 A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

per.apple.com/ios/human-interface-guidelines/app-architecture/re applications across 20 different categories downloaded from App


questing-permission/). Apple provides 24 different application cat- store and vShare.
egories for the developers (Apple Inc., https://developer.apple.co Step 2: Feature Extraction
m/app-store/categories/). In this work, top 20 application cate- In this stage, a feature refers to user/device permission. After
gories were considered (Statista, https://www.statista.com/ installing the applications from App store and vShare, applications
statistics/270291/popular-categories-in-the-app-store/). The were reverse engineered for framework extraction using static
exhaustive study on the developer guidelines aided in checking analysis (Dale Rapp, https://dalewifisec.wordpress.com/2013/01/
the consistency of the results obtained from applying permission 24/how-to-find-vulnerabilities-in-mobile-apps-through-reverse-e
based ranking algorithm across a category. ngineering/; Intel, https://software.intel.com/en-us/node/622647).
Use of developer frameworks for accessing user/device Static analysis of code helps in discovering the frameworks,
resources Objective-C classes and, methods used in an application
Apple provides a different set of frameworks for its developers (Szydlowski et al., 2012). The extraction was performed using otool
which are broadly categorized as app frameworks, graphics & games, (Wikipedia, https://www.owasp.org/index.php/IOS_Application_
app services, media & web, developer tools and system (Apple Inc., Security_Testing_Cheat_Sheet; Raywenderlich, https://www.
https://developer.apple.com/documentation/). Apart, from the raywenderlich.com/45645/ios-app-security-analysis-part-1). In
frameworks provided by Apple, the application developers can this paper, information for 51 privacy related frameworks
use third-party frameworks in application’s source code. These (provided by official app store) along with third-party analytics
frameworks provide set of interfaces, classes and methods; which framework such as Google analytics, PL crash reporting framework,
when integrated, can be used to access resources of the device such ad-supporting frameworks were extracted. Next, important
as camera, photo album, geo-coordinates etc. classes, methods and, delegates with respect to every framework
Frameworks, classes, and methods used in the application can were identified and API calls were traced for every application
give us insight into the general behavior of an application, particu- using static analysis.
larly for assessing the permissions or set of permissions the appli- For example, user’s photo gallery can be accessed using the
cation will require once installed. A benign application uses classes and delegates of UIKit (Apple Inc., https://developer.apple.
frameworks that are required for its proper functioning. However, com/documentation/uikit); Photos (Apple Inc., https://developer.a
there can be some applications that make use of certain frame- pple.com/documentation/photos) or PhotosUI framework (Apple
works which are not required for an application to provide its Inc., https://developer.apple.com/documentation/photosui). Simi-
desired functionalities. This behavior where the application uses larly, user’s location can be fetched by using classes and methods
permissions other than the required ones tends to be malicious of CoreLocation (Apple Inc., https://developer.apple.com/documen
in nature. In application preprocessing phase, a Boolean permission tation/corelocation) or MapKit framework (Apple Inc., https://devel
matrix to be used as input to machine learning classifiers was con- oper.apple.com/documentation/mapkit). Table 2 depicts a list of
structed through a 4-step process. permissions that require user’s approval before granting its access
Step 1: Application stored and fetched (.app) and its corresponding framework that was used for permission
In this stage, 2325 apps from different categories were down- mapping. User/device resources can be accessed by integrating dif-
loaded and reverse engineered to extract the frameworks (later ferent frameworks and their delegates, class methods which in
mapped to user permissions). Out of 2325 apps: 1470 apps were turn requires user’s approval. A list of permission variables and
downloaded from App Store and 855 from vShare. The apps down- some of their frameworks are described in Table 2. For the conduc-
loaded from app store were considered benign as they pass Apple’s tion of this work, all the 51 privacy related frameworks were
code review process. The apps downloaded from vShare were con- mapped to 13 user permission variables.
sidered to be suspicious as vShare is not the official store for dis- Step 3:Permission mapping
tributing iOS applications (vShare: Download paid apps for free on After performing an exhaustive study on the privacy-related
iOS 10(iPhone&iPad) and Android without jailbreak, http://www. frameworks and their classes, the frameworks were grouped based
vshare.com/) and there are chances of introduction of on the resource they accessed. For example, frameworks like Map-
over-privileged frameworks. Table 1 details the distribution of kit and CoreLocation were grouped together as they can be used to
access user’s location. Similarly, frameworks used to access user’s
Table 1 photo album were grouped together. The same process was
Application data set downloaded from App store and vShare.
repeated for the other frameworks. Thus, all the extracted frame-
S. No Category App Store vShare works that were related to user’s privacy were mapped to 13 per-
1 Books 100 50 mission variables. As the list of permission variables is not well
2 Business 150 75 documented, a total number of permissions the Apple provides
3 Education 130 70 are not available. For this work, we have used the permissions that
4 Entertainment 150 75 are listed in Table 2. To illustrate the mapping of frameworks with
5 Finance 50 30
6 Food & Drink 50 25
user permissions, we cite an example of an application ‘eBook
7 Games 150 100 reader’ from books category. The information of all the frameworks
8 Health & Fitness 50 30 is depicted in Fig. 3.
9 Lifestyle 70 40 The application had frameworks for accounts: Accounts, Social,
10 Medical 50 25
and Twitter which were mapped to single permission variable
11 Music 80 40
12 Navigation 30 15 accounts. The application had other frameworks AddressBook,
13 News 50 25 AVFoundation, CoreLocation, and CoreMotion. All these frameworks
14 Photo & Video 30 20 were mapped to permission variables contacts, camera, location
15 Productivity 50 25 and motion & fitness. Similarly, the process was repeated for all
16 Shopping 50 40
17 Social Networking 50 30
2325 apps. A total of 13, permissions were considered which
18 Sports 30 15 included: Bluetooth sharing, calendar, camera, health kit, home kit,
19 Travel 100 100 location services, media library, microphone, motion &fitness, photos,
20 Utilities 50 25 reminders, social media accounts or user accounts. Fig. 3 depicts a
Total 1470 855
snapshot of the extracted frameworks using otool. The application

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 7

Table 2
Possible 13 permissions in iOS and some of their frameworks.

S. No Permission Framework Description


1. Bluetooth Core Bluetooth (Apple Inc., https://developer. Provides classes that applications used to communicate devices that are equipped with
apple.com/documentation/corebluetooth) Bluetooth technology
2. Calendar EventKit (Apple Inc., https://developer. Provides interface to access user’s calendar and event data.
apple.com/documentation/eventkit)
3. Camera AVFoundation (Apple Inc., https://developer. Provides interfaces to play, record audio &video. AVCaptureDevice provides input i.e.
apple.com/documentation/avfoundation) audio video for capturing sessions (Apple Inc., https://developer.
apple.com/documentation/avfoundation/avcapturedevice).
UIKit (Apple Inc., https://developer. Provides classes and methods to construct and manage graphical user interface of iOS
apple.com/documentation/uikit) applications.
4. Photo UIKit (Apple Inc., https://developer. Provides interface to access camera & photo album via UIImagePickerController class.
apple.com/documentation/uikit) The class provides interface for clicking pictures. (Apple Inc., https://developer.
apple.com/documentation/uikit/uiimagepickercontroller)
Photos (Apple Inc., https://developer. Provides classes that help in creating photo editing extensions that are managed by
apple.com/documentation/photos) Photos app.
PhotosUI (Apple Inc., https://developer.
apple.com/documentation/photosui)
5. Cellular-data CoreTelephony (Apple Inc., https://developer. Provides information regarding user’s service providers for accessing telephony related
apple.com/documentation/coretelephony) details.
6. Contacts AddressBook (Apple Inc., https://developer. Provides interface to access the database for storing contacts.
apple.com/documentation/addressbook)
7. Location CoreLocation (Apple Inc., https://developer. Provides interface to obtain the geographic location of a user.
apple.com/documentation/corelocation)
8. Microphone Speech (Apple Inc., https://developer. Provides interface to recognize live or pre-recorded speech.
apple.com/documentation/speech)
9. Health kit HealthKit (Apple Inc., https://developer. Allows the application to share health-related data with other applications.
apple.com/documentation/healthkit)
10. Home kit HomeKit (Apple Inc., https://developer. Provides an interface for users to communicate, configure home automation
apple.com/documentation/homekit) accessories.
11. Notification UserNotifications (Apple Inc., https://developer. Provides support for delivering and managing local/remote notifications
apple.com/documentation/usernotifications)
12. Motion & CoreMotion (Apple Inc., https://developer. Provides accelerometer, pedometer, a gyroscope to access data of motion and
Fitness apple.com/documentation/coremotion) movement of an iOS device.
13. Account/ Social Accounts (Apple Inc., https://developer. Provides interface to users to access and manage their external accounts. Provides
media accounts apple.com/documentation/accounts) interface to integrate social media for iOS applications
Social (Apple Inc., https://developer.
apple.com/documentation/social)

application presence of a framework was marked as 1 and 0 other-


wise. The frameworks that can be used to access the same resource
were grouped together and for constructing the permission matrix,
‘OR’ operation was performed.
To illustrate the construction of permission matrix, we consider
the same application eBook reader. The application had classes
ACAccount (Apple Inc., https://developer.apple.com/documenta
tion/accounts/acaccount) from Accounts framework (Apple Inc.,
https://developer.apple.com/documentation/accounts) and class
TWTweetComposeViewController (Xamarin, https://developer.xama
rin.com/api/type/Twitter.TWTweetComposeViewController/) from
Social framework (Apple Inc., https://developer.apple.com/docu
mentation/social) and Twitter framework for integrating Twitter
account of the user to the app. As both the frameworks can be used
to access user account details, they were grouped together. For
constructing the permission matrix, OR operation was performed,
that resulted value 1, for permission variable accounts. Similarly,
application ‘playbooks’ had frameworks CoreImage and ImageIO
with their respective classes which were grouped together as both
Fig. 3. Frameworks for ‘eBook reader’ application. the frameworks can be used to access photos. Similarly, the process
was repeated for 2325 applications and their frameworks for map-
ping them into13 permission variables. Other frameworks such as
also had crash reporting, flurry analytics frameworks and their third-party analytics, crash reporting, ad supporting frameworks
classes which are also shown in Fig. 4. were also studied to classify the behavior of applications.
Step 4: Permission matrix construction Adding a new class label ‘malicious’
After grouping the frameworks that can be used to access same Our previous studies on dynamic analysis (Bhatt, 2017)
resource (Step 3), a permission matrix was constructed. The per- (wherein the application’s behavior is examined at runtime
mission matrix is Boolean-valued matrix Pmxn where m denotes (Szydlowski et al., 2012; Intel, https://software.intel.com/en-us/n
the number of application and n number of permissions. A total ode/622647) have revealed that the applications that contain
of 76 classes across 51 frameworks were considered. For every advertisement frameworks, crash reporting frameworks, analytics

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
8 A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Fig. 4. Third-party frameworks and classes for ‘eBook reader’ application.

frameworks share tons of user’s crucial data over the network. This learning. The input for the framework is a configuration file (an
data includes user’s GPS co-ordinates, email’s, passwords, health XML or java file) which comprises of series of parameters. For con-
information, information on social media accounts etc. Moreover, ducting the experiments, we created different XML configuration
this crucial data is shared over the network in an unencrypted form files on several query strategies for different active learning scenar-
with other third-party domains for monetary benefits and users ios. The configuration file comprises of series of parameters which
are unaware of this privacy breach. Any app that contains third- are required to run an experiment (see Fig. 5).
party domains, analytics frameworks, ad supporting frameworks The most important parameters of the configuration file
and it shares user’s sensitive information like email id’s, password, include:
text messages, searched keywords, location to third-party domains
in an unencrypted form as well as without user’s consent is i) Evaluation method: describes the evaluation method that is
‘malicious’. used during the active learning process. For conducting the
In this work, after extracting the third-party frameworks (using experiments 10- fold cross validation was employed on
static analysis), a new class label malicious was assigned to 323 2325 iOS applications.
apps. Our studies revealed startling facts as 180 apps from App ii) Data set: Application data set is specified input to classifiers.
store (that were initially assumed benign) contained these frame- In our work, the data set refers to the permission matrix
works whereas 143 apps from vShare (that were initially assumed obtained from the application preprocessing phase.
suspicious). Analysis of these frameworks through static analysis iii) Active learning algorithm selects the active learning protocol
revealed that applications from App store also contained much pri- that is used during the learning process.
vacy breaching frameworks. Thus, the permission matrix assigning iv) Stopping criteria, specifies the termination condition for the
a new class label was used as input to classifiers. The output of this experiment. The maximum numbers of iterations that are
step was the construction of Boolean permission matrix which was required are declared by the user according to their criteria.
passed to machine learning classifiers. In this work, the maximum number of iteration used was
100.
v) Active learning scenarios specify the scenarios used in the
4.2. Privacy leak threat detection phase
active learning process. For our empirical studies, pool based
and stream-based selective sampling active learning scenar-
The privacy leak threat detection phase classifies an iOS appli-
ios have been used.
cation as benign, malicious or suspicious based on active learning
vi) Query strategies specify the query strategy used during the
approaches. The phase describes how active learning approaches
active learning process. In our study six query strategies
were employed to classify apps. JCLAL framework was employed
were employed for stream-based selective sampling and
to classify applications using active learning (Reyes et al., 2016).
pool-based active learning scenarios.
The learner takes a set of feature vectors and labels (i.e. benign,
vii) Base classifier
suspicious or malicious) as inputs in the training corpus. The fea-
For conducting experiments, Naïve Bayes and linear SVM
ture vector comprises of application category and permissions
(support vector machine) classifiers were employed for dif-
used by an app. Thus, the classifier can differentiate between the
ferent active scenarios.
benign, malicious and suspicious application based on its category
viii) Oracle specifies oracle used during the process. JCLAL pro-
and use of permission variables across it. In this paper, two differ-
vides two types of Oracle. A simulated Oracle identifies the
ent machine learning classifiers: Naïve Bayes and support vector
hidden class for the selected unlabeled instance. On the
machine (SVM) were employed. Experimental setup for training
other hand, the framework also supports a console human
classifiers using active learning has been described in next section.
oracle that iteratively asks the user to specify the class of
the selected unlabeled instance. During our experiments,
5. Experimental setup for active learning we used simulated oracle.

JCLAL framework was employed (Reyes et al., 2016) for con- To measure the effectiveness of iABC-AL, four standard metrics:
ducting experiments and to classify applications, using active precision, recall, accuracy and, F-score were employed. All the

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 9

Fig. 5. Structure of a configuration file (Reyes et al., 2016).

metrics are based on true positive, true negative, false positive and FP
FPR ¼ ð14Þ
false negative. Higher values of accuracy, precision, recall and, F- FP þ TN
measure signify higher detection quality (Wang et al., 2014). In this
work, three classes labels: benign, malicious and suspicious were
used. Let TP denotes the number of malicious applications which 6. Results and evaluation
are correctly classified and FN denotes the number of malicious
applications classified as (benign or suspicious). Where TN denotes The section presents a detailed analysis of empirical results
the number of benign applications correctly classified and FP obtained at different stages. The section presents (i) highlighting
denotes the number of incorrectly classified benign applications. features of iABC-AL and how the features were achieved, (ii) in-
Precision is defined as the percentage of correct positive predic- depth analysis of empirical results using supervised and active
tions (Eq. (10)) (Wang et al., 2014). Recall also called as a true pos- learning approaches, and, (iii) research questions addressed in this
itive ratio (TPR) defined as percentage of positive labeled instances work.
predicted as positive (Eq. (11)) (Wang et al., 2014). Accuracy is
defined as the percentage of correct predictions (Eq. (12)) (Wang 6.1. Highlights of iABC-AL
et al., 2014). F-score is computed as weighted average of precision
and recall (Eq. (13)) (Wang et al., 2014). False positive ratio (FPR) is The objective of iABC-AL is to protect end user’s privacy by min-
calculated by using Eq. (14). imizing the percentage of labeled training data set and maximizing
the precision of classification model. The iABC-AL achieves this by:
TP
Precision ¼ ð10Þ
TP þ FP i) Reducing false positive rates by incorporating category of
the application.
TP ii) Improve precision of classifiers using active learning
Recall ¼ ð11Þ
TP þ FN approaches by utilizing limited labeled data set.

ðTP þ TNÞ To demonstrate the effectiveness of iABC-AL framework in max-


Accuracy ¼ ð12Þ
ðTP þ TN þ FP þ FNÞ imizing the precision rate of supervised learning approaches, we
first present the results obtained by using various machine
Precsion  Recall learning classifiers with/without considering the category of the
F score ¼ 2  ð13Þ
Precsion þ Recall application in Table 3. For study here, classifiers belonging to

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
10 A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Table 3
Summary of results (%) before/after adding application’s category Classifier: C, True positive ratio T, Precision- P, Recall: R, F-score: F, ROC: Ro, Naïve Bayes (NB), Random Tree
(RT), SVM), k-nearest neighbor-KNN.

C Without category iABC-AL with category


T P R F Ro TPR P R F Ro
NB 74.8 75.1 74.8 74.9 80.2 77.8 77.8 77 80.2 77.1
SVM 74.8 74.3 74.8 74.3 79.1 80.2 80.2 80.2 79.9 82.5
RT 77.5 77.8 77.5 77.6 76.4 74.1 74.3 74.1 73.9 77.7
KNN 75.3 74.8 75.3 74.8 81.0 72.2 73.2 72.1 71.3 79.1

XN
different families of classifiers were deployed namely: Naïve Bayes ðX n  XÞðC  n  CÞ 
(NB), random forest (RF), support vector machine (SVM) and k- ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
RðX; CÞ ¼ qX n¼1
X ffi ð15Þ
nearest neighbor (KNN). These were employed across 13 permis-
N
ðX n  XÞ  2 N ðC n  CÞ  2
n¼1 n¼1
sions for a total of 2,325applications. It can be observed from
Table 3 that SVM gave best TPR as compared with other classifiers.
In this work, the permission variable X is a Boolean variable
1) Reduced false positive rates by incorporating category of the having a value 1 if an application uses it and otherwise 0. Corre-
application lation coefficient R (X, C) = 0 indicates independency between X
and C and R (X, C) with value 1 indicates a strongest negative
The novel idea that we propose is using application’s category
correlation (Wang et al., 2014). Table 4 shows the results after
as a dimension to improve the detection rate for malicious applica-
applying correlation coefficient on books category. From Table 4,
tions. Applications belonging to app store are sorted on the basis of
we can assert that results of permission-based ranking after
category (Apple Inc., https://developer.apple.com/app-store/cate
applying correlation coefficient have ranked use of the presence
gories/). For example, navigation category helps the users search-
of third-party frameworks, cellular information photo and as
ing places nearby, or tracking locations etc. whereas applications
top three risky frameworks for books category whereas as the
from books category help the users to download books etc. The cat-
same frameworks had different ranking for different a category.
egory can help in identifying malicious nature of an application.
Application of correlation coefficient to rank the permissions
For example, it is a normal feature of an application belonging to
and Apple’s guidelines for the developer (Apple Inc., https://devel
navigation category to access user’s current location whereas the
oper.apple.com/app-store/categories/) aided in identifying the
same permission may be risky in the books category. Thus, to
risky permissions across categories. The same process was
improve the accuracy of classification results and to reduce the
repeated for the other categories. After applying the correlation
false positive rate, a ranking of permissions for applications from
coefficient for all categories, top 5 risky permission and top 10
20 categories was performed using correlation coefficient (Wang
risky permissions were identified. Table 4 also depicts the ranking
et al., 2014) and gain ratio attribute evaluator (Dag et al., 2012).
results obtained after applying feature selection method gain
In case of iOS platform, the number of permissions that require
ratio.
user’s explicit approval is around 13 (Stackoverflow, https://stacko
verflow.com/questions/29894749/complete-list-of-ios-app-permis Feature selection using gain ratio
sions) which is very less as compared to Android which is approx-
imately 135 (Wang et al., 2014). Adding category dimension aids in In this work, attribute-based feature selection method has been
determining the malicious permissions. Permission-based ranking applied to rank permissions of each category. To check the consis-
helps to rank the permissions, as applications from the same cate- tency of the permission ranking, results obtained from gain ratio
gory request similar set of permissions. Therefore, misuse of any attribute evaluator (Dag et al., 2012) were compared with ranking
permission can be easily detected. Table 3 depicts the summary results of the correlation coefficient. A snapshot of ranking results
of results obtained by adding the category of application for from both correlation coefficient and gain ratio is depicted in
improving the accuracy of classifiers. Table 4 for books category. Likewise, the same process was
From Table 3, it can be observed that TPR for Naïve Bayes algo- repeated for 19 different categories.
rithm and SVM increased by 3% and 5.4% respectively. On the other
hand, in case of random tree and k-nearest neighbor, the results of
TPR decreased. Therefore, classifiers: Naïve Bayes and SVM were Table 4
Permission rank using Correlation Coefficient and Gain Ratio attribute evaluator-
chosen for next level for applying active learning approaches. After
based ranking of permissions for books category.
applying correlation coefficient (Wang et al., 2014) and feature
selection using gain ratio (Dag et al., 2012), the ranking results Correlation Coefficient Gain Ratio Attribute Evaluator

were sorted on the basis of first Top 5 risky permissions and then Third party analytics Third party analytics
top 10 risky permissions. The sorted results were then passed to Cellular Cellular
Photos Bluetooth
Naïve Bayes, and SVM machine learning classifiers for calculation
Camera Reminders
of standard metrics such as TPR, precision, recall, F-score, and ROC. Bluetooth Photos
Social Media Accounts Camera
Permission-based ranking using correlation coefficient Calendar Calendar
Location Contacts
Let X denotes a permission variable and C denotes a class vari-
Contacts HealthKit
able. Correlation coefficient R (value between 1 and +1), signifies Reminders Motion & Fitness
how strongly two variables are correlated with each other (Wang Microphone Location
et al., 2014). The correlation coefficient (X, C) having value +1 indi- Motion & Fitness Microphone
cates a strong positive correlation. Eq. (15) denotes the formula for HealthKit Social Media Accounts
HomeKit HomeKit
calculating the correlation coefficient (Wang et al., 2014).

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 11

 Gain ratio attribute evaluator

Gain ratio technique is normalized form of information gain


that evaluates the attribute by measuring the gain ratio with
respect to a class. The gain ratio is calculated using Eq. 16 (Dag
et al., 2012).

HðClassÞ  HðClassjAttributeÞ
GainRatioðClass; AttributeÞ ¼ ð16Þ
HðAttributeÞ
Table 5 enumerates the best-case values obtained after running
the classifiers (Naïve Bayes and Support vector machine) for Top 5
permissions after applying correlation coefficient and gain ratio
attribute evaluator. The results obtained from both (correlation
coefficient and gain ratio attribute evaluator) were evaluated
Fig. 6. Best case results for correlation coefficient and gain ratio attribute evaluator.
through standard metrics such as precision, accuracy, recall, F-
score, and ROC.
Similarly, best-case values were calculated by running the clas- In case of iOS platform, the number of permissions is very less;
sifiers for Top 10 permissions and all 13 permissions for all 20 cat- therefore, active learning approaches (pool & stream) for different
egories. The detailed experimental results for Top 10 permissions classifiers (Naïve Bayes& SVM) and query strategies were
and for all 13 permissions are not shown in this section. However, employed for all 13 permissions.
Fig. 6 shows the best-case values for all the standard metrics for
Top 5, Top 10 and all 13 permissions for both feature selection 2) Enhancing the precision of classification model using active
methods of all 20 categories. The experimental results demon- learning approaches
strated the results of correlation coefficient had higher values of
TPR, precision, recall, ROC for most of the categories for a different Objective of iABC-AL is to protect user’s privacy by improving
number of permissions when compared with gain ratio attribute the precision of machine learning classifiers by utilizing limited
evaluator. labeled data set. Research studies have identified that the machine
learning techniques offer promising results for detecting malicious
 Inference Android applications. However, there is been lack of research work
conducted for iOS platform (Pajouh et al., 2017). The prime reason
From Tables 4 and 5 we can infer that the application of ranking being lack of data set available for iOS applications. To overcome
algorithm helps in determining malicious permissions across cate- lack of large datasets, different active learning approaches (pool-
gories. The proposed approach of iABC-AL is very effective for iOS based and stream-based selective sampling) along with various
platform because the total number of permission that requires query strategies were employed.
user’s approval is approximately 13. Thus, using category of an
application and applying ranking algorithm improves accuracy,  Pool-based approach & Stream-based selective sampling approach
precision of the classifiers. The advantage of the proposed
approach is that it can be used for other mobile platforms such For incorporating pool-based/stream-based selective sampling
as Android where the number of permissions is approximately active learning scenario (detailed in section III) six querying strate-
135 (Wang et al., 2014). Application of ranking algorithms and fea- gies were employed for two classifiers (Naïve Bayes and SVM).
ture selection algorithm can help in selecting the most informative These strategies are entropy sampling, kullback leibler divergence,
permission(s).

Table 5
Best case values of classifiers (%) for correlation coefficient and gain ratio attribute evaluation (Naïve Bayes, SVM classifier) for Top 5 permissions. True positive ratio TPR,
Precision- Prec., Recall: R, F-score: F, ROC: Ro.

C Maximum Correlation Maximum Gain Attribute


TPR Pre. R F Ro TPR Prec. R F Ro
1 82.0 81.7 82.0 81.8 90.8 84.7 85.0 84.7 84.8 91
2 83.6 83.4 83.6 83.0 88.3 81.8 81.4 81.8 81.4 87.2
3 78.5 78.8 78.5 78.6 85.3 72.5 73.8 72.5 73.0 80.3
4 87.1 87.1 87.1 87.1 92.9 87.1 87.1 87.1 86.3 92.3
5 70.0 70.6 70.0 70.3 81.5 75.0 76.2 75.0 75.4 78.9
6 88.0 87.9 88.0 87.7 89 80.0 82.6 80.0 80.5 84.9
7 81.2 81.2 81.2 81.2 89.6 84.0 81.9 82.0 81.9 88.8
8 96.3 88.3 87.5 87.7 89.7 87.5 89 87.5 87.7 89.7
9 75.5 75.3 75.5 75.3 86.8 79.1 78.6 79.1 78.6 84.6
10 76.0 76.0 76.0 75.9 87.7 78.7 79.7 78.7 87.1 87.1
11 80.0 79.8 80.0 75.7 85.5 80.0 81.4 80.0 74.9 82.2
12 86.7 86.7 86.7 86.7 87.4 86.7 86.7 86.7 86.7 87.4
13 77.3 75.7 77.3 75.6 86.6 76.0 74.8 76.0 75.1 84.6
14 82.0 82.4 82.0 82.1 86.9 82.0 82.1 82.0 81.7 84.6
15 81.3 83.7 81.8 81.8 88.5 82.7 89.1 82.7 83.2 88.9
16 73.3 73.8 73.3 71.9 82.6 64.4 52.7 64.4 52.3 71.6
17 86.3 86.7 86.3 86.4 91.6 86.3 86.7 86.3 86.4 91.8
18 77.8 75.6 77.8 75.9 86.7 80 78.6 80 78.8 84
19 81.5 81.7 81.5 81.5 89.6 81 81.5 81 81 90.6
20 84.0 87.4 84.0 84.6 92.8 88 88.8 88 88.2 93

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
12 A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Table 6
Average/Max performance (%) for different active learning query strategies for Pool SVM. Application category: AC (1–20), True positive ratio T, Precision- P, Recall: Re, F-score: F,
ROC: Ro.

C Average- Pool SVM Maximum- Pool SVM


T P Re F Ro T P Re F Ro
1 84.9 84.3 84.7 83.8 84.5 85.6 85.2 85.3 84.8 85.9
2 82.4 83.3 82.4 81.6 83.3 83.1 83.9 83.1 82.3 83.8
3 75.6 75.3 75.6 73.9 74.9 84.6 84.1 84.6 83.6 84.2
4 83.5 83.9 83.5 83.3 85.7 84.0 84.3 84.0 83.7 86.0
5 63.1 67.0 63.1 62.6 67.1 63.7 67.1 63.7 63.4 67.5
6 88.5 90.0 88.5 88.3 89.9 88.5 90.0 88.5 88.3 89.9
7 78.9 82.2 78.9 78.7 83.9 81.1 83.5 81.1 81.1 85.6
8 85.0 87.7 85.0 85.2 87.6 85.0 87.7 85.0 85.2 87.6
9 82.7 82.1 82.7 80.7 83.0 82.7 82.1 82.7 80.7 83.0
10 71.4 74.7 71.4 71.8 76.2 71.4 74.7 71.4 71.8 76.2
11 82.5 82.7 82.5 80.0 81.2 82.5 82.7 82.5 80.0 81.2
12 80.0 87.4 80.0 79.3 81.6 80.0 87.4 80.0 79.3 81.6
13 82.8 84.3 82.8 82.6 85.5 82.8 84.3 82.8 82.6 85.5
14 80.0 83.3 80.0 79.7 85.3 80.0 83.3 80.0 79.7 85.3
15 54.2 55.6 54.2 54.0 58.5 54.2 55.6 54.2 54.0 58.5
16 68.8 64.3 68.8 64.9 75.9 68.8 64.3 68.8 64.9 75.9
17 85.6 82.2 85.6 82.6 86.3 86.2 82.6 86.2 83.2 86.8
18 75.8 77.0 75.8 73.8 82.6 80.0 79.1 80.0 78.0 85.4
19 81.6 83.4 81.6 81.1 85.2 82.0 83.5 82.0 81.6 85.4
20 80.0 84.0 80.0 81.2 84.0 80.0 84.0 80.0 81.2 84.0

least confident sampling, margin sampling, relevance sampling and ficult to extract the framework, class information because they can-
vote entropy. The result set of 240 combinations for each scenario not be directly executed on compiler (Szydlowski et al., 2012).
were obtained i.e. 2scenarios  6query strategies  20categories ¼ 240 result- Therefore, reverse engineering in form of static analysis was applied
ing in a total of 480 result combinations. to obtain the frameworks and permission information of iOS apps.
Table 6 enumerates the average (of six querying strategies) and
maximum values of metrics obtained using different active learn- 1) Comparative study on results obtained from active learning
ing approaches for pool-based scenario and SVM classifier. approaches with supervised approach
Likewise, results were computed for (i) pool-based scenario- for
Naïve Bayes classifier, (ii) stream-based selective sampling- for SVM A comparative analysis of the results obtained from supervised
classifier and (iii) stream-based selective sampling- for Naïve Bayes approach and active learning approach (Pool and Stream) are
classifier. The detailed results have not been included in this section. described in Tables 7 and 8. The tables enumerate the values of
However, Fig. 7 shows the best-case values for all standard metrics standard metrics: precision, F-score and ROC (receiver operating
for different active learning scenarios and different classifiers. characteristics) obtained from supervised and active learning
The experimental results demonstrated that application of approach for Naïve Bayes and SVM classifiers respectively.
active learning approaches improved the results of the supervised
approaches significantly Table 3. Active learning approaches have  The Significance of precision in this work
increased the accuracy and precision of Naïve Bayes by 7.13%,
7.41% and 4.36%, 4.68% for pool-based and stream-based selective Precision is defined as the ratio of correctly predicted positive
scenario respectively. In case of SVM, the accuracy and precision values out of total predicted positive values (Wang et al., 2014).
increased by 4.5%, 7.43% and 5.9%, 6.82% pool-based, and stream- The metric highlights the percentage of correct positive predictions
based selective scenario. out of all positive prediction. A high value of precision signifies low
false positive rate.
6.2. Comparison of results In our work, as the number of permissions is very limited i.e.
approximately 13 when compared to other platforms such as
The main motive of applying active learning and its approaches Android (approximately 135) (Wang et al., 2014). Hence, identify-
was unavailability of labeled data set for iOS applications. All iOS ing risky permissions or misuse of privacy-related permissions is
applications are distributed as iOS binaries on online stores, it is dif- challenging. Therefore, to reduce the false positive rate category
of the application was considered as an extra dimension to identify
the misuse of risky permission for a given category along-with
detailed study of Apple’s guidelines (Apple Inc., https://developer.
apple.com/app-store/categories/) for developers. In this work, we
have preferred precision over F-score or ROC, because with a lim-
ited number of permission set, precision is a very important metric
in our study as a high value of precision indicates low false positive
rate. To illustrate the same in our results, evaluation of precision
along-with other standard metrics such as F-score and ROC (recei-
ver operating characteristics) have also been added and detailed in
Tables 7 and 8 for both supervised and active learning approaches.

 Active learning v/s Supervised for Naïve Bayes classifier

Results obtained from Table 7 enumerate that the application of


Fig. 7. Results from different Active learning scenarios for different classifiers. pool-based active learning scenario has improved the precision

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 13

Table 7
Precision, F-score and ROC (Receiver operating characteristic) (%) from supervised and active learning approaches (Pool & Stream) for Naïve Bayes classifier. Supervised approach:
SA, Pool: P, Stream: S.

C Precision F-score ROC


SA P S SA P S SA P S
1 80.6 85.0 83.0 80.2 83.9 82.2 89.9 91.0 91.2
2 76.4 83.9 81.1 76.4 82.0 79.4 86.4 87.6 86.5
3 71.6 85.0 72.2 71.6 83.9 72.0 81.8 91.0 82.9
4 84.5 85.8 86.5 83.8 85.2 84.9 92.3 93.8 92.6
5 70.3 76.1 75.1 69.4 72.1 70.9 75.9 76.9 76.5
6 75.5 90.0 86.7 75.0 88.3 80.1 85.8 92.3 92.3
7 82.3 80.2 81.5 82.1 79.0 80.3 89.5 90.5 90.2
8 83.9 83.5 83.4 82.8 80.3 80.1 91.0 93.4 93.1
9 73.6 80.8 78.9 73.6 78.9 76.7 85.2 85.8 87.9
10 81.4 62.7 69.5 81.3 62.8 70.4 89.8 78.5 82.5
11 74.7 75.7 73.5 73.3 73.5 74.8 86.7 85.9 85.2
12 77.2 90.8 87.5 75.6 84.6 75.2 82.1 80.8 88.2
13 79.7 80.5 78.6 79.8 79.6 77.1 86.8 84.7 84.7
14 80.9 85.6 81.3 80.1 83.7 78.2 89.8 89.6 89.0
15 84.4 75.5 77.5 81.9 73.6 73.6 86.9 90.4 87.4
16 66.0 72.9 72.9 65.7 70.0 70.1 76.4 79.2 78.9
17 78.7 83.7 83.7 77.8 79.8 79.2 90.6 95.4 93.4
18 71.9 79.1 69.1 71.5 78.0 70.7 79.3 87.5 93.3
19 78.9 83.5 82.4 78.5 81.3 81.2 89.4 89.9 91.3
20 83.6 82.1 91.3 82.9 81.6 90.2 93.8 91.2 94.9

Table 8
Precision, F-score and ROC (Receiver operating characteristic) (%) from supervised and active learning approaches (Pool & Stream) for SVM classifier. Supervised approach: SA,
Pool: P, Stream: S.

C Precision F-score ROC


SA P S SA P S SA P S
1 84.8 85.2 84.1 84.7 84.8 83.6 86.1 85.9 84.2
2 82.7 83.9 83.9 82.8 82.3 82.6 84.2 83.8 83.8
3 76.4 84.1 74.6 76.1 83.6 72.1 75.4 84.2 73.2
4 87.1 84.3 83.9 87.1 83.7 83.3 88.8 86.0 85.7
5 64.3 67.1 66.9 64.6 63.4 63.0 67.2 67.5 67.2
6 87.9 90.0 89.5 87.7 88.3 85.9 89.0 89.9 84.3
7 79.0 83.5 83.5 79.0 81.1 81.1 82.5 85.6 85.6
8 82.9 87.7 87.7 82.8 85.2 85.2 84.6 87.6 87.6
9 78.5 82.1 82.1 78.6 80.7 80.7 80.8 83.0 83.0
10 89.7 74.7 79.4 89.5 71.8 78.1 91.0 76.2 86.7
11 83.7 82.7 82.7 82.8 80.0 80.0 83.2 81.2 81.2
12 84.8 87.4 87.4 84.6 79.3 79.3 86.1 81.6 81.6
13 79.3 84.3 84.3 78.9 82.6 82.6 83.1 85.5 85.5
14 80.0 83.3 83.3 80.0 79.7 79.7 83.4 85.3 85.3
15 74.0 55.6 64.0 71.5 54.0 65.4 75.6 58.5 69.6
16 67.2 64.3 64.3 66.7 64.9 64.9 74.2 75.9 75.9
17 88.8 82.6 81.7 88.6 83.2 82.0 89.6 86.8 85.8
18 68.9 79.1 76.0 67.8 78.0 73.0 75.4 85.4 82.0
19 82.9 83.5 83.5 82.4 81.6 81.6 86.0 85.4 85.4
20 83.0 84.0 84.0 82.8 81.2 81.2 84.3 84.0 84.0

results significantly for 15categories, and F-score and ROC for 14 ROC for 8 categories for SVM classifier. The results improved by a
categories for Naïve Bayes classifier. From Table 7, we can easily minimum value of 1% (utilities category) to a maximum value of
figure out that for every category except games, health & fitness, 10.3% (sports category). Application of stream-based selective
medical, productivity, and utilities, results have improved by a sampling has improved the precision values significantly for12 cat-
minimum value of 1% (entertainment category) to a maximum egories for SVM, F-score for 6 categories and ROC for 7 categories.
value of 15% (food & drink category). Results have improved by a minimum value of 0.6% (travel cate-
Results obtained from Table 7 enumerate that the application of gory) to a maximum value of 7.1% (sports category).
stream-based selective active learning scenario results has Therefore, from Tables 7 and 8 we can infer that the values of
increased precision for13 categories, F-score for 5 categories, ROC precision have improved significantly for both active learning sce-
for 7 categories for Naïve Bayes classifier. Results have improved nario when compared with another metrics F-score and ROC. A
by a minimum value of 0.4% (photo & video category) to a maxi- high value of precision specifies low false positive rate. Therefore,
mum value of 11.2% (food & drink category). we conclude that the proposed approach of adding a category
dimension and using active learning approaches protects end
 Active v/s Supervised for SVM classifier user’s privacy by classifying applications with low false positive
rate.
Results obtained from Table 8 enumerate that the application of
pool-based active learning scenario has improved the precision 2) Comparative study of results obtained from active learning
values significantly for 14categories, F-score for 7 categories and approaches: Pool v/s Stream

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
14 A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Table 9
Best case Precision, F-score and ROC (%) for Naïve Bayes and SVM classifier for pool/stream-based selective sampling active learning scenario. Precision: P, F-score: F, ROC: R.

Scenario Classifier Category Query Strategy P F R


Pool SVM Food & Drink Vote entropy 90.0 88.3 89.9
Pool Naïve Bayes Navigation Kullback leibler divergence 90.8 84.6 80.8
Stream SVM Food & Drink Relevance sampling 89.5 85.9 84.3
Stream Naïve Bayes Utilities Relevance sampling 91.3 90.2 87.6

The section presents an empirical analysis of the results 6.3. Research questions
obtained from applying pool/stream-based selective sampling
based active learning scenario. For illustration, the query strategies To evaluate the effectiveness of the proposed approach, differ-
that gave the highest precision were selected for Naïve Bayes/SVM ent experiments were conducted to answer the following research
classifier for pool/stream scenario and are summarized in Table 9 questions.
along-with values of F-score and ROC. RQ1. How effective is the proposed approach of iABC-AL for
Apart from the above, the results of pool/stream-selective based malware detection? Can the proposed approach outperform the
active learning scenarios were compared based on the (i) number state-of-art approach?
of labeled/unlabeled instances selected for initial iteration, (ii) true To answer the above question, we compared the effectiveness
positive rate, (iii) precision, (iv) F-score, (v) ROC, (vi) initial number of iABC-AL with (Pajouh et al., 2017), in terms of accuracy. The pro-
of iterations, (vi) number of labeled/unlabeled instances after final posed approach was evaluated by performing a stratified 10-fold
iteration, summarized in Table 10. For a comparative study on pool cross-validation on a dataset for 2325 iOS applications using vari-
and stream scenario, relevance sampling query strategy was cho- ous machine learning classifiers. The entire dataset of 2325 appli-
sen from utilities category as it gave the highest precision of cations was split into 10 subsets. The experiment was run 10 times
91.3%. The initial value of the number of iteration was selected and every time a different subset was selected for testing and other
as 100. Table 10 enumerates the comparison of results from both 9 subsets for training. Average performance was reported over ten
the scenarios. experiments. We compared the performance of iABC-AL with
From Table 10, we can depict that both pool/stream-based (Pajouh et al., 2017) and obtained accuracy rate of 88.5%-91.3%
active leaning scenarios, classify the applications by using only against 51%-91% (Pajouh et al., 2017).
10% of the training data. The initial iteration parameter (t) was The Effectiveness of iABC-AL for detecting malicious
set as 100 in the configuration file, but from Table 10, we can applications
deduce that pool-SVM, pool-Naïve Bayes, stream –SVM gave the Pajouh et al. (Pajouh et al., 2017) have proposed a model that
results after completing 61iterations each with a precision of applies kernel-based support vector machine and weighing factor
84%, 82.1%, and 84% respectively. In case of stream-Naïve Bayes, for application library calls to detect OS X malware. The authors
results were obtained after 100 iterations with a precision of have used a data set comprising of 450 benign samples and 152
91.3%. The final iteration parameter (t) indicates that the values malwares. The proposed model detects OS X malware with 91%
of precision become constant after completing n iterations. The fol- accuracy. To increase the data set and evaluate the change in accu-
lowing subsection presents the research questions addressed in the racy rate authors used synthetic minority over-sampling tech-
paper. nique. By using this technique, their model achieved an accuracy
rate of 96%. To construct the data set 460 applications were down-
Table 10 loaded from app store from different categories. Thereafter, Mach-
Summary of results obtained from pool/stream-based scenarios for Naïve Bayes & O binary files were extracted from benign and malware samples
SVM classifier for relevance sampling.
manually. Later feature selection techniques were applied to find
Parameters Instances: Pool Instances: Stream most relevant attributes.
Classifier Linear SVM Linear SVM The aim of iABC-AL is to protect end user’s privacy by detecting
No. of applications 75 75 malicious iOS applications. For conducting the experiments 2325
Test set size 7 7 applications were downloaded and frameworks were extracted to
Initial iteration 100 100
map them into permission variable. The framework comprises of
Initial labeled set size 6 6
Initial unlabeled set size 61 61
all private shared library of application. The highlighting feature
Final iteration 62 67 of iABC-AL is detecting malicious iOS application by considering
Final labeled set size 67 66 application’s category. After adding the category of application,
Final unlabeled set size 1 1 the entire permission matrix was provided to classifiers. With ref-
Precision 84% 84%
erence to (Pajouh et al., 2017), the authors applied library weight-
True positive ratio 80% 80%
F-score 81.2% 81.2% ing and synthetic minority over-sampling technique which
ROC 84% 84% resulted in a data set of 3060 applications. For classifying OS mal-
Classifier Naïve Bayes Naïve Bayes ware supervised machine learning approaches using five different
No. of applications 75 75 classifiers and stratified 10-fold cross validation were employed.
Test set size 7 7
Initial iteration 100 100
The accuracy rate of the model ranged from 51%-91% for different
Initial labeled set size 6 6 classifiers. Later after applying the synthetic minority over sam-
Initial unlabeled set size 61 61 pling technique and creating the data set of 3060 applications,
Final iteration 62 101 the accuracy of the model ranged from 43.15%–96.62%. When com-
Final labeled set size 67 67
pared with iABC-AL, which uses active learning approaches preci-
Final unlabeled set size 1 1
Precision 82.1% 91.3% sion rate from 88.5%-91.3% and accuracy rate from 89.5%-91.5%
True positive ratio 82.8% 90.1% was obtained for different active learning scenarios.
F-score 81.6% 90.2% RQ2. How does the performance of iABC-AL vary for different
ROC 91.2% 94.9% amount of training data?

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 15

For a 10-fold cross validation, 90% of the data set was used for During empirical evaluation on our dataset, it was identified
training and 10% for testing. The supervised approaches require that all the active learning techniques outperformed random sam-
that a large amount of training data set (i.e. 90%) must be available. pling when performance measure was precision, F-score, and ROC
However, our approach employs active learning which aims to for most of the cases when the percentage of training data varied
reduce the amount of training data labeled by experts. To address from (10%-50%). However, the results of random sampling were
the above research question, we investigated whether the pro- better when 90% of data set was used for labeling. The results of
posed approach can work effectively with a reduced amount of all metrics i.e. precision, F-score, and ROC are summarized in
training data set and also compared the results with other base- Tables 11–13 respectively.
line technique such as random sampling (Ramirez-Loaiza et al.,
2017). To answer this question, we conducted multiple experi- Inference
ments by varying the amount of training data from 10% to 50% of
the entire data, with an interval of 10%. From Table 9, it can be From Tables 11–13, it can be enumerated that even with 10% of
inferred that relevance sampling query strategy gave the best pre- the entire data set as the training data; iABC-AL could detect mal-
cision value for stream scenario and Naïve Bayes classifier from ware effectively with a precision of 82.6%, 90.1%, 86.7%, and 89.6%
utilities category. Therefore, for empirical analysis, training data in case of pool-Naïve Bayes, pool-SVM, stream- Naïve-Bayes and
for relevance sampling was varied in the configuration file. stream-SVM respectively. Therefore, we can conclude that using
Random sampling is most commonly used baseline when com- active learning technique is better than random sampling tech-
paring across active learning strategies (Ramirez-Loaiza et al., nique because it selects the most informative instance for labeling
2017). This strategy chooses instances randomly from an unlabeled when compared to random sampling that selects a random
pool. The instance is picked randomly without considering the fact instance to label. Thus, application of active learning approaches
whether the chosen instance provides any information to the clas- outperforms random sampling technique by providing better val-
sifier. Whereas, the active learning technique selects the most ues for standard metrics. To evaluate the performance of both
informative instance to be queried thereby achieving better accu- the approaches we applied different statistical tests as described
racy by utilizing a limited percentage of training data set below.
(Ramirez-Loaiza et al., 2017).
To evaluate the effectiveness of active learning approaches we  Statistical test for comparison of results obtained from Active learn-
present a comprehensive empirical evaluation of different active ing and Random Sampling
scenarios for different querying strategies for different machine
learning classifiers. We also evaluate most commonly used base- From Table 11, Naïve Bayes with active stream sampling gave
line technique i.e. random sampling (Ramirez-Loaiza et al., 2017) better precision than random sampling. For establishing that these
using Naïve Bayes and SVM (support vector machine). Later, the results are not just by chance and has statistical significance too,
results obtained from different active learning scenarios were com- two statistical methods, namely, Coefficient of Variance (CoV)
pared with those obtained from random sampling were evaluated and Analysis of variance (ANOVA) have been used (Veerarajan,
through standard metrics such as precision, accuracy, recall, F- 2008). An efficient implementation of these tests can be found in
score, and ROC (receiver operating characteristics). The perfor- a web platform STAC (Statistical Tests for Algorithms Comparison)
mance measures are described in Tables 11–13. Table 11 presents (STAC Web Platform, http://tec.citius.usc.es/stac/index.html;
the result for precision metric obtained by varying the percentage Rodriguez-Fdez et al., 2015).
of training data set (for active). While Tables 12 and 13 enumerate
the results for metrics F-score and ROC respectively. (i) Coefficient of Variance
The coefficient of variance, CoV for a sample X is defined by Eq.
 Active learning v/s Random sampling (17) (Veerarajan, 2008).

 
Standard Dev iationx
For comparing the results of active learning and random sam-
pling, relevance sampling querying strategy was selected. The C:o:V x ¼  100 ð17Þ
Meanx
results from different active learning scenario such as pool-based
and stream-based selective sampling were evaluated for Naïve If C:o:V x > C:o:V Y then the variable X (or its distribution) is less
Bayes and SVM (support vector machine) classifier. Application data consistent or in other words it more variable than the other vari-
set was chosen from utilities category. Results of relevance sampling able Y (or its distribution) (Veerarajan, 2008). Table 14 shows com-
querying strategy for (i) Pool-based scenario using Naïve Bayes, (ii) parison of CoV of precision obtained by all three types of sampling.
Pool-based scenario using SVM (support vector machine), (iii) As CoV of Stream based selective sampling is minimum
Stream-based selective sampling using Naïve Bayes, (iv) Stream- amongst all three approaches. We can conclude that stream based
based selective sampling using SVM (support vector machine), were selective sampling scenario is more consistent and stable
compared with (v) Random sampling using Naïve Bayes and (vi) (Veerarajan, 2008) as compared to pool-based active learning sce-
Random sampling using SVM (support vector machine). nario as well as random sampling. Similar results were obtained for

Table 11
Analysis of precision (%) on varying percentage of training data for utilities category and relevance sampling query strategy. L/UL: labeled/unlabeled instances for active learning,
PN: pool Naïve Bayes, SN: stream Naïve Bayes, SS: Stream SVM, LR: labeled instances for random sampling, Random-NB: Random sampling for Naïve Bayes, Random-SVM:
Random sampling for SVM.

Train data% L UL LR Naïve Bayes classifier SVM classifier


PN SN Random-NB PS SS Random-SVM
10 6 61 8 73.3 91.3 74.9 84.1 84.1 54.5
20 13 54 15 73.3 91.3 84.1 84.1 84.1 51.7
30 20 47 23 73.3 88.4 81 84.1 84.1 81
40 26 41 30 81.8 88.4 74.5 81.9 81.9 69.8
50 33 34 38 82.2 88.2 89.2 81.9 81.9 73.6

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
16 A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Table 12
Analysis of F-score (%) on varying percentage of training data for utilities category and relevance sampling query strategy.

Train data% L UL LR Naïve Bayes classifier SVM classifier


PN SN Random-NB PS SS Random-SVM
10 6 61 8 75.6 90.2 70.2 81.2 81.2 62.1
20 13 54 15 75.6 90.2 75.3 81.2 81.2 58.8
30 20 47 23 75.6 87.1 79.4 81.2 81.2 79.4
40 26 41 30 81.8 87.1 72.5 79.4 79.4 70
50 33 34 38 82.7 82.7 89.2 79.4 79.4 72.8

Table 13
Analysis of ROC (%) on varying percentage of training data for utilities category and relevance sampling query strategy.

Train data% L UL LR Naïve Bayes classifier SVM classifier


PN SN Random-NB PS SS Random-SVM
10 6 61 8 91.7 94.9 76.8 84.1 84.1 70.4
20 13 54 15 91.7 94.9 85.6 84.1 84.1 65.6
30 20 47 23 91.7 93.2 94.6 84.1 84.1 80
40 26 41 30 92.6 93.2 90.2 81.7 81.7 72
50 33 34 38 92.7 93.3 96.5 81.7 81.7 72.7

Table 14 culate ANOVA statistical test and results as shown in second half of
Coefficient of variance for Naïve Bayes classifier for Pool, Stream and Random the table were obtained. These can be interpreted as follows.
sampling.
Variation within groups, shown in first column as SS, is lesser
Metric Pool Stream Random Sampling than between groups. This demonstrates the cohesiveness of data-
73.3 91.3 74.9 set. Next significant value is p-value. p-value indicates probability
73.3 91.3 84.1 of null hypothesis being true after performing the statistical test
73.3 88.4 81 given significance level, a the pre-chosen probability (Stats
81.8 88.4 74.5
Direct Limited, https://www.statsdirect.com/help/basics/p_values.
82.2 88.2 89.2
Standard Deviation 4.77 1.63 6.24 htm). If p-value is less than significance level, then the null hypoth-
Mean 76.78 89.52 80.74 esis is rejected. Significance level a was chosen as 0.05 here. As can
Coefficient of variance 6.21 1.82 7.73 be seen in Table 15, for precision of Naïve Bayes classifier for differ-
ent Active learning scenarios and Random sampling, p-value came
out to 0.0028 which is significantly lesser than a. Thus, the null
SVM classifier also where CoV for active learning approaches was hypothesis H0 can be rejected. This inference is strengthened by
found to be lesser than random sampling, indicating consistency examining the f-value that compares the joint effect of all vari-
in the results. Results of CoV indicate that active learning ables. As obtained f-value is greater than Fcrit, H0 gets rejected by
approaches of sampling will give better classification results than this criterion as well. Thus, it can be concluded that the alternate
any random sampling. hypothesis- results of three approaches are significantly different
is accepted.
(ii) ANOVA (Analysis of Variance) After this extensive statistical significance testing, it can be con-
Null hypothesis ðH0 ) here is that the precision results of all cluded that the proposed approach of iABC-AL using active learn-
three approaches are similar (Stats Direct Limited, https://www. ing techniques can detect malicious iOS apps with better
statsdirect.com/help/basics/p_values.htm). It was tested by precision when compared with other supervised approaches and
ANOVA statistical test whether the null hypothesis can be rejected random sampling.
or not. A one – way ANOVA test has been used as there is only one RQ3. How does the performance of iABC-AL vary for a different
independent variable, precision. Results of applying ANOVA to the number of iterations?
dataset are shown in Table 15. The proposed approach takes an iteration parameter, to control,
First half of the results show number of data samples, their sum, the number of active learning iterations. To address this research
mean and variance for each dataset. These values were used to cal- question, the values of iteration were varied from 10-100 with

Table 15
Summary of results of ANOVA for single factor, with a = 0.05. SS: sum of squares, d: degree of freedom, MS: mean square, F: calculated F-value, P: p-value, Fcrit: critical value of F.

ANOVA: Single Factor


SUMMARY
Groups Count Sum Average Variance
Pool 5 383.9 76.78 22.73
Stream 5 447.60 89.52 2.65
Random Sampling 5 403.70 80.74 38.99
ANOVA
Source of Variation SS df MS f-value p-value Fcrit
Between Groups 425.13 2 212.564 9.907 0.0028 3.885
Within Groups 257.47 12 21.455
Total 682.697 14

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx 17

Table 16
Analysis of Precision (%) on varying number of iterations for utilities category and relevance sampling query strategy.

Iteration Pool-naïve Pool-SVM Stream-Naive Stream-SVM


10 60.7 69.4 67.4 69.4
20 65.8 79 73.6 79
30 68.3 83.3 81.4 83.3
40 79.4 77.2 83.8 77.2
50 80 88.1 84.6 88.1
60 82.2 81.9 81.8 81.9
70 73.3 84.1 91.3 84
80 73.3 84.1 91.3 84
90 73.3 84.1 91.3 84
100 73.3 84.1 91.3 84

an interval of 10. Utility category was selected with relevance sam- have not been considered in previous research works have aided
pling as querying strategy with training data percentage as 10%. in identifying malicious applications with better precision rate
Table 16 presents the empirical results obtained by varying the and accuracy. One more advantage of iABC-AL is that it can detect
number of iterations, which is used to control the number of iter- the malicious applications before they are run thereby minimizing
ations in the active learning process. From Table 16 we can infer the privacy threats for end-user up to great extent.
that the precision value becomes consistent after 70 iterations.
In this section, analysis of results obtained from supervised
learning, the advantage of using application’s category for detect- References
ing malicious application, application of active learning approaches
Agarwal, Y., Hall, M., 2012. ProtectMyPrivacy: detecting and mitigating privacy
for improving the precision rate of classifiers. We also performed a leaks on iOS devices using crowdsourcing categories and subject descriptors.
comparative study of the results obtained from the supervised Proceeding 11th Annu. Int. Conf. Mob. Syst. Appl. Serv. 6, 97–110.
approach with results of active learning. Further, an in-depth anal- Apple Inc., ‘‘Speech | Apple Developer Documentation.” [Online]. Available: https://
developer.apple.com/documentation/speech. [Accessed: 07-Feb-2018].
ysis of results obtained from different active learning approaches
Apple Inc., ‘‘Apple Developer Documentation.” [Online]. Available: https://
was performed by varying the percentage of training data and developer.apple.com/documentation/. [Accessed: 06-Feb-2018].
number of iterations. Results of active learning were compared Apple Inc., ‘‘HomeKit | Apple Developer Documentation.” [Online]. Available:
https://developer.apple.com/documentation/homekit. [Accessed: 07-Feb-
with random sampling technique by experimentation and statisti-
2018].
cal methods. Apple Inc., ‘‘UserNotifications | Apple Developer Documentation.” [Online].
Available: https://developer.apple.com/documentation/usernotifications.
[Accessed: 07-Feb-2018].
7. Conclusion Apple Inc., ‘‘HealthKit | Apple Developer Documentation.” [Online]. Available:
https://developer.apple.com/documentation/healthkit. [Accessed: 07-Feb-
The increased popularity of smartphones and increase in the 2018].
Apple Inc., ‘‘AddressBook | Apple Developer Documentation.” [Online]. Available:
number of downloads of third-party applications has resulted in https://developer.apple.com/documentation/addressbook. [Accessed: 07-Feb-
the increased possibility of malicious activities. Even the applica- 2018].
tions that seem to be benign can breach the privacy of users by Apple Inc., ‘‘Core Motion | Apple Developer Documentation.” [Online]. Available:
https://developer.apple.com/documentation/coremotion. [Accessed: 07-Feb-
sharing user’s crucial and private information such as location, 2018].
passwords, tweets, phone number etc. In this study, we propose Apple Inc., ‘‘UIKit | Apple Developer Documentation.” [Online]. Available: https://
iABC-AL, a framework based on machine learning approach, which developer.apple.com/documentation/uikit. [Accessed: 07-Feb-2018].
Apple Inc., ‘‘Photos | Apple Developer Documentation.” [Online]. Available: https://
uses active learning approaches to detect malicious iOS applica- developer.apple.com/documentation/photos. [Accessed: 07-Feb-2018].
tions. By applying reverse engineering in form of static analysis, Apple Inc., ‘‘PhotosUI | Apple Developer Documentation.” [Online]. Available:
iABC-AL explores privacy threats that iOS applications may pose https://developer.apple.com/documentation/photosui. [Accessed: 12-Feb-
2018].
to its user’s. The applications were reverse engineered to extract
Apple Inc., ‘‘Core Location | Apple Developer Documentation.” [Online]. Available:
frameworks which were mapped to permission variables and later https://developer.apple.com/documentation/corelocation. [Accessed: 07-Feb-
provided as input to machine learning classifiers. The primary goal 2018].
of the study was to minimize the privacy threats of users by (i) Apple Inc., ‘‘MapKit | Apple Developer Documentation.” [Online]. Available: https://
developer.apple.com/documentation/mapkit. [Accessed: 12-Feb-2018].
minimizing the percentage of labeled training data set and (ii) Apple Inc., ‘‘Core Bluetooth | Apple Developer Documentation.” [Online]. Available:
maximizing the precision of classification model. The highlights https://developer.apple.com/documentation/corebluetooth. [Accessed: 06-Feb-
of iABC-AL include: (i) its ability to reduce false positive rates by 2018].
Apple Inc., ‘‘Categories and Discoverability – App Store – Apple Developer.”
using application’s category as a feature to distinguish malicious [Online]. Available: https://developer.apple.com/app-store/categories/.
applications from the benign ones (ii) its ability to improve the [Accessed: 12-Feb-2018].
precision of classifiers by using active learning approaches which Apple Inc., ‘‘EventKit | Apple Developer Documentation.” [Online]. Available:
https://developer.apple.com/documentation/eventkit. [Accessed: 06-Feb-
work with a limited labeled data set. A total of 2325 iOS applica- 2018].
tions across 20 different application categories were evaluated. Apple Inc., ‘‘Requesting Permission - App Architecture - iOS Human Interface
The empirical results demonstrate that the proposed approach of Guidelines.” [Online]. Available: https://developer.apple.com/ios/human-
interface-guidelines/app-architecture/requesting-permission/. [Accessed: 16-
using category of application and active learning improves the pre- Nov-2017].
cision rate obtained from supervised approaches rate by 14.5% Apple Inc., ‘‘AVFoundation | Apple Developer Documentation.” [Online]. Available:
(best case). The approach works well for iOS platform as it can https://developer.apple.com/documentation/avfoundation. [Accessed: 06-Feb-
2018].
detect malicious iOS applications by reducing the amount of
Apple Inc., ‘‘Social | Apple Developer Documentation.” [Online]. Available: https://
labeled dataset. Our research study can be employed to scrutinize developer.apple.com/documentation/social. [Accessed: 07-Feb-2018].
malicious iOS applications at an initial scan. We also compared Apple Inc., ‘‘Core Telephony | Apple Developer Documentation.” [Online]. Available:
iABC-AL with existing work on static analysis, active learning. With https://developer.apple.com/documentation/coretelephony. [Accessed: 07-Feb-
2018].
the help of empirical results, we demonstrated that addition of Apple Inc., ‘‘App Store - Apple (IN).” [Online]. Available: https://www.apple.com/in/
application’s category and using active learning approaches that ios/app-store/. [Accessed: 01-Feb-2018].

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008
18 A.J. Bhatt et al. / Journal of King Saud University – Computer and Information Sciences xxx (2018) xxx–xxx

Apple Inc., ‘‘UIImagePickerController - UIKit | Apple Developer Documentation.” B. Settles, ‘‘Active Learning,” Act. Learn., pp. 89–102, 2012.
[Online]. Available: https://developer.apple.com/documentation/uikit/ ‘‘STAC Web Platform.” [Online]. Available: http://tec.citius.usc.es/stac/index.html.
uiimagepickercontroller. [Accessed: 07-Feb-2018]. [Accessed: 16-May-2018].
Apple Inc., ‘‘Xcode - Apple Developer.” [Online]. Available: https://developer. Stackoverflow, ‘‘Complete list of iOS app permissions – Stack Overflow.” [Online].
apple.com/xcode/. [Accessed: 14-Feb-2018]. Available: https://stackoverflow.com/questions/29894749/complete-list-of-
Apple Inc., ‘‘ACAccount – Accounts | Apple Developer Documentation.” [Online]. ios-app-permissions. [Accessed: 07-Mar-2018].
Available: https://developer.apple.com/documentation/accounts/acaccount. Statista Inc., ‘‘Apple iPhone unit sales worldwide 2007-2018 | Statistic.” [Online].
[Accessed: 12-Feb-2018]. Available: https://www.statista.com/statistics/263401/global-apple-iphone-
Apple Inc., ‘‘Accounts | Apple Developer Documentation.” [Online]. Available: sales-since-3rd-quarter-2007/. [Accessed: 09-Feb-2018].
https://developer.apple.com/documentation/accounts. [Accessed: 07-Feb- Statista, ‘‘Apple: most popular app store categories 2018 | Statistic.” [Online].
2018]. Available: https://www.statista.com/statistics/270291/popular-categories-in-
Apple Inc., ‘‘AVCaptureDevice - AVFoundation | Apple Developer Documentation.” the-app-store/. [Accessed: 09-Mar-2018].
[Online]. Available: https://developer.apple.com/documentation/avfoundation/ Stats Direct Limited, ‘‘P Values (Calculated Probability) and Hypothesis Testing -
avcapturedevice. [Accessed: 07-Feb-2018]. StatsDirect.” [Online]. Available: https://www.statsdirect.com/help/basics/
M.S. Bhatt, Arpita Jadhav, Gupta Chetna, ‘‘Disassembling and Patching iOS p_values.htm. [Accessed: 16-May-2018].
Applications at Assembly Level,” no. August, pp. 10–12, 2017. Symantec Corporation, ‘‘Report Finds Rate of iOS Malware Increasing Faster than
Chandramohan, M., Tan, H.B.K., 2012. Detection of mobile malware in the wild. Android Malware at iPhone Ten Year Anniversary.” [Online]. Available: https://
Computer (Long. Beach. Calif) 45 (9), 65–71. www.symantec.com/about/newsroom/press-releases/2017/skycure_0718_01.
Dag, H., Sayin, K.E., Yenidogan, I., Albayrak, S., Acar, C., 2012. Comparison of feature [Accessed: 01-Feb-2018].
selection algorithms for medical data. Int. Symp. Innov. Intell. Syst. Appl., 1–5 Szydlowski, M., Egele, M., Kruegel, C., Vigna, G., 2012. Challenges for dynamic
Dale Rapp, ‘‘How to Find Vulnerabilities in Mobile Apps through Reverse analysis of iOS applications. In: Lect. Notes Comput. Sci. (including Subser. Lect.
Engineering | daleswifisec.” [Online]. Available: https://dalewifisec.wordpress. Notes Artif. Intell. Lect. Notes Bioinformatics). LNCS, pp. 65–77.
com/2013/01/24/how-to-find-vulnerabilities-in-mobile-apps-through-reverse- F. Thung, X. B. D. Le, and D. Lo, ‘‘Active Semi-supervised Defect Categorization,” IEEE
engineering/. [Accessed: 12-Feb-2018]. Int. Conf. Progr. Compr., vol. 2015–Augus, pp. 60–70, 2015.
Felt, A.P., Ha, E., Egelman, S., Haney, A., Chin, E., Wagner, D., 2012. Android T. Veerarajan, Probability, Statistics and Random Process, Third. 2008. Tata
permissions demystified. Proc. Eighth Symp. Usable Priv. Secur. – SOUPS 12, 1. McGraw-Hill.
Huang, C.Y., Tsai, Y.T., Hsu, C.H., 2013. Performance evaluation on permission-based ‘‘vShare: Download paid apps for free on iOS 10(iPhone&iPad) and Android without
detection for android malware. Smart Innov. Syst. Technol. 21, 111–120. jailbreak.” [Online]. Available: http://www.vshare.com/. [Accessed: 10-Jun-
Intel, ‘‘Dynamic Analysis vs. Static Analysis.” [Online]. Available: https://software. 2017].
intel.com/en-us/node/622647. [Accessed: 12-Feb-2018]. Wang, T., Lu, K., Lu, L., Chung, S., Lee, W., 2013. Jekyll on iOS: when benign apps
A. Kurtz, A. Weinlein, C. Settgast, F. Freiling, ‘‘DiOS: Dynamic Privacy Analysis of iOS become evil. USENIX Secur. Symp., 559–572
Applications,” no. June, 2014. Wang, W., Wang, X., Feng, D., Liu, J., Han, Z., Zhang, X., 2014. Exploring permission-
S. Ma, S. Wang, D. Lo, R. H. Deng, and C. Sun, ‘‘Active Semi-supervised Approach for induced risk in android applications for malicious application detection. IEEE
Checking App Behavior against Its Description,” Proc. - Int. Comput. Softw. Appl. Trans. Inf. Forensics Secur. 9 (11), 1869–1882.
Conf., vol. 2, pp. 179–184, 2015. Wikipedia, ‘‘Active learning (machine learning)”.
Martin, W., Sarro, F., Jia, Y., Zhang, Y., Harman, M., 2016. A survey of app store Wikipedia, ‘‘Apple Inc.” [Online]. Available: https://en.wikipedia.org/wiki/Apple_
analysis for software engineering. IEEE Trans. Softw. Eng. 43 (9), pp. 1–1. Inc.
R. Moskovitch, N. Nissim, and Y. Elovici, ‘‘Malicious Code Detection Using Active Wikipedia, ‘‘IOS Application Security Testing Cheat Sheet - OWASP.” [Online].
Learning,” Privacy, Secur. Trust KDD, no. june 2007, pp. 74–91, 2009. Available: https://www.owasp.org/index.php/IOS_Application_Security_
Nissim, N., Cohen, A., Elovici, Y., 2017. ALDOCX: detection of unknown malicious Testing_Cheat_Sheet. [Accessed: 12-Feb-2018].
microsoft office documents using designated active learning methods based on Wikipedia, ‘‘Smartphone.” [Online]. Available: https://en.wikipedia.org/wiki/
new structural feature extraction methodology. IEEE Trans. Inf. Forensics Secur. Smartphone.
12 (3), 631–646. Xamarin, ‘‘TWTweetComposeViewController Class.” [Online]. Available: https://
N. Nissim et al., ‘‘ALPD: Active learning framework for enhancing the detection of developer.xamarin.com/api/type/Twitter.TWTweetComposeViewController/.
malicious PDF files,” Proc. - 2014 IEEE Jt. Intell. Secur. Informatics Conf. JISIC [Accessed: 12-Feb-2018].
2014, pp. 91–98, 2014. Zhao, M., Zhang, T., Ge, F., Yuan, Z., 2012. Robotdroid: A lightweight malware
N. Nissim, A. Cohen, and Y. Elovici, ‘‘Boosting the detection of malicious documents detection framework on smartphones. J. Networks 7 (4), 715–722.
using designated active learning methods,” Proc. - 2015 IEEE 14th Int. Conf.
Mach. Learn. Appl. ICMLA 2015, pp. 760–765, 2016.
Arpita Jadhav Bhatt obtained her M.E in Software systems for Birla Institute of
Pajouh, H.H., Dehghantanha, A., Khayami, R., Choo, K.K.R., 2017. Intelligent OS X
Information and Technology, Pilani Rajasthan and B.Tech from Rishiraj Institute of
malware threat detection with code inspection. J. Comput. Virol. Hacking Tech.,
1–11 Technology, Indore in 2010 and 2008 0respectively. She is currently working as an
Park, M.W., Choi, Y.H., Eom, J.H., Chung, T.M., 2014. Dangerous Wi-Fi access point: Assistant Professor in Jaypee Institute of Technology, Noida, India. Her areas of
Attacks to benign smartphone applications. Pers. Ubiquitous Comput. 18 (6), interest are mobile application engineering, software engineering, programming in
1373–1386. iOS, mobile computing and operating systems.
Ramirez-Loaiza, M.E., Sharma, M., Kumar, G., Bilgic, M., 2017. Active learning: an
empirical study of common baselines. Data Min. Knowl. Discov. 31 (2), 287– Dr. Chetna Gupta: is Assistant Professor (Senior Grade) in the Department of
313. Computer Science & Engineering from Jaypee Institute of Information and Tech-
B. Rashidi, C. Fung, and E. Bertino, ‘‘Android Malicious Application Detection Using nology, Noida, and India. She obtained her Doctorate in the area of Software Testing.
Support Vector Machine and Active Learning,” 2017. She also holds a Masters of Technology and a Bachelor of Engineering degree in
E. Rashidi, Bahman ; Fung, Carol; Bertino, ‘‘Android Malicious Application Detection Computer Science and Engineering. Her areas of interest are Software Engineering,
Using Permission Vector and Network Traffic Analysis,” pp. 1126–1132, 2017. Requirement Engineering, Software Testing, Software Project Management, Data
Raywenderlich, ‘‘iOS App Security and Analysis: Part 1/2.” [Online]. Available: Structures, Data Mining and Web Applications. She has many publications in
https://www.raywenderlich.com/45645/ios-app-security-analysis-part-1.
international journals and conferences to her credit.
[Accessed: 12-Feb-2018].
Reyes, O., Pérez, E., Del, M., Rodríguez-Hernández, C., Fardoun, H.M., Ventura, S.,
Dr. Sangeeta Mittal: is Assistant Professor in the Department of Computer Science &
2016. JCLAL: a java framework for active learning. J. Mach. Learn. Res. 17, 1–5.
Engineering from Jaypee Institute of Information and Technology, Noida, and India.
I. Rodriguez-Fdez, A. Canosa, M. Mucientes, and A. Bugarin, ‘‘STAC: A web platform
for the comparison of algorithms using statistical tests,” Fuzzy Syst. (FUZZ- She obtained her Doctorate in Computer Science & Engineering. She also holds a
IEEE), 2015 IEEE Int. Conf., pp. 1–8, 2015. Masters of Technology and a Bachelor of Technology degree in Computer Science
Rovelli, P., Vigfusson, Y., 2014. PMDS: permission-based malware detection system. and Engineering. Her areas of interest are Wireless Sensor Networks, Context Aware
10th Int. Conf. Inf. Syst. Secur. (ICISS 2014), 338–357. Systems and Sensor based Smart Environments. She has many publications in
B. Settles and M. Craven, ‘‘An analysis of active learning strategies for sequence international journals and conferences to her credit. She is member of IEEE and
labeling tasks,” Proc. Conf. Empir. Methods Nat. Lang. Process. - EMNLP ’08, p. ACM.
1070, 2008.
B. Settles, ‘‘Active Learning Literature Survey,” Sci. York, no. January, 2009.

Please cite this article in press as: Bhatt, A.J., et al. iABC-AL: Active learning-based privacy leaks threat detection for iOS applications. Journal of King Saud
University – Computer and Information Sciences (2018), https://doi.org/10.1016/j.jksuci.2018.05.008

You might also like