You are on page 1of 11

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2801792, IEEE Access

IEEE ACCESS 1

A Bi-objective Hyper-heuristic Support Vector


Machines for Big Data Cyber-Security
Nasser R. Sabar, Xun Yi and Andy Song

Abstract—Cyber security in the context of big data indicates the data is too big, too fast, or too hard for
is known to be a critical problem and presents a great existing tools to handle. Big data is commonly described
challenge to the research community. Machine learn- by three characteristics: volume, variety and velocity (aka
ing algorithms have been suggested as candidates for
handling big data security problems. Among these algo- 3Vs). The 3Vs define properties or dimensions of data
rithms, support vector machines (SVMs) have achieved where volume refers to an extreme size of data, variety
remarkable success on various classification problems. indicates the data was generated from divers sources and
However, to establish an effective SVM, the user needs velocity refers to the speed of data creation, streaming
to define the proper SVM configuration in advance, and aggregation [49]. The complexity and challenge of
which is a challenging task that requires expert knowl-
edge and a large amount of manual effort for trial and big data are mainly due to the expansion of all three
error. In this work, we formulate the SVM configura- characteristics (3Vs)- rather than just the volume alone
tion process as a bi-objective optimisation problem in [14]. Learning from big data allows researchers, analysts,
which accuracy and model complexity are considered as and organisations users to make better and faster decisions
two conflicting objectives. We propose a novel hyper- to enhance their operations and quality of life [38]. Given
heuristic framework for bi-objective optimisation that
is independent of the problem domain. This is the first its practical applications and challenges, this field has
time that a hyper-heuristic has been developed for attracted the attention of researchers and practitioners
this problem. The proposed hyper-heuristic framework from various communities, including academia, industry
consists of a high-level strategy and low-level heuristics. and government agencies [14].
The high-level strategy uses the search performance However, big data created a new issue related not only
to control the selection of which low-level heuristic
should be used to generate a new SVM configura- to the 3Vs characteristics, but also to data security. It
tion. The low-level heuristics each use different rules has been indicated that big data does not only increase
to effectively explore the SVM configuration search the scale of the challenges related to security, but also
space. To address bi-objective optimisation, the pro- create new and different cyber-security threats that need
posed framework adaptively integrates the strengths to be addressed in an effective and intelligent ways. Indeed,
of decomposition- and Pareto-based approaches to
approximate the Pareto set of SVM configurations. security is known as the prime concern for any organ-
The effectiveness of the proposed framework has been isation when learning from big data [47]. Examples of
evaluated on two cyber security problems: Microsoft big data cyber-security challenges are malwares detection,
malware big data classification and anomaly intrusion authentications and steganoanalysis [45]. Among these
detection. The obtained results demonstrate that the challenges, malware detection is the most critical challenge
proposed framework is very effective, if not superior,
compared with its counterparts and other algorithms. in big data cyber-security. The term malware (short for
malicious software) refers to various malicious computer
Index Terms—Hyper-heuristics, Big data, Cyber se- programs such as ransomwares, viruses and scarewares
curity, Optimisation.
that can infect computers and release important infor-
mation via networks, email or websites [53]. Researchers
I. Introduction and organisations acknowledged the issues that can be
caused by these dangerous software (malicious computer
The rapid advancements in technologies and network-
programs) and therefore new methods should be developed
ings such as mobile, social and Internet of Things create
to prevent them. Yet, despite the fact that malware is a
massive amounts of digital information. In this context,
crucial issue in big data, very little researches have been
the term big data has been emerged to describe this
done in this area [47]. Examples of malware detection
massive amounts of digital information. Big data refers
methods include signature-based detection methods [22],
to large and complex datasets containing both structured
behaviors monitoring detection methods [54] and patterns-
and unstructured data generated on a daily basis, and need
based detection methods [19],[53]. However, most of exist-
to be analysed in short periods of time [49]. The term big
ing malware detection methods are mainly proposed to
data is different from the big database, where big data
deal with small-scale datasets and unable to handle big
N. R. Sabar is with the Department of Computer Science and Infor- data within a moderate amount of time. In addition, these
mation Technology, La Trobe University, Melbourne, VIC, Australia. methods can be easily evaded by attackers, very costly to
Email: n.sabar@latrobe.edu.au maintain and they have very law success rates [53].
X. Yi and A. Song are with the School of Computer Science and
Information Technology, RMIT University, Australia. Email: {xun.yi, To address the above issues, machine learning (ML)
andy.song}@rmit.edu.au algorithms have been proposed for classifying unknown

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2801792, IEEE Access

IEEE ACCESS 2

patterns and malicious software [45],[53]. ML have show- data classification and anomaly intrusion detection. The
ing promising results to classify and identify unknown empirical results fully demonstrate the effectiveness of the
malware software. Support vector machines (SVMs) are proposed framework on both problems.
among the most popular ML algorithms and have shown The remainder of this paper is organised as follows. In
remarkable success in various real-world applications [15]. the next section (Section II), we present a brief overview
The popularity of SVMs is due to their strong performance of related work. The definition and formulation of SVMs
and scalability [40]. However, despite these advantages, the are presented in Section III. In Section IV, we describe the
performance of an SVM is strongly affected by its selected proposed hyper-heuristic framework and its main compo-
configuration [9]. A typical SVM configuration includes nents. In Section V, we discuss the experimental setup,
the selection of the soft margin parameter (or penalty) and including the benchmark instances and the parameter
the kernel type as well as its parameters. In the literature, settings of the proposed framework. In Section VI, we
various methodologies have been developed for selecting provide the computational results of our framework and
SVM configurations. These methodologies can be classi- compare the framework with other algorithms. Finally, the
fied based on the formulation of the SVM configuration conclusion of this paper is presented in Section VII.
problem and the optimisation method used [9] [12]. An
SVM configuration formulation can rely on either a single
II. Related work
criterion, in which case k-fold cross-validation is used to
assess the performance of the generated configuration, or In this section, we briefly discuss some related works on
multiple criteria, in which case more than one criterion malware detection methods and meta-learning methods. It
must be used to evaluate the generated configuration, such also includes review on hyper-heuristics for classification
as the model accuracy and model complexity [46]. The problems.
available optimisation methods include grid search meth-
ods, gradient-based methods and meta-heuristic methods.
A. Malware detection methods
Grid search methods are easy to implement and have
shown good results [13]. However, they are computation- Recent survey by Ye et al. [53] classified malware detec-
ally expensive, which limits their applicability to big data tion methods into three types: signature-based detection
problems. Gradient-based methods are very efficient, but methods, patterns-based detection methods and cloud-
their main shortcomings are that they require the objective based detection methods. Most of existing detection meth-
function to be differentiable and that they strongly depend ods use signature to detect malware software [21], [22].
on the initial point [4]. Meta-heuristic methods have been Signature is a unique short string of bytes defined for
suggested to overcome the drawbacks of grid search meth- each known malware software so it can be used to detect
ods and gradient-based methods [56],[5], [28]. However, the future unknown software [22]. Although signature-based
performance of a meta-heuristic method strongly depends detection methods are able to detect malware software,
on the selected parameters and operators, the selection of they require constant updating to include the signature
which is known to be a very difficult and time-consuming of new malware software into the signature database. In
process. In addition, only one kernel is used in most works, addition, they can be easily evaded by malware developers
and the search is performed over the parameter space of by using encryption, polymorphism or obfuscation [53].
that kernel. Furthermore, signature database is usually created via
This work presents a novel bi-objective hyper-heuristic manual process by domain experts which is known as
framework for SVM configuration optimisation. Hyper- tedious task and time-consuming [16].
heuristics are more effective than other methods because Patterns-based detection methods check whether a
they are independent of the particular task at hand and given malware software contains a set of patterns or
can often obtain highly competitive configurations. Our not. The patterns are extracted by domain experts to
proposed hyper-heuristic framework integrates several key distinguish malware software and non-benign files [2], [10],
components that differentiate it from existing works to find [35]. However, the analysis of malware software and the
an effective SVM configuration for big data cyber security. extraction of patterns by domain experts is subject to
First, the framework considers a bi-objective formulation error-prone and requires a huge amount of time [19]. This
of the SVM configuration problem, in which the accu- indicates that manual analysis and extraction are major
racy and model complexity are treated as two conflicting issues in developing patterns-based detection methods
objectives. Second, the framework controls the selection because malware software grows very fast [53].
of both the kernel type and kernel parameters as well Cloud-based detection methods use a server to store
as the soft margin parameter. Third, the hyper-heuristic detection software so malware detection can be done
framework combines the strengths of decomposition- and in a client-server manner using cloud-based architecture
Pareto-based approaches in an adaptive manner to find an [54],[41], [53]. However, cloud-based detection methods are
approximate Pareto set of SVM configurations. highly affected by the available number of cluster nodes
The performance of the proposed framework is validated and the running time of the detection methods [29]. This
and compared with that of state-of-the-art algorithms can slow down the detection processes and thus multiable
on two cyber security problems: Microsoft malware big malware software can not be easily detected.

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2801792, IEEE Access

IEEE ACCESS 3

Generally speaking, due to the economic benefits, mal- C. Hyper-heuristics


ware software getting increasingly complex and malware Hyper-heuristic is an emergent search method that seeks
developers employ automated malware development tool- to automate the process of combining or generating an
kits to write and modify malware codes to evade detec- effective problem solver [11]. A traditional hyper-heuristic
tion methods [44]. In addition, existing methods are not framework takes all possible designing options as an input
scalable enough to deal with big data and less respon- and then decides which one should be used. The out-
sive to new threats due to the quickly changing nature put of a hyper-heuristic framework is a problem solver
of malware software. Machine learning (ML) algorithms rather than a solution [39]. Sim et al. [42] proposed a
have been suggested to be used as malware detection hyper-heuristic framework to generate a set of attributes
methods to automatically detect malware software [53]. that characterise a given instance for one dimensional
However, designing an effective detection method using bin packing problem. The authors used hyper-heuristic
machine learning algorithm is a challenging task due to framework to predict which heuristic should be used to
the large number of possible design options and the lack of solve the current problem instance. Ortiz-Bayliss et al. [34]
intelligent way for how to choose and/or combine existing proposed a learning vector quantization neural network
options. This work addresses these challenges through based hyper-heuristic framework for solving constraint
proposing a hyper-heuristic framework to search the space satisfaction problems. The hyper-heuristic framework was
of the design options and their values, and iteratively trained to decide which heuristic to select based on the
combine and adapt different options for different problem given properties of the instance at hand. In [25], the
instances. authors presented a stochastic hyper-heuristic framework
for unsupervised matching of partial information. The
B. Meta-learning approaches hyper-heuristic framework was implemented as a feature
selection method to determine which sub-set of features
A traditional SVM has several tunable parameters that should be selected. In [7] the authors proposed a hyper-
need to be optimised in order to obtain high quality heuristic framework to evolve decision-tree for software
results [9]. Meta-learning approaches have been widely effort prediction. Other examples that use hyper-heuristic
used to find the best combination of parameters and frameworks to evolve classifiers are [51], [6] and [8].
their values for SVM. Meta-learning is an approach that
aims at understanding the problem characteristics and the
III. Problem description
best algorithm that fit to it [52]. In particular, it tries
to discover or learn which problem features contribute This section is divided into three subsections. We first
to algorithm performance and then recommend the ap- describe the SVM process, followed by the formulation of
propriate algorithm for that problem. Soares et al. [43] the configuration problem. Finally, we present the pro-
proposed a meta-learning approach to find the parameter posed multi-objective formulation of the SVM configura-
values of Gaussian kernel for SVM to solve regression tion problem.
problems. The authors used K-NN as a ranking method
to select the best value for the kernel width parameter. A. Support vector machines
Reif et al. [36] hybridised meta-learning and case-based SVMs are a class of supervised learning models that
reasoning to generate the initial starting solutions for have been widely used for classification and regression
genetic algorithm. The proposed genetic algorithm is used [50]. SVMs are based on statistical learning theory and are
to find appropriate parameter values for a given classifier better able to avoid local optima than other classification
to solve a given problem instance. In [3], the authors algorithms. An SVM is a kernel-based learning algorithms
employed a meta-learning approach that uses classical, that seeks the optimal hyperplane. The kernel learning
distance and distribution statistical information to rec- process maps the input patterns into a higher-dimensional
ommend kernel method for SVM. In [24] the authors feature space in which linear separation is feasible. Sup-
proposed a hybrid method that combines meta-learning pose that we have L sample sets {(xi ,yi ) | xi ∈ Rv , yi ∈
and search algorithms to select SVM parameter values. R}, where xi is an input vector of dimensionality v and yi
Other examples that use meta-learning approaches to tune is the output vector corresponding to xi . The basic idea of
SVMs are [32], [31], [37] and [30]. the SVM approach is to map the input vector xi into an N-
Although meta-learning approaches have shown to be dimensional feature space and then construct the optimal
effective in tuning SVMs parameter values, they still face decision-making function in the feature space as follows
the problem of over-fitting. This is because the extracted [50]:
problem features only capture the instances that have been
used during the training process. In addition, most of L
1 X
existing approaches are used to tune single kernel method min( k ω k +C (ξi + ξi∗ )) (1)
2 i=1
and were tested on small scale instances. Our proposed
framework uses kernel methods and the selection process s.t.
is formulated as a bi-objective optimisation to effectively
deal with big data problems. yi − f (xi ) ≤ ε + ξi

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2801792, IEEE Access

IEEE ACCESS 4

to improve the accuracy of our algorithm and avoid the


f (xi ) − yi ≤ ε + ξi∗ shortcomings of using a single kernel function.

B. SVM configuration formulation


ξi , ξi∗ ≥ 0, i = 1, 2, . . . L
A traditional SVM configuration specifies the appropri-
where ω=(ω1 , ω1 , ω1 , . . . , ωN )T is the weight vector; C is ate values for C, the kernel type and the kernel parameters.
the margin parameter (or penalty); ε is the insensitive loss The aim is to find SVM configurations from the space
coefficient, which controls the number of support vectors; of all possible configurations that minimise the expected
and ξi and ξi∗ are two slack variables, which take non- error when tested on completely new data. This can be
negative values. Equation (1) can be transformed into represented as a black-box optimisation problem that
a dual problem, in which the optimal decision can be seeks an optimal cross-validation error (I) and can be
obtained by solving the following: expressed as a tuple of the form <SVM, Θ, D, C, S >,
where [26]
L
• SVM is the parametrised algorithm,
X
f (x) = (αi − αi∗ )K(x, xi ) + b (2)
• Θ is the search space of the possible SVM configura-
i=1
tions (C, kernel type and kernel parameters),
where αi and αi∗ are Lagrange coefficients representing the • D is the distribution of the set of instances,
two slack variables, b ∈ R is the bias, and K(x, xi ) is the • C is the cost function, and
the kernel function • S is the statistical information.

K(x, xi ) = hΦ(x), Φ(xi )i (3)


θ∗ ∈ arg min I(Θ) (4)
θ ∈Θ
Here, Φ(.) represents the mapping function to the fea-
ture space. The kernel function is used to compute the The goal is to optimise the cost function C: Θ × D 7−→
dot product of two sample points in the high-dimensional R of the SVM over a set of problem instances π ∈ D to
space. Table I summarises the kernel functions that have find
been widely used in SVMs [9]. In this table, α, β and d 1 X
are kernel parameters that need to be set by the user. θ∗ ∈ arg min C(θ, π) (5)
θ ∈Θ |D|
π∈D
TABLE I: Kernel functions Each θ ∈ Θ represents one possible configuration of the
Name Formula SVM. The cost function C represents a single execution of
Radial K(x, xi ) = exp(−αkx, xi k2 ) the SVM using θ to solve a problem instance π ∈ D. The
Polynomial K(x, xi ) = (α(x.xi ) + β)d statistical information S (e.g., a mean value) summarises
Sigmoidal K(x, xi ) = tanh(α(x.xi ) + β)
P d the output of C obtained when testing the SVM across
ANOVA K(x, xi ) = i
exp((α(x − xi ))2 a set of instances. The main role of the proposed hyper-
Inverse multi-quadratic
p
K(x, xi ) = 1/ kx, xi k2 + β heuristic framework is to find a θ ∈ Θ such that C(θ) is
optimised.
The existing kernel functions can be classified as either
local or global kernel functions [9]. Local kernel functions C. Multi-objective formulation
have a good learning ability but do not have a good A multi-objective optimisation problem involves more
generalisation ability. By contrast, global kernel functions than one objective function that all need to be optimised
have a good generalisation ability but a poor learning simultaneously [17]. A general multi-objective optimisa-
ability. For example, the radial kernel function is known tion problem of the minimisation type can be represented
to be a local function, whereas the polynomial kernel as follows:
function is a global kernel function. The main challenge
lies in determining which kernel function should be used min F(X ) = [f1 (x), f2 (x), . . . , fm (x)]
for the current problem instance or the current decision s.t. ζ(X ) = [ζ1 (x), ζ2 (x), . . . , ζc (x)] ≥ 0 (6)
point. This is because the kernel selection process strongly (L) (U )
xi ≤ xi ≤ xi
depends on the distribution of the input vectors and the
relationship between the input vector and the output where X=(x1 , x2 , . . . , xN ) is a set of N decision variables,
vector (predicted variables). However, the feature space m is the number of objectives fi , c is the number of con-
(L)
distribution is not known in advance and may change dur- straints ζ(X ), xi is the lower bound on the ith decision
(U )
ing the course of the solution process, especially in big data variable, and xi is the upper bound on the ith decision
cyber security. Consequently, different kernel functions variable. In a multi-objective optimisation problem, two
may work well for different instances or in different stages solutions are compared using the concept of dominance
of the solution process, and kernel selection may thus (≺). Given two solutions a and b, a is said to dominate b
have a crucial impact on SVM performance. To address (a ≺ b) if a is superior or equal to b in all objectives and
this issue, in this work, we use multiple kernel functions strictly superior to b in at least one objective [17]:

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2801792, IEEE Access

IEEE ACCESS 5

process is repeated for a certain number of iterations. In


( the following subsections, we discuss the proposed hyper-
fi (a) ≤ fi (b), ∀i = 1, . . . , m
F(a) ≺ F(b) if f (7) heuristic framework along with its main components.
∃i ∈ 1, . . . , m, fi (a) < fi (b)
A solution a is Pareto optimal if there is no other solu-
tion that dominates it. Accordingly, the set of all Pareto-
optimal solutions is called the Pareto-optimal set (P S),
and its image in the objective space is called the Pareto
front (P F ). The main goal of optimisation algorithms is SVM
to find the optimal P S.
For an SVM, the accuracy can be seen as a trade- SVM configuration (θ)
off between the complexity (number of support vectors
(N SV )) and the margin (C) [46]. A large number of
support vectors may lead to over-fitting, whereas a large Cost (C) Configuration (θ)
value of C to increase the generalisation ability may result
in incorrect classification of some samples. This trade-
Hyper-heuristic framework
off can be controlled through the selection of the SVM
configuration (C, kernel type and kernel parameters). To
this end, in this work, we consider the accuracy and High level Low level
complexity (number of support vectors (N SV )) achieved
over the training instances as two conflicting objectives
[46]:
Fig. 1: The proposed methodology
• Accuracy. The accuracy represents the classification
performance on a given problem instance. It can be
calculated via so-called K-fold cross-validation (CV),
in which the given instance is split into K disjoint sets A. The proposed hyper-heuristic framework
D1 , . . . , DK of the same size. For each configuration
The proposed hyper-heuristic framework for configura-
(θ ∈ Θ), the SVM is trained K times. In each
tion selection is shown in Figure 2. It has two levels:
iteration, K − 1 sets are used for training, and the
the high-level strategy and the low-level heuristics [11].
other set is used for performance testing. The error
The high-level strategy operates on the heuristic space
(err) represents the average number of misclassified
instead of the solution space. In each iteration, the high-
data sets over K training iterations.
level strategy selects a heuristic from the existing pool of
• Complexity. The complexity represents the number
low-level heuristics, applies it to the current solution to
of support vectors (N SV ) or the upper bound on the
produce a new solution and then decides whether to accept
expected number of errors.
the new solution. The low-level heuristics constitute a set
The SVM configuration θ ∈ Θ comprises the decision of problem-specific heuristics that operate directly on the
variables (C, kernel type and kernel parameters). The solution space of a given problem [39].
bounds on each decision variable represent its range of
To address the bi-objective optimisation problem, we
possible values. The two objectives to be optimised (m=2)
propose a population-based hyper-heuristic framework
can be formulated as follows [46]:
that operates on a population of solutions and uses an
min F(X ) = [f1 (x), f2 (x)] archive to save the non-dominated solutions. The proposed
where f1 (x) = err (8) framework combines the strengths of decomposition- and
f2 (x) = N SV Pareto (dominance)- based approaches to effectively ap-
proximate the Pareto set of SVM configurations. Our idea
where err represents the number of misclassified data sets is to combine the diversity ability of the decomposition
and N SV denotes the number of support vectors. approach with the convergence power of the dominance
approach. The decomposition approach operates on the
IV. Methodology population of solutions, whereas the dominance approach
The flowchart of the proposed methodology (abbrevi- uses the archive. The hyper-heuristic framework generates
ated as HH-SVM) is depicted in Figure 1. The method- a new population of solutions using either the old pop-
ology has two parts: the SVM and the hyper-heuristic ulation, the archive, or both the old population and the
framework. The main role of the hyper-heuristic frame- archive. This allows the search to achieve a proper balance
work is to generate a configuration (C, kernel type and between convergence and diversity. It should be noted
kernel parameters) and send it to the SVM. The SVM that seeking good convergence involves minimising the
uses the generated configuration to solve a given problem distances between the solutions and P F , whereas seeking
instance and then sends the cost function (mean values high diversity involves maximising the distribution of the
of err and N SV ) to the hyper-heuristic framework. This solutions along P F .

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2801792, IEEE Access

IEEE ACCESS 6

The main components of the proposed hyper-heuristic 1) Select: The selection step involves selecting one
framework are discussed in the following subsections. heuristic from the existing pool of heuristics using a selec-
tion mechanism. In this work, we use Multi-Armed Bandit
(MAB) [39] as an on-line heuristic selection mechanism .
B. Solution representation
In MAB, the past performance of each heuristic is saved;
In our framework, each solution represents one config- then, these performances are used to decide which heuris-
uration (θ ∈ Θ) of the SVM, which is represented in the tic should be selected. Each heuristic is associated with
form of a one-dimensional array, as shown in Figure 3. In two variables: the empirical reward qi and the confidence
this figure, C is the margin parameter (or penalty), KF is level ni . The empirical reward qi represents the average
the index of the selected kernel function, and k1 , k2 , . . . , reward obtained during the search process using this
kKF are the parameters of that kernel function. heuristic. A higher value of the empirical reward is better.
The confidence level ni is the number of times that the
ith heuristic has previously been applied. Based on these
C. Population initialisation
two variables, MAB calculates the confidence interval for
The population of solutions (P S) is randomly ini- each heuristic and then selects the highest value using the
tialised. We use the following equation to assign a random following formula (Equation (11)):
value to each decision variable in a given solution (x):

v 
u 2log PLLHn ni(t)
u
xpi = + lip Randpi (0, 1)× − (upi lip ), i=LLH1
(9) arg max qi(t) + c
 t 
ni(t)

p = 1, 2, . . . , |P S|, i = 1, 2, . . . , d i=LLH1 ...LLHn

where i is the index of the decision variable, d is the total (11)


number of decision variables, p is the index of the solution, The pool of heuristics in our framework is denoted by
|P S| is the population size, Randpi (0, 1) returns a random {LLH 1, . . . , LLH n}, where n is the total number of
value in the range [0,1] for the ith decision variable, lip is heuristics. The index t is the time step or the number of
the lower bound on the value of that decision variable, and the current iteration of the search. c is a scaling factor
upi is the upper bound. that adjusts the balance between the influence of the
empirical reward and the confidence level to ensure that
the confidence interval will not be excessively biased by
D. Fitness calculation either of these indicators. For example, a highly rewarded
The fitness calculation assigns a value to each solution but infrequently used heuristic should most likely be less
in the population that indicates how good this solution preferred than a frequently used heuristic whose reward
is compared with those in the current population. In this value is only slightly lower.
work, we use the MOEA/D approach to solve the multi- The empirical reward qi(t) is calculated as follows:
objective optimisation problem for selecting the SVM ni(t−1) × qi(t−1) + ri(t)
configuration. In this approach, a given multi-objective qi(t) = (12)
ni(t)
optimisation problem is first decomposed into a number of
single-objective sub-problems, and then, all sub-problems where ri(t) is the accumulative reward value of heuristic
are solved in a collaborative manner [57]. MOEA/D uses a LLHi up to time t. The computation of ri(t) uses the
scalarisation function to decompose a given problem into equation below, Equation (13).
a number of scalarised single-objective sub-problems as X
follows [57]: ri(t) = ∆ g te (x, λ) (13)
The component ri(t) is the sum total of the fitness
g te (x, λ) = max(λi |zi∗ − fi (x)|) (10) improvement introduced by heuristic i from the beginning
i ∈m
of the search up through the current iteration t.
where g te is the Tchebycheff decomposition approach, x is 2) Apply: Two tasks are performed in the application
a given solution (SVM configuration), m is the number of step:
objectives (in this work, m=2), and λ=(λ1 , λ2 , . . . , λm ) • Solution selection. This task determines which so-
is a weighting vector such that λi ≥ 0, ∀ i ∈ m. fi is the lutions should be selected to form the mating pool. In
fitness value for the ith objective calculated using Equation this work, we propose to utilise the advantages of both

(8). z ∗ =(z1∗ , z2∗ , . . . , zm ) is the idea or the reference point, the decomposition- and Pareto (dominance)-based

i.e., zi =min{fi (x) | x ∈ Ω } for each i=1,2, . . . , m. approaches during the solution selection process. In
MOEA/D, each solution in the current population
represents a sub-problem. To combine decomposition
E. High-level strategy and dominance, we optimise each sub-problem using
The main role of the high-level strategy is to automate information from only its neighbouring sub-problems
the heuristic selection process [39]. Our proposed high- with probability pn , from both the neighbouring sub-
level strategy consists of the following steps. problems and the archive with probability pna , or

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2801792, IEEE Access

IEEE ACCESS 7

Hyper-heuristic framework

High-level strategy

no
Start Stop
Accept
Select LLH Apply LLH Terminate?
solution? yes

Domain barrier

Low-level heuristics

Set of heuristics Population of solutions

LLH1 LLH3 Sol1 Sol3


Archive
LLHn Solk

LLH2 LLH4 Sol2 Sol4

Fig. 2: Hyper-heuristic framework

so as to incorporate various characteristics into the search


C KF k1 k2 ... ... kKF
and to include different search behaviours. The heuristics
are as follows[20]:
Fig. 3: Solution representation
1) Parametrised Gaussian Mutation

from only the archive with probability pa . A fixed x = x + N (M ean, σ2) (14)
set of neighbouring solutions for each sub-problem
is determined using the Euclidean distances between where M ean = 0 and σ2 = 0.5 is the standard
any two solutions based on their weight vectors. deviation.
• Heuristic application. In this task, the selected
heuristic is applied to the created mating pool to 2) Differential Mutation 1
evolve a new set of solutions.
3) Accept solution: The acceptance step checks whether x = x1 + F × (x2 − x3 ) (15)
the newly generated solutions should be accepted. In this 3) Differential Mutation 2
work, we first compare each solution x with its neighbour-
ing sub-problems y. x will replace y if it is superior in terms x = x1 + F × (x2 − x3 ) + F × (x4 − x5 ) (16)
of the scalarisation function, g te (x, λ) <g te (y, λ). Next, we
update the archive using non-dominated solutions. 4) Differential Mutation 3
4) Terminate: This step terminates the search process.
This step checks whether the total number of iterations x = x1 + F × (x1 − x2 ) + F × (x3 − x4 ) (17)
has been reached and, if so, terminates the search process
and returns the set of non-dominated solutions. Otherwise, where x1 , x2 , x3 , x4 and x5 are five different solutions
it starts a new iteration. selected from the mating pool in accordance with the
solution selection process discussed in IV-E2. F is a
scaling factor, whose value is fixed to 0.9 in this study.
F. Low-level heuristics
The low-level heuristics (LLHs) are a set of problem- 5) Arithmetic Crossover
specific rules that operate directly on a given solution.
Each LLH takes one or more solutions as input and then x = λ × x1 + (1 − λ) × x2 (18)
modifies them to generate a new solution. In this work,
we utilise various sets of heuristics within the proposed where λ is a randomly generated number, whose value
framework. These heuristics have been demonstrated to is within the range λ ∈ [0, 1]. x1 is the current sub-
be suitable for different problems and even for different problem, and x2 is the best solution in its neighbour-
stages of the same problem. These heuristics are chosen hood.

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2801792, IEEE Access

IEEE ACCESS 8

6) Polynomial Mutation
N M
1 XX
logloss = − yij log(pij ) (20)
(19) N i=1 j=1
(
x1 + σ × (b − a), if Rand ≤ 0.5 where N represents the number of training samples, M is
x= the number of classes, log is the natural logarithm, and yij
x1 , otherwise
is a true label that takes a value of 1 if i is in class j and 0
and otherwise. pij is the estimated probability that i belongs
to class j. Further description can be found on the Kaggle
1
web site.
(
(2 × Rand) (η+1) − 1, if Rand ≤ 0.8
σ= 1 2) Anomaly intrusion detection: In the second exper-
1 − (2 − 2 × Rand) (η+1) , otherwise imental evaluation, we used the NSL-KDD 2 anomaly
where η is set to 0.3 and a and b are the lower and intrusion detection instances. NSL-KDD includes selected
upper bounds, respectively, on the value of the ith records from the KDDCUP99 dataset collected by moni-
decision variable. toring incoming network traffic. NSL-KDD has been used
by many researchers to develop network-based intrusion
detection systems (NIDs). The NSL-KDD problem in-
G. Archive stance consists of 125,973 training samples and 22,544 test-
The archive saves the set of non-dominated solutions ing samples, each classified as either normal or anomalous
and is updated in each iteration. In this work, the newly (i.e., a network attack).
generated solutions are first added to the archive. Then,
following the concept of NSGA-II [18], we use the non- B. Parameter settings
dominated sorting procedure to divide the archive into
The proposed framework has a few parameters that need
several levels of non-domination such that solutions in the
to be determined in advance. To this end, we conducted
first level have the highest priority to be selected, those
a preliminary investigation to set the values of these
in the second level have the second highest priority, etc.
parameters. We tested different values for each parameter
To ensure that the selected solutions are distributed along
while keeping the other parameters fixed. Table II shows
the Pareto front (P F ), we may also select some solutions
the parameter settings investigated in our work as well as
at the lowest level, depending on the crowding distance
the final selected values.
measure.
TABLE II: The parameter settings of our framework
V. Experimental setup Parameter Investigated range Final value
Maximum number of generations 5-150 100
This section summarises the benchmark instances that Population size, P S 5-30 20
were used to assess the proposed framework and the Archive size 5-20 10
pn 0.1-0.8 0.5
parameter settings. pna 0.1-0.8 0.3
pa 0.1-0.8 0.2

A. Benchmark instances
In this work, we analysed our proposed framework on VI. Results and comparisons
two different cyber security problems with a broad range
In this section, we present the results of the experi-
of different structures and sizes.
ments that we conducted to evaluate the proposed HH-
1) Microsoft malware big data classification: A first
SVM framework described in this paper. We conducted
experimental evaluation uses Microsoft malware big data
two experimental tests. In the first test, HH-SVM was
classification problem which was introduced for BIG 2015,
compared with each low-level heuristic individually. In the
hosted at Kaggle 1 . Microsoft provided a total of 500 GB
second test, the results of HH-SVM were compared with
of data of known malware files representing a mix of 9
those of other algorithms proposed in the literature.
families (classes) for 2 purposes: training and testing. A
total of 10868 malwares are included in the training set,
and 10783 malwares are included in the testing set. Each A. HH-SVM compared with individual low-level heuristics
sample is a binary file with the extension “.bytes”, and the This section compares the proposed HH-SVM with each
corresponding disassembled file in the assembly language low-level heuristic (LLH). Our aim is to assess the benefits
(text) has the extension ”.asm”. The ultimate goal is of the proposed hyper-heuristic framework and the effects
to train the classification algorithm using the training of using multiple LLHs on the search performance. To this
samples to effectively classify each of the testing samples end, we tested each LLH separately. The outcomes were
into one of the 9 categories (malware families) such that the results of seven different algorithms, denoted by HH-
the logloss function below is minimised: SVM, LLH1 , LLH2 , LLH3 , LLH4 , LLH5 , and LLH6 . All
1 https://www.kaggle.com/c/malware-classification/data 2 http://nsl.cs.unb.ca/NSL-KDD/

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2801792, IEEE Access

IEEE ACCESS 9

algorithms were executed under identical conditions, and TABLE V: The p-values of HH-SVM compared with the
the same base components were utilised on both problem individual low-level heuristics
instances (BIG 2015 and NSL-KDD). The average results BIG 2015 NSL-KDD
over 31 independent runs are compared in Table III. The HH-SVM vs.
p-value p-value
BIG 2015 results are compared in terms of logloss, for
LLH1 0.001 0.000
which lower values are better (20), whereas the NSL-
LLH2 0.000 0.010
KDD results are compared based on accuracy, for which
LLH3 0.020 0.011
higher values are better. In the table, the best results
achieved among all algorithms are indicated in bold font. LLH4 0.000 0.000
From the results, we can see that HH-SVM outperforms LLH5 0.012 0.000
all other algorithms (LLH1 , LLH2 , LLH3 , LLH4 , LLH5 , LLH6 0.022 0.000
and LLH6 ) on both BIG 2015 and NSL-KDD. Table IV
reports the numbers of support vectors (NSV ) for HH-
SVM and the compared algorithms on both instances, for B. HH-SVM compared with other algorithms
which lower values are better. As seen from this table, In this section, the results of HH-SVM are compared
the proposed HH-SVM framework produced lower NSV with those reported in the literature. For BIG 2015, we
values for both BIG 2015 and NSL-KDD compared with consider the following algorithms in the comparison:
LLH1 , LLH2 , LLH3 , LLH4 , LLH5 , and LLH6 . These
• XGBoost (AE) [55]
positive results justify the use of the proposed hyper-
• Random Forest (RF) [23]
heuristic framework and the use of the pool of heuristics
• Optimised XGBoost (OXB) [1]
(LLHs).
For the NSL-KDD instance, the accuracy results ob-
tained by HH-SVM are compared against those of the
TABLE III: Comparison of the HH-SVM results against following algorithms:
the results of all low-level heuristics (LLH1 to LLH6 )
• Gaussian Naive Bayes Tree (GNBT) [48]
individually
• Fuzzy Classifier (FC) [27]
Algorithm / Instance BIG 2015 NSL-KDD • Decision Tree (DT) [33]
HH-SVM 0.0031 85.69
LLH1 0.0332 77.24 The results of HH-SVM and the other algorithms for
LLH2 0.0223 66.45 the BIG 2015 and NSL-KDD problem instances are sum-
LLH3 0.0214 80.01 marised in Table VI and Table VII, respectively. Similar
LLH4 0.0208 79.22
LLH5 0.0227 80.37
to the literature, the results for BIG 2015 in Table VI are
LLH6 0.0216 76.93 given in the form of the logloss values achieved by the
various algorithms, whereas in Table VII, all algorithms
are compared in terms of the accuracy measure. In the
logloss comparisons, a lower value indicates better per-
TABLE IV: The NSV values obtained by HH-SVM and formance, whereas in the accuracy comparisons, a higher
the individual low-level heuristics (LLH1 to LLH6 ) value indicates better performance. The best result ob-
Algorithm / Instance BIG 2015 NSL-KDD tained among the compared algorithms is indicated in bold
HH-SVM 20 8 in both tables. As shown in Table VI, HH-SVM has a lower
LLH1 33 12
logloss value than those of AE, RF and OXB for the BIG
LLH2 34 17
LLH3 34 20 2015 instance, whereas in Table VII, the accuracy value
LLH4 42 16 of HH-SVM is higher than those of GNBT, FC and DT
LLH5 41 22 for the NSL-KDD instance. The results demonstrate that
LLH6 38 21
HH-SVM is an effective methodology for addressing cyber
security problems. The good performance of HH-SVM can
To further verify these results, we conducted statistical be attributed to its ability to design and optimise different
tests using the Wilcoxon test with a significance level of SVMs for different problem instances and for different
0.05. The p-values for the HH-SVM results versus those stages of the solution process.
of LLH1 , LLH2 , LLH3 , LLH4 , LLH5 , and LLH6 are
reported in Table V. In this table, a p-value of less than TABLE VI: Comparison of the logloss results of HH-SVM
0.05 indicates that HH-SVM is statistically superior to the and other algorithms
algorithm considered for comparison. A value greater than Algorithm BIG 2015
0.05 indicates that the performance of our proposed HH- HH-SVM 0.0031
SVM framework is not significantly superior. From the
AE 0.0748
table, we can clearly see that all p-values are less than 0.05,
RF 0.0988557
indicating that HH-SVM is statistically superior to LLH1 ,
LLH2 , LLH3 , LLH4 , LLH5 , and LLH6 across both BIG OXB 0.0063
2015 and NSL-KDD.

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2801792, IEEE Access

IEEE ACCESS 10

TABLE VII: Comparison of the accuracy results of HH- [10] David Brumley, Cody Hartwig, Zhenkai Liang, James Newsome,
SVM and other algorithms Dawn Song, and Heng Yin. Automatically identifying trigger-
based behavior in malware. Botnet Detection, pages 65–88,
Algorithm NSL-KDD 2008.
HH-SVM 85.69 [11] Edmund K Burke, Matthew Hyde, Graham Kendall, Gabriela
Ochoa, Ender Özcan, and John R Woodward. A classification
GNBT 82.02 of hyper-heuristic approaches. In Handbook of metaheuristics,
FC 82.74 pages 449–468. Springer, 2010.
[12] Athanassia Chalimourda, Bernhard Schölkopf, and Alex J
DT 80.14 Smola. Experimentally optimal ν in support vector regression
for different noise models and parameter settings. Neural
Networks, 17(1):127–141, 2004.
[13] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for
VII. Conclusion support vector machines. ACM transactions on intelligent
systems and technology (TIST), 2(3):27, 2011.
In this work, we proposed a hyper-heuristic SVM opti- [14] Min Chen, Shiwen Mao, and Yunhao Liu. Big data: A survey.
misation framework for big data cyber security problems. Mobile Networks and Applications, 19(2):171–209, 2014.
We formulated the SVM configuration process as a bi- [15] Nello Cristianini and John Shawe-Taylor. An introduction to
support vector machines and other kernel-based learning meth-
objective optimisation problem in which accuracy and ods. Cambridge university press, 2000.
model complexity are treated as two conflicting objectives. [16] Mohsen Damshenas, Ali Dehghantanha, and Ramlan Mah-
This bi-objective optimisation problem can be solved using moud. A survey on malware propagation, analysis, and de-
tection. International Journal of Cyber-Security and Digital
the proposed hyper-heuristic framework. The framework Forensics (IJCSDF), 2(4):10–29, 2013.
integrates the strengths of decomposition- and Pareto- [17] Kalyanmoy Deb. Multi-objective optimization using evolution-
based approaches to approximate the Pareto set of con- ary algorithms, volume 16. John Wiley & Sons, 2001.
[18] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT
figurations. Our framework has been tested on two bench- Meyarivan. A fast and elitist multiobjective genetic algo-
mark cyber security problem instances: Microsoft malware rithm: Nsga-ii. IEEE transactions on evolutionary computation,
big data classification and anomaly intrusion detection. 6(2):182–197, 2002.
The experimental results demonstrate the effectiveness [19] Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher
Kruegel. A survey on automated dynamic malware-analysis
and potential of the proposed framework in achieving techniques and tools. ACM computing surveys (CSUR), 44(2):6,
competitive, if not superior, results compared with other 2012.
algorithms. [20] Agoston E Eiben, James E Smith, et al. Introduction to
evolutionary computing, volume 53. Springer, 2003.
[21] Eric Filiol. Malware pattern scanning schemes secure against
References black-box analysis. Journal in Computer Virology, 2(1):35–50,
2006.
[1] Mansour Ahmadi, Dmitry Ulyanov, Stanislav Semenov, Mikhail [22] Eric Filiol, Grégoire Jacob, and Mickaël Le Liard. Evaluation
Trofimov, and Giorgio Giacinto. Novel feature extraction, methodology and theoretical model for antiviral behavioural
selection and fusion for effective malware family classification. detection strategies. Journal in Computer Virology, 3(1):23–37,
In Proceedings of the Sixth ACM Conference on Data and 2007.
Application Security and Privacy, pages 183–194. ACM, 2016. [23] Luba Gloukhov, Cody Wild, and David Reilly. Malware clas-
[2] Alfred V Aho and Margaret J Corasick. Efficient string match- sification: Distributed data mining with spark. In Associa-
ing: an aid to bibliographic search. Communications of the tion for the Advancement of Artificial Intelligence, pages 1–6.
ACM, 18(6):333–340, 1975. www.aaai.org, 2015.
[3] Shawkat Ali and Kate A Smith-Miles. A meta-learning ap- [24] Taciana AF Gomes, Ricardo BC Prudêncio, Carlos Soares,
proach to automatic kernel selection for support vector ma- André LD Rossi, and André Carvalho. Combining meta-learning
chines. Neurocomputing, 70(1):173–186, 2006. and search techniques to select parameters for support vector
[4] Nedjem-Eddine Ayat, Mohamed Cheriet, and Ching Y Suen. machines. Neurocomputing, 75(1):3–13, 2012.
Automatic model selection for the optimization of support vec- [25] Kieran Greer. A stochastic hyperheuristic for unsupervised
tor machine kernels. Pattern Recognition, 38(10):1733–1745, matching of partial information. Advances in Artificial Intel-
2005. ligence, 2012:13, 2012.
[5] Yukun Bao, Zhongyi Hu, and Tao Xiong. A particle swarm [26] Frank Hutter, Holger H Hoos, Kevin Leyton-Brown, and
optimization and pattern search based memetic algorithm for Thomas Stützle. Paramils: an automatic algorithm configu-
svms parameters optimization. Neurocomputing, 117:98–106, ration framework. Journal of Artificial Intelligence Research,
2013. 36(1):267–306, 2009.
[6] Rodrigo C Barros, Márcio P Basgalupp, André CPLF de Car- [27] Pavel Krömer, Jan Platoš, Václav Snášel, and Ajith Abraham.
valho, and Alex A Freitas. A hyper-heuristic evolutionary Fuzzy classification by evolutionary algorithms. In Systems,
algorithm for automatically designing decision-tree algorithms. Man, and Cybernetics (SMC), 2011 IEEE International Con-
In Proceedings of the 14th annual conference on Genetic and ference on, pages 313–318. IEEE, 2011.
evolutionary computation, pages 1237–1244. ACM, 2012. [28] Ana Carolina Lorena and Andre CPLF De Carvalho. Evolu-
[7] Márcio P Basgalupp, Rodrigo C Barros, Tiago S da Silva, and tionary tuning of support vector machine parameter values in
André CPLF de Carvalho. Software effort prediction: a hyper- multiclass problems. Neurocomputing, 71(16):3326–3334, 2008.
heuristic decision-tree based approach. In Proceedings of the [29] Mohammad M Masud, Tahseen M Al-Khateeb, Kevin W
28th Annual ACM Symposium on Applied Computing, pages Hamlen, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani
1109–1116. ACM, 2013. Thuraisingham. Cloud-based malware detection for evolving
[8] Márcio P Basgalupp, Rodrigo C Barros, and Vili Podgor- data streams. ACM transactions on management information
elec. Evolving decision-tree induction algorithms with a multi- systems (TMIS), 2(3):16, 2011.
objective hyper-heuristic. In Proceedings of the 30th Annual [30] Pericles BC Miranda, Ricardo BC Prudencio, Andre CPLF
ACM Symposium on Applied Computing, pages 110–117. ACM, Carvalho, and Carlos Soares. Combining meta-learning with
2015. multi-objective particle swarm algorithms for support vector
[9] Asa Ben-Hur and Jason Weston. A users guide to support vector machine parameter selection: An experimental analysis. In
machines. Data mining techniques for the life sciences, pages Neural Networks (SBRN), 2012 Brazilian Symposium on, pages
223–239, 2010. 1–6. IEEE, 2012.

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2801792, IEEE Access

IEEE ACCESS 11

[31] Péricles BC Miranda, Ricardo BC Prudencio, Andre Carlos PLF [51] Alan Vella, David Corne, and Chris Murphy. Hyper-heuristic
de Carvalho, and Carlos Soares. Multi-objective optimization decision tree induction. In Nature & Biologically Inspired
and meta-learning for support vector machine parameter selec- Computing, 2009. NaBIC 2009. World Congress on, pages 409–
tion. In Neural Networks (IJCNN), The 2012 International 414. IEEE, 2009.
Joint Conference on, pages 1–8. IEEE, 2012. [52] Ricardo Vilalta and Youssef Drissi. A perspective view and sur-
[32] Péricles BC Miranda, Ricardo BC Prudêncio, André PLF vey of meta-learning. Artificial Intelligence Review, 18(2):77–95,
De Carvalho, and Carlos Soares. A hybrid meta-learning ar- 2002.
chitecture for multi-objective optimization of support vector [53] Yanfang Ye, Tao Li, Donald Adjeroh, and S Sitharama Iyengar.
machine parameters. Neurocomputing, 143:27–43, 2014. A survey on malware detection using data mining techniques.
[33] Mehdi Mohammadi, Bijan Raahemi, Ahmad Akbari, and Babak ACM Computing Surveys (CSUR), 50(3):41, 2017.
Nassersharif. New class-dependent feature transformation for [54] Yanfang Ye, Tao Li, Shenghuo Zhu, Weiwei Zhuang, Egemen
intrusion detection systems. Security and communication net- Tas, Umesh Gupta, and Melih Abdulhayoglu. Combining file
works, 5(12):1296–1311, 2012. content and file relations for cloud based malware detection. In
[34] José Carlos Ortiz-Bayliss, Hugo Terashima-Marı́N, and Santi- Proceedings of the 17th ACM SIGKDD international conference
ago Enrique Conant-Pablos. Learning vector quantization for on Knowledge discovery and data mining, pages 222–230. ACM,
variable ordering in constraint satisfaction problems. Pattern 2011.
Recognition Letters, 34(4):423–432, 2013. [55] Mahmood Yousefi-Azar, Vijay Varadharajan, Len Hamey, and
[35] Chhabi Rani Panigrahi, Mayank Tiwari, Bibudhendu Pati, and Uday Tupakula. Autoencoder-based feature learning for cyber
Rajendra Prasath. Malware detection in big data using fast security applications. In Neural Networks (IJCNN), 2017 In-
pattern matching: A hadoop based comparison on gpu. In ternational Joint Conference on, pages 3854–3861. IEEE, 2017.
Mining Intelligence and Knowledge Exploration, pages 407–416. [56] Jun Zhang, Zhi-hui Zhan, Ying Lin, Ni Chen, Yue-jiao Gong,
Springer, 2014. Jing-hui Zhong, Henry SH Chung, Yun Li, and Yu-hui Shi.
[36] Matthias Reif, Faisal Shafait, and Andreas Dengel. Meta- Evolutionary computation meets machine learning: A survey.
learning for evolutionary parameter optimization of classifiers. IEEE Computational Intelligence Magazine, 6(4):68–75, 2011.
Machine learning, 87(3):357–380, 2012. [57] Qingfu Zhang and Hui Li. Moea/d: A multiobjective evolution-
[37] Alejandro Rosales-Pérez, Jesus A Gonzalez, Carlos A Coello ary algorithm based on decomposition. IEEE Transactions on
Coello, Hugo Jair Escalante, and Carlos A Reyes-Garcia. evolutionary computation, 11(6):712–731, 2007.
Surrogate-assisted multi-objective model selection for support
vector machines. Neurocomputing, 150:163–172, 2015.
[38] Nasser R Sabar, Jemal Abawajy, and John Yearwood. Hetero-
geneous cooperative co-evolution memetic differential evolution
algorithm for big data optimization problems. IEEE Transac-
tions on Evolutionary Computation, 21(2):315–327, 2017.
[39] Nasser R Sabar, Masri Ayob, Graham Kendall, and Rong Qu.
A dynamic multiarmed bandit-gene expression programming
hyper-heuristic for combinatorial optimization problems. IEEE
transactions on cybernetics, 45(2):217–228, 2015.
[40] Bernhard Schölkopf and Alexander J Smola. Learning with
kernels: support vector machines, regularization, optimization,
and beyond. MIT press, 2002.
[41] Sagar Shaw, Manish Kumar Gupta, and Sanjay Chakraborty.
Cloud based malware detection. In Proceedings of the 5th
International Conference on Frontiers in Intelligent Computing:
Theory and Applications: FICTA 2016, volume 1, page 485.
Springer, 2017.
[42] Kevin Sim, Emma Hart, and Ben Paechter. A hyper-heuristic
classifier for one dimensional bin packing problems: Improving
classification accuracy by attribute evolution. Parallel Problem
Solving from Nature-PPSN XII, pages 348–357, 2012.
[43] Carlos Soares, Pavel B Brazdil, and Petr Kuba. A meta-learning
method to select the kernel width in support vector regression.
Machine learning, 54(3):195–209, 2004.
[44] Dawn Song, David Brumley, Heng Yin, Juan Caballero, Ivan
Jager, Min Kang, Zhenkai Liang, James Newsome, Pongsin
Poosankam, and Prateek Saxena. Bitblaze: A new approach
to computer security via binary analysis. Information systems
security, pages 1–25, 2008.
[45] Shan Suthaharan. Big data classification: Problems and chal-
lenges in network intrusion prediction with machine learn-
ing. ACM SIGMETRICS Performance Evaluation Review,
41(4):70–73, 2014.
[46] Thorsten Suttorp and Christian Igel. Multi-objective opti-
mization of support vector machines. Multi-objective machine
learning, pages 199–220, 2006.
[47] Colin Tankard. Big data security. Network security, 2012(7):5–
8, 2012.
[48] Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A Ghor-
bani. A detailed analysis of the kdd cup 99 data set. In Com-
putational Intelligence for Security and Defense Applications,
2009. CISDA 2009. IEEE Symposium on, pages 1–6. IEEE,
2009.
[49] Chun-Wei Tsai, Chin-Feng Lai, Han-Chieh Chao, and Athana-
sios V Vasilakos. Big data analytics: a survey. Journal of Big
Data, 2(1):21, 2015.
[50] Vladimir Vapnik. The nature of statistical learning theory.
Springer science & business media, 2013.

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like