You are on page 1of 10

A methodology for training set instance selection using mutual

information in time series prediction


Milo B. Stojanovi
a,n,1,2,3
, Milo M. Boi
b
, Milena M. Stankovi
b
, Zoran P. Staji
b
a
College of Applied Technical Sciences, Aleksandra Medvedev 20, 18000 Ni, Serbia
b
Faculty of Electronic Engineering, University of Ni, Aleksandra Medvedeva 14, 18000 Ni, Serbia
a r t i c l e i n f o
Article history:
Received 21 September 2012
Received in revised form
25 November 2013
Accepted 19 March 2014
Communicated by: P. Zhang
Available online 8 April 2014
Keywords:
Instance selection
Mutual information
Time-series prediction
a b s t r a c t
Training set instance selection is an important preprocessing step in many machine learning problems,
including time series prediction, and has to be considered in practice in order to increase the quality of
the predictions and possibly reduce training time. Recently, the usage of mutual information (MI) has
been proposed in regression tasks, mostly for feature selection and for identifying the real data from data
sets that contain noise and outliers. This paper proposes a new methodology for training set instance
selection for long-term time series prediction. The proposed methodology combines a recursive
prediction strategy and advanced instance selection criterionthe nearest neighbor based MI estimator.
An application of the concept of MI is presented for the selection of training instances based on MI
computation between initial training set instances and the current forecasting instance, for every
prediction step. The novelty of the approach lies in the fact that it ts the initial training subset with the
current forecasting instance, and consequently reduces the uncertainty of the prediction. In this way, by
selecting instances which share a large amount of MI with the current forecasting instance in every
prediction step, error propagation and accumulation can be reduced, both of which are well known
shortcomings of the recursive prediction strategy, thus leading to better forecasting quality. Another
element which sets this approach apart from others is that it is not proposed as an outlier detector, but
for the instance selection of data which do not necessarily have to contain noise and outliers. The results
obtained from the data sets from NN5 competition in time series prediction indicate that the proposed
method increases the quality of long-term time series prediction, as well as reduces the amount of
instances needed for building the model.
& 2014 Elsevier B.V. All rights reserved.
1. Introduction
Time series forecasting is nowadays a key topic in various
elds. In nancial economics, stock exchange courses are pre-
dicted [1], in computer science, the ow of data through networks
and frequency of access to websites [2], in power systems,
distribution companies forecast the load of the following day [3],
etc. Conventional methods for time series forecasting, developed
during the 1970s and 80s, among which the most popular were
the linear regression [4], Box-Jenkins ARIMA [5] and exponential
smoothing [6], cannot always provide an accurate and unbiased
estimation of time series in cases when the underlying system
processes are typically non-linear, non-stationary and not dened
a priori. In addition, very often the choice of the prediction
method and the determination of its parameters depend on
knowing the properties of the underlying process. In order to
address these problems, machine learning models, among which
the most popular were articial neural networks (ANNs) [7] and
support vector machines (SVMs) [8], have established themselves
in the last two decades as serious contenders to classic statistical
models in the area of forecasting. Basically, time series prediction
in machine learning comes down to training the model which
establishes a mapping between training set instances and their
target values. Then, the trained model estimates future values
based on current and past data samples. The determination of
sufcient and necessary data is essential for training a good
forecasting model. If the amount of data in the training set is
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/neucom
Neurocomputing
http://dx.doi.org/10.1016/j.neucom.2014.03.006
0925-2312/& 2014 Elsevier B.V. All rights reserved.
n
Corresponding author. Tel.: 381 018 4233099 (Home),
381 065 8136718 (Mobile).
E-mail addresses: milosstojanovic10380@yahoo.com,
milos.stojanovic@vtsnis.edu.rs (M.B. Stojanovi).
URL: http://www.vtsnis.edu.rs/index_english.html (M.B. Stojanovi).
1
Home address: Bulevar doktora Zorana inia 29/5, 18000 Ni, Serbia.
2
Job address: Aleksandra Medvedev 20, 18000 Ni, Serbia.
3
Institution (job): College of Applied Technical Sciences, Ni, Serbia.
Neurocomputing 141 (2014) 236245
insufcient, the forecasting of the model will be poor, and the
model may be prone to undertting. On the other hand, if
the training set is too large, the information that it provides to
the model could be unnecessary or redundant. As a result, the
model could have poor generalization performance and may be
prone to overtting. Based on the size of the prediction horizon of
a problem, predictions can be classied into two categories:
short-term and a long-term. Take for example electric load
forecasting. If hourly predictions for one day ahead are needed,
it is a short-term forecasting problem requiring values for the next
24 steps. On the other hand, if yearly predictions for the next 10 or
20 years are needed, it is a long-term forecasting problem,
requiring predictions for the next 10 or 20 steps. But, both of
them are classied as long-term time series prediction, based on
the number of steps in the prediction horizon needed to be
predicted. When one-step ahead prediction is needed it is referred
to short-term prediction. But when multi-step ahead predictions
are needed, it is called a long-term time-series prediction. Long-
term predictions are especially challenging, where multiple steps
ahead have to be predicted. The problem that occurs in long-term
prediction is that uncertainty increases with an increase in the
number of steps in the prediction horizon. It depends on several
factors, such as the accumulation of errors and lack of information.
Moreover, with the increase in the number of steps that need to be
predicted, a model selection problem emerges, because the envir-
onment in which the model was developed may change over time
[9]. Another problem that affects the quality of long-term predic-
tions occurs when the time series consist of daily or shorter
time intervals, i.e. if they contain high frequency data [10]. High
frequency data represents a specic type of forecasting problem,
rendering conventional methods inappropriate and demanding
new approaches [11].
The selection of an appropriate subset of instances that are
included in the initial training set is a very important preproces-
sing step, especially in long-term time series prediction tasks. It
may provide improvements in terms of the quality of the output
results and in the reduction of computational time. This problem
in literature is considered from several different perspectives.
Instance selection can be approached from the aspect of remov-
ing outliers and noise from distorted data sets, as presented in
[1214]. Instance selection of data that contain outliers aims to
remove elements from the training set that in some way differ
from most other elements in the input set. From another perspec-
tive, this problem can be considered data shifting', where the joint
distribution of inputs and outputs differs between the training and
test stage, as presented in [1517]. It usually appears in non-
stationary environments, when the training environment is dif-
ferent from the test one, whether due to a temporal or a spatial
change. There are also various methods based on active learning
that deal with the selection of relevant instances, some of which
can be found in [1820]. The key idea behind active learning is that
a machine learning algorithm can achieve greater accuracy with
fewer training instances if it is allowed to choose the data from
which it learns. Active learning is also closely related to covariate
shift, where the training input distribution is different from the
test input distribution [21]. Finally, the instance selection of data
which do not necessarily have to contain noise and outliers,
determines a subset of the initial training set. It can be used to
train a more accurate model, with a possible reduction in compu-
tational time.
In order to perform the selection of instances that the learning
algorithm will use, three main approaches have been used: the
incremental, decremental and batch, as presented in [2224].
While in the incremental approach the selection algorithm starts
from an empty set of instances, and adds them iteratively, in the
decrement approach selection the algorithm starts from a full set
of available instances, and removes those which did not meet the
predened selection criterion. The batch method performs several
iterations through the initial training set before removing one
instance. In each iteration it marks instances that are candidates
to be removed in the next iteration. Recently, the application of
evolutionary algorithms, boosting techniques and pruning techni-
ques have been used to tackle this problem [2527]. According to
the selection strategy, instance selection can be tackled with
lter and wrapper methods, as presented in [28,29]. In the lter
methods, the selection criterion uses a selection function which is
independent from the training algorithm used to form the regres-
sion model. On the other hand, in the wrapper methods the
selection criterion is based on the evaluation measure obtained by
the regression model. In other words, it is embedded into the
evaluation function of the model. In this way, instances that do
not contribute to the prediction quality are discarded from the
training set.
Most research on instance selection, which has been done so
far, refers to classication problems [30], while only a few papers
deal with instance selection in regression tasks, especially in the
case of long-term time series prediction. For example, [31] shows a
method of k-surrounding neighbors for the selection of input
vectors, while the output is calculated with the k-nearest neigh-
bors (kNN) algorithm. In [13] a genetic algorithm is presented to
perform feature and instance selection for linear regression
models. In [32,33] a new distance function, which integrates the
Euclidean distance and the dissimilarity of the trend of a time
series, is dened as a similarity measure between two instances
for long-term time series prediction. By selecting similar instances
in the training set for each testing instance based on the modied
kNN approach, prediction quality can be increased. Only recently a
mutual information (MI) estimator based on nearest neighbors,
which allows MI estimation directly from the data set, has been
introduced for instance selection in time series prediction. Its aim
is to remove outliers and noise from highly distorted data sets
[14,34]. The applied algorithm determines the loss of MI with
respect to its neighbors in such a way that if a loss of MI is similar
to the instance near the studied one, then this instance must be
included in the training dataset. This approach has proved suc-
cessful in situations when it has been applied to training sets
which are articially distorted by adding noise or outliers. In [35],
the concept of MI is applied for improving short-term load
forecasting through the selection of instances with similar load
patterns to the current forecasting instance.
The research presented in this paper is motivated by the work
presented in [14,34] and represents an extension of the approach
proposed in [35]. It is framed within the instance selection of data
which do not necessarily have to contain noise and outliers.
It proposes a methodology for training subsets selection in long-
term recursive time series prediction, by using MI to decide
which instances should or should not be included in the training
data set. The methodology is based on a decremental approach and
lter method which uses MI as the selection criterion. In
this way, by selecting instances which share a large amount of
MI with the current forecasting instance in every prediction step,
error propagation and accumulation can be reduced. Since they are
well known shortcomings of the recursive prediction strategy, this
can lead to better forecasting quality. In this paper, the least squares
support vector machines (LS-SVMs) were used as nonlinear models
to present the application of the proposed methodology.
The rest of the paper is organized as follows: Section 2 presents
the formulation of MI and describes the method used to compute
it, followed by Section 3 which introduces the methodology to
select training set instances. Section 4 shortly reviews the basics of
LS-SVMs. Section 5 includes a variety of experiments to verify the
proposed approach, and nally, Section 6 draws the conclusions.
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 237
2. Mutual information
In this section, the basics of Mutual Information for continuous
variables are briey explained, followed by the k-nearest neigh-
bors approach used to estimate it.
Mutual Information is commonly used for measuring depen-
dences between randomvariables in a way that does not make any
assumptions about the nature of their underlying relationships.
Therefore, MI is more powerful in some cases than estimators. This
is due to the fact that estimators only consider the linear relation-
ships between the variables, as, for example, the correlation
coefcient [36]. Moreover, MI can naturally be dened between
groups of variables, and thus can be applied for feature selection
[3740] and instance selection [14,34], independently from the
nal prediction model. The MI of two random variables X and Y
quanties the information that X and Y share. More formally, MI
measures how much knowledge of one variable reduces uncer-
tainty about the other. The denition of MI is derived from entropy
in information theory. Let us denote X and Y as continuous random
variables with a joint probability density function
X;Y
and mar-
ginal density functions
X
x and
Y
y.
The MI between two random variables X and Y can be
computed as:
IX; Y
Z
1
1
Z
1
1

X;Y
x; ylog

X;Y
x; y

X
x
Y
y
dx dy 1
The estimation of the joint probability density function (PDF)
for the pair (X, Y) is needed for the computation of MI. The most
commonly used methods for PDF estimation are histograms and
kernel estimators presented in [41,42]. However, their usage is
commonly limited to functions of one or two variables because the
number of samples needed for PDF estimation increases exponen-
tially with the number of variables. As a result, the estimator used
in this paper is the k-nearest neighbor (kNN) based MI estimator,
proposed in [43]. The novelty of this estimator lies in its ability to
estimate the MI between two multi-dimensional variables directly
from the data set. This avoids direct PDF estimation which is the
most problematic issue in MI estimation.
Let us consider the set of N input-output pairs z
i
x
i
; y
i
;
i 1; :::; N which are independent and identically distributed
realizations of a random variable Z X; Y, where x and y can be
either a scalar or vector. For any pair of points z and z', the
maximum norm is used for the comparison of inputoutput pairs
dened by:
jjzz
0
jj max fjjxx
0
jj; jjyy
0
jjg 2
while any norms can be used in the X and Y spaces.
The basic idea is to estimate I(X,Y) from the distances in spaces
X, Y and Z from z
i
to its k nearest neighbors, averaged over all z
i
. Let
us denote z
ki
x
ki
; y
ki
the kth nearest neighbor of z
i
. It should
be noted that x
k(i)
and y
k(i)
are the input and output parts of z
k(i)
respectively, and thus not necessarily the kth nearest neighbors
of x
i
and y
i
. Let us dene d
i
X
jjx
i
x
ki
jj; d
i
Y
jjy
i
y
ki
jj;
d
i
Z
jjz
i
z
ki
jj. Thus,d
i
maxd
i
X
; d
i
Y
. Subsequently, the number
n
i
X
of points x
j
whose distance from x
i
is strictly less than d
i
is
counted, and similarly the number n
i
Y
of points y
j
whose distance
from y
i
is strictly less than d
i
is counted. Then, I(X,Y) can be
estimated by:
IX; Y k
1
N

N
i 1
n
i
X
1n
i
Y
1 N 3
where is the digamma function dened as:
t

0
t
t

d
dt
ln t 4
and t is the gamma function dened as:
t
Z
1
0
u
t 1
e
u
du 5
Function satises the recursion equation x1 x
1=x and 1 C where C 0:5772156 is the EulerMascheroni
constant. This paper implements the type of estimator presented
in (3) which is one of the two proposed in [43]. The other one can
be found in [44]. Both types of MI estimators depend on the value
chosen for k, which controls the bias-variance tradeoff. Higher
values for k imply a lower statistical error. On the other hand,
systematic errors increase with an increase in k. Thus, to maintain
a balance between these two errors, as recommended in [45], a
mid-range value for k6 will be used. Additionally, an approach
for the selection of parameter k that relies on the resampling
methods can be found in [46].
3. The methodology for instance selection
In this section the methodology for training set instance
selection is presented in combination with a recursive time-
series prediction strategy.
3.1. Recursive time-series prediction strategy
Let us assume that on the basis of the time-series dened in
(6):
y y
1
; y
2
; :::; y
n1
; y
n
; y
n1
; :::; y
i
; :::; y
N

; i 1; :::; N 6
where N represents the overall number of its elements and n is the
number of input lags, it is necessary to form an initial training set
which is needed to select the subsets of the training instances in
the recursive prediction strategy. Let S fX; Yg denote the initial
training set formed on the basis of the time-series y, dened as:
X
y
n
y
n1
y
1
y
n1
y
n
y
2

y
N1
y
N2
y
Nn
2
6
6
6
6
4
3
7
7
7
7
5
Nn x n
7
Y
y
n1
y
n2

y
N
2
6
6
6
6
4
3
7
7
7
7
5
Nn x 1
8
In (7) each row of matrix X represents one training vector or
x
k
AR
n
that consists of lagged input variables, while in (8) each
element of vector Y represents a target value y
k
AR, k1,, (Nn)
assigned to the corresponding vector in matrix X. In the case of the
prediction of time-series, training vectors x
k
AX are dened on the
basis of the values of the time-series from the previous steps,
while the y
k
AY outputs represent the target values of the time
series. M(Nn) denotes the size of the initial training set, that is,
the overall number of inputoutput pairs from X and Y. At the
same time, n denotes the number of features, that is, the size of the
regressor. Unlike the features in the case of regression, in the time-
series prediction the features of the model are dened through the
values of the time-series from the previous steps. The methodol-
ogy can also be used in the case when the input vectors consist of
additional features which were not derived from the time-series.
Such is the case of sparse input vectors, which are not formed from
all lagged variables. With the aim of simplifying the notation, it
was considered that the input vectors consist only of the variables
from the time-series with continuous lags.
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 238
Let us suppose that predictions for the H steps of the time-
series y are needed, that isy
N1
; y
N2
; :::; y
NH
. H represents the
size of the horizon of the prediction problem, or the number of
time steps needed to be predicted by a model f. It is important to
clarify that H denes the prediction horizon of the problem, and
not of the model f, which is one in every prediction step. In order
to achieve that aim, the recursive prediction strategy represents
the most intuitive and most commonly used method [47]. Its basic
characteristic is that it relies on predictions from previous steps
with the aim of predicting future steps, instead of the real values
which at that point in time are not available. Accordingly, a single
prediction model is trained, which rst performs the prediction for
one step ahead:
^ y
N1
f y
N
; y
N1
; :::; y
Nn1
9
Then to predict the following step, the same model is used:
^ y
N2
f ^ y
N1
; y
N
; :::; y
Nn2
10
The prediction of step H follows based on:
^ y
NH
f ^ y
NH1
; ^ y
NH2
; :::; ^ y
NHn
11
The main shortcoming of the recursive prediction strategy lies
in the propagation and accumulation of errors through the steps in
the prediction process. This especially becomes evident with the
increase in the prediction horizon H. If the size of the regressor n
is greater than H, then there are nH known data in the regressor
for predicting step H. On the other hand, if H exceeds n, all of the
data in the regressor will consist of the predicted values, which
has a negative effect on the quality of the following predictions.
3.2. Instance selection
The proposed methodology has as its primary aim the improve-
ment of the quality of prediction through the choice of subsets of
the training instances for each of the H prediction steps. In order to
achieve that, prior to training the model during each step of the
prediction, it is necessary to form a new training set which would,
based on the selection criterion, t in with the current vector
used for the prediction. The term current vectorx
t
fy
N
; y
N1
; :::;
y
Nn1
g, is related to the last n observations in the rst prediction
step only. After that, it is updated in every prediction step,
accordingly with recursive prediction strategy. If x
k
, k1,,
(N n) denotes the k training vector, that is, the k-th row of
matrix from (7), and x
t
the current vector which is used to predict
the following step, the amount of mutual information between x
k
and x
t
denes one criterion for the determination of the similarity
between them. Both training vectors, x
k
and x
t
consists of n lagged
variables. Thus, the MI between them is computed according to
(3), where n represents the overall number of input-output pairs.
Here too, as in the case of the kNN algorithm, the starting point
was the assumption that similar training vectors have joint similar
target values. Now there is the addition that the measure of that
similarity is determined by the amount of mutual information. If x
k
and x
t
share a signicant amount of mutual information, their
target values will share them as well. In other words, on the basis
of the well-known target value of x
k
, the uncertainty of the
unknown target value of the vector x
t
which needs to be predicted
can be reduced. Thus, by ranking the vectors of the initial training
set based on x
t
and the selection of subsets of those which share a
greater amount of mutual information with it, it is possible to
achieve a greater prediction quality when the model includes x
t
.
The proposed approach for the selection of subsets of the training
instances in accordance with the criterion of mutual information is
shown in algorithm 1.
Algorithm 1. The selection of the instances based on the evalua-
tion of mutual information
1. The initialization
From the time series y dened in (6), form the initial
training set S fX; Yg in accordance with the selected size of
the regressor n and eqs. (7) and (8). Form the initial test
vector x
t
fy
N
; y
N1
; :::; y
Nn1
g from the last n values of
the time series y.
2. Estimate the amount of mutual information between each
vector of the initial training set x
k
, k1,, (Nn) and the
current vector x
t
, on the basis of (3), and maintain those
values in vector V(k).
3. Sort set S in descending order in accordance with the
evaluated amount of mutual information from vector V.
Then, on the basis of V and S:
4. Dene the lowest allowed limit for the amount of mutual
information between the vectors from the initial training set
and the current vector x
t
, marked by . On the basis of vector
V choose from the set S those inputoutput pairs for which V
(k)4, k1,, (Nn) is valid. Use them as a basis to form
the reduced training subset S
r
fX
r
; Y
r
g, or,
5. Dene the overall number of vectors which are retained in
the training subset, marked with r. Form a reduced training
subset S
r
fX
r
; Y
r
gon the basis of the rst r inputoutput
pairs from S.
After forming the initial training set, the rst step in algorithm 1
is the evaluation of the mutual information between each input
vector from X and the current vector x
t
used to predict the following
step. Then vector V is formed, which determines the relative
signicance (ranking) of each input vector from X in relation to x
t
.
The following step is the sorting of set S in descending order in
relation to vector V. In order to select the vectors, two options are
available: the denition of the overall number of training vectors r
which are retained in the training set and the denition of the
lowest limit for the amount of mutual information. If the option
which denes the limit of the mutual information is selected, the
algorithm needs to be provided with a parameter . Parameter
determines the sensitivity' of the algorithm, that is, the minimal
allowed amount of similarity between the input vectors from X and
x
t
. All of the training examples from S, whose input vectors share
the amount of information with x
t
that is greater than are added
to the training subsetS
r
fX
r
; Y
r
g. If the option which denes the
overall number of training vectors which need to be retained is
selected, the algorithm needs to be provided with a parameter r.
Then from the rst r examples from S the training sub-
setS
r
fX
r
; Y
r
g is formed.
Previously, the terms training instance and training vector were
used. The term training instance refers to one training vector which
has one added target value, i.e. one input output pair from X and Y.
However, in order to simplify the notation, these two terms will be
used synonymously, although a choice of training instances is made.
The improvement in the quality of the prediction by using the
proposed algorithm depends on the quality
0
of the initial training
set (the number of available training vectors, the size of the
regressor and the selected features), but also on the values of
the selected parameters r and . As was indicated in section 5,
these values can be determined based on the function of the
mutual information between the vectors of the initial training set
and the current vector which is used in the prediction. As the
results of the testing indicate, in this case a good choice for
parameter r is 25, and for parameter is 0.2. This choice is
suitable even though the values of these parameter can vary
among different data sets and depend on the MI function between
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 239
the vectors of the initial training set and the current vector used in
the prediction.
3.3. Prediction strategy with instance selection
The procedure for the selection of the subsets of instances
during each step of the prediction in the recursive prediction
strategy is shown in algorithm 2.
Algorithm 2. The selection of instances in the recursive prediction
strategy
1. Initialization
From the time series y dened in (6), form an initial training
set S fX; Yg in accordance with the selected size of the
regressor n and the eqs. (7) and (8). Form the initial test
vector x
t
fy
N
; y
N1
; :::; y
Nn1
g from the last n values of
the time series y.
2. On the basis of points 2 to 5 of algorithm 1, form a reduced
training setS
r
fX
r
; Y
r
g.
3. On the basis of the reduced training set, train the local
LS-SVM prediction model.
4. On the basis of this model, form and preserve the prediction
for the following step in the time series in vector
y
t
; t 1; :::; H.
5. Update vector x
t
by shifting it one place, and adding the
prediction formed in step 4, in accordance with the
recursive prediction strategy.
6. Repeat steps 2 to 5H times, where H represents the size of
the prediction horizon, that is, the number of steps that
need to be predicted.
4. Least Squares support vector machines
In this section, a brief review of LS-SVMs is given, which are
used to train the forecasting models in the experiments.
Least squares support vector machines, as a reformulation of SVMs,
are commonly used for function estimation and for solving non-linear
regression problems [48]. The main property of these methods is that
they obtain a solution from a set of linear equations instead of solving
a quadric programming problem, as in SVMs. Therefore, LS-SVMs have
a signicantly shorter computing time and they are easier to optimize,
but with a lack of sparseness in the solution. The regression model in
primal weight space is expressed in:
yx
T
xb 12
where represents the weight vector, b represents a bias term and
x is a function which maps the input space into a higher-
dimensional feature space.
LS-SVMs formulate the optimization problem in primal space
dened by:
min J
p
; e
1
2
;b;e

1
2

N
k 1
e
2
k
13
subject to equality constrains expressed as:
y
k

T
x
k
be
k
; k 1; :::; N 14
while e
k
represents error variables, is a regularization parameter
which gives the relative weight to errors and should be optimized
by the user.
Solving this optimization problem in dual space leads to
obtaining
k
and b in the solution represented as:
yx
N
k 1

k
Kx; x
k
b 15
The dot product Kx; x
k
x
T
x
k
represents a kernel func-
tion while
k
is a Lagrange multiplier. When using a radial basis
function (RBF) dened by:
kx; x
k
e

jjx x
k
jj
2
s
2
16
the optimal parameter combination (, s) should be estab-
lished, where denotes the regularization parameter and s is a
kernel parameter. For this purpose, a grid-search algorithm in
combination with k-fold cross-validation is a commonly used
method [49].
5. Experimental results and discussion
This section presents the experimental results and discussion of
applying the proposed approach to datasets taken from the NN5 time-
series forecasting competition [50]. The goal of the experimental study
is to compare the performance of proposed instance selection
methodology in combination with a recursive prediction strategy in
comparison to the strategy without instance selection. The section
begins with the description of the datasets used in the experiments,
following the experimental setup. Then the results are presented, and
nally a discussion of the results concludes the section.
In the last decade, several time series forecasting competitions
have been organized in order to compare and evaluate the perfor-
mance of different machine learning methods. Among them, the NN5
competition is one of the most interesting ones since it includes the
challenges of a real world multi-step ahead forecasting task, namely
multiple time series, outliers, missing values as well as multiple
overlying seasonality, etc [51]. Each of the 11 time series in the
reduced data set represents roughly two years of daily cash money
withdrawal amounts (735 data points) at ATM machines at one of the
various cities in the UK. The reduced data set is a typical representative
of the entire data set, which contains 111 time series. For each time
series, the competition needed to forecast the values for the next 56
days using the given historical data points as accurately as possible.
The prediction quality was evaluated using the symmetric mean
absolute percentage error (SMAPE):
SMAPE% 100
1
H

H
i 1
jy
i
^ y
i
j
y
i
^ y
i
=2
17
where y
i
and ^ y
i
are the real and the predicted value of the time-series
in the i
th
prediction step and H is the size of prediction horizon. Since
this is a relative error measure, the errors can be averaged over all of
the time series in the reduced data set to obtain a mean SMAPE
dened as:
meanSMAPE%
1
11

11
i 1
SMAPE
i
18
where SMAPE
i
denotes the SMAPE of the i
th
time series.
The NN5 time-series requires a missing value interpolation pre-
processing step. Under this step two types of anomalies need to be
considered: zero values, that indicate that no money withdrawal
occurred and missing observations for which no value was recorded.
About 2.5% of the data represents missing values. For the missing
value interpolation method, the one proposed in [51] is used. If y
m
is
the missing value, it is replaced with the median of the set
y
m365
; y
m365
; y
m7
; y
m7
of those values which are available.
A recursive forecasting strategy requires the setting of the
regressor size n, Eqs. (9)(11). Several approaches have been
proposed in the literature to select this value, such as the one
based on the partial autocorrelation function (PACF), proposed in
[11]. Since this aspect is not the central topic of this paper, the
regressor size is determined based on the visual inspection of the
time-series and the analysis presented in [11] and [51]. It is set to 14
for all 11 time-series in the reduced data set. Previous analyses of
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 240
this data set have shown day of the week' seasonality, so two
weeks is a long enough period of time to catch the time-series
pattern and to maintain the same setup between the experiments.
In order to obtain more accurate results, a rolling forecast origin
evaluation was used, as proposed in [52]. This approach overcomes
the shortcomings of xed origin evaluation, including its dependence
on the randomness contained in the particular forecasting origin and
the limited sample size of errors [53]. In rolling origin evaluation, the
forecasting origin is successively updated, and forecasts from new
origins are produced. In general, the rolling-origin procedure provides
H(H1)/2 forecasts, against H from the xed-origin, where H denotes
the number of steps that need to be predicted.
The input selection procedure with local models training and
recursive forecasting of the time-series for one prediction horizon
ahead was presented in section 4 with algorithm 2. The rst step
in the algorithm is the selection of a new training set (X
r
, Y
r
) from
the initial set, based on the current forecasting vector x
t
and
selection option described in algorithm 1. After that, the optimal
(, s) pair is determined based on (X
r
, Y
r
) using a grid-search with
10-fold cross-validations. The local LS-SVM model is then
employed for the time-series prediction. The whole process is
repeated H times by employing a recursive prediction strategy. A
LS-SVM Matlab toolbox can be found in [54].
Several models are generated with different training sets which are
formed from the initial training set using the MI threshold or number
of inputs option. To be clear, the term model previously mentioned is
associated with a set of local models which are all formed with certain
training sets. These training sets are generated from the initial training
set using the MI criterion, and for every local model this training set is
different. Every model is a set of local models trained with training
sets generated from an initial set using the same criterion. The feature
set for every model is the same, i.e. all the local models have the same
feature structure. In order to assess the increase in the accuracy of the
proposed models and its contribution to the forecasting research, the
accuracy of proposed models was compared with two statistical
benchmark models: ARIMA and HoltWinters, but also with model
which implement instance selection based on distance metric pro-
posed in [32,33].
The amount of MI between the initial training set vectors and the
test vector (i.e. the vector with which the prediction is performed)
for the rst prediction step in the prediction horizon is given in Fig. 1.
On the horizontal axis the number of vectors in the initial training set
is denoted, while the vertical axis represents the amount of MI. The
rst block from the left in Fig. 1. is related to rst time-series in the
data set, the second block to the second time-series, and so on. From
Fig. 1. it can be observed that most of the vectors from the initial
Fig. 1. The amount of MI between the initial training set vectors and the test vector for the rst prediction step.
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 241
training set have MI values smaller than 0.1 and that there is a small
number of vectors with MI values greater than 0.3. From Fig. 1. it can
also be observed that the MI curves are similar for all 11 time-series,
and that the slopes and maximum values of the MI curves differ
between them. Also, for all 11 time series in Fig. 1. total number of
signicant vectors (the ones that have MI greater than 0) is around
400. Let us for example consider the time-series which corresponds
to block 9 and block 10 in Fig. 1. The time-series that corresponds to
block 9 has a signicantly larger number of vectors with a larger
amount of MI, compared to the other one that corresponds to block
10. This indicates the quality of the initial training set regarding the
current vector with which the forecasting model is employed. In this
way it is possible to assess the expected quality of the future
predictions. The predictions obtained with a model trained with a
large number of vectors that share a large amount of MI with the
current test vector should be more accurate than the predictions
obtained with a model trained with fewer vectors that share a
smaller amount of MI with the current test vector.
The selection strategy of parameter r, which determines the
overall number of vectors which need to be selected during a given
step, is the following: to select as few vectors as possible, enough to
train the LS-SVM model, such that they share as much information
as possible with the test vector during the rst prediction step.
Although it is possible to select a different value of parameter
r during each of the prediction steps, with the aim of simplifying
the results, parameter r has been selected on the basis of the mutual
information during the rst prediction step and remains unchanged
during all the subsequent prediction steps. In addition, in the
case of the recursive strategy, during the rst prediction step, all
of the values in the regressor are known (true), and thus the
evaluation of the mutual information is more precise than in the
following steps.
When it comes to the selection of parameter , which denes
the lowest limit of mutual information between two vectors, it is
necessary to select as great a value as possible, and at the same time
leave a sufcient number of vectors to form the model. As the
results of the test have shown, a sufcient number of vectors for
training the LS-SVM model range from several dozen to one
hundred. Parameter was also selected on the basis of the available
number of vectors with a suitable amount of information during
the rst step of the prediction and it remains unchanged in all
the steps of the prediction. According to the previous analysis and
Fig. 1, for testing purposes 27 different models were built, denoted
with:

LSSVMa model trained with an initial training set that contains


721 instances,

LSSVM-MI (Af0:05; 0:1; 0:15; 0:2; 0:25; 0:3; 0:35; 0:4g)mod-


els trained with sets that consist of initial input vectors with
an MI greater than in every prediction step,

LSSVM-rMI (r Af15; 25; 50; 75; 100; 200; 300; 400g)models


trained with sets that contain the rst r instances for each
prediction step which are in a descending order based on the MI,

LSSVM-kkNN (kAf15; 25; 50; 75; 100; 200; 300; 400g)models


trained with sets that contain the rst k instances for each
prediction step which are in a descending order based on the
distance metric proposed in [32,33],

ARIMAan autoregressive integrated moving average bench-


mark model,

HWthe HoltWinters benchmark model.


The mean and median SMAPE values and the ranking results
are presented in Table 1.
The problem of comparing multiple models on multiple time
series, and trying to infer whether there are signicant general
differences in the performance is discussed in [55] and [56].
In such cases using a two stage procedure is recommended: rst,
using Friedman's test, to determine whether the compared
models have the same mean rank, and if the null hypothesis is
rejected during second stage, using some post-hoc test. If this test
rejects the null hypothesis, then the post-hoc test needs to be
performed to compare the different models. Friedman's test is a
nonparametric test which is designed to detect differences among
two or more groups. Applying Friedman's test, a p-value of 0.016 is
obtained. With a value less than 0.05 the null hypothesis is rejected
at the 5% level, and it indicates signicant differences in the mean
ranks among the compared models. For the post-hoc test, the
multiple comparison with the best test (MCB) is used.
From Table 1 it can be observed that in terms of the mean and
median SMAPE all of the models that implement an MI based
instance selection algorithm (regardless of the selection option or
chosen value of the parameters r and ) the LSSVM model
outperforms the others without an instance selection algorithm.
A slightly better performance was determined for MI based
instance selection models in comparison to LSSVM-kNN models
and benchmark ARIMA and HW models. Also, in terms of average
rank, all of the models that implement an MI based instance
selection algorithm outperformed the LSSVM model, LSSVM-kNN
models, ARIMA and the HW model. The post-hoc test indicates no
signicant differences by multiple comparisons with the best test
at the 0.05 level between the models. It should be mentioned that
the best meanSMAPE achieved during the NN5 competition for the
reduced data set is 17.6%, obtained with a statistical model, while
the top three entries for computational intelligence models were
19%, 19.9% and 20.5% [57].
The predictions obtained with the LSSVM and LSSVM-0.2MI
models versus real values, for time-series No. 9 and No. 10 in
the reduced data set are presented in Fig. 2. (from top to bottom,
Table 1
The performance comparison of the individual forecasting models.
Model meanSMAPE
(%)
Avg. Rank medianSMAPE
(%)
Avg Rank
LSSVM 21.47 27 20.34 24
LSSVM-0.05MI 19.66 9 18.66 9
LSSVM-0.1MI 19.55 4 18.29 4
LSSVM-0.15MI 19.29 3 18.51 7
LSSVM-0.2MI 19.08 1 19.36 20
LSSVM-0.25MI 19.68 10 18.70 10
LSSVM-0.3MI 19.84 13 19.03 15
LSSVM-0.35MI 19.63 5 18.04 1
LSSVM-0.4MI 19.64 6 18.15 3
LSSVM-15MI 20.72 23 20.03 2
LSSVM-25MI 19.28 2 18.07 18
LSSVM-50MI 19.75 11 19.29 17
LSSVM-75MI 20.01 16 19.27 13
LSSVM-100MI 19.64 7 18.86 14
LSSVM-200MI 19.66 8 18.88 6
LSSVM-300MI 20.01 14 18.48 11
LSSVM-400MI 20.01 15 18.76 2
LSSVM-15kNN 20.21 18 18.59 8
LSSVM-25kNN 20.52 21 18.82 12
LSSVM-50kNN 19.81 12 18.44 5
LSSVM-75kNN 20.06 17 19.35 19
LSSVM-100kNN 20.62 22 19.18 16
LSSVM-200kNN 20.87 25 20.03 21
LSSVM-300kNN 20.75 24 20.36 25
LSSVM-400kNN 20.49 20 20.06 23
ARIMA 20.34 19 20.55 26
HW 21.06 26 20.79 27
Friedman test 0.016
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 242
respectively). While for series No. 9 both models perform very
well (which was expected, considering the analysis of block 9 in
Fig. 1.), improvements were much more signicant for series No. 10.
Actually, predictions obtained from the LSSVM model for series No. 10
are partially usable for 28 steps ahead at the most, and from that point
on they become a at line. However, predictions obtained from the
LSSVM-0.2MI model, which implements MI based instance selection,
follow the shape of the curve for all the 56 prediction steps.
In terms of computational complexity of the proposed instance
selection algorithm, it can vary in both directions, depending on the
initial training set size, regressor size, number of steps in the
prediction horizon that need to be predicted and the number of
selected vectors per step. In a particular case, the time needed
to obtain the predictions are multiplied by a factor of approximately
3, 4, 6, 8 and 12 for the LSSVM-15MI, LSSVM-75MI, LSSVM-200MI,
LSSVM-300MI and LSSVM-400MI models respectively, compared to
the initial LSSVM model. This is a direct consequence of the large
prediction horizon size, H56, but also of the large number of
selected vectors per step for models LSSVM-200MI, LSSVM-300MI
and LSSVM-400MI. Nevertheless, an increase in computational time
is compensated with an increase in the quality of the prediction
results.
6. Conclusion
A possible approach for improving long-term time-series pre-
diction is presented in this paper. The proposed methodology is
based on the concept of mutual information for instance select-
ion, which was previously mostly used for feature selection and
for removing noise and outliers. The novelty in the presented
approach is the use of the MI estimator in order to make
an objective evaluation of how relevant an input vector is for the
training set, i.e. how well it ts into the current prediction step.
Unlike the previous approaches which use an MI estimator
(feature selection, removing noise and outliers and missing values
interpolation), where the evaluation of the mutual information
was performed only on a training set, the proposed methodo-
logy evaluates the mutual information between the vectors of
the initial training set and the current vector used in the
prediction.
As the experiment results indicated, the MI estimator reduced
the initial training set several times. All of the generated models
performed better than the initial model, despite the selected
number of instances or the MI threshold value. It has been shown
that the quality of the training set is more signicant than the size.
The models trained with sets of vectors which share a large amount
of information with the forecasting inputs achieved greater accu-
racy than the models trained with one much larger set.
Although the calculations in the algorithms are slightly time
consuming, they bring signicant improvements to the time-series
forecasting accuracy and lead to a reduction in the initial training
set size needed for model formation.
Further work should include detailed research on how parameter
k choice in the MI estimator is reected on the proposed instance
selection algorithm. Also, the determination of the initial training
set size should be analyzed, in order to constrain its size in cases
when large amounts of history data are available for instance
selection.
Acknowledgments
This work was supported by the Ministry of Science and
Technological Development, Republic of Serbia (Project number:
III44006).
We thank the editors and the anonymous reviewers for their
helpful comments and suggestions that have improved the quality
of this paper.
Fig. 2. The forecasts from the LSSVM and LSSVM-0.2MI models and real values for time series No. 9 and No. 10.
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 243
References
[1] K. Kim, Financial time series forecasting using support vector machines,
Neurocomputing 55 (2003) 307319.
[2] P. Bermolen, D. Rossi, Support vector regression for link load prediction,
Comput. Netw. 53 (2009) 191201.
[3] N. Amjady, Short-term hourly load forecasting using time-series modeling
with peak load estimation capability, IEEE Trans. Power Syst. 16 (2001)
798805.
[4] A.D. Papalexopoulos, T.C. Hesterberg, A regression-based approach to short-
term system load forecasting, IEEE Trans. Power Syst. 5 (1990) 15351547.
[5] G.E.P. Box, G. Jenkins, Time Series Analysis, Forecasting and Control, Holden-Day,
San Francisco, 1976.
[6] C.C. Holt, Forecasting seasonals and trends by exponentially weighted moving
averages, Int. J. Forecast. 20 (2004) 510.
[7] G. Zhang, B.E. Patuwo, M.Y. Hu, Forecasting with articial neural networks:
The state of the art, Int. J. Forecast. 14 (1998) 3562.
[8] N.I. Sapankevych, R. Sankar, Time series prediction using support vector
machines: a survey, IEEE Comput. Intell. Mag. 4 (2009) 2438.
[9] H.H. Nguyen, C.W. Chan, Multiple neural networks for a long term time series
forecast, Neural Comput. Appl. 13 (2004) 9098.
[10] N. Kourentzes, S.F. Crone, Forecasting high-frequency time series with neural
networksan analysis of modelling challenges from increasing data fre-
quency, Int. Conf. Data Min. (2008).
[11] S.F. Crone, N. Kourentzes, Input-variable specication for neural networks
an analysis of forecasting low and high time series frequency, in: Procee-
dings of International Joint Conference on Neural Networks, 2009, pp. 619
626.
[12] J. Zhang, Y. Yim, J. Yang, Intelligent selection of instances for prediction
functions in lazy learning algorithms, Artif. Intell. Rev. 11 (1997) 175191.
[13] J. Tolvi, Genetic algorithms for outlier detection and variable selection in linear
regression models, Soft Comput. 8 (2004) 527533.
[14] A. Guillen, L.J. Herrera, G. Rubio, H. Pomares, A. Lendasse, I. Rojas, New method
for instance or prototype selection using mutual information in time series
prediction, Neurocomputing 73 (2010) 20302038.
[15] J.Q. Candela, M. Sugiyama, A. Schwaighofer, N.D. Lawrence, Dataset Shift in
Machine Learning, The MIT Press, Cambridge, MA, 2009.
[16] A.J. Storkey, M. Sugiyama, Mixture Regression for Covariate Shift, Adv. Neural
Inf. Process. Syst. 19 (2007) 13371344.
[17] J.G.M. Torres, T. Raeder, R.A. Rodriguez, N.V. Chawla, F. Herrera, A
unifying view on dataset shift in classication, Pattern Recogn. 45 (2012)
521530.
[18] D.A. Cohn, Z. Ghahramani, M.I. Jordan, Active learning with statistical models,
J. Artif. Intell. Res. 4 (1996) 129145.
[19] V.S. Iyengar, C. Apte, T. Zhang, Active learning using adaptive resampling, in:
Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2000 pp. 9198.
[20] M.C. Ludl, A. Lewandowski, Adaptive machine learning in delayed feedback
domains by selective relearning, Appl. Artif. Intell. 22 (2008) 543557.
[21] H. Shimodaira, Improving predictive inference under covariate shift by
weighting the log-likelihood function, J. Stat. Plann. Inference 90 (2000)
227244.
[22] D.W. Aha, Tolerating noisy, irrelevant and novel attributes in instance-based
learning algorithms, Int. J. Man-Mach. Stud. 36 (1992) 267287.
[23] D.R. Wilson, T. Martinez, Reduction techniques for instance based learning
algorithms, Mach. Learn. 38 (2000) 257286.
[24] Y. Guo, D. Schuurmans, Discriminative batch mode active learning, Adv. Neural
Inf. Process. Syst. (NIPS) (2008) 593600.
[25] H. Ishibuchi, T. Nakashima, M. Nii, Learning of neural networks with GA-based
instance selection, in: Proceedings of the IFSA World Congress and 20th
NAFIPS International Conference 4, 2001 pp. 21022107.
[26] M. Sebban, R. Nock, S. Lallich, Stopping criterion for boosting-based data
reduction techniques: from binary to multiclass problems, J. Mach. Learn. Res.
3 (2002) 863865.
[27] V.B. Zubek, T.G. Dietterich, Pruning improves heuristic search for cost-
sensitive learning, in: Proceedings of the International Conference on Machine
Learning, 2002, pp. 2734.
[28] K. Yu1, X. Xu1, M. Ester, H.P. Kriegel, Feature weighting and instance selection
for collaborative ltering: an information-theoretic approach, Knowl. Inf. Syst.
5 (2003) 201224.
[29] J.A.O. Lopez, J.A.C. Ochoa, J.F.M. Trinidad, J. Kittler, A review of instance
selection methods, Artif. Intell. Rev. 34 (2010) 133143.
[30] N. Jankowski, M. Grochowski, Comparison of instances seletion algorithms I,
in: Proceedings of the Articial Intelligence and Soft ComputingICAISC 2004,
2004, pp. 598603.
[31] J. Zhang, Y.S. Yim, J. Yang, Intelligent selection of instances for prediction
functions in lazy learning algorithms, Artif. Intell. Rev. 11 (1997) 175191.
[32] Z. Huang, M.L. Shyu, k-NN based LS-SVM framework for long-term time series
prediction, in: Proceedings of the 2010 IEEE International Conference on
Information Reuse and Integration, 2010, pp. 6974.
[33] M.L. Shyu, Z. Huang, Long-term time series prediction using k-NN based LS-
SVM framework with multi-value integration, Recent Trends in Information
Reuse and Integration, Springer (2012) 191209.
[34] A. Guilln, L. Herrera, G. Rubio, Instance or prototype selection for fun-
ction approximation using mutual information, in: Proceedings of the
European Symposium on Time Series PredictionESTSP 2008, 2008,
pp. 6775.
[35] M. Stojanovi, M. Boi, M. Stankovi, Z. Staji, Adaptive least squares support
vector machines method for short-term load forecasting based on mutual
information for inputs selection, Int. Rev. Electr. Eng. 7 (2012) 35743585.
[36] L. Batina, B. Gierlichs, E. Prouff, M. Rivain, F.X. Standaert, N.V. Charvillon,
Mutual information analysis: a comprehensive study, J. Cryptol. 24 (2010)
269291.
[37] A. Sorjamaa, J. Hao, N. Reyhani, Y. Ji, A. Lendasse, Methodology for long-term
prediction of time series, Neurocomputing 70 (2007) 28612869.
[38] F. Rossi, A. Lendasse, D. Francois, V. Wertz, M. Verleysen, Mutual information
for the selection of relevant variables in spectrometric nonlinear modeling,
Chemom. Intell. Lab. Syst. 80 (2006) 215226.
[39] L.J. Herrera, H. Pomares, I. Rojas, M. Verleysen, A. Guillen, Effective input
variable selection for function approximation, in: Proceedings of the
International Conference on Articial Neural NetworksICANN 2006, 2006,
pp. 4150.
[40] G. Doquire, M. Verleysen, Feature selection with missing data using mutual
information estimators, Neurocomputing 90 (2012) 311.
[41] R. Moddemeijer, On estimation of entropy and mutual information of the
continuous distributions, Signal Process. 16 (1989) 233248.
[42] Y. Moon, B. Rajagopalan, U. Lall, Estimation of mutual information using kernel
density estimators, Phys. Rev. E 52 (1995) 23182321.
[43] A. Kraskov, H. Stgbauer, P. Grassberger, Estimating mutual information, Phys.
Rev. E 69 (2004) 066138.
[44] http://www.klab.caltech.edu/kraskov/MILCA/.
[45] H. Stogbauer, A. Kraskov, S.A. Astakhov, P. Grassberger, Least dependent
component analysis based on mutual information, Phys. Rev. E 70 (2004)
066123.
[46] D. Franois, F. Rossi, V. Wertz, M. Verleysen, Resampling methods for
parameter-free and robust feature selection with mutual information, Neuro-
computing 70 (2007) 12761288.
[47] L.J. Herrera, H. Pomares, I. Rojas, A. Guillen, A. Prieto, O. Valenzuela, Recursive
prediction for long term time series forecasting using advanced models,
Neurocomputing 70 (2007) 28702880.
[48] J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle, Least
Squares Support Vector Machines, World Scientic, Singapore, 2002.
[49] S. Arlot, A. Celisse, A survey of cross-validation procedures for model selection,
Stat. Surv. 4 (2010) 4079.
[50] http://www.neural-forecasting-competition.com/NN5/datasets.htm.
[51] S.B. Taieb, G. Bontempi, A. Atiya, A. Sorjamaa, review and comparison of
strategies for multi-step ahead time series forecasting based on the NN5
forecasting competition, Expert Syst. Appl. 39 (2012) 70677083.
[52] L.J. Tashman, Out-of-sample tests of forecasting accuracy: an analysis and
review, Int. J. Forecast. 16 (2000) 437450.
[53] S.F. Crone, N. Kourentzes, Feature selection for time series predictiona
combined lter and wrapper approach for neural networks, Neurocomputing
73 (2010) 19231936.
[54] http://www.esat.kuleuven.be/sista/lssvmlab/.
[55] J. Demar, Statistical comparisons of classiers over multiple data sets,
J. Mach. Learn. Res. 7 (2006) 130.
[56] S. Garcia, F. Herrera, An extension on statistical comparisons of classiers
over multiple data sets for all pairwise comparisons, J. Mach. Learn. Res. 9
(2008) 26772694.
[57] http://www.neural-forecasting-competition.com/NN5/results.htm.
Milo B. Stojanovi was born in Ni, Serbia in 1981. He
received his Ph.D degree in Computer Science from the
Faculty of Electronic Engineering, University of Ni,
Serbia in 2013. Currently he is working as a Teaching
Assistant at the Department of Modern Computer
Technologies in the
1
College of Applied Technical
Sciences in Ni. His research interests include machine
learning (ML) and its application in power systems,
feature and instance selection algorithms for regression
models and the application of ML algorithms in time-
series prediction. His main area of research is develop-
ing and improving ML models for the precise forecast-
ing of load demand.
Milo M. Boi was born in Ni, Serbia in 1982. He
received his M.Sc. degree in Electrical Engineering in
2008 from the Faculty of Electronic Engineering, Uni-
versity of Ni, Serbia. He is currently a research associ-
ate at the Faculty of Electronic Engineering, University
of Ni. His research interests are in eld of articial
intelligence and machine learning. His main current
areas of research are developing methods for load
forecasting, feature and instance selection algorithms.
His current interest also lies in the domain of applica-
tion of articial intelligence in robotics.
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 244
Milena M. Stankovi was born in Bjelovar, Croatia in
1953. She received her M.Sc. and Ph.D degree in
Computer Science in 1984 and 1988 respectively, from
the Faculty of Electronic Engineering, University of
Ni, Serbia. She is currently a professor and the Head
of CIIT (Computational Intelligence and Information Tech-
nologies) laboratory at the Faculty of Electronic Engi-
neering, University of Ni. Her main research interest
lies in the domain of the application of spectral
methods for the representation and classication of
discrete functions. Her current interest also lies in the
domain in data mining and web mining methods and
applications.
Zoran P. Staji was born in 1968 in Leskovac, Serbia.
He received his B.Sc. degree in 1993, M.Sc. degree in
1996, and his Ph.D degree in 2000 from the University
of Nis, Faculty of Electronic Engineering, Serbia. Since
1993 he has been working as a professor at the Faculty
of Electronic Engineering at the Department of Power
Engineering. His research interests include smart grid,
information and communication technologies and elec-
trical fraud detection. Since 1998, he has been the
leader of many government-funded projects concern-
ing innovations, technology development and energy
efciency. In 2007, he established a private R&D center
in the eld of Intelligent Energy Management Systems.
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 245