Professional Documents
Culture Documents
X;Y
x; ylog
X;Y
x; y
X
x
Y
y
dx dy 1
The estimation of the joint probability density function (PDF)
for the pair (X, Y) is needed for the computation of MI. The most
commonly used methods for PDF estimation are histograms and
kernel estimators presented in [41,42]. However, their usage is
commonly limited to functions of one or two variables because the
number of samples needed for PDF estimation increases exponen-
tially with the number of variables. As a result, the estimator used
in this paper is the k-nearest neighbor (kNN) based MI estimator,
proposed in [43]. The novelty of this estimator lies in its ability to
estimate the MI between two multi-dimensional variables directly
from the data set. This avoids direct PDF estimation which is the
most problematic issue in MI estimation.
Let us consider the set of N input-output pairs z
i
x
i
; y
i
;
i 1; :::; N which are independent and identically distributed
realizations of a random variable Z X; Y, where x and y can be
either a scalar or vector. For any pair of points z and z', the
maximum norm is used for the comparison of inputoutput pairs
dened by:
jjzz
0
jj max fjjxx
0
jj; jjyy
0
jjg 2
while any norms can be used in the X and Y spaces.
The basic idea is to estimate I(X,Y) from the distances in spaces
X, Y and Z from z
i
to its k nearest neighbors, averaged over all z
i
. Let
us denote z
ki
x
ki
; y
ki
the kth nearest neighbor of z
i
. It should
be noted that x
k(i)
and y
k(i)
are the input and output parts of z
k(i)
respectively, and thus not necessarily the kth nearest neighbors
of x
i
and y
i
. Let us dene d
i
X
jjx
i
x
ki
jj; d
i
Y
jjy
i
y
ki
jj;
d
i
Z
jjz
i
z
ki
jj. Thus,d
i
maxd
i
X
; d
i
Y
. Subsequently, the number
n
i
X
of points x
j
whose distance from x
i
is strictly less than d
i
is
counted, and similarly the number n
i
Y
of points y
j
whose distance
from y
i
is strictly less than d
i
is counted. Then, I(X,Y) can be
estimated by:
IX; Y k
1
N
N
i 1
n
i
X
1n
i
Y
1 N 3
where is the digamma function dened as:
t
0
t
t
d
dt
ln t 4
and t is the gamma function dened as:
t
Z
1
0
u
t 1
e
u
du 5
Function satises the recursion equation x1 x
1=x and 1 C where C 0:5772156 is the EulerMascheroni
constant. This paper implements the type of estimator presented
in (3) which is one of the two proposed in [43]. The other one can
be found in [44]. Both types of MI estimators depend on the value
chosen for k, which controls the bias-variance tradeoff. Higher
values for k imply a lower statistical error. On the other hand,
systematic errors increase with an increase in k. Thus, to maintain
a balance between these two errors, as recommended in [45], a
mid-range value for k6 will be used. Additionally, an approach
for the selection of parameter k that relies on the resampling
methods can be found in [46].
3. The methodology for instance selection
In this section the methodology for training set instance
selection is presented in combination with a recursive time-
series prediction strategy.
3.1. Recursive time-series prediction strategy
Let us assume that on the basis of the time-series dened in
(6):
y y
1
; y
2
; :::; y
n1
; y
n
; y
n1
; :::; y
i
; :::; y
N
; i 1; :::; N 6
where N represents the overall number of its elements and n is the
number of input lags, it is necessary to form an initial training set
which is needed to select the subsets of the training instances in
the recursive prediction strategy. Let S fX; Yg denote the initial
training set formed on the basis of the time-series y, dened as:
X
y
n
y
n1
y
1
y
n1
y
n
y
2
y
N1
y
N2
y
Nn
2
6
6
6
6
4
3
7
7
7
7
5
Nn x n
7
Y
y
n1
y
n2
y
N
2
6
6
6
6
4
3
7
7
7
7
5
Nn x 1
8
In (7) each row of matrix X represents one training vector or
x
k
AR
n
that consists of lagged input variables, while in (8) each
element of vector Y represents a target value y
k
AR, k1,, (Nn)
assigned to the corresponding vector in matrix X. In the case of the
prediction of time-series, training vectors x
k
AX are dened on the
basis of the values of the time-series from the previous steps,
while the y
k
AY outputs represent the target values of the time
series. M(Nn) denotes the size of the initial training set, that is,
the overall number of inputoutput pairs from X and Y. At the
same time, n denotes the number of features, that is, the size of the
regressor. Unlike the features in the case of regression, in the time-
series prediction the features of the model are dened through the
values of the time-series from the previous steps. The methodol-
ogy can also be used in the case when the input vectors consist of
additional features which were not derived from the time-series.
Such is the case of sparse input vectors, which are not formed from
all lagged variables. With the aim of simplifying the notation, it
was considered that the input vectors consist only of the variables
from the time-series with continuous lags.
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 238
Let us suppose that predictions for the H steps of the time-
series y are needed, that isy
N1
; y
N2
; :::; y
NH
. H represents the
size of the horizon of the prediction problem, or the number of
time steps needed to be predicted by a model f. It is important to
clarify that H denes the prediction horizon of the problem, and
not of the model f, which is one in every prediction step. In order
to achieve that aim, the recursive prediction strategy represents
the most intuitive and most commonly used method [47]. Its basic
characteristic is that it relies on predictions from previous steps
with the aim of predicting future steps, instead of the real values
which at that point in time are not available. Accordingly, a single
prediction model is trained, which rst performs the prediction for
one step ahead:
^ y
N1
f y
N
; y
N1
; :::; y
Nn1
9
Then to predict the following step, the same model is used:
^ y
N2
f ^ y
N1
; y
N
; :::; y
Nn2
10
The prediction of step H follows based on:
^ y
NH
f ^ y
NH1
; ^ y
NH2
; :::; ^ y
NHn
11
The main shortcoming of the recursive prediction strategy lies
in the propagation and accumulation of errors through the steps in
the prediction process. This especially becomes evident with the
increase in the prediction horizon H. If the size of the regressor n
is greater than H, then there are nH known data in the regressor
for predicting step H. On the other hand, if H exceeds n, all of the
data in the regressor will consist of the predicted values, which
has a negative effect on the quality of the following predictions.
3.2. Instance selection
The proposed methodology has as its primary aim the improve-
ment of the quality of prediction through the choice of subsets of
the training instances for each of the H prediction steps. In order to
achieve that, prior to training the model during each step of the
prediction, it is necessary to form a new training set which would,
based on the selection criterion, t in with the current vector
used for the prediction. The term current vectorx
t
fy
N
; y
N1
; :::;
y
Nn1
g, is related to the last n observations in the rst prediction
step only. After that, it is updated in every prediction step,
accordingly with recursive prediction strategy. If x
k
, k1,,
(N n) denotes the k training vector, that is, the k-th row of
matrix from (7), and x
t
the current vector which is used to predict
the following step, the amount of mutual information between x
k
and x
t
denes one criterion for the determination of the similarity
between them. Both training vectors, x
k
and x
t
consists of n lagged
variables. Thus, the MI between them is computed according to
(3), where n represents the overall number of input-output pairs.
Here too, as in the case of the kNN algorithm, the starting point
was the assumption that similar training vectors have joint similar
target values. Now there is the addition that the measure of that
similarity is determined by the amount of mutual information. If x
k
and x
t
share a signicant amount of mutual information, their
target values will share them as well. In other words, on the basis
of the well-known target value of x
k
, the uncertainty of the
unknown target value of the vector x
t
which needs to be predicted
can be reduced. Thus, by ranking the vectors of the initial training
set based on x
t
and the selection of subsets of those which share a
greater amount of mutual information with it, it is possible to
achieve a greater prediction quality when the model includes x
t
.
The proposed approach for the selection of subsets of the training
instances in accordance with the criterion of mutual information is
shown in algorithm 1.
Algorithm 1. The selection of the instances based on the evalua-
tion of mutual information
1. The initialization
From the time series y dened in (6), form the initial
training set S fX; Yg in accordance with the selected size of
the regressor n and eqs. (7) and (8). Form the initial test
vector x
t
fy
N
; y
N1
; :::; y
Nn1
g from the last n values of
the time series y.
2. Estimate the amount of mutual information between each
vector of the initial training set x
k
, k1,, (Nn) and the
current vector x
t
, on the basis of (3), and maintain those
values in vector V(k).
3. Sort set S in descending order in accordance with the
evaluated amount of mutual information from vector V.
Then, on the basis of V and S:
4. Dene the lowest allowed limit for the amount of mutual
information between the vectors from the initial training set
and the current vector x
t
, marked by . On the basis of vector
V choose from the set S those inputoutput pairs for which V
(k)4, k1,, (Nn) is valid. Use them as a basis to form
the reduced training subset S
r
fX
r
; Y
r
g, or,
5. Dene the overall number of vectors which are retained in
the training subset, marked with r. Form a reduced training
subset S
r
fX
r
; Y
r
gon the basis of the rst r inputoutput
pairs from S.
After forming the initial training set, the rst step in algorithm 1
is the evaluation of the mutual information between each input
vector from X and the current vector x
t
used to predict the following
step. Then vector V is formed, which determines the relative
signicance (ranking) of each input vector from X in relation to x
t
.
The following step is the sorting of set S in descending order in
relation to vector V. In order to select the vectors, two options are
available: the denition of the overall number of training vectors r
which are retained in the training set and the denition of the
lowest limit for the amount of mutual information. If the option
which denes the limit of the mutual information is selected, the
algorithm needs to be provided with a parameter . Parameter
determines the sensitivity' of the algorithm, that is, the minimal
allowed amount of similarity between the input vectors from X and
x
t
. All of the training examples from S, whose input vectors share
the amount of information with x
t
that is greater than are added
to the training subsetS
r
fX
r
; Y
r
g. If the option which denes the
overall number of training vectors which need to be retained is
selected, the algorithm needs to be provided with a parameter r.
Then from the rst r examples from S the training sub-
setS
r
fX
r
; Y
r
g is formed.
Previously, the terms training instance and training vector were
used. The term training instance refers to one training vector which
has one added target value, i.e. one input output pair from X and Y.
However, in order to simplify the notation, these two terms will be
used synonymously, although a choice of training instances is made.
The improvement in the quality of the prediction by using the
proposed algorithm depends on the quality
0
of the initial training
set (the number of available training vectors, the size of the
regressor and the selected features), but also on the values of
the selected parameters r and . As was indicated in section 5,
these values can be determined based on the function of the
mutual information between the vectors of the initial training set
and the current vector which is used in the prediction. As the
results of the testing indicate, in this case a good choice for
parameter r is 25, and for parameter is 0.2. This choice is
suitable even though the values of these parameter can vary
among different data sets and depend on the MI function between
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 239
the vectors of the initial training set and the current vector used in
the prediction.
3.3. Prediction strategy with instance selection
The procedure for the selection of the subsets of instances
during each step of the prediction in the recursive prediction
strategy is shown in algorithm 2.
Algorithm 2. The selection of instances in the recursive prediction
strategy
1. Initialization
From the time series y dened in (6), form an initial training
set S fX; Yg in accordance with the selected size of the
regressor n and the eqs. (7) and (8). Form the initial test
vector x
t
fy
N
; y
N1
; :::; y
Nn1
g from the last n values of
the time series y.
2. On the basis of points 2 to 5 of algorithm 1, form a reduced
training setS
r
fX
r
; Y
r
g.
3. On the basis of the reduced training set, train the local
LS-SVM prediction model.
4. On the basis of this model, form and preserve the prediction
for the following step in the time series in vector
y
t
; t 1; :::; H.
5. Update vector x
t
by shifting it one place, and adding the
prediction formed in step 4, in accordance with the
recursive prediction strategy.
6. Repeat steps 2 to 5H times, where H represents the size of
the prediction horizon, that is, the number of steps that
need to be predicted.
4. Least Squares support vector machines
In this section, a brief review of LS-SVMs is given, which are
used to train the forecasting models in the experiments.
Least squares support vector machines, as a reformulation of SVMs,
are commonly used for function estimation and for solving non-linear
regression problems [48]. The main property of these methods is that
they obtain a solution from a set of linear equations instead of solving
a quadric programming problem, as in SVMs. Therefore, LS-SVMs have
a signicantly shorter computing time and they are easier to optimize,
but with a lack of sparseness in the solution. The regression model in
primal weight space is expressed in:
yx
T
xb 12
where represents the weight vector, b represents a bias term and
x is a function which maps the input space into a higher-
dimensional feature space.
LS-SVMs formulate the optimization problem in primal space
dened by:
min J
p
; e
1
2
;b;e
1
2
N
k 1
e
2
k
13
subject to equality constrains expressed as:
y
k
T
x
k
be
k
; k 1; :::; N 14
while e
k
represents error variables, is a regularization parameter
which gives the relative weight to errors and should be optimized
by the user.
Solving this optimization problem in dual space leads to
obtaining
k
and b in the solution represented as:
yx
N
k 1
k
Kx; x
k
b 15
The dot product Kx; x
k
x
T
x
k
represents a kernel func-
tion while
k
is a Lagrange multiplier. When using a radial basis
function (RBF) dened by:
kx; x
k
e
jjx x
k
jj
2
s
2
16
the optimal parameter combination (, s) should be estab-
lished, where denotes the regularization parameter and s is a
kernel parameter. For this purpose, a grid-search algorithm in
combination with k-fold cross-validation is a commonly used
method [49].
5. Experimental results and discussion
This section presents the experimental results and discussion of
applying the proposed approach to datasets taken from the NN5 time-
series forecasting competition [50]. The goal of the experimental study
is to compare the performance of proposed instance selection
methodology in combination with a recursive prediction strategy in
comparison to the strategy without instance selection. The section
begins with the description of the datasets used in the experiments,
following the experimental setup. Then the results are presented, and
nally a discussion of the results concludes the section.
In the last decade, several time series forecasting competitions
have been organized in order to compare and evaluate the perfor-
mance of different machine learning methods. Among them, the NN5
competition is one of the most interesting ones since it includes the
challenges of a real world multi-step ahead forecasting task, namely
multiple time series, outliers, missing values as well as multiple
overlying seasonality, etc [51]. Each of the 11 time series in the
reduced data set represents roughly two years of daily cash money
withdrawal amounts (735 data points) at ATM machines at one of the
various cities in the UK. The reduced data set is a typical representative
of the entire data set, which contains 111 time series. For each time
series, the competition needed to forecast the values for the next 56
days using the given historical data points as accurately as possible.
The prediction quality was evaluated using the symmetric mean
absolute percentage error (SMAPE):
SMAPE% 100
1
H
H
i 1
jy
i
^ y
i
j
y
i
^ y
i
=2
17
where y
i
and ^ y
i
are the real and the predicted value of the time-series
in the i
th
prediction step and H is the size of prediction horizon. Since
this is a relative error measure, the errors can be averaged over all of
the time series in the reduced data set to obtain a mean SMAPE
dened as:
meanSMAPE%
1
11
11
i 1
SMAPE
i
18
where SMAPE
i
denotes the SMAPE of the i
th
time series.
The NN5 time-series requires a missing value interpolation pre-
processing step. Under this step two types of anomalies need to be
considered: zero values, that indicate that no money withdrawal
occurred and missing observations for which no value was recorded.
About 2.5% of the data represents missing values. For the missing
value interpolation method, the one proposed in [51] is used. If y
m
is
the missing value, it is replaced with the median of the set
y
m365
; y
m365
; y
m7
; y
m7
of those values which are available.
A recursive forecasting strategy requires the setting of the
regressor size n, Eqs. (9)(11). Several approaches have been
proposed in the literature to select this value, such as the one
based on the partial autocorrelation function (PACF), proposed in
[11]. Since this aspect is not the central topic of this paper, the
regressor size is determined based on the visual inspection of the
time-series and the analysis presented in [11] and [51]. It is set to 14
for all 11 time-series in the reduced data set. Previous analyses of
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 240
this data set have shown day of the week' seasonality, so two
weeks is a long enough period of time to catch the time-series
pattern and to maintain the same setup between the experiments.
In order to obtain more accurate results, a rolling forecast origin
evaluation was used, as proposed in [52]. This approach overcomes
the shortcomings of xed origin evaluation, including its dependence
on the randomness contained in the particular forecasting origin and
the limited sample size of errors [53]. In rolling origin evaluation, the
forecasting origin is successively updated, and forecasts from new
origins are produced. In general, the rolling-origin procedure provides
H(H1)/2 forecasts, against H from the xed-origin, where H denotes
the number of steps that need to be predicted.
The input selection procedure with local models training and
recursive forecasting of the time-series for one prediction horizon
ahead was presented in section 4 with algorithm 2. The rst step
in the algorithm is the selection of a new training set (X
r
, Y
r
) from
the initial set, based on the current forecasting vector x
t
and
selection option described in algorithm 1. After that, the optimal
(, s) pair is determined based on (X
r
, Y
r
) using a grid-search with
10-fold cross-validations. The local LS-SVM model is then
employed for the time-series prediction. The whole process is
repeated H times by employing a recursive prediction strategy. A
LS-SVM Matlab toolbox can be found in [54].
Several models are generated with different training sets which are
formed from the initial training set using the MI threshold or number
of inputs option. To be clear, the term model previously mentioned is
associated with a set of local models which are all formed with certain
training sets. These training sets are generated from the initial training
set using the MI criterion, and for every local model this training set is
different. Every model is a set of local models trained with training
sets generated from an initial set using the same criterion. The feature
set for every model is the same, i.e. all the local models have the same
feature structure. In order to assess the increase in the accuracy of the
proposed models and its contribution to the forecasting research, the
accuracy of proposed models was compared with two statistical
benchmark models: ARIMA and HoltWinters, but also with model
which implement instance selection based on distance metric pro-
posed in [32,33].
The amount of MI between the initial training set vectors and the
test vector (i.e. the vector with which the prediction is performed)
for the rst prediction step in the prediction horizon is given in Fig. 1.
On the horizontal axis the number of vectors in the initial training set
is denoted, while the vertical axis represents the amount of MI. The
rst block from the left in Fig. 1. is related to rst time-series in the
data set, the second block to the second time-series, and so on. From
Fig. 1. it can be observed that most of the vectors from the initial
Fig. 1. The amount of MI between the initial training set vectors and the test vector for the rst prediction step.
M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 241
training set have MI values smaller than 0.1 and that there is a small
number of vectors with MI values greater than 0.3. From Fig. 1. it can
also be observed that the MI curves are similar for all 11 time-series,
and that the slopes and maximum values of the MI curves differ
between them. Also, for all 11 time series in Fig. 1. total number of
signicant vectors (the ones that have MI greater than 0) is around
400. Let us for example consider the time-series which corresponds
to block 9 and block 10 in Fig. 1. The time-series that corresponds to
block 9 has a signicantly larger number of vectors with a larger
amount of MI, compared to the other one that corresponds to block
10. This indicates the quality of the initial training set regarding the
current vector with which the forecasting model is employed. In this
way it is possible to assess the expected quality of the future
predictions. The predictions obtained with a model trained with a
large number of vectors that share a large amount of MI with the
current test vector should be more accurate than the predictions
obtained with a model trained with fewer vectors that share a
smaller amount of MI with the current test vector.
The selection strategy of parameter r, which determines the
overall number of vectors which need to be selected during a given
step, is the following: to select as few vectors as possible, enough to
train the LS-SVM model, such that they share as much information
as possible with the test vector during the rst prediction step.
Although it is possible to select a different value of parameter
r during each of the prediction steps, with the aim of simplifying
the results, parameter r has been selected on the basis of the mutual
information during the rst prediction step and remains unchanged
during all the subsequent prediction steps. In addition, in the
case of the recursive strategy, during the rst prediction step, all
of the values in the regressor are known (true), and thus the
evaluation of the mutual information is more precise than in the
following steps.
When it comes to the selection of parameter , which denes
the lowest limit of mutual information between two vectors, it is
necessary to select as great a value as possible, and at the same time
leave a sufcient number of vectors to form the model. As the
results of the test have shown, a sufcient number of vectors for
training the LS-SVM model range from several dozen to one
hundred. Parameter was also selected on the basis of the available
number of vectors with a suitable amount of information during
the rst step of the prediction and it remains unchanged in all
the steps of the prediction. According to the previous analysis and
Fig. 1, for testing purposes 27 different models were built, denoted
with: