12 views

Uploaded by khalala

ERP

- Best Track 42420
- Stark Gcpr15
- TRNDOLOG
- Crone - Support Vector Regression and ANN for Forecasting - 2006
- Lifecycle Planning.doc
- Project Computational Linguistics
- Setup Lifecycle Planning
- 100014 Br 04 Metrix Nd
- PR some solutions
- Business Statistics
- zemouri2003eaai Radial Basis Function
- Demand Management
- Transforming Time Series
- Summary and Q/A of Opening & Ending Vignettes of Data Mining for Business Intelligence & Data Warehousing
- TFA for tech mgmt
- Ensemble Classifiers and Their Applications: A Review
- MerLin SugX Algorithm Obfuscated
- 1995 Abbott - Sequence Analysis
- Sales Forecasting Full
- Survey of Stochsatic Models

You are on page 1of 10

Milo B. Stojanovi

a,n,1,2,3

, Milo M. Boi

b

, Milena M. Stankovi

b

, Zoran P. Staji

b

a

College of Applied Technical Sciences, Aleksandra Medvedev 20, 18000 Ni, Serbia

b

Faculty of Electronic Engineering, University of Ni, Aleksandra Medvedeva 14, 18000 Ni, Serbia

a r t i c l e i n f o

Article history:

Received 21 September 2012

Received in revised form

25 November 2013

Accepted 19 March 2014

Communicated by: P. Zhang

Available online 8 April 2014

Keywords:

Instance selection

Mutual information

Time-series prediction

a b s t r a c t

Training set instance selection is an important preprocessing step in many machine learning problems,

including time series prediction, and has to be considered in practice in order to increase the quality of

the predictions and possibly reduce training time. Recently, the usage of mutual information (MI) has

been proposed in regression tasks, mostly for feature selection and for identifying the real data from data

sets that contain noise and outliers. This paper proposes a new methodology for training set instance

selection for long-term time series prediction. The proposed methodology combines a recursive

prediction strategy and advanced instance selection criterionthe nearest neighbor based MI estimator.

An application of the concept of MI is presented for the selection of training instances based on MI

computation between initial training set instances and the current forecasting instance, for every

prediction step. The novelty of the approach lies in the fact that it ts the initial training subset with the

current forecasting instance, and consequently reduces the uncertainty of the prediction. In this way, by

selecting instances which share a large amount of MI with the current forecasting instance in every

prediction step, error propagation and accumulation can be reduced, both of which are well known

shortcomings of the recursive prediction strategy, thus leading to better forecasting quality. Another

element which sets this approach apart from others is that it is not proposed as an outlier detector, but

for the instance selection of data which do not necessarily have to contain noise and outliers. The results

obtained from the data sets from NN5 competition in time series prediction indicate that the proposed

method increases the quality of long-term time series prediction, as well as reduces the amount of

instances needed for building the model.

& 2014 Elsevier B.V. All rights reserved.

1. Introduction

Time series forecasting is nowadays a key topic in various

elds. In nancial economics, stock exchange courses are pre-

dicted [1], in computer science, the ow of data through networks

and frequency of access to websites [2], in power systems,

distribution companies forecast the load of the following day [3],

etc. Conventional methods for time series forecasting, developed

during the 1970s and 80s, among which the most popular were

the linear regression [4], Box-Jenkins ARIMA [5] and exponential

smoothing [6], cannot always provide an accurate and unbiased

estimation of time series in cases when the underlying system

processes are typically non-linear, non-stationary and not dened

a priori. In addition, very often the choice of the prediction

method and the determination of its parameters depend on

knowing the properties of the underlying process. In order to

address these problems, machine learning models, among which

the most popular were articial neural networks (ANNs) [7] and

support vector machines (SVMs) [8], have established themselves

in the last two decades as serious contenders to classic statistical

models in the area of forecasting. Basically, time series prediction

in machine learning comes down to training the model which

establishes a mapping between training set instances and their

target values. Then, the trained model estimates future values

based on current and past data samples. The determination of

sufcient and necessary data is essential for training a good

forecasting model. If the amount of data in the training set is

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

http://dx.doi.org/10.1016/j.neucom.2014.03.006

0925-2312/& 2014 Elsevier B.V. All rights reserved.

n

Corresponding author. Tel.: 381 018 4233099 (Home),

381 065 8136718 (Mobile).

E-mail addresses: milosstojanovic10380@yahoo.com,

milos.stojanovic@vtsnis.edu.rs (M.B. Stojanovi).

URL: http://www.vtsnis.edu.rs/index_english.html (M.B. Stojanovi).

1

Home address: Bulevar doktora Zorana inia 29/5, 18000 Ni, Serbia.

2

Job address: Aleksandra Medvedev 20, 18000 Ni, Serbia.

3

Institution (job): College of Applied Technical Sciences, Ni, Serbia.

Neurocomputing 141 (2014) 236245

insufcient, the forecasting of the model will be poor, and the

model may be prone to undertting. On the other hand, if

the training set is too large, the information that it provides to

the model could be unnecessary or redundant. As a result, the

model could have poor generalization performance and may be

prone to overtting. Based on the size of the prediction horizon of

a problem, predictions can be classied into two categories:

short-term and a long-term. Take for example electric load

forecasting. If hourly predictions for one day ahead are needed,

it is a short-term forecasting problem requiring values for the next

24 steps. On the other hand, if yearly predictions for the next 10 or

20 years are needed, it is a long-term forecasting problem,

requiring predictions for the next 10 or 20 steps. But, both of

them are classied as long-term time series prediction, based on

the number of steps in the prediction horizon needed to be

predicted. When one-step ahead prediction is needed it is referred

to short-term prediction. But when multi-step ahead predictions

are needed, it is called a long-term time-series prediction. Long-

term predictions are especially challenging, where multiple steps

ahead have to be predicted. The problem that occurs in long-term

prediction is that uncertainty increases with an increase in the

number of steps in the prediction horizon. It depends on several

factors, such as the accumulation of errors and lack of information.

Moreover, with the increase in the number of steps that need to be

predicted, a model selection problem emerges, because the envir-

onment in which the model was developed may change over time

[9]. Another problem that affects the quality of long-term predic-

tions occurs when the time series consist of daily or shorter

time intervals, i.e. if they contain high frequency data [10]. High

frequency data represents a specic type of forecasting problem,

rendering conventional methods inappropriate and demanding

new approaches [11].

The selection of an appropriate subset of instances that are

included in the initial training set is a very important preproces-

sing step, especially in long-term time series prediction tasks. It

may provide improvements in terms of the quality of the output

results and in the reduction of computational time. This problem

in literature is considered from several different perspectives.

Instance selection can be approached from the aspect of remov-

ing outliers and noise from distorted data sets, as presented in

[1214]. Instance selection of data that contain outliers aims to

remove elements from the training set that in some way differ

from most other elements in the input set. From another perspec-

tive, this problem can be considered data shifting', where the joint

distribution of inputs and outputs differs between the training and

test stage, as presented in [1517]. It usually appears in non-

stationary environments, when the training environment is dif-

ferent from the test one, whether due to a temporal or a spatial

change. There are also various methods based on active learning

that deal with the selection of relevant instances, some of which

can be found in [1820]. The key idea behind active learning is that

a machine learning algorithm can achieve greater accuracy with

fewer training instances if it is allowed to choose the data from

which it learns. Active learning is also closely related to covariate

shift, where the training input distribution is different from the

test input distribution [21]. Finally, the instance selection of data

which do not necessarily have to contain noise and outliers,

determines a subset of the initial training set. It can be used to

train a more accurate model, with a possible reduction in compu-

tational time.

In order to perform the selection of instances that the learning

algorithm will use, three main approaches have been used: the

incremental, decremental and batch, as presented in [2224].

While in the incremental approach the selection algorithm starts

from an empty set of instances, and adds them iteratively, in the

decrement approach selection the algorithm starts from a full set

of available instances, and removes those which did not meet the

predened selection criterion. The batch method performs several

iterations through the initial training set before removing one

instance. In each iteration it marks instances that are candidates

to be removed in the next iteration. Recently, the application of

evolutionary algorithms, boosting techniques and pruning techni-

ques have been used to tackle this problem [2527]. According to

the selection strategy, instance selection can be tackled with

lter and wrapper methods, as presented in [28,29]. In the lter

methods, the selection criterion uses a selection function which is

independent from the training algorithm used to form the regres-

sion model. On the other hand, in the wrapper methods the

selection criterion is based on the evaluation measure obtained by

the regression model. In other words, it is embedded into the

evaluation function of the model. In this way, instances that do

not contribute to the prediction quality are discarded from the

training set.

Most research on instance selection, which has been done so

far, refers to classication problems [30], while only a few papers

deal with instance selection in regression tasks, especially in the

case of long-term time series prediction. For example, [31] shows a

method of k-surrounding neighbors for the selection of input

vectors, while the output is calculated with the k-nearest neigh-

bors (kNN) algorithm. In [13] a genetic algorithm is presented to

perform feature and instance selection for linear regression

models. In [32,33] a new distance function, which integrates the

Euclidean distance and the dissimilarity of the trend of a time

series, is dened as a similarity measure between two instances

for long-term time series prediction. By selecting similar instances

in the training set for each testing instance based on the modied

kNN approach, prediction quality can be increased. Only recently a

mutual information (MI) estimator based on nearest neighbors,

which allows MI estimation directly from the data set, has been

introduced for instance selection in time series prediction. Its aim

is to remove outliers and noise from highly distorted data sets

[14,34]. The applied algorithm determines the loss of MI with

respect to its neighbors in such a way that if a loss of MI is similar

to the instance near the studied one, then this instance must be

included in the training dataset. This approach has proved suc-

cessful in situations when it has been applied to training sets

which are articially distorted by adding noise or outliers. In [35],

the concept of MI is applied for improving short-term load

forecasting through the selection of instances with similar load

patterns to the current forecasting instance.

The research presented in this paper is motivated by the work

presented in [14,34] and represents an extension of the approach

proposed in [35]. It is framed within the instance selection of data

which do not necessarily have to contain noise and outliers.

It proposes a methodology for training subsets selection in long-

term recursive time series prediction, by using MI to decide

which instances should or should not be included in the training

data set. The methodology is based on a decremental approach and

lter method which uses MI as the selection criterion. In

this way, by selecting instances which share a large amount of

MI with the current forecasting instance in every prediction step,

error propagation and accumulation can be reduced. Since they are

well known shortcomings of the recursive prediction strategy, this

can lead to better forecasting quality. In this paper, the least squares

support vector machines (LS-SVMs) were used as nonlinear models

to present the application of the proposed methodology.

The rest of the paper is organized as follows: Section 2 presents

the formulation of MI and describes the method used to compute

it, followed by Section 3 which introduces the methodology to

select training set instances. Section 4 shortly reviews the basics of

LS-SVMs. Section 5 includes a variety of experiments to verify the

proposed approach, and nally, Section 6 draws the conclusions.

M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 237

2. Mutual information

In this section, the basics of Mutual Information for continuous

variables are briey explained, followed by the k-nearest neigh-

bors approach used to estimate it.

Mutual Information is commonly used for measuring depen-

dences between randomvariables in a way that does not make any

assumptions about the nature of their underlying relationships.

Therefore, MI is more powerful in some cases than estimators. This

is due to the fact that estimators only consider the linear relation-

ships between the variables, as, for example, the correlation

coefcient [36]. Moreover, MI can naturally be dened between

groups of variables, and thus can be applied for feature selection

[3740] and instance selection [14,34], independently from the

nal prediction model. The MI of two random variables X and Y

quanties the information that X and Y share. More formally, MI

measures how much knowledge of one variable reduces uncer-

tainty about the other. The denition of MI is derived from entropy

in information theory. Let us denote X and Y as continuous random

variables with a joint probability density function

X;Y

and mar-

ginal density functions

X

x and

Y

y.

The MI between two random variables X and Y can be

computed as:

IX; Y

Z

1

1

Z

1

1

X;Y

x; ylog

X;Y

x; y

X

x

Y

y

dx dy 1

The estimation of the joint probability density function (PDF)

for the pair (X, Y) is needed for the computation of MI. The most

commonly used methods for PDF estimation are histograms and

kernel estimators presented in [41,42]. However, their usage is

commonly limited to functions of one or two variables because the

number of samples needed for PDF estimation increases exponen-

tially with the number of variables. As a result, the estimator used

in this paper is the k-nearest neighbor (kNN) based MI estimator,

proposed in [43]. The novelty of this estimator lies in its ability to

estimate the MI between two multi-dimensional variables directly

from the data set. This avoids direct PDF estimation which is the

most problematic issue in MI estimation.

Let us consider the set of N input-output pairs z

i

x

i

; y

i

;

i 1; :::; N which are independent and identically distributed

realizations of a random variable Z X; Y, where x and y can be

either a scalar or vector. For any pair of points z and z', the

maximum norm is used for the comparison of inputoutput pairs

dened by:

jjzz

0

jj max fjjxx

0

jj; jjyy

0

jjg 2

while any norms can be used in the X and Y spaces.

The basic idea is to estimate I(X,Y) from the distances in spaces

X, Y and Z from z

i

to its k nearest neighbors, averaged over all z

i

. Let

us denote z

ki

x

ki

; y

ki

the kth nearest neighbor of z

i

. It should

be noted that x

k(i)

and y

k(i)

are the input and output parts of z

k(i)

respectively, and thus not necessarily the kth nearest neighbors

of x

i

and y

i

. Let us dene d

i

X

jjx

i

x

ki

jj; d

i

Y

jjy

i

y

ki

jj;

d

i

Z

jjz

i

z

ki

jj. Thus,d

i

maxd

i

X

; d

i

Y

. Subsequently, the number

n

i

X

of points x

j

whose distance from x

i

is strictly less than d

i

is

counted, and similarly the number n

i

Y

of points y

j

whose distance

from y

i

is strictly less than d

i

is counted. Then, I(X,Y) can be

estimated by:

IX; Y k

1

N

N

i 1

n

i

X

1n

i

Y

1 N 3

where is the digamma function dened as:

t

0

t

t

d

dt

ln t 4

and t is the gamma function dened as:

t

Z

1

0

u

t 1

e

u

du 5

Function satises the recursion equation x1 x

1=x and 1 C where C 0:5772156 is the EulerMascheroni

constant. This paper implements the type of estimator presented

in (3) which is one of the two proposed in [43]. The other one can

be found in [44]. Both types of MI estimators depend on the value

chosen for k, which controls the bias-variance tradeoff. Higher

values for k imply a lower statistical error. On the other hand,

systematic errors increase with an increase in k. Thus, to maintain

a balance between these two errors, as recommended in [45], a

mid-range value for k6 will be used. Additionally, an approach

for the selection of parameter k that relies on the resampling

methods can be found in [46].

3. The methodology for instance selection

In this section the methodology for training set instance

selection is presented in combination with a recursive time-

series prediction strategy.

3.1. Recursive time-series prediction strategy

Let us assume that on the basis of the time-series dened in

(6):

y y

1

; y

2

; :::; y

n1

; y

n

; y

n1

; :::; y

i

; :::; y

N

; i 1; :::; N 6

where N represents the overall number of its elements and n is the

number of input lags, it is necessary to form an initial training set

which is needed to select the subsets of the training instances in

the recursive prediction strategy. Let S fX; Yg denote the initial

training set formed on the basis of the time-series y, dened as:

X

y

n

y

n1

y

1

y

n1

y

n

y

2

y

N1

y

N2

y

Nn

2

6

6

6

6

4

3

7

7

7

7

5

Nn x n

7

Y

y

n1

y

n2

y

N

2

6

6

6

6

4

3

7

7

7

7

5

Nn x 1

8

In (7) each row of matrix X represents one training vector or

x

k

AR

n

that consists of lagged input variables, while in (8) each

element of vector Y represents a target value y

k

AR, k1,, (Nn)

assigned to the corresponding vector in matrix X. In the case of the

prediction of time-series, training vectors x

k

AX are dened on the

basis of the values of the time-series from the previous steps,

while the y

k

AY outputs represent the target values of the time

series. M(Nn) denotes the size of the initial training set, that is,

the overall number of inputoutput pairs from X and Y. At the

same time, n denotes the number of features, that is, the size of the

regressor. Unlike the features in the case of regression, in the time-

series prediction the features of the model are dened through the

values of the time-series from the previous steps. The methodol-

ogy can also be used in the case when the input vectors consist of

additional features which were not derived from the time-series.

Such is the case of sparse input vectors, which are not formed from

all lagged variables. With the aim of simplifying the notation, it

was considered that the input vectors consist only of the variables

from the time-series with continuous lags.

M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 238

Let us suppose that predictions for the H steps of the time-

series y are needed, that isy

N1

; y

N2

; :::; y

NH

. H represents the

size of the horizon of the prediction problem, or the number of

time steps needed to be predicted by a model f. It is important to

clarify that H denes the prediction horizon of the problem, and

not of the model f, which is one in every prediction step. In order

to achieve that aim, the recursive prediction strategy represents

the most intuitive and most commonly used method [47]. Its basic

characteristic is that it relies on predictions from previous steps

with the aim of predicting future steps, instead of the real values

which at that point in time are not available. Accordingly, a single

prediction model is trained, which rst performs the prediction for

one step ahead:

^ y

N1

f y

N

; y

N1

; :::; y

Nn1

9

Then to predict the following step, the same model is used:

^ y

N2

f ^ y

N1

; y

N

; :::; y

Nn2

10

The prediction of step H follows based on:

^ y

NH

f ^ y

NH1

; ^ y

NH2

; :::; ^ y

NHn

11

The main shortcoming of the recursive prediction strategy lies

in the propagation and accumulation of errors through the steps in

the prediction process. This especially becomes evident with the

increase in the prediction horizon H. If the size of the regressor n

is greater than H, then there are nH known data in the regressor

for predicting step H. On the other hand, if H exceeds n, all of the

data in the regressor will consist of the predicted values, which

has a negative effect on the quality of the following predictions.

3.2. Instance selection

The proposed methodology has as its primary aim the improve-

ment of the quality of prediction through the choice of subsets of

the training instances for each of the H prediction steps. In order to

achieve that, prior to training the model during each step of the

prediction, it is necessary to form a new training set which would,

based on the selection criterion, t in with the current vector

used for the prediction. The term current vectorx

t

fy

N

; y

N1

; :::;

y

Nn1

g, is related to the last n observations in the rst prediction

step only. After that, it is updated in every prediction step,

accordingly with recursive prediction strategy. If x

k

, k1,,

(N n) denotes the k training vector, that is, the k-th row of

matrix from (7), and x

t

the current vector which is used to predict

the following step, the amount of mutual information between x

k

and x

t

denes one criterion for the determination of the similarity

between them. Both training vectors, x

k

and x

t

consists of n lagged

variables. Thus, the MI between them is computed according to

(3), where n represents the overall number of input-output pairs.

Here too, as in the case of the kNN algorithm, the starting point

was the assumption that similar training vectors have joint similar

target values. Now there is the addition that the measure of that

similarity is determined by the amount of mutual information. If x

k

and x

t

share a signicant amount of mutual information, their

target values will share them as well. In other words, on the basis

of the well-known target value of x

k

, the uncertainty of the

unknown target value of the vector x

t

which needs to be predicted

can be reduced. Thus, by ranking the vectors of the initial training

set based on x

t

and the selection of subsets of those which share a

greater amount of mutual information with it, it is possible to

achieve a greater prediction quality when the model includes x

t

.

The proposed approach for the selection of subsets of the training

instances in accordance with the criterion of mutual information is

shown in algorithm 1.

Algorithm 1. The selection of the instances based on the evalua-

tion of mutual information

1. The initialization

From the time series y dened in (6), form the initial

training set S fX; Yg in accordance with the selected size of

the regressor n and eqs. (7) and (8). Form the initial test

vector x

t

fy

N

; y

N1

; :::; y

Nn1

g from the last n values of

the time series y.

2. Estimate the amount of mutual information between each

vector of the initial training set x

k

, k1,, (Nn) and the

current vector x

t

, on the basis of (3), and maintain those

values in vector V(k).

3. Sort set S in descending order in accordance with the

evaluated amount of mutual information from vector V.

Then, on the basis of V and S:

4. Dene the lowest allowed limit for the amount of mutual

information between the vectors from the initial training set

and the current vector x

t

, marked by . On the basis of vector

V choose from the set S those inputoutput pairs for which V

(k)4, k1,, (Nn) is valid. Use them as a basis to form

the reduced training subset S

r

fX

r

; Y

r

g, or,

5. Dene the overall number of vectors which are retained in

the training subset, marked with r. Form a reduced training

subset S

r

fX

r

; Y

r

gon the basis of the rst r inputoutput

pairs from S.

After forming the initial training set, the rst step in algorithm 1

is the evaluation of the mutual information between each input

vector from X and the current vector x

t

used to predict the following

step. Then vector V is formed, which determines the relative

signicance (ranking) of each input vector from X in relation to x

t

.

The following step is the sorting of set S in descending order in

relation to vector V. In order to select the vectors, two options are

available: the denition of the overall number of training vectors r

which are retained in the training set and the denition of the

lowest limit for the amount of mutual information. If the option

which denes the limit of the mutual information is selected, the

algorithm needs to be provided with a parameter . Parameter

determines the sensitivity' of the algorithm, that is, the minimal

allowed amount of similarity between the input vectors from X and

x

t

. All of the training examples from S, whose input vectors share

the amount of information with x

t

that is greater than are added

to the training subsetS

r

fX

r

; Y

r

g. If the option which denes the

overall number of training vectors which need to be retained is

selected, the algorithm needs to be provided with a parameter r.

Then from the rst r examples from S the training sub-

setS

r

fX

r

; Y

r

g is formed.

Previously, the terms training instance and training vector were

used. The term training instance refers to one training vector which

has one added target value, i.e. one input output pair from X and Y.

However, in order to simplify the notation, these two terms will be

used synonymously, although a choice of training instances is made.

The improvement in the quality of the prediction by using the

proposed algorithm depends on the quality

0

of the initial training

set (the number of available training vectors, the size of the

regressor and the selected features), but also on the values of

the selected parameters r and . As was indicated in section 5,

these values can be determined based on the function of the

mutual information between the vectors of the initial training set

and the current vector which is used in the prediction. As the

results of the testing indicate, in this case a good choice for

parameter r is 25, and for parameter is 0.2. This choice is

suitable even though the values of these parameter can vary

among different data sets and depend on the MI function between

M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 239

the vectors of the initial training set and the current vector used in

the prediction.

3.3. Prediction strategy with instance selection

The procedure for the selection of the subsets of instances

during each step of the prediction in the recursive prediction

strategy is shown in algorithm 2.

Algorithm 2. The selection of instances in the recursive prediction

strategy

1. Initialization

From the time series y dened in (6), form an initial training

set S fX; Yg in accordance with the selected size of the

regressor n and the eqs. (7) and (8). Form the initial test

vector x

t

fy

N

; y

N1

; :::; y

Nn1

g from the last n values of

the time series y.

2. On the basis of points 2 to 5 of algorithm 1, form a reduced

training setS

r

fX

r

; Y

r

g.

3. On the basis of the reduced training set, train the local

LS-SVM prediction model.

4. On the basis of this model, form and preserve the prediction

for the following step in the time series in vector

y

t

; t 1; :::; H.

5. Update vector x

t

by shifting it one place, and adding the

prediction formed in step 4, in accordance with the

recursive prediction strategy.

6. Repeat steps 2 to 5H times, where H represents the size of

the prediction horizon, that is, the number of steps that

need to be predicted.

4. Least Squares support vector machines

In this section, a brief review of LS-SVMs is given, which are

used to train the forecasting models in the experiments.

Least squares support vector machines, as a reformulation of SVMs,

are commonly used for function estimation and for solving non-linear

regression problems [48]. The main property of these methods is that

they obtain a solution from a set of linear equations instead of solving

a quadric programming problem, as in SVMs. Therefore, LS-SVMs have

a signicantly shorter computing time and they are easier to optimize,

but with a lack of sparseness in the solution. The regression model in

primal weight space is expressed in:

yx

T

xb 12

where represents the weight vector, b represents a bias term and

x is a function which maps the input space into a higher-

dimensional feature space.

LS-SVMs formulate the optimization problem in primal space

dened by:

min J

p

; e

1

2

;b;e

1

2

N

k 1

e

2

k

13

subject to equality constrains expressed as:

y

k

T

x

k

be

k

; k 1; :::; N 14

while e

k

represents error variables, is a regularization parameter

which gives the relative weight to errors and should be optimized

by the user.

Solving this optimization problem in dual space leads to

obtaining

k

and b in the solution represented as:

yx

N

k 1

k

Kx; x

k

b 15

The dot product Kx; x

k

x

T

x

k

represents a kernel func-

tion while

k

is a Lagrange multiplier. When using a radial basis

function (RBF) dened by:

kx; x

k

e

jjx x

k

jj

2

s

2

16

the optimal parameter combination (, s) should be estab-

lished, where denotes the regularization parameter and s is a

kernel parameter. For this purpose, a grid-search algorithm in

combination with k-fold cross-validation is a commonly used

method [49].

5. Experimental results and discussion

This section presents the experimental results and discussion of

applying the proposed approach to datasets taken from the NN5 time-

series forecasting competition [50]. The goal of the experimental study

is to compare the performance of proposed instance selection

methodology in combination with a recursive prediction strategy in

comparison to the strategy without instance selection. The section

begins with the description of the datasets used in the experiments,

following the experimental setup. Then the results are presented, and

nally a discussion of the results concludes the section.

In the last decade, several time series forecasting competitions

have been organized in order to compare and evaluate the perfor-

mance of different machine learning methods. Among them, the NN5

competition is one of the most interesting ones since it includes the

challenges of a real world multi-step ahead forecasting task, namely

multiple time series, outliers, missing values as well as multiple

overlying seasonality, etc [51]. Each of the 11 time series in the

reduced data set represents roughly two years of daily cash money

withdrawal amounts (735 data points) at ATM machines at one of the

various cities in the UK. The reduced data set is a typical representative

of the entire data set, which contains 111 time series. For each time

series, the competition needed to forecast the values for the next 56

days using the given historical data points as accurately as possible.

The prediction quality was evaluated using the symmetric mean

absolute percentage error (SMAPE):

SMAPE% 100

1

H

H

i 1

jy

i

^ y

i

j

y

i

^ y

i

=2

17

where y

i

and ^ y

i

are the real and the predicted value of the time-series

in the i

th

prediction step and H is the size of prediction horizon. Since

this is a relative error measure, the errors can be averaged over all of

the time series in the reduced data set to obtain a mean SMAPE

dened as:

meanSMAPE%

1

11

11

i 1

SMAPE

i

18

where SMAPE

i

denotes the SMAPE of the i

th

time series.

The NN5 time-series requires a missing value interpolation pre-

processing step. Under this step two types of anomalies need to be

considered: zero values, that indicate that no money withdrawal

occurred and missing observations for which no value was recorded.

About 2.5% of the data represents missing values. For the missing

value interpolation method, the one proposed in [51] is used. If y

m

is

the missing value, it is replaced with the median of the set

y

m365

; y

m365

; y

m7

; y

m7

of those values which are available.

A recursive forecasting strategy requires the setting of the

regressor size n, Eqs. (9)(11). Several approaches have been

proposed in the literature to select this value, such as the one

based on the partial autocorrelation function (PACF), proposed in

[11]. Since this aspect is not the central topic of this paper, the

regressor size is determined based on the visual inspection of the

time-series and the analysis presented in [11] and [51]. It is set to 14

for all 11 time-series in the reduced data set. Previous analyses of

M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 240

this data set have shown day of the week' seasonality, so two

weeks is a long enough period of time to catch the time-series

pattern and to maintain the same setup between the experiments.

In order to obtain more accurate results, a rolling forecast origin

evaluation was used, as proposed in [52]. This approach overcomes

the shortcomings of xed origin evaluation, including its dependence

on the randomness contained in the particular forecasting origin and

the limited sample size of errors [53]. In rolling origin evaluation, the

forecasting origin is successively updated, and forecasts from new

origins are produced. In general, the rolling-origin procedure provides

H(H1)/2 forecasts, against H from the xed-origin, where H denotes

the number of steps that need to be predicted.

The input selection procedure with local models training and

recursive forecasting of the time-series for one prediction horizon

ahead was presented in section 4 with algorithm 2. The rst step

in the algorithm is the selection of a new training set (X

r

, Y

r

) from

the initial set, based on the current forecasting vector x

t

and

selection option described in algorithm 1. After that, the optimal

(, s) pair is determined based on (X

r

, Y

r

) using a grid-search with

10-fold cross-validations. The local LS-SVM model is then

employed for the time-series prediction. The whole process is

repeated H times by employing a recursive prediction strategy. A

LS-SVM Matlab toolbox can be found in [54].

Several models are generated with different training sets which are

formed from the initial training set using the MI threshold or number

of inputs option. To be clear, the term model previously mentioned is

associated with a set of local models which are all formed with certain

training sets. These training sets are generated from the initial training

set using the MI criterion, and for every local model this training set is

different. Every model is a set of local models trained with training

sets generated from an initial set using the same criterion. The feature

set for every model is the same, i.e. all the local models have the same

feature structure. In order to assess the increase in the accuracy of the

proposed models and its contribution to the forecasting research, the

accuracy of proposed models was compared with two statistical

benchmark models: ARIMA and HoltWinters, but also with model

which implement instance selection based on distance metric pro-

posed in [32,33].

The amount of MI between the initial training set vectors and the

test vector (i.e. the vector with which the prediction is performed)

for the rst prediction step in the prediction horizon is given in Fig. 1.

On the horizontal axis the number of vectors in the initial training set

is denoted, while the vertical axis represents the amount of MI. The

rst block from the left in Fig. 1. is related to rst time-series in the

data set, the second block to the second time-series, and so on. From

Fig. 1. it can be observed that most of the vectors from the initial

Fig. 1. The amount of MI between the initial training set vectors and the test vector for the rst prediction step.

M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 241

training set have MI values smaller than 0.1 and that there is a small

number of vectors with MI values greater than 0.3. From Fig. 1. it can

also be observed that the MI curves are similar for all 11 time-series,

and that the slopes and maximum values of the MI curves differ

between them. Also, for all 11 time series in Fig. 1. total number of

signicant vectors (the ones that have MI greater than 0) is around

400. Let us for example consider the time-series which corresponds

to block 9 and block 10 in Fig. 1. The time-series that corresponds to

block 9 has a signicantly larger number of vectors with a larger

amount of MI, compared to the other one that corresponds to block

10. This indicates the quality of the initial training set regarding the

current vector with which the forecasting model is employed. In this

way it is possible to assess the expected quality of the future

predictions. The predictions obtained with a model trained with a

large number of vectors that share a large amount of MI with the

current test vector should be more accurate than the predictions

obtained with a model trained with fewer vectors that share a

smaller amount of MI with the current test vector.

The selection strategy of parameter r, which determines the

overall number of vectors which need to be selected during a given

step, is the following: to select as few vectors as possible, enough to

train the LS-SVM model, such that they share as much information

as possible with the test vector during the rst prediction step.

Although it is possible to select a different value of parameter

r during each of the prediction steps, with the aim of simplifying

the results, parameter r has been selected on the basis of the mutual

information during the rst prediction step and remains unchanged

during all the subsequent prediction steps. In addition, in the

case of the recursive strategy, during the rst prediction step, all

of the values in the regressor are known (true), and thus the

evaluation of the mutual information is more precise than in the

following steps.

When it comes to the selection of parameter , which denes

the lowest limit of mutual information between two vectors, it is

necessary to select as great a value as possible, and at the same time

leave a sufcient number of vectors to form the model. As the

results of the test have shown, a sufcient number of vectors for

training the LS-SVM model range from several dozen to one

hundred. Parameter was also selected on the basis of the available

number of vectors with a suitable amount of information during

the rst step of the prediction and it remains unchanged in all

the steps of the prediction. According to the previous analysis and

Fig. 1, for testing purposes 27 different models were built, denoted

with:

721 instances,

els trained with sets that consist of initial input vectors with

an MI greater than in every prediction step,

trained with sets that contain the rst r instances for each

prediction step which are in a descending order based on the MI,

trained with sets that contain the rst k instances for each

prediction step which are in a descending order based on the

distance metric proposed in [32,33],

mark model,

The mean and median SMAPE values and the ranking results

are presented in Table 1.

The problem of comparing multiple models on multiple time

series, and trying to infer whether there are signicant general

differences in the performance is discussed in [55] and [56].

In such cases using a two stage procedure is recommended: rst,

using Friedman's test, to determine whether the compared

models have the same mean rank, and if the null hypothesis is

rejected during second stage, using some post-hoc test. If this test

rejects the null hypothesis, then the post-hoc test needs to be

performed to compare the different models. Friedman's test is a

nonparametric test which is designed to detect differences among

two or more groups. Applying Friedman's test, a p-value of 0.016 is

obtained. With a value less than 0.05 the null hypothesis is rejected

at the 5% level, and it indicates signicant differences in the mean

ranks among the compared models. For the post-hoc test, the

multiple comparison with the best test (MCB) is used.

From Table 1 it can be observed that in terms of the mean and

median SMAPE all of the models that implement an MI based

instance selection algorithm (regardless of the selection option or

chosen value of the parameters r and ) the LSSVM model

outperforms the others without an instance selection algorithm.

A slightly better performance was determined for MI based

instance selection models in comparison to LSSVM-kNN models

and benchmark ARIMA and HW models. Also, in terms of average

rank, all of the models that implement an MI based instance

selection algorithm outperformed the LSSVM model, LSSVM-kNN

models, ARIMA and the HW model. The post-hoc test indicates no

signicant differences by multiple comparisons with the best test

at the 0.05 level between the models. It should be mentioned that

the best meanSMAPE achieved during the NN5 competition for the

reduced data set is 17.6%, obtained with a statistical model, while

the top three entries for computational intelligence models were

19%, 19.9% and 20.5% [57].

The predictions obtained with the LSSVM and LSSVM-0.2MI

models versus real values, for time-series No. 9 and No. 10 in

the reduced data set are presented in Fig. 2. (from top to bottom,

Table 1

The performance comparison of the individual forecasting models.

Model meanSMAPE

(%)

Avg. Rank medianSMAPE

(%)

Avg Rank

LSSVM 21.47 27 20.34 24

LSSVM-0.05MI 19.66 9 18.66 9

LSSVM-0.1MI 19.55 4 18.29 4

LSSVM-0.15MI 19.29 3 18.51 7

LSSVM-0.2MI 19.08 1 19.36 20

LSSVM-0.25MI 19.68 10 18.70 10

LSSVM-0.3MI 19.84 13 19.03 15

LSSVM-0.35MI 19.63 5 18.04 1

LSSVM-0.4MI 19.64 6 18.15 3

LSSVM-15MI 20.72 23 20.03 2

LSSVM-25MI 19.28 2 18.07 18

LSSVM-50MI 19.75 11 19.29 17

LSSVM-75MI 20.01 16 19.27 13

LSSVM-100MI 19.64 7 18.86 14

LSSVM-200MI 19.66 8 18.88 6

LSSVM-300MI 20.01 14 18.48 11

LSSVM-400MI 20.01 15 18.76 2

LSSVM-15kNN 20.21 18 18.59 8

LSSVM-25kNN 20.52 21 18.82 12

LSSVM-50kNN 19.81 12 18.44 5

LSSVM-75kNN 20.06 17 19.35 19

LSSVM-100kNN 20.62 22 19.18 16

LSSVM-200kNN 20.87 25 20.03 21

LSSVM-300kNN 20.75 24 20.36 25

LSSVM-400kNN 20.49 20 20.06 23

ARIMA 20.34 19 20.55 26

HW 21.06 26 20.79 27

Friedman test 0.016

M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 242

respectively). While for series No. 9 both models perform very

well (which was expected, considering the analysis of block 9 in

Fig. 1.), improvements were much more signicant for series No. 10.

Actually, predictions obtained from the LSSVM model for series No. 10

are partially usable for 28 steps ahead at the most, and from that point

on they become a at line. However, predictions obtained from the

LSSVM-0.2MI model, which implements MI based instance selection,

follow the shape of the curve for all the 56 prediction steps.

In terms of computational complexity of the proposed instance

selection algorithm, it can vary in both directions, depending on the

initial training set size, regressor size, number of steps in the

prediction horizon that need to be predicted and the number of

selected vectors per step. In a particular case, the time needed

to obtain the predictions are multiplied by a factor of approximately

3, 4, 6, 8 and 12 for the LSSVM-15MI, LSSVM-75MI, LSSVM-200MI,

LSSVM-300MI and LSSVM-400MI models respectively, compared to

the initial LSSVM model. This is a direct consequence of the large

prediction horizon size, H56, but also of the large number of

selected vectors per step for models LSSVM-200MI, LSSVM-300MI

and LSSVM-400MI. Nevertheless, an increase in computational time

is compensated with an increase in the quality of the prediction

results.

6. Conclusion

A possible approach for improving long-term time-series pre-

diction is presented in this paper. The proposed methodology is

based on the concept of mutual information for instance select-

ion, which was previously mostly used for feature selection and

for removing noise and outliers. The novelty in the presented

approach is the use of the MI estimator in order to make

an objective evaluation of how relevant an input vector is for the

training set, i.e. how well it ts into the current prediction step.

Unlike the previous approaches which use an MI estimator

(feature selection, removing noise and outliers and missing values

interpolation), where the evaluation of the mutual information

was performed only on a training set, the proposed methodo-

logy evaluates the mutual information between the vectors of

the initial training set and the current vector used in the

prediction.

As the experiment results indicated, the MI estimator reduced

the initial training set several times. All of the generated models

performed better than the initial model, despite the selected

number of instances or the MI threshold value. It has been shown

that the quality of the training set is more signicant than the size.

The models trained with sets of vectors which share a large amount

of information with the forecasting inputs achieved greater accu-

racy than the models trained with one much larger set.

Although the calculations in the algorithms are slightly time

consuming, they bring signicant improvements to the time-series

forecasting accuracy and lead to a reduction in the initial training

set size needed for model formation.

Further work should include detailed research on how parameter

k choice in the MI estimator is reected on the proposed instance

selection algorithm. Also, the determination of the initial training

set size should be analyzed, in order to constrain its size in cases

when large amounts of history data are available for instance

selection.

Acknowledgments

This work was supported by the Ministry of Science and

Technological Development, Republic of Serbia (Project number:

III44006).

We thank the editors and the anonymous reviewers for their

helpful comments and suggestions that have improved the quality

of this paper.

Fig. 2. The forecasts from the LSSVM and LSSVM-0.2MI models and real values for time series No. 9 and No. 10.

M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 243

References

[1] K. Kim, Financial time series forecasting using support vector machines,

Neurocomputing 55 (2003) 307319.

[2] P. Bermolen, D. Rossi, Support vector regression for link load prediction,

Comput. Netw. 53 (2009) 191201.

[3] N. Amjady, Short-term hourly load forecasting using time-series modeling

with peak load estimation capability, IEEE Trans. Power Syst. 16 (2001)

798805.

[4] A.D. Papalexopoulos, T.C. Hesterberg, A regression-based approach to short-

term system load forecasting, IEEE Trans. Power Syst. 5 (1990) 15351547.

[5] G.E.P. Box, G. Jenkins, Time Series Analysis, Forecasting and Control, Holden-Day,

San Francisco, 1976.

[6] C.C. Holt, Forecasting seasonals and trends by exponentially weighted moving

averages, Int. J. Forecast. 20 (2004) 510.

[7] G. Zhang, B.E. Patuwo, M.Y. Hu, Forecasting with articial neural networks:

The state of the art, Int. J. Forecast. 14 (1998) 3562.

[8] N.I. Sapankevych, R. Sankar, Time series prediction using support vector

machines: a survey, IEEE Comput. Intell. Mag. 4 (2009) 2438.

[9] H.H. Nguyen, C.W. Chan, Multiple neural networks for a long term time series

forecast, Neural Comput. Appl. 13 (2004) 9098.

[10] N. Kourentzes, S.F. Crone, Forecasting high-frequency time series with neural

networksan analysis of modelling challenges from increasing data fre-

quency, Int. Conf. Data Min. (2008).

[11] S.F. Crone, N. Kourentzes, Input-variable specication for neural networks

an analysis of forecasting low and high time series frequency, in: Procee-

dings of International Joint Conference on Neural Networks, 2009, pp. 619

626.

[12] J. Zhang, Y. Yim, J. Yang, Intelligent selection of instances for prediction

functions in lazy learning algorithms, Artif. Intell. Rev. 11 (1997) 175191.

[13] J. Tolvi, Genetic algorithms for outlier detection and variable selection in linear

regression models, Soft Comput. 8 (2004) 527533.

[14] A. Guillen, L.J. Herrera, G. Rubio, H. Pomares, A. Lendasse, I. Rojas, New method

for instance or prototype selection using mutual information in time series

prediction, Neurocomputing 73 (2010) 20302038.

[15] J.Q. Candela, M. Sugiyama, A. Schwaighofer, N.D. Lawrence, Dataset Shift in

Machine Learning, The MIT Press, Cambridge, MA, 2009.

[16] A.J. Storkey, M. Sugiyama, Mixture Regression for Covariate Shift, Adv. Neural

Inf. Process. Syst. 19 (2007) 13371344.

[17] J.G.M. Torres, T. Raeder, R.A. Rodriguez, N.V. Chawla, F. Herrera, A

unifying view on dataset shift in classication, Pattern Recogn. 45 (2012)

521530.

[18] D.A. Cohn, Z. Ghahramani, M.I. Jordan, Active learning with statistical models,

J. Artif. Intell. Res. 4 (1996) 129145.

[19] V.S. Iyengar, C. Apte, T. Zhang, Active learning using adaptive resampling, in:

Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, 2000 pp. 9198.

[20] M.C. Ludl, A. Lewandowski, Adaptive machine learning in delayed feedback

domains by selective relearning, Appl. Artif. Intell. 22 (2008) 543557.

[21] H. Shimodaira, Improving predictive inference under covariate shift by

weighting the log-likelihood function, J. Stat. Plann. Inference 90 (2000)

227244.

[22] D.W. Aha, Tolerating noisy, irrelevant and novel attributes in instance-based

learning algorithms, Int. J. Man-Mach. Stud. 36 (1992) 267287.

[23] D.R. Wilson, T. Martinez, Reduction techniques for instance based learning

algorithms, Mach. Learn. 38 (2000) 257286.

[24] Y. Guo, D. Schuurmans, Discriminative batch mode active learning, Adv. Neural

Inf. Process. Syst. (NIPS) (2008) 593600.

[25] H. Ishibuchi, T. Nakashima, M. Nii, Learning of neural networks with GA-based

instance selection, in: Proceedings of the IFSA World Congress and 20th

NAFIPS International Conference 4, 2001 pp. 21022107.

[26] M. Sebban, R. Nock, S. Lallich, Stopping criterion for boosting-based data

reduction techniques: from binary to multiclass problems, J. Mach. Learn. Res.

3 (2002) 863865.

[27] V.B. Zubek, T.G. Dietterich, Pruning improves heuristic search for cost-

sensitive learning, in: Proceedings of the International Conference on Machine

Learning, 2002, pp. 2734.

[28] K. Yu1, X. Xu1, M. Ester, H.P. Kriegel, Feature weighting and instance selection

for collaborative ltering: an information-theoretic approach, Knowl. Inf. Syst.

5 (2003) 201224.

[29] J.A.O. Lopez, J.A.C. Ochoa, J.F.M. Trinidad, J. Kittler, A review of instance

selection methods, Artif. Intell. Rev. 34 (2010) 133143.

[30] N. Jankowski, M. Grochowski, Comparison of instances seletion algorithms I,

in: Proceedings of the Articial Intelligence and Soft ComputingICAISC 2004,

2004, pp. 598603.

[31] J. Zhang, Y.S. Yim, J. Yang, Intelligent selection of instances for prediction

functions in lazy learning algorithms, Artif. Intell. Rev. 11 (1997) 175191.

[32] Z. Huang, M.L. Shyu, k-NN based LS-SVM framework for long-term time series

prediction, in: Proceedings of the 2010 IEEE International Conference on

Information Reuse and Integration, 2010, pp. 6974.

[33] M.L. Shyu, Z. Huang, Long-term time series prediction using k-NN based LS-

SVM framework with multi-value integration, Recent Trends in Information

Reuse and Integration, Springer (2012) 191209.

[34] A. Guilln, L. Herrera, G. Rubio, Instance or prototype selection for fun-

ction approximation using mutual information, in: Proceedings of the

European Symposium on Time Series PredictionESTSP 2008, 2008,

pp. 6775.

[35] M. Stojanovi, M. Boi, M. Stankovi, Z. Staji, Adaptive least squares support

vector machines method for short-term load forecasting based on mutual

information for inputs selection, Int. Rev. Electr. Eng. 7 (2012) 35743585.

[36] L. Batina, B. Gierlichs, E. Prouff, M. Rivain, F.X. Standaert, N.V. Charvillon,

Mutual information analysis: a comprehensive study, J. Cryptol. 24 (2010)

269291.

[37] A. Sorjamaa, J. Hao, N. Reyhani, Y. Ji, A. Lendasse, Methodology for long-term

prediction of time series, Neurocomputing 70 (2007) 28612869.

[38] F. Rossi, A. Lendasse, D. Francois, V. Wertz, M. Verleysen, Mutual information

for the selection of relevant variables in spectrometric nonlinear modeling,

Chemom. Intell. Lab. Syst. 80 (2006) 215226.

[39] L.J. Herrera, H. Pomares, I. Rojas, M. Verleysen, A. Guillen, Effective input

variable selection for function approximation, in: Proceedings of the

International Conference on Articial Neural NetworksICANN 2006, 2006,

pp. 4150.

[40] G. Doquire, M. Verleysen, Feature selection with missing data using mutual

information estimators, Neurocomputing 90 (2012) 311.

[41] R. Moddemeijer, On estimation of entropy and mutual information of the

continuous distributions, Signal Process. 16 (1989) 233248.

[42] Y. Moon, B. Rajagopalan, U. Lall, Estimation of mutual information using kernel

density estimators, Phys. Rev. E 52 (1995) 23182321.

[43] A. Kraskov, H. Stgbauer, P. Grassberger, Estimating mutual information, Phys.

Rev. E 69 (2004) 066138.

[44] http://www.klab.caltech.edu/kraskov/MILCA/.

[45] H. Stogbauer, A. Kraskov, S.A. Astakhov, P. Grassberger, Least dependent

component analysis based on mutual information, Phys. Rev. E 70 (2004)

066123.

[46] D. Franois, F. Rossi, V. Wertz, M. Verleysen, Resampling methods for

parameter-free and robust feature selection with mutual information, Neuro-

computing 70 (2007) 12761288.

[47] L.J. Herrera, H. Pomares, I. Rojas, A. Guillen, A. Prieto, O. Valenzuela, Recursive

prediction for long term time series forecasting using advanced models,

Neurocomputing 70 (2007) 28702880.

[48] J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle, Least

Squares Support Vector Machines, World Scientic, Singapore, 2002.

[49] S. Arlot, A. Celisse, A survey of cross-validation procedures for model selection,

Stat. Surv. 4 (2010) 4079.

[50] http://www.neural-forecasting-competition.com/NN5/datasets.htm.

[51] S.B. Taieb, G. Bontempi, A. Atiya, A. Sorjamaa, review and comparison of

strategies for multi-step ahead time series forecasting based on the NN5

forecasting competition, Expert Syst. Appl. 39 (2012) 70677083.

[52] L.J. Tashman, Out-of-sample tests of forecasting accuracy: an analysis and

review, Int. J. Forecast. 16 (2000) 437450.

[53] S.F. Crone, N. Kourentzes, Feature selection for time series predictiona

combined lter and wrapper approach for neural networks, Neurocomputing

73 (2010) 19231936.

[54] http://www.esat.kuleuven.be/sista/lssvmlab/.

[55] J. Demar, Statistical comparisons of classiers over multiple data sets,

J. Mach. Learn. Res. 7 (2006) 130.

[56] S. Garcia, F. Herrera, An extension on statistical comparisons of classiers

over multiple data sets for all pairwise comparisons, J. Mach. Learn. Res. 9

(2008) 26772694.

[57] http://www.neural-forecasting-competition.com/NN5/results.htm.

Milo B. Stojanovi was born in Ni, Serbia in 1981. He

received his Ph.D degree in Computer Science from the

Faculty of Electronic Engineering, University of Ni,

Serbia in 2013. Currently he is working as a Teaching

Assistant at the Department of Modern Computer

Technologies in the

1

College of Applied Technical

Sciences in Ni. His research interests include machine

learning (ML) and its application in power systems,

feature and instance selection algorithms for regression

models and the application of ML algorithms in time-

series prediction. His main area of research is develop-

ing and improving ML models for the precise forecast-

ing of load demand.

Milo M. Boi was born in Ni, Serbia in 1982. He

received his M.Sc. degree in Electrical Engineering in

2008 from the Faculty of Electronic Engineering, Uni-

versity of Ni, Serbia. He is currently a research associ-

ate at the Faculty of Electronic Engineering, University

of Ni. His research interests are in eld of articial

intelligence and machine learning. His main current

areas of research are developing methods for load

forecasting, feature and instance selection algorithms.

His current interest also lies in the domain of applica-

tion of articial intelligence in robotics.

M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 244

Milena M. Stankovi was born in Bjelovar, Croatia in

1953. She received her M.Sc. and Ph.D degree in

Computer Science in 1984 and 1988 respectively, from

the Faculty of Electronic Engineering, University of

Ni, Serbia. She is currently a professor and the Head

of CIIT (Computational Intelligence and Information Tech-

nologies) laboratory at the Faculty of Electronic Engi-

neering, University of Ni. Her main research interest

lies in the domain of the application of spectral

methods for the representation and classication of

discrete functions. Her current interest also lies in the

domain in data mining and web mining methods and

applications.

Zoran P. Staji was born in 1968 in Leskovac, Serbia.

He received his B.Sc. degree in 1993, M.Sc. degree in

1996, and his Ph.D degree in 2000 from the University

of Nis, Faculty of Electronic Engineering, Serbia. Since

1993 he has been working as a professor at the Faculty

of Electronic Engineering at the Department of Power

Engineering. His research interests include smart grid,

information and communication technologies and elec-

trical fraud detection. Since 1998, he has been the

leader of many government-funded projects concern-

ing innovations, technology development and energy

efciency. In 2007, he established a private R&D center

in the eld of Intelligent Energy Management Systems.

M.B. Stojanovi et al. / Neurocomputing 141 (2014) 236245 245

- Best Track 42420Uploaded byelisabeth_nainggolan12
- Stark Gcpr15Uploaded byJazz Alberick Rymnikski
- TRNDOLOGUploaded byajax_telamonio
- Crone - Support Vector Regression and ANN for Forecasting - 2006Uploaded byfatekuniska
- Lifecycle Planning.docUploaded bysoma sekhar vallabhaneni
- Project Computational LinguisticsUploaded byIshanSrivastava
- Setup Lifecycle PlanningUploaded byVijay Kishore
- 100014 Br 04 Metrix NdUploaded byMichael Flores
- PR some solutionsUploaded byBongkyu Jeon
- Business StatisticsUploaded byRishabh Raj
- zemouri2003eaai Radial Basis FunctionUploaded byAbdo Sawaya
- Demand ManagementUploaded byAngelique Flor Paula Ruiter
- Transforming Time SeriesUploaded byZenzen Miclat
- Summary and Q/A of Opening & Ending Vignettes of Data Mining for Business Intelligence & Data WarehousingUploaded byBhai Jaan
- TFA for tech mgmtUploaded byBrahmi DesPaired
- Ensemble Classifiers and Their Applications: A ReviewUploaded byseventhsensegroup
- MerLin SugX Algorithm ObfuscatedUploaded bynickeleres
- 1995 Abbott - Sequence AnalysisUploaded byAnibal Gauna
- Sales Forecasting FullUploaded bySozo Net Work
- Survey of Stochsatic ModelsUploaded byhkocan
- hourly prices energyeconomics.pdfUploaded bymi naka
- Solution Chapter 1 Time Series cryerUploaded byIvan
- Articol IasiUploaded byThad Rodriguez
- Military Reconnaissance RobotUploaded byIJAERS JOURNAL
- DM-04 Basic ClassificationUploaded byDimas
- ForecastingUploaded byNidhi Makkar Behl
- McCleary & McDowall (2012) Time-series DesignsUploaded byJean Paul Vaudenay
- 0026-1394_40_3_314.pdfUploaded byOmar Pons
- fsUploaded byhammad59999
- lec35Uploaded bysrsaikumarreddy

- Data+Mining+Stock5Uploaded bykhalala
- 22677429Uploaded byOana Vezan
- 10.1007_978-3-642-16402-6_19Uploaded bykhalala
- TTF TheoryUploaded bykhalala
- Bao Gia SEO Amo 2011Uploaded bykhalala
- 10.1007_978-1-84882-952-7_1Uploaded bykhalala
- Business Process ModellingUploaded byapi-3853166
- 1-s2.0-S002216940800382X-mainUploaded bykhalala
- SQL Server 2014 DatasheetUploaded byskyred
- an Information Delivery Model for Banking Business, 2014Uploaded bykhalala
- LA PhanThanhDuc TTUploaded bykhalala
- Mobile StudyUploaded bykhalala
- Integrating ERP and E-business _ Resource Complementarity in Business Value Creation, 2014Uploaded bykhalala
- Order Processing in Supply Chain Management With Developing an Information System Model _ an Automotive Manufacturing Case StudyUploaded bykhalala
- Order Processing in Supply Chain Management With Developing an Information System Model _ an Automotive Manufacturing Case StudyUploaded bykhalala
- Geographic information systemUploaded bykhalala
- NN5Uploaded bykhalala
- ConceptualUploaded bykhalala
- Citad2.pdfUploaded bykhalala
- fix1dayallsections-090708132200-phpapp02Uploaded byfarooqjunaid
- Banking SimulationUploaded bykhalala
- Baitap-XSTK-UIT-Ver4.0Uploaded bykhalala
- FIXimulator ThesisUploaded bykhalala
- Eastern Standard Tribe by Cory DoctorowUploaded byMichael Sauers
- 1-s2.0-S0169207006000409-mainUploaded bykhalala
- Model SelectionUploaded bykhalala
- Thuat Toan va Giai ThuatUploaded byanon-706195
- So a Business Value PropositionUploaded bykhalala
- aribaUploaded bykhalala

- HT for beginners.pdfUploaded byRahul Manavalan
- MIT18_05S14_class6slides_2.pdfUploaded byIslamSharaf
- Quiz 5 Continuous DistributionsUploaded byEsther Tobing
- Robust Optimization Framework for Process Parameter and Tolerance DesignUploaded bybelcourt
- PQT Imp ProblemsUploaded byakashvishal20
- 9781315140421_preview.pdfUploaded byhw
- Structural reliability and risk analysisUploaded byAlexandru Constantinescu
- 1508.03096.pdfUploaded byMohd Shahril
- HeiiUploaded byspitzersglare
- compact Differential evolutionUploaded byRaja Sekhar Batchu
- math mcq probabilityUploaded byAyushMitra
- MNL 62-2012.pdfUploaded bylolo
- 001_finalsolUploaded byhacktom
- Quantitative AnalysisUploaded byVan Thu Nguyen
- 08r059210401 Probability Theory and Stochastic ProcessUploaded byandhracolleges
- lecture30 central limit theorem.pdfUploaded bysourav kumar ray
- 02 Chapter ALR for Printing(1)Uploaded bySean Cohen
- EDF-Tutorial-101.pdfUploaded byArianta Rian
- Actuary PS3Uploaded byappleduck
- Cumilative Distribution FunctionUploaded byabhi_bhatye
- L2 Probabilistic ReasoningUploaded bySahil Kothari
- Weibull Gamma Composite DistributionUploaded byMao Bouricha
- Maximum LikelihoodUploaded byAkolgo Paulina
- תקשורת ספרתית- הרצאה 4 | מרחב האותותUploaded byRon
- (NASA J.harben P.reese) Implementing Strategies for Risk Mitigation (NCSLI-2011)Uploaded byJavier Ignacio Camacho Hernandez
- quiz3Uploaded byParikshit Verma
- Monte Carlo Methods.pdfUploaded bybhuniakanishka
- TM 5-698-3_Reliability_Primer_2005.pdfUploaded byWurzel1946
- FLUID MECHANICS Civil Engineering 2nd yearUploaded byvenkat chunduru
- yybooknotes.processfileUploaded byyasit10