You are on page 1of 6

Stock Closing Price Forecasting Using Ensembles of

Constructive Neural Networks


R.S. Joo1, T.F. Guidoni1, J.R. Bertini Jr2 , M.C. Nicoletti1,3, A.O. Artero4
1

DC-UFSCar, S. Carlos, SP, Brazil


ICMC-USP, S. Carlos, SP, Brazil
3
FACCAMP, C.L. Paulista, SP, Brazil
4
FCT-UNESP, Presidente Prudente, SP, Brazil
{rafael.joao@dc.ufscar.br, tarcisio.guidoni@comp.ufscar.br, bertini@icmc.usp.br, carmo@dc.ufscar.br, almir@fct.unesp.br}
2

Abstract Efficient automatic systems which continuously


learn over long periods of time, and manage to evolve its
knowledge, by discarding obsolete parts of it and acquiring new
ones to reflect recent data, are difficult to be constructed. This
paper addresses neural network (NN) learning in non-stationary
environments related to financial markets, aiming at forecasting
stock closing price. To face up this dynamic scenario, an efficient
NN model is required. Therefore, Constructive Neural Networks
(CoNN) were employed due to its self-adaptation capability, in
contrast to regular NN which demands parameter adjustment.
This paper investigates a possible ensemble organization,
composed by NNs trained with the Cascade Correlation CoNN
algorithm. An ensemble is an effective approach to nonstationary learning because it provides pre-defined rules that
enable new learners - with new knowledge - to take part of the
ensemble along data stream processing. Results obtained with
data stream related with four different stocks are then analysed
and favorably compared with those obtained with the traditional
MLP NNs, trained with Backpropagation.
Keywordslearning
in
non-stationary
environments,
Backpropagation, Cascade Correlation, ensemble, constructive
neural networks, temporal data mining.

I.

INTRODUCTION

Conventional automatic learning algorithms are, generally,


based on the assumption that the data they rely upon for
learning is a random set of records extracted from a stationary
distribution. Demands for computational systems capable to
process continuous data streams and adaptively learn from
them, however, are pushing the research on automatic learning
into devising new strategies and methods for dealing with nonstationary data. Real-world systems such as urban traffic
control, patient monitoring, image recognition, natural
language processing, spam filtering [1], credit card fraud
detection [2], stock market prediction [3], robot control, and
many others depend on a flexible and agile adaptation of the
system, to reflect the continuous incoming data, so to remain
reliable.
Over the past decades neural networks (NN) have become
one of the most successful learning procedure, used in many
different areas of knowledge as an automatic inductive method
for learning (see for instance [4], [5] [6]). Neural networks can
be considered mathematically inspired computational models
which are data-driven. They can be approached as adaptive

procedures which learn to generalize concrete data, by


detecting and representing intrinsic relationships among the
given data. Usually automatic learning methods use historic
data to construct a predictor (such as a classifier or a regression
model). Depending on the nature of the data and, particularly,
its volatility over time, what has been automatically learnt
today (represented as rules, NNs, decision trees, etc.) easily
becomes obsolete in a short time span. For the automatic
learning in evolving environments, i.e., environments where
new events, data and conditions change over time, traditional
learning methods, such as NN, need to adjust themselves, to be
able to guarantee a reliable prediction of the process behavior,
taking into account the continuous changes in the incoming
data.
As commented in [7], for the automatic learning in nonstationary environments the chosen algorithm, besides coping
with the intrinsic difficulties attached to this type of
environment, needs to keep its essential characteristics, such as
good performance, stability and low associated costs particularly in relation to processing time and memory. Taking
into account the imposing limitations, a powerful strategy to
implement incremental learning is the ensemble learning
approach [8]. Essentially, an ensemble is a pool of base
learners whose outputs are mediated by a pre-defined rule [9].
When processing a data stream, the ensemble approach can be
updated by adding new base classifiers during execution,
without halting or retraining. Therefore, such approach
provides a fast and natural way for updating its knowledge
about the problem.
Particularly the knowledge domain under investigation i.e.,
the financial area of stock market, is heavily affected by data
volatility, which can be approached as the degree to which
stock closing price fluctuate over time. The fact that the whole
financial area is also substantially affected by internal and
external variables contribute to increasing the challenge
attached to the task of constructing NN-based automatic stock
closing price predictors. The methods reported here aim at
helping stockholder decisions, by inferring in advance, the
closing price of a stock at the end of the day. Such information
in advance may support crucial operations such as buying,
selling and day trading. A long term forecast can be done if
stock data is artificially generated to this purpose.

The remainder of this paper is organized as: Section II


describes the data used in the experiments. Section III
describes the two Neural Network approaches considered in
this work, i.e. (A) traditional NNs and (B) constructive NNs,
organized as ensembles. Section IV presents the experiment
results on closing price forecast for four well-known company
stocks, as well as a comparative discussion on both strategies.
Finally, Section V concludes the paper and provides directions
for future works.
II.

DESCRIBING THE DATA

Four datasets downloaded from http://br.financas.yahoo.com


were used in the experiments, namely: Coca-Cola, Johnson &
Johnson, IBM and McDonalds. The experiments described in
Section IV have used both: the four 10,000 record datasets as
well as four subsets of them, each with 1,000 records: from
each of the 10,000 record datasets, a sample of the last 1,000
records was chosen, at the beginning of the experiments (i.e.
13th November 2013). The four 10,000 record datasets as well
as their corresponding 1,000 record subsets (referred to as
short) share the same data structure, as shows Table 1. The
daily records were used in the same order in which they were
originally organized; the first record reports information
obtained on the 13th November 2013 and the last, on the 24th
November 2009.
The data are chronologically organized; its only temporal
discontinuity is related to Saturdays and Sundays, when the
stock market is not active. The stock attributes that describe
each stock transaction in a particular day are the usual price
related to the stock market at opening time, price (Open), the
highest price reached (High), the lowest value reached (Low),
the number of stocks negotiated (Volume) and the closing
price (Close) the feature to be predicted. The attribute
Adjusted close (in Table I) refers to the stock price at the
market closing time, taking into account eventual dividends
(cash and stock).
TABLE I.
AN EXAMPLE OF A DATA RECORD (FROM THE JOHNSON &
JOHNSON (DATA STREAM). ASTERISKS IDENTIFY ATTRIBUTES NOT USED.

Attribute
*Date
Open
High
Low
Volume
Close
*Adj. close
III.

Value
2013-11-13
93.36
93.45
92.32
8954300
93.34
93.34

NN LEARNING METHODOLOGY (CONVENTIONAL AND


CONSTRUCTIVE NNS)

For the experiments involving constructive as well as


traditional neural networks, the methodology employed was to
partition a set of data into subsets of (approximately) equal
size and then, simulating it as a stream.
The data stream with 1,000 records (SD) was divided into
disjoint sets with sizes in {100, 200, 300, 400, 500} at a time,
resulting in partitions (Pi) of the original 1,000 records as

described next. The motivation for proposing five different


partitions was to evaluate the most convenient chronological
window, related to each data set, which would produce NNs
with a more reliable power prediction. In what follows n
represents the number of groups in a particular partition.
P1={C1, C2, C3, C4, C5, C6, C7, C8, C9, C10}, with |Ci| = 100
(n=10)
P2 = {C1, C2, C3, C4, C5}, with |Ci| = 200 (n=5)
P3 = {C1, C2, C3, C4}, with |C1| = |C2| = |C3| = 300 and |C4| =
100 (n=4)
P4 = {C1, C2, C3}, with |C1| = |C2| = 400 and |C3| = 200 (n=3)
P5 = {C1, C2}, with |C1| = |C2| = 500 (n=2)
A. The Traditional Neural Network Aproach
The experiments using traditional Multi-Layer Perceptron
(MLP) trained with Backpropagation [10] aimed at only to
produce results for a comparative analysis against those given
by the ensembles of constructive NNs. For training and
evaluating the several induced NNs, the next procedures,
described in Fig. 1 and Fig. 2 were employed.
procedure vary_hidden_number(SD,av_matrix)
input: SD {stream of data}
output: av_matrix {a bidimensional matrix containing,
in each position k,j, the total average error of the
induced NNs, for a particular partition (k) of SD,
varying the number of hidden nodes (j) from 2 to 20}

dividing_input_set(SD,[P1,P2,P3,P4,P5])
for j 2 to 20 do
begin
k1
while k 5 do
begin
av_matrix[k,j] train_test_NN(Pk,j,tot_av_errork,j)
kk+1
end
end
end procedure
Fig. 1. Main procedure which implements both (1) partition of the data stream
and (2) inducing and testing NNS (via train_test_NN procedure - Fig. 2). It
stores in a matrix the average errors obtained, taking into account different
numbers of hidden neurons.

Procedure vary_hidden_number/2 (Fig. 1) is responsible


for obtaining five different partitions of the data stream SD,
each one identified by Pi {i=1, ..., 5}, as previously described.
For each partition, it calls procedure train_test_NN which is in
charge of inducing a NN for each group of data records and
then, evaluating the NN created in the next group of the
partition. It also controls the number of hidden neurons to be
used by the train_test_NN procedure. As final result, for each
partition Pi and number of hidden neurons, it produces a value
which represents the total average error of the induced NNs,

taking into account Pi and the 19 possible values for the


number of hidden nodes.
procedure train_test_NN(Pk,n_hidden,tot_av_errork,n_hidden)
input: Pk = {C1,,Cn}, n_hidden
output: tot_av_errork,n_hidden
tot_av_error 0
n |Pk| {n is the number of groups (of data
records) in partition Pk}

for i 1 to n-1 do
begin
learn(Ci,NNi,n_hidden)
test(NNi, Ci+1,av_errork,i,n_hidden)
tot_av_errork,n_hidden tot_av_errork,n_hidden +
av_errork,i
end
tot_av_errork,n_hidden tot_av_errork,n_hidden/(n-1)
end procedure
Fig. 2. Procedure induces and tests NNs, taking into account a particular
partition of the data stream. Particularly, its learn procedure implements the
Backpropagation algorithm.

The NN architecture used in part of the experiments was a


feedforward NN, with 4 input nodes (each associated with a
data record attribute namely, open, high, low and volume), one
output node and one single hidden layer. The main goal of the
training procedure was to create NNs able to predict the close
value associated with data records given by the four attributes.
It is important to mention that the procedure learn/3 in Fig. 2,
which takes an input set and trains a NN, also follows a
methodology for experiencing with the number of hidden
neurons in the single hidden layer of the NNs. Both
procedures learn/3 and test/4, which are part of procedure
train_and_test_NN/3 were implemented considering a
variation in the number of hidden neurons, from 2 up to 20, as
described by procedure vary_hidden_number/2.
B. The Cascade Correlation Ensemble Approach
Prior to describing the ensemble method used for the stock
market prediction, a brief description of the Cascade
Correlation algorithm, used for training the NNs in the
ensemble, is given.
The Cascade Correlation (CasCor) algorithm was proposed
in [11], and is a constructive neural network learning
algorithm which can be employed for both types of problems
i.e., classification or regression. The algorithm constructs a
network by adding a hidden neuron one at a time and
organizes them in a way that resembles a cascade. The training
starts by adding to an architecture only having input neurons,
the output neuron which, initially, receives connections from
the input neurons (4 in the case study). Once trained this initial
NN, it is possible to compute its error. Therefore, in order to
minimize the error, a new neuron is trained and added to the
network. Every new neuron receives connections from all
previously added hidden neuron plus the input neurons and its
output is connected to the output neuron. The output neuron is
them retrained, this time with an extra connection. Each

neuron is trained using the Quickprop algorithm [12] [13],


which can be approached as a quick version of the
Backpropagation algorithm. However it deals differently with
the output and the hidden neurons. When training the output
neuron the aim of the algorithm is to minimize the network
error, so a gradient descend on the residual error is performed.
On the other hand, when training hidden neurons, the
objective is to maximize the correlation between the hidden
neuron output and the network output. So, Quickprop
performs a gradient ascend on the correlation.
The process of adding neurons to the network goes on up
to a certain stop criterion; a commonly used stopping criteria
is when the network reaches a predefined error rate or, then,
when a new added neuron no longer decreases the network
error. In this work, each network grows up to 20 hidden
neurons, and the best of such 20 networks is used. Since all
hidden weights are frozen, the only information to store is the
output neuron and the number of hidden neurons which define
the best architecture. Now consider defining the ensemble to
cope with stock market predictions. Consider a given stock
closing price along time as a data stream, where a new data
pattern, for a given stock is generated every day. To deal with
stock closing price prediction, the ensemble works as follows:
(1) When a new data pattern becomes available, a prediction
for the closing price value is conducted by the ensemble,
reported to the user and stored to be further used to
estimate the error;
(2) At the end of the day, the actual closing price is stored as
the fifth attribute value in the pattern description (see
Table I).
(3) When the set of stored patterns reaches a certain volume, it
is used to train a new NN, which will be added to the
ensemble (referred to as NN_New). Such set of data is
referred to as Last_DS;
(4) The previous set of stored patterns is then discarded and a
new one begins to be gathered.
Figure 3 shows the procedure that simulates the execution
of the proposed ensemble. As in Fig. 1 and Fig. 2, the
procedure shows how the results have been obtained, thats
why the inferred values are considered in sets, instead
individually, like in a real application. The ensemble updating
rule is similar to the one described in [14]. A predefined
maximum number of base learners is required prior to
execution, noted as e_size. Once the ensemble size is defined,
NNs, trained with, CasCor/2 are added to the ensemble, one at
a time, as soon as a new data set becomes available.
Therefore, the ensemble grows along with the execution and
may reach its maximum size during the data stream
processing. Up to the point where the number of created
networks is less than or equal to e_size, the procedure
add_to_ensemble/2 just adds the NN_New created, to the
ensemble. Once the ensemble is complete, the procedure
substitute_worst/3 is called to replace a particular network of
the ensemble, with the most recently created NN_New. To
properly find what network should be replaced by the

NN_New, the Last_DS (Ci+1 in the procedure) is used to


measure the average error of each network in the ensemble;
the network with the highest error (noted as NN_Worst) is
chosen to be replaced by the NN_New. The error is calculated
as the sum of the absolute difference between the actual close
value and the estimated one. To estimate the closing price of a
new data pattern, such pattern is given as input to all base
learners and then the ensemble mediator calculates the mean
of all the output values produced by the base learners. In Fig.
3 the procedure test_ensemble/4 simulates the test of the set
Ci+1, calculates the mean error and, also, identifies the
NN_Worst, which will be replaced in the next iteration.

Johnson and McDonalds, respectively. As mentioned in


Section III, results for the MLP considered a single layer MLP
with the number of hidden neurons varying from 2 to 20,
yielding in 19 different architectures. The results for the
ensemble approach considered ensemble size varying from 2 to
30 NNs. Due to the vast volume of results obtained, the tables
only show the best and the worst result for each data stream for
a given Window Size (WS).
TABLE II.
RESULTS FOR THE COCA-COLA SHORT DATA STREAM. HN:
NUMBER OF HIDDEN NEURONS, AE: AVERAGE ERROR, SD: STANDARD
DEVIATION. ES: ENSEMBLE SIZE, AN: AVERAGE NUMBER OF NEURONS IN
THE NNS IN THE ENSEMBLE. BEST CASE OF ALL IS BOLD FACED.
MLP

procedure train_test_Ensemble(Pk,e_size,
tot_av_errork,e_size)
input: Pk, e_size
output: tot_av_errork,e_size
tot_av_error 0
n |Pk| {n:number of groups (of data records) in Pk}
E {ensemble initially is empty}
for i 1 to n-1 do
begin
normalization(Ci, nCi)
CasCor(nCi,NN_New)
if i < e_size
then add_to_ensemble(E,NN_New)
else substitute_worst(E,NN_New, NN_Worst)
test_ensemble(E, Ci+1, av_errork,i, NN_Worst)
tot_av_errork,e_size tot_av_errork,e_size + av_errork,i
end
tot_av_errork,e_size tot_av_errork,e_size /(n-1)
end procedure
Fig. 3. Procedure which defines and tests the ensemble, taking into account a
particular partition of the data stream.

Prior to training, each data set has its pattern values


normalized in the range [0,1] by dividing each attribute value
by the highest value for that attribute, in a process
implemented by procedure normalization/2. The highest value
for each attribute is then stored and associated to the network
just trained. When the closing value of a new data pattern
needs to be inferred, the pattern is normalized using the
associated highest attribute values of each network which, in
turn, produces a different normalization for each network.
This is required because each network possibly had been
trained with values of different magnitude.
IV.

WS
HN

Worst case

AE
SD
5.65
4.78

HN
20

Best case

Worst case

AE
SD
13.84
12.53

ES
AN
2
15.4

AE
SD
4.65
5.02

ES
AN
19
11.4

AE
SD
9.28
4.43

100

200

10

6.66
5.26

14.79
0.95

2
15.4

6.40
6.67

7
16.4

11.3
7.3

300

5.82
2.35

19

16.33
14.63

14
11.5

10.49
5.58

18
15.5

18.77
10.61

400

11

5.85
2.3

6.96
3.56

4
18.0

4.01
2.94

9
18.0

5.99
3.86

500

12

6.81
4.12

20

17.55
15.78

5
20.5

2.86
2.80

13
20.5

6.08
4.01

The results are the average error (AE) over the whole
stream followed by the standard deviation. The error is
calculated as the average of the absolute difference between
predicted and actual closing price.
TABLE III. RESULTS FOR THE IBM SHORT DATA STREAM. HN: NUMBER OF
HIDDEN NEURONS, AE: AVERAGE ERROR, SD: STANDARD DEVIATION. ES:
ENSEMBLE SIZE, AN: AVERAGE NUMBER OF NEURONS IN THE NNS IN THE
ENSEMBLE. BEST CASE OF ALL IS BOLD FACED.
MLP
Best case
WS
HN

AE
SD
4.51
2.12

CasCor Ensemble
Worst case

HN

Best case

Worst case

AE
SD
6.19
4.15

ES
AN
3
18.5

AE
SD
3.7
2.75

ES
AN
20
16.9

AE
SD
9.33
6.51

100

10

200

4.89
1.98

19

7.90
4.43

2
18.4

3.83
2.92

5
19.8

9.74
7.27

300

12

2.99
1.15

19

6.34
3.45

5
15.5

6.92
4.68

6
12.5

10.45
7.51

400

11

9.87
4.54

20

11.77
5.78

10
10.33

5.18
4.61

13
21.0

13.76
8.79

500

7.99
5.15

19

45.13
40.59

18
8.5

5.60
4.55

5
8.5

24.50
13.28

EXPERIMENTS, RESULTS AND ANALYSIS

As briefly mentioned before, the financial area of stock


market is highly affected by the data volatility, which is
constantly subjected to fluctuations, many of them caused by
rumors, political instability and so many others. Therefore such
data streams pose a hard challenge for automatic learning.
Tables II, III, IV and V summarize the results obtained
using NNs trained using the Backpropagation algorithm in a
MLP architecture compared to the ensemble of CasCor CoNN,
in the four short data stream, Coca-Cola, IBM, Johnson &

CasCor Ensemble

Best case

20

The NNs architectures for the best and worst case are also
described, i.e. the number of hidden neurons (HN) for the MLP
and ensemble size (ES) and the average number of hidden
neurons composing the ensemble (AN) for the ensemble.

TABLE IV. RESULTS FOR THE JOHNSON AND JOHNSON SHORT DATA STREAM .
HN: NUMBER OF HIDDEN NEURONS, AE: AVERAGE ERROR, SD: STANDARD
DEVIATION. ES: ENSEMBLE SIZE, AN: AVERAGE NUMBER OF NEURONS IN THE
NNS IN THE ENSEMBLE. BEST CASE OF ALL IS BOLD FACED.
MLP
Best case
WS
HN

CasCor Ensemble
Worst case

AE
SD
2.58
1.41

H
N
20

Best case

Worst case

AE
SD
28.18
26.16

ES
AN
15
11.6

AE
SD
1.88
2.19

ES
AN
11
14.0

AE
SD
3.51
2.38

100

200

11

3.95
2.39

19

10.28
6.21

11
15.8

1.31
1.63

3
20.4

4.04
2.22

300

11

4.14
1.97

20

54.16
51.67

11
11.75

1.83
1.40

3
16.25

3.74
2.04

400

10

2.69
1.30

20

5.94
2.65

13
16.0

2.69
2.00

9
19.66

5.92
3.40

500

2.25
1.01

19

10.08
8.41

2
17.5

3.00
2.09

9
20.0

8.32
4.91

TABLE V. RESULTS FOR THE MCDONALDS SHORT DATA STREAM. HN:


NUMBER OF HIDDEN NEURONS, AE: AVERAGE ERROR, SD: STANDARD
DEVIATION. ES: ENSEMBLE SIZE, AN: AVERAGE NUMBER OF NEURONS IN THE
NNS IN THE ENSEMBLE. BEST CASE OF ALL IS BOLD FACED.
MLP
Best case
WS
HN

AE
SD
4.54
2.20

CasCor Ensemble
Worst case

HN
19

Best case

Worst case

AE
SD
7.39
4.18

ES
AN
4
19.6

AE
SD
3.41
2.01

ES
AN
11
15.1

AE
SD
6.64
5.85

100

200

10

4.56
2.20

17

7.64
4.13

2
20.4

3.39
2.30

12
20.2

5.60
3.29

300

13

7.22
5.65

19

14.88
12.01

2
21.0

4.09
2.41

4
19.0

5.45
3.48

400

13.18
10.98

20

44.86
42.13

17
14.66

1.78
1.19

4
18.66

5.24
2.68

500

15

17.79
15.65

19

30.35
20.98

19
17.0

1.11
0.78

10
18.5

5.54
2.90

It is important to remember that the experiments with MLP


trained with Backpropagation were conducted aiming at
proving results for a comparative analysis versus ensembles of
CoNNs, the main goal of this work.
Taking into account all the four data streams considered, as
well as all the Backpropagation trained MLP, in general the
best results (i.e., the smallest values for the average error) were
obtained by NNs having the number of hidden neurons around
10, while the less satisfactory results, by NNs having numbers
of hidden neurons very small (2, 3) or very high (19,20), with
one exception (Table IV). During the experiment, it was
observed the obvious fact that the higher the number the hidden
neurons considered, the slower was the NN training.
Particularly in situations where the number of hidden nodes is
high and the size of the window is large, results exhibited the
highest standard deviation. Also, when using larger data
windows, there is higher dispersion among the obtained results;
this can be consequence of the testing set incorporating patterns
representing a new trend in the market, which could be
characterized as the occurrence of concept-drift.

Now, regarding the ensemble of CoNNs, it can be seem that


it has given the best results in three out of the four domains.
However, in spite of the best result provided by the regular
MLP in the IBM domain, it seems to be an isolated fact. A
finer inspection of the results shows that, when considering
every possible window size of all domains, the ensemble
presented better results in 17 out of the 20 displayed results.
This clarifies that the inherent characteristics of the ensemble
strategy combined with the flexibility of CoNNs as base
learners is a suitable approach for data stream processing. The
MLP, on the other hand, does not have mechanisms to handle
concept drift. For example, even if a good model is inferred,
the model will soon be outdated by the concept drift usually
embedded in the stream data; also, it cannot handle recurrent
drifts, since it does not store knowledge along with the data
stream processing.
Further analysis concerning the size of the networks shows
that, on average, the MLP were induced with 9.8 hidden
neurons with a standard deviation of 2.53. The CoNNs on the
ensemble presented average of 15.8 hidden neurons with a
standard deviation of 3.6. The ensemble size on average was
8.4 with standard deviation of 6.17. Notice that the MLP
networks are smaller than the CasCor networks; this can be due
to the different training algorithm employed and, also, to the
incremental nature of CasCor networks. Regarding the
ensemble size, the best configuration depends on the domain;
evidence for that is the high value for the standard deviation.
Consider now analyzing both strategies when dealing with
a data stream with 10,000 patterns, which represents a more
realistic scenario for a non-stationary data stream. In such
scenario, concept drifts may occur several times, and the
learning system should have mechanisms to overcome them.
Table VI presents the results regarding the four considered
stocks. Aiming at establishing a baseline performance, for each
domain, the best result obtained with the MLP is shown along
with the description of the best result configuration, Window
Size and Number of Hidden Neurons. Also, for each domain,
two results regarding the ensemble are shown: the performance
with the same window size as the best configuration of the
MLP and the best result obtained with the ensemble.
As can be seem in the results, the ensemble has provided
the best results due its capacity to handle concept drift.
Another point to stress is in regard to the window size in
which the best values have been obtained. Notice that, for the
ensemble approach, the best performances have been obtained
when considering window size of 100 patterns. This is
intuitive because the system gathers new information sooner
than other strategies. Aside that, another trend that can be
seem in the results is the reduced number of NNs in the
ensemble which, again, facilitates the quick updating of new
concepts. Figure 4 shows the original data stream of closing
price and the predictions made by the ensemble approach,
considering the configuration that resulted in the best value, as
showed in Table VI.

TABLE VI .
RESULTS CONSIDERING THE FOUR DOMAINS WITH 10,000
PATTERNS. HN: NUMBER OF HIDDEN NEURONS, AE: AVERAGE ERROR, SD:
STANDARD DEVIATION. ES: ENSEMBLE SIZE, AN: AVERAGE NUMBER OF
NEURONS IN THE NNS IN THE ENSEMBLE. BEST CASE OF ALL IS BOLD FACED.
MLP
WS

HN

AE
SD

100

5.22
4.22

300

12

3.04
1.18

500

13.20
11.04

100

4.53
3.50

Ensemble
CasCor
ES
AE
AN
SD
Coca-Cola
3
2.57
15.8
2.76
IBM
2
4.97
16.43
10.63
McDonalds
19
4.1
18.50
4.46
Johnson & Johnson
3
3.04
15.11
2.87

150

Ensemble CasCor
Best result
ES
AE
WS
AN
SD
500

3
15.05

1.74
2.03

100

2
13.11

2.25
6.16

100

3
17.3

2.11
1.79

100

3
15.11

3.04
2.87

CoNNs, which implies no pre-definition of the network


architecture and fast training; 2) the ensemble approach
provides an effective way to incur new knowledge as well as
to discard old ones along with the data stream processing.
Experiment results on real stock data and considering regular
NNs (MLP) as baseline, shows that the approach proposed in
this work is indeed an effective method to cope with learning
and prediction on dynamic environments, such as financial
markets. Future work aiming at enhancing the ensemble
performance includes exploring other ways to mediate the
ensemble, such as using weights and designing strategies,
guided by the current ensemble error, to avoid inconsistent
peaks. Such improvements should be followed by further
comparisons against state of the art methods on stock market
forecasting.
ACKNOWLEDGMENT
The authors would like to thank FAPESP and CNPq for the
financial support.

Coca-Cola

REFERENCES

100
50
0
400

500 1000

2000

3000

4000

6000

7000

8000

9000 10000

IBM

300

Closing price (USD)

5000

[1]

[2]

200
100
0
100
150

1000

2000

3000

4000

5000

6000

7000

8000

9000 10000

[3]

McDonnald's

100

[4]

50
0
100
200

1000

2000

3000

4000

5000

6000

7000

8000

9000 10000

Johnson & Johnson

150

[5]
[6]

100

[7]

50
0
100

1000

2000

3000

4000

5000

6000

7000

8000

9000 10000

Forecasting period (days). From Mar 26, 1974 to Nov 08, 2013
Observed closing price

Predicted closing price

Fig. 4. Daily closing price forecast. Each figure Corresponds to the best
ensemble result on that domain.

It can be seen in Fig. 4 that in general, the prediction carried


on in the four domains fits well the whole original stream.
Indeed, these results show that the ensemble approach is
suitable to tackle non-stationary learning; in special stock
closing price prediction. It is important to mention that, a real
stock market prediction system hardly would consider such a
big time period (i.e. 10,000 days). This range was chosen to
show the proposed approach forecasting performance and its
ability to handle concept drift, even for long life applications.
V.

[8]

[9]
[10]
[11]

[12]

[13]

CONCLUSIONS

This paper addressed the problem of stock closing price


forecasting considering an ensemble of constructive neural
networks. The adopted approach presents two characteristics
suitable to non-stationary learning, 1) the base learners are

[14]

F. Fdez-Riverola, E. L. Iglesias, F. Daz, J. Mndez, J. Corchado,


Applying lazy learning algorithms to tackle concept drift in spam
filtering, Expert Systems with Applications, vol. 33, pp. 3648. 2007.
H. Wang, W. Fan, P. Yu, J. Han, Mining concept-drifting data streams
using ensemble classifiers, In: Proc. of The ACM Conference on
Knowledge Discovery and Data Mining, pp. 226235, 2003.
C. S. Vui, G.K. Soon, C.K. On, R. Alfred, P. Anthony, A review of
stock market prediction with Artificial neural network (ANN), In: Proc.
of The IEEE Conference on Control System, Computing and
Engineering, pp. 477482, 2013.
T. Hastie, R. Tibshirani, J. Friedman The Elements of Statistical
Learning, Springer Series in Statistics, Springer-Verlag, 2009.
C.M. Bishop, Neural Networks for Pattern Recognition, Oxford
University Press, 1996.
C.M. Bishop, Pattern Recognition and Machine Learning, SpringerVerlag, 2007.
J.R. Bertini Jr., M.C. Nicoletti, L. Zhao, Ensemble of complete ppartite graph classifiers for non-stationary environments, In: Proc. of
The IEEE Congress on Evolutionary Computation, pp. 18021809,
2013.
J.R. Bertini, L. Zhao, A. Lopes, An incremental learning algorithm
based on the K-associated graph for non-stationary data classification,
Information Sciences, vol. 246, pp. 5268, 2013.
A. Fern, R. Givan, Online ensemble learning: an empirical study,
Machine Learning, vol. 53, pp. 71109, 2003.
D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations
by back-propagating errors, Nature 323 (6088), pp. 533536,1986.
S. Fahlman, C. Lebiere, The cascade correlation architecture, in: D.S.
Touretzky (Ed.), Advances in Neural Information Processing Systems 2,
Morgan Kaufman, San Francisco, CA, pp. 524532, 1990.
S. Fahlman, Faster-learning variations on Backpropagation: an
empirical study, In: Proc. of the 1988 Connectionist Models Summer
School, D. S. Touretzky, G. E. Hinton and T. J. Sejnowski (Eds.),
Morgan Kaufmann, San Mateo, CA, pp. 3851, 1988.
S. Fahlman, An empirical study of learning speed in Backpropagation
networks. School of Computer Science, Carnegie Mellon University,
Pittsburgh, PA, Technical Report CMU-CS-88-162, 1988.
N. Street, Y. Kim, A streaming ensemble algorithm (SEA) for largescale classification, In: Proc. Int. Conf. Knowledge Discovery and Data
Mining, ACM, pp. 377382, 2001.