Cox 60

Prediction by Exponentially Weighted Moving Averages and Related Methods
Author(s): D. R. Cox
Source: Journal of the Royal Statistical Society. Series B (Methodological), Vol. 23, No. 2
(1961), pp. 414-422
Published by: Wiley for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2984031
Accessed: 03-03-2017 13:51 UTC
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted
digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about
JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://about.jstor.org/terms
Royal Statistical Society, Wiley are collaborating with JSTOR to digitize, preserve and extend access to
Journal of the Royal Statistical Society. Series B (Methodological)
This content downloaded from 163.1.41.46 on Fri, 03 Mar 2017 13:51:49 UTC
All use subject to http://about.jstor.org/terms
414 [No. 2,
Prediction by Exponentially Weighted Moving Averages

and Related Methods
By D. R. Cox
Birkbeck College, University of London
[Received January 1961]
SUMMARY
The mean square error of prediction is calculated for an exponentially
weighted moving average (e.w~.m.a.), when the series predicted is a Markov
series, or a Markov series with superimposed error. The best choice of
damping constant is given; the choice is not critical. There is a value of the
Markov correlation po below which it is impossible to predict, with an
e.w.m.a., the local variations of the series. The mean square error of an
e.w.m.a. is compared with the minimum possible value, namely that for the
best linear predictor (Wiener). A modified e.w.m.a. is constructed having a
mean square error approaching that of the Wiener predictor. This modifica-
tion will be of value if the Markov correlation parameter is negative, and
possibly also when the Markov parameter is near po.
1. INTRODUCTION
LET {... x(t- (1),x()(

be a stochastic process in discrete time and let it be required from the values (1) to
predict the value of the process at time t +h. The predictor xe(t, h; A), defined for
jAj<1 by 00
Ue(t, h; A) = (1-A) X Arx(t-r) (2)

r=O
,= (1-)tx(t) +AA _ t1, h; A)9) (3)

is called an exponentially weighted moving average (e.w.m.a.); see, for example,
Muir (1958). We may in addition define Ue(t, h; 1) to be the unweighted mean of
previous observations in the series. Note that the right-hand side of (2) does not
depend on h; the symbol h on the left-hand side serves only to specify the value being
predicted.
The use of e.w.m.a.'s seems first to have been suggested in unpublished reports by
Dr C. Holt of the Carnegie Institute of Technology.
2. STATIONARY SERIES
In many applications it is most unreasonable to suppose that {x(t)} is a simple
stationary process. It is common, even for series that are apparently locally stationary,
for there to be long-term drifts in mean level, and possibly also drifts in variance and
correlation structure; see, for example, Jowett (1955). Although such series can be
represented by stationary processes with a spectrum containing a substantial low
frequency component, this component will usually be unknown in applications.
1961] Cox - Prediction by Exponentially Weighted Moving Averages 415
Hence predictors have to be constructed to deal with series that are subject to
arbitrary long-term variations. It is, of course, an advantage of prediction by a
moving average, rather than by a linear function whose coefficients do not sum to
one, that any drift in mean is followed by the predictor.
It is, however, a reasonable first step to examine the behaviour of the e.w.m.a.
when {x(t)} is stationary. The resulting mean square error of prediction will be a
close approximation to that achieved for a non-stationary series that is locally
stationary over lengths k such that Ak is not negligible.
Consider, then, a stationary process with
E{x(t)} =T, var {x(t)} = a2, cov {x(t), x(t+ k)} = pk p2. (4)
If we consider (2) as a predictor of x(t + h), we have for the mean square error of
prediction
Ae(h; A) = E{x(t + h) -e(t, h; A)}2
-2a2
= +2u2(l -A) o r0
P ArP-2a2(l._A) E Ar Pr+h. (5)
- +A (Il+A) r= =
For prediction one step ahead, h = 1, this becomes
2cr2 2u2(l -A) 0

Ae(l; A) = +A (l-+A) r= Ar-i Pr (6)
An important special case is the Markov series wh

generally if we consider a Markov series with superim
pk =capk (k = 1, 2, ...). Then
e(h; A) =-A 2a+ 2cza2(l - A) (Ap - ph- Ap

Ae(h;A)l+A+ (l- AP)( + A)(7
with
Ae(1; A) = 22(l-Ap-lp + Ap) (8)

In the important special case ox = 1, this last expression becomes
2uT2(l-p)
2lA2)(l+A- p)(9)
(1 -Ap) (I + A) *(9
A number of conclusions can be drawn from these for
by a detailed analysis of the simplest situation, namely prediction one step ahead,
h = 1, for a Markov series, o= 1. Then the mean square error (9) is, for given p,
minimized by a predictor xe(t, 1; A.pt), where
Aop t = l/<pl), (10)

opt - 1P 13
The corresponding mean square error of prediction is
Ae (l1 mm) 8a2p(l-p)/(l +p)2 (l<p(l), (11)

a2(-I<(p < ).
416 Cox - Prediction by Exponentially Weighted Moving Averages [No. 2,
Table 1 gives (10) and (11) as functions of p.
TABLE 1
Markov series. Prediction one step ahead. Optimum A and corresponding

mean square error of prediction
mean sq. mean sq.

P "opt errorla2 P opt errorl/2
< 1 1 0.7 0.214 0.581

0 4 0.750 0-980 0 8 0-125 0.395
0.5 0'500 0.889 0o9 0 056 0o199
0 6 0-333 0 750 0-95 0 026 0100
Table 2 shows that for given p the mean squ

with A so that the choice of A is not at all
greater than Aopt than a value less than A ) The insensitiveness to the choice of A
is in accord with empirical experience with e.w.m.a.'s.
TABLE 2
Markov series (p> &). Prediction one step ahead. Mean square error of
prediction divided by U2 as function of A
A p=04 p= 07 p=o9 A p=04 p= 07 p=o09
0o0 1P200 0-600 0-200 0 6 0 987 0 647 0*272

0.1 1P136 0-587 0'200 0 7 0-980 0 692 0-318
0.2 1-087 0-581 0-203 0.8 0-980 0 758 0 397
0.3 1-049 0'584 0 211 0o9 0 987 0 853 0-554
0-4 1P020 0*595 0-223 10 o 000 1.000 lo 000
0 5 l000 0 615 0 242 Aopt 0'980 0 581 0o199
If we predict more than one step ahead, there is a substantial increase in the
critical value of p below which the optimum value of A is one. If we consider a proce
with co < 1, the critical value is not much changed. These facts are illustrated in
Table 3.
TABLE 3
Markov series with superimposed error. Prediction h steps ahead. Value of

p below which the optimum A is 1
h 1 2 4
cx
06 0.455 0.688 0*835

0-8 0.385 0.661 0827
0o9 0-357 0 650 0 824
l 0 0 333 0 516 0*821
The interpretation of the results (10) and (1 1) is not t

useless if p < -. Rather it is that if p> I the value of A can be chosen so that the
predictor follows to some extent the local fluctuations in the process as specified by
the Markov series. On the other hand, if p < 3, the best that the e.w.m.a. can do is
to give an estimate of the mean around which the Markov fluctuations occur. In a
practical case with p < 3, we would not take A too large, in order to obtain a predicto
insensitive to long-term fluctuations in mean level. Some guidance on this is provided
by Table 4 which gives the variance obtained for a stationary Markov series with
p 3 3 when values of A less than one are used.
TABLE 4
Markov series with p < 3. Prediction one step ahead. Mean square error of
prediction for values of A near one
Mean square error/U2
A P= P=ih1 p=IO p =-i p=-
1-00 1P000 1P000 1P000 1P000 1P000

0-95 1001 1-020 1-026 1-036 1.043
0.90 1P002 1P041 1.053 1-074 1P089
0.80 1P011 1-087 1111 1.157 1.190
0-70 1-022 1.139 1P176 1.252 1.307
0.60 1.042 1-197 1.250 1.359 1.442

0-50 1.067 1.263 1-333 1P481 1.600
Provided that p is not negative, there is not much to be gained by having A greater
than about 0-8 - 0 9. If p is appreciably negative, it is clear on general grounds that
an e.w.m.a. is not a good predictor.
3. COMPARISON WITH OPTIMUM LINEAR PREDICTOR
It is natural to compare the e.w.m.a. with the optimum linear predictor xw(t,
for the stationary process (Wiener, 1949; Middleton, 1960, Chapter 4). This predictor
is the linear combination of {...,x(t - 1), x(t)} which, for a stationary process with
known parameters, minimizes the mean square error of prediction. For the Markov
series (4), with Pk = pk, it is well known that
X,(t, h) = ph{x(t) 4 +,. (12)

The corresponding mean square error of prediction is
A.(h) = U2(1 - p2h). (13)

If the process {x(t)} is Gaussian, a much stronger justification of (12) can be
obtained. In fact, for a given value of x(t+h), the likelihood of the sequence
{..., x(t- 1), x(t)} can be written down and, because the Markov property holds also
when the time axis is reversed, the quantity (12) has, for known p, p, a2, all t
properties of an unbiased estimator based on a complete sufficient statistic for the
value of xt+h.
The last two columns of Table 5 compare the mean square error (13) with the
mean square error (11) for the best e.w.m.a. If p is appreciably negative, the Wiener
predictor has a much smaller mean square error than the e.w.m.a. If p is positive,
the difference in mean square error is less important, although there is a region near
the critical value of p (i.e. p = 3 when h = 1) where the difference in mean square
errors is probably enough to be of practical interest.
There are two objections to the practical use of the Wiener predictor (12). One
relatively unimportant point is that the parameters ju and p are assumed known.
However, if the underlying process is genuinely a stationary Markov Gaussian process
with p, p, a2 unknown, the method of maximum likelihood can be used to estimate
x(t + h). This is very nearly the same as replacing p and Ht in (12) by the usual estimates
based on the observed section of series. If a long portion of the series is available for
estimating t and p, the mean square error of prediction is asymptotically unaffected
by replacing ju and p by estimates, whether or not the series is Gaussian.
A much more serious objection to x,(t, h) is that if the mean shifts from ju to say p'
the predictor will be biased, and the mean square error correspondingly increased.
It is at least of theoretical interest to see whether the mean square error of the Wiener
predictor can be approached by a predictor that is not sensitive to long-term fluctua-
tions, and this we consider in the next section.
4. A MODIFIED FORM OF E.W.M.A.

In order to improve on the e.w.m.a. it would be natural to consider linear
predictors 00
, x(t-r)gr, (14)
r=O
00
and to require that l gr = 1, (15)

r=O
in order that if the process fluctuates around a value p, then for all t the
too, should fluctuate around ju. Then, subject to (15), it would be natural to choose
{grj to minimize the mean square error or prediction for some plausible form of
process {x(t)}. Such an extension of Wiener's theory is, for continuous time, a special
case of the work of Zadeh and Ragazzini (1950), who show that an integral equation
determining the predictor can be formulated and in principle solved.
If, however, we are interested in a predictor subject to (15), having optimum
properties when {x(t)} is a stationary Markov series, we can argue in a simple way
from first principles. For the minimum mean square error subject to (15) cannot be
less than the value (13) for the Wiener predictor. Therefore, if we can produce a
predictor (14) having a mean square error arbitrarily close to (13), we have solved the
problem. But such a predictor is obviously
xm(t,h; A) ph x(t)+(1_phxe(t,h; A) (16)
A (1+Aph)x(t)+A(l-ph)Xe(te1,h; A) (17)
The predictor is easily computed from the recurrence relation (17). As A approaches
one, the mean square error of (16) will approach that of the Wiener predictor. The
predictor (16) is obtained from the Wiener predictor (12) by replacing ju by an
e.w.m.a. with parameter A. Note that when p $ 0 (16) is not an e.w.m.a. The expres-
sion (16) can be regarded as a moving average in which the weights for x(t- 1),
x(t-2), ... fall off exponentially, but in which x(t) receives a weight that is not a
member of the same geometric series. We call (16) a modified exponentially weighted
moving average (m.e.w.m.a.).
If we take A too close to unity the predictor (16) will be sensitive to long-term drifts.
It is therefore sensible to choose A as small as possible subject to the mean square
error of prediction being sufficiently close to its minimum possible value. Now for a
Markov process, the mean square error of (16) is
/m(h, A) = 2a2(l ph)(I pA2+Aph-Aph+1) (18

and is given in the first columns of Table 5.
TABLE 5
Markov series. Mean square errors of prediction of modified exponentially

weighted moving average, of best exponentially weighted moving average
and of Wiener predictor
h=1
m.e.w.m.a. e.w.m.a. Wiener
/\m(l, A)/a2 lVe( Aot)<2 \(I)/a2
p A= 0.5 0.7 0-8 0-9 0.95
-0-75 0.901 0-606 0.523 0-469 0-451 1.000 0-438

-0*50 1P200 0-941 0.869 0-795 0.771 1P000 0X750
-0-25 1P342 1-131 1P053 0-989 0.962 1.000 0*938
0.00 1P333 1.176 1.111 1P053 1P026 1P000 1P000

0O1 1*288 1P154 1.096 1P041 1-015 1P000 0.990
0-2 1*221 1-110 1P058 1-008 0.984 1P000 0.960
0-3 1.131 1-042 0.999 0-955 0.933 1.000 0-910
0.4 1P020 0-953 0.918 0.880 0.861 0X980 0-840
0.5 0.889 0-842 0.815 0-785 0-768 0*889 0*750

0-6 0.739 0-709 0.691 0-668 0-655 0-750 0.640
0-7 0-572 0-556 0'545 0.531 0.521 0-581 0-510
0-8 0-391 0-385 0.380 0-373 0.368 0*395 0.360
0.9 0*199 0-198 0.197 0-195 0.193 0.199 0-190
0-95 0.100 0-100 0.100 0-099 0-099 0.100 0.098
h= 2
m.e.w.m.a. e.w.m.a. Wiener
Am(2, A)/a2 Ae(2, Aopt)/I2 A.(2)Iu2
p A= 0.5 0.7 0.8 0-9 0-95
-0-75 0.713 0-694 0.689 0-686 0.684 1P000 0.684

-0-50 1.050 0-985 0.964 0-949 0X943 1.000 0-938
-0X25 1*233 1.115 1-072 1P037 1-022 1.000 0.996
0-00 1P333 1-176 1.111 1P053 1P026 1.000 1.000

0.1 1P361 1.199 1-128 1P062 1P030 1.000 1.000
0.2 1-374 1.214 1P140 1P068 1P033 1P000 0.998
0-3 1P365 1-216 1.142 1.068 1P030 1.000 0-992
0-4 1.327 1P196 1P127 1.053 1-015 1P000 0-974
0-5 1-250 1-144 1.083 1-016 0.978 1P000 0.938

0.6 1.124 1.047 1.000 0X943 0X909 1.000 0.870
0-7 0.940 0.894 0.862 0.820 0.768 0.936 0-760
0.8 0-691 0.671 0-656 0-632 0-615 0.700 0-590
0-9 0.376 0-372 0-369 0-362 0.356 0-378 0-344
0.95 0.194 0-194 0.193 0.192 0.190 0.195 0.186
Table 5 shows that, if p is appreciably negative, the m.e.w.m.a. works very well
with values of A not too near 1. Further, a modest improvement in mean square
error is possible with values of A in the range 07 - 0 9 if 0-4 ( p (0-8. One would
probably not often in practice want to use values of A greater than 09.
A modified predictor similar to (16) can be formed whenever the corresponding
Wiener predictor is known. Thus if {x(t)} is a second-order autoregressive process,
the Wiener predictor has the form
U4x(t)-[4+ 3{x(t- 1)- }+ , (19)

where o, P depend on the correlation function. The corresponding modified
predictor is
oXX(t) + /X(t- 1) + (1 -o- e(t, h; A). (20)
In modified predictors such as (16) and (20), the exponentially weighted component
can be replaced by any reasonable form of moving average. We shall not attempt
here to examine whether the requirement of minimizing the effect of long-term trends,
or some other requirement, can be used to choose between alternative forms of
moving average.
5. ESTIMATION OF PARAMETERS
In order to make direct use of the above work, a section of observed series must
be analysed to see whether the local fluctuations are in agreement with a Markov
process, or with a Markov process with superimposed error, and if so to estimate the
value of p.
A method of estimation unaffected by long-term drifts is the serial variogram of
Jowett. In this
2 ave {x(i)-x(i+k)}2 (21)
is plotted against k. For a locally Markov process (21) will, for sufficiently small k,
be proportional to 1- pk. Jowett (1955) has given a convenient semi-graphical method
for fitting a local Markov process and this should be adequate for the present purpose;
alternatively the more elaborate method of Davies and Jowett (1958) could be used.
If the agreement with a local Markov series seems good, the optimum A for an
e.w.m.a. can be obtained from (7) and (10) (or Table 1). Table 5 will show whether
there is likely to be a worth-while improvement by using a m.e.w.m.a. predictor (16).
If the series is locally a Markov process with superimposed error, formulae (7) and
(8) can be used. If the process has a more complex local structure, the use of predictors
like (20) is worth considering. If, however, it is for simplicity still desired to use an
e.w.m.a., it seems reasonable to fit a value of p to the first few terms of (21) and again
to determine a value of A from (10). Confirmation of values of A by dry-running will
often be advisable:
6. SOME EXTENSIONS
The work described in this paper could be extended in various ways, of which
the following are examples.
(a) Exponentially weighted moving averages can be used for filtering or smoothing
instead of for prediction. Further, in some applications, two-sided e.w.m.a.'s could
be used. A possible application is to serial sampling inspection schemes (Cox, 1960).
(b) In some applications we would be interested not only in the error of prediction
at one time point but also in the form of the stochastic process {x(t)} of predicted
values. For the predictors we have considered, the transformation from {x(t)} to
{x(t)} is linear and hence is described by a transfer function. Note in particular that
if {x(t)} is completely random, then e.w.m.a.'s form a Markov process with correlation
function Ak.
(c) Sometimes, approximately linear trends will be expected over substantial
lengths of series. It is clear that an e.w.m.a. will follow the trend, but with a time lag;
which may be serious if A or h are appreciable. There are various ways of dealing
with this difficulty, of which one is to require that the weights g, in (14) satisfy
only (15), but also a further condition
00
E rgr = -ho. (22)

r=O
This ensures that if x(t) = a + bt, the predicted value at t + h is a + b(t + h), for all a, b
(Zadeh and Ragazzini, 1950). No e.w.m.a. can satisfy (22), but it is easy to construct
simple predictors that do satisfy the condition, for example by combining two
e.w.m.a.'5s.
(d) The work could be extended to non-linear predictors.
7. EMPIRICAL TRIAL
For an empirical test of some of the conclusions of this paper, some data (Jowett,
1955) on per cent. nitrogen in gas from a blast furnace has been used. The 70 observa-
tions were equally spaced in time, and Jowett showed that over short stretches the
serial variogram agreed excellently with a Markov series of variance a 2 = 4-5 and
first serial correlation coefficient p = 056; there was some additional long-term varia-
tion in the series.
According to (10), the optimum e.w.m.a. has A- 04. By Table 2, the mean square
error of prediction is insensitive to A, but there should be a just perceptible increase in
mean square error if we took A = 0-2 or 0-6. Empirical mean square errors of predic-
tion for h = 1 have been calculated and are compared in Table 6 with the theoretical
mean square errors calculated from (9). Although the empirical mean square errors
are slightly below the theoretical values, the general agreement is very satisfactory.
Table 5 shows that some reduction in mean square error should be obtainable
from the modified exponentially weighted moving average (16). Its properties have
accordingly been worked out for p = 0-56, A = 0-8, and are given in Table 6.
TABLE 6
Empirical and theoretical mean square errors of predictors applied to

Jowett's data
Predictor Empirical Theoretical

m.s.e. m.s.e.
e.w.m.a., A= 02 3 69 3 72
e.w.m.a., A= 0-4 3-55 3.65
e.w.m.a., A= 06 364 373
m.e.w.m.a., A = 06, p = 056 3-30 3 34
Of course, the values in Table 6 are obtained by predicting over the same section
of series as is used for the estimation of the parameters of the process. Hence in
Table 6 the agreement with the theoretical results is partly tautological. However,
the insensitivity to the choice of A of the mean square error for the e.w.m.a. means
that quite substantial errors in estimating p will not matter. Provided that the section
of series to be predicted has the same general structure as that used for estimation,
the present methods should be applicable.
ACKNOWLEDGEMENTS
It is a pleasure to thank Mr W. N. Jessop and his colleagues, Operational Research
Dept, Courtaulds Ltd, both for discussions of prediction problems which led to the
present work, and also for the calculation of Tables 1-5, which was done under the
supervision of Mr J. Luckman.
REFERENCES
Cox, D. R. (1960), "Serial sampling acceptance schemes derived from Bayes's theorem",
Technometrics, 2, 353-360.
DAVIES, H. M. and JOWETT, G. H. (1958), "The fitting of Markoff serial variation curves",
J. R. statist. Soc. B, 20, 120-142.
JOWETT, G. H. (1955), "The comparison of means of industrial time series", Appl. Statist., 4,
32-46.
MIDDLETON, D. (1960), Introduction to Statistical Communication Theory. New York:
McGraw-Hill.
MUIR, A. (1958), "Automatic sales forecasting", Brit. Computer J., 1, 113-116.
WIENER, N. (1949), The Extrapolation, Interpolation and Smoothing of Stationary Time-Series,
with Engineering Applications. New York: Wiley.
ZADEH, L. A. and RAGAZZINI, J. R. (1950), "An extension of Wiener's theory of prediction",
J. appl. Phys., 21, 645-655.

Cox 60

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cox 60

Uploaded by

Copyright:

Available Formats

Prediction by Exponentially Weighted Moving Averages and Related Methods

Prediction by Exponentially Weighted Moving Averages

[Received January 1961]

LET {... x(t- (1),x()(

Ue(t, h; A) = (1-A) X Arx(t-r) (2)

,= (1-)tx(t) +AA _ t1, h; A)9) (3)

Ae(h; A) = E{x(t + h) -e(t, h; A)}2

2cr2 2u2(l -A) 0

An important special case is the Markov series wh

e(h; A) =-A 2a+ 2cza2(l - A) (Ap - ph- Ap

Ae(1; A) = 22(l-Ap-lp + Ap) (8)

Aop t = l/<pl), (10)

Ae (l1 mm) 8a2p(l-p)/(l +p)2 (l<p(l), (11)

Table 1 gives (10) and (11) as functions of p.

Markov series. Prediction one step ahead. Optimum A and corresponding

mean sq. mean sq.

< 1 1 0.7 0.214 0.581

Table 2 shows that for given p the mean squ

A p=04 p= 07 p=o9 A p=04 p= 07 p=o09

0o0 1P200 0-600 0-200 0 6 0 987 0 647 0*272

Markov series with superimposed error. Prediction h steps ahead. Value of

06 0.455 0.688 0*835

The interpretation of the results (10) and (1 1) is not t

Mean square error/U2

A P= P=ih1 p=IO p =-i p=-

1-00 1P000 1P000 1P000 1P000 1P000

0.60 1.042 1-197 1.250 1.359 1.442

3. COMPARISON WITH OPTIMUM LINEAR PREDICTOR

X,(t, h) = ph{x(t) 4 +,. (12)

A.(h) = U2(1 - p2h). (13)

4. A MODIFIED FORM OF E.W.M.A.

and to require that l gr = 1, (15)

xm(t,h; A) ph x(t)+(1_phxe(t,h; A) (16)

/m(h, A) = 2a2(l ph)(I pA2+Aph-Aph+1) (18

Markov series. Mean square errors of prediction of modified exponentially

-0-75 0.901 0-606 0.523 0-469 0-451 1.000 0-438

0.00 1P333 1.176 1.111 1P053 1P026 1P000 1P000

0.5 0.889 0-842 0.815 0-785 0-768 0*889 0*750

-0-75 0.713 0-694 0.689 0-686 0.684 1P000 0.684

0-00 1P333 1-176 1.111 1P053 1P026 1.000 1.000

0-5 1-250 1-144 1.083 1-016 0.978 1P000 0.938

U4x(t)-[4+ 3{x(t- 1)- }+ , (19)

E rgr = -ho. (22)

Empirical and theoretical mean square errors of predictors applied to

Predictor Empirical Theoretical

You might also like

0.5 0.889 0-842 0.815 0-785 0-768 0889 0750