Professional Documents
Culture Documents
Author(s): D. R. Cox
Source: Journal of the Royal Statistical Society. Series B (Methodological), Vol. 23, No. 2
(1961), pp. 414-422
Published by: Wiley for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2984031
Accessed: 03-03-2017 13:51 UTC
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted
digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about
JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://about.jstor.org/terms
Royal Statistical Society, Wiley are collaborating with JSTOR to digitize, preserve and extend access to
Journal of the Royal Statistical Society. Series B (Methodological)
This content downloaded from 163.1.41.46 on Fri, 03 Mar 2017 13:51:49 UTC
All use subject to http://about.jstor.org/terms
414 [No. 2,
By D. R. Cox
Birkbeck College, University of London
SUMMARY
The mean square error of prediction is calculated for an exponentially
weighted moving average (e.w~.m.a.), when the series predicted is a Markov
series, or a Markov series with superimposed error. The best choice of
damping constant is given; the choice is not critical. There is a value of the
Markov correlation po below which it is impossible to predict, with an
e.w.m.a., the local variations of the series. The mean square error of an
e.w.m.a. is compared with the minimum possible value, namely that for the
best linear predictor (Wiener). A modified e.w.m.a. is constructed having a
mean square error approaching that of the Wiener predictor. This modifica-
tion will be of value if the Markov correlation parameter is negative, and
possibly also when the Markov parameter is near po.
1. INTRODUCTION
2. STATIONARY SERIES
In many applications it is most unreasonable to suppose that {x(t)} is a simple
stationary process. It is common, even for series that are apparently locally stationary,
for there to be long-term drifts in mean level, and possibly also drifts in variance and
correlation structure; see, for example, Jowett (1955). Although such series can be
represented by stationary processes with a spectrum containing a substantial low
frequency component, this component will usually be unknown in applications.
This content downloaded from 163.1.41.46 on Fri, 03 Mar 2017 13:51:49 UTC
All use subject to http://about.jstor.org/terms
1961] Cox - Prediction by Exponentially Weighted Moving Averages 415
Hence predictors have to be constructed to deal with series that are subject to
arbitrary long-term variations. It is, of course, an advantage of prediction by a
moving average, rather than by a linear function whose coefficients do not sum to
one, that any drift in mean is followed by the predictor.
It is, however, a reasonable first step to examine the behaviour of the e.w.m.a.
when {x(t)} is stationary. The resulting mean square error of prediction will be a
close approximation to that achieved for a non-stationary series that is locally
stationary over lengths k such that Ak is not negligible.
Consider, then, a stationary process with
E{x(t)} =T, var {x(t)} = a2, cov {x(t), x(t+ k)} = pk p2. (4)
If we consider (2) as a predictor of x(t + h), we have for the mean square error of
prediction
-2a2
= +2u2(l -A) o r0
P ArP-2a2(l._A) E Ar Pr+h. (5)
- +A (Il+A) r= =
For prediction one step ahead, h = 1, this becomes
2uT2(l-p)
2lA2)(l+A- p)(9)
(1 -Ap) (I + A) *(9
A number of conclusions can be drawn from these for
by a detailed analysis of the simplest situation, namely prediction one step ahead,
h = 1, for a Markov series, o= 1. Then the mean square error (9) is, for given p,
minimized by a predictor xe(t, 1; A.pt), where
This content downloaded from 163.1.41.46 on Fri, 03 Mar 2017 13:51:49 UTC
All use subject to http://about.jstor.org/terms
416 Cox - Prediction by Exponentially Weighted Moving Averages [No. 2,
TABLE 1
TABLE 2
Markov series (p> &). Prediction one step ahead. Mean square error of
prediction divided by U2 as function of A
If we predict more than one step ahead, there is a substantial increase in the
critical value of p below which the optimum value of A is one. If we consider a proce
with co < 1, the critical value is not much changed. These facts are illustrated in
Table 3.
TABLE 3
h 1 2 4
cx
This content downloaded from 163.1.41.46 on Fri, 03 Mar 2017 13:51:49 UTC
All use subject to http://about.jstor.org/terms
1961] Cox - Prediction by Exponentially Weighted Moving Averages 417
predictor follows to some extent the local fluctuations in the process as specified by
the Markov series. On the other hand, if p < 3, the best that the e.w.m.a. can do is
to give an estimate of the mean around which the Markov fluctuations occur. In a
practical case with p < 3, we would not take A too large, in order to obtain a predicto
insensitive to long-term fluctuations in mean level. Some guidance on this is provided
by Table 4 which gives the variance obtained for a stationary Markov series with
p 3 3 when values of A less than one are used.
TABLE 4
Markov series with p < 3. Prediction one step ahead. Mean square error of
prediction for values of A near one
Provided that p is not negative, there is not much to be gained by having A greater
than about 0-8 - 0 9. If p is appreciably negative, it is clear on general grounds that
an e.w.m.a. is not a good predictor.
It is natural to compare the e.w.m.a. with the optimum linear predictor xw(t,
for the stationary process (Wiener, 1949; Middleton, 1960, Chapter 4). This predictor
is the linear combination of {...,x(t - 1), x(t)} which, for a stationary process with
known parameters, minimizes the mean square error of prediction. For the Markov
series (4), with Pk = pk, it is well known that
This content downloaded from 163.1.41.46 on Fri, 03 Mar 2017 13:51:49 UTC
All use subject to http://about.jstor.org/terms
418 Cox - Prediction by Exponentially Weighted Moving Averages [No. 2,
the critical value of p (i.e. p = 3 when h = 1) where the difference in mean square
errors is probably enough to be of practical interest.
There are two objections to the practical use of the Wiener predictor (12). One
relatively unimportant point is that the parameters ju and p are assumed known.
However, if the underlying process is genuinely a stationary Markov Gaussian process
with p, p, a2 unknown, the method of maximum likelihood can be used to estimate
x(t + h). This is very nearly the same as replacing p and Ht in (12) by the usual estimates
based on the observed section of series. If a long portion of the series is available for
estimating t and p, the mean square error of prediction is asymptotically unaffected
by replacing ju and p by estimates, whether or not the series is Gaussian.
A much more serious objection to x,(t, h) is that if the mean shifts from ju to say p'
the predictor will be biased, and the mean square error correspondingly increased.
It is at least of theoretical interest to see whether the mean square error of the Wiener
predictor can be approached by a predictor that is not sensitive to long-term fluctua-
tions, and this we consider in the next section.
00
in order that if the process fluctuates around a value p, then for all t the
too, should fluctuate around ju. Then, subject to (15), it would be natural to choose
{grj to minimize the mean square error or prediction for some plausible form of
process {x(t)}. Such an extension of Wiener's theory is, for continuous time, a special
case of the work of Zadeh and Ragazzini (1950), who show that an integral equation
determining the predictor can be formulated and in principle solved.
If, however, we are interested in a predictor subject to (15), having optimum
properties when {x(t)} is a stationary Markov series, we can argue in a simple way
from first principles. For the minimum mean square error subject to (15) cannot be
less than the value (13) for the Wiener predictor. Therefore, if we can produce a
predictor (14) having a mean square error arbitrarily close to (13), we have solved the
problem. But such a predictor is obviously
A (1+Aph)x(t)+A(l-ph)Xe(te1,h; A) (17)
The predictor is easily computed from the recurrence relation (17). As A approaches
one, the mean square error of (16) will approach that of the Wiener predictor. The
predictor (16) is obtained from the Wiener predictor (12) by replacing ju by an
e.w.m.a. with parameter A. Note that when p $ 0 (16) is not an e.w.m.a. The expres-
sion (16) can be regarded as a moving average in which the weights for x(t- 1),
x(t-2), ... fall off exponentially, but in which x(t) receives a weight that is not a
member of the same geometric series. We call (16) a modified exponentially weighted
moving average (m.e.w.m.a.).
If we take A too close to unity the predictor (16) will be sensitive to long-term drifts.
It is therefore sensible to choose A as small as possible subject to the mean square
This content downloaded from 163.1.41.46 on Fri, 03 Mar 2017 13:51:49 UTC
All use subject to http://about.jstor.org/terms
1961] Cox - Prediction by Exponentially Weighted Moving Averages 419
error of prediction being sufficiently close to its minimum possible value. Now for a
Markov process, the mean square error of (16) is
TABLE 5
h=1
m.e.w.m.a. e.w.m.a. Wiener
/\m(l, A)/a2 lVe( Aot)<2 \(I)/a2
p A= 0.5 0.7 0-8 0-9 0.95
h= 2
m.e.w.m.a. e.w.m.a. Wiener
Am(2, A)/a2 Ae(2, Aopt)/I2 A.(2)Iu2
p A= 0.5 0.7 0.8 0-9 0-95
This content downloaded from 163.1.41.46 on Fri, 03 Mar 2017 13:51:49 UTC
All use subject to http://about.jstor.org/terms
420 Cox - Prediction by Exponentially Weighted Moving Averages [No. 2,
Table 5 shows that, if p is appreciably negative, the m.e.w.m.a. works very well
with values of A not too near 1. Further, a modest improvement in mean square
error is possible with values of A in the range 07 - 0 9 if 0-4 ( p (0-8. One would
probably not often in practice want to use values of A greater than 09.
A modified predictor similar to (16) can be formed whenever the corresponding
Wiener predictor is known. Thus if {x(t)} is a second-order autoregressive process,
the Wiener predictor has the form
In modified predictors such as (16) and (20), the exponentially weighted component
can be replaced by any reasonable form of moving average. We shall not attempt
here to examine whether the requirement of minimizing the effect of long-term trends,
or some other requirement, can be used to choose between alternative forms of
moving average.
5. ESTIMATION OF PARAMETERS
In order to make direct use of the above work, a section of observed series must
be analysed to see whether the local fluctuations are in agreement with a Markov
process, or with a Markov process with superimposed error, and if so to estimate the
value of p.
A method of estimation unaffected by long-term drifts is the serial variogram of
Jowett. In this
2 ave {x(i)-x(i+k)}2 (21)
is plotted against k. For a locally Markov process (21) will, for sufficiently small k,
be proportional to 1- pk. Jowett (1955) has given a convenient semi-graphical method
for fitting a local Markov process and this should be adequate for the present purpose;
alternatively the more elaborate method of Davies and Jowett (1958) could be used.
If the agreement with a local Markov series seems good, the optimum A for an
e.w.m.a. can be obtained from (7) and (10) (or Table 1). Table 5 will show whether
there is likely to be a worth-while improvement by using a m.e.w.m.a. predictor (16).
If the series is locally a Markov process with superimposed error, formulae (7) and
(8) can be used. If the process has a more complex local structure, the use of predictors
like (20) is worth considering. If, however, it is for simplicity still desired to use an
e.w.m.a., it seems reasonable to fit a value of p to the first few terms of (21) and again
to determine a value of A from (10). Confirmation of values of A by dry-running will
often be advisable:
6. SOME EXTENSIONS
The work described in this paper could be extended in various ways, of which
the following are examples.
(a) Exponentially weighted moving averages can be used for filtering or smoothing
instead of for prediction. Further, in some applications, two-sided e.w.m.a.'s could
be used. A possible application is to serial sampling inspection schemes (Cox, 1960).
This content downloaded from 163.1.41.46 on Fri, 03 Mar 2017 13:51:49 UTC
All use subject to http://about.jstor.org/terms
1961] Cox - Prediction by Exponentially Weighted Moving Averages 421
(b) In some applications we would be interested not only in the error of prediction
at one time point but also in the form of the stochastic process {x(t)} of predicted
values. For the predictors we have considered, the transformation from {x(t)} to
{x(t)} is linear and hence is described by a transfer function. Note in particular that
if {x(t)} is completely random, then e.w.m.a.'s form a Markov process with correlation
function Ak.
(c) Sometimes, approximately linear trends will be expected over substantial
lengths of series. It is clear that an e.w.m.a. will follow the trend, but with a time lag;
which may be serious if A or h are appreciable. There are various ways of dealing
with this difficulty, of which one is to require that the weights g, in (14) satisfy
only (15), but also a further condition
00
This ensures that if x(t) = a + bt, the predicted value at t + h is a + b(t + h), for all a, b
(Zadeh and Ragazzini, 1950). No e.w.m.a. can satisfy (22), but it is easy to construct
simple predictors that do satisfy the condition, for example by combining two
e.w.m.a.'5s.
(d) The work could be extended to non-linear predictors.
7. EMPIRICAL TRIAL
For an empirical test of some of the conclusions of this paper, some data (Jowett,
1955) on per cent. nitrogen in gas from a blast furnace has been used. The 70 observa-
tions were equally spaced in time, and Jowett showed that over short stretches the
serial variogram agreed excellently with a Markov series of variance a 2 = 4-5 and
first serial correlation coefficient p = 056; there was some additional long-term varia-
tion in the series.
According to (10), the optimum e.w.m.a. has A- 04. By Table 2, the mean square
error of prediction is insensitive to A, but there should be a just perceptible increase in
mean square error if we took A = 0-2 or 0-6. Empirical mean square errors of predic-
tion for h = 1 have been calculated and are compared in Table 6 with the theoretical
mean square errors calculated from (9). Although the empirical mean square errors
are slightly below the theoretical values, the general agreement is very satisfactory.
Table 5 shows that some reduction in mean square error should be obtainable
from the modified exponentially weighted moving average (16). Its properties have
accordingly been worked out for p = 0-56, A = 0-8, and are given in Table 6.
TABLE 6
e.w.m.a., A= 02 3 69 3 72
e.w.m.a., A= 0-4 3-55 3.65
e.w.m.a., A= 06 364 373
m.e.w.m.a., A = 06, p = 056 3-30 3 34
This content downloaded from 163.1.41.46 on Fri, 03 Mar 2017 13:51:49 UTC
All use subject to http://about.jstor.org/terms
422 Cox - Prediction by Exponentially Weighted Moving Averages [No. 2,
Of course, the values in Table 6 are obtained by predicting over the same section
of series as is used for the estimation of the parameters of the process. Hence in
Table 6 the agreement with the theoretical results is partly tautological. However,
the insensitivity to the choice of A of the mean square error for the e.w.m.a. means
that quite substantial errors in estimating p will not matter. Provided that the section
of series to be predicted has the same general structure as that used for estimation,
the present methods should be applicable.
ACKNOWLEDGEMENTS
It is a pleasure to thank Mr W. N. Jessop and his colleagues, Operational Research
Dept, Courtaulds Ltd, both for discussions of prediction problems which led to the
present work, and also for the calculation of Tables 1-5, which was done under the
supervision of Mr J. Luckman.
REFERENCES
Cox, D. R. (1960), "Serial sampling acceptance schemes derived from Bayes's theorem",
Technometrics, 2, 353-360.
DAVIES, H. M. and JOWETT, G. H. (1958), "The fitting of Markoff serial variation curves",
J. R. statist. Soc. B, 20, 120-142.
JOWETT, G. H. (1955), "The comparison of means of industrial time series", Appl. Statist., 4,
32-46.
MIDDLETON, D. (1960), Introduction to Statistical Communication Theory. New York:
McGraw-Hill.
MUIR, A. (1958), "Automatic sales forecasting", Brit. Computer J., 1, 113-116.
WIENER, N. (1949), The Extrapolation, Interpolation and Smoothing of Stationary Time-Series,
with Engineering Applications. New York: Wiley.
ZADEH, L. A. and RAGAZZINI, J. R. (1950), "An extension of Wiener's theory of prediction",
J. appl. Phys., 21, 645-655.
This content downloaded from 163.1.41.46 on Fri, 03 Mar 2017 13:51:49 UTC
All use subject to http://about.jstor.org/terms