Arthur E. Albert, Leland A. Gardner Jr. Stochastic Approximation and NonLinear Regression

STOCHASTIC APPROXIMATION
AND NONLINEAR REGRESSION

STOCHASTIC
APPROXIMATION
AND NONLINEAR
REGRESSION
ARTHUR E. ALBERT
LELAND A. GARDER, JR.
IIIIIII RESEARCH l\'IOOGRAPH NO. 42

THE I\U.T. PRESS, CAMBRIDGE, MASSACHUSETTS
Copyright 1967
The Massachusetts Illstilllte of Techllology
Set ill Times New Romall

Prillted ami boulld ill the Ullited States 0/ America by
The Riverside Press, Cambridge, A-fassac!lusetts
All rights reserved. No part 0/ this book may be

reproduced or I/Iilized ill (lilY form or by allY meallS,
electrollic or mechallical, illc/udillg photocopyillg,
recordillg, or by allY ill/ormatioll storage (md retrieval
system, wit/lOut permissioll ill writillg /rom the publisher.
Library of COllgress cawlog mrd lIumber: 67-16501
ISBN: 0-262-51148-7 (Paperback)

To Lise
To Margie
Foreword
This is the forty-second volume in the M.I.T. Research Monograph

Series published by the M.I.T. Press. The objective of this series is to
contribute to the professional literature a number of significant pieces
of research, larger in scope than journal articles but normally less
ambitious than finished books. We believe that such studies deserve
a wider circulation than can be accomplished by informal channels, and
we hope that this form of publication will make them readily accessible
to research organizations, libraries, and independent workers.
HOWARD W. JOHNSON
Preface
This monograph addresses the problem of"real-time" curve fitting

in the presence of noise, from the computational and statistical view
points. Specifically, we examine the problem of nonlinear regression
where observations { Yn: n = I, 2, ...} are made on a time series whose
mean-value function {Fn(6)} is known except for a finite number of
parameters (Bl> B2, , Bp) = 6'. We want to estimate this parameter.
In contrast to the traditional formulation, we imagine thedata arriving
in temporal succession. We require that the estimation be carried out in
real time so that, at each instant, the parameter estimate fully reflects all
of the currently available data.
The conventional methods of least-squares and maximum-likelihood
estimation, although computationally feasible in cases where a single
estimate is to be computed after the data have been accumulated, are in
applicable in such a situation. The systems of normal equations that
must be solved in order to produce these estimators are generally so
complex that it is impractical to try to solve them again and again as
each new datum arrives (especially if the rate of data collection is high).
Consequently, we are led :0 consider estimators of the "differential
correction" type. Such estimators are defined recursively. The (n + I)st
estimate (based on the firstn observations)is defined in terms of thenth
by an equation of the form
(tl arbitrary; n = 1,2, . ) ,
where {an} is a suitably chosen sequence of " s m oot hing " vectors. The
ix
X PREFACE
term .. differential correction" refers to the proportionality of the

difference between tn+l andtn (the correction) to the difference between
the nth observation, Yn, and the value that would be predicted by the
regression function if tn were in fact the" true" parameter value.
The choice of smoothing vectors criticalIy affects the computational
simplicity and statistical properties of such recursive estimates. The
main purpose of this monograph is to relate the large-sample statistical
behavior of said estimates (consistency, rate of convergence, large
sample distribution theory, asymptotic efficiency) to the properties of
the regression function and the choice of smoothing vectors. A wide
class of smoothing vectorsis examined. Some are deterministic and some
depend on (are functions of ) the observations.
The techniques used in the analysis are, for the most part, elementary
and, by now, standard to those who are familiar with the literature of
stochastic approximation. However, for the sake of the nonspecialist,
we have tried to keep our treatment self-contained. In all cases, we seek
the asymptotic properties (large n) of the solution to the nonlinear
difference equation which relates tn+l to tn.
As a fortuitous by-product, the results of this monograph also serve
to extend and complement many of the resultsin the stochastic-approxi
mation literature.
The structure of the monograph is as folIows. Part I deals with the
special case of a scalar parameter. Here we discuss probability-one and
mean-square convergence and asymptotic distribution theory of the
estimators for various choices of the smoothing sequence {an}. Part II
deals with the probability-one and mean-square convergence of the
estimators in the vector case for various choices of smoothing vectors
{an}. Examples are liberally sprinkled throughout the book. In fact, an
entire chapter is devoted to the discussion of examples at varying levels
of generality.
The book is written at thefirst-year graduate level, although this level
of maturity is not required uniformly. Certainly the reader should
understand the concept of a limit both in the deterministic and prob
abilistic senses. This much will assure a comfortablejourney through
Chapters2 and3. Chapters4 and5 require acquaintance with theCentral
Limit Theorem. Familiarity with the standard techniques of large
sample theory will also prove useful but is not essential. Chapters 6 and
7 are couched in the language of matrix algebra, but none of the
"classical" results used are deep. The reader who appreciates the
elementary properties of eigenvalues, eigenvectors, and matrix norms
will feel at home.
PREFACE xi
The authors wish to express their gratitude to Nyles Barnert, who

collaborated in the proofs of Theorems 6.1 through 6.3; to Sue M.
McKay, Ruth Johnson, and Valerie Ondrejka, who shared the chore of
typing the original manuscript; to the ARCON Corporation, the
M.I.T. Lincoln Laboratory, the Office of Naval Research, and the
U.S. Air Force Systems Command, who contributed to the authors'
support during the writing of the monograph; and, finally, to the
editorial staff of the Annals of Mathematical Statistics, who were
principally responsible for the writing of this monograph.
ARTHUR E. ALBERT
LELAND A. GARDNER, JR.
Cambridge, Massachusetts
October 1966
Contents
1. Introduction 1
PART I THE SCALAR-PARAMETER CASE
2. Probability-One and Mean-Square Convergence 9

2.1 The Basic Assumptions (AI Through AS"') 9
2.2 Theorems Concerning Probability-One and Mean-
Square Convergence for General Gains 11
2.3 The Prototype Deterministic Gain 17
2.4 Reduction in the Linear Case 18
2.5 Gains That Use Prior Knowledge 19
2.6 Random Gains 20
2.7 Theorems Concerning Probability-One and Mean
Square Convergence for Particular Gains; Ap-
plication to Polynomial Regression 23
2.8 Trigonometric Regression ' - 24
2.9 Exponential Regression 25
3. Moment Convergence Rates 27

3.1 Restricted Gain Sequence 27
3.2 Theorems Concerning Moment Convergence
Rates 28
3.3 Power-Law Derivatives 34
3.4 Relevance to Stochastic A pprox im ation 35
3.5 Generalization 37
xiii
xiv CONTENTS
4. Asymptotic Distribution Theory 38

4.1 Notation for and Relations Betw een Modes of
Convergence 39
4.2 Theorcms Concerning Asymptotic Normality for
General Gains 39
4.3 Alternative to the Continuous Convergence As-
s umption 47
4.4 Large-Sample Variances for Particular
Gains 48
4.5 Oth er Gains 53
4.6 Gain Comparison and Choice of Gain Con-
stants 54
4.7 A General Stochastic Approx imatio n
Theorem 58
5. Asymptotic Efficiency 60
5.1 Asymptotic Linearity 61
5.2 Increased Efficiency via Transformation of the
Parameter Space 61
5.3 Asymptotic Efficiency and Summary
Theorem 65
5.4 Increased Efficiency 72
5.5 Large-Sa m pl e Confidence Intervals 72
5.6 Choice of Indexing Sequence 73
5.7 A Single-Parameter Estimation Problem 74
PART Il THE VECTOR-PARAMETER CASE
6. Mean-Square and Probability-One Convergence 81

6.1 Theorem Concerning Divergencl! to Zero of Prod-
ucts of Elementary Matrices and Assumptions
(B1 Through B5) 83
6.2 Discussion of Assum ptions and Proof 84
6.3 Theorems Concerning Mean Squa r e and Proba
-
bility-One Convergence for General Gains and

Assumptions (Cl Through C6' and Dl Through
D5) 92
6.4 Truncated Vector Iterations 102
6.5 Conjectured Theorem and Assumptions (E 1
Through E6') 1 03
6.6 Batch Processing 1 04
CONTENTS xv
7. Complements and Details 109

7.1Optimum Gains for Recursive Linear Regres-
sion 109
7.2 "Quick and Dirty" Recursive Linear Regres-
sion 115
7.3 Optimum Gains for Recursive Linear Regression.
Batch Processing 117
7.4 "Quick and Dirty" Linear Regression. Batch Proc-
essing 121
7.5 Gain Sequences for Recursive Nonlinear Regres-
sion. The Method of Linearization 122
7.6 Sufficient Conditions for Assumptions EI Through
E6' (E6) When the Gains (Equations 7.48) Are
Used 125
7.7 Limitations of the Recursive Method. III Condi-
tioning 136
7.8 Response Surfaces 139
8. Applications 146
8.1 Vector Observations and Time-Homogeneous Re-
gression 148
8.2 Estimating the Initial State of a Linear System via
Noisy Nonlinear Observations 153
8.3 Estimating Input Amplitude Through an Un-
known Saturating Amplifier 156
8.4 Estimating the Parameters of a Time-Invariant
Linear System 161
8.5 Elliptical Trajectory Parameter Estimation 172
9. Open Problems 182

9.1 Proof of the Conjectured Theorcm 182
9.2 Extensions of Chapters 3 Through 5 to the Vcctor-
Parameter Case 182
9.3 Kalman-Type Filtering Thcory for Nonlinear Sys-
tems 183
Appendix. Lemmas 1 Through 8 189
References 200
Index 203
1. Introduction
Despite the many significant and elegant theoretical developments of

the past several decades, the art of statistical inference on time series is,
from the applied point of view, in its infancy. An important class of
problems, which has been relatively neglected, arises from the fact that
there are. always computations associated with statistical procedures; a
procedure which is "optimal" in the decision theoretic sense can be
somewhat less than optimal from a practical point of view if the associ
ated computations are prohibitively lengthy. This dilemma is com
pounded when we consider a time series as aflolV of data. In"space age"
applications, it is especially important that statistical procedures keep
pace with the incoming data so that, at any instant, all of the available
information has already been processed. The acquisition of new
observations merely serves to update the current state of knowledge.
In this monograph we will investigate nonlinear regression from that
point of view. L et
{Y,,:n = 1,2,}
be a stochastic process whose mean-value sequence is a member of a

family of known sequences, that is to say,
if Y" = F,,(6),
where 6 is a vector parameter which is not known and must be estimated.

1
2 INTRODUCTION
We will explore the asymptotic (large II ) properties of recursive estima

tion schemes for 6 of the form
(1.1)
where tn + 1 is the estimate of 6 based upon the first n observations and
{an} is a suitably chosen sequence of" smoothing vectors."
Without question, estimators of the type of Equation I.I are compu
tationally appealing, provided the smoothing sequence is chosen
reasonably. After each observation, we compute the prediction error
Yn - Fn(tn) and correct tn by adding to it the vector [Yn - Fn(tn)]an
Such recursions are sometimes called " differential correction"
procedures.
In contrast, maximum-likelihood and least-squares estimation
methods, although often efficient in the purely statistical sense, require
the solution of systems of simultaneous nonlinear normal equations.
If we want "running" values of these estimates, the computational
problems are often great.
Of course, the choice of the weights an critically affects the computa
tional simplicity and statistical properties of the recursive estimate
(Equation 1.1). The main purpose of this monograph is to relate the
large-sample statistical behavior of the estimates to the properties of the
regression function and the choice of smoothing vectors.
Estimation schemes of the type of Equation 1.1 find their origins in
Newton's method for finding the root of a nonlinear function. Suppose
that G() is a monotone differentiable function of a real variable, and
we wish to find the root e of the equation
G(x) = O.
If 11 were known to be a reasonably good estimate of (i.e., is close to) e,

then
(1.2)
where the dot denotes differentiation. This equation says that G(e) takes
on nearly the samc values as the line L which passes through the point
(11) G(/l)) with slope G(/l) [i.e., is tangent to the curve y = G(x) at
x = II], provided that e is not too far from II' Solving Equation 1.2
for e, we see that
G(ll)
e
0(11)'
'" I
- 1 -
so that a potentially better estimator for 8 might be (see Figure 1.1)

G(/l)
12 =
11 ' (1.3)
0(11)
-
INTRODUCTION 3
Figure 1.1 Graphical interpretation of Newton's method.
In turn, t2 could be "improved" in the same way, and Equation 1.3

suggests that an ever-improving sequence of estimators for 8 can be
obtained by means of the recursion"
G(tn)
tn+1 In - (n 1). (1.4)
G(tn)
=
It would appear, though, that the first guess 11 must be close to 8 in

order that the linear approximation, Equation 1.2, should be accurate.
This is not essential if 1 GI is bounded above and away from zero:
o < b::; 1 G(x) 1 ::; d < 00.
We choose a number a to satisfy

b
o < a::; d'
and we modify the recursion, Equation 1.4, to read

a
an (1.5)
G(ln)'
=
It is easy to show that In converges to 8 as n -+ 00. Indeed, by the mean

value theorem, we obtain
G(tn) = G(un)(tn - 8), (1.6)
where Un lies between 8 and In. Thus, by Equations 1.5 and 1.6, it follows
that
In+1 - 0 [1 - anG(un)](tn - 8)
{J]n [1 - a G(UJ)]}(11 - 0).
G(tf)
= =
(1.7)
4 INTRODUCTION
But
1 - a- < 1 -
d
a--
G(uJ) < b
- a-
0< 1 < 1
- b - G(IJ) - d '
so that
as n oo.
Let us now complicate matters by letting 0 vary with n. There is a
sequence of monotone differentiable functions, 0", all having a common
root 8:
0,,(6) = 0 (n = 1,2, . ) .
Again, we estimate 6 by sequences of the form

1,,+1 = I" - a"O,,{t,,).
In precisely the same way, in place of Equation 1.7 we obtain
1,,+1 - 6 [1 - a"G"(u,,)] (/,, - 6) {Ii [1 - a/GJ(UJ)]} (/1 - 6).

1=1
= =
Now assuming that

0 < b" < I G,,(x)I Mb" < 00
for all n and all x, we choose a" so that

1. a" has the same sign as G",
1
2. la,,1
Mb,,'
3. 2: I a"b" I = 00
..

Then we have
and I t" +1 - 61 tends once again to zero as noo.

This technique can be applied to the problem of discrete-time curve
fitting: Suppose Yh Y2, '" is a sequence of numbers, and it is known
that this sequence is one of a family of sequences, {F,,(6)}, indexed by a
real parameter 6. Here 8 is not known, and we wish to find that value of
6 for which
Y.. = F,,(6) (n = 1,2, ... ).
INTRODUCTION 5
If we let
Gn(x) = Fn(x) - Yn ,
the desired parameter value is that value of x which makes Gn(x) vanish
identically in n.
Now let noise be introduced, so that the sequence of observations, Yn,
are corrupted versions of Fn(8):
Yn = Fn(8) + Wn (n = 1,2,),
where Wn is (zero mean) noise. Motivated by the previous discussion,
we consider estimation schemes of the form
tn+l = t" + a"[Y,, - F,,(t,,)], (1.8)
wh ic h can be rewritten as
tn+l = tn - lanl Zn(tn). (1. 8a)

For every x, we can regard Zn(x) as an observable random variable with
expectation equal to
GnCx) = sgn tn[f"Cx) - Fn(8)] = I tn(un)I(x - 8), (1.9)
where Un = un(x, 8) lies between x and 8. Thus,
tn+l - 8 = (1 - lantnCun)l)(tn - 8) + anWn, (1.10)
and we are led to the study of certain first-order nonlinear difference
equations with stochastic driving terms.
This brings to mind the literature associated with stochastic approxi
mation, which dates back to a paper by Robbins and Monro (1951).
That paper concerns itself with the problem of estimating the root, say
a, of an unknown (time-homogeneous) regression function G(x), which
is the mean value of an observable random variable Z(x). The distribu

tion of the latter depends on a scalar parameter, x, which can be
controlled by the experimenter. They proposed that a be estimated
recursively by Equation 1.8a, where Z(tn) is the value of an observation
taken at the "level" x tn, and {an} is any nonsummable null sequence
=
of scalars with Ln an2 < 00. The success of the Robbins-Monro

procedure (it converges to a with probability one and in mean square
under a wide range of conditions) encourages us to believe in the
reasonableness of Equation 1.8.
Burkholder (1956) has studied processes of the form of Equation 1.8a
in detail. In fact, he considers the more general situation where the root
of Gn depends upon n but converges to a limit 8 as n -+ 00. (This is not
just an academic generalization, for such a result is needed in the treat
ment of the Kiefer-Wolfowitz procedure for locating the minimum of a
6 INTRODUCTION
time-homogeneous regression function.) Consequently, there will be

some overlap between his work and Chapters 2 through 4 of the present
work. In fact, after appropriate reinterpretation of the symbols, we
obtain some results that are significantly stronger than those given by
Burkholder.
If we view the stochastic-approximation literature as a study in the
asymptotic behavior of the solutions to a certain class of nonlinear first
order difference equations with stochastic driving terms, then the results
of this monograph (particularly Chapters 3 and 4) serve to extend and
complement many of the results in that literature, and accounts for our
choicc of title. However, our primary consideration is nonlinear
regression per se and, for this reason, we often fail to state theorems with
the weakest possible hypotheses; we want to keep their statemcnts and
proofs relatively simple.
We will treat the scalar-parameter case, Equation 1.8, and the general
vector case, Equation 1.1, separately. For the vector-parameter case, we
will treat the topics of strong consistency (probability-one convergence)
and mean-square convergence. In the scalar-parameter case, we also
treat the questions of convergence rates, asymptotic distribution theory,
and efficiency. A wide class of gain sequences are examined. Some are
deterministic, and some depend on the actual data which have been
observed. Examples are sprinkled throughout the body of the mono
graph, and Chapter 8 is devoted exclusively to applications.
The techniques we use are, by now, standard to those who are familiar
with the literature of stochastic approximation, but for the sake of the
nonspecialist we have tried to keep our treatment self-contained. In all
cases, we seek the asymptotic properties of the solutions to the intrinsic
ally nonlinear difference equations of the type 1.1. We accomplish this
by studying the asymptotic properties of certain linear difference equa
tions which, in a sense, dominate the original ones.
Now a word about notation. In Cha p ters 6 through 9, we do not
adhere to the convention which rese rves lower-Crespo upper-)case bold
face symbols for vectors (resp. matrices). The reader must keep in mind
not only this point but also the orders of the various vectors and
matrices involved. The symbol an = O(bn) means that lan/bnl has a
finite limit superior as n tends to infinity, while an = o(b") means the
ratio tends to zero. The balance of the abbreviations are standard and
are defined when they are first used.
We begin by studying the problems of probability-one and mean
square convergence in th e scalar case.
PART I
THE SCALAR-PARAMETER CASE

2. Probability-One and
Mean-Square Convergence
2.t The Basic Assumptions (At Through AS")
Throughout Part I we will use certain assumptions, the first of which

is as follows:
AI. {Y,,:n 1, 2, } is an observable stochastic process of the form

=
Y" = F,,(8) + W", where Wl> W2, have uniformly bounded

variances. The function F,,(8) is known except for a real param
meter 8. However, 8 is known to lie in an interval J = (el> e2),
whose end points may be finite or infinite. For each value of
n, F,,{) is assumed to be monotone and differentiable with
respect to the argument in parentheses.
If J happens to be finite or semifinite, it is reasonable to constrain

estimators for 8 so that they always fall in J. To this end, we define the
limiting operation
if x e2,
if e1 < x < e2, (2.1)
if x el>
and, accordingly, will consider truncated estimation recursions of the
form
1,,+1 = [I" + o"[Y,, - F,,(t,,) ]] (n = 1, 2, ; 11 arbitrary). (2.2)

9
10 PROBABILITY-ONE AND MEAN-SQUARE CONVERGENCE
In the work that follows, we will use certainsymbols and assumptions

(in addition to AI) repeatedly. For thesake of future brevity and ease of
comparison, we list them here once and for all and will refer to them
later by number.
AI'. In addition to Assumption AI, WI> W2,' is a zero-mean

independent process.
A2. For every n ;::: 1, d" = sup IF,,(x)1 < 00, where F" denotes the
xeJ
derivative of F".
"
A3. B
,,
2
= 2: b,,2 -+ 00 with n, where bit = inf
xeJ
1 F,,(x) .I
k=1
d"
A4. sUP- < 00
.
" b"
d,,2 1
AS. IImsup- 2 < .
"B "
AS'. limsup
"
(:)(::) < 1.
AS". I1m
b,,2 0
"
=
B" 2 .
(bn2)2
AS"'.
f 2 < ,,B
00
.
We note that Assumption AS'" implies AS", Assumptions A4 and AS"

imply AS', and Assumption AS' implies AS.
These assumptions are stated in terms of the quantities bn and dn,
which are, in turn, defined in terms of an interval J that is known to
include the true value of the parameter. Obviously, J should be chosen
to be thesmallest interval known to contain e. (In general, the smaller J
is, the weaker are Assumptions A2 through AS'".) In the absence of prior
knowledge, J can be (must be) taken to be the real line.
Even when J is a finite interval, it is not essential that the estimator
sequence be truncated. We could alternatively redefine the regression
function outside of J by linearity and then use an untruncatcd procedure
to estimate e. That is to say, we could define
{F"('2) + Fn(2)(X - '2) if x ;::: '2,

F" *(x) = F,,(x) if '1 < X < '2,
F,,(1) + Fn('1)(X - '1) if x '1>
and we could use the untruncated scheme

THEOREMS FOR GENERAL GAINS 11
Since we know that
and
inf I Fn (x) I = inf

- 00 < x < 00
I Fn*(x)l,
1 < x < 2
any of the Assumptions A2 through AS'" that hold for Fn() over
J = (1o 2) will also hold for Fn *(.) over J* = ( - 00, )
00 . Hence, the
results of this chapter (as well as the next) will apply to the untruncated
estimators tn* whe nevcr they apply to the truncated ones, tn. In most
applications, however, common sense seems to dictate that we should
use truncated procedures whenever we can.
The first theorem demonstrates the strong consistency of the estima
tion sequence, Equation 2.2, for a wide class of gain sequences. [For
J = ( - 00, 00), independent observations and gains which do not depend
on the iterates, the result becomes Burkholder's (1956) Theorem 1 after
an appropriate interpretation of the symbols.]
2.2 Theorems Concerning Probability-One and Mean-Square

Convergence for General Gains
THEOREM 2.1
Let {Yn:n = 1, 2, " '} be an observable process satisfying Assump

tions Al and A2. Let {tn} be a sequence of estimators defined by the
recursion
(t1 arbitrary),
where, for each n, o

n
(') is a Borel function of II real variahles, and
J = (1o 2) is any interval, finite or infinite, known to include the true
parameter value 8. L et p n ) be the n-dimensional cube whose sides arc the
interval J. If the gain sequence {on(')} is chosen so that
1. For each n, the sign of on(x) is constant over J ( n ) and equal to the
sign of Fn(),
1
2. sup IOn(X) I < d for all suitably large values of n,
xeJ(n) n
and
3. 2: bn
n
inf
xe/(n)
IOn(x)I = 00,
then tn converges to 8 with probability one if either

4. L sup lan(x)I < 00

n xel(ft)
or
5. L sup lan(x)12 < 00 and Assumption AI' holds.
n xeJ(ft)
Proof. For notational convenience,denote

aneth t2, " tn) by an,
sup lan(x)I by sup lanl.

xeJ(ft)
Let
and
(2.4)
Then we obtain
and,consequently,
Itn+1 - 81 ITn - 8 + Znl. (2.5)
Indeed,Equation 2.5 clearly holds if (Tn + Zn) E J. Otherwise,because

gl < 8 < g2, 8 is closer to the end point of J nearest to Tn + Zn than
it is to Tn + Zn.
The placement of the absolute-value signs on the right-hand side
makes it awkward to iterate the Inequality 2.5. However,suppose that
we can choose a positive null sequence of real numbers {An} such that
Zn
1Im = 0 a.s., (2.6)
n A"
-
that is,with probability one. Then,by Condition 2, we can choose N so

large that
d" sup la,,1 < 1 (2.7)
and
(2.8)
both hold whenever n N (N a random variable). Fix n N. If

IT" 81 < An/2, then by Equation 2.5 it follows that
-
It"+l - 81 A".
In the contrary case,
IT" - 81> " IZ"I,
which implies that the sign of (T. - 9 + Z.) is equal to the sign of
(T. - 9):
IT. - 9 + Z.I = (T. - 9 + Z.)sgn(T. - 9 + Z.)
= (T. - 9 + Z.)sgn(T. - 9)
= IT. - 91 + Z.sgn(T. - 9).
Setting
X. = Z.sgn(T. - 9), (2.9)
we have, in this case,
It.+l - 91 IT. - 91 + X
In either event, therefore,
It.+1 - 91 max {A., IT. - 91 + XJ (2.10)
if n N, and this is the key relationship for our subsequent analysis.
To establish Equation 2.6, we choose a positive null sequence {AJ
so that
'" (sup la"j)
2
< 00.
L.
n
A
n
2
This is always possible since

I(sup 1a.1>- < 00

under either Condition 4 or S. But then, from Equation 2.4, we obtain

2
2: tf Z 2 const I suua.1 < 00,
( ,,) ( )
n A" A.
so that
It follows from the monotone convergence theorem (Loeve, 1 960,

p. 1 52) that
(:r < 00 a.s.,
which in turn implies Equation 2.6.

Returning our attention to Equation 2.10, we notice that
by virtue of Equation 2.3. Using the mean-value theorem, we find that

I Tn - 81 = 1[1 - anFn(un)](tn - 8)1,
where Un lies between 8 and tn. By Condition 1, we have anFn > O. Thus,
in view of Equation 2.7, it follows that
o < bn inf lanl ::::;; anF'n(un) ::::;; tin sup lanl < 1
if n ;::: N, so that
This combines with Equation 2.10 to give, for all such indices,
I tn +1 - 81 ::::;; max {An> (\ - bn inf lanDltn - 81 + Xn}. (2.11)

If Equation 2.1 1 is iterated from n back to N, we obtain
I tn +1 -
ill ::::;; max
(I
[ max (-AipPn
NSlsn 1
+ L.,
"=1+1
-)
PnPX,,
,
"
(which can be verified by induction), where
Since 1 - x ::::;; e-X (for all x), we see that

p n
--L ::::;; exp [- 2: bj inf 10,11-+ 0
P N-1 =N
j
as n-+ 00 by Condition 3.
We still have to show that
2:" X" < 00 a.s., (2.1 3)
for then, by Lemma 2 of the Appendix, it follows that

n
max
NSISn k=H
2: 1 P....!! X,,-+O a.s. as n -+ 00.
Pk
Since
max
\ A ,Pn \ ::::;; max I A, I -+O
NSlsn p, NSlsn
and
l 1 1N - 810 as noo,
PN-
the desired conclusion will follow from Equation 2.12.
To establish Equation 2.13, we can use either Condition 4 or 5. Under
Condition 4,
Then If Lk I Xkl < 00 and, hence, Lk Xk < 00 a .s. by the monotone

convergence theorem. Under Condition 5, we notice that the random
variables Xlo X2, " Xk are functions of Zl' .. " Zk, Tlo " Tko which

are themselves functions of 110 12, 1k, Wlo' " Wk, where Zk
" Ok Wk =
(see Equations 2.3, 2.4, and 2 9). In turn, lb ' . " tk are functions of
.
Wlo"', Wk-lo so that the Borel field induced by Xlo"', Xk is a sub

field of the Borel field induced by Wlo" " Wk' Thus,
C(Xk+lIXko, Xl) = C[C( Xk+llWk,"', WI)IXk,, Xl]'

The inner expectation is equal to
Ok+l(/Io, Ik+1) sgn (Tk+1 - 8)C(Wk+ll Wk,"', WI) = 0
by virtue of the assumed independence of the W's. Thus,
C(Xk+lI Xko, Xl) = 0 a.s.
and, since
we see that
Theorem D on page 387 of Loeve '(1960) applies, thereby proving

Equation 2.1 3. Q.E.D.
The conditions for mean-square convergence are identical with those
required for probability-one convergence, although the method of proof
differs.
THEOREM 2.2
Let {Yn:n I, 2, } be an observable process satisfying Assump
= . .
tions Al and A2. Let {In} be a sequence of estimators defined by the

recursion
(II arbitrary),
where an() is a Borel function of Il real variables and J = at> g ) is any

2
interval,finite or infinite,known to include the true parameter value O. If
Conditions 1, 2, 3,and either 4 or 5 of Theorem 2.1 hold,then
"lim tff(tn - 0)2 = O.

... ""
Proof. By Equation 2.5, we see that

(t+l - 0)2 :::; (Tn - 0)2 + 2Zn(Tn - 0) + Z,,2, (2.14)
where, as previously,
Furthermore,by virtue of Assumption AI,

(CWn)2:::; CW,,2 :::; a2 < 00 for all n. (2.1 5)
By the mean-value theorem,
T" - 0 = [1 - anF"(un)](t,, - 0),
so that for l arge n,
(T" - 0)2:::; (1 - b" inf lanl)2(tn - 0)2. (2.16)

Suppose that Condition 4 holds. Then, combining Equations 2.14,
2.15, and 2.16, and letting
(2.17)
we obtain
o :::; e+l :::; (1 - b" inf lanl)2en2 + a2 sup lanl2 + 2ae " sup lanl
:::; (1 - bn inf la,,1)2 e,,2 + Ml sup la"l(1 + en), (2.18)
where M's will denote various constants. Since sup la,,1 is summable
and since 0 < b" inf lanl < 1 for large n by Condition 2, Lemma 3 of
the Appendix can be applied to give
sup en2 = A122 < 00.
"
Thus,from Equation 2.18, we obtain
o :::; e+l :::; (1 - bn inf la,,1)2e,,2 + M3 sup la"l. (2.19)
Choose N so large that
b" inf lanl < 1 for n N,
and iterate Equation 2.19 back to N. We get
PROTOTYPE DETERMINISTIC GAIN 17
where
n
Pn = I1 (I - bJ inf laJI) -+ O.
J-l
A special case of Lemma 2 in the Appendix is (a version of the Kron

ecker Lemma):
(2.21)
since sup lakl is summable. But since 0 < Pn/Pk < I for all N ::;; k ::;;n,
it follows that Pn 2/Pk 2 < Pn/Pk. This and Equation 2.21, together with
Equation 2.20, give en 2 -+ O.
Under Condition 5, the Wn are independent, so that for every n, Wn
is independent of an(lh .." In) and Tn(lh'." In). Thus,
8Zn(Tn - 8) = 8[8(Zn(Tn - 8)1 Wh .." Wn -1)]

= 8[an(Tn - 8)8(Wnl Wh . . , Wn -1)] = O. (2.22)
By Equations 2.14, 2.16, and 2.22,
8(ln+l - 8)2::;; (1 - bn inf lanI)28(ln - 8)2 + 8Zn 2,
. (2.23)
and, by Equation 2.15,
o ::;; e+1 ::;; (1 - bn inf lanl)2en2 + (0' sup lanl)2. (2.24)
After iterating back to N, we have
o ::;; e+1 ::;;

(p Pn ) 2eN2 + 0'2 k i ( pknP ) 2 (sup lakl)2. (2.25)
N-l =N
Since nP -+ 0, the first term tends to zero, and the same argument used
earlier shows that
n
lim
Pn 2
n .... .., k=N pk
( )
(sup lakl)2 = O.
Thus, under either Condition 4 or 5, we have en 2 -+ 0 as n -+ 00. Q.E.D.

The conditions of Theorems 2.1 and 2.2 are satisfied by a number of
gain sequences, provided that the regression function satisfies a certain
number of the assumptions A2 through A5" listed at the beginning of
this chapter.
2.3 The Prototype Deterministic Gain
Consider the gain sequence
an(xh X2,"', xn) = ;nn2 sgn Pn . (2.26)

18 PROBABILlTYONE AND MEANSQUARE CONVERGENCE
Since Fn( . ) is monotone for each 17, the sign of Fn(x) is independent of x
and Equation 2.26 does not depend on the arguments. In instances where
speed of realtime computation is an important factor,these determin
istic gains possess the virtue of being computable in advance of the data
acquisition (although there is the possibility of a storage problem).
Since
sup a. (x) = inf a. (x) = an,
xeJ(n) xeJ(n)
Condition 2 of Theorems 2.1 and 2.2 holds under Assumption AS.

Furthermore, Condition 3 is ensured by Assumption A3, because the
Abel-Dini Theorem (Knopp, 1947, p. 290) says that
co
2: --
bk2
IS
{diVergent when r ::s;
0
if B.2 co. (2.27)
k= 1Bk2 +r convergent when r > 0
(This theorem will be used repeatedly.) If is an independent {Yn}
process,Condition S also holds under Assumption A3,because
2:" (sup a )2 = 2:" n:

.
.
< co.
If {Y,,} is not an independent, process,Condition 4 holds when

b" ,.., n
il
(0: > 0)
(by which we mean that the ratio of the two sides has a nonzero limit
inferior and finite limit superior),for then
bn 1
and I anI =
Jj2 ,.., n
1 +,,'
n
which is summable. In particular, we can do nonlinear polynomial

regression,
"
F,,( 8) = 2: fi( 8)ni (p <:: 1 ),
i=O
when the errors are dependent and have nonzero means.
2.4 Reduction in the Linear Case
It is instructive to examine our iterative scheme when the regression

fuilction is linear:
F,,(8) = b,,8 and
(Note the slightly different usage of b" here.) We take J = ( -co, co) ,
and the recursion Equation 2.2 becomes
GAINS THAT USE PRIOR KNOWLEDGE 19
In + 1 = In + an[Y.. - bn1n]
= (1 - anbn)ln + an Yn
B-1 bn v
= -B 2 In + B 2 .1. ..
n n
Iterating back to n = 1, we get
In+l
[ n (-BBr-2l)] 11 + k2:.. ( nn
n Br-1 bk
)-
Bk2 Yk
=l I=k+l B1
= 2
1=1 1
If the initial (no data) estimate is 11 = 0, then
which is precisely the least-squares estimator for (J based upon the first n
observations. In other words, the gain sequence, Equation 2.26, yields a
recursively defined estimator sequence which is identical to the corre
sponding sequence of least-squares estimators.
The variance of the least-squares estimate in the case of independent
identically distributed residuals is easily computed. Since
Yk = bk8 + Wk
with &Wk2 = 0'2, it foIlows that
0'2
&(In+1 - (J)2 = B 2'
n
Thus, In converges in quadratic mean to (J if and only if Bn2 00. But
we have already shown in the preceding paragraph that this condition
implies Conditions 3 and 5. Since in the present case supx I Fn(x)1 = bn,
Conditions 1 and 2 also hold. In short, Conditions 1, 2, 3, and 5 of
Theorem 2.2 are necessary as weIl as sufficient conditions for the
quadratic-mean (and, we might add, almost sure) convergence of the
recursively defined least-squares estimator,
"
In+l = I" + ; 2 (Y" - b"I,,) (11 = 0),
"
in the case of a linear regression function.
2.5 Gains That Use Prior Knowledge
In practical applications, prior knowledge about the true value of (J is

often available. There may, for example, be a nominal value (Jo, pre
dicted by rough theoretical calculations, which will hopefuIly be close
to the true value of 8. Under this supposition, the regression function

can be approximated by the first few terms of its Taylor series expansion:
F1\(8) ::::: F1\(80) + f1\(80)(8 - 80),
If W1\ denotes the error in the nth observation, we can write
Y" ::::: F1\(80) + f1\(80)(8 - 80) + Wn,

or equivalently,
where
Yn* = Yn - [Fn(80) - 80f1\(80)].
The parameter 8 now occurs (approximately) linearly in the mean value
of the observable Yn*, so that the recursive version of the linear least
squares estimator discussed earlier seems appropriate. Accordingly, the
gain sequence would be
(2.28)
and the iteration would be
In+l = In + an[Y1\* - f1\(80)11\]'

But (we are, of course, being heuristic)
Yn* - fn(80)In ::::: Yn - Fn(tn),
so we are led to substitute Yn - Fn(ln) for Yn* - fn(80)ln, and to
consider the" improved" recursion
(or, perhaps, a truncated version). This particular recursive estimation

scheme, that is, the gain sequence, Equation 2.28, is widely used in
current practice, especially the vector version (see Section 7.5 in
Chapter 7).
2.6 Random Gains
From the theoretical point of view, the gain sequences, Equations 2.26
and 2.28, are deterministic special cases of those of the form
(2.29)
RANDOM GAINS 21
with
for all x eJ(n). Some cases in poi nt are the following:
and, more important,
an(Xh , Xn) = (2.30)
The last is referred to as the adaptive" gain and is quite often used in

practice.
The convergence properties of estimates based on gains of the type of
Equation 2.29 are determined by considerations of the following sort. If
Assumptions A3, A4, and AS' hold,then Conditions 2, 3, and S of
Theorems 2.1 and 2.2 hold. Indeed,we see that
dn sup lanl ::; bndBn3n2 < 1

and
The Abel-Dini theorem, Equation 2.27,therefore guarantees the non

summability of bn
inf lanl, as well as the summability of
(sup lanl)2 ::; (bnd'ln2)2 ::; K' :.n

Assumption AS' is not essential if we mod ify the gain sequence
slightly. Let {p.n} be a pos itive null sequence, chosen so that
ILnbn2 _ 00
f Bn2 -
00
(This is always possible since 2.n bn2/Bn2 = under Assumption A3.)
Then, set
an*(Xh . , xn) ILnan(Xh

=
, xn).
22 PROBABILlTYONE AND MEANSQUARE CONVERGENCE
Under Assumptions A3 and A4,it is easy to show that
b 2
L:n bn inf 10n *1 >
-
K' L:n
Bn2
= ex> ,
and
Thus,the modified gain sequence
an * (Xl> " Xn) = fLnan(Xl> " Xn)

satisfies Conditions I, 2,3,and 5 of Theorems 2. I and 2.2 if an is of the
form of Equation 2.29. Hence,the sequence
will be a consistent estimator sequence if { Yn} is an independent process

whose regression function satisfies Assumptions A2,A3,and A4.
'
A still broader class of gain sequences satisfy the conditions
an ()I
bn bn
a a, 2 sgn an sgn Pn
Bn2 ::::; I X ::::; for all x E J(n>, =
Bn
If we impose Assumptions A3,A4,and AS", then it is easy to show that
dn sup lanl--+- O.
The same arguments used in the previous paragraph apply to the respec
tive nonsummability and summability of
L:n bn inf lanl and
Again, Assumption AS" can be dispensed with if an is replaced by

= fLnan, where fLn is a positive null sequence chosen so that
an *
We summarize all the foregoing (in what is actually a corollary to

Theorems 2. I and 2.2) in two more theorems.
CONVERGENCE THEOREM FOR PARTICUL AR GAINS 23
2.7 Theorems Concerning Probability-One and Mean-Square

Convergence for Particular Gains; Application to Polynomial
Regression
THEOREM 2.3
Let {Yn: n = I,2, } be a stochastic process satisfying Assumptions
. . .
AI', A2,and A3. Let
1 .. +1 = [In + an(t!>, In)[Yn - Fn(ln)]]'

where J = (gl, g2) is an interval containing the true parameter O.
Then,as n co, In 0 with probability one and in mean square if
I. a n = sgn Fn ;nn2 and Assumption A5 holds,

'
or
an2(Xh , Xn) sgn Fn
an(xb, Xn) ,
.
n

2.

=
fJn(Xb , Xn) 2
i-I
'Yl(Xl'' Xi)
n ), and Assump
where b n S an(x) , fJn x), 'Yn (x ) ( S dn for all x E J(
tions A4 and A5' hold,
or
abn 1 an () a'b.. t" 0 < a :::; a' < co sgn

3.
B ..2
S x I S
B ..2
lOr some , a .. x) ( =
sgn Fn for all x

n
E J ( ), and Assumptions A4 and A5" hold.
Furthermore, if an(xh, xn) is replaced by a .. *(Xh..., xn) =
ILnan(Xh.. , xn), where ILn is a positive null sequence chosen so that

L: IL nb ..2 co,
n Bn2
=
then Assumptions A5, A5', and A5" can be dispensed with in Conditions
1, 2, and 3, respectively.
For the special case of polynomial regression, most of the conditions
are automatically satisfied and the independence assumption can be
dropped.
THEOREM 2.4
Let {Yn: n = 1, 2, ...} be a stochastic process satisfying Assumption

AI, where
Fn(0) = L: jj( O)ni

p
(OEJ,p;;::;I),
i=O
fp( .) is monotone on J,and

o < inf I.h(x)1 S sup It(x)1 < 00 (j = 0, 1, ., p) .
xeJ xeJ
If, for some 0 < a S a' < 00 and all x EJ("),

a'b"
ab"
B
"
2 S I a"()I
x S B
"
2
and
sgn a,,(x) = sgn/p,
then the estimator
I" + l = [I" + a,,(11o , 1,,)[ Y" - F,,(I,,)] ] ! (11 arbitrary)

converges to 8 with probability one and in mean square as n 00.
Proof Denote
sup I I"(x) I by sup Itl,
xeJ
inf
xel
I t(x)1 by inf I tl.
If fp is nondecreasing,
inf F,,(x)
xeJ
nPinf I/pl[l + 0(1)]
as n 00, and if fp is nonincreasing, we have

sup
xel
F,,(x) S -nPsup I/pl[l + 0(1)].
In either case, F,,(;) is monotone for large n, and it is easy to find
constants 0 < Kl S K2 < 00 and N such that
KlnP S inf I F,,(x) I S sup I F,,(x) I S K2nP
xeJ xel
whenever n N. Thus, Assumption A I and Condition I of Theorems

2.1 and 2.2 hold when n N. Conditions 2, 3, and 4 hold automati
cally. Q.E.D.
Naturally, there are regressions (some of great practical importance)
which fail to satisfy the conditions we require in order to perform
recursive estimation procedures. We will exhibit two cases where one or
more of the conditions of Theorems 2.1 and 2.2 are violated.
1.8 Trigonometric Regression
For
F,,(8) = cos n8 (0 < 8 < )
'IT ,
EXPONENTIAL REGRESSION 25
the monotonicity restriction is violated: Fn(6) = -n sin n6 changes sign

at least once for every n 2 as 6 varies over J = (0,77). Fortunately,
other computationally convenient estimators are available. For ex
ample, we can estimate cos 6 in a consistent (both mean-square and
probability-one) fashion,using the estimator
C2n + v'Cn + 8Cn '
In =
4C1n
where C1n and C2n are the sample autocovariance functions at lags one
and two,respectively. These can,of course,be computed recursively in
n. Knowledge of cos 6 is tantamount to knowledge of 6 when 0 < 6 < 17.
[When J is CO,277),an independent estimate of sin 6 is needed to resolve
the ambiguity in the angle.] This problem and various more realistic
generalizations of it (e.g.,unknown phase and amplitude) are the subject
of a planned paper by the second-named author.
2.9 Exponential Regression
The function
Fn(6) = enD
violates Conditions 2 and 3 of Theorems 2.1 and 2.2 in an essential way.
For, if an{x) is any gain satisfying Condition 2,it follows that
bn inf I anex) I exp [n,d exp [ - n'2] = exp [-n{'2 - '1)],
xeJ(n)
and this is always summable.

However,common sense tells us that if the noise variance is bounded,
one should be able to estimate 6 with ease and accuracy (plot the data on
semilog paper),because the "signal-to-noise ratio" grows exponentially
fast. This is indeed the case. Suppose
and
We let
Yn* =
{log Yn if Yn> A,
log A if Yn A,
where A is a positive constant chosen to save us the embarrassment of
having to take logarithms of numbers near and less than zero. Then
Yn* = n6 + Vn,
where
if Yn> A,
if Yn A.
With high probability, Yn is going to be larger than A when n is large,

so that (heuristically)
where
Here Vn* has a second moment which goes to zero at least as fast as
e-2nO This suggests that we estimate 8 using weighted least squares. The
weights should ideally be chosen equal to the variance of Vn(e-2nO).
Since 8 is not known,we settle for a conservative (over-) estimate of the
variance, e-2n1, and estimate 8 by
Here tn+1 is related to tn by the recursion
tn+! = tn + (ne2n1/"lk2e2k1)[Yn* - nt 1
n (11 = 0).
If the residuals, Wn, a re independent and identically distributed with a

density f and have the property
lim sup IxI3Hf(x) < co
x-+:l::CO
for some positive S, then it can be rigorously shown that S(tn - 8)2 -+ 0
(exponentially fast) as n -+ co.
3. Moment Convergence Rates
We are now going to investigate more closely the large-sample

behavior of our estimates { tn} generated by Equation 2.2 when (a) the
errors of observation are independent with zero means and (b) some
member of the class of gains considered in Theorem 2.3, Condition 3,
is used. In fact, for the balance of our treatment of the scalar-parameter
case only such gains will be considered, so we repeat the delineation once
and for all as foIlows.
3.1 Restricted Gain Sequence
By a restricted gain sequence we mean a function an( ) defined for all

points x = (Xl> , xn ) in the n-fold product space J(n) of the interval J
.
with itself such that (see Assumption AI)

sgn an(x) = Fn
for all x E J(n), and (see Assumption A3)
Bn2
inf -b la,,(x)I a > 0,
xel(n) n
B"2
sup -b I an(x) I a t < ex:>
xe/(n) n
for some numbers a at and all sufficiently large n.

As already indicated in Chapter 2, the gains used in practice have
this property.
27
28 MOMENT CONVERGENCE RATES
Our first result tells us that the mean-square error tends to zero as
IIB,,'.1. whenever there is such a constant a which exceeds t,that is,when
B"2 la,,(x) I
lim inf inf > t.
" SE/(n) b"
The conditions of Theorem 3.1 are the same as those that ensured strong
and mean-square consistency in Theorem 2.3, Condition 3.
3.1 lbeorems Concerning Moment Convergence Rates

THEOREM 3.1.
Suppose {Y,,:n = 1,2,} satisfies Assumptions AI', A2, A3, A4,

and A5w,whereJ = (fl' f ) is any interval known to contain 8. Further
2
more, suppose
sup 8[Y" - F,,(8)]2Q < 00

"
for some positive integer q. Let 11 be arbitrary and
1,,+1 = [I" + a,,(/1o, I")[Y,, - F,,(/,,)]] (n = 1,2,),
where {a,,} is any restricted gain sequence. Then,if a > t,it follows that
8(/" - 8)2Q = 0
(B21/)
as noo.
Proof. We let
a" = a,,(/1o, I..}, T" = I" + a,,[Y" - F,,(/,,)

(and note that this meaning of T" differs from that of Chapter 2 ). Since
the value of I"+1 is that of T" truncated to the intervalJ,which contains
8, the former must be closer to 8 than the latter. Consequently,
(/,,+1 - 8)2P S (T" - 8)'.1.P (3.1)

for any integer p.
We first derive an upper bound for 8(T" - 8)2P which is linear in
8(/" - 8)2P. By the law of the mean
T" - 8 = I" -8 + a,,[F,,(8) - F,,(/,,)) + a"W"

= [I - a"F"(u,,)(/,, - 8) + a"W", (3.2)
where u" lies between I" and 8, and
W" = Y" - F,,(8).

THEOREMS FOR MOMENT CONVERGENCE RATES 29
Thus,
(T" - 8)21' = [1 - a"F"(u,,)]2p(t,, - 8)21'
+ 2p[1 - a"F"(u,,)]2P-l(t,, - 8)2p-la"W"
+ 1=2 (2) [1 I
_ a"F"(u,,)]2p-l(f,, - 8)2p-l(a"W,,)I.
Conditioning by tlo t , , f" is tantamount to conditioning by

2
110 Wlo, W" l. Since the zero-mean W's are presumed independent,
_
the second term on the right-hand side has zero conditional expectation,
giving
C{(T" - 8)21'1 flo , I,,}
= [1 - 2pa"F,,(u,,) I en [-a"F,,(u,,)]I] (f" - 8)21'

+
+ en [1 - a"F"(u,,)]2p-l(t,, - 8)2p-la"ICW,,1
1
2
(p = 1,2,, q). (3.3)

We now utilize the sure bounds imposed by our assumptions. To save
space we set
(3.4)
an abbreviation which will be used throughout this chapter and the
next. Since a"F,,(u,,) = la"I I F,,(u,,)I, we then have
(3.5)
where
a" = a' sup d"lb" < 00,
the reason being that, because f" and 8 belong to J, u" must also.
The Inequality 3.5 will be valid only for all n exceeding some finite index,
which generally depends on a and a'. However, without loss of gener
ality, we may proceed as though the gain restriction is met for n =
1,2,, and thereby obviate continual rewriting of the qualification.
With this understanding, we now majorize the right-hand side of Equa
tion 3.3 by bounding 2pa"F,,(u,,) from below and everything else from
above with the deterministic quantities in Equation 3.5. Following this,
we take expectations and use the sure inequality, Equation 3.1 . The
result is
C(t"+l - 8)21' (l - 2pa{3" + K{3"2)O(t,, - 8)21' 1
+ K'
1=
( )
b
2 E"2
"Cit" - 8121'-1
(p = 1,2,, q), (3.6)

for all n. Here K and K' are some finite constants depending on p, but
not on n, and the latter contains the hypothesized uniform bounds on
the observational error moments.
Inequality 3.6 is the starting point in the derivation of moment
convergence rates.
For the presently hypothesized case a > -1. we introduce the working
definition
X",'= B"'_l(t", 0) -
and multiply Equation 3.6 through by B",21'. We get
GX ; l =:;; (;::J 21' (1 - 2paf3", + Kf3",2)GX,,21'
+ K'
1=2 B"-l
1
( )
21'- f3,,1/2 CIX",121'-I. (3.7)
Since f3n -+ 0 as n -+ 00, we find that
(J1' =
1
(1 _ f3n)1' = 1 + pf3n + 0(f3n2),
where all order relations will be as n -+ 00. Thus, for some c > 0 and all
lare enough n, we have
( B ) 21' (1 - 2paf3", Kf3",2) = 1 1 )f3n 0(f3",2) cf3""
B",: 1
+ - p(2a - + =:;; 1 -
(3.8 )
because 2a 1 > O. Let N be fixed large enough so that Equation 3.8
-
holds with cf3", < 1 for all n N. Introduce the inductive hypothesis
(3.9)
It is afartjar; true that the expectations in Equation 3.7 remain bounded
as n -+ 00 for each index i. Since f3n -+ 0, the summands for i > 2 are
evidently each of smaller order than the (i = 2)-term. Thus, after
substituting Equation 3.8 into Equation 3.7, we have
GX; 1 =:;; (1 - cf3",)C Xn 21' + K"f3" (all n ;:::: N).
Iterating back in n to N, we obtain
",
", ",
X; 1 =:;; n (1
J=N
- cf3j)C XN21' + K" L: n (1
k=N J=k+l
- Cf3J)f3k'
From the scalar version of the identity which is Lemma 1, it follows that
the right-hand side is equal to
K" (1
Q",GXN21' + - Qn),
C
where
..
TI (1 - Cf3f)
Q.. =
f=N
Since Lf3 .. = 00 (Equations 3.4 and 2.27 with r = 0), Q" tends to zero as
n 00. This shows that .;It'p -1 implies .Yf?p' Since .;It'1 holds trivially and
B,,2/B_1 1, the asserted conclusion follows by induction on p =
1 , 2, .. " q. Q.E.D.

For gains with a -!- our technique of estimating convergence rates
requires that we strengthen our assumption 13" 0 to Lf3"2 < 00.
THEOREM 3.2
Let. the hypotheses of Theorem 3.1 hold with Assumption A5"
{
strengthened to A5'". Then
O "2
- 8)2q is at most the order of e r
if a = t,
tff(t" _1_"
if 0 < a < t.
B 4q ..
"
Proof. We first iterate Inequality 3.6 back to the index N for which
Z" = 2paf3" - Kf3,,2 E (0, 1) and log B,,2 > 1
for all n N, which can be done since 13" 0 and B"2 00. This gives
"
tff(t" + 1 - 8)21'
(l - ZJ)tff(tN - 8)21'
JTI

=N
..
L: TI+ K'
.. 21' 1
( 1 - zJ) L: BI<2 tffl!1<
(b ) - 8 1 21'-1.
I<=N J=I<+1 1=2 I<
We apply Lemma 4 with Z = 2pa to get
11>( I"+l - 8)2 I'

t.>
DN-1BtP1C(tN - 8)21'
B" 41'''
K' "
+ -- B L.
21'
" BI<41' .. -113I<1J2CIII<
'" DI< L. - 8121'-1
" k=N 41'''
1=2
(a > 0), (3.10)
where the DI<'s are uniformly bounded in accordance with the lemma.
Consider first the case a = h and set
X.. -- VlogB"-1B_1 (I" - 8)

Multiplication of Equation 3.10 through by B,,21'/(log B,,2)1' gives
-
IX11+ DN-1(lOg B-l"IXNIP
1 S
(log B.I.,
= 0(1) +
K'
1'
"
L D k
21'
L
(
Bk2 log B_l 1'-1/2
B2 l Iog B 2
)
(I og B"2) k=N 1=2 k- k
X (log Bk2) 1/2,8k'/2CIXkI21'-1
1'- (3.11)
as n -+ 00 for all p q. We again make the inductive hypothesis,

Equation 3.9, but with X" redefined as in this paragraph. Then the first
and fourth factors of the i-summands in Equation 3.11 are bounded
uniformly in k N. The second factor is never less than unity while the
third always is. The (i 2)-term therefore dominates, and
=
K H "
0(1) + I- LN ,8k,
og B"2 k=
because log Bk2 increases with k. But the last written sum is 0(1) as
n -+ 00 by Equation L4.3 of the Appendix. In fact, we have (Knopp,
1947, p. 292),
" b 2
k log B ,,2,
L
k=l Bk2
which, incidentally, makes explicit the Abel-Dini Theorem (2.27) for
r = O. (The symbol will always mean the ratio of the two sides tends
to unity.) Therefore, .Yt'1'_l implies .Yt'p, and the proof for a -} is =
completed by induction on p 1,2" , " q.

=
For the remaining case, we redefine the working variate as
(0 < a < -!-)
and multiply Equation 3.10 through by B,,41'''. In place of the final bound
in Equation 3.11, we find
"
eX:!l DN_ltffXN21' + K' L Dk
k=N
x
21'
'" Bk2
L- -B2
( )
"<21'-0 ( B 2k-4"
2 ) 112
,8
1!>IX 12
Ii> k
1'-1
1=2 k-l k
As before, the sum on i is seen to be bounded by a constant independent

of k times the (i 2)-term, so that under the inductive hypothesis
=
.:It'P-l we have
C Xl 0(1) + K" i B k4a'

k=N
The last written sum is the same thing as
" b2
L B 2+21-2a)'
k=N k
which, according to Equation 2.27, tends to a finite limit as n -+ 00 .
Therefore, :/('P holds, and the usual induction finishes the argument.
Q.E.D.
The conclusions of Theorems 3.1 and 3.2 depend on the value of a
that results from our selection of a gain sequence. Letting
c" =
. f B,,2 l a n(x)1 '
In
xeJ(n) bn
we see that there always exists an a > -i when L limn inf c" > -1. and
=
an a < -} when L < -to Generally speaking, the case a -} occurs only=
when L =-} and Cn < t for only finitely many values of n.

It is important to note that the assumption a > -} is a necessary one/or
the conclusion o/Theorem 3./, at least when Assumption AS'" also holds.
To demonstrate this, assume the gain restriction is satisfied by some
largest (smallest) number a (a'), and that Var Y" uo2 > 0 for all n.
Furthermore, let us take J ( - 00 , (0), which forces equality in Equa
=
tion 3.1. Then, if we use Equation 3.S in Equation 3.3 with p I, we get =
C{(t"+l - 8)21110"" In} (1 - 2a"{3n)(I" - 8)2 + a2u20 ,,:

after dropping the positive term a2{3"2(1,, - 8)2. Taking expectations and
iterating back to an appropriate finite index N, we obtain, with e,,2 again
abbreviating &'(1.. - 8)2,
" " n
b2
e+l I1 (1 - 2a"{3;)eN2 + a2u02 L I1 ( I - 2a"{3J) k 4'
J=N k=N I=k+l Bk
If we assume ",{3..2 < 00, we can apply Lemma 4 with z 2a" and =
K =O. Thus, after further weakening our lower bound by dropping the
positive term involving eN2, we have
a2u2 " b2
e+l L k 4
B 4" k=N C B "-4a"
.. k
const " bk2 .
>- '"
- B "4a" k=N
L.
B k2+2(1-2a")
The strictly positive" const" involves a uniform lower bound on the C's,
which exists according to Lemma 4. Using Equation 2.27 once again, we
see that
lim infB 4""e2

" =
+ oo
{A> if a
"
"
t,
.. .. 0 if a < t.
Thus, if the assumption a > 1- of Theorem 3.1 fails, the mean-square
error cannot generally be 0( I/B,,2). Indeed,
"
for all a < t, that is, all cases in which
d" 1
1 sup- < -.
" b" 2a
3.3 Power-Law Derivatives
We have shown in Theorem 2.4 that the conditions of Theorems 3.1

and 3.2 hold for polynomial regression when the function multiplying
the largest power of n is bounded, differentiable, and I I on J. In any -
such problem the sequence of Equation 3.4 goes to 0 as lin. We point

out here that this is true for the larger class of squared infimums of the
form
(- I < {3 < (0), (3.12)
where I" > 0 is any "slowly" increasing or decreasing sequence, i.e.,

one for which
1"+1
I"
=
1 + 0 (!).n
In fact, for any such sequence (see, for example, Lemma 4 in Sacks,
1958), we have
We should not infer from this that nb"2 -+ 00 is necessary to meet our
conditions B"2 -+ 00 and 'L-{3"2 < 00. Indeed, if
1
b2
" = -- , (3.13)
n log n
it is true that Bn2 ;;;; log log 11, and hence {3" o(I/n). We retain this {3..
=
behavior, and make B,,2 increase even more slowly, when we replace
RELEVANCE TO STOCHASTIC APPROXIMATION 35
log n in Equation 3.13 by the product of iterated logarithms (see

Knopp, 1947, p. 293).
At the other end of the spectrum, we cannot (as already noted in
Chapter 2) handle derivatives which increase faster than some finite
power of n, that is, exponentially fast. In such cases, two assumptions are
violated because d,Jbn -+ 00 and bn21 Bn2 + O. Although the latter can be
compensated for by dividing the gains by lIn, the former cannot. An ad
hoc treatment like the one used in Section 2.9 is required.
3.4 Relevance to Stochastic Approximation
At the end of Chapter 1, we rewrote our iterative procedure as a

Robbins-Monro-type process. Here we pursue this point a bit further
and relate the two preceding theorems to some known results in the
theory of stochastic approximation. For this purpose we will take J to be
the entire real axis.
Consider the following situation. For each real number x and integer
n, let Zn(x) be an observable random variable hose mean value can be
written in the form
CZn(X) = Gn(x) = Kn(x)(x - 6) (3.14)
for some 6. Furthermore, suppose sUPx.n Var Zn(x) < 00. The function
Kn(x), which may depend on 6, is assumed one-signed, say infx Kn(x) > 0
for every fixed n. Corresponding to a choice of weights an > 0, we then
estimate 6 by
(n = 1,2",,; t1 arbitrary), (3.15)
where Zn denotes a random variable whose conditional distribution,
given t1 = 1o 1 = 2" " In = n is the distribution of Zn(n) (or,
2 '

equivalently, the conditioning can be on the values of t10 Z1o" ', Zn-l ) '
This is Burkholder's (1956) type-Ao process specialized to the case
where the regression functions all have the same zero. The significance
of our results lies in their validity for a much larger class of Kn(x)'s than
heretofore considered.
To apply Theorems 3.1 and 3.2, it clearly suffices, in accordance with
Equations 1.8 and 1.9, to make the following symbolic identifications:
Kn(x) = IFn(un)l,
Zn(x) = sgn Fn[Fn(x) - Yn], (3.16)
an =
lanl.
Independence of the Yn's is essentially the stated property of the zn's
Our assumptions A2, A3, A4, and AS'" place restrictions on

b" = inf Kn(x), tI" =
sup K,,(x). (3.17)
x x
Thus, if a" is chosen as any restricted gain sequence, we have for the
{
mean-square estimation error of the successive approximations
(Equation 3.15)
1/B"2 for -t < a < 00 ,
,f(t" - 8) 2 is at most the order of log Bn2/Bn2 for a =

-t ,
1/B,,4a. for 0 < a < -t,

(3.18)
as n tends to infinity. There is, of course, a concealed limitation on
applicability: we need to know the n-dependence of the infimums bn in
order to select the proper gain sequence.
In the degenerate case bn bo > 0 and tin = do < 00 for all n, we
=
have essentially the model under which the original Robbins-Monro

process was studied, namely, that an unknown regression function falls
between some pair of fixed straight lines passing through the point to be
estimated (and hence G" might as well be viewed as independent of n).
Since Bn 2/b" = bon, Assumptions A2, A3, A4, and AS'" are obviously
satisfied. Furthermore, a" a/n is a restricted gain sequence, no matter
=
what positive value we fix for a. The resulting Robbins-Monro process is

mean-square convergent for all positive values of
a = boa = minimum slope x gain const.
The way in which the rate of convergence is governed by this product is

given by Equation 3.18 with Bn2 proportional to n. This special case has
been derived by Hodges and Lehmann (1956).
If we assume that
I
dG,,(x) =
K and
dx x =6
for all n, the deviations vii (tn - 8) of the Robbins-Monro process tend
to be normally distributed about the origin in large samples. The
variance (Sacks, 1958, Theorem 1) is
a2
where V(a) = -- ,
2a - 1
provided that a is chosen large enough to make a > -t. Here V is

minimized by a I, that is, by a
= I/bo (cf. Newton's method with a
=
GENERALIZATION 37
constant correction weight). As we will see in the next chapter, the

function V appears in the limiting distribution of our estimates gener
ated by various choices of the restricted gain sequence.
3.5 Generalization
As emphasized in Chapter I, we are interested in deriving general

results in the theory of stochastic approximation only insofar as they
pertain to the analysis of the recursive estimation scheme for nonlinear
regression problems. However, it seems appropriate to note here that we
could have written Theorems 3.1 and 3.2 not only in the wider formula
tion of Equations 3.14 through 3.17 but, moreover, with the first of these
replaced by
G,,(x) = K,,(x)(x - 8,,).
It is not difficult to show that the conclusions hold as written, provided

that the roots of the regression functions are such that
8" - 8 = 0 (,.)
as n 00.
Theorem 3.1 so generalized is Burkholder's (1956) Theorem 2 (after
we ignore the continuous convergence portion of his conditions which
are imposed to show that B"2QC(t,, - 8)2Q is not only 0(1) but tends to
the 2qth moment of a certain normal variate). However, Assumptions
A3 and A5" permit a much larger class of K,,(x)'s than does his corre
sponding assumption that b,,2 is of the form of Equation 3.1 2 without
the i,,'s and the exponent restricted to -1 < {:J o.
4. Asymptotic Distribution Theory
None of the results in Chapter 2 or 3 depended on the nature

of the "iteration space" J other than that it should contain the true
parameter point. However, when we turn to the question of a limiting
distribution for the successive iterates, we will find we need stronger
conditions if one of the end points is finite.
Theorem 4.1 assumes that J is the entire real axis. As already pointed
out at the beginning of Chapter 2, this still covers cases where () is known
to belong to a finite or semifinite interval, say Jo, and an untruncated
procedure is used by linearly extending the regression functions to
(-co, co). On the other hand, when the iterates are confined to such an
interval Jo, the hypotheses of Theorem 4.1' require the existence of
higher-order moments. The number of these, over and above the fourth,
depends on how fast the"signal" is becoming"buried in the noise,"
that is, how fast the regression functions are flattening out (if in fact they
do).
Theorem 4.1 is not immediately applicable (although it does have
theoretical interest in its own right). In fact, it might better be regarded
as a lemma for Theorem 4.2, where we show that its subsidiary hy
potheses are indeed satisfied by some particular gains. The new assump
tion concerning the functions IFnl/bn can be replaced by a different one,
which is discussed after the proofs. First, however, we will need the
following.
38
ASYMPTOTIC NORMALITY FOR GENERAL GAINS 39
4.1 Notation for and Relations Between Modes of Convergence
The following are (standard) symbolic descriptors for the asymp

totic behavior of sequences of random variables. We list them here
and use them in this chapter and the next without referring to them
again.
1. Xn = 01'(1) means Xn remains bounded in probability asn --+ co.
2. Xn 4 X means Xn converges to X in the rth mean(r > 0).
3. Xn X means Xn converges to X with probability one.
4. Xn --+ X or Xn = X + op(l) means Xn converges to X in
p
probability.
5. Xn "" Yn meansXn and Yn have the same limiting distribution.In
particular, if Yn "" Y and Y is normal with mean 0 and variance
I{J2, we write Xn "" N(O, I{J2).
Certain mutual implications will frequently be used: Mode 4 is a conse

quence of either 2 or 3; Mode 4 and a.s. supn I Xnl < co imply Mode
2; and 5 impliesI (seeLoeve, 1960, Chapter 3) .Furthermore, ifXn "" X
and Yn oil), then Xn + Yn "" X, and Mode 4 is preservedthrough
continuous transformations. There is a calculus of01' and01" analogous
=
to0 and 0; for example, op(1)OI'(I) op(l) (seeChernoff, 1956) .=
4.2 Theorems Concerning Asymptotic Normality for General Gains
THEOREM 4.1
Suppose that {Y,,: n 1,2,...} satisfies Assumptions AI', Al, Al,
and AS with J
==
A4, == (- 00, 00), and set
g ,,(x) ==
I lb) 1.
Suppose the functions g10 g,lt' . are continuously convergent at the
point 9; that is to say, for every sequence {xJ tending to 9 as n-+oo,
{g,,(xJ} has a limit. Furthermore, suppose that
and sup "I Y " - F,,(9)II+I < 00,

"
for some 3 > O. Let tl be the arbitrary and
tll+l ==
t" + a.(t1o . . ' tJ[Y" - F,,(tJ]
(n ==
1,2, . . ),
where {aJ is a restricted gain sequence such that
L == lim inf inf Bb"1 1a,.(x)1 > i.

" llel<R) "
40 ASYMPTOTIC DISTRIBUTION THEORY
Then
asn 00, provided that the functionsequences{gn} and{an} satisfy the

following further conditions:
1. l miff [:2 lan(t1>.. " t,,)1 g,,(tn) .\r - = 0,
2. li Iff [::2' a,,2(t1>"', t,,) f = 0, - f'
for some .\ > 1- andf' > O.
Proof. Weintroducethe abbreviations

B,,2
Y" = Tn la,,(t1>.." t,,)1 g,,(t,,),
y,,
' = n2Ia,,(t1>""
71
tn)1 gn(un) (4.1)
where Un is the point, with the indi cat ed property, which arose in
Equation 3.2 from the law of the mean. A ssumption concerning the
(bounded) random variableYn is mad<; underConditionI in the state
ment of the theorem to be proved. Letting .\ be as it is postulated
there, we rewrite our untruncated difference equation
as
In+l - 8 = (I - Yn'fJn)(t" - 8) + anW"
= (I - .\fJn)(ln - 8) - fJn[(Yn' - Yn) + (Yn - .\)]
x (In - 8) + anWn, (4.2)
wherean and Wn are the same as in Equation 3.2 and fJn is stilI given
by Equation 3.4. After iterating this back to an arbitraryfixed integer
N, we obtain
"
tn+l - 8 = n (1 - A{3;){tN - I)
i=N
+ {J,Jy" - Xt" - I) - a"W,,]

= I + II + III + IV. (4.3)
We are going to show, under the conditions of the theorem, thatY, II,
andIII go to zero in first mean faster than 1/Bn, while Bn timesIV has
the asserted limitingnormal distribution.We fixN, as usual, sufficiently
large so that )..f3, < 1 for allj N.
With regard toI: From Lemma 4, we have
..
n (l
'k+l
- )..f3,) ::;
DkBk2i\
B 2"
-
"
= 0 ( 1)
-B
"
' (4.4)
for anyfixedk (in particular, k = N - 1), because ,\

2 exceeds one by
assumption.
With regard to II: Using Equation 4.4 and the Schwarz Inequality,
we obtain
According to the hypothesis of the theorem, we haveL .1- + e 2 for

somee > O. By definition of the limit inferior there corresponds, to any
suche, a finite indexne such that, for alln > n..
inf
xeJ(n)
B ,,
b"
2
la,,(x)I L - e
1
- +
2
e.
The gain restriction is therefore satisfied by some numbera = 1- + e > 1- ,

so thatbyTheorem 3.1
ek = tB'%(tk - fJ)2 = 0 (J
as k -+ 00, independent of the value of a. Next, let " denote the
hypothesized limiting value ofgk(X,J when Xk tends to 8. Then from
Equation 4.1 and the gain restriction,
(say). (4.6)
Since tk, and hence Uk, converges in probability to 8, it follows that

Vk = op(l) ask -+ 00. But for alln andx, we have
( co = sp : (0), < (4.7)
and therefore the t'k'S are bounded random variables and tB't'k2 -+ O.
-+
The sequence following the center dot in Equation 4.5 is thus
0(1) as k 00. By Lemma 4, SUPkN Dk < 00. Therefore,
0(1)0(1)
- 1 > 0, the bound inEquation 4.5 must go to zero asn -+ 00
=
since )"
2
byLemma 5.
With regard to III: In the same way, we find that

" B 2l\-1{3 k
kL
B"ClllIl ::::;; const ; C%(Yk - 'A)2Bkek'
=N "2l\-1

By Condition 1, Theorem 3.1, and Lemma 5, this bound also goes to

zero as n 00.
With regard to IV: The preceding and Equation 4.3 combine to give
" "
B"(t,, - 8) L B" n
'" k=N (1 - 'A{3j)akWk == X"
j=k+l
as n 00. To show that X" has the asserted large-sample normal
distribution, we express this sum in the formulation of Lindeberg's
Central Limit Theorem (Loeve, 1960, p. 377):
"
kL
X" = X"k'
=l
{:. Ii
for k = 1, 2, , N - 1, (4.8)
a"k =
( 1 - Afl,)a. for k=N,N + l, , n.
j=k+1
The multipliers a"k are random variables via
ak = ak(tlo., tk).
From the form of the iteration, it is clear that
110 t2, " tk
tlo Wlo"', Wk-1 are equivalent conditioning sets. (4.9)
tlo X"lo"', X",k-l

Thus,
CX"k = CC{X"kltlo t2, "', tk} = Ca"kC{Wklth Wh, Wk-1} = 0
by the assumed independence of the errors Wk = Yk - Fk( 8). The

summands are therefore centered.
Next, we set
if IX"kl > B
otherwise,
where B > 0 is arbitrary, and
ak = C{Xkl th X"h"', X",k-1}

Uk = C Xk'
Aspecial case of Lindeberg's theorem tells us that

X" ,.., N(O, ,p2)
if the following conditions hold:
"
a. lim L tS'X""(e)X,, = 0,
" k=1
(4.10)
"
c. lim L1 a" ,p2 < 00.
" "=
=
It remains to prove that Equation 4.10 is a consequence of our assump

tions , with the asserted formula for ,p2.
With regard to a: By HOlder's Inequality, we have
O"Xnk(e)X;" :;:; P6{(1+6){ l onkW,, 1 > e} C1/(1+6)0k1+6)W,,2(1+6),

where 28 is the 8 of the theorem's hypothesis. From Equations 4.4, 4.8,
and the gain restriction, we obtain a sure bound
B,,2l1.-1 b"
1 0n,, 1 :;:; const Bn2l1.-1 B = an" (say).
"
By Markov's Inequality, we have
Thus, we obtain
Condition a follows from Lemma 5 because 2,\ - 1 > 0 and

lim" ... f3/ = O.
co
With regard to b: In addition to Equation 4.1, we need one more

abbreviation; namely,
Bn4 ( B
Zn = n2
b,,2 a 110,,', tn) = Tnn2 an2(110" " tn) (4.11)
According to Equation 4.9,

k = 1, 2" " , N - 1,
(4. 12)
k = N, N + 1" " , n.
Thus, from Equation 4.4, we obtain

B 2(211.-1){J k
fflk - akl Dk2 ; 2(211._1) . tS'IZk - Czkl, (4. 13)
,,
because ak = CSk (the left-hand side, of course, being zero for k < N).
But
tS'IZk - tS'zkl tS'IZk - JLI + 1tS'(Z k - fL)1 2ff V. (Zk - JL)2. (4.14)
After substituting Equation 4.14 into Equation 4.I3 and summing over k
from 1 to N, Condition b follows from Condition 2 and Lemma S.
With regard to c: We have from Equation 4.12, in the notation of
Lemma 6,
" " "
{J
l
k a k = a2 B ,,2 k N iXL
(I - >'{JJ)2 :2 [p. + tS'(Zk - JL)]
B
= a2JL'Y,,2(>.) + 0( 1),
where the order term exists for reasons already given in the previous
paragraph. We immediately obtain
t/12 = ....!
2 !....!!:..
2 >' - 1
by the conclusion of Lemma 6. Q.E.D.
Remark. The restriction to gains for which there exists an a > !

guarantees that e " = O(l/Bn) and, consequently, under the present
assumptions, that II and III in Equation 4.3 are both op(l/Bn). As
pointed out following the proof of Theorem 3.2, it is necessary, when
Assumption AS'" holds, to have a > ! to ensure this rate of convergence.
Assumption AS'", in turn, was needed to apply Lemma 6 and get a
definite limit t/12 in Equation 4.10.
THEOREM 4. 1'
Let the hypotheses of Theorem 4.1 hold over J [Et> E2 ], with at =
least one of the end points finite, and suppose we choose the interval so
that 0 is an interior point. In addition, assume there exists an integer
p, 2 p < co, for which
and, corresponding to the smallest such integer,

sup C[Y" - Fn(O)]2P < co.
n
Let t1 be arbitrary, and
tn+1 = [tn + anCt1' . " tn)[ Yn - Fn(tn)]] (n = 1, 2" ,,),
where {an} is any restricted gain sequence having L > t. Then the con
clusion of Theorem 4. 1 holds under Conditions 1 and 2.
Proof. We represent the effect of truncation as an additional

term on the right-hand side of the fundamental formula, Equation
4.3, in the foIlowing way. Let Tn again abbreviate the function
tn + an[ Yn - Fn(tn)]. Define the indicators
if Tn gl> if Tn g2,
otherwise, otherwise,
and the random variable
(4. 15)
In what foIlows we proceed as though both end points are finite. If one
is not, the appropriate term in Un is to be deleted and the ensuing
arguments accordingly modified.
In this notation the truncated recursion is
and Tn - 8 is given by the right-hand side of Equation 4.2:
Tn - 8 = ( 1 - Yn'{3n)(tn - 8) + anWn (4.16)
The meaning of all symbols is the same as before, the only difference
being that Ih, In, Un and 8 now belong to a finite interval. We thus
have
In+1 - 8 = (right-hand side of Equation 4.3) + Un,
The hypotheses of the present theorem include those of Theorem 4.1,

after the latter is rewritten for a finite interval. The conclusion will thus
be at hand once we show that BnUn op(l ). =
From Equation 4.4, since SUPk:.:N Dk < co, we have

46 ASYMPTOTIC DISTRIBUTION TIIEOR Y
After taking expectations and using Lemma 5, we see that it suffices to

prove the stronger statement
(4.17)
as n -+ 00, and this is what we now proceed to do.

If we set
and
then, from Equation 4. 15, we have
All quantities on the right-hand side of Equation 4.16 are surely

bounded, with the possible exception of = Y" - F,,(8). Therefore,
W"
IT"I has as many moments as I W"I,
which by hypothesis is 2p. From the
Holder Inequality and the fact that Cl/rl Xlr is a nondecreasing function
of real numbers r, it follows that
CIU"I [Cx" + Cl/2p1T,,12p . &(2p-ll/2pX,,2p/(2'P-ll

const C(2p-ll/2pX". (4.18)
We seek, therefore, the n-dependence of Cx".

The random variable y,,' in Equation 4.16 belongs to a finite interval
of positive numbers (see Equations 4.1 and 4.7). Hence, for aU large
enough n, Equation 4.16 gives
IT" - 81 It" - 81 + a' ;"2IW"I,

"
because {3" tends to zero. Since we are assuming, without loss of gener
ality, that [1 < 8 < [2,
we can write
8 - [1 28 > 0,
for some such 8. For the right-hand end point, we therefore have
tffX"2 = P{T" - 8
[2 - 8} P{lT" - 81 28}
P{lt" - 81 e} p{IW,,1 b:}

+

tff(t" - 8)21' (a'b,,)2 2p 0"1W:"121'
+
821' eB,,
ALTERNATIVE ASSUMPTION 47
after using Markov's Inequality. The second term, in the notation of

Equation 3.4, is 0(f3nPIBn2p) = o(IIBn2p) and hence, according to
Theorem 3.1, of smaller order than the first. It is clear that <Jxn1 can be
bounded by a sequence with the same n-dependence; hence
Con sequ ently, from Equation 4. 18,
ClUnl =
() )
0 B P-l -
But
1 1 f1n f1n
Bn2P-l = (BnP-2bn)2 . Bn = 0 Bn ( )
by hypothesis. This establishes Equation 4.17 which, as already argued,
is sufficient. Q.E.D.
Remark. The assumption that BnP-2bn 00 .for some finite p is
directed at those situations in which the sequence of derivative functions
tends to zero as n 00, and it places a limitation on the way in which
we allow this to happen. It excludes Equation 3.13, and also infimums
like bn2 = log nln, since then we would have En2 log2 n. However, the
assumption is satisfied by Equation 3.12, with p 2 the first integer
exceeding 2 - f11(f3 + 1). This makes quantitative the required relation
ship between the rate at which the derivatives are approaching zero and
the number of existing noise moments.
4.3 Alternative to the Continuous Convergence Assumption
Referring to the proof of Theorem 4.1, we see that the assumption of

continuous convergence of {gn()} at the point 8 was explicitly used to
show that the expectation of the second term in Equation 4.3 went to
zero faster than lIEn. More specifically, according to Equation 4.5, any
assumption which ensures limk "" C(Yk' - Yk)2 = 0 will suffice. One
__
alternative is that the sequence gl> g2, . . . possesses a common concave

modulus of continuity on J (finite or not as the case may be). That is to
say, there is a function rp with the properties
(-2 -)
+ +
o
lim rp(t) 0, rp()
1 rp() !P(11) rp(t2) rp
11 12 ,
1--0
=
12 , 2
(all t2 11 0),
48 AS YMPTOTIC DISTRIBUTION THEOR Y
such that
sup Ign (X 1 ) - gn(X2)I <pet) (4.19)
IXI-X2IS!
(Xl' X2)eJ
fora lln. Indeed, inplace of Equation4.6, wewould then have

IYk' - Ykl a'lgk(uk) - gk(tk)I a'[<p(luk - 8/) + <p(ltk - 8/)]
2a'<p(ltk - 8/).
By Equation4.7, <P is necessarily bounded; thus, for the same reasons,
tn 8 implies that6'<p2(lfk - 81) --+ 0 ask --+ 00.
The two assumptions are clearly unrelated: the existence of(anykind
of) a common modulus of continuity is an interval property of all the
functions and says nothing about convergence of the sequence (con
tinuous or otherwise) at a particular point. We have selected the con
tinuous convergence assumption because it is implicit inCondition 1 of
Theorem 4.1. This becomes clear when we turn to the question of
satisfying the two provisos for someparticular gain sequences.
The three gains listed below in Theorem 4.2 were introduced in
Equations 2 .2 6, 2.30, and 2 .2 8, respectively. We have added a multiplica
tive gain constant in each case.The first and third gains, it will be noted,
are deterministic, while the second is random.The value of80 appearing
in the third will usually be fixed on the basis of p rior knowledge
concerning the value of8 and might as well be identifiedwith the initial
guessfl'
4.4 Large-Sample Variances for Particular Gains
THEOREM 4.2
Suppose that{Yn : n = 1, 2 , "'} satisfies Assumptions AI', A2 , A3,
A4, and A S" with J = (-00,00). For a llx in J, set
gn(X) = Fh:) I,
I C = "
dn
IIm sup- '
bn
Assume that the functions gh g2, are continuously convergent at 8

.
to a number 'Y. Furthermore, suppose that

< ex:>
Var Yn = u2 and sup tS'1 Yn - Fn(8)12+6
"
for some a > O. Let t1 be arbitrary, and let
In+1 = In + an(/l>" ', tn)[ Yn - Fn(/n)] (n = 1,2", .),

LARGE-SAMPLE VARIANCES FOR PARTICULAR GAINS 49
where {a,,} is one of the following gains:
Gain a,,(xb,XJ
Al sgnF" ;:1
F,,(xJ
2 AI
F"I(X,J
"

"-1
F1I(60)
3 Aa n (60 EJ).
kL
Fk2(60)
=l
The A's are suitably chosen constants. Defi ne the function

x2
Vex) =-- H < x < co).
2x - 1
Then, as 11 -+ 00,
with corresponding variances given by
Gain QI
V(A1,,), provided Al > t,

2 V(A2), provided A2 > c2j2,
3 ( :a),
V Aa provided Aa > c2j2, and gn(60) -+ Yo
as n -+ co.
In every case, the same limiting distribution obtains if, in the norming
sequence, Fi6) is replaced by Fk(tk) for k 1, 2" " , n. =
Proof To save space we set

bk2
bnk = B2 (k=I, 2, . . , n) .
n
n -+
This constitutes a normalized Toeplitz matrix because each column
tends to zero as co and the row sums are identically one. Thus,
n
limfn = f implies lim L bndk = f (4.20)
n 11 k=l
by the Toeplitz Lemma (Knopp, 1947, p. 75). This fact will be used
repeatedly.
To apply Theorem 4.1, we must first verify that the number L, defined
in its hypothesis, exceeds t for each of the three gains under considera
tion. For Gain I this is immediate because L is A1> and the latter is
presumed chosen larger than t. For gain 2 we have (and the same clearly
will be the case for Gain 3)
L
2
lim inf inf ,, la,,(2l(x)I = lim inf inf A2 "
g,,(x,,)
2 b"kgk2(Xk)
=
" xeJ(n) " " xeJ(n)
"=1
-
> "
A2
(d)2 - 2 2
A2 > !. >
lim sup 2 bn" "
C
" k-l b"

In the last line we have used the fact that
"
lim sup
"
2 b"d" lim supI"
"
(4.20a)
"=1
if 0 < inf"/" sup"l" < K < 00. Indeed, if we set/ = lim sup"I", there
corresponds, to any e > 0, a finite index no such that/" < /+ e for all
n > no. For such indices, we have
y"
" "0
= 2 b"d"
"=1
< K 2 b"" + /+ e.
"=1
The first term tends to zero as n 00; hence, there is an nl > no such
that it remains less than e for all n > nl' Thus, for all sufficiently
large n,
y" < /+ 2e,
from which the asserted conclusion follows because e was arbitrary.
The problem is to prove that Conditions 1 and 2 in the statement of
Theorem 4.1 are satisfied with values of (, JL) which yield the asserted
formulas for Q2. To do this we set
1 "
B 2 2 Pi(t,,)
"
S,,2 = = 2 b""g,,2(t,,),
" "=1 "=1
(4.21)
for x in J. Let y" and z" have the meanings respectively given in Equa
tions 4.1 and 4.11 as functions of t1> " ', t,,:
y" =
B2
T" la,,1 g,,(t,,) and Z" =
B,,41
b,,2 a"12
LARGE-SAMPLE VARIANCES FOR PARTICULAR GAINS 51
The first two_ columns of the following table are proportional to these
sequences for the listed gains.
Gain YnlA znlA2 NA p.IA2
1 gn(tn) 1 " 1
2 gn2(tn) gn2(tn)
1 (4.22)
s;::- Sn4 ,,2
3 gn(80)gn(tn) gn2(80) .!
n2(80) n4(80) Yo Y02
We now show, in each of the three cases, that
C(yn - ,\)2 = 0(1), (4.23a)
C(zn - p.)2 = 0(1), (4.23b)
as n -,)- 00 for the corresponding ('\, p.) given in the third and fourth
columns.
First of all, however, we note that each of the asserted ,\ values
exceeds t. Indeed, since
1 :$; lim inf gn(x) :$; lim sup gn(x) :$; c
n n
for alI x E J, any limiting values of gnex) must belong to the interval
[I, c]; in particular, yand Yo. Thus, in the case of Gain 3, '\ A3(ylyo) > =
(c 2/2)(Ilc)
= c/22:: 1- with equality only when c = 1, in which case we
say the problem is asymptotically linear.
With regard to Gain 1: The hypothesized continuous convergence of
the gn's at 8 to y immediately allows us to infer gn(tn) y from tn 8.
But the gn's are bounded, so gn(tn) -,)- " in mean square.
With regard to Gain 2: We consider the identity
According to Equations 4.7 and 4.21,
for all x in J; hence

By the same argument used for Gain 1, the third term goes to zero in
mean square. For the first, from Equation 4.21,
According to Theorem 2.3, Condition 3, tk 8 as k co. Thus, and -+

again by the continuous convergence assumption, the random variable
enclosed in square brackets tends a.s. to zero as k-+ co. Now the
implication of Equation 4.20 is valid when the f,o's and f are random
variables and "lim" is replaced by "a.s. lim" (but is not, incidentally,
when replaced by "P lim"). Hence, SrI2 - ,.2(8) 0 and, because the
variates are bounded, we have
o.
,.
lim tS'IS,.2 - ,.2(J)12 = (4.25)
Furthermore,
,.
lim ,.2(8) = ,,2. (4.26)
Equation 4.23a therefore follows after we square Equation 4.24 and

take expectations.
To establish Equation 4.23b, we note that Z,./A22 is I/S,.2 times y,./A2
( )
Thus,
.! _1_ y,. + _1_ _ .!
,
_
1
A22 S,.2 A2 S,.2
_
,,2 ,,2
=
so that, since" is also no smaller than unity,

-
1 :;2 :2 1 l n2 - 1 1
s
+ IS,.2 - ,,21
It follows from the results of the previous paragraph that this bound
goes to zero in mean square as n -+ co.
With regard to Gain 3: If we use the additional assumption that
{g,.} is convergent at the selected point (Jo, the same type of argument used
in the preceding paragraphs establishes Equations 4.23a and 4.23b for
the asserted .\ and JL in Equation 4.22.
We have thus verified all the (unassumed) hypotheses of Theorem 4.1.
In view of Equations 4.25 and 4.26, we have
SrI =
1
0 ,.
J Fk2(tk)
k=l
1+"
as n -+co, and the limit is a sure one when every t k is replaced by 8.

Hence, by the conclusion of Theorem 4.1, we have
OTHER GAINS 53
It remains to note that Q2 is precisely ,,2p./(2 'A - 1) when we substitute

the values of 'A and p. given in Equation 4.22. Q.E.D.
The foIIowing result clearly requires no independent proof.
THEOREM 4.2'
Let the hypotheses of Theorem 4.2 hold over an interval J = [eh e2],
with at least one of the end points finite, where the interval is so chosen
that 8 is an interior point. In addition, assume there is a finite integer
p 2 such that
with n. Also suppose that

sup C[Y" - F,,(8)]21> < 00.
"
If 11 is arbitrary and
1,,+1 = [I" + a,,(/h, )
"I [Y,, - F,,(t,,)]] (n = 1, 2" ,,),
where {a,,} is one of the three gains listed in Theorem 4.2, then the
conclusion of Theorem 4.2, under its provisos, holds for these truncated
estimates.
4.5 Other Gains

After examining the proof of Theorem 4.2 it is clear how to deduce
the asymptotic distribution of the estimates generated by any restricted
gain sequence which can be appropriately expressed in terms of the g,,' s
and, furthermore, we know what additional conditions (if any) should
be imposed in order to do so. Consider, for example,
b,,2 1 ... b" 1
. a" A sgn A 2 (4.27)
B,,2 F"Ce,,) = B,, g,,(g,,),
=
r"
where {e,,} is any sequence of random variables, taking values in J, which

converges in probability to a limit e as n 00. We first restrict A to
ensure L > t; in this case,
A > :.
2
Mean-square limits of the random variables y" and z" must next be
guaranteed. Here we have
g /,, A2
g,,,,(( e,,)) , z" = g,,2(e,,);
Y,, A
-
_
and therefore we require thatgh g2, ... be continuously converge nt at

the point g to, say, i'. The mean-square limits are then
i' A2
A - > -,
1
p. = - ,
i'z

i'
=
2
after using weak convergence and boundedness. The variance is thus
(4.28)
If gn is define d as the value ofx at which the infimum of I Fn(x) I is

assumed, that is, Fnan) = bn/ sgnFn, then i' 1, and Equation 4.27
=
be comes Gain 1. If we take gn tno then i'

= i' and the variance,
=
Equation 4.28, is the same as that for Gain 2, although the gains are
algebraically different. Finally, the same is true forgn = 80 andGain 3.
The fact that both Gain 2 and Gain 3 are easier to compute than
Equation4.27 is reflected in the strongerlimitationA > c2/2 .
4.6 Gain Comparison and Choice of Gain Constants
We should compare the estimates in Theorem 4.2 on the basis not

only of their relative asymptotic efficiencies but also the amount of labor
involved in calculating and using the corre sponding gain sequences.
We have numbered the gains independently of any such considerations,
but in the order of the increasing analytical restrictions imposed in the
Q2-table.
It is clear that there is no universal ordering of the costs Ch C2, and
C3 associate d with using the respective gains (if, indee d, such a numeri
cal value caneven be assigned), and that the problem must be treated in
the light of the particular application. However, some rather vague
general relationships can be cited. Thus, Gains 1 have the advantage
ove r Gains2 in that they can be computed before the data acquisition
and thereby decrease computation time. Such is also true of Gain 3,
which has the added advantage that it does not require locating the
derivative minima but, rather, just the ir calculation at the sele cted initial
guess. Thus, we might write C3 < C1 < C2 But unless the re is a
re cursive relation between successivebn's, we are faced with the problem
of storing the entireGain1 seque nce.Gain2, on the other hand, can be
inversely generated on line by means of the re cursion
Fn(tn) J.. + Fn+1(tn+1)
(n = 1 2 . . .)
an+l Fn+1(tn+l) an A2 ' "
which is to be initialized by l/a1 F1(t1)/Az The Gain 3 sequence is
=
computed in thesame fashion withtn replaced by 80 for alln 1, and

GAIN COMPARISON AND CHOICE OF GAIN CONSTANTS 55
A2 by A3 Thus, storage considerations suggest inverting the order to

C3 < C2 < C1 Still, there are problems in which the minimum value of
I Fn(x)I is taken on at one and the same end point of J for every n. (Such
is the case for the example worked out at the end of Chapter 5.) This
leads to a further change: C1 = C3 < C2
2.0 ,
\
v
1.5
i.-

\ '-... --
l..----
1.0
0.75 1.0 1.5 2.0
Figure 4.1 The stochastic approximation variance (unction defined in Theorem 4.2.
Turning to the question of relative statistical efficiencies, we note that

the variances for Gains 1, 2, a nd 3 (hereafter denoted by Q1 2, Q22, and
Q32) are functions of several variables via the function V, plotted in
Figure 4.1:
QI 2-= QI 2 (AI,,,),
For Gain 1, Al must be chosen in the open interval (t,00). For Gain 2,
A2 must be chosen in (c2/2,00), while for Gain 3, 00 can be chosen
by the experimenter (this determines Yo), and then A3 must be
chosen in (c2/2,00). For any particular choices of the A" it is not
hard to exhibit regressions such that each gain is, in turn,"optimal"
(has minimal Q2) for some value of the parameter O. Thus, the question
of" which gain to use" has no quick answer.
As a possible guideline for comparing the three types of gains for a
particular regression when 0 (hence y) is not known, we might adopt a
"minimax" criterion for choosing the AI and then compare the variance
56 AS YMPTOTIC DISTRIBUTION THEORY
functions as y varies over its domain. That is to say, we compare

and
as y varies, where the AI* are chosen from their respective constraint sets
to minimize
and
respectively. (For any particular regression problem c is known, and we

will assume that 80, hence Yo, has been determined by considerations of
"nominal" parameter values.) As we will see, only the first function
achieves its minimum on the constraint set.
2
We "minimize" Q2 by
for 1;5; c < V2,

(4. 29)
for c V2,
where the +0 indicates the lack of a minimum over A 2 > t. With

regard to Gains 1 and 3, maximization over 8 E J is the same as maxi
mization over all real numbers y E [1, c]. For the former we see that
max V(Ay) = max {V(A), V(Ac)} (A > t). (4.30)
1 :S7:SC
This is minimized by the value of A = ACc) which makes V(A) and

V(Ac) equal. The solution is simply
(l ;5; c < 00), (4.3 1)
which decreases monotonically from 1 to t. The situation for Gain 3 is a

bit more complicated. Now we seek to minimize Equation 4.30 with
respect to A, subject to the restriction A > c2/2yo t (A is A3/YO). The
solution is given by the right-hand side of Equation 4.31, provided that
Yo > c3/(c + 1), which cannot take place unless c is small enough so
that c3 < c(c + 1). This leads to a rather complicated formula for
Gain 3:
c3
if c3 < c(c + 1) and Yo >
c + l'
(4.32)
otherwise.
The values of the variance resulting from the choices of Equations 4.29,
4.31, and 4.32 are
GAIN COMPARISON AND CHOICE OF GAIN CONSTANTS 57
if 1 c < V2,
(4.33)
if c V2,
c3
c3 < c(c + 1)
+ l'
if and Yo >
c
otherwise.
where 1 :S ","0 :s c. We see that every Q,I 1 with equality when and
only when c ==1.
The same is true for the simpler choices
and (4.34)
which meet the provisos in all problems. The corresponding variances
are
(4.35)
It is interesting to note in Equation 4.33 that Qal < QI1 whenever

c v'2and
"0 == limg.(Bo) > a.s. limg.(tJ == ".

The same is true in Equation 4.35 for every c. Thus, a fortuitous choice
for 80 will make the estimates based on the more easily computed Gain 3
asymptotically more efficient than those based on Gain 2.
In the next chapter we limit our consideration to sequences gh g2 . . .
,
that converge uniformly on J to a continuous limit g. We then, at an
increased computational cost, iterate in a certain transformed parameter
space defined only by g and invert back to J at each step. The result, as
might already be anticipated, is that Q2 V(I) I for all three gains,
= =
because the transformation will be selected to force asymptotic linearity,

that is, c= 1.
Before doing this, however, we point out that our methods of proof
in this chapter (as was the case in Chapter 3) are readily adapted, after
appropriate reinterpretation of the symbols, to yield asymptotic
normality for a general class of stochastic approximation processes.
58 ASYMPTOTIC DISTRIBUTION TIIEOR Y
4.7 A GeaenI Stodulstic Approxim8tioD Theorem
THEOREM 4.3
For every real number x and positive integer n, let Z.(x) be an
observable random variable. Corresponding to a given sequence of
constants "It "lit' " recursively define

1.+1 = I. - u.z. (" = 1, 2" ,,; 11 arbitrary),

where z. is a random variable whose conditional distribution, given
11 tl. la ta,' . " I. - t., is the sameasthe distribution of Z.<tJ.
Let
= =
G.(x) - IZ.(x)
have a zero which converges to a finite number Bas" -+ 00:
G.(BJ - 0, lim B. - B

Furthermore, suppose that (x - BJG.(x) is one-signed (say positive) for

all
x :1= B We impose the following further conditions.
{:.(X
1. There exists b. > 0 and a number" such that
G.(x)
if x:l= B
,.(x) - - BJ
if x - B.
satisfies 1 g,,{x) Co < 00 for all x and all n > no. (We can
always redefine b" so that any strictly positive value ofinf,,> "0. x g,,(x)
is unity. )
2. g,,{x) is continuously convergent at 8 to y.
3. B,,2 = b12 + ... + b,,2-+00 with n and b,,4/B,,4 < 00.
4. 8" 8 + o(I/B,,).
sUPn.x O"I Z,,{x) - G,,{X)12+6 < 00 for some 8 > 0 and Var Z,,(x) is
=
5.
continuously convergent at 8 to a number a2
Then, if {exn} is any positive number sequence so chosen that
IIm- >
B"2
Y \
"2".
i

ex" = 1\
" b"
the random variables
(n = 1, 2, . . )
a
GENERAL STOCHASTIC APPROXIMATION THEOREM 59
have a large-sample normal distribution about 0 with variance

2
V(>.) = 2>' >.
_
1"
Toward Proof. In the case 8" =8 for all n (that is, a common root
of the mean-value functions) and Var Zn(x} 0-2 for all n and x, the
=
validity of the assertion is an almost immediate consequence of Theorem

4.1, after we identify gn(x) as defined here with I Fn(u")J!b,,, etc. (Compare
with Equations 3.14 through 3.I7.) But an examination of the proof of
Theorem 4.1 shows that the perturbation of any quantities which take
values in the parameter space by terms which go to zero faster than lIB"
has no effect on the limiting behavior of {I,,}. Moreover, continuous
convergence of bounded functions of such quantities at 8 yields the same
conclusion that results from assuming the limiting values to start with.
The conclusion of Theorem 4.3 is precisely the conclusion of Burk
holder's (1956) Theorem 3 when
<
e
1
b _ __ (0 !) (4.36)
" n%-
as n -+ 00. (This is not obvious until the symbols in the two statements
are properly related.) As already noted at the end of Chapter 3, our
Condition 3 is much less restrictive than Equation 4.36. Furthermore,
Burkholder assumes that all moments ofZ,,(x) - G,,(x) are finite, albeit
only throughout some neighborhood of 8. Condition 5, at least from
the point of view of application, is in most instances weaker. Indeed, the
distribution of the "noise", Zn(x) - G,,(x}, usually depends on x in a
rather trivial fashion and is often independent of the adjustable param
eter. On the other hand, high-order absolute moments are infinite in
some problems. Finally, Burkholder's assumption that
Gn I
sup I (x) <
00 '
n.x 1 + Ixl
for every I) > 0 is weaker than our Condition 1, provided that

lim sup" b" < 00. But, as already pointed out, we make no such
limitation as Equation 4.36.
5. Asymptotic Efficiency
The third gain considered in Theorem 4.2,

(3)
Fn(80)
an - n
_
2: Fk2(80)
k=1
is appropriate in many applications. As we have noted, it is computa
tionally cheaper than the .. adaptive" second gain, and it can lead to
estimates that are more efficient in large samples. However, the existence
of a stable limiting distribution for these estimates should not depend on
the value of our initial guess, 11 80 Hence, the Gain-3 proviso (that
=
{gn(x)} converge at the particular point x 80) ought to be replaced by

=
the assumption that the sequence possess a limit, say g(x), at every
x inJ. If, in addition, we require that this convergence be uniform on J
and that the limit function be continuous, there will be continuous
convergence at every point of J (in particular, at 8, as also required in
Theorem 4.2). Indeed, if , is arbitrary in J and {xn} is any sequence
tending to " then
Ign(xn) - gW I sup Ign(x) - g(x)1

xel
+ Ig(xn) - gWI
tends to zero as n -+ 00. Therefore, throughout Chapter 4 we might just

as well have hypothesized uniform convergence on J of gl> g2, .. . to a
continuous limit function g. The latter assumption will be a consequence
of the former when J is bounded, provided that each member of the
60
INCREASED EFFICIENCY VIA TRANSFORMATION 61
sequence is continuous. In t he conclusion of T heorem 4.2, we now

identify y and Yo with g evaluated at (J and (Jo, respectively.
5.1 Asymptotic Linearity
Having thus hypothesized t he existence and knowledge of a limit

function, we can now construct recursive estimation schemes t hat, as
will su bsequently be s hown, become asymptotically efficient when (and
only when) the o bservational errors are Gaussian. We demonstrate t his
for the case of a bounded intervalJ, w hich is henceforth identified with
the prior knowledge space of t he parameter. From t he applied point of
view, this does not constitute a significant restriction.
By way of introduction, we note t hat our new assumptions allow us to
write
(5.1)
as n -+ 00 for all x in J (because t he interchange of t he integration and

limit operations is permissible). T he regression functions are t herefore
asymptotically linear in the values of the integral. It is reasonable,
therefore, to estimate recursively t he parameter value
cp f8 geg) d
J l
=
and invert back at each step to obtain the (J-estimate. This is, in fact, t he
method analyzed in the following t heorem.
In some rather simple problems, Equation 5.1 is an equality for every
n (and the major portion of the proof of Theorem 5.1 is o bviated). For
instance, if Fn(J)= kn(J3, and J is any finite interval that docs not include
the origin, then gn{x) =
(X/1)2 for all 11. In such a situation, we would
estimate (J3 by linear least squares and t hen take the cube root.
5.2 Increased Efficiency via Transformation of the Parameter Space
THEOREM 5.1
Let Assumptions AI', A2, A3, A4, and A5"' hold, w here J = [h 2]
is any finite interval containing (J as an interior point. For n I, let
62 ASYMPTOTIC EFFICIENCY
be continuous at everyx in J, and suppose that
SU)) Ig.(x) - g(x) 1 0

1
as n 00, In addition, suppose that

Var Y. - as and sup I[Y. - F.(I)]IP < 00,

where p 2 is the smallest integer (assumed to exist) for which

lim B.P-1b. - +00,

For x in J, define
which takes values in J* - [0, ta>], and let 'I" - cz,-I be the inverse
function (which exists because g is positive and bounded), For y in J*,
define
F.*(y) - FII('I"(Y,
b.* - inf I I .*(Y)I,
".1'
where the dot means differentiation with respect to y. Let tl* be

arbitrary, and
tf + I - [t.* + Q.*[ Y. - FII*(t.*)]Jr f.) (n - 1,2, .. ),
where a.* is anyone of the quantities
-h bll* I.*(tll*)
Sprll*
B*1'

k-I
Ik*I(tk*)
and CPo is an arbitrarilyselected point in J*. Finally, let

t. - 'I"(t.*) (n - 1,2,).
Then, as n 00, we have
Jk-I}: Ikl(l) (til - I)/a ,.., N(O, I),
and t he same holds true w hen every Fk(8) is replaced by Fk(tk) in t he

norming sequence.
Proof. Letting cp <1> ( 8) denote the unknown parameter in t he

=
transformed space, we have, by definition, t' Yn Fn ( 8) Fn * (cp). T he

= =
proof of t he t heorem falls into two parts. We first s how t hat t he starred
INCREASED EFFICIENCY VIA TRANSFORMATION 63
problem is asymptotically a linear one. Since each of the starred gains

has a gain constant purposely chosen to be unity, the Q2-table of
Theorem 4.2 with C = 1 will then give (via Theorem 4.2')
JJ/'k*2(cp)(tn* - cp)/u,.., N(O, 1 ) (5.2)
in all three cases. The second part of the proof will yield the desired
conclusion by the "delta method."
The initial step, then, is to show that our assumption that {Fn} obeys
Assumptions AI' through AS'" on J implies that {Fn *} does on J *. The
basic relation for doing this is
('l"(yd'l"(y) F,,('l"(y)
* (y)
t. - t.
=
" - n dy d!l>(x)
... g,,('Y(y)
dx x='P(II) I
= s n r"b" 'F . (5.3)
g( (y
We immediately see that the sign of Fn*(y) over y E J* is constant and

the same as that of Fn(x) over x E J. Furthermore, we have
bn infgn(x),
gn(x).
bn * = dn* = bn sup
J g (x) J g (x)
Since the range of the limit function cannot be larger than that shared
by every member of the sequence, Equation 4.7 yields
bn
bn* dn* bnco (n = 1 ,2" . , ) .
Co
-,
Thus, not only are Assumptions AI' through A5'" satisfied by the starred
infimums and supremums, but also
lim Bn*I>-2bn* = + 00
"
for the same hypothesized integer p,

We use the uniform convergence to show that
dn*
c * = I,1m sup -
1
.. b..*
= ,
The ratio
[gn(x)/g(x)]
d..*
s p
=
b..* inf [g..(x)/g(x)]
J
actually converges because both n umerator and denominator tend to

unity. Indeed, we have
/ sup gg(x)
J
n(x)
1 \ :::; sup l gn - 11:::; sup Ign(x) - g(x )1 = 0(1)
_
J
(X)
g(x) 1
and, similarly, infl gn(x)/g(x) 1 as n 00. It is a fortiori true that
gn*(Y) IFb*)1 1 = g*(y)

n
=
(uniformly) on J*. In other words, in the Q2-table of Theorem 4.2, we

are to read c )' )'0 = 1 . This establishes Equation 5.2.
= =
To obtain the limiting distri bution of the estimates tn of 8 'Y(9'), =
we expand in the usual fashion:
'Y(t"*) = 'Y(9') + 'Y(9')(I"* - 9') + &(vn) - 'Y(9') ](t"* - 9') ,

where v" is some random point s uch that IVn - 9'1 < It"* - 9'1. The
derivative
. 1
'Y(y) = g('Y(y
is continuous and nonzero at every yin J*. From Equation 5.2, we see
p p .
that In* - 9'; hence 'Y(t'n) 'Y(9'). Thus, after we multiply through by

.
the appropriate norming sequence and use Equation 5.2 as written, it
follows that
Jk=li Fk*2(9') (t" - 8) = 'Y(9') Jk=li Fk*2(9') (t,,* - 9') + 0,,(1) 0,,(1 )
,.., N(O, 'Y2 (9') a2).
But according to the leading equality in Equation 5.3,
Fk*(9') = Fk (8)'Y(9'),
so that
(5.4)
which is the asserted distribution.

Now, from Theorem 2.3, Condition 3, we know that t"* 9' and,
hence, t" 8 by continuity of 'Y. It follows by precisely the same
argument used with regard to Gain 2 in the proof of Theorem 4.2 that
ASYMPTOTIC EFFICIENCY AND SUMMARY THEOREM 65
in the notation of Equation 4.21. This combines with Equation 5.4 to

prove the final stateme nt. O.E.D.
The appropriately normalized deviations tn - 8 of Theorem 5.1 have
a large sample N(O, I) distribution for any of the listed gains, none of
which contain undetermined constants (which is as it should be). The
result is true without variance dependence on unknowns or quantitative
restriction on the limit function g which occurred in the conclusion of
Theorem 4.2. The computation of these transformed estimates is clearly
more time-consuming, but this is the price we must pay for the improve
ment in variance. Since all three gains yield the same limiting distribu
tion, the computationally cheapest third one will usually be used.
5.3 Asymptotic Efficiency and Summary Theorem
The question naturally arises as to how these estimates (or, as a matter

of fact, any of our estimates) compare statist ically with the still more
computationally expensive method of Maximum Likelihood (abbreviated
ML). A good deal of the following discussion pertaining to this topic is
standard material; i t i s included for the sake of compl eteness.
We are going to assume that the observational errors Wn = Y" - F,,(8)
are not only independent but identically distributed as some random
variable W possessing a probability density function; that i s,
P{W" ::;; w} = P{W::;; w} = Ico f(x) dx (n = 1, 2" .. ) .
The minimal assumptions on J, of course, ar e
CW = o,
( Certain higher-order moments are presumed finite when we consider
our methods of estimation.) We further suppose that
hew) =
_ dl o:!(w)
ex ists (on all set s wi t h po sit iv e prob ability) and that
Ch(W) = 0, (5.5)
A sufficient condition for the former i s that W be symmetri cally distrib

uted about the origin. The latter, it will be noted, excludes constant
densities. We are also going to assume that f i s independent of 8
although, of ourse, this need not be the case for the vali dit y of any of
our results.
The likelihood of a realization Y1 = Yl> ... , Yn = Yn, when 8 is the

true parameter value, is simply
8 En.
Here n is an interval, finite or infinite, denoting the "natural" domain

of the parameter, whereasJ was a (not necessarily proper) subinterval of
n defined by a priori knowledge. We have
(5.6)
which is a linear combination of independent, identically distributed

random variables with coefficients depending on the unknown param
eter. In view of our restriction, Equation 5.5, it follows that
(5.7)
where A = A, is independent of 8.
Now let In = In( Yl> ... , Yn) denote a 8-estimate based on the first n
o bservations (rather than Il - 1 as previously). Under regularity condi
tions, the celebrated Cramer-Rao inequality states that
8) 2 > b n2(8)+ {
I + [dbn(8)/d8]}2
(In (5.8)
In2(8) '
_
-
where bn(8) is here the estimate's bias. The usual form in which the
regularity conditions are written is (see, for example, Hodges and
Lehmann, 1951) as follows:
i. n is open.
ii. a log Ln/o8 exists for all 8 En and almost all points
Y = (Yl>, Yn).
iii. co(a log Ln/(8) 2 > 0 for all 8 En.
iv. f Ln dy = 1 and f (tn - 8)Ln dy = b n( 8) may be differentiated
under the (multiple) integral signs.
Our Equation 5.5 ensures Conditions ii and iii, and Condition iv holds
because/does not depend on 8. We note that Conditions ii and iv imply
I!e a log Ln/88 = O.
The ratio of the right-hand side of Equation 5.8 to the left-hand side
is called the (fixed sample size) efficiency of In when 8 is the true
parameter point in n. As is known, a necessary and sufficient condition
for an estimate In to be such that this ratio is unity for all 8 E n is that In
be a sufficient estimate (a statement concerning F1, , F" and f) and

that a log gn/e8 = K,,(8)(t - 8), where g" is the density of t". The right
hand side of Equation 5.8 is only a lower bound on the mean-square
estimation error; there exist problems where the uniform minimum
variance of regular unbiased estimates exceeds 1/1,,2(8) at every 8.
Let us restrict our attention to Consistent Asymptotically Normal
(a bbreviated CAN) estimates of the value of 8 specifying {F,,(8)}, that is,
those for which
,,(8)8
t -
,..., N(O, 1)
as n -+ 00, where 0',,(8) is some positive sequence approaching zero with

increasing sample size for any 8 E n. We assume that
(5.9)
exists (possibly as +00). Here "10 is called the asymptotic efficiency of{t,,}
when 8 is the true parameter value. IfVar t" 0',,2(8) and db,,(8)/d8-+ 0
as n -+ 00 for all 8 E n, then it follows from Equation 5.8 that "10 1 for
all 8 E n.
If a CAN estimate is such that "18 1 for all 8 E n, it is called
=
a Consistent Asymptotically Normal Efficient (abbreviated CANE)

estimate. This definition is made without restrictions entailing "10 1.
CANE estimates sometimes do fail to have minimum asymptotic
variance within the class of CA N estimates because the class is too broad
to permit such a minimum. Le Cam (1953), for example, has shown how
to construct a set of superefficient estimates, that is, "10 > I for some
8 E n, from a given CANE estimate whose asymptotic variance obeys
certain mild conditions. The basic idea is to define the new estimate in
such a way that its bias goes to zero as n -+ 00 for all 8 in n, but its
derivative approaches a strictly negative number at isolated points in n.
In other words, the lower bound in Equation 5.8 is attained and forced
to be asymptotically smaller than 1//,,2(8) at some parameter values.
The saving feature is that a parameter set of superefficiency must have
Lebesque measure zero.
With these remarks as introductory material, we now compute the
asymptotic efficiencies of the estimates which were the subject matter of
Theorems 4.2' and 5.1. In accordance with our initial discussion in this
chapter concerning restrictions on gh g , . . . , we impose the hypotheses
2
of the latter theorem. We take this opportunity to write out in full these
hypotheses (getting rid of some implications among them) and the
results concerning the two types of estimation procedures in the case of
a bounded prior knowledge interval J.
THEOREM 5.2
Let Yn (n 1,2,) be an observable sequence of independent

=
random variables with common variance u2. Let t9' Yn = Fn(8) be pre
scribed up to a parameter value 8 which is known to be an interior point
of a finite interval J = [elf e ]. We impose the following conditions:
2
1. The derivative Fn exists and is one-signed on J for each n.
2. Bn2 = L=l b1<2 -+ 00 as n-+ 00, where bl< inf",el IFIx)l. =
3. "i:.bn4/Bft4 < 00.

4. Each gn(x) = IFn(x)l/bn is continuous on J, and the sequence
converges uniformly to a limit function g(x). We set
C = SUP",el g(x), 1 :::;; C < 00.
5. supn C[Yn - Fn(8)]2P < 00, where p 2 is the smallest integer

(assumed to exist) for which limn BnP-2bn = +00.
Let t1(1) = 80{l) be fixed arbitrarily in J, and let
(n = 1,2,),
where
and the A's are positive constants. Then, as n -+ 00,
where
Q12 = V(Alg(8 provided that Al > t,
Q 22 V(A ) provided that A2 > c2/ 2,
2
(A3 gf8j )
=
Q32 = V provided that A3 > c2/ 2,
and Vex) x2 /(2 x - 1). The same limiting distribution obtains if

=
Fk(8) is replaced by Fk(tl<(I in the norming sequence.

For x EJ, define
<l>(x) r'" gee) de,

J l
=
and let 'Y = <l> -1 be the inverse function. For y in J* = [0, <l> ( )]'
2
define
F"*(y) = F,,('Y(Y.
Let 11* = be arbitraryin J*, and let
[ ]
CPo
'(2)
t:+1 = /,,* + "F,,*(f{!o) [Y" _
F"*(/,,*)]
L
k=l
Fk*2(f{!o)
0
t. = 'Y(t,,*) (n = 1,2, . . ).
Then, as 11-+ co,
j11-I1 1,,1(1) (t. - I)/a - N(O, 1),

and the same holds true if II) is replaced by Fk(tk) in the norming
sequence.
We further now assume that the Y" - F,, ( 8) have a common probability
destiny function!, which does not depend on 8, and derivative/', such
that
J:a/'(W) dw = 0 and J:,./;) dw = A2 > O.
Then the asymptotic efficiency of {/,,(I)} is
(i = 1,2, 3),
and the asymptotic efficiency of {I,,} is
with equality when and only when/is the N(O, 0'2) density.
Proof. From the uniform convergence of g" to g on J, it follows that
d
b" = sup g,,(x) sup g(x)
" I I
as n 00. In other words, the numbers c defined in Theorem 4.2 and

a bove in Condition 4 are indeed equal.
The formulas for the various asymptotic efficiencies immediately
result from Equations 5.5, 5.7, 5.9, and the preceding expressions for the
corresponding 0',,2 (8) s. '
The only statement requiring verification, then, is the final one. We

may obviously assume ,\2 is finite. Then, by the Schwarz Inequality,
(72,\2 =
fOl w2f(w)dw fOlj;) dw IfOl Wf'(W)dWr (5.1 0)
Since w2f(w) is integrable, we must have limlwl .... Ol If(w) = 0, an d
Ol
therefore (assumingf is absolutely continuous),
f _ 00 wf'(w)dw =
"'few)
1+00Ol foo
_ - _ 00 few) dw = -1 .
This proves that (72,\2 1 . The necessary and sufficient condition for
equality in Equation 5. 1 0, that is, (72,\2 1 , is that the integrands w(w)
=
and f'2(W)If(W) be linearly related:

f'(w) = Kwf(w) ( -00 < w < (0) .
This differential equation has a unique probability-density solution with

first and second moments respectively equal to 0 and (72, It is
1 e- w2 /2 a2
few) = ,
V21T(7
as cla ime d. Q.E.D.
The transformed estimates {In} in Theorem 5.2 are thus asymptotically
efficient if and only if the errors of observation are Gaussian (and the
untransformed if, in addition, the problem is asymptotically linear).
Naturally the question arises as to how close 'TJ l/a2,\2 'TJ! is to unity = =
when f i s "approximately" Gaussian.

Let us make this quantitative with Student's I-distribution having v
degrees of freedom:
(-2-1)
V +
)
1_ r
( W
2 -<V+l)/2
(2 ) 1 ( -00 < (0 ) .
__
v
few) - j- + < w (5. 1 1 )
VV1T r
_
To meet Condition 5, we require that v > 2p, where p 2 is an integer

depending on the (unrelated) regression sequence. By symmetry, all odd
ordered moments are zero; in particular, CW = Ch(W) O. The =
variance is (72 = v/(v - 2) and, as v-+- 00, Equation 5.11 approaches the
N(O, 1) density function. After a somewhat lengthy but straightforward
calculation, we obtain the formula
(v - 2)2(V+ 3)
'TJ = 2(v + l)(v+2}'
For large v, TJ = 1
-
(4/v)+ O(l/v2). As the following table shows, the
approach is not too rapid.
Degrees of Freedom Asymptotic Efficiency of

in Student'!! nistrihlltion Transformed Estimates
, 0.34
6 0.43
7 0.'0
8 0."
10 0.63
U 0.7'
20 0.81
2' 0.84
0.89
100 0.94
400 0.99
For the Laplacian density
few) --!:- e-.t2Iwl/a

=
( -co < W < co ) (5.12)
V2u
all moments exist, so Condition 5 holds in every case. The odd-ordered
moments are 0, and u2 is the variance. We find that
,\2 = Iao [d10dwgf(W)] 2 few) dw

-00
=
u,
and hence
TJ = -l .
It is to be noted for this density that

- ao d210g{(w) few) dw
tfh'(W) =
I-ao dw
=
0.
Consequently, by Equations 5.6 and 5.7,

tf 82 log Ln =/:
CB
(8 log80 Ln) 2
802
_
B
(5.13)
because the left-hand side is identically zero. But the assumption of

eq uality in Equation 5.13 for all 0 En is fundamental to the proof that
there is a solution of 8 log Ln/fJO = 0, that is, a ML estimate, which is a
CA N E estimate. (The function h() is a step function in this case.)
Hence, the low asymptotic efficiency of our transformed estimate is only
apparent.
5.4 Increased Efficiency
If the (non-Gaussian) noise density lis prescribed and such that

Ch'(W) = Ch2(W),
the asymptotic efficiency of any of our methods can be improved when
ever there is a CANE solution, On, which makes Equation 5.6 vanish.
(This generally imposes further conditions on the regression functions as
wel l as onf. See, for example, LeCam, 1953.) lf we now let In denote any
of the recursive estimates listed in Theorem 5.2, the quantities
n
.. I
tk(tn)h( Yk - Fk(tk
k l
On = In+ ).2 ::..:=-
:..:: --:
- n:--
- ----
2: tk2(tk)
k=l
will then have the same optimum large-sample statistical properties as
the ML estimates. The fact that these require infinite memory, via the
quantities tk(tn) (k = 1, 2" ", n) , violates our ground rule that we
restrict consideration to computationally feasible estimation schemes.
5.5 Large-Sample Confidence Intervals
Under the conditions of Theorem 5.2, we have shown, for the trans
formed estimates {In}, that
I X
lim P{Sn(tn - 0)
n
< O'x} = . /_
y 27T I
-co
exp (-!e) dg
for every real number x. The norming sequence
(5.14)
is free of unknowns. The value of 0'2 = 0'92 can be consistently estimated,

without further assumptions, by
sn2 = ! [Yk - Fk(tk)]2,

n k=l
and we can therefore set up large-sample confidence intervals for O.
Specifically, given a (small) number a, let Ca be such that 100(1 - a)
percent of the area under the standardized normal curve lies between
- Ca and Ca. Then
Sn
tn Ca S (5.15)
n
CHOICE OF INDEXING SEQUENCE 73
is an asymptotic confidence interval on 8, having confidence coefficient

1 - ct. That is to say, the probability that this interval will cover the
I
unknown 8 tends to - ct as n -+ 00. For any (large) fixed n, there is, of
course, positive probability that the interval in Equation 5.15 will not be
a subset of J. In practice, the confidence interval will be taken as the
intersection of the two.
5.6 Choice of Indexing Sequence
In some problems the integer-valued indexing of the regression

functions results from prior selection of a set of discrete values of a
continuous index (usually denoting "where" or "when" we sample).
Suppose, then, we start with a prescribed function F(T, 8) of two con
tinuous variables. For a particular choice TI> T2, we obtain a sequence

of mean values
(8 EJ;k = 1,2" ,,)
known up to the parameter 8. Generally, there will be no unique

sequence Tl, T2, for which the resulting {FkC')} satisfies Conditions I

through 5 of Theorem 5.2. The most o bvious criterion for choosing such
a sequence is maximization of each of the summands in Equation 5.14.
That is, we define Tk T(lk) by =
(5. 16)
where Ik is determined, in accordance with one of our iterative schemes,

by 11>" ', Ik-1 and T1>" ', Tk-l' The range of T values (it may be vector
valued) over which the maximum is taken will usually be restricted by
further considerations.
A (trivial) example would be
F( T, 8) = (T - 8)3 and J = [0,1].
The squared derivative with respect to 8 is maximized over T EJ by
T(8) = { if 0 8 =::; t,
<
if t =::; 8 < 1.
There is, then, a single regression function
F(x) =
{(I - X)3 ' o =::; x =::; t,
x3, t =::; x =::; 1,
with b1l = i for all n.
5.7 A Single-Parameter Estimation Problem
Example 5.1. The following is a single-parameter specialization of a

multiparameter problem. We seek to estimate recursively, from range
measurements corrupted by additive noise, the true initial range to a
body traveling in an inverse square law central force field with total
system energy E 0, that is, along a parabolic path. Solution of the
=
more realistic mu ltiparameter estimation problem for an el liptic path

(E < 0) is worked out as Example 8.S in Chapter 8.
Figure 5.1 Trajectory of object
r __________-L-- 0
The polar coordinates (r, cp) of the parabola with focus at the origin
shown in Figure 5. 1 are related by
2a '
= (5.17)
r 1 +cos cp
where the angle cp is measured as indicated. If a force f -k/r2 is

=
exerted by the origin on the point P, with (reduced) mass 111, then
[2
a = 2111k'
wherein
I = mr2 ; = const (5.18)
is the conserved angular momentum. The motion is thus determined by

the values of three parameters: 111, k, and I (plus an ini tia l time).
We select
8= a
as the one to be estimated and presume the others given. We assume that
at time t = 0 the coordinates of P are (a, 0), that is, that the turning
angle, which orients the axis of the parabola to the observational co
ordinate system, is also known. Integration of Equation 5.18 with r
gi ven by Equation 5.17 then yields the cubic equation
SINGLE-PARAMETER ESTIMATION PROBLEM 75
z+tZ3
t
K 2 (5.19)
8
=
for
z = tan 111'. (5.20)
There is a single positive root, namely.
(5.21)
(t 0).
ByEquations 5.17 and 5.20 the regression at time tis thus
, = 8(1 + zI) = F(t. 8). (5.22)

wherein z depends nonlinearly on t/82 in accordance with Equation 5.21.
In the following we introduce a sequence of regressions Fk(8) =
F(Tk' 8) by selecting appropriate observation times 0 < 'T1 < 'T2 < " . ,
but for the time being we can continue to work with the continuous time
variable. Furthermore, rather than introduce more symbols, we use 8 as
the dummy variable, where
o < e1 < 8 < e2 < 00,

and the end points are given.
Letting a dot denote differentiation with respect to the parameter (and
not, as is customary, with respect to time), we have from Equation 5.22
r 8z
r(t, 8) = 1+Z2+28z
88
But from Equation 5.19 we have
8z t
( Z 2)
88 I + -2K 3'
8
=
so that
t z
rlr(t, 8) 1 +Z2 - 4K - . --.
82 1+z2
=
Returning to z, by using Equation 5.19 once again, we find that
( x
2
+ 6x - 3
t(t, 8) - H (Z2) , H x) (5.23)
3(1+x) ,
= =
This expression, together with Equation 5.21, is the basis for all further
considerations.
The quadratic numerator in H vanishes at x -3 - 20

= < 0 and
-3 + 20 > O. Since H(O) -1, it follows that =
H(x) > 0 for a ll x> xo = -3 + 2V3 = 0.464. (5.24)

In addition, H(x) increases monotonically with Xo. x>
Now, the solution z
tan 19' increases with for every fixed that
= t fJ,
is, every fixed path. But z Z(t/fJ2), z
so must decrease with increasing
= fJ
t.
for every fixed In particular,
Z(t/fJ2)> Z(t/'22).
If we now define 10 as the positive time at which
Z2(tO/[22) = Xo, (5.25)
we w ill then have
Z2(I/fJ2)> xO'
Consequently, by Equations 5.23 and 5.24, we obtain
1'(/, fJ) < 0 for all 1 > 10 and all fJ E J.
Thus, Condition 1 of Theorem 5.2 will be met if we begin our observa
tions at any time after 10, which is defined by Equation 5.25 in terms of
'2 and K.
We next examine the behavior of I C/, as 00 for 8 in J. From
F fJ)1 t
Equation 5.21, we see that
, z(l/fJ2) -3K2 * '
I1m - %-
t ... '" ( fJ )
t
=
and, from Equation 5.23,

H(x) ! .
"'lim
... '" x =
3
Consequently,
11(1, 8)1
1
3 82
3K *
( ) t* A(fJ)t* = (5.26)
as I fX). Furthermore, with 1 > 10, we obtain
b(1) = !! 11(1,1)1 ! H (Z2 (:2)) = = H

(Z2 (,:2))'
and it is easyto see that
(t
g , 8) IFCt, fJ)1 A(fJ) g(fJ)
=
bet) A('2) =
SINGLE-PARAMETER ESTIMATION PROBLEM 77
uniformly on J as t-+ 00. Thus, the functions gk(8) = g(Tk' 8)

(k 1, 2, ) will satisfy Condition 4 for any sequence
=
increasing to infinity.
There are many such sequences for which Conditions 2 and 3 are met.
For instance, we can take slowly increasing times such as
(any > 0, k = 1,2,).
Then, by Equation 5.26, we will have for n -+ 00
n n
P2(Tk' 8) A2(8) loga k. (5.27)
k-l k=l
According to Sacks' ( 1958) Lemma 4,
n
toga k n loga n. (5.28)
k=l
Thus, as n -+ 00,
bn2 '" log n, Bn2 '" n toga n,
and both Conditions 2 and 3 hold. In addition, Condition 5 is true if the
additive noise in the range measurements has finite fourth-order
moments.
PART II
THE VECTOR-PARAMETER CASE

6. Mean-Square and Probability-One
Convergence
In this chapter we turn our attention to the more realistic situation

where the regression function is known except for a number of scalar
parameters; or equivalently, if there are p > 1 unknowns, a p-dimen
sional vector parameter. We will study the quadratic-mean and almost
sure convergence, in that order, of (column) vector estimator sequences
of the form
tn+l tn an[Yn - Fn(tn)]
= + (n= 1,2" ,,; tl arbitrary). (6.1 )
The scalar observable is
Yn Fn(6) = + Wn,
where {Fn()} is known, 0 is the p-dimensional vector to be estimated,
Wn is the residual error, and {an} is a suitably chosen sequence of p
dimensional gain vectors. Owing to its considerable complexity, the
question of a large-sample distribution for the vector estimates is not
examined in this monograph. However, the technique of analysis used
in Chapters 3 and 4 would be a logical starting point if we were to
consider this problem.
Our approach to the vector-parameter case is patterned after the one
dimensional treatment found in Chapter 2. We linearize the regression,
assuming the existence of first partials, by invoking the mean-value
Fn(tn) Fn(6) Fn'(un)(tn - 6),

theorem:
= +
81
82 MEAN-SQUARE AND PROBABILITY-ONE CONVERGENCE
where
u" lies on the line segment joining tIl and e,
F,,(u,, is the gradient of F" evaluated at and
F,,'(u,,)) is the (row vector) transpose of the (column vector) F,,(u,,)_
U",
From Equation 6.1 we then have

tn+l - e anFn'(un)] (tn - e) a"W",
= [I - +
(6.2)
x n
" "
1
where is the p p identity matrix. Iterating this back to 1 gives =
TI [I - ajF/(uj)](tl - e) TI+l [I - a,F/(u,)]ajWJ,

"
tn+l - e J-l =
I=J +
1*1
(6.3)
where, both now and later, T17 =m A, means the matrix product
A"An-1 Am (i.e., the product is to be read "backward").

ties of {tn - e} are crucially dependent on the asymptotic (large )

It is clear from Equation 6.3 that the large-sample statistical proper
n
properties of iterated transformations of the type
"
P" = TI (I -
1=1
ajb/), (6.4)
al
where and b/ are p-dimensional column and row vectors, respectively.
{ aJ
We begin by studying conditions on deterministic sequences of p-vectors,
and {hj}, which are sufficient to guarantee that P" converges to zero
(that is, the null matrix) as n -+ 00.
In the one-dimensional case, this problem is trivial: P" converges to
zero if the positive a/I/s are such that 2, a/I, 00 and a/lj > 1 only
=
finitely often. (This was so because of the scalar inequality 1 - x :::; e-x. )
In higher dimensions, life is not so simple, and we must think in
terms of matrix eigenvalues. In what follows, we make use of the
following statement.
Definition of norm. For a square symmetric matrix P, let AmiD (P)

and Amax (P), respectively, denote the smallest and largest eigenvalues
of P (all of which are real). For any rectangular matrix P, we use as its
norm
where P' is the transpose of P.

As so defined, "11 is bona fide norm since we are concerned only with
real matrices. If P is a p x 1 matrix (i.e., a column), then IIPII ( P'P)Y., =
the familiar Euclidean length. If P is a 1 x p matrix (i.e., a row), then

THEOREM CONCERNING DIVERGENT PRODUCTS 83
P 'P is a p x p matrix. It has but one nonzero eigenvalue; namely,

PP ' = IIP I12, with, incidentally, an associated right-sided eigenvector P'.
Finally, if P is of the form ah', where a and h are column vectors, then
P 'P = h(a 'a)h' = lIa Il2hh '; therefore, we have
IIPII2 = lIa ll2Amax (hh' ) =
lIa ll2 llhll2.
As is generally the case with matrix norms,
IIPQII ::; IIPII IIQII,
liP + QII ::; IIPII + IIQII,
provided that P and Q are such that the operations are defined.
It is evident that a sequence of matrices {P,,} converges to the matrix
of zeros as n-+oo if and only if IIP"II-+O. In the particular case where
P" is given by Equation 6.4, we are tempted to make use of the (correct)
inequality
..
IIP"II ::; 11 III - a jh / II

i-1
in order to find conditions on the vector sequences {aj} and {hj} which
will ensure liP,,11 -+ O. This approach proves to be fruitless. In fact, it can
be shown that
111- ah' lI 1
for any "elementary matrix" of the form I - ah', where a and hare
column vectors, so the above-cited inequality tells us nothing about the
convergence of liP"II.
The successful approach involves grouping successive products
together and exploring an inequality of the form
where lie is a set of consecutive integers. This idea is the basis of the
following theorem.
6.1 Theorem Concerning Divergence to Zero of Products of Elementary

Matrices and Assumptions (81 Through BS)
THEOREM 6.1
Let {a j} and {h j} be a sequence of p-vectors and define
"
P" = 11 (I - ajh/ )
i=1
for all n 1. Then we have

0
lim II Pnl1
n
=
if the following assumptions hold:

B1. /lan/l /lbn/l 0 as n 00.
B2. I /lan/l /lbn/l = 00.
n
B3. There exists a sequence of integers 1 =
-
VI < V2 < V3 such
that, with Pk Vk+l
= Vb we have
(k = 1, 2, )
and
11m 1
. f-
10
k Pk
\
"min
( ""
Jell<
bjb/
IIbJ 112
) 2 0
L. T > ,
=
where both now and later Jk is the index set

{Vb Vk + I, , Vk+l - I}.
6.2 Discussion of Assumptions and Proof

Before embarking on the proof of this fundamental result, let us try
to give some insight into the meaning of the assumptions. The first,
second, and fourth are assumptions concerning the rate of decay of the
product /lan/l /lbn/l. The first two are particularly in the spirit of their
one-dimensional analogues.
Assumption B3 has the following interpretation. For any set of P
dimensional unit vectors UIo U2, , Uro we have

Now I Cu/x) I is precisely the distance from UJ to the hyperplane through

the origin whose unit normal is x; therefore, 2/=1 CU/X)2 d2(x) is the =
sum of the squared distances from the u/s to that hyperplane. Since
d2Cx) is continuous in x, it actually achieves its minimum on the (com
pact) surface of the unit sphere. Thus, the value of
D ISCUSSION OF ASSUMPTIONS AND PROOF 85
is the sum of the squared distances from the u/s to the particular
(p - 1 )-dimensional hyperplane that best fits" the vector set U1> . " Ur.
.
Assumption B3, therefore, requires that the normalized vector sequence

hl/llhd, h2/llh211,' can be broken up into groups of finite size, the kth
. .
group containing r = Pk members, in such a way that the average

squared distance from the vectors hj/llhjl l (jeJk) to any (p - 1 )
dimensional hyperplane remains bounded away from zero as k -+ 00.
Loosely speaking, the sequence h1> h2,' must therefore Urepeatedly
. .
span" Euclidean p-space. No direction must be neglected. This makes

good intuitive sense. Indeed, let x be a generic point in p-space and set
Xn+l = Pnx. Then we have
Xn+1 =
Xn - (hn'xn)an;
that is, Xn+1 is obtained from Xn by subtracting off a piece of Xn
pointing in the direction an. If Pn is to map every x into the origin as
n -+ 00, as it must if liPn ll -+ 0, then hn must have at least a nonzero
projection on all p-coordinate axes infinitely often as n -+ 00. Assump
tion B3 requires just this, in a precise way.
There is also a relationship that exists between T2 and the limiting
value of the ratio of the largest to smallest eigenvalue of the matrix
1 h"hk' (which is the subject matter of Lemma 7b of the Appendix) .
=
This ratio, sometimes called the conditioning number, is a measure of

how close a nonsingular matrix is to being singular.
In the scalar-parameter case, we required that the gain have the same,
presumed constant, sign as the regression-function derivative. In the
present notation, the requirement would read
for all sufficiently large n. In higher dimensions, the natural analogue of

the product numerator is the inner product, and of the absolute values,
the lengths. Therefore, we must ensure that
lim inf an'hnlll an ll ll hn il

ft
> 0,
and it might seem reasonable that this is sufficient. But Assumption B5

demands much more of the cosines of the angles between the an's and
hn's. It requires that their smallest limit point be strictly larger than a
certain positive quantity IX, which depends on every member of both the
86 MEANSQUARE AND PROBABILITY-ONE CONVERGENCE
sequences {an} and {bn}. (We note 7"2 =::; 1 is always the case, so 0 =::; ex; <
1, as should be.) Moreover, the lower bound in Assumption B5 is an
essential one. This is graphically demonstrated by the following example
in which Assumptions BI through B4 hold,
lim"infan'bn/ilanll llbnli = IX >0

(so that Assumption B5 is "just barely" violated), but P" does not
converge to zero.
Example 6.1. We take

[
{ [l
if n is odd,
-l [COS cp]
a" - n . ,
Sin cp
if n

is even,
h. ]
where 0 < cp < 'TT/2. Assumptions B I and B2 are immediate because
Ilanll = I/n and Ilbnll 1. The limit inferior in Assumption B5 is simply
=
lim infa,,'bn/llanll Ilbnll = min (cos cp, sin cp) =::; I/V2,
"
with equality only at cp = 'TT/4. With regard to Assumption B4, we have
for any strictly increasing integer sequence {Vk} whose successive

differences remain bounded. For the particular choice Vk 2k - 1, we =
]'
have
[1 0 .

J=2k-l
b.b/
J
=
0 1
and therefore
identically in k. It can be further shown that the value of 7"2 in Assump

tion B3 cannot exceed 1- for any choice of indices ..In other words,
IX = I/V2 i s true in all cases, and Assumption B5 will be
even more violated ifPk > 2. We now take cp = 'TT/4, which gives
DISCUSSION OF ASSUMPTIONS AND PROOF 87
With this choice of the angle, we thereby satisfy Assumptions B 1

through B4 and violate Assumption B5 to the extent that equality rather
than inequality holds. But, for any rp, we have
sin ] [ sin ]
(I - b,a/ ) [ -cos rp
rp
-cos
=
rp
rp
=
1;,
whether j is odd or even. Thus, we see that
for all n l, and we have exhibited a (nontrivial) fixed point of every

one of the transformations. Therefore, {Pn } cannot tend to the zero
matrix.
In particular, then, the lower bound in Assumption B5 cannot gener
ally be replaced by zero, in contrast to what we might have expected by
analogy with the scalar-parameter case. We now prove the theorem.
Proof of Theorem 6.1. From Assumption Bl it follows that

supn III- anbn'll M < 00. For any n, let K K(n) be the largest
= =
integer such that "K :S; n, so that "K :S; n :S; "K+l - 1 . Then we have
n.
Pn Il (I - a,h/)pvx-1
i=VK
=
and, by Assumption B3,

n
IIPnll:s; Il II I - a,h/ IIIIPvK-d :S; MqIlPvx-11l

i=VK
It therefore suffices to show that
(6.5)
as K tends to infinity with n over some subset of the positive integers.

To do this, we set
(6.6)
where
(6.7)
By virtue of Assumptions Bl and B3, we have

where, unless otherwise noted, k runs over all positive integers. It is not
difficult to see that
Qk'Qk = I - k(Tk + Tk') + k2Ek

for some matrix Ek> whose norm is uniformly bounded in k. Thus, since
the matrices are symmetric,
IIQkl12 =
IIQk'Qkll S III - k(Tk + Tk')11 + O(k2)
= Amax [I k(Tk + Tk')] + O(k2)
-
= 1 - kAmiD (Tk + Tk') + O(k2) (6.9)

as k 00. Consequently, if we can show that
lim inf AmiD (Tk
k
+ Tk') > 3c > 0, (6.10)
for some such number c, we are done. For then, since k 0, from
Equations 6.9 and 6.10 we have
Os IIQkl12 S 1 - 2Ck
(say) for all large enough k. But I - x S e-X is always true, so that
(6.11)
Since the square root of the sum of squares is never smaller than the
sum of the absolute values,
by Assumption B2. The bound in Equation 6.11 will therefore tend

to 0 as n 00, and Equation 6.5 wiIl be afortiori true.
To demonstrate the validity of Inequality 6.10 is the main burden of
the proof. By Equation 6.6
Tk + Tk' = 1k 2 (ajh/
Je/,.
+ hja/)
= } 2 rlvju/
Uk Je/,.
+ ujv/), (6.13)
where
The unit vector VJ can be decomposed into its components in the

direction of UJ and orthogonal to it:
(6.14)
DISCUSSION OF ASSUMPTIONS AND PROOF 89
where
Here al is the cosine of the angle between 81 and hi which, by Assump

tion B5, is positive for all j E Jk and all large enough k. We assume,
hereafter, that the index k is so restricted. Since
"min (Tk + Tk') = min

UXII1
x'(Tk + Tk')x,
Equations 6.13 and 6.14 yield
"min (Tk + Tk') Uk L 'AX'V/)(x'u/)

min A2
IIXII-1 lei"
=
min :- L '/[aAX'U/)2 + VI
IlxlI-1 Uk Jeilc
- a/2 (x'OJ)(x'uJ

1IX11-1Uk [
min A2 flkak L (X'UJ)2
lei"
- )'k yl-.:"'l---a"""'k2 L
lei"
]
Ix'OJ! IX'uA , (6.15)
where
If uJ and OJ are imbedded in an orthonormal basis for p-space, it follows

from the Fourier representation for x in this coordinate system that
I = IIxll 2 (X'UJ)2 + (X'OJ)2
with equality when p = . Thus,

2
Ix'OJI vi - (X'UJ)2.
If we set
(6.16)
this combines with Equation 6.15 to give
"min (Tk + Tk')

min A2
IIXII-l
[
kak L J2(x) - )'k V I - ak2 L eAX) V I
uk fl Je/" e lei"
-
]
el(x) .
(6. 17)
We successively decrease the lower bound in Equation 6.17 by taking

the minimum over larger x-sets in p-space. We have
90 MEANSQUARE AND PROBABILlTYONE CONVERGENCE
where, since Uj is the normalized hi>

(6.18)
Thus, the set of all unit length vectors x is a subset of those for which
L el(x) Ak
je/k
In turn, the set of all real numbers ej in the unit interval which satisfy
2: el Ak
J e/k
contains the set of those of the form of Eq uation 6. I 6 which satisfy the
inequality. Consequently, the lower bound in Equation 6. I 7 can be
weakened to
AmiD (Tk + Tk') min

llkS 1: /2SPk
[ Ak L el
ie/k
- Bk L ejYl
ie/k
- ]
el ,
le/k
where we have set
(6.19)
After applying the Schwarz Inequality to the second term on the right
hand side and setting
we obtain
Inequality 6.10 will thus follow if the lowcr bound in Inequality 6.20 has
a strictly positive limit inferior as k co. We now complete the proof
by showing, as a consequence of Assumptions 83 through 85, that this
is indeed the casco
In the original notation, the numerator quantities in Equation 6.19
are
while the common denominator !J.k is given by Equation 6.7. Since

!J.k2
::;; Pk'Yk 2, we have
DISCUSSION OF ASSU/lfPTlONS AND PROOF 91
lim i nf PleA Ie lim inf 2VPie ,B1e ale > 2Vp > 0 (6.2 1 )
Ie Ie )'Ie P
according to Assumptions B4 and B5. From these two assumptions, it

also follows that
rIm sup Ble ., - 28

" A" =
vI _ .,2
for some 8 > O. For any such number, we see that
.,. - 3"1 - <;I -2->0,
.,.- (6.22)
"(I - + [.,. - 3"1 - <;1]1
because the left-hand side increases steadily from zero when viewed as a
function of 8. Using Equation 6.18 and Assumption B3, we now fix k(8)
so large that
and fA:c
J>./;:
PIc
I'1m I. nf
"
JAk- - e = ., - e
Pic
hold simultaneously for k k(8). For all such indices, we can, therefore,
write
gk(Z) min g(z), (6.23)
t-BS2S1
where
This function is strictly convex on [0, I ], as can be easily seen from an

examination of g (sin 6) over 0 6 TTl 2. It has roots at Z 0 and at =
.,-3
=.,.
Zo = - la,
"(1 - + [T - 3r
the last by the definition of Equation 6.22. Therefore, g(.) must be
strictly positive over [., - e, I ], because Zo < ., - e. This, together with
Equations 6.23 and 6.2 1 , implies the desired conclusion for Equation
6.20. Q.E.D.
Let us now return to the sequence of estimates (Equation 6.1), and
focus our attention on the resulting difference equation (Equation 6.3).
We allow the gain vector a, to depend on the firstj iterates, so that the
leading product is writte n
n
Pn(tl> .. " tn) TI [I - altl> ' . " tj)h/(tj)].
= j=1
In keeping with the notation of Theorem 6.1, we have set

Fiu,) = hitJ),
which is indeed the case because UJ depends on the iterates only through
the value of tJ It should be clear that the above sequence of matrix
valued random variables {P,,} will (surely) tend to the zero matrix as
n -+ 00, if we require that the vectors a,,(xh' . " x,,) and F ,, ( y) satisfy
Assumptions BI through B5 of Theorem 6.1 uniformly in all vector
arguments Xl, " X" and y . Such are, in fact, the first five assumptions

of the following theorem. The sixth takes care of the additional term in
Equation 6.3 arising from the stochastic residuals Wh W2, '
-
and
ensures mean-square convergence of lit" al to zero.
6.3 Theorems Concerning Mean-Square and Probability-One

Convergence for General Gains and Assumptions
(Cl Through C6 ' and Dl Through DS)
THEOREM 6.2
Let {Y,,:n = 1, 2, } be a real-valued stochastic process of the
form Y" = F,,(O) + W", where F,,( ) is known up to the p-dimensional
parameter e, and Wh W2,'" have uniformly bounded variances. For
each n, let a,,() be a Borel measurable mapping of the product space
X RP into RP (Euclidean p-space), and let
(n = I, 2, ... ; tl arbitrary).
Denote the gradient vector of F" by F" and suppose the following
assumptions hold:
Cl. lim sup Ila,,(xh"', x,,) II IIF,,(y) II = O.

n Xl,"' ,Xn.Y
C3. There exists a sequence of integers I = Vl < V2 < V3 such

that, with pk = Vk+l - Vk,
p;S;Pk;S;q < oo (k = 1 2 . . . ),
"
and
lim inf ..!.. inf ..\ m (L FiYJ)F/(YJ

in J e/" II FJ(yJ)112
)- - 'T
2 > 0
,
k PkY.,. . . . Y."+l-l
ASSUMPTIONS Cl THROUGH C6' AND Dl THROUGH DS 93
where
Jk = {Vb Vk + 1" " , Vk+l - I }.
max I l aj(xl>"',xj)II II Fj(Yj)1I
.::.
je:,:... ./ .::.
It -::-
. -:-
- - __,_;_.,__..".
C4. lim sup sup _
k l aj(xl>" . ,xj) I I I I Fj(Y j)1I

je/" I
Xlo .. .X.1c+l-1.Y./c ... .y./c+l-lmm
= p < 00.
. f a,,'(Xl> x,,)F"(y) > a,

. f
C5 I ImIn
"
In
.
... Y lIa,,(xl>"',X,,)I I I I F,,(y)II

" Xl. ...X
where
1 - 1'2
a =
J 1 _
1'2 + ('T/p)2'
Then Cl lt" - 61120 as noo if either
C6. L: sup l I a,,(xh"',x,,)11 < 00 or
n Xl_ .x"
C6'. {W,,} is a zero-mean independent process and

2: sup l I a,,(xl> " x,,) 112 < 00.
n
.
Xl. 0. Xn
Proof. The argument consists of three main parts. As in the proof of

Theorem 6.1, given any n, let K = K(n) be the index such that
VK n VK+l - I.
We first show, with a minimum of effort, that
Cll t"+l - 6112 M1Cl l t'K - 6112 + M2CYzllt'K - 611 2 + M3 AK2,

(6.24)
where AK
0 as K 00 with n. ( M's, with or without affixes, will stand
for finite constants which will be reused without regard to meaning.) It
is more than sufficient, therefore, to have
(6.25 )
as k 00 over the integers. This is immediate, if we can prove that
(6.26 )
holds for all large enough k, say k N, and some sequences having the
properties
Mk > 0 , lim t:.k = 0, t:.k = 00, Bk < 00. (6.27)
Indeed, after iterating Equation 6.26 back to N, we obtain
It follows from Equation 6.27 and Lemma 2 that this upper bound goes
to zero as k te nds to infinity. The sought-after conclusion will thus be
at hand.
The second and third parts of the proof establish Equations 6.26
and 6.27 under Assumptions C6 and C6', respectively. In the former
case, the argument is relatively straightforward. Under Assumption C6',
however, the details are a bit more complicated, but we are finally able
to use the independence to establish the desired inequality with some
(other) sequences which obey Equation 6.27.
Proofof Equation 6.24. Ite rate Equation 6.2 back from n + 1 to VK,
where K = K(n) is as before. We obtain
" " "
t"+l - 8 J=VK
=
J=VK I=J+l (I - a,h,') aJWJ,
TI (I - ajh/)(tvK - 8) + 2: TI
(6.28)
where it will be necessary to remember that aJ and hJ are now vector
valued ra ndom v ari ables :
(6.29)
We "square" both sides of Equation 6.28, take e x pectations, and then
bound from above (in the obvious way) the two squared norms and the
inner product. The result is
(6.30)
From Assumption Cl it follows that there is a r ea l number M such that

sup" III - a"h,,'11 < M;
therefore, by Assumption C3, we have
"
J=m
TI III - ajh/ II < Mq (6.31)
for all vie 112 n VIe+l

- 1 and all k l . (Unqualified deterministic
bounds on random variables are to be read as sure ones.) Under either
Assumptions C6 or C6', we have
Jel"
max IlaJ11
Jel"
max sup
Xl. .XI
Ilaj(xh, xj)11 = Ale -'Jo- 0 (6.32)
ASSUMPTIONS Cl THROUGH C6' AND Dl THROUGH DS 95
as k -+ 00. If we apply these results with k = K(n) to Equation 6.30 and

then use the Schwarz Inequality, we get
M-2 Qcllt"+1 - 611 2
ClltYK - 6112 + 2AKCYolltYK - 6112C L 1 WJI
JelK
2
( )
AK2 C( L I WJI ) .
2
+
JelK
From Equation 6.32 and this, Equation 6.24 follows, becauseJk contains
no more than q indices for any k, and sup" 8 W"2is finite by hypothesis.
Proofof Equations 6.26 and 6.27 under Assumption C6. We return to

Equation 6.28, set n VK + 1
= -
I , and then replace K by an arbitrary
integer k. After again "squaring," we have
(6.33)
where, i n contrast to Equation 6.6,
Qk =
TI (I - ajh/) =
Qk(th"',tYk+l-1) (k = 1,2,,, . ) (6.34)
Jeh
is stochastic. The deterministic quantity to be used here in place of
Equation 6.7 is
Ak
= C k xl i'XJ ) aj(xh' XJ)112II Fj(y)II2r

(6.35)
Because ofAssumptions CI and C3, we have

limAk O. (6.36)
k
=
We formally define Tk the way we did in Equation 6.6, but with the
summands given by Equation 6.29 and Ak by Equation 6.35. Using
Assumption C4 in addition to CI, we see that Equation 6.9 remains true
for the matrix Qk of Equation 6.34. Furthermore, by virtue of the uniform
nature of Assumptions C3 through C5, the same ( long) type of argu
ment which led to Equation 6.10 proves, for the present situation, that
(k N) (6.37)
holds for some (deterministic) c > 0 and N < 00. We now apply the
Schwarz Inequality to the second term on the right-hand side of

Equation 6.33 and then majorize the bound using Equations 6.31, 6.32,
and 6.37. After taking expectations, we obtain for k N, in the
notation of Equation 6.25,
e+1 !5: (1 - cl1k)2ek2 + 2M2QAkekt'Y. 2: I Wil ( ie;"

)2
(
+ M2QAk2C 2: I Wil )2
(I - cl1k)2ek2 + M1Akek + M2Ak2
ie/l<
!5:
!5: (l - cl1k)2ek2 + M3Ak(l + ek), (6.38)
because second-order noise moments are uniformly bounded. For Ak

defined in Equation 6.32, we have, after making use of Assumption C6,
Ak < 00, (6.39)
because the maximum value of a number of terms is certainly smaller

than their sum. With Equations 6.36 and 6.39 we satisfy the hypoth
eses of Lemma 3, and by its conclusion have
sup ek2 < 00.

k
Hence, from Equation 6.38 there results
e+1 !5: (I Cl1k) ek2

- + M4 Ak (6.40)
(say) for all large enough k. It remains to be seen (for the same reason
that Equation 6.12 followed from Assumption B2) that Assumption C2
implies
l1k 00 = (6.41 )
for the sequence of Equation 6.35.
Proof of Equations 6.26 and 6.27 under Assumption C6'. When

Equation 6.39 fails to hold, we need a tighter bound on the expectation
of the cross term in Equation 6.33 than the one used in Equation 6.38.
Specifically, we will show that
L = 11(t." 8)'Q,c' ""ii-1 (I -."")alw/l S cil./ce/c2 + Mlil./cA/c2

r
- -
1:7" '-1+1
(6.42)
no matter what c > 0 in Equation 6.37. All results of the previous

ASSUMPTIONS CI THROUGH C6' AND DI THROUGH DS 97
6.39,
C5;
paragraph, with the exception of were derived from Assumptions
in particular, the balance of the second bound on e+l
6.38
Cl through
6.42, -+
in Equation remains true as written. Thus, given the val,idity of
Equation we wiII have, because Ak 0,
=:; - c Ak)2ek2 + c Akek2 + M1AkAk2 + M2Ak2 (6.43)

- -!cAk)2ek2 + M3Ak2
e+l (l
(I
for all large enough Just as Assumption C6 led to Equation 6.39, we
k.
see that Assumption C6' implies that
(6.44)
Equation 6.43 is the desired inequality of Equation 6.26, while Equa.
tions 6.36, 6.41, and 6.44 are collectively the statement of Equation 6.27.
It remains, therefore, to establish Equation 6.42. We begin by carry
Equation 6.34, we find that

ing out the matrix multiplication called for. Using the definition of
where, as earlier, Ak is given by Equation 6.35 and
E
in view of Assumption C4. After much manipulation, it turns out that
the norm of the matrix i k is also uniformly bounded:
k;<:sulp jesuJpk II EjkII < M2

Thus, the left hand side of Equation 6.42 is equal to
L = 1t'(tVk - 0)' jeJ2:k (I + I=Vk aih/) ajWj

- t'(tvk - O),Ak(Tk Tk') jeh2: ajWj
+
+ ,s'(tvk -0)'Ak2 2: Ejk ajWjl I I + II IIII. (6.45)

ieJk = +
Consider the first term. Since a, and h, depend on the estimates up
i-I, Wi is independent of
through time i, and hence on the observational errors up through time
and
for all j E Jk Consequently, we h ave
{ [ I k aiht']a,wil wlo"" Wi -l }
C (tvk - 6)' 1 +
= (tvk - 6') [1 + : aiht']a/f{ Wil Wlo '., Wi -l} = 0

I=Vk
for allj E Jk and, hence, each of the Pk unconditional expectations must
vanish, giving
I I I = O. (6.46)
For the second term, in view of Equation 6.32, we have
III I ::; 6' Iltvk - 61111k IITk + Tk'il L Ilailil Wil ::; M311kAkek, (6.47)
ielk
where M3 involves All (with the last given meaning) and the uniform
bound on residual variances. Similarly. we have
IIIII ::; 6"lltvk - 61111k2 L II Eikll lla,111 Wil ::; M411k2Akek. (6.48)
ielk
where M4 involves M2 and so on. Since llk -+- 0 as k -+- 00, Equations
6.45 through 6.48 combine to give
where the identity is in the number c > 0 appearing in Equation 6.37.

Calling these two square roots a and b, respectively. we obtain
L ::; cllkek2 + AfsllkAk2

2
from the generally valid inequality ab ::; a + b2 This is the assertion
of Equation 6.42. Q.E.D.
The proof of almost sure convergence can be executed under Assump

tions Cl through C5 and either Assumption C6 or C6'. Actually,
Assumptions Cl through C5 can be weakened somewhat, and it is under
such a set of weaker conditions that we prove the following.
THEOREM 6.3
In the notation and under the assumptions of Theorem 6.2, t"

converges to 6 with probability one as n -+- 00. This remains true if
Assumptions Cl through C5 are replaced by the following conditions:
For every sequence of p-vectors Xl, X2 and Ylo Y2,. ,
ASSUMPTIONS Cl THROUGlI C6' AND D1 TlIROUGlI DS 99
Dl . lim n l I an(xh,xn )II IIFn(Yn )11 = O.

D2. 2n Il an(x1, ,xn )II II Fn (Yn )11 = 00.
D3. There exists a sequence of integers 1 = V1 < V2 < V3 such
that, with p" = VI<+1 - v",
P !5: PI< !5: q< (k = 1,2,)

00
and
where
max I laj(xh,x/)I I I F
I iy/)11
D4 rISUp
.:.... I;:..
k
/E:..,;. -::-..,.
- ..- _---,,....,,..:------,,-
p
_
m in lI aj (xh. . . , x/)II II Fi y/)11 00

jE < .
h
=
1 - .,.2
D5. lim inf
n
an'(xlo,xn)Fn(Yn)
I lan(Xh, Xn)I I n(Yn)1
I IF > II:
=
J1 - .,.2 + (.,./pi
The v,,'s, .,.2 ,and p can depend on the sequences {xn} and {Yn}.
Proof Let tlo t2, t3, be a realization of the random-variable

sequence recursively defined in the statement of Theorem 6.2. From

Equation 6.28, we have
"
t
Il n+l - 611 TI III
/=VK
-
ajh/ll lltvK -
611
n n
-
+ 2: TI 1 III ajh/ll llaJ WJII , (6.48)
j=VK 1=/+
where K = K(n) is the (now, possibly random) integer defined at the
outset of the proof of Theorem 6.2. The random vectors a, and hj in
Equation 6.29 are seen to satisfy Assumptions Dl through D5 with
probability one, when we set xJ = tj and Yj = Uj = uitj). Let
n
sn+1 = 2: ajWj (6.49)
"=1
The same two arguments used in the final paragraph of the proof of
Theorem 2.1, to show that Equation 2.13 follows from Condition 4 or
5, apply here to show (component-wise) that
Sn
&.s.
S (6.50)
follows from Assumptions C6 or C6' for some finite vector-valued

s.
random variable We will not actually use Equation 6.50 until later on,
and for the present we only make note of the fact that it implies
(6.51)
as n-+oo. Under Assumption 01, there is a scalar-valued random
variable M = M(t!> t2,) such that Equation 6.31 holds with prob
ability one. Thus, from Equation 6.48,
Iltn+l - 611 - 611 + qMq imaxK IlaiWill ,

M qlltvK
el
and hence, because of Equation 6.51, tn will converge a.s. to 6 once we

have shown that
(6.52)
as m tends to infinity over the positive integers.
Rather than derive a recursive relationship between the successive

deviations in Equation 6.52 (as was previously done for their mean
squares, em2), we set n + 1 = Vm+l
in Equation 6.3 and consider the
resulting formula for t;m+l - 6 in terms of all past iterates, that is,
those with consecutive indices in J1,J2, . . . ,Jm. (Note the dummy index
)
k in the proof of Theorem 6.2 is here replaced by Ill. We first rewrite the
sum on the right-hand side of Equation 6.3, with n left arbitrary. We
define
R, {0 (I - aM for j= 1,2,,n,
for j;::: n + I,

where the n-dependence must be remembered. The sum in question is

then
" "
i-1 R1+1aiWi = 1=1L R1+1S1+1 - R1+1Si

L
"
L (R'SI - R1+1S/) S,,+1
1=1 =
"
+
L [R 1+1(1 - ajh/)s, - R 1+1S,] Sn+1

1=1 = +
"
L R 1+1ajh/s, Sn+1
1=1
= - +
But Lemma 1 tells us that L1=1 R 1+1ajh/ - R1. We can thereby

= I
obtain, with S1
incorporate the established limit of Equation 6.50 into the result and
0,
=
ASSUMPT[ONS CI THROUGH C6' AND DI THROUGH D5 101
n n
2: RJ+1ajh/(sj - s). (6.53)

1=1 RJ+1ajWj R1s (Sn+1 - s) - 1=1
2: = +
Now, taking Vm+1 1,from Equations 6.3 and 6.53 we obtain

n = -
tVm+l - k=1 s) (s m+l - s)

m Ym 1-V1
e n Qk(tv1
= - e + +
- 2: 2: n (I - a,h/)ajh/(sj - s), (6.54)

k=1 l=j+1 lei"
where Qk is given by Equation 6.34.

To show that the norm of the leading product matrix goes to 0 a.s. as
m -+ 00 is an easy matter. We set
(6.55)
in contrast to the deterministic definition of Equation 6.35. However, it

follows from Assumptions 01 through 03, just as Equations 6.36 and
6.41 did from the uniform version of these assumptions, that
llk
A
-+
&.sO
,
(6.56)
as k -+ 00. Furthermore, Assumptions 03 through 05 imply the

existence of a random variable c > 0 and an integer-valued random
variable N < 00 such that
for all k N (6.57)
with probability one. Thus, by Equation 6.56, we have
liD Qkl liD: Qk l DN (l

- Cl!.k) Cexp { -c kt l!.k} 0
(6.58)
as n -+ 00, where Cis some positive random variable. It follows that the
norm of the first vector on the right-hand side of Equation 6.54 goes to
zero as m -+ 00. That the second does also is an immediate consequence
of Equation 6.50 and the fact that -+ 00 with m. Vm
It remains to appropriately bound the norm of the third term in
Jk,
Equation 6.54. For any j E Equations 6.31 and 6.34 imply
m 1 (I - a h ) vm (I a h ) V
I v l=iy
j+1 i .' I Il m=tyl
v

I l=j+1 (I - aih/)I
- i / II I k+ft,, + 1
l=nk+1 IIQ"' MQ,

where void products are to be read as unity. This, plus Equation 6.55,
gives
m m
::; Mq L TI II Qdl kdk'
k=ll=k+l
(6.59)
where, according to Equation 6.50,
dk max IlsJ - 511 0 (6.60)

Jeh
=
as k -7 00. It is now a simple matter to show that the last sum in

Equation 6.59 tends a.s. to zero as In -7 00, which will complete the
proof. Using Equation 6.57, we have
N-l m N m m
L TI II Qdl I =k+l
k=I I =N+l
L I =k+l
TI II Qdl kdk + k=N TI II Qdl k dk
m m
1
::; c' TI (1 - C I) + - L bmkdk (6.61)
I =N+l Ck=N
after setting
By Equation 6.58, bmk 0 as In -7 00 for every fixed k; in particular,

the first term in Equation 6.61 is a.s. 0(1) as III -7 00. According
to Lemma 1, we have
as In -7 00 for the same reason. It therefore follows from the Toeplitz

Lemma (Knopp, 1947, p. 75) and Equation 6.60 that the second term in
Equation 6.61 tends a.s. to zero as III -7 00. Q.E.D.
6.4 Truncated Vector Iterations

The asymptotic behavior of truncated vector estimator sequences
poses difficult analytical problems, and we have been only partially
successful in dealing with them. However, because of their great practical
importance, we feel that a discussion of truncated procedures is in order
even though our results are not complete.
ASSUMPTIONS EI THROUGH E6' 1 03
Suppose that prior information is available which allows us to state

with certainty that the unknown 6 is an interior point of a closed ( not
necessarily bounded) convex subset, 8P, of p-dimensional Euclidean
space. (In the absence of prior knowledge, f!J R".) For any vector x in =
R",there is always a unique point [x]JI' in 8P which is closest to x in the

sense of minimum Euclidean distance. In particular, if x E 8P, then
[x]ao x. If x rf: 8P, then [x].J' is a boundary point of f!J.
=
The natural generalization of the truncated scalar iteration of Chapters

2 through 5 is
t"+1 = [t" + a " [ Y,, - F,,(t,,)]].J',
where tl is arbitrary in fJ' (which reduces to Equation 2.2 if p 1 and fJ' =
is the interval [Eb E2]). In higher dimensions, the computational and

analytical problem associated with actual evaluation of [xlao can be
considerable. One must carry out the minimization of a quadratic form
over a convex set and, if fJ' is chosen inconveniently, this gets messy.
Furthermore, it must be repeated for every new estimate. For this
reason, one should, whenever possible, take fJ' to be either a sphere or a
rectangular parallelepiped whose sides are parallel to the coordinate
planes. In these cases, the evaluation of [xla' is easy. If fJ' is the sphere of
radius r centered on a point 60, then
. ' x - 60
[xlao = mID (r, Ilx - (011) x + 60
Il _ 0011
If fJ' is the rectangular parallelepiped {x: al XI {3h i = 1,2, . , p},
where XI is the ith component of x, then the ith component of the vector
[xlj> is simply [xa!:, which was introduced in Equation 2.1.
We conjecture that the following proposition is a valid extension of
Theorems 6.2 and 6.3, but the methods used to establish those results do
not seem to work for the present situation.
6.5 Conjectured Theorem and Assumptions (El Through E6 ')

CONJECTURED THEOREM
Let {Y,,: n = I, 2, . . . } be a real-valued stochastic process of the form
Y" F,,(O) + W", where F,,() is known up to the p-dimensional
=
parameter 0, and Wlt W2, have uniformly bounded variances. Let

[!jJ be a closed convex subset of Euclidean p-space known to contain 6 as

an interior point. For each n, let a ,, ( ) be a Borel measurable mapping of
PJ'<"> = X 8P into R", and let
t"+ 1 = [t" + a ,,(tb , t,,)[ Y" - F,,(t,,)]]j>
(n = 1,2,; tl arbitrary in [!jJ).
1 04 MEAN-SQUARE AND PROBABILITY-ONE CONVERGENCE
Denote the gradient of Fn by Fn, and suppose the following conditions

are met:
E1. lim sup I I an(x) I II Fn(Y)1I
n
o.
xe(nl.)'e
=
E2. 2: inf lI an(x)1I I IFn(y)1I 00.

n xe(nl.)'e
=
E3. There exists a sequence of integers 1 = Vl < V2 < V3 such

that, with PIe = V"+l - V",
(k = 1 , 2, . ),
and
1 . l
1m Inf -
.
Inf
\
"min L.
(
" F;(YJF/ (Y;
...
) -T
-
2
> 0,
" PIe (y.k..).k+ l-l)e(l'kl l ;(Yj)II 2
ieJk Ir
where
J" = {Vb V" + 1,, VIe+1 -
I}.
maxIl a;(xlo' " , X;)II IIF;(Yi) II
E4. lim sup sup
eJ..;.:.k-c-
_i_ ---,-;- ____ __
" (Xl..X.k+ 1_1)ea'(k+ 1-11.ye&' minIl a;(xlo, xj)IIIIF;(Y;)11

ie J k
p < 00. =
1 -
J I-T2 + (T/p)2
T2
=
Then tn e as n 00 both in mean square and with probability one,if

either
E6. 2: sup II an(x) II < 00,
n xe&,(nl
or
E6'. {Wn} is a zero-mean independent process and
2:n xsup
e&,(nl
Il a n(x) 112 < 00.
In practice, Assumptions El through E6' represent a substantial

relaxation of Assumptions CI throu.gh C6', particularly when f!I is a
small compact set.
6.6 Batch Processing

One of the reasons we feel so strongly about the veracity of the
conjecture is that the part of the proposition concerning mean-square
BATCH PROCESSING 1 05
convergence is true if the estimation scheme is only slightly modified;

specifically, if the data are collected and processed in batches. Actually,
batch processing is sometimes the most natural way to handle an
incoming flow of data. For example, if information is being collected
simultaneously from many sources (say, Pk sources reporting at time k),
then it is reasonable to process all the currently available data before
updating the most recent estimate. A less convincing, but still plausible,
case can be made for batch-processing data that arrive in a steady
stream. In either case, the raw data consist of scalar observations
y..
= F.. (8) + W.. (n= 1,2" ,,), These are grouped together so that
the data processed at the kth instant is the vector of observations
(k = 1,2,),
where the regression vector Fk(8) and the residual vector Wk each have
Pk components and are defined in the obvious way.
The recursion considered is of the form
Sk+l = Sk + Ak[Yk - FkCSk),

where Ak is a suitably chosen P x P k matrix. We now d efin e
t" = Sk for al l n eJk
This vector estimate "keeps pace" with the incoming data, but instead
of ch an ging after each observation, it changes only after Ph P2, . .
observations.
The mean-square consistency of truncated recursive estimators
SUbjected to batch processing is the substance of the next theorem.
THEOREM 6.4
Let { Y..}, {anC )}, and 9 be as defined in the previous statement of the
Conjectured Theorem, and suppose that the Assumptions E 1 through E5
and either E6 or E6' hold. Let S1 be arbitrary in 9, and let
(k = 1,2" , . ),
whe re
Here AI< is the matrix whose columns are
where aJ = aj(th " til and

tJ SIc = for all j e J".
Then ells" - 011 2 O as k oo.
Proof. We define the p-vectors

Z" =
Ak[Y" - Fk(O)] = A"Wk,
(6.62)
' "
where Wk has the PIc components W.k, W.k+ 1" W.k+ 1-1' Then we
have
s"+1 = [SIc + Z,,]J'.
If SIc + Zk e(!J, then Ilsk+l - Oil = liS" + Z" - Oil. If SIc + Z" does
not belong to 9, the hyperplane that passes through the boundary
point SIc +1 and is perpendicular to the line joining SIc +1 and SIc + Z"
separates the latter from all points in 9. (This is the classical way of
constructing a hyperplane that separates a convex body from any point
o utside.) Thus, all points in 9, 0 in particular, are closer to Sk+1 than to
SIc + Z". Consequently, we see that
Ils"+1 - Oil liS" + Z" - Oil
is true in either case.
By the mean-value theorem, we have
F,,(s,,) = FkCO) + n"'(s,, - Ok),
where H" is the matrix whose columns are, respectively, the gradient
vectors of F.k, F'k +h , F'k + 1 -1 evaluated at various points in (!/J.

Thus, we have
and
Ilsk+1 - Oil 11(1 - Akllk')(s" - 0) + Z"II
From this and Equation 6.62, we deduce the inequality
Ils"+1 - 011 2 III - A"Hd 21Is" - 011 2
+ 2W,,'Ak'(I - Akllk')(s" - 0) + IIA"WkI1 2. (6.63)
If we can show that
(6.64)
BATCH PROCESSING 1 07
holds for some (deterministic) c > 0 and number sequence {L1k} such
that
I1k-+ 0, (6.65)
then we will be finished.
Indeed, if we set
ek2 =
Cllsk - 8112,
it follows from Equation 6.63, after first majorizing the middle term
with norms, that
e+l (I - cl1k)2ek2 + M1C2(IIAkIl Il WkID2ek + 6( IIAkIl Il WkID2.
(6.66)
Since
IIAkll2 tr (Ak' Ak) =
2: lIajll2
je/k
2: sup IIalx)II 2, (6.67)

je/k xefl'CI)
it follows that
IIAkll q ma x sup lIaj(x)1I = ak' ( 6 . 68)

j e1k fl'CI)
xe
Under Assumption E6,

(6.69)
By hypothesis, all residua second moments are finite, so from Equation

6.(i{i it follows thM.
e+l (I - cl1k)2ek2 + A1 akek + Afaak2 (1 cl1k)2ek2

2
-
+ M4ak(I + ek)'
According to Lemma 3, Equations 6.65 and 6.69 imply SUPk ek2 < 00;
therefore,
e+l (1 Cl1k) ek2
- + MSak'
The argument that followed Equation 6.27 shows that ek2 -+ 0 as

k -+ 00. On the other hand, when Assumption E6' is true, the cross term
in Equation 6.63 has zero expectation, because Wk is independent of
Ak, Ilk' and Sk' We now find that
e+l (1 Cl1k) ek2 + Maak2.

-
The sequence of Equation 6.68 is square summable under Assumption

E6', so again ek2 tends to zero.
It remains, therefore, to establish Equations 6.64 through 6.65. Let

b./c, .. " b./c + 1 -1 be the columns of Ilk, so that b,
= F(uJ for some
UJ E . Then we have
and
III - AkHk'I12 = 11(1 - AkHk')'(I - AkHk)II
II - J

/c
(ajb/ + bja/) I + IIAkHk'I12
II - 2: (ajb/
J el"
+ hja/) II + 2: Ila,1121IbJI12.
iel"
(6.70)
We set
where
The latter satisfies Equation 6.65, according to Assumptions El and E2.

Furthermore, Equation 6.70 and Assumption E4 give
III - AkHk'I12 III - Ak(Tk + Tk') II + MAk2
- 1 - Ak.\mln (Tk + Tk') + MAk2.
The same argument used in the proof of Theorem 6.1 applies here, under
Assumptions E3 through E5, to prove that
Amln (Tk + Tk') > 3c > 0

for all large enough k, and hence Equation 6.64. Q.E.D.
The following is a rather trivial coroIIary to Theorem 6.4, but it will

prove useful in the discussion of certain applications.
Corollary. If the sequence of gain vectors {an()} satisfies El, E2, E4,
E5, and either E6 or E6', then so does
an ( . )
* = cp"an(')
for any sequence {CPn} of scalars bounded from above and below away
from zero.
7. Complements and Details
I n this cha pter we wi ll exa mi ne vari ous rati ona les for choosi ng gai n
se que nces for vect or -para me ter re cursi ve esti mati on s chemes . M otivate d
by considerati on of the li near case , two types of gai ns will be dis cusse d i n
detai l. T he firs t cate gory of gai ns possesses a n optima l pr oper ty whe n
a pplied t o li near regressi on . T he ot her has the virt ue o fex tre me compu
ta ti ona l simp licity. T he res ults of T he ore m 6.4 are s pe cialize d a nd
a pplie d dire ct ly to these par ticular gai ns i n T he ore m 7 .1 . We be gi n our
dis cuss ion wit h a look a t li near re gressi on fr om the re cursi ve poi nt of
view.
7.1 Optimum Gains for Recursive Linear Regression
Suppose one obser ve s ra ndom var ia bles,
Y" = h,,'a + W", (7.1)
where {hn} is a k nown se quence of p-di me nsi ona l ve ct ors, a is not k nown,
a nd the W,,'s are ass ume d to be i nde pe nde nt ra nd om varia bles wit h
commori u nk nown varia nces a2 S uppose furt her tha t s o meone prese nts
us wit h a n esti mat or til tha t is base d upon (is a meas urable functi on of)
the firs t n - 1 obser vati ons . We constr uct a n esti mat or tn+l tha t
i ncor p orates the nt h obser vati on YII i n the foll owing wa y:
tn+l = t" + an[YII - bn'tn1. (7.2)

109
1 10 COMPLEMENTS AND DETAILS
How shoul d th e ( determi ni stic) gai n vector an be ch osen if we wish to

minimiz e th e mea n- squa re dista nce from tn+l t o 81
Th is qu estion ca n be answered i n a stra igh tforwar d ma nner. Let
Bn = C( tn+l - 8)( tn+l 8r (n = 1,2,, , . ).

a
-
Since
t9'lI tn+1 - 8112 = tr t9'( tn+l - 8)( t"+l - 8r, (7.3)
it i s cl ea r tha t ansh oul d be ch osen to mi n i mize th e tra ce of Bn. Su bst itu t
i ng Equa ti o n 7.1 i nto Equ ation 7 .2, ex ploiti ng the i ndependence of the
W' s, a nd completing th e squa re, we find th at
Bn - Bn-l
_ (Bn-In
h )(Bn_lh..)'
1 + hn'Bn-1hn
+ ( 1 + hn'Bn-lhn) an - ( Bn-1bn
1 + b"'Bn-lhn an )( -
Bn_lh,,
h '
1 + hn'B"-1 n ),
(7.4)
T hus,
tr Bn = tr Bn-l
h n'B_lh"
1 + hnB' n-l"
-
h ) an
+ ( 1 + hn'Bn-1n
I -
1 + Bh:lnh r 7 )
( .5
Thu s, if th e estima tor tn i s gi ven ( with second- order moment ma trix
B"-I,) th e appropr ia te valu e of an ( wh ich mi ni mizes tr Bn) i s given by
(7.6)
Wh en an is so ch osen,
Bn - B n-l _ \?-.. _I"o,,}\?-n- "0,,)
1
(7.7)
-
1 + n h B' n_l"h '
)
and
B"h" = Bn-lhn 1 ( -
hn'B .. _1h,,
1 + hnB' n-1hn
1 Bn- hn
:...,..::-:- = an
1 + hn'Bn-1hn
Thus, the same end is achieved by choosing
an = Bnbn, (7.8)
wh ere Bni s defined i n terms of Bn-1 by Equation 7 .7. Th is result leads
u s to gi ve seri ous consi deration to gai n sequences defined i teratively by
OPTIMUM GAINS FOR RECURSIVE LINEAR REGRESSION II I
Equati ons 7 .7 a nd 7.8, a t lea st i nsofar a s re cur sive e stima ti o n pr ocedure s

for li near regre ssi o n are concer ne d. Le t u s, there fore, consider i n de tai l
the re cur sive e stima ti on scheme
t"+l = tIl + a"[Y,, - b,,'t,,],

a" = B"b", (7 .9)
I n order to "get the re cur si on started," ini tia l condi ti ons for tIl a nd
B" -1 mu st be spe cifie d for some n no + 1 . I f thi s i s d one , i t is ea sy t o
=
sh ow tha t the re cur si ons ca n be solve d in closed for m:
(k = 0, 1 , 2,), (7 . lO)
and
(7 . 1 1)
pr ov ide d that Bno i s posi tive definite . T o pr ove Equa ti on 7. lO , we pr o

cee d by i ndu cti on . It is true for k = 0 . Su ppose i t hold s for k. T he n
N ow we u se Equati on 7.9 , wri te Bno+k+1 i n term s o f Bno +k a nd

bno +k+10 a nd carr y out the mU lti pli ca ti o n on the righ t- ha nd side of
Equa ti on 7.12. We there by obtai n the ide nti ty ma trix, a nd th is
e sta bli she s Equa ti o n 7 . lO. Equa ti on 7 .11 is verifie d by su bsti tu ti on .
Le t u s examine the conse que nce s of some spe cia l ini tial condi tions.
I f we take no = 0, t1 0, a nd Bo Ro, the n
( -1 L" bjb/)-1'
= =
= Ro +
B" ( 7.1 3)
1=1
a nd
( 7 .14)
Thi s i s exa ctly the ex pre ssi on for the condi tional expe cta ti on of 6 , give n
Y10 , Y", in the ca se where the re sidua ls have a spheri call y symme tric

j oint nor ma l di stri bu ti on wi th varia nce (72, a nd 6 ha s a pri or nor ma l

di stri bu ti on wi th zer o mea n a nd covaria nce (72Ro .
Supp ose, o n the othe r ha nd, tha t we wai t for p ob se rva ti ons to
accumula te be fore a ttemp ti ng to e stima te the p-dimensi ona l parame te r
O. I f we a ssume tha t hl, h2' ... , hI' are linea rly i ndependen t and take, as
our" fir st" e stima te,
the n, s in ce YJ b/O WJ' we se e tha t

= +
- 0) et bib/) l et bJWJ).
(tp+1 = -
W e hav e, the re fore,
C(tp+1 - O)(tp+1 - 0)' a2( bJh/)

=
J=1
1
- .
)
Th us, if we tak e
Bp ( hjh/ ,
I' -1
no = p, and =
J=1 '
we ded uce from Equa ti on 7. 10 tha t
for n p.
F urthe rmore, b y Equation 7 .11 ,
tn+l Bn (J= hJYJ + i bIYJ)

1 1=1'+1
=
(7 .15)
wh ich is pre cisely th e lea st-s qua res es tima tor for 0 based upon
YhI nYth2, , Yn.
e more conventional matrix notati on, th e Bayesia n and lea st
s qua res estima tors, Equa ti ons 7. 1 4 and 7.15, ca n be wri tte n as
and
res pectivel y, wh ere Hn' is th e n x p matrix wh os e rows are h/ (j =
YJ
1, 2, , n), and Yn is the n vector whos ej th component is th e s cal ar
obs ervation Thus , depending upon th e initial condi tions, th e
recurs ion of Eq uation 7.9 can yiel d the B ayesian es tima tor of 0 (condi
tional ex pectation) in a Gauss ian formulation, or the leas t-sq uares
estimator for 0 (no ass umptions concerning dis tributio n theory of
res idual s being n ecess ary) .
OPTIMUM GAINS FOR RECURSIVE LINEAR REGRESSION 11 3
From th e large- samp le p oint of view, the i nitial condi ti ons a re of no

consequence. In fa ct , by Equation 7 . 1 1,
C(tn+l - 6)(tn+l - 6)' = Bn[Ano + a2Bn -1]Bn,
where
Therefore,
Iflltn+l - 6112 = tr Bn[Ano + a2Bn -1]Bn
= tr BnAno Bn + a2 tr Bn
app roa ch es zero if and only if tr Bn -+ O. Since
.\max (Bn) S tr Bn S P.\max (Bn),
th is redu ce s the que sti on of tn's mea n- square consiste ncy t o the stud y of
Bn' s large st ei ge nva lue. ( W e could resort to The ore m 6.2, bu t we will se e
tha t the specia l featur es of th is linea r pr oble m mak e the hyp othe se s of
The orem 6.2 unne ce ssari ly strong.) B y Equati on, 7 .10 ,
Si nce .\max (Bn) = l/.\mln (Bn -1) , and since .\mln (Bn -1) -+ 00 i f and onl y
if
for every unit vector x, w e must fi n d cond itions wh ich ensure tha t
l i m .\mln
n
( i hih/)
11
= 00. (7 . 16)
Equa tion 7 .16 wi ll hold if there is a s eque nce of integ ers 1 VI <
V < . . . , with
=
2
P S Pk = Vk+l - Vk S q < 00 and Jk = {vbVk + I, ,Vk+l - I},
s uch that
(7 . 17a)
114 COMPLEMENTS AND DETAILS
an d
L mi n IIhJI12
'"
n=1 Je/"
= 00. (7.17b)
Fo r the n
"min (Vk +i -1 hJh/ )

i=l
= "min(i L bJh/)
n=1 ie/"
i "min (L bih/ )
n= l leI"
mi n IlbJl12"mln ( L hJh/lllhJI12)
nml Je/" leI"
T2 L mi n IlhJI12
k
-+ 00.
nal leI"
Since
i s a nonde crea sing sequen ce, it converge s to the sa me limit as doe s

SYu 1 -1' Thi s e stabli she s E qua tio n 7.16.
As we men tioned earlier, the special na ture o f the gain s being u se d in

thi s linear-regre ssion proble m cau se s some o f the condi tio ns o f Theorem
6.2 to be irrelevan t. I n fact, A ssump tion s C2, C3, an d C4 are by the m
selve s sufficien t to guara ntee E qua tion 7.17. Qui te ob viou sly, Assum p
tion C3 i s the sa me as E qua tion 7.17a, an d A ssumptio n C4 im plie s that
mi n IIBjhJll llh,11 ! max IIBjhjlll hl ,11

Jeh P leh

1
pq
- L
le/k
IIB,bill Ilh,ll;
there for e, we have
by A ssu mptio n C2. Since

IIB/II = "max (Bj) = 1/"mln (BJ-l),
and since
"QUICK AND DIRTY" RECURSIVE LINEAR REGRESSION 115
i s a p osit ive nonde crea sing se que nce, we see t ha t

l i m s up IIBili < M.
j .... ""
T hus,
which impl ie s E quat ion 7 .1 7b .

In summary, Assumptions C2, C3, and C4 of Theorem 6.2 are sufficient
for Equation 7.17, which in turn implies Equation 7.16. This latter condi
tion, in turn, is necessary and sufficient for the recursion of Equation 7.9 to
generate a mean-square consistent estimator for the linear-regression
parameter 6 , regardless of the initial conditions imposed upon the t" and
B" sequences.
7.2 "Quick and Dirty" Recursive Linear Regression
I n s ituat ion s where data p our i n a t a very high rate a nd"rea l-t ime"
e st imate s are a cutely de sired, we ma y be will ing t o trade off stat ist ical
effi cie ncy f or comp utat iona l speed, s o long a s consiste ncy is pre se rved .
T he ga in seque nce
(7 .18)
with t he a ssociated re curs ion
(tl a rb itra ry)
furn ishe s a very hand y e st imat ion scheme. T he ga ins used in E quat ion
7 .9 all owed u s t o find a closed f orm expre ssion f or t" and t o stud y it s
a sympt otic pr opert ie s d ire ctl y ( wit hout re cour se t o The orem 6.2). T h is
i s not p ossible in the pre sen t ca se. H oweve r, if we a ssume tha t E quat ion
7 .17 holds and, in add it i on, t ha t
o < lim inf Ilh"+lll < lim sup Ilh"+lll < ex)
(7 .19a)
n Ilh,,11 -
n Ilh,,11
a nd
l im ( 7 .l9b)
n
then Assumptions C1 through C5 and C6' are satisfied. To see why this
is so, we begin by pointing out that in the present case
by Equation 7.19b, which guarantees Assumption C1. By Equation

7.17b, we have Ln lihill2 00; therefore, by the Abel-Dini Theorem
=
(Equation 2.27, Chapter 2),
(7.20)
which establishes Assumption C2. Assumption C3 is the same as Equa

tion 7.17a, while Equation 7.19a implies that
therefore, if i, j E In,
Consequently
while
Therefore,
as k -+ 00, by Equation 7.19b.

This establishes Assumption C4. Assumption C5 is immediate, since
the gain vector and the gradient vector are collinear, while Assumption
C6' follows from the nonsummability of lihil12 and the Abel-Dini
Theorem.
OPTIMUM GAINS FOR BATCH PROCESSING 117
Since the hypotheses of Theorems 6.2 and 6.3 are identical in the
linear case, we infer: The recursion defined by the gain of Equation 7.18
generates an estimator sequence that is consistent in both the mean-square
and almost-sure senses, prorided that the conditions of Equations 7.17 and
7.19 are satisfied by the regression l:ectors {hn}.
7.3 Optimum Gains for Recursive Linear Regression. Batch Processing

Under those circumstances where it is more natural to group scalar
observations together and process them in batches, each" observation"
can be thought of as a vector Yk whose mean value is a vector of the
form Hk'6, where Hk' is a matrix whose rows are the transposes of the
h-vectors that are associated with the (scalar) components of Yk' Thus,
,Y
if 1 VI < V < . . . and
Y
=
2
. - [..'" : 1
y
1.2. . )
_ .. +l
(k -
.- .
then
(k = 1.2,).
Now consider the same question that was posed in the early part of
Section 7.1. If the observations Yh Y2,"', Yk- have been used to
form the estimator Sk, and jf
l
1
Rk-l = 21f(sk - 6)(Sk - 6)'
a
is known, which matrix Ak has the property of minimizing
Ifllsk+l - 6112,
where
Sk+1 = Sk + Ak[Yk - Hk'Sk] ? (7.2 1 )
By analogy with Section 7.1, we define
Tk = (Hk'Rk-IHk + J) (7.22)
and
(7.23)
By substituting Equation 7.21 into Equation 7.23 and completing the

square, we find that
1
Rk = Rk-1 - Rk-1HkTk- Ilk'Rk_1
+ (Ak - Rk-1HkTk -l )Tk(Ak - Rk-1HkTk -1 ) '. (7.24)
The last term above is a nonnegative definite matrix whose trace is non
negative. Therefore, 611sk+1 - 8112 a2 tr Rk is minimized when the
=
third term vanishes, that is, when we take

=
-1
Ak Rk-1HkTk .
In this case, we see that
=
-
Rk Rk-1 - Rk-1HkTk lHk'Rk_1.
If the last equation is postmuitiplied by Hk, we find that
-
RkHk = Rk-1Hk(I - Tk l Hk'Rk_1Hk)
-
= R_lHk[1 - (Hk'Rk-1Hk + 1) l Hk'Rk_1llk]
-1 =
= Rk-1Hk(Hk'Rk-1Hk + 1) Rk-1llkTk -1 = Ak
Thus, Cllsk+1 - 8112 is minimized if we take
(7.25)
where
(7.26)
This, in turn, strongly suggests that we give serious consideration to
recursions of the form
(k = 1, 2" .. ) , (7.27)
where Rk satisfies Equation 7.26 for k 1, 2" " and Ro is arbitrary.
=
The analogy to Section 7.1 is so close that it seems reasonable to assume

that some sort of relationship exists between the recursions of Equa
tions 7.25 through 7.27 and 7.9. This is indeed the case. In fact, the
recursion of Equations 7.25 through 7.27 (suitably initialized) yields an
estimator sequence which is a subsequence of the one generated by
Equation 7.9.
To prove this assertion, iterate Equation 7.24 from k back to ko + 1.
We find that
Sk+1 =
[i =Iiko+l
(I R/HJH/> ] sleo +1
+
ttL (I
J= +1 R;II;II;') RjlliY/. (7.28)
]
OPTIMUM GAINS FOR BATCH PROCESSING 1 19
If Rko is nonsingu\ar, it is easy to prove by induction (using Equation
]
7.26) that
k
-1
Rk =
[
Rkol + L H,H/
J=ko+l
(k ko + 1" ). (7.29)= . .
Thus, if j > ko,

1- RJH,H/ = RJ(R, -1 - H,H/) = R,R,-_\. (7.30)
Inserting Equation 7.30 into Equation 7.28, we see that
Sk+l = (
Rk RkolSkO+l + i
J=ko+1
HJYJ ) (7.31)
After substituting Equation 7.29 into Equation 7.31, we obtain
Sk+1 =
[Rka1 J=iko+1 H,H/] -1RkolSko +1
+
+ [Rkol J=iko+l HJH/] -1 ('=ko+l

+ i H,y,) . (7.32)
We now consider two special initial conditions:

ko = 0, (7.33)
and
ko = I,
If the starting conditions are at ko = 0 according to Equation 7.33,
then
(7.35)
You will recall that H/ is the matrix whose rows are h;" h1 + It .
"
h/+
.
1 - It and Y/s components are Y. ,, Y./+ 1

I -1 ' Thus,
H,H/ =
L hth/,
tell
(7.36)
where J, = {v" VJ + 1" " , VJ+l - I}, and

HJYJ = L
tell
Ytht (7.37)
Substituting Equations 7.36 and 7.37 into Equation 7.35, we obtain
(7.38)
Now if we compare Equation 7.38 with Equation 7.1 4, we see that
if Ro = Bo. Thus, the kth estimator generated by the batch-process

recursion is the same as the vkth estimator generated by the scalar
recursion, provided that the initial "covariance" matrices are the same.
Similarly, under the initial condition of Equation 7.34 (assuming III
has rank p), we have
Sk+1 =
[1-111-11' It 1-111-1/] -ltt H1Y1)
+
=
(Vk 2: bjb1') (Vk 2: YjbJ),
+ 1 -1 -1 + 1 -1
1= 1 1= 1
which is identical to Equation 7.15, the least-squares cstimator based
upon the fi rst Vk + 1-I observations.
We now recall that
Bn 1 2 C(t +1 O)(tn+1 0)'

=
CJ
n - -
(cf. the discussion immediately preceding Equation 7.10), satisfies the

recursion of Equation 7.9. Since Sk tvk, it follows from Equation 7.23
=
that the kth element of the recursively defined sequence of Equation 7.26
is identical with the (vk+1 I)th element of the recursively defined
sequence of Equation 7.7, if k 2:: ko and
-
In other words, the Rk matrices can be computed by means of the

recursion of Equation 7.7 , thereby circumventing the necessity of
computing the matrix inversion which is called for in Equation 7.26.
(Actually, the rccursion (7.7) is carrying out the matrix inversion, but it
is being done" painlessly.")
In summary: The batch-processing recursion,

Sk+l = Sk + Ak[Yk - I-Ik'Sk],
with
and Bn satisfying the recursion (Equation 7.7) for n 2:: Vko, generates a
sequence of estimators II'hich (depending upon initial conditions) is a
subsequence of those generated by the recursion Equation 7.9, and,
"QUICK AND DIRTY" BATCH PROCESSING 121
consequently, their asymptotic beharior is determined by consideration of

the condition of Equation 7.l7.
7.4 "Quick and Dirty" Linear Regression. Batch Processing

Preserving the notation of Sections 7.1 through 7.3, we assume that
Equations 7.17 and 7.19 hold, and we consider the "batch-processing"
recursion
Sk+l = Sk + Ak*[Yk - Hk'Sk],
where Ak* is the p x Pk matrix whose column vectors are
Yk is the pIc-dimensional vector whose components are
and the Vk are chosen to satisfy Equation 7.17.

In Section 7.2, we showed that Equations 7.17 and 7.19 imply
Assumptions Cl through C6' , which are the same as Assumptions El
through E6' when f!i' is all of Euclidean p-space. If we can show that
for Vk :::;; n < Vk+h then the arguments of Section 7.2 can be applied to
the present gain vectors, and Assumptions El through E6' can again be
established. The inequality is indeed true. Under Equations 7.17 and
7.19, we have
:::;; 2: Ilhil12 Cl ) 2: II h iII 2

n n
+ h ll2 :::;; ( l
cdn +
1=1 1=1
ifvk:::;; n < Vk+l ' Thus, Theorem 6.4 applies to the untruncated (as well
as truncated) batch-pr .)cessing recursion; therefore, Sk converges to 0 in
the mean square.
We now turn our attention to the question of gain sequences for truly
nonlinear regression problems.
7.5 Gain Sequences for Recursive Nonlinear Regression. The Method of

Linearization
I n most applications , s ome pri or knowledg e is avail abl e concerni ng
the true value o f the reg ressio n parameter. For ex ampl e, i fe repres ents
the vector o f o rbital parameters of a s atelli te, there is a no mi nal val ueeo
w hich is the valuee should be if thi ngs were to go ex actly as pl anned . In
the abs ence of a maj or mis fortune, the actual (unkno wn) value ofe will
be clos e to the nomi nal (k nown) val ue ofeo. If the reg ress ion functi o n is
Fn(e), then we can wr ite
Fn(e) = Fn(eo) + Fn'(;n)(e - eo),
where Fn(;n) is the gradient of Fn, evaluated at some point on the line
segment which joins e and eo. Thus, if the observations Yn are of th e
f orm
Yn = Fn(e) + Wn,
we can wri te
Yn * = Fn'(;n)e + Wn, (7.41)
where
When viewed in terms of the "transformed" observations Yn *, Equation

7.41 looks very much l i ke a linear-regression problem except for the fact
that ; n (hence Fn(;n) and Yn *) is not known. However, ifeo is close toe,
;n must be close to eo and so Fn(;n) is close to Fn(eo). If we let
Yn = Yn - [Fn(eo) - Fn'(eo)eo],
we deduce f rom E quati o n 7.41 that
Yn Fn'(eo)e + Wn (7.42)
( wh e re means "approximately equal" ) . In turn, Equation 7.42
suggests that it would be worthwhile trying the recursive linear
regression schemes developed in Sections 7.1 through 7.4 on the trans
formed observations Yn. That is to say, we "pretend" that
Yn = hn'e + Wn,
where
hn = Fn(eo).
We estimate e b y a recursi ve s cheme of the f orm
tn +1 = tn
an[ Yn - hn'tn],
+
where an is defined in terms of hn by Equations 7.9 or 7.18, and we

investigate the consequences.
METHOD OF LINEARIZATION 123
Actually, since we know that
Y" - h"'t,, = Y" - [F,,(00) + F"'(Oo)(t,, - (0)],
and since the right-hand side is approximately equal to Y" - F,,(t,,) if

t" is near 00 (which it will be if 00 and t" are both near 0), we will be
equally justified in studying the recursion
t,,+1 = t" + a"[Y,, - Fn(t,,)].
This is no great surprise. But by "deriving" the recursion via the
technique of linearization, we have been led to consider two particularly
promising gain sequences:
(7.43)
where
(n 1 , 2, . . ) (7.44)
and
=
(7.45)
The B" recursion is initialized at no, where B"o can be any positive
definite matrix. In this case, in closed form, we can write
if n > no. (7.46)
The other sequence is the nonlinear version of the "quick and dirty"
gain:
(7.47)
where h" is given by Equation 7.44.

The preceding argument was based upon the idea of approximating
the regression function by the first two terms of its Taylor-series
expansion about some nominal parameter value 00, which is assumed to
be close to the true parameter value. If the estimator sequence tn
approaches 0, then, after a time, a "better" nominal value of 0 is
available. Why not, then, evaluate the gradient F" at the most current
estimate fore, and use either the gain of Equation 7.43 or 7.47, but with
h" F,,(t,,)? Such gains are adaptive (depend upon the observations),
=
whereas those with h" = Fn(Oo ) are deterministic. The reader will recall
from Chapter 4 that adaptive gains may or may not be more efficient in
the scalar-parameter case (compare Theorem 4.2), and we feel it is safe
to conjecture that a similar situation exists in the vector case.
If the data are to be processed in batches, then we consider recursions

of the form
S,,+1 = [s" + A"(Y,, - F,,(s,,]jIO,

where Y" is the Pk-dimensional vector whose components are Y.",
Y.,,+h, Y'n+l-1, F,,(s,,) is the Pk-dimensional vector whose com
ponents are F.,,(s,,), F +l(S,,), .. " FVn+ 1 -1(S,,), and An is the P x Pk
matrix whose column vectors are given either by Equation 7.39 or
Equation 7.40, with hk equal to the gradient of Fk evaluated either at
some (fixed) nominal value 60 or at some recent estimate of 6.
The preceding discussion is intended to be informal and heuristic. Its
purpose is to motivate the study of a few particular gain sequences in the
context of our convergence theorems. In subsequent sections, we will
exhibit various sufficient conditions on the regression function Fn(),
which imply Assumptions El through E6' (E6) for the above-mentioned
gains.
Before proceeding, we should point out that these gains (and their
attendant recursions) have been used in practice for a long time. Their
"discovery" undoubtedly occurred by means of the technique of
linearization. However, to our knowledge, the question of convergence
has not been treated before. For ease of future reference, we list the
various gains which are to be investigated in the sequel.
7.5.1. Single-ObserL'Otion Recursion. We have

tn+1 = [t" + aneth t2,"', t")(Yn - F,,(tn]jIO,
=
:F,,(!; ) (7.48a)
a" n ,,
2
j=1
II:Fi(;j)112
an = B,,:F,,(;,,), (7.48b)
where
and, for eachj, !;j maps (th t2, .. " tj) into (ffJ. The gains can be classified
as deterministic if ;j(th t2, " tj) = 60 E [j! for all j; adaptire if

!;j(th t2," ', tJ = tj (j = 1, 2,,,); quasi-adaptiL'e if !;lt h t2," ', tj) =

t,,( jl> where {n(j )} is a nondecreasing sequence of integers with n(j) =::;; j.
7.5.2. Batch-Processing Recursion. Let 1 = "1 < "2 < . . . be a

sequence of integers chosen so that Assumption E3 holds. Let
Sk+1 = [Sk + Ak(s1, " ', Sk)(Yk - Fk(Sk))]jIO,
SUFFICIENT CONDITIONS FOR El THROUGH E6' 125
where
and
Here, we consid er the g ai ns
(7,48c)
and
(7,48d)
In the present case,
map (S1o . "
Deterministic if !;v/c
Adaptive i f !;v/c
=
=
00 E f/J
Sic
}
Sk) i nto f/J. The g ains can be clas sified as
(k = 1, 2" , ,),
Quasi-adaptirJe i f !;v/c = Sn(k)
where n (k) is a n oild ecreasi ng i nteg er sequ ence with n (k) k.
7.6 Sufficient Conditions for Assumptions El Through E6' (E6) When

the Gains (Equations 7.48) Are Used
THEOREM 7.1
L et {Fn ()} b e a s equen ce of real-valu ed fun ctions d efin ed over a
p -di men si on al closed convex set f/J. W e as sume that each Fn{) has
b ound ed, second- order mixed partial d erivatives ov er f/J and that for
some x E f!/, the foll owing condi tions hold true.
F l. lim su p IIGn{Y)lI/llhn*1I < 00,

)'e9'(p)
where
[ ]
is the p x p matrix whose ith column is
02Fn(,F,,)loglog,
2Fn(,F,,)log2og, (i 1,2" , ', p)
=
2Fn(,F,,)logpog, ; = y,
and
satisfies the conditions:
IlblBI

Fl. lim
.....
lib.*ral = O.
/-1
..
..
F3.
-1
lib.*111 00
F5 . There is a sequence of integers
1 = Pl < PI < ... ,

with
such that
.
1im I n f! \
k ... ", Pk. "mIn ieJ"
( ""
L. II hi*112
h/h/' - *2 )- 7" >
0
,
where
Jk = {Vk' Vk + I" ", Vk+l - I}.
Let r(f?) be the rad ius of the smallest closed sphere containing f?
a. For th e gains of Eq uations 7.48a, c, A ssumptions El through

E6' h old if r(9') i s sufficiently small. ( I n th e case of batch
pr ocessing, we i ntend th at the b atches of d ata corr espond to the
index sets i nd uced by A ssumpti o n F5 . Th us, the k t h batch ofd at a
consists of { Yj;j E Jk}.)
b. If As sumptions F2, F 3, and F4 are str eng thened to
F2' . Kln6 IIhn*1I K2n6 for some positiv e K1, K2, and S, then
A ss umptions El through E6 hold for th e g ains of Eq uati ons
7.48a, c if r(9') is s ufficiently s mal l .
c. For the gains of Equations 7 .48b, d, A ss umptions EI through
E6' hold i f we assume, i n add ition to Assumptions FI through
F5, that
4(-r*/ K qa*) 2 1 - (p/q) -r*2

[1 + (-r*/Kqa*)2]2
F6 .
> 1 _ (p/q)-r*2 + (l/pqK8Q)-r*6'
provided r(gIJ) is sufficiently small, where
a*2 10
1m sup -
1 \ ( " h/h/*' )
I lhI*112 .
=
k -"O Pk
" max
L..
le/k
d. For the gains of Equations 7.48b, d, Assumptions FI,

F2', F5, and F6 together imply Assumptions EI through E6,
provided that r(gIJ) is suitably small.
Comment: In practice, the true parameter value 6 is not known

exactly but is generally known to lie in some neighborhood f!lJ of a
nominal value 80 In such cases, the vectors hn* would most naturally be
chosen equal to Fn(60). Theorem 7.1 says that Assumptions EI through
E6' will obtain under various subsets of Assumptions F 1 through F6 if f!lJ
is "sufficiently small." Just holV small could be specified quantitatively,
as will be seen in the proof. H owever, since a quantitative bound is so
complicated and conservative that we feel it contributes little to our
understanding, we do not include it. The purpose of Theorem 7.1 is to
furnish a set of relatively easy to understand conditions that furnish
insight into the circumstances under which our estimator recursions
will converge, always subject to the proviso that r(f!lJ) should fall below
some threshold value.
Proof of a. If E f!lJ, we can write Fn() hn* + Gn( - x), where =
Gn is the matrix of Fn's second-order mixed partials evaluated at

various points in f!lJ. If we let
(7.49a)
it follows from Assumption FI that
sup Ilr n() II C 1 r(f!lJ) lihn* II (7.49b)

liea'
Since
it therefore follows that
for all E f!lJ if r(f!jJ) is chosen small enough to ensure the leftmost
inequality. By Assumption F4, we have
(7 . 5 1 )
therefore, for the gains of Equations 7 48 a, . c, there exist positive

constants C2 and C2' such that
11
C2(1 - C1r(&I'llhn*11 :s; Ilanll L

J=l
Ilh/11 2:s; C2'(1 + C1r(&I'llhn*11
(7.52)
u n iformly in an' s ar gument for all n. Assumptions E l , E2, E4, and E6'
n ow foll ow w hen Eq uations 7 .49 thr ough 7 .52, F2, F3 , and the Abel
Di n i theor em (2.27) ar e combined i n w hat is, by now , routine fashion.
To pr ove Assumption E5 f or the gains of Eq uations 7 . 48a , c, w e
n otice that
therefore, by Eq uation 7 .49a , w e have

a/FlY) hl'hj* + r/(y)rj(j) + (r/(y) + r/(fh/ (7
.53)
Ilhj *+ rJ(J) IIllh/ + rJ(y) II

IlaiM IIFb)II =
Multiplying the numerator and denominator of the last expression by

Ilh/I!2, we find that
a/Fly) _
II a, II IIFiy)!! -
Using Schwarz's Inequality and Inequality 7.49b, we see that
and, by Equati on 7 .49b , that
IIh/1l C4r(&I')
1
IIh/ + ri()11
-
for all E f!/, provided that r(&I') is suitably small. Thus, the left-hand
side of Equation 7.53 is bounded below (uniformly) by (\ C3r(9' x -
(1 - C4r(.9'2, which can be made arbitrarily close to one by taking

r(.9') small enough. Since Assumption E5 requires the left-hand side of
Equation 7.53 to be bounded below uniformly by some number which
is strictly less than one, Assumption E5 will therefore hold if r(.9') is
suitably small.
Assumption E3 does not depend upon the particular gain sequence,

and so we prove it now, once and for all, as follows:
where
By Equation 7.49b, if r(&) is sufficiently small, we have
Ilh/il
Ilh/ + rill - 1 + rJ, (7.55)
_
where IrJI :::;; C5r(f3IJ). So, if e > 0 is given,

mln (10) mln
( h *h 2*' ) tr Irillh*h/* 2*,
1 6 1 /1 - 16" l 1
"
p( * - e)
T2 - q C5r(&)
by Assumption F5 if k is suitably large and r(f3IJ) is suitably small. Here
2 is a nonnegative d e finite matrix. From the Courant-Fischer character
ization of eigenvalues, Schwarz's Inequality, and Equations 7.49b and
7.55),
min (3)
2: Ilr/llllh/11
- - 2 JeJ" IlrJ + h/112
>
*
=
- 2 '1;
1 " Ilt11I Clr} /I,r - Csr(&).
(where the infimum is taken as the x/s vary over p-space).

Since e is arbitrary, we see that
(7.56)
provided that r(&) is appropriately small. This proves E3.

Proof ofb. Since Assumption implies Assumptions F2'

through F2
F4, Assumptions El through ES hold as before and Equation 7.S0 can
be strengthened to
K1(1 - C1r(f!IJn6 I F,.(;)I K2(I C1r(f!IJn6,
::s; ::s; +
for all ; E f!IJ. Thus,

K3n6 = 0 (1+6
n )
1
Ila,.1 1 ::s;
LP6
-,.--
1=1
which implies Assumption E6.
Proof of c. For the gains of Equations 7.48b, d, we have
'\mln (Bvk+ l-l)I Fn(;n)1 Ilanll '\max (Bn)IIFn(;n)ll,

::s; ::s; (7.S7)
where
(m = n, "k+1 - 1),
and k is chosen so that
By Equation 7.490, w e see that
['\max C/l-l (h/ + fi)(h/ fi)' R)] -llllhn*11 - Ilfnlll I I

+ + ::s; a,.
['\mln ttl (h/ fi)(h/ fi)' R)r\l hn*11 Ilfnll) . (7.S8)

::s; + + + +
By Equation 7.49b, w e have
Ilhn*11 - Ilfnll (1 - C1r(f!IJl h,.* I , and Ilh,.*11 I f,.11

+
Furthermore, since R
is assumed to be nonnegative definite and since
(A + B)
'\mln '\mln
(A) + '\mln (B)
for symmetric matrices,
i=l (h/ + fi)(h/ fi)' + R)

'\mln (i +
'\mln Ct h/h/' ) - 2 it Ilrillllh/il

'\mln ct h/h/' ) - 2C1r() it1 I h/1 12.

SUFFICIENT CONDITIONS FOR El THROUGH E6' 13 1
By Lemma 7b, if Il > is given,
>'mln (Jl hJ*h/') C*;,; ) Ct h/h/' )

Il
>'max
; ( T*;'; Il) Ilh/1 2

it
if nis large.
Consequently,
>'mln Ct (h/ + rJ)(h/ + ri)' + R)

[(T*;K}Il) - 2Cl r()] it Ilh/1I2.

(7.60)
Combining Equations 7.57, 7.50, and 7.60, we obtain
Ilanll [CuK}Il
) - 2Cl r()] C1r(Y' n I hn*1 , (7.61)
-1 (l +
i=Ll I h/1 I2
p
when is large (the result holding uniformly in an's argument, of

n
course). By Assumption F3, Equations 7.50, 7.61, and the Abcl-Dini

Theorem ( 2 .27), Assumption E6' holds, while Equations 7.50, 7.61,
and Assumption F 2 imply Assumption EI.
On the other hand, since n EJk, we have
iLm1 I h/ rJI12 + tr R
Yk + 1-1
5 Ct i J ( llh/1 2 + 2 l1rill l h/1l I ri l 2)

+ + + tr R
5 (1 + C1r(2 (i I h/1I2 + max

J= 1 i eJ,,; I h/1I2)
q + tr R. (7.62)
By Equation 7.51,
max
ieJk I h/1I2 K2Qllh,,*112 5
and by Assumptions F 2 and F3, if is given, we have

Il >
tr R + q{ l + Clr(2K2qllhn* 1 2
n Il <
2:
J=1 I h/1 2
for large n. Combining Equations 7.58, 7.59, and 7.62, we find that
if n is large. Equation 7.63 and the Abel-Dini Theorem imply Assump

tion E2. To prove Assumption E4, we notice that, for n E J,,,
(7.64)
By Equation 7.51, we see that
max Ilb/1 2
Jeh
min Ilb/1 2:S;
K2q, max
Jeh
IlbJ* 1 2 :s; Ilb:kI12K29,
Jeh
and, by virtue of Assumption F2,
Ilb:kl12 II
/1 2 - qK2q
-::-.::.....;.::..:!-- <-
2
J-1
Ilb
if k is large (that is, if n is large). Thus, p (as defined by Assumption E4)
is bounded above by
(I C1r(f1' )2 4 [ (l - CLr(f1'2 + ] (I )
- C1r(f1') (pK q) T*2 - 2pK2QC (f1')
+ II
+ II
1 1r _ 211
Since II is arbitrary,
(l + C1r(f1'4(pK49)
p :s; (l _ C1r(f1'2(T*:a 2pK29C1r(f1'"
_
(7.65)
This establishes Assumption E4 and will play a role in the proof of

Assumption E5.
By Equation 7.490, we can always write

F,,(y) = F,,(;,,) +
r"*(;'" y),
where
r"*(;, y) = r,,(y) - r,,(;)
satisfies
Ilr,,*1I 2C1r(9) llh,,*11
for all ;, y E 9.
Thus, for the gains of Equations 7.48b, d,
a,,'F,,(y) F,,'(;,,)BmF,,(;,,) + F,,'(;,,)Bmr,,*
_
lIa"IIIIF,,(y)1I - IIBmF,,(;,,) IIIF,,(;,,) r,,* 11

+
> [ F,,'(;,,)BmF,,(;,,) ][ IIF,,(;,,) II

+
]
- IIBmF,,(;,,) I'IIF,,(;,,)1I II F,,(;,,) II I r,,*1I
*
IIr,, II IIBmF,,(;,,)11
{n,
where
for Equation 7.48b,
m-
P l-b for Equation 7.48d.
By Equations 7.SO and 7.66,
IIt.<t.> II + IIr,,*1I (I 3C1r(9lIh,,*II,

+
and
I IF,,(;,,)II 2! (I - C1r(&'lIh,,*1 I ,
Letting
by Lemma 70 we have,
F,,'(;,,)BmF"(;,,) > 2/e % 1
m ( +
/em) -1
IIBmF,,(;,,) 1I IIF,,(;,,)II - '
Thus, we see that
I . f
imin . f
in a,,'(x)F,,(y)
" xea'(n'yea' II a,,(x)II IIF,,(y)1I
2!
..
hm.: nf 2/em%( 1 + /em)-1
(I - 2C1r(Y'9)
1
2C1r(&')
+
3C1r( ) - 1 _ 3C1r(.o/')
2! lim inf 2/em%(1 +
/em) -1 - Car(&') (7.66)
m
if r(&') is small.
If we can show that
lim inf K"

n-+ClO
(K:**)2 - C9r(fl'),
a
(7.67)
then E5 will follow if r(fl') is small enough. This is so because
72 () 7U - C10r(P/), and p ( p::: q) + Cllr(P/),
when r(P/) is small, by virtue of Equations 7.56 and 7.65. Since

(l - 72)/(1 - r2 + 72/p2) decreases with r2 and increases with p2, it
follows that
1 - 72 1 (p/q)r*2
{P/q)7*2 + (1/pqK8q)r*8 C12r()
-
+
1 - 72 + 72/p2 1 _
if r(P/) is small.
On the other hand, if Equation 7.67 holds,
li

nf 2K,,%( l + K,,) -1 (;:*) [ (K::*rr 1 - C13r(P/),
l +
and so, by Assumption F6,
li!!'f2K.%(1 + J-l > (1 -+rlpl)%

_
if ret/') is sufficiently small. Thus, ES follows from Equation 7.66, when

r(t/') is small.
To prove Equation 7.67, we note that
A..s. (8. -1) A..s. (B. 1)

i
"
-
1
1
A..s. r '5' (Il
Ls;,. l
+ rj)(bl + r/)'] if m eJIe
By virtue of the fact that
A..s. (B. -1) = min x'Bm-lX,

IIltll=1
we see that
A..s. L. (hl + r/)(hl + f/)' ]

- mm Ilh *1 2 "mm [(h/ Irl)( h /2 rl)'] + +
> .
ieln
I
\ .
lh * 1

leln I
. I h l* I 2 {mln [h/
h/'] + mln [rih , *' + h/r/]}
fl Ilh/1 12 i fl
n Ilh /11 2 i n
m . Ilh *11 2 { min [h /h/'] 2 IIr,ll

m
ieln
I
I lh * 1 1 2
leln Ilh *1}1 .
I - leln
I
(We used Schwarz's Inequality in the last step.)

By Equation 7.49b, we therefore have
"min (Bm - 1) kf min Ilh/1 2 {"min [ L: hl /jl :] - 2qClr()} .

,,=1 lei. lei. h
I
Since
we see that
"min (Bm - 1)
min
,,= 1 lei.
Ilh/1 2 {"min [ L: hl hj *JI :] - 2qClr()} -
lei. 1
q min
leh
Ilh/1 2
if m eJk.
In much the same fashion,
"max (Bm - 1) !S:
max Ilh/112{"max [ L: bl ;*J :]

n=1 lei. h I lei. I
+ q[CI 2r2(.9') + C1r(.9')] } "max(R)
+
if m Elk.
Thus, if m eJk, we have
"min (Bm -1)

Km = "max (Bm -1 )
By virtue of Assumptions F3, F4, and F5, the sums in the denominator
and numerator approach +00, while the ratio of the second term in the
numerator to the sum in the denominator approaches zero by Assump
tion F2. Using the discrete version of L'Hospital's rule, we find that
lim inf Km
m .... ""
and by Assumptions F4, F5, and F6, we see that the last is greater than
or equal to
(C1;qr - Cgr([ljJ),
which proves Equation 7.67.

The proof of part d is in the same vein as b, and we leave the details to
the reader. Q. E.D.
7.7 Limitations of the Recursive Method. III Conditioning
In the parlance of numerical analysis, a matrix. HH ' is said to be ill

conditioned if
.\max ( HH')/ .\mln ( HH')

is large but finite. The column vectors of such matrices are "just barely"
linearly independent, and, when one tries to compute the value ofx that
minimizes
liz - H'x112
(that is, x = ( HH')-1 Hz), one finds that the numerical solution is
ex.tremely sensitive to round-off errors (compare Householder, 1964,
Chap. 5). The notion of ill conditioning extends naturally to the large
sample theory of recursive linear regression if we call a linear-regression
function (actually a sequence of regression vectors) {h Ill ill conditioned
whenever we have
lim sup
.\max (11 hjh/)
.\mln (2: hjb/ )
= 00. (7.68)
...... co
Jm1
This extension of the terminology is a reasonable one; for, if observa

tions are made on a process of the form
(n = 1 " 2 ...) ,
and if we attempt to estimate I) recursively (by means of Equation 7.9), it
is necessary to compute BII = (Li 1 hjb/) -1 at each step of the recur
=
sion. If {hIll is ill conditioned, this computation becomes increasingly

unstable with regard to round-off errors.
If {hll} is ill conditioned, this does not preclude the possibility that
.\max (B,,) -+ O. Such situations are very perplexing from the practical
point of view. On the one hand, theoretical considerations lead us to
LIMITATIONS OF RECURSIVE METHOD 1 37
expect consistency from the recursively computed least-squares

estimator (cf. Section 7.1). On the other hand, numerical considerations
can easily cause the recursion to generate a nonsensical output.
The "classic" instance of such a situation arises in polynomial
regression wheref,,(O) = 2f=o 8inl and
1
In fact, polynomial regression is a particular instance of a more general

class of ill-conditioned regression functions:
THEOREM 7.2
If 2" IIhnl12 = 00 and l im" h"/lIh,,1I = h, then {hIll is ill conditioned.
(We defer the proof till the end of this section.) For instance, if
it is clear that
b" [0].
IIh"lI
-
I'
therefore, Theorem 7.2 applies. At the same time, we have
The first factor on the right-hand side is less than a constant times n - 4;
therefore, tr B" = O( Iln) _ 0. In cases such as these, we can only advise
the practitioner to exercise extreme caution in designing his computa
tional program.
In light of Lemma 7b, ill-conditioned linear-regression functions must
necessarily t'iolate at least one of the hypotheses of Theorem 7.1. If, in
particular, the regression is ill conditioned owing to the fact that
b"
L IIh,,1I2 = 00, and -h
" IIb"lI ,
it follows that
(n+k b.b'' ) =
lim
n --oo Am1n 2:n Ilh j 1'12
i=
kAm1n (bb') = 0
for any k, which means that Assumptions C3, D3, and 3 of Chapter 6
are I.:iolated. This of itself does not preclude consistency (for example,
least-squares polynomial regression). However, the theorems of
Chapters 6 and 7 don't apply. In particular, the "quick and dirty"
recursion applied to polynomial regression cannot be shown to be
consistent.
These observations apply even more strongly to the case of nonlinear
regression. A nonlinear regression function exhibits the pathology of ill
conditioning if Equation 7.68 holds when bn is the gradient of the
regression function evaluated at the true parameter value.
Proof of Theorem 7.2. Since det A is equal to the product of A's
eigenvalues, it must be that
det A Amax (A)[Amln (A)],, -1
if A is p x p and nonnegative definite. On the other hand, we see that
Amax (A) !p tr (A).

Combining these results, we find that
-[ ] -G r.
tr A)
Aax (A) Amax (A) P-1 > (
det (A)
(7.69a)
_
Alnl (A)Amax (A) Amln (A)

In the case at hand, we can write
bn/llhnil = b + rn, where IIrnll-+ 0;
therefore,
det ( bJb/) = [ IIhill2 (hh' + rib' + br/ + rJr/)] .

im1
det
'=1
Since
it follows that
(7.69b)
RESPONSE SURFACES 139
B ut
Since
the discrete version of L'Hospital's rule applies. Since \lrn\\-+ 0 , it

follows that \\Rn \\-+ 0 , from which it follows that
det (hh' + Rn] -+ det (hh'] = O. (7.69c)
We combine Equations 7.69a, b, and c, and the theorem is established.

Q.E.D.
7.8 Response Surfaces
Until now, we have motivated recursive-estimation procedures by

considering regression problems in the setting of time-series analysis. It
i s in these applications that the demands of .. on-line" computation
make recursive-estimation techniques particularly attractive. In such
cases, the regression function is typically of the form
Fn(e) = F(tn;e) (n = 1,2,),

where, f or each e, F( ; e) is a continuous f unction of t ime, a nd
(7 .70)
are the sampling instants. The large-sample properties of recursive
estimation sequences are determined by the analytic properties of
F( ; e) for large values of t.
However, the scope of regression analysis also embraces experimental
situations where the regression function is of the form F(t; e), t now
denoting a (possibly abstract) variable that the experimenter can choose
more or less at will (with replication if so desired) from a certain set of
values. In particular, the constraint of 7.70 is not present. In fact, the
values of the independent variable t are usually chosen from a set that is
bounded (in an appropriate metric) or compact (in an appropriate
topology). For example, F(t; e) might be the mean yield of a chemical
process when the control variables (temperature, pressure, input
quantities, and so on) are represented by the vector t and the external
variables (not under the control of the experimenter, and, indeed,

generally unknown) are denoted by O.
In such cases (where t is a finite dimensional vector), the regression
function describes a surface that is indexed by 0 as t varies over its
domain. This surface is called a response sur/ace, and if 0 is not known,
the job of fitting the proper response surface to data Yl o Y2 , Y3, ' "
(which are noisy observations taken at settings tl o t2, t3, of the

independent control variable) is equivalent to choosing the "correct"

value of 0 on the basis of the noisy observations
Y" = F(t,, ; 0) + W" (n = 1, 2" , , ),
In most cases, the experimenter wishes to estimate 0 once and for all
after all the data have been collected. One could apply the recursive
method, but its chief selling point, the availability of a running estimate
for 0 , is of no great value. However, when questions of sequential
experimentation arise, this feature regains its allure.
For example, suppose the experimenter wishes to determine the
correct response surface, and suppose he can make observations using
either of two different experimental procedures. Under the first pro
cedure, his observations take the form
V" = F(t,, ; 0) + v" (n 1, 2" , . ),
=
the v" being independent, zero-mean measurement errors with variance

Uu2 . The second procedure generates observations of the form
U" = F(t,. ; 0) + u" (n = 1, 2, , , . ),
the Un being independent, zero-mean errors with variance Uu 2 . If Uu2 and
uu2 were known, and if data were costly to obtain, it is clear that a
sophisticated experimentalist would choose the observation procedure
with the smaller variance and use it exclusively. However, if the vari
ances are not known a priori, a sensible thing to do is to allocate some
experimental effort to estimate Uu2 and Uu2 and then sample from the
population with the lower variance estimate. Or, one could proceed
sequentially, sampling from each population according to the outcome
of a chance device, whose probability law increasingly favors the
population with the lower variance estimate. Such a procedure requires
a running estimate for 0 in order that the variance estimates be com
putable after each new observation. [Actually, the sequential design of
experiments demands a far more sophisticated approach, but the
present oversimplified procedure suffices to motivate the application of
recursive methods to the fitting of response surfaces. The interested
reader is advised to refer to Chernoff's paper ( 1 959) for a proper
introduction to sequential experimentation.]
RESPONSE SURFACES 141
From the theoretical point of view, the most appealing feature of the
recursive method, applied to the determination of response surfaces, is
the wide class of regressions (apparently much larger than those in
time-series applications) that satisfy the hypotheses of Theorem 7. 1 .
The following theorem demonstrates the great simplifications that ob
tain when the independent variable t is constrained to a compact set.
THEOREM 7 . 3
Let !'/ be a compact set, let &' be a convex, compact subset of p
dimensional Euclidean space, and suppose that F( ; . ) is a real-valued
function defined over !'/ &', having the following properties :
G I . 0 2 F/oO,001 exists and is continuous over !'/ &', and
G2. II F II is continuous and positive on !'/ &',
where F is the column vector whose components are
8F
(i = 1 , 2, , , ' , p).
80,
Let tb t2 , . . . be a sequence of points from !'/ a nd let
F,,(x) = F(t,.; x).

G3. If there exists a sequence of integers
with
such that
k-..
lim inf det
o
( 2 F/(X)F/(X)
l e i"
= D2 > 0
for some x E &', then Conclusion a of Theorem 7. 1 holds.
Proof. (J 2 F/oO,oOI) is continuous on !'/ &', which is compact;

therefore, I () 2F/()0,OOI) I is uniformly bounded. Therefore I ( 8 2F,,/80,8 (1) I
is uniformly bounded in i, j, and n. The continuity and positivity of II F II
over !'/ &' implies the existence of positive Kl and K2 , such that
Kl s IIF,,(x) II S K2
for all n and all x E &'. These facts establish Assumptions Fl through F4.
If B is p x p and nonnegative definite, we find that
det (B) S .\mln (B)'\-; (B) S .\mln (B) [tr (B)],, - l .

1 42 COMPLEMENTS A ND DETAILS
Since
and since
we conclude that Assumption F5 holds if Assumption G3 holds.
Commen t : The set :T may be abstract, compact with respect to an

arbitrary topology, provided that F( ; ) is continuous on :T (8) in
.
the induced product topology. However, in most (but not all) applica
tions, :T will be a closed bounded subset of some finite dimensional
Euclidean space.
We close this chapter by exhibiting examples of regression functions
of the form F,,(6) F( I,, ; 6) which violate the conditions that justify the
=
recursive method if tIl 00, but which satisfy the conditions of Theorem
7. 3 if the tIl are chosen appropriately from a finite interval.
Example 7. 1 . It has been shown in Chapter 2 that the regression

F( t,, ; 0)= cos 01" or sin 01" violates the conditions of Theorem 2. 1
if tIl = n (or m) . However, if is known to lie in an interval 0 < " 2 =:;;
=:;; fJ2 and {tIl} is a suitably chosen sequence, the difficulty disappears. In
fact, to make the problem more interesting, consider
F(t,,; 6) = 01 sin 02 1",
wh ere
o < T1 = inf I" < s up I" = T2 < 7T/2fJ2 ,

n n
a nd
The function
F(I,6) = 0 1 sin 02 1
sa tisfies Assumptions G l a nd G2 o ve r :T (#J i f we take :T = [Th T2 ].
On the other ha nd, a little a l geb ra shows that
d et [F2" + l(6) F" + l (6) + F2,, (6) F ,,(6) ]
-
_ [124022 (1 22 n + 1 - 1 2 ")] [
--
Sin 02(12,, + 1 - 12,,)
0 2 (1 2 ,, + 1 - 1 2 ,, )
sin O 2(12,, + 1 + 12,,) 2
-
0 2 (t 2 ,, + 1 + 12 ,,)
]
The first factor is larger than 1(" 1 " 2 ) 2 T1 T3 If we let

ILn = 02 (t2,, + 1 - 12,,) ,
R ESPONSE SURFACES 143
a nd
we see that
o < IX 2 Ts S IL" S w" - 2IX2 Tl < w" S 2T2f32 < 77.
Since sin e/e has a negative derivative which is bounded a way from zero
in the interval [IX2 TS' 2T2f32 ], we find that
m
. (--
Sin IL" sin w,, 2
- -- > 0. )
" IL"
f
w"
This establishes Assumption 03 if we take II Ic = 2k - 1 (k = 1, 2, ).
Example 7.2. It was also shown in Chapter 2 that the exponential

regression eOtm violates the conditions of Theorem 2. 1 in an essential
way if t" = nT. B ut consider the more general regression
F(t,, ; (J) = e8l!" + e 8 a! ,. ,
a nd suppose it is known that - co < IXl S (J l S f3 1 < IX 2 S (J2 S f32 < co .

I f the sampling i nstants a re chosen so that
o < Tl = inf t" < sup t1l = T2 < co,
" 11
a nd
then Assumptions G l a nd G2 hold for the function

F(t ; 8) = e8l! + e8at
defined over
Furthermore,
det (F2 " + lF2 " + 1 + F2"F2 11)
= (t211 + 1 t211) [exp 2 (J2t" + (J1 t211 + 1)][ 1 - exp (J2 - (J1)(t211 + 1 - t2,,)]
2 2
T14[e xp 2(IX 2 t" + IX1 t211 + 1) ] [1 - exp (IX 2 - f3 1)Ts] .

2
Since {t,,} is a bounded sequence, the second factor is bounded away

from zero ; therefore, Assumption G3 holds with VI< = 2k - 1 (k =
1 , 2" " ) .
Another generalized version of the e x ponential regression is given by
F(t1l ; 8) = (J1 eBat,. .
If we assume that
144 COMPLEMENTS AND DErAILS
where the a's and /fs can be positive or negative. and if the sampling
instants are chosen so that
Tl == inf l. < sup t. == Til. and o < Ta == inf (Ia. + 1 - 'IIJ.

then
F(I ; 8) - 'lei,}
satisfies Assumptions 01 and 02 on [Tl Til] 8'. Moreover. we see
that
det (flla + l"a + l + [

tll.faJ - '111 exp [2''(IIIa + 1 - laJ1 (Ia. + l - IJ ]
alIlTall exp 2aa(llIa + l - laJ,
and the last is bounded away from zero since the In are bounded. Thus,
Assumption G3 holds for this regression as well.
Example 7.3. In Section 7.7, we showed that polynomial regressions
2 O/Inl
"
F(tn ; 6) =
1-0
fail to satisfy Assumptions C3, D3, and E3 of Chapter 6 if tn -+- 00.
However, if the sampling instants are suitably chosen from a compact
set, this difficulty also evaporates. To illustrate this, consider the case of
a first-degree polynomial
F(t ; 6) = 00 + Olt
sampled at times {In} in the interval [Tl' T2 ] and having the property
that
Letting
Fn(6) = F(ln ; 6),
we find, as usual, that
det (F2 n + l F n + l + F2nn) =

2 T3
(12n + 1 - 12n)
Assumptions G l and G2 are satisfied over any compact 6-set ; therefore,

the conclusions of Theorem 7.3 hold. In particular, notice that the
problem is no longer ill conditioned.
In closing, we point out that all three examples require that samples
be made over a bounded interval, in such a way that
RESPONSE S URFACES 145
One such scheme (defined over the interval [0, 1 ] , with T3 = t) chooses
11= -1" and
(j = 1, 2, . . " 2 k - 1 ; k = I, 2, . . . ) .
8. Applications
Before we can apply the recursive methods described in Chapters 6

and 7 to particular regression problem,s, several decisions must be made.
Should the data be processed one observation at a time or in batches?
I n the latter case, what should the batch sizes be '! Should the recursion
be truncated? Which type of gains should be used, deterministic or
adaptive?
At this writing, definitive answers to these questions are not available.
It seems clear, however, that each issue should be weighed in the context
of the consistency and computability of the resulting estimator
recursion.
Consistency (in either the mean-square or almost-sure sense) is the
most important consideration. A procedure for which consistency
cannot be established (or conjectured with high certainty) should be
held in less esteem than one to which the theorems of Chapters 6 and 7
apply. This comment is particularly relevant to the decision concerning
truncation. For example, suppose that a particular gain sequence is
being contemplated and that one or more of Assumptions Cl through
C6' of Theorem 6.2 are violated. However, suppose it is known a priori
that the true parameter value 6 lies inside a given sphere fY!, which is
sufficiently small so that Assumptions E I through E6' are satisfied. (This
situation often arises in practice.) If an untruncated procedure were
used, convergence could not be ensured. However, a batch-processing
recursion, truncated over Y' with batch sizes chosen so that Assumptions
146
APPLICATIONS 147
E3 through E5 hold, does converge to a in the mean square (by Theorem

6.4). If data-processing considerations make batch processing un
feasible, the single-observation recursion, truncated over &J, appears to
be the natural alternative in such a case. Although the theorem con
cerning convergence of truncated single-observation recursions is
conjectural, we are confident enough in its validity to feel safe in
recommending it under the above-mentioned circumstances.
In some applications, single-observation recursions may be dictated
by cost considerations. If observations are very expensive and we are
estimating a as part of a sequential hypothesis-testing procedure, the
single-observation recursion is the natural one to choose. In other
applications, the data may be collected in batches (for example, multiple
sensors reporting simultaneously) and should be so processed.
If one is confronted with a situation where a free choice exists, we tend
to favor batch processing. From the theoretical point of view, it would
appear that once the gains are decided on, the batch sizes should be
chosen to make Assumptions E3 through E5 hold. In all likelihood,
though, the choice of batch size is of no practical consequence and can
be chosen purely for convenience (ideally, though, as large as computa
tionally convenient).
The considerations governing the choice of gain sequence are clear
cut in the case of linear regression but not so well defined in the non
linear case. In the linear case, the gains of the type 7.9 or 7.39, depending
on whether the single-observation or batch-processing recursion is used,
are preferred unless the rate of data acquisition is so high that the data
processing facility is swamped. In this case, the" quick and dirty" gains
of Equations 7.1S and 7.40 yield estimator sequences which can be
computed more quickly and thus can better keep pace with the data.
One sacrifices statistical efficiency by doing so, though, and i n order to
guarantee consistency, two additional conditions (Equations 7.19a, b)
must be verified.
For nonlinear regression, the decision is more delicate. Referring to
Theorem 7.1, we see that the "quick and dirty" gains (7.4Sa, c) yield
convergence under a set of conditions which are weaker than those
required by the "linearized least-squares" gains (7.4Sb, d). If the
additional Assumption F6 can be verified, the choice between gains of
the type 7.4Sa, c and 7.4Sb, d involves the weighing of efficiency versus
computability. (The "linearized least-squares" gain probably yields a
more efficient recursion, since it entails more computation.) If, on the
other hand, the extra condition cannot be verified, the "linearized
least-squares" gains may not yield a convergent estimator. Thus, the use
of the "quick and dirty" gain is the conservative course of action.
148 APPLICATIONS
Alternatively, one might use the " quick and di rty" gain initially to get
things started and then switch to the other type of gain, under the
supposition that the linearized version of the problem is, by then, an
adequate approximation. This a ppro ach can be investigated analytically
i n the spirit of the present work, but we will not pu rsue it further.
If the " linearized least-squares" gains of Equations 7. 48b, d are to be
used, the results of the scalar-parameter case presented in Theorem 4.2
for Gains 2 and 3 show that we cannot state a priori that the adaptive
version will be more efficient than the deterministic version (as one
mi ght expect). At this time, we can offer little in the way of guidelines
for choosing between adaptive and deterministic linearized least-squares
gains. However, adaptive gains must be compute d after each cycle of the
recursion and so, if pressed for time, we may be compelled to resort to
the quasi-adaptive or deterministic versions. On the other hand, if
"quick and dirty" gains are being used because of time considerations,
the sensible thing to do is to use the deterministic versions. These can be
stored in memory and need not be computed in real-time. If the " quick
and dirty" gains are being used because Assumption F6 of Theorem 7.1
cannot be established, the adaptive version might conceivably speed up
the convergence rate somewhat.
We will now display some examples and show, in each case, how to go
about verifying the conditions which will guarantee consistency of the
recursive-estimation procedure used.
8.1 Vector Observations and Time-Homogeneous Regression
Example 8.1. Suppose the observations are an r-dimensional vector

stochastic process of the form
Yk = f(8) + Z/. (k = 1 , 2" , . ) ,
where the components of the residual vectors have uniformly bounded
variances. An estimate for the unknown p-vector (p :::; r) a is sought .
The " classical " approach would involve estimatin g the mean-value
vector f( a) by
A I "
f" - L Y,
n 'l
=
and solv ing (by least squares perhaps) for the value of e" that "comes
closest" (in some sense) to making the equati ons
f(On) = e"
work.
TIME-HOMOGENEOUS REGRESSION 149
By contrast, the recursive approach estimates directly :

Sk+1 = Sk + AdYk f(sk)]' -
0
The batch-processing recursion is the natural one to use when the
"observations" are vectors to begin with.
On the other hand, if the components of Yk are observed one at a
time, we could write
and
f(O) =
[f1(0)]
: ,
fr(O)
so that
Yll = FIl(O) + ZIl (n = 1, 2,,, . ),
where
(i = 1 , 2", ',r; k = 0, 1" . .). (8.1)
I n this case, we could justifiably consider the single-observation recur
sion. However, for the purposes of this example, we confine our atten
tion to the batch-processing recursion.
We will assume the following :
o is known to lie inside a prescribed p-dimensional sphere 9. 8.2
()
The components of each of the vector-valued functions
1,() = grad};() (i= 1, 2, .. , r)
are continuously differentiable over Y'. (8.3)
For each x E 9, the set of vectors Il(X) 12(x), .. , Ir(x)
has rank p and all have positive lengths. (8.4)
We also assume that either
{Zn} is an independent (scalar) process with mean zero (8.5)
or
for some S > O. (8.6)
We will consider the (truncated, batch-processed) recursion
Sk+l = [Sk + Ak(Yk f(sk))]ao,
E 9,
-
Sl =
(8.7)
0
1 50 APPLICATIONS
where Ak can be any of the following p x r matrices :
Ak =
1 r
k (J1
IltiOo)112
-1
)
(t1(00)"'" t,(Oo
(deterministic, "quick and dirty"), (S.8a)
Ak = k1 (J1 tlOo)t/(Oo)) -1(t (00) 1,(00

r
1 " ' "
1( - 1
(deterministic, "linearized least-squares ) " , (S.Sc)
= k Ij(Sk)I/(Sk)) (t (Sk),
r
Ak 1, 1,(Sk
J 1
. .
(adaptive, "linearized least-squares"). (S.Sd)

We will verify, i n detail, that Equation S.Sa furnishes a mean-square
convergent-estimator sequence and will sketch the arguments which are
relevant to the corresponding proofs for Equations S.Sb, c, and d.
Let
Vk = (k - l)r + 1 , and
so that the number of indices in Jk is
(k = 1" 2 . . . ),
and let the column vectors of Ak be denoted by
an (Vk n < vk+1),
where
and
u - 1,2" , . ) .
is defined by Equation 8.1.] Under Equation 8.3, the matrices

[GFIIA'(X)b Xp)
. , whose column vectors are
[iPFII(1)/8Y18Y,] (i = 1 , 2, . ,p), (S.9)

2F,,(Y)/8yp8Y, ,-xc
TIME-lIO MO GENEO US REGRESSIO N 151
are uniformly bounded in norm a s (Xl' X2, .

. " xp) varies over the com
pact set f?JJ(P>. By Equation 8. 4,
Ilhnll = II Fn(Oo) II min 11/;(00)11 > 0 ;
i=I.2 . . . , r
therefore, Assumptions Ft and F3 of Theorem 7. 1 are satisfied. On the

other hand,
Ilhnll =s; max 11/;(00)11 < 00;
1=1
r
therefore, Assumptions F2 and F4 hold. Since
Amln C?;k IrCIi)

I=l . r
11/;(00)11-2Amln C 11(00)//(00 ) ,
Equation 8.4 implies Assumption F 5. Since the gains are given by
Equation 7. 48c, it therefore follows from Theorem 6.4 (via Theorem
7.10) , that Sn converges to 0 if f?JJ is small enough and Equation 8.5 holds.
If, instead of Equation 8.5, we assume Equation 8.6, Assumptions E I
through E6 are established as follows :
Let
Yk* = k6yk, Fk*(X) = k6f(x), Zk* = k6Zk>
hn* = k6hn = k6,Fn(Oo) (Vk =s; < Vk + 1)'
(8. 10)
11
A k* = the p x r matrix whose column vectors are
112) hn*
1
an* = (Vk+i-1
j=1
Ilh/
-
(vk =s; 11 < vk+l),
and consider the recursion
I f we can show that Sk* e in quadratic mean, we will be done. This is

so because the Corollary to Theorem 6.4 will then guarantee the mean
square convergence of
(8.12)
with
S ince
fPkAk*(Yk* - Fk *(Sk = Ak(Yk - Fk(Sk
for every k, the recursions of Equations 8. 12 and 8.7 (hence Sk and Sk)
are identical, which immediately establishes the mean-square conver
gence of Sk under Equation 8.6.
152 APPLICATIONS
To establish Equation 8. 1 1, notice that

n n - 1
- < k -- + 1
r r
if
(k - l)r + 1 = "k n "k + l -1 = kr.
Thus by Equation 8.10,
(r Ilhnll Ilhn*11 ( + 1 r Ilhnll, (8.13)
and
(8. 1 4)
where Gn* ( Xl> .. " xp) is the matrix whose columns are given by
Equation 8.9, with F,. replaced by k6Fn ("k 11 < "k+1 ) ' Thus, Fk*
satisfies Assumption FI of Theorem 7.1, since
the right-hand side being uniformly bounded by virtue of an earlier

argument. Under Equations 8. 3 and 8. 4,
o < min II li(6 0) II Ilhill max Illi(60)!1 < 00;
1= 1 ..T 1= 1 . .T
therefore, by Equation 8. 1 3,
K1n6 Ilhn*11 K2n6,
which establishes Assumption F2'. Since
h/ /' hjh/
L h 2 =
ieJk Il /11 ieJk
L Ilhil12
,
Assumption F5 holds by an earlier argument, and Assumptions EI

through E6 therefore hold if (J/J is small enough and Equation 8.11
follows by Theorem 6.4 via Theorem 7.1b.
The treatment of the gain given by Equation 8.8b is virtually identical
except for one small detail. The adaptive, "quick and dirty" gain of
Equation 8.8b is not exactly of the form 7.48c. Whereas Equation 7.48c
requires that the columns of Ak be of the form
ESTIMATING INITIAL STATE O F LINEAR SYSTEM 1 53
where ;j takes values in [!!J and depends on the observations through

time j, the gain 8.8b is of the form
[Vk +i
-1
;=1
]
IIF;(sIJI12 -1 F n(Sk) (Ilk :s; n < "k+ 1),
and Sk depends on the observations up through time "k 1. Nonethe
-
less, the proof of Theorem 7. 1 goes over word for word, and the same
arguments used to establish the convergence of the recursion under the
gain 8.8a can be applied verbatim to 8. 8b.
The" linearized least-squares" gains of Equations 8.8c, d are treated
similarly except that an additional assumption concerning the condi
tioning number of LI= 11;(60)1;'(60) is called for in order to meet
=
Assumption F6 of Theorem 7.1.
In the very special case where Zn 0 for every nand r
= p, the
regression problem reduces to that of finding the root of the equations
11(6) = Y1
12(6)=
Y2
(8. 1 5)
Ip(6) = Yp.
In the absence of noise, the vector" observations" are all the same :
and so Equation 8.7 becomes

Sk+l = [Sk + Ak(Y - f(sk]'p. (8. 1 6)
The preceding results show that
lim Sn = 6,
where 6 is the root of Equation 8. I 5 in &, provided that fYJ is small

n
enough. Actually, the rate of convergence can be speeded up consider

ably in the noiseless case by eliminating the damping factor Ilk from the
gains of the type 8.8a, b, c, d. Convergence, then, follows from an easy
extension of Theorem 6. 1 to the case of batch processing.
8.2 Estimating the Initial State of a Linear System via Noisy

Nonlinear Observations
Example 8.2. Suppose that a particle is moving back and forth

along the x-axis, its position at time t = kT being denoted by x(k), and
suppose that x(k) satisfies the second-order difference equation
x(k + 1) + x(k - 1) = sin ka (k = 1,2,,),
154 APPL ICATIO NS
Observer
True angular displacement of particle
[=arctan .(kll
" Particle's apparent position
, '-
, l at time kr
1--- . ( k 1 -----l
' ... ..
,1
Figure 8.1 Observation geometry.
where IX is known but the initial conditions

x(O) = 81,
are not known. S uppose further that an observer located one unit away
from the origin on a line passing through the origin and normal to the
x-axis makes noisy observations on the angular displacement of the
particle at the instants k-r. (See Figure 8.1.)
Thus, the observations take the form
Yk = arctan x(k) + Wk (k = 1, 2" , . ),
where it is assumed that the Wk are independent wit h zero-mean and
common variances. We want to estimate
from the Y's. We proceed as follows:

The position of t he particle can be written in closed form as
x(k; 6) = rp(k) + hk'6,
where
Sin
. . (k - n)7T
cp (k) = "1 nIX Sin '
2
and
hk =
[COS k7T/2] .
sin k7T/2
We assume that IX is such that the system does not resonate and that 6 is
known to lie within a sphere&' of radius R, centered at the origin. In this
case, there is a scalar C such that
sup Ix(k;6)1 C.
k,8e8'
ESTIMATING INITIAL STATE O F LINEAR SYSTEM 1 55
Our observations take the form
where
We estimate 8 by means of a scalar-observation recursion, truncated

over f!jJ:
S" + 1 [s" + a,,( Y"
= F,,(sn))]'?
-
The gains a n can be chosen in a variety of ways. We will concentrate on

" linearized least-squares" gains
where the gradients Fi can be evaluated either at some nominal value 80,
(deterministic version) or at the then-most-recent estimate s} (adaptive
case).
In general,
therefore,
and
i f n is even,
i f n is odd.
If we let
it is easy to see that the deterministic gains take the form

156 APPLICATIONS
while the adaptive gains take the form
In either case, we have
and
uniformly in a,,'s argument.
It is now an easy matter to verify Assumptions EI through E6' of the
conjectured theorem in Chapter 6. Assumptions EI, E2, and E6' hold
because
o < C3/n Il an ll IIF"II C21n
uniformly in an's and FII'S argument. Assumptions E3 and E4 hold with
T2 = 1- and p C2/C3 if we choose Vk 2k - 1 (k = 1, 2,) . For
=
then Pk = 2 for all k and

I
Am1n
(
2k FjF/) = lAml" (/) = 1,
Pk j=-l IIFj))2
while
and
Finally, Assumption E5 holds, since

1 a"'F,, j I - T2

1 - T2 + T21p2
>
= )I a n )) IIFn)1
The same results can be obtained if a batch-processing recursion
(with linearized least-squares gains) truncated over f!I' is used. Theorem
6.4 can then be applied.
8.3 Estimating Input Amplitude Through an Unknown

Saturating Amplifier
Example 8.3. Amplifiers are, ideally, memoryless linear devices .

Real amplifiers only approximate this performance. They are practically
memoryless but linear only over a certain range of i nputs . Typically, as
ESTIMATING INPUT AMPLITUDE 157
-------i
(0 )
--------_#_- AIIIpIItude of .......
(b)
Figure 8.2 (a) Performance of an ideal amplifier. (b) Performance of a real

amplifier.
the input amplitude increases, the ampl ifier saturates. (See Figures
8.2a and 8.2b.) A model that is frequently used to describe the input
output relationship of a saturating amplifier states that
2S
Yout (t) = - ;;; arctan
(rrA
(t)
)
2S Yin

Such an amplifier has the property that
Yout (t) ::::: A Yin (t) if A Yin (t) S,

and
Yout (t)::::: S if A Yin (t)>> S.
Here A is called the amplification factor, and S is called the saturation
level of the amplifier.
Suppose that a sinusoid B sin (27rft + 'I-') of known frequency and
phase but unknown amplitude is passed through an amplifier whose
saturation level S is unknown, and suppose that the output is observed
in the presence of wide-band noise. In other words, we observe
2S [(rrAB) Sin. (2rrft + '1') + Z(t).

yet) = -;;; arctan
2S ]
1 58 APPLICATIONS
On the basis of these observations, it is possible to estimate Sand AB.

(If A is known, we can deduce B. Otherwise we can only estimate their
product.)
For notational convenience, we will sample Y(t) at times
{ (k; ) /2'1T/ if
'Y k is odd,
(2k: I '1T ) /2'1T/

-
t"
'Y
=
_
if kis even .
If we set
kodd,
keven,
Z" = Z(t,,), Y" = Y(t,,),

and
F,,(8) = 81 arctan 82K", (8. 1 7)
then we can write
Y" = F,,(8) + Z" (k = 1,2, ) ..
We assume that
and
and esti mate 8 via the batch-processing recursion
s,. + 1 = [s" + A"(Y,, f,,(s,.))]ao,

where fJ' is the rectangular parallelepiped, [cxlt,8d [CX2, ,821,
-
and
(8. 1 80)
or
(8. 1 8b)
The conditions of Theorem 7. 1 are dealt with as follows: The gradient

of the regression function is
(8. 1 9)
ESTIMATING INPUT AMPL ITUDE 1 59
and the matrix of Fk's mixed partials, the first column evaluated at
x = (X2Xl) . the second at z = (Z2Zl) , is given by

[1 + (Z2gk)2] -1 ]
-2Z1Z2gk[1 + (Z2gk)2]-2
Given the existing assumptions, the norm ofF k(') is uniformly (in k and
6) bounded above and away from zero, and the col umns of Gk are
uniformly bounded. This establishes Assumptions Fl through F4. To
establish Assumption F5, we choose
(k = 1,2,,, .) .
The norms iiFk(X) ii are uniformly bounded (in x and k); t herefore, it
suffices to show that for some x E & and some S > 0,
(8.20)
.
for all k. Equation 8.20 will fol low if we show that F 2k and F 2k-l are
linearly independent for every k Since F2k-l F2k+3 and F2k
= F2k+4. =
it suffices to do so for k I . Let us assume the contrary:

=
for some x E {!/ and some nonzero a.
Then, since gl = 1 and g2 = - 1 V2, we have, by Equation 8.19,
and
Multiplying the first relation by the second, we obtain
arctan (X2) arctan ( - x2/V2)

(8.21)
1 + (X2)2 1 + (X2/V2)2
By assumption, we see that -'"2 > 0; therefore, the left-hand side is

positive, while the right-hand side is negative. This establishes a
contradiction. Therefore, F2k and F2k-1 are indeed linearly independent
160 APPLICATIO NS
for every k, and Assumption F5 holds. If (Pi - a,) (i I, 2) are small,

=
Sn converges t o e in the mean square if the gain of Equation 8.18a is

used. If Equation 8.18b is used, the additional restriction on the
conditioning number of the matrices
(k = 1 , 2)
must be satisfied.
In this treatment of the example, we chose batches of two observa
tions each. If we choose batches of size four, each " observation" is of
the form
(k = 1 , 2, ,,,)
where now
does not depend upon k, since Fk( ) = Fk+4( ) (k = I, 2,

. . . .
.) and Zk
is defined in the obvious way. The recursion becomes
where
or
Assumptions FI through F4 of Theorem 7. 1 follow from previous

arguments, while Assumption F5 follows from the fact that
ESTIMATING PARAMETERS OF A LINEAR SYSTEM 161
As usual, a restriction on the conditioning number of 2:1 1 FlOo)F/(Oo) =
must be met if the second (" linearized least-squares") type is to be used.
8.4 Estimating the Parameters of a Time-Invariant Linear System
Example 8.4. Here we consider recursive estimation of the param

eters defining a stable time-invariant linear system when it is driven by
appropriate inputs. The output is observed in the presence of additive,
but not necessarily white, noise. We treat bot h continuous and discrete
time systems. When the continuous output is sampled at regular inter
vals, as is usually done in practice, the two estimation procedures are
very much the same. As we will see, however, there is a single difference
wh ich is important from the computational point of view.
Our results on the asymptotic behavior of the estimates are quite
complete (strong and mean-square convergence plus asymptotic
normality). Although only an indication of proof i s given for some
results, each can be established rigorously under the stated conditions.
Consider an output X(I) that satisfies a stable pth-order linear di ffer
ent ial equation
dlx(t)
L.. '-
dtl
8 _
g(t ) ( - 00 < t < (0) , (8.22C)
1=0 -
or a stable pth-order linear difference equation, which we write as
- 1, 0, + 1,,, . ) .
I'
L: 8lx(t j) get) (t (8.22D)
1=0
- = = "',
In either case, if
get) = cos wI, (8.23)
the steady-state output takes the form

x(t) = A cos wt + B sin wI, (8 . 24)
where A and B depend nonlinearly on the 8's.
To exh ibit this dependence in the continuous case, we can compute
the even- and odd-ordered time derivatives of 8 .24 and substitute into
8.22C. Letting [xl denote the integral part of x, we find that
= (A cos wI + B sin wI) O 82j( _l)'w21

dlx(l) [1'/2)
,o 8, diI
I'
I
B cos wt)
[(1'-1)/2J
+ (A sin wI L: 82J+1( _1)1+1w21+1.
1=0
-
162 APPLICATIO NS
If this is to equal 8.23 for all 1 > 0, the coefficient of cos wi must be
unity and that of sin wI must be zero. As a result, we have
A
(8.25)
where a and {3 are linearly related to the unknown parameters via

[(1'-1)/2)
2
{3= L 821+1(-1) Iw J+1 . (8.26C)
1-0
In the discrete case, Equation 8.25 is again easily shown to be a valid

relation, after we redefine a and {3 by
I' I'
a = L f)j cosjw, {3 = L f)J sinjw. (8.260)
1=0 1=0
We note that Equation 8.25 holds reciprocally, that is, with A inter
changed with a and B with {3, thereby making explicit the nonlinear
dependence of A and B on the f)'s.
For the sake of convenience we are going to restrict attention to the
case where the number of unknown parameters is even, that is, where
p = 2q+l (8.27)
for some integer q O. The modifications required when p+ 1 is odd

will be clear.
To estimate the parameters in the continuous-time case, we will take
as our input
q
g(t) = L cos Akl (- 00 < 1 < 00), (8.28)
k=O
where the A'S are distinct positive angular frequencies to be chosen so

that
Ak A, =F a multiple of 'IT if k =F j.
The superposition principle immediately allows us to write the steady

state output as
Fc(/; 6) = L (Ak cos Akl + Bk sin Akl)

q
(I > 0), (8.29)
k=O
where the 2(q+ I)-coefficients, AI.: and BI.:, are related to the f)'s via
Equations 8.25 and 8.26C after setting w = A and affixing the subscript
k = 0, I, ..., q to each of A, B, ex, {3, and A. In view of Equation 8.27,
[P/2] and [(p - 1)/2] are both equal to q; therefore (with "e" for even
and" 0" for odd), we have
(8.30)
where
( 8.3Ia)
2
(-I) Q'\0 Q
2
(-1) Q'\1 q
1 ,
( _1 ) Q'\q2q
(8.3Ib)
-'\03 H l
'\05 <-I).,,"
2
-,\13 '\15 (-I) q'\1 q+1
[ _'\q3 ,\q 5
In the discrete-time case, we take as our input

(- 1) Q'\lq+1.,!
q
) (8. 3 2)
1 (_1 )1
get ) = ;- + L eos Wkt + ;- (t " ,,- 1, 0, + 1, ,
v 2 k=l
.
. 2
v
=
where
The corresponding output is
Fa(t; 8) = '!.. + i
.
v2
(Ak cos wkt + Bk sin Wkt) + vq/l2 (_1 ) 1
t, 2, . . ),
k =l
(t = (8.33)
where Ao, A1, 010 . ", Aq, Bq, Aq+1 are related to 00, 010 , 02q+l via
Equations 8.25 and 8.26D, after we affix k = 0, 1 , . . ., q + 1 to each of

A, B, ex , {3, and wand set

Wo = 0, Wq+l = 7T
1 64 APPLICATIONS
(so that B k = flk = 0 for k = 0 and q + 1). Thus, we have
/XO
/Xl
Y=
fll
/Xq

n [:: 1
82q+1

no, (8.34)
flq
/Xq+l
where
1 1 1
1 cos Wl cos 2Wl cos (2q + l)wl
0 - sin Wl -sin 2wl -sin (2q + l)wl
n= (8 . 35)
cos Wq cos 2wq cos (2q + l)wq

0 - sin Wq - sin 2wq -sin (2q + l)wq
-1 -1
is 2(q + 1) by 2(q + 1), while each of 1\.e and 1\.0 is q + 1 by q + 1.

Without loss of generality, we take the positive integers as the sampling
instants for a continuous output. (For a constant increment 7' = 1, the
restrictions are placed on Ak7' rather than on Ak') Our observations, in
both the sampled continuous and discrete cases, are then of the form
(t = 1,2,,, . ), (8.36)
where, according to Equations 8.29 and 8.33,
Ao
Bo
Al which are related to e via
= Bl the linear equations 8.30 (8.37C)
with k = 0, 1, . . . , q,
Aq
Bq
or
Ao
A1
B1
which are related to e via
; = the linear equations 8.34 (8.370)
with k = 0, 1, . . , q + 1.
Aq
Bq
Aq+l
and
or
We will assume that {Zt: t I, 2,

= } is some zero-mean, second order
. .
.
stationary process. Initially, our only restriction on the unknown

covariance sequence is that
0'/1 = CZtZt+lhl = 0 (), (8.39)
as h 00 for some e, 0 < e < I (and thus the noise process need not
possess a spectral density).
We are now in a position to write down the procedure for estimating 6.
We affix an n to a parameter symbol to denote an estimate of the
parameter based on the first n observations. Thus, ;" is a vector-valued
function of Yl> Y2," ', Y" which estimates ; Ak, n estimates the ,
component Ak, and so on.
Step 1. Recursively estimate the (p + I)-dimensional parameter; by

naive least squares:
; n = ;,,-1 + Bnh"(Y,, - h,,';,,_1),

where
(B"_1h,,)(B"_1h,,),

I + hn'Bn-1hn
Bn = B10- 1 _
(n 2q + 3),
i nitialized by
2q+2
;2q+2 = B2 q+ 2 2: ht Yt
t- 1
166 APPLICATIONS
(See the d iscussion leading to Equation 7.15; here we have decreased the
iteration i ndex n by one.)
Step 2. Assume that a finite number A is known for which
max 1 0ji < A.
1=0.1. .2q+l
Compute
where i n the sampled continuous case

I" = {x: Ixl (q+ 1) max 1,,12J+l}
J=O.I. .q
and i n the discrete case
lie = {x: Ixl 2(q+ I)A} for aU k = 0, I, ... , q+ 1.
Step 3. In the sampled continuous case, form the vectors a" and "
whose components are, respectively, a"." and P"." (k = 1,,, q). Then
estimate the even-numbered components of 8 by 8e." and the odd
n umbered components by 80 II, where
and
In the discrete case, form the vector y " whose components are given
by Equation 8 .35 with the ale's and PIc'S replaced throughout by their
respective estimates a"." and p",,,. Then estimate 8 by
8" = a-1y".
We first show that Equation 8.39 is sufficient for probability-one and
mean-square comergence ofO" to 8 as n co: Let y" and z", respectively,
be the vector of the first n observations and first n noise realizations, and
let
H" = (hI> h2' . . . , h,,]
be the 2(q + 1) by n matrix whose columns are given either by Equa
tion 8 .38C or 8 .38D. Equations 8.36 and the closed-form expression
for ;" combine to give
(8.40)
where
"
B" -1 = L hth/, n > 2(q+ 1).
1-1
ESTIMATING PARAMETERS OF A L INEAR SYSTEM 167
From the identities (Knopp, 1947, p. 4S0)
n'\
"
L cos 2'\t
1-1
=
sin
-.-, cos (n + 1),\
SIn 1\
} (,\ : a mUltiple of 17) , (S.41)
si n'\
sin 2'\t =
1-1
? , sin (n + 1),\
SIn 1\
it follows for n tending to infinity that
1 "
- L cos2,\t
n 1=1
} = 1- + 0 (!) (,\ : a multiple of 17) ,
1 n 11
- L sin2,\t
n ,=l
(S.42)
Consequently, for the h-vectors defined either by Equation S. 38C or

S.3SD, we have (2/n)8" -1 = I + {l/n)E" for some matrix En, whose
elements remain uniformly bounded as n --+ co. From Equation 8.40,
therefore,
(S.43)
and ;n will converge to ; in the same probabilistic sense that the

2(q + I)-vector on the right-hand side converges to the zero vector.
Ignoring the multiplication factor, each entry is of the form
where C, is either a cosine or sine. Using our assumption, S.39, we find

that
as n --+ co. It follows from a Law of Large Numbers for centered

dependent random variables (Parzen, 1960, p. 419) that 2" tends to zero
with probability one as n --+ co.
1 68 APPLICATIONS
In Step 2, therefore, each Ak." and Bk." is a strongly consistent

estimate of the corresponding val ues of Ak and Bk Since strong con
vergence is retained through continuous transformations, we have
as n co. But Clk." is uniformly bounded in n, so it follows that Clk."

converges to Clk in every mean (see the introductory material of Chapter
4). The same clearly applies to the convergence of 13k." to 13k' Con
sequently, the components of the vectors a" and {3" or y" in Step 3
converge with probability one and in mean square to the corresponding
components of a and (3 or y. The 6 estimates are obtained by applying
linear transformations to these vector estimates. Consequently, both
types of convergence are maintained.
It is an easy matter to establ ish asymptotic normality of our estimates
when we restrict attention to independent noise processes with
and sup 6"/Z,/2+8 < co. (8. 44)
t=l.
2. ..
From Equation 8.43, we find that
Vn(" - ; ) H"z"
2 2 "
"'"
. r
vn
= . r
-vnt=1
2: htZt
as n co. From Equation 8.44 and Liapounov's Central Limit

Theorem (Loeve, 1960, p. 275), it can be shown that the vector on the
right-hand side tends to normality with zero-mean vector and co
variance matrix equal to
6"( htZt) ( i hsZ.) = lim i

'
lim
" n

t=1 ,=1 " n
hth:6"ZtZ
t=1 .=1
4
0'2
= li,?1 Ii'" B" -1 = 20'21, (8.45)
where 1 is the (p + I)-by-(p + I) identity. Equation 8. 45 is true in both

the sampled continuous and discrete cases. It is now easy to establish the
asymptotic normality of the 6 estimates.
Turning first to the relations in Equation 8.37C, we consider the
matrix of partial derivatives
J
0(110' 0:10
II" flo, flb

"
fl,>
"
_
8{Ao Bo. A1 B1 '. A" B,)

evaluated at the true parameter values. Since Clk and 13k are interior points
of Ik, the vector
r;:J
ESTIAfA TING PA RA METERS O F A LINEA R SYSTEM 169
resulting from Step 2 tends, after appropriate standardization, to zero

mean normality with covariance matrix 2O'2J/J (by the vector version of
the " delta method" used in the proof of Theorem 5.1). After computing
the derivatives, we find that J' J is a diagonal matrix, and that the second
q + 1 diagonal entries arc identical to the first q + 1, namely,
(rx02 + f302)2, (rx12 + f312)2,. , (rxq2 + f3/)2.

I f we set
(8.46)
it follows that Vn(ex" - ex) and Vn(3" - (3) are asymptotically in
dependent and identically distributed as a (q + I)-dimensional normal
random variable with 0 mean and covariance matrix 2O'2P2, where
P = drag [Po, Ph.. " pq].
Consequently, for the even and odd components of the estimate of 0 in

Step 3, there results
Vn(8e." - 8e) - N(O, 2O'2Ae - 1 p 2Ae -1/ ) ,

(8.47C)
Vn(80." - 80) - N(O, 2O'2Ae -1P...2Ae -11),
where
P... =
[AD' A '
po -
dl ag - P1 . . . -
pq
, , Aq
]
1
and these two (q + I)-vectors become independent in large samples.
The formula for the covariances of the odd components results from
the fact that in Equation 8.31 we have
Ao = diag [AD Ah " Aq]Ae.

'
Hence, in the sampled-continuous case, it is necessary only to invert the
(q + I)-by-(q + 1) matrix Ae to obtain the estimate of the 2(q + 1)
vector 8.
In the discrete case, we find in precisely the same way that Vn(y" - y)
has as covariance matrix of its limiting normal distribution, 2O'2Q2, where
Q = diag [Po, Ph Ph" ', Pq, Pq, pq+d
is given by Equation 8.46 with the a's and f3's now computed from
Equation 8.26C rat her than Equation 8.26D. Consequently, for the
estimate in Step 3, we have
(8.47 D)
170 APPLICATIONS
in 2(q + I) dimensions. The entries in P and Q, respectively, are given

by the formulas
Pic =
Lo 82j( -1 ) J'\lc2Jr + Lto 82J+ 1( -1)J'\lc(k2J+ 10,r I, . . " q),
= (8.48C)
and
(k = 0, 1" , ',q + I) . (8.48D)
The limiting distributions 8. 47C and 8.47D depend on the unknown

8's only via the values of the p's. Fo r the latter situation, we add to Step
2 the calculation of
Pic. n = IX. n + fJ. n'
and let Qn denote t he matrix Q with Pic replaced by its cor.sistent
estimate Pic. n' Then we have
If the noise variance (72 is unknown, it can be consistently estimated by

adding to Step I the calculation of
A similar procedure can be carried out in the sampled-continuous case.

Consequently, we can set up large-sample confidence regions on a.
When the independent errors share a common normal distribution,
an is the Maximum-Likelihood estimate of a for every n > p + I in both
the sampled-continuous and discrete-time cases. This is true because the
least-squares estimate ;n of ; becomes the Maximum-Likelihood
estimate, and the Maximum-Likelihood estimate of the I-I vector
valued function which relates a to ; is the function of the Maximum
Likelihood estimate, namely, an.
This optimum property is conditional on the given regression vectors
hi, h2' ... . There remains the problem of delineating them by an
appropriate choice of the input frequencies. In the discrete-time case, the
answer, at least from the computational point of view, is clear. The
particular selection
(k = 1, 2, ,, ', q) (8.49)
ESTI.UATING PARAMETERS OF A LINEAR SYSTEM 17 1
makes Equation 8.3 5, after normal ization, an orthogonal matrix and

thereby obviates inversion in Step 3. Using Equation 8.41, we find that
nn' = diag [2(q + I), q + 1" ", q + 1, 2(q + 1)].
Consequently,
1) 1 L.. (/Xk COSJWk + t'k ) /X +l

1 . "
R SIO ) ( -1)1
q +I q
2 /xo
+ JWk + (2
8;
= --
(q + q + k=l
(j = 0, 1,.", 2q + 1), (8.50)
and the estimate of 8J is obtained by merely substituting for /Xk and f3k
the quantities /Xk." and f3k... which result from Step 2. The limiting
covariance matrix in 8.47D reduces to a Toeplitz matrix with entries
where the p's are given by 8.48D and 8.49.

In the sampled-continuous case, the choice of 0 < Ao < Al < ... < Aq
is not so obvious. To carry out Step 3 , we must invert the (q + 1)-by
(q + 1) matrix 1\.., given in (8.3 1b). A procedure for doing this is given
in Lemma 8. (Take TI - A-l and II
= q + 1; then the conclusion
=
gives the row vectors of the inverse of 1\.e'.) An analysis of the method
would show that certain choices for the A'S make the inversion numeri
cally difficult.
On the other hand, we would like to pick these input frequencies to
make our estimate statistically accurate, which we measure by the
determinant of the limiting covariance (called the generalized variance).
In this regard, it is unimportant how we label the parameters; therefore,
the determinant of the limiting covariance matrix of vn(6" - 6) is
simply the product of the determinants of the two matrices in 8.47C.
The square root of this general ized vari ance is proportional to
(8.51)
The numerator is n=o Pk2, where Pk is given by 8.48C and depends on

the unknown parameters. For given bounds on the components of 6, the
f unction in Equation 8.51 can be examined for any particular choice of
the input freq uencies.
172 APPLICATIONS
8.5 Elliptical Trajectory Parameter Estimation
Example 8.5. To a first approximation, the trajectory of a small

Earth sateIlite is an ellipse with one of its focii located at the Earth's
center of mass. If a polar-coordinate system is chosen in the plane of
this eIlipse (the origi n being located at the Earth's center of mass), the
(r, \I-'}coordinates of the satellite at any time t satisfy the eq uation
a(l - e) 2
ret) =
1+ e cos ('Y{t) - )
a
' (8 . 52)
where ret) is the distance from the Earth's center of mass to the satellite
at time t, 'Y(t) is the angle between a radius vector from the Earth's
center of mass to the sateIlite and the reference direction of the co
ordinate system, a is the length of the eIlipse's major semiaxis, e is the
eccentricity of the ellipse, and a is the angle between the ellipse's major
axis and the reference direction. (See Figure 8.3.)
Noisy observations Yl(t), Y2(t), and yit) are made on r(t), \F{t), and
;(t) = dr/dt, respectively. Thus we have
Y3(t) = I'(t) + Z3(t).
(8.53)
We wish to reconstruct ret) and \F(t) from the noisy data, so that the
position of the satellite can be predicted at any instant of time. We
begin our analysis by deriving parametric representations of rand \1'.
The functional forms of ret) and \f(t), which depend upon the param
eters a, e, a, and '1-"(0), can be deduced from Newton's laws.
In polar coordinates, the" F = ma " equations become
ar = i - r'Y2 = -1-'/r2 (p. a known constant), (8.54)

all = rlY + 2;'Y = 0, (8.55)
Major axis
Reference direction
Figure 8.3 Elliptical trajectory of a small Earth satellite.

ELLIPTICAL TRAJECTORY PARAMETER ESTIMATION 1 73
where the dots denote time derivatives throughout the example.

Equation 8.55 can be rewritten
1 d .
(r2'1') O.
r dl
=
The last equation implies that r20/= const. Thus, we have

r20/ = M, (8.56)
which expresses the conservation of angular momentum.
Here M is related to a and e. If Equation 8.52 is differentiated with
respect to 1 and M/r2 is substituted for 0/, there results
. Me sin ('Y(/) - 0:).

r( 1 ) = (8.57)
a(1 e2) _
Now, we differentiate; and use the same substitution :

i( Mae cos ('I'(t) - a) .
t)
=
(8.58)
a(l _ e2)ra
We substitute 8.56 through 8.58 into Equation 8.54. Thus we obtain
(8.59)
Finally, we substitute 8.52 i nto Equation 8.59 and solve for M2. We find
that
M.2 = a/L{ 1 - e2); (8.60)
therefore, Equation 8.56 becomes
(8.61)
Substituting 8.52 i nto Equation 8.61, we can integrate the differential
equation :
'I'(o-a (a(1 - e2%
J
d'Y
/L -\1 = t. (8.62)
'I'(O)-a (1 + e cos 'Y)2
Equation 8.62 expresses 'Y{t ) as an implicit function of four parameters
('Y(O), 0:, e, and a) .
IPI"(t ) could be solved for explicitly, the resulting expression could be
substituted into Equation 8.52, thereby causing ret ) [hence ;(/)] to be
represented as functions of these parameters. Unfortunately, the in
tegral 8.62 cannot be represented in terms of elementary functions. We
must consequently resort to a clever change of variable.
1 74 APPLICATIONS
Before proceeding, let us point out that we have greatly simplified the
problem by assuming that the plane of the orbit is known exactly,
thereby reducing the number of unknown parameters by two. We will
now add one more simplifying assumption, namely, that a (the length of
the major semiaxis) is known. Under this assumption, we can choose
the unit of length so that
a =1.
S ince I-' has the dimensionality of cubed length over squared time, we
can also choose the unit of time so that
1-'=1.
Fundamental Equations 8.52 and 8.61 become
ret) = (1 - e2)/{1 + e cos ('1-'(t) - a)}, (8.63)

r2(t)'f'(t) = VI - e2 (8.64)
Now, we consider the following change of variable :
E(t) [ L a] 27T
'I-'(t -
( e + cos ('Y(t ) - a) )
=
{ arc cos
I + e cos (llr(t) - a)
if sin ('Y - a) 0,
2"
( e + cos ('-F(t) - a) ) if sin ('Y - a) O.
- arc cos
1 + e cos eY(t) a)-
<
(8.65)
(Here, [('Y - a)/27T]
is the greatest integer in ('Y - a)/27T.)
As ('Y - a)
varies from 0 to 00, so does E(and in a monotone fashion). Furthermore,
if k7T :$; 'Y- a
:$; (k + 1)7T,the same holds for E: k7T:$; E:$; (k + 1)7T.
In fact,
E(t) =k7T whenever 'Y(t) - a =k7T (k = 1 , 2, ,, .) .
As a n immediate consequence, the transformation can be inverted as

follo ws :
'Y(t) - a = [E27T(t)] 27T

E(t) )
arc cos (eecos cos
E(t) - 1
-
if sin E 0,
{ 27T - arc cos (eecos- cos

E(t) - 1)
E(t)
if sin E < O. (8.66)
ELLIPTICAL TRAJECTORY PARAMETER ESTIMATION 175
As consequences of Equations 8.65 and 8.66, we obtain

e + cos ('Y - a)
- < 00,
1
cos E -- os 'I'" (8.67)
+ e cos ('I'" - a)'
= a
and
e - cosE
cos('Y - a) = Os E < 00. (8.68)
ecos E - I'
Here E(t) is called the eccentric anomaly at time t.
As a consequence of Equations 8.68 and 8.63,
r(t) = 1 - e cos E( t). (8.69)
Since V(t) = (d'YldE)(dEldt), we can write Equation 8.64 as
d'YdE
r2 (t) (8.70)
.
=
dE dt
Differentiating Equation 8.68 and using 8.69, we find that
. f\J/'
d'Y ( I - e2)
\.. - =

sm a ) sm E. (8.71)
dE r2
Computing sin ('Y - a) from Equation 8.68, we obtain
(I - e2)Yz
sin ('Y - a ) = sin E. (8.72)
r
Combining Equations 8.71 and 8.72, we have
d'Y (I - e2 )Yz.
= (8.73)
dE r
After substituting 8.73 i nto Equation 8.70, we obtain
1
dE
.
r = (8.74)
dt
iB(I) (I - e cos E)dE

Now, we use 8.69 in Equation 8.74 and integrate:
B(O) = t.
This yields
E(t) - e sin E(t) = t + (E( O) - e sin (0. (8.75)
The quantity E(t) - e sin E(t) is called the mean anomaly at time t.
We will parametrize the unknowns as follows:
82 = E( O) - e sin E( O), and 83 = a. (8.76)
176 APPL ICATIONS
We have chosen 82 to be the mean anomaly at t ime zero instead of

\1"(0), because this parametrization is more useful in orbit determina
tion. It enters explicitly in the representation for E(t) :
E (I) - 81 sin E(I) = 1 + 82 (8.77)
Since x - 81 sin x is monotone increasing when 0 < 81 < 1 (which it

must be for an ellipse), we can solve Equation 8.77 for E(I) as a function
of 81 and 82 Letting 8 be the column vector whose components are
defined in Equation 8.76, we can write Equations 8.69 and 8.66 as
ret; 8) = 1 - 81 cos E(t), (8.78)
'1"( 1 ; 8) 83 + [E2(7T1 )] 27T

E(t) )
=
( 8181cos- cos
arc cos
E(t) 1
if sin E ;;::: 0,
81 - cos E(t) )
_
arc cos ( if sin E 0,

+ (8.79)
27T <
81 cos E(t)
-
{ 1 _
the dependence of E on the parameters being suppressed in Equation

8.79 to save space. In the sequel, we will generally express E's dependence
upon 81 and 82 by writing E(t ; 8) instead of E(t; 81, 82), it being under
stood that E's dependence on 81 and 82 is given implicitly by Equation
8.77.
We are now able to set up the desired recursive-estimation scheme.
We will assume that bounds are known for 81 > 82, and 83 :
Hence, the truncation set will be a rectangular parallelepiped :
We will estimate 8 via a truncated, batch-processing, " quick and dirty,"

adaptive recursion : Let T be a sampling interval chosen so that the
residual vectors
are independent. Also let
(n = 1 " 2 . . . ),
ELLIPTICAL TRAJECTORY PARAMETER ESTIMATION 1 77
and
[YlY (t,,)] [r(I"; ] 8)

Y" = 1
2( ,,) , and F,,(8) = :-(1,,; 8)
Y 3(1,,) r(I,,; 8)
The y's and z's are defined in 8.53. Let 5 1 = 80 E f1J and
,
S,, + 1 =
[s" + A"(Y,, - F,,(5,,))] (n = 1 , 2, ) ,
where A" is a 3 x 3 matrix :
and
h3 ,, - 2 = grad 8) l a - 8n
r(I,,; }
h3 ,, - 1 = grad '1'(1,,; 8) l a = 8n (n = 1 . 2. . ) . (8.8 1 )
h3" = grad ;(1,,; 8) l a - sn
The following formulas are necessary to carry out the recursi on :

Combining Equation 8.57 with Equation 8.60 (and remembering that
a = p. = 1 in this example), we find that
;(t ; 8) = 81 ( 1 2
- (1 ) - % sin ('1'(1; 8) - (3). (8.82)
which, together with Equati ons 8.78 and 8.79. define the components of
F". Furthermore, we have
and
(Equations 8.83 through 8.85 will be derived at the end of this example .)
1 78 APPL ICATIO NS
A typical computation cycle might go like this, where (J1n. (J2n, and (Jan
den ote the components of the (column) vector Sn'
I. S ubst it ute Sn for 0 in Equat i on 8.77 and solve for E(tn ; sn).
2. Compute r(tn ; sn) from Eq uation 8.78.
3. Compute y(tn ; sn) from Equati on 8.79.
4. Compute cos ['F(tn ; sn) - (Jan]. and sin ['F(tn ; sn) - (Jan] from
Equation 8.68.
5. Compute ;(tn ; sn) from Equation 8 .82 .
6. Compute han - 2 , han - I > h an' using Equations 8.81 and 8.83 through
8.8 5.
7. Update Li a li hj I1 2 to L 1 II hj l1 2 and form An.
8. Form the col umn vector Fn(sn) of q uantities in (2), (3), and ( 5).
9. Observe Yn and compute [sn + An( Yn - Fn(sn H.. = Sn + 1 '
10. Begin the next cycle.
The gains that we use are given by Equation 7.48c, so we urify

Assump tions F1 through F5 of Theorem 7. 1 . We begin by pointing out
the following :
{ Il grad r et ; 0 ) 11 } { Il grad r et ; 0) 11 }
o < inf Il grad 'F(t ; 0) 1 1 sup II grad 'Y(t ; 0 ) 11 < co .
& ,, 9' , 1 > 0 & ,, 9' , 1 > 0
II grad ret ; 0) 1 1 I I grad ret ; 0) 11

(8.86)
This follows from Equations 8.83 through 8.8 5 by virtue of Equation
8.80.
Straightforward differentiation of Equations 8.83 through 8.8 5 will
verify that each element of the matrices of second-order mixed partial
( 0 ) derivatives of r, 'Y, and ; is uniformly bounded for 0 E 9' and t > O.
Thus.
If we define hn * as was done in Equation 8.81, except that the derivatives

are all evaluated at 0 0 instead of Sn, then, by Equation 8.86, we find that
there are constants K1 and K2 such that
o < K1 I l hn* 11 K2 < co (8.87)
for all n. F I through F 4 now follow immediately.
To prove F 5, it suffices to prove that
a
3 1 3 + } > O.
( h*n + .h*'n )
lim inf Amln L
n __ co 1= 1 Il h !n + i I1 2
ELLIPTICAL TR AJECTO R Y PA R AMETER ESTIMATION 1 79
I n view of Equation 8.87, it therefore suffices to show that
lim inf '\mln

n -+ oo
( :i h:" + iM + i)
1=1
= lim inf '\mln (II nH,,/) > 0, (8.88)
n, ... oo
where
Hn = (h:n + h Mn + 2, h:n + 3).
The matrix H nH,, ' has three nonnegative eigenvalues
o ::; '\In ::; '\2n ::; '\3n'
Therefore,
(8.89)
Since
'\3n ::; '\I n + '\2n + '\3n = tr H nH n ' =
3
2
2:
i=l
I! h:n + i I1 , (8.90)
{
i t follows from Equations 8 .89 and 8.90 that
A min (H nH " ' ) = A1'1 -
>
3
I det Hn l } 2
L II Mn + i 1l 2
im1
In the light o f this and Equation 8.87, we find that Equation 8.88 holds
if I det Hnl is bounded away from zero . If det II " is expanded by co
factors of its last row, we see that
I det Hn I -
- 810{COS2 2
(' - 830) + sin 2 (' - 830)
r r
2 -
810 .
r
2
r
2
} _
Since r(tn) is uniformly bounded, it follows that lim infn ... I det Hn l > O . co
We must now derive Equations 8.83 through 8.85. The following

identities are basic to the derivations that follow : From Equation 8.67,
cos E + e cos E cos (' - a) e + cos ('Y - a) ,
=
and thus
cos E - e = {l - e cos E) cos ('Y - a),
or
cos E - e = r cos ('Y - a) , (8.9 1 )
after using Equation 8.69 (which we restate for convenience) :
r = 1 - e cos E. (8.92)
Differentiating Equation 8.92 with respect to time, we obtain
f = Ee si n E.
From Equation 8.75,

E(1 - e cos E) = 1 .
1 80 APPLICATIONS
and from Equation 8 .92,

Er = 1,
and thus
. . E
r = e sm - (8.93)
r
We differentiate Equation 8.77 with respect t o 0l( = e) :
: (1 - 01 cos E) - sin E = O. (8. 94)

By Equation 8.92,
oE sin E
= (8.95)
801 -r-'
We differentiate Equation 8.77 with respect to 02 :
BE
O (1 - 01 cos E) = 1 .
o 2
Thus, we have
(8.96)
and finally,
= O.
oE
(8.97)
B 03
We differentiate Equation 8.92 with respect to 01 :

or . oE
- = - cos E + 0l sm E - .
B o
Using Equations 8 .92 and 8.95, we obtain

or 01 - cos E.
= (8.98)
B Ol r
By Equation 8.91,
or
0 01
= - cos (\f" - 03) ,
Similarly, we have
(8.99)
By Equation 8.93,
;;2 = 01( 1 - 012) % sin (\f" - 03), and

or
o Oa
=
O.
ELLIPTICAL TRAJECTORY PARAMETER ESTIMATION 181
We differentiate Equation 8.68 with respect t o 81 :

8 - SID ('Y - 83
8'1" . 8r -
- I (cos E - 81) 881 sin E 8E I
8 1
( =
- r - 881 - ,.
We use 8.92, 8.93, 8.95, and 8.98 to show that
8'1"
881 = (I - 81 2) - % -;:r
sin E
(r + ( I - 81 2 . (8. 1 00)
By 8.93, we have
(, +
I I .
81 2) SI D ('Y - 83) ,
8'1"
8 1 8 =
I _
We differentiate 8.68 with respect to 82 :

8r I
2 ( - SID ('I" - 83 2r (81 - cos E) - 882 - -r SID. E -
8'1" . I 8E
-
88 =
882
Using Equations 8.99, 8.96, and 8.93, we obtain
812)%
882 =
8'1" (l
r2 '
-
(8. 1 0 1 )
and finally
= 1.
8'1"
8 3 8 (8. 1 02)
We differentiate 8.93 with respect to 81 :

sin E 81 cos E 8E 81 sin E -
8r
=
8;
-
881 --
r + r - 881 r 2 881 -
sin E 2
- r3- (r + r81 cos E - 812 + 81 cos E)
81)ei E)
=
C( l - 281 2)%
=
r 2 SID. ('I" - ) a .
Similarly,
(COS E 8E sin E !!..
882 81 r 882 r2 882)
8; _
8 (cos E 81) 8 cos ('I" )

r =
r _
= - a .
Finally, we obtain
883 = O.
8;
9. Open Problems
In closing, we wish to call attention to a number of problems that are

related to the work contained in this monograph.
9.1 Proof of the Conjectured Theorem
In Chapter 6 we were forced to state as conjecture a theorem pertain

ing to the almost sure and quadratic-mean convergence of scalar
observation, truncated-estimation recursions of the form
tn+l = [tn + an(Yn - Fn(tn))]jJ,

where !II is a closed convex subset of Euclidean p-space.
There is little doubt in our mind regarding the correctness of the
theorem, and we hope that one of our readers wilf have better luck than
we did in inventing a correct proof.
9.2 Extensions of Chapters 3 Through 5 to the Vector-Parameter Case
In Chapter 7 we discussed two distinctly different sets of gain sequences:

the "linearized least-squares" gains 7.48b, d, and the "quick and dirty"
gains 7.480, c. Under a reasonable set of regularity conditions, both
types of gains yield convergent estimator sequences. The latter family is
unquestionably more convenient from the computational point of view,
whereas the former is more efficient in the statistical sense (at least in the
182
KALMAN-TYPE FILTERING THEORY 183
case of linear regression). In the general case, it is not unreasonable

to expect that a tradeoff exists between computational convenience and
statistical efficiency, not only for the classes of gains already discussed
but also for any others that one may dream up.
In order to investigate this issue quantitatively, the techniques of
Chapters 3 through 5 will have to be extended to the vector-parameter
case. Such results will also serve the equally important purpose of
allowing one to construct asymptotic confidence sets for nonlinear
regression parameters.
9.3 Kalman-Type Filtering Theory for Nonlinear Systems
Throughout this monograph, the parameter that is to be estimated

does not change in time. However, in many applications it is desired to
estimate a quantity that changes in time, according to an equation of the
form
(9.1)
where the functional form of II(') is known, VII is a zero-mean sto

chastic process, and incomplete information about the values of Oil is
available through observations YII of the form
(9.2)
(When n(') is the identity transformation and VII is zero for each n, the
problem reduces to ordinary regression.)
When Fn() and n(') are linear functions of their argument and the
vector processes {Vn} and {WII} are mutually and temporally indepen
dent, Kalman has developed a recursive theory of smoothing and
prediction which generates estimates for 8n which are optimal in a
number of statistical senses. For example, if anln denotes the estimate of
8n based upon the observations Yh Y2," ', YII' then
811+1111+1 = 1I(anlll) + An[YII+1 - FII+1(alllll)] (9.3)

where the smoothing vectors An (or matrices as the case may be) are
defined recursively in terms of the second-order noise statistics for {Vn}
and {Wn} and the transformations (which, in the linear case, are matrices)
n and Fn. (See Kalman, 1960.)
Motivated by the computational appeal of the recursive formulation,
the prevailing approach in cases when n(') and/or Fn() are nonlinear
has been the "method of linearization," coupled with Kalman filtering.
Briefly, this approach involves the "linearization" of Equations 9.1 and
184 OPEN PROBLEMS
9.2, which is accomplished by expanding cI>n(') and Fn() in a Taylor

series about a "nominal value," 6, usually, but not always, chosen to
satisfy a noiseless version of the "state equation," Equation 9.1 :
(9.4)
When this is done, and if all terms of nonlinear order are ignored, we
find that
6n+1 cI>n(6) + 4>n(6)(6n - 6) + Vn;
therefore, by Equation 9.4,
(6n+1 - 6+1) cI>n(6) + 4>n(6)(6n - 6) + Vn, (9.5)

and
Y" - Fn(6) Fn(6)(6n - 6) + Wn, (9.6)
where 4>n(6) and Fn(6) are, respectively, the matrices of cI>n and F,,'s
first partial derivatives evaluated at 6.
If the Kalman filtering theory is applied to the linear approximation
Equations 9.5 and 9.6, we find that
On+11"+1 = cI>n(Onl") + An[Yn+1 - F,,+1(Onln)], (9.7)
where now An is defined recursively in terms of the second-order noise
statistics for {Wn} and {V,, } and in terms of the matrices Fn(6) and
4>n(6).
Although this technique meets with wide acceptance in applications,
little if any work (to the best of our knowledge) has been directed toward
the analysis of the "steady-state" operating characteristics of such
schemes. Of particular interest are such questions as: What is the large-
sample (large n) mean-square estimation error of Onl"? What is the
quantitative nature of the tradeoff between computational convenience
and accuracy that one experiences with various choices of the gains An?
The estimation recursion 9.7 looks so much like the recursions for
regression-parameter estimation that there is every reason to hope that
the analytic approaches developed in this monograph can be carried
over and extended to the more general case. Indeed, when the state and
observation Equations 9.1 and 9.2 are scalar relations, our previous
methods can be applied and furnish a bound on the limiting mean
square prediction error.
From the first n observations, we recursively predict 8"+1 by
t,,+1( = 8"+1In in the previous notation):
(n = 1, 2, . . . ) . (9.8)
KALMAN-TYPE FILTERING THEOR Y 1 85
Here 11 is an arbitrary random variable estimating the initial state 01

(each assumed to have finite second-order moments). We use gains {an}
that minimize, subject to a certain constraint, the steady-state prediction
error under the following assumptions:
HI. The noise processes {Vn} and {Wn} are mutually and temporally
independent with zero means and some finite (nonzero) variances
O'v2 and O'w2
H2. The derivatives d>n(x) and Fn(x) are one-signed for each n.
H3. For all x, fJn' I(bn(x) I fJn and bn IFn(x) I bn', where
fJn c1fJn' and bn' C2bn for some 1 C1> C2 < 00.
H4. fJ lim sup fJn < 1' where c = C1C
n c-
=
2
H5. bn -+ 00 as n -+ 00.
The significance of the last two assumptions will be discussed after we

prove the following theorem.
THEOREM 9.1
Let {8n} and {Yn} be scalar-valued processes defined by Equations 9.1
and 9.2 which satisfy Assumptions H I through H5. Let {t n} be generated
by Equation 9.8 with
fJ n
an sgn (d>nFn) .
C bn
=
Then, for the mean-square prediction error, we have

2
lim sup C(tn - 8n)2 O'(V 1)2
..
I-fJ2 1--
c
with equality in the linear case (when C = 1).
Proof. The usual Taylor-series expansion gives

tn+1 - On+1 = (d>n - anFn)(1n - On) - Vn + anWn,
where the derivatives are evaluated at some intermediate points. We
square this, take expectations conditional on 01> V1> ... , Vn-1; 11>
W1>, Wn-1> and use Assumption HI. The result combines with
Assumptions H2 and H3 to yield
e +1 (fJn - lanlbn)2en2 + O'v2 + O'w2an2, (9.9)
where en2 = C(tn - 8n)2. A lower inequality holds with fJn - la nlbn
replaced by fJn' - lanlbn' O. Let us set
..
Bn" = (1 - B,,) TI BJ (9.10)
1="+1
186 OPEN PROBLEMS
Then, after iterating Equation 9.9 back to (say) n = 1, we have
(9.11)
By Assumption H4 and the choice of lail, we find that
Pi - lallbi = Pi 1 -
( ) (9.12)
is bounded away from unity for all large enoughj, so the leading product
in Equation 9.11 goes to zero as n tends to infinity. According to Lemma
1,
n n
L Bnk = I -
TI Bj-> 1 .
k=l /=1
It remains to apply Equation 4.20a to the summation in Equation 9.11

to conclude that
The equality is a consequence of Equation 9.12, Assumption H4, and

Assumption H5. Q.E.D.
It is important to note that Assumption H4 does not require

"stability" of the state equation On+1 <I)n(ln) + Vn, except in the
=
limit of indefinitely large C1 or C2' At the other end of the spectrum when
f3n' f3n f3 and bn bn' (the linear case), Assumption H4 allows
= = =
systems On + 1 = f30n + Vn with f3 < 00 arbitrary. The reason for this,

of course, is Assumption H5 which states that infx I Fn(x)l-+ 00. A
smaller and smaller error therefore results in estimating On, whose
fluctuations are increasing without bound from an observation Yn In
such cases, the result is exact:
lim en2 = uv2
n
This is the same error that would obtain at each step if we were able to
observe the state variable directly.
By stating our result in a slightly more general form, we can see why
we selected the particular gains in the statement of the theorem and also
investigate the situation when bn -+ O. We define, for positive numbers
x, the sequence of functions
(9.14)
KALMAN-T YPE FILTERING THEOR Y 187
By filling in the details of the proof, it is not difficult to see that under
Assumptions H I through H3 the first line in Equation 9.13 holds true,
namely,
lim sup e"2 =::; lim sup Q,,(lani), (9.15)
" "
provided that
(9.16a)
and
"
TI (f3, - 10,1 b,)2 0 (9. 16b)

i-I
as n 00. For given sequences {f3n} and {bn}, the function 9.14 has a
unique minimum at the point
-R" + V R,,2 + 4ay2aW2

Xo = --.:.:..---:::--",..-
.".- --'---"- (9.17)
2aw2
where
R" aw2(l - f3n2) + ay2bn2

= (9.18)
f3nbn
Under Assumption H5, when bn 00, we have
a 2bn
Rn P n
oo
as n 00, assuming always that f3n has a finite limit superior. Con
sequently,
and
(9.19)
independent of the unknown variances. If we were to use this Xo as lanl,

we would not meet the condition of Equation 9.160; hence the reason
for the division by c. Assumption H4 ensures that Equation 9.16b is
satisfied.
As one might expect, the situation is entirely different when
H5'.b,,O.
In this case, we have for Equation 9.18
Rn "'" aw2(1 - f3n2)

= 00 '
f3nbn
188 OPEN PROBLEMS
assuming that 13" does not tend to unity. In the same way as in the
previous paragraph, we find for Equation 9.17 that
2
13"
b -+ 0
O'v
Xo -
O'w 2 I - 13"2 ,,
as n tends to infinity. Since O'v
2
/O'W2 is not known, this suggests using
gains with
(a > 0), (9.20)
and assuming, in place of Assumption H4, that
H4'. 13 < 1.
Then Equation 9.16a holds, at least for all large enough n (which is
enough), as does Equation 9.l6b. Thus, for the gains a" =
sgn I>J,,) 10,,1 in Equation 9.20, the result of Equation 9. 15 reads
(9.2 1)
and there is equality when 13,,' 13" 13 (and lim sup should be re
= =
placed by lim). But this is precisely the mean-square error resulting from
not using the observations at all (that is, by setting a 0 in Equation =
9.20). In other words, for a stable system

(0 < 13 < I)
and an observational equation
Y" =
Fn(8n} + W" supx IFn(x}1 :s: C2 infx IFn(x)l-+ 0,
with
the "observation-free" predictor 1,,+1 = f3tn does just as well as the
"optimized" version of Equation 9.8.
Finally, when b" bn' tends to a nonzero finite constant (say unity)
=
and f3n' f3n 13 (0 < 13 < I), the same approach will lead to time
= =
independent gains lanl a, where

=
- 13 Qo (9.22)
a Qo +
2
- O'w
and
(9.23)
These two equations combine to give a quadratic (Kalman's "variance
equation") whose positive square-root solution is the minimum mean
square linear prediction error Qo. The optimum gain now depends on
O'i and O'w as well as 13. This result is a special case of Kalman's linear
2
theory (see his Example I), and we include it only as a point of com
parison.
Appendix. Lemmas 1 Through 8
Lemma 1
Let AI> A2, be a sequence of square matrices. Then, for all

1 :::;
k :::; nand n I,
2:
n f1n (I - A,)A1 = 1 -
n
f1 (I - A,),
I=/c '=1+1 I=/c
where products are to be read backwards and void ones defined as the
identity.
Proof.
n n n
We have
f1 (I - AI) - f1 (I AI) = f1 (I AI)[I (I - Aj)]

n
- - -
1=/+1 1=/ 1=1+1
= f1 (I - AI)A/
1=1+1
Thus,
The sum over j from k to nof the right-hand side collapses to yield the
asserted result. Q.E.D.
189
190 APPENDIX
Lemma 2
Let P" = TI7=l (I - a,) when a, E (0, I) for all j Nand P"_ 0
as n - 00. Then, if Lk Xk < 00,
" P
max 2: 2xJ_O
lSkS" i=k PI
Proof The maximum in question is equal to the larger of the two

values which result from maximizing over 1 k N - 1 and over
N k n. The former is O(Pn) = 0(1) as n _ 00, and we must prove
the latter is also. To save writing, we set RJ+l = PnlPJ and SJ+l =
Xl + ... + Xi (where all void products are to be read as unity). Then we
have the identity
" "
2: RJ+lXJ = 2: (Rj+lSJ+1 - Rj+lS')
i=k J=k
"
= 2: (Ris, - Ri+1S,) + S"+l - Rk sk
i=k
"
= - 2: RJ+laiSi + Sn+l - Rk sle
i=k
But, according to Lemma I, the last expression is unaltered if we subtract
a constant from every subscripted s. Using S limn Sn, we therefore have
=
I i-k Pp"i Xi I IS"+l - sl + IpPk-"l (Sk - S) \ + Ij=k a 'pp " (sJ - S) I

i
(L2.l)
in the original notation. With regard to the second term, given e > 0 we
choose nl > Nso that ISle - sl < e for all k > nh and then n2 > nl so
that [17=n1 (1 - aJ) < e for all n > n2. Then, since products with
indices exceeding Nincrease with fewer terms,
max
NSkSn
I p
" (S -
p k-l k
S)\
= max max Ii (I - a,)lsk - sl, max Ii (1 - aj)lsle - sl}
{NSkSn1i=k NSkSn i-k
max { max ISle - sle, (1 - an)e} const e.
=
NSkSn1
Setting
n
ani = aJ [l (I - a,),
I=J+l
LEMMAS I THROUGH 8 191
we have for the maximum of the final term in Equation L2.l,
max
NSI<sn J=I< 1i
anl sJ - S) I i
J=N
an/lsJ - s l (L2.2)
We see that anJO as n 00 for each fixed j and, using Lemma 1, we

find that
n pn
L anJ 1 - __ 1
PN-1
=
J=N
as n 00. From ISJ - s l 0 as j 00 and the Toeplitz Lemma
(Knopp, 1947,p. 75) we infer that the bound in Equation L2.2 must go
to zero as n 00. Q.E.D.
Lemma 3
Let {an} and {an} be positive number sequences such that an 2 for all
n N and an < 00. If
e+1 (1 - an)2en2 + an(I + en),

then
n
sup en2 < 00.
It is a/ortior; true that the conclusion is valid when (1 - an)2 is replaced

by the smaller quantity 1 - 2an
Proof. If en2 1, then clearly e+1 M < 00 for any n. If en2 > 1,
then en < en2 and
e+1 (1 + an)en2 + an
if n N,because (1 - an)2 + an 1 + an when an 2. In every case,
M, (1 + an)en2 + an}
therefore,
e+1 max { (n N).
If we iterate this back to N, we find that
2
e,,+1 max
{[MP
max -
"+
NSI<S" PI<
L.
1
J=I<+, J
J -PP ]
n a ' -e
p
P"
N-l
N 2 + TI
n n
J JN
-pP GJ ' }
(L3.1)
where
"
TI (1 + aJ).
P" =
J=1
Since 1 + eX is valid for all real numbers, we have
n "
x
p
" exp L aJ exp L aJ < 00,
PI< J=I<+1 J=1
192 APPENDIX
for all N s k s nand all n N. The assumed summability of {u,,} thus

shows that both terms in Equation L3.1 must remain uniformly
bounded. Q.E.D.
Lemma 4
Let {bIll be any real-number sequence such that p"2 < 00, where
p" = b,,2/B,,2 and B,,2 = b12+ ... +b,,2 (that is, Assumption ASj.
Suppose Z > 0 and K O. Define
Z" = zp" - Kp,,2,

and let N be fixed sufficiently large so that 0 < Z" < 1 holds for all
n N. Then
k = N, N + 1,, n
for some 0 < CIe-1 SiS DIe-1 < 00, which do not depend on nand
have the property
lim Cle lim Die 1.
... co
= =
k co k
Proof. The left-hand inequality in
exp
{ __}
_ X < 1 -x < exp {-x} (U.l)
- x 1
is valid for all x < 1. (See Knopp, 1947, p. 198, for this as well as
and form the product on j from k to n.
GO
Equation U.4.) We set x = ZJ
2 __I -
( )
Since
" 2
Z " Z21 " Z
= 2 ZI + __ S 2 ZJ + 2 __ J ,
I-Ie - 1 ZI I-Ie 1 - ZJ J-Ie J-Ie 1 - ZJ
this gives
CIe-1 exp
{- i Z } S Ii (1 - zJ ) S exp
{- i } Z/ , (U.2)
J
J-Ie J-Ie I-Ie
{-GO
with
CIe-1 = exp 2
J-Ie 1 -
1
__
ZJ
Z2
}
tending to 1 as k -+ 00 because {zJ2} is summable. From the right-hand
inequality in Equation U.l, we have
? = 1 - PJ < exp{-pJ};
LEMMAS I THROUGH 8 193
therefore,
m-l
--2 nn -2-
- L: PJ}
BY-I exp {n
=:; (L4.3)
Bn
=
J=k BJ J=k
But ZJ =:; ZPi> and therefore
This combines with Equation L4.2 to give the asserted lower bound.
To prove the upper bound, we use
exp {x} ( 1 + r, > (L4.4)
which is valid for all positive numbers and For the choices x y.
and
we find that
exp {zJ} (-BBJ2-ll )(1-P/)(Z-KD/)

( Bl )"(_ 1 -(,,+K)P/+KP/2
_)
1 -_)
( BJ-1
Bl )"(_
1 PJ -("+K)P1
=
BJ-1

1 - PJ
because 1/(1 - Pi) exceeds 1. Consequently, after inverting and forming
the product over j, we have
exp { - J=kL:n zJ} i=kn B2"I ( 1 ) K)D/ Bk;;
n B;;
=:;
I
1 _
B
2z
R
t'J
("+
(L4.S) =
n
S,
where
log (z + K) Jn PJ log 1 1 PJ ( + K) J PJ log - PJ 1
co
o < S = _ =:; z 1
(L4.6)
Equations L4.5 and L4.6 combine with Equation L4.2 to give the
asserted upper bound with
D/C-l exp {(Z + K) JtP/Iog pJ
= 1
Setting PJ in the left-hand member of Equation L4.1, we see that

x =
the last written sum is majorized by 'Ll=k Pl/(l - Pj). This goes to 0 as
k since it is the tail of a convergent series, and therefore Dk -+ 1.
-+ 00
Q.E.D.
194 APPENDIX
Lemma 5
Let {bn} be any real-number sequence such that
B,,2 = b12 + . . . + bn2-,>-00 and f3n = bn2/Bn2-,>-Oasll-'>-00
(that is, Assumptions A3 and AS"). Define
Bk2f3k
f3nk(Z) = Jj"2Z (k = 1,2,.,n)
n
for z > O. Then
"
.
ltm
" k=1
2: f3nk(Z)" = -
Z
if limn n = , finite or not.
Proof. For every fixed k, f3nk -'>- 0 as 11-'>- 00. The conclusion follows
immediately from the Toeplitz Lemma (Knopp, 1947, p. 75) if we can
show that the row sums
as n -'>- 00. By the Abel-Dini Theorem, Equation 2.27,the numerator as

well as the denominator approaches +00 with n. However,the value of
lim Rn is obtainable as the limiting value of the ratio of successive
numerator differences to denominator differences (Hobson, 1957, p. 7),
that is, of
B; If3n+ 1 f3n+ 1
1 - (1
_
B;l - B,,22 - - f3n+li

This ratio, in turn,is indeterminate (0/0),as f3n -'>- O. But we can replace
f3" by a continuous variable f3 and apply L'Hospital's rule to the resulting
function. Thus, after differentiating, we have
1
lim R" lim (1 1 Q.E.D.
(3)2-
= =
" 8 .... 0 Z - Z
-.
Lemma 6
Let {b,,} be any real-number sequence such that
B"2 = b12 + . . . + b"2 -'>- 00 and 2: f3"2 < 00,
where f3" = b,,2/B,,2 (that is, Assumptions A3 and AS"'). Then, for
any z > t,
LEMMAS J THROUGH 8 195
as n 00, where N is chosen so that zf3, < 1 for all j > N.
Proof. From Lemma 4, with K = 0, we have
where Cle2 and Dle2 both tend to 1 as k 00. We thus have
where f3nk( ) was defined in the hypothesis of Lemma 5 for all positive
arguments. After we take limits on both sides of this inequality, we find
that the desired conclusion follows from that of Lemma 5. Q.E.D.
Lemma 7
(a) Let B be a positive definite p x p matrix with eigenvalues

o < "1 "2 ... "I' and associated unit eigenvectors <Ph <P2 , <PI'. . . .
Then, for every vector x,
with equality holding if
(b) Let {hn} be a sequence of p-dimensional vectors satisfying

Assumptions F2, F3, F4, and F5. Then
ct hih/)
( "'n hh ')
. "max K2q
lIm sup - '
n_oo " T 2
mIn L., i'
1= 1
where
and
196 APPENDIX
Proofof Q. Let y = B2X. Then

x'Bx y'y y'y
IlxllllBxl1 = y
IIB-12 IIIIB!:'yll = V(y'B ly)(y'By)
The last is bounded below by 2VApIAd(1 + AplAl) according to
Theorem 13 in Beckenback and Bellman (1961). The second statement
is verified by substitution.
Proofofb.
Amln (i hih/)
i=l
> Amln ( hih/)
i=l

k 1
i
i=l
Amln (L hih/) ,
lell
where k is chosen to be the largest integer for which Vk =:; n. But,
"min (L hih/)
le/l
min l!htlI2T/pJ>
le/l
where
and, of course,
Pi = Vi+1 - vi> the number of elements in Ji
Let
Then
and therefore,
Thus, we have
The numerator of the second term is bounded by
K2(Q+1)llhvk+ 1-1112q.
LEMMAS J THROUGH 8 19 7
Since lim infJ-+oo Tl/Kl T2/K 2Q, Assumption F2 implies that the
second term approaches zero as k -+ 00. The first term on the right-hand
side is indeterminate. The discrete version of L'Hospital's rule can be
applied (Hobson, 195 7, p. 7 , Section 6), and we find that
lim sup
Ie / Ie
2: 2: IIhdl2 2:.. (Tl/Kl) 2: IIhdl2 lim sup Kl/T/2 K2Q/T2.

Ie-+oo 1-11E11 J l / ... .",
IEll
This is the desired result. Q.E.D.
Lemma 8
Let ' 1 0 r2, .. . , '" be any distinct real numbers, and let R =
[rio r2, " ', rIO] be the (Vandermonde) matrix whose jth column is
Then the ith row of R -1 is
where
(k = 1,2, . . , n ,
)
and alo . . . , a" are such that
"
TI (x - 'Ic) =
g..(x).
1e1
Proof We introduce the n x n matrix
A
=
[.Il- - - - - - - - - - l,
a.. ! a,,-1 a2 al
198 APPENDIX
and notice that ArJ = 'irJ because x = 'J is a root of gn(x) O. In other =
words, 'J is the eigenvalue of A associated with the right-sided eigen

vector rJ. If we post-multiply A - xl by
o o o
x 1 o o
2
E(x) = x x o
[! l
we find that
(A - xl)E(x) = i l
.-;(;)-- -; ---: -l-()-- .- : l(;)

- - -- - -
.
identically in all real numbers x. Now set
1/ = [gn-l(")'" ',gl(")' 1].

Then
I;'(A - "I)E(,,) = [0, 0, .. . , 0].
But E(x) is nonsingular for all x (its determinant is unity), so Ii must be a
left-sided eigenvector of A. Letting L be the matrix whose columns are
these vectors, we can easily see that L'R commutes with the diagonal
matrix of roots '1> '2, . . " 'n; in terms of entries, (" - 'J)I/rJ O. By =
hypothesis, the ,'s are distinct numbers, and therefore the two sets of
eigenvectors are biorthogonal. In other words, we have
L'R = D, or
for some diagonal matrix D.
We complete the proof by showing that the ith entry of D is indeed
the one given in the statement of the lemma. To do this, we multiply
n
gk(X) by X -k-1 and sum on k up to n - 1, giving
n-l n- l k
L gk(X) Xn-k -1
k=l
= (n - l) xn-1 - L L aJxn-J-I
k=IJ-I
After collecting the coefficients of the a's, we have
LEMMAS J THROUGH 8 199
If we set the arbitrary constant equal to -an, the right-hand side is the
derivative of gn(x). After differentiating the product form of the char
acteristic polynomial, we obtain the identity
n-1 n
xn-1 + 2 gk(X)Xn-k-1 =
2 I1 (x - 'J).
k=l k=l Jk
In particular, at x = " the left-hand side becomes the inner product,
l;'r" and the right-hand side becomes DJ, (r, - 'j) . Q.E.D.
The reader will note that the preceding is precisely the sort of analysis
used in deriving the solution of an nth-order difference equation with
coefficients 010, an. The proof immediately suggests itself if one
knows that the right-sided eigenvectors of the matrix defining the
corresponding first-order vector difference equation is a Vandermonde
matrix of characteristic roots. Only the point of view is different; we
start with the characteristic roots and construct the coefficients.
References
Beckenback, E. F., and R. Bellman (1961). Inequalities, Springer-Verlag,

Berlin.
Burkholder, D. L. (1956). "On a Class of Stochastic Approximation
Processes," Ann. Math. Statist., 27, 1044-1059.
Chernoff, H. (1956). "Large Sample Theory: Parametric Case," Anll. Math.
Statist., 25, 463 -483.
Chernoff, H. (1959). "The Sequential Design of Experiments," AIIII. Math.
Statist., 30, 755-770.
Hobson, E. W. (1957). The Theory 0/ Functions 0/ a Real Variable and the
Theory 0/ Fourier Series, Vol. 2, Dover, New York.
Hodges, J. L., Jr., and E. L. Lehmann (1951). " Some Applications of the
Cramer-Rao Inequality," Proceedings 0/ the Second Berkeley Symposium
011 Mathematical Statistics and Probability, University of California
Press, Berkeley, California, pp. 13-22.
Hodges, J. L., Jr., and E. L. Lehmann (1956). "Two Approximations to the
Robbins-Monro Process," Proceedings 0/ the Third Bl'I"keley Symposium
on Mathematical Statistics and Probability, Vol. 1, University of California
Press, Berkeley, California, pp. 96-104.
Householder, A. S. (1964). The Theory 0/ Matrices in Numerical Analysis,
Blaisdell, New York.
Kalman, R. E. (1960). " A New Approach to Linear Filtering and Prediction
Problems," Journal 0/ Basic Engineering, 82, 35-45.
Knopp, K. (1947). Theory and Applicatioll 0/ Infinite Series, Hafner, New
York.
LeCam, L. (1953). On Some Asymptotic Properties 0/ Maximum Likelihood
Estimates and Related Bayes Estimates, University of California Publica
tions in Statistics, I, No. 11, 277-330.
Locve, M. (1960). Probability Theory, Van Nostrand, New York.
200
REFERENCES 201
Parzen, E. (1960). Modem Probabiliiy Theory alld its Applications, Wiley,

New York.
Robbins, H., and S. Monro (1951)." A Stochastic Approximation Method,"
Ann. Math. Statist., 22, 400-407.
Sacks, J. (1958)... Asymptotic Distribution of Stochastic Approximation
Procedures," Anll. Math. Statist., 29, 373-405.
Index
Abel-Dini Theorem, 18 Efficiency, asymptotic, 60,65,72

Amplifier, saturating, 156 Efficiency for certain non-Gaussian
Anomaly, eccentric and mean, 175 residuals. 70
Assumptions AI, 9 Estimates, CAN and C A N E, 67
AI'-A5"', 10
BI-B5,84 Gain sequences, adaptive, 21
Cl-C6',92 deterministic, 17
01-05,99 modified,22
EI-E6',104 optimum for linear regression, 109,
Fl-F5, 126 117
GI-G3,141 "quick and dirty," 115, 121
HI-HS,185 restricted, 27
Batch processing, 104,124 Initial state, estimation of, 153

Burkholder, D. L., 5, 11,35,37,39 Intervals, confidence, large-sample. 72
Kalman filtering. 183

Central Limit Theorem, Liapounov's,
Kiefer-Wolfowitz procedure,S
168
Lindeberg's, 43 Kronecker Lemma, 17
Conditioning. 111, 136

Linearity, asymptotic, 51,61
Continuity. modulus of, 47
Linear system, estimating parameters
Convergence. scalar-parameter, II, 23
of, 161
modes of. 39
Linearization, method of, 122,183
rates of, 28
vector-parameter, 92 Matrix, elementary, 83
Cramer-Rao Inequality, regularity Maximum Likelihood, 65
conditions for, 66
Newton's method. 2, 36
Differential corrections procedure, 2 Normality, asymptotic, 39
203
204 INDEX
Recursion, truncated, 9,102 Stochastic approximation, 5, 6, 35, 37

typical computational cycle of, 178 general theorem on, 58
Regression, exponential, 25 Surf'lces, response, 139
nonlinear polynomial, 18
recursive linear, 109
Toeplitz Lemma, 50
time-homogeneous, 148
Transformation of parameter space,
trigonometric, 24
61
Regression function, definition of, ex
Trajectory, estimating parameters of,
tended by linearity, 10,38
172
Robbins-Monro process, 5, 35,36
Sequential experimentation, 140 Variances, large-sample, 48
Frequently Used Symbols, and Where Defined or Redefined
r .]: 9 A. 87,95,101,108
d. 10 T. 8 7,95,108
b. 10 Q. 87,95,101
B.2 10 T 84,92,99,104
F. 10 p 84,92,99,104
e.' 16,93 (l 84,92,99,104
{3. 29 II. 84,92,99,104
g.(. ) 39 p. 84,92,99,104
c 48 fJIJ 103
n) 49 rb 103
Fn 82 D. 110,111,112,130
1111 82 Km 133,135

Arthur E. Albert, Leland A. Gardner Jr. Stochastic Approximation and NonLinear Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arthur E. Albert, Leland A. Gardner Jr. Stochastic Approximation and NonLinear Regression

Uploaded by

Copyright:

Available Formats

STOCHASTIC APPROXIMATION

AND NONLINEAR REGRESSION

IIIIIII RESEARCH l\'IOOGRAPH NO. 42

Set ill Times New Romall

All rights reserved. No part 0/ this book may be

Library of COllgress cawlog mrd lIumber: 67-16501

ISBN: 0-262-51148-7 (Paperback)

This is the forty-second volume in the M.I.T. Research Monograph

This monograph addresses the problem of"real-time" curve fitting

term .. differential correction" refers to the proportionality of the

The authors wish to express their gratitude to Nyles Barnert, who

PART I THE SCALAR-PARAMETER CASE

2. Probability-One and Mean-Square Convergence 9

3. Moment Convergence Rates 27

4. Asymptotic Distribution Theory 38

PART Il THE VECTOR-PARAMETER CASE

6. Mean-Square and Probability-One Convergence 81

bility-One Convergence for General Gains and

7. Complements and Details 109

9. Open Problems 182

Appendix. Lemmas 1 Through 8 189

Despite the many significant and elegant theoretical developments of

be a stochastic process whose mean-value sequence is a member of a

where 6 is a vector parameter which is not known and must be estimated.

We will explore the asymptotic (large II ) properties of recursive estima

If 11 were known to be a reasonably good estimate of (i.e., is close to) e,

so that a potentially better estimator for 8 might be (see Figure 1.1)

Figure 1.1 Graphical interpretation of Newton's method.

In turn, t2 could be "improved" in the same way, and Equation 1.3

It would appear, though, that the first guess 11 must be close to 8 in

We choose a number a to satisfy

and we modify the recursion, Equation 1.4, to read

It is easy to show that In converges to 8 as n -+ 00. Indeed, by the mean

Again, we estimate 6 by sequences of the form

1,,+1 - 6 [1 - a"G"(u,,)] (/,, - 6) {Ii [1 - a/GJ(UJ)]} (/1 - 6).

Now assuming that

for all n and all x, we choose a" so that

and I t" +1 - 61 tends once again to zero as noo.

tn+l = tn - lanl Zn(tn). (1. 8a)

is the mean value of an observable random variable Z(x). The distribu

of scalars with Ln an2 < 00. The success of the Robbins-Monro

time-homogeneous regression function.) Consequently, there will be

THE SCALAR-PARAMETER CASE

2.t The Basic Assumptions (At Through AS")

Throughout Part I we will use certain assumptions, the first of which

AI. {Y,,:n 1, 2, } is an observable stochastic process of the form

Y" = F,,(8) + W", where Wl> W2, have uniformly bounded

If J happens to be finite or semifinite, it is reasonable to constrain

1,,+1 = [I" + o"[Y,, - F,,(t,,) ]] (n = 1, 2, ; 11 arbitrary). (2.2)

In the work that follows, we will use certainsymbols and assumptions

AI'. In addition to Assumption AI, WI> W2,' is a zero-mean

We note that Assumption AS'" implies AS", Assumptions A4 and AS"

{F"('2) + Fn(2)(X - '2) if x ;::: '2,

and we could use the untruncated scheme

Since we know that

inf I Fn (x) I = inf

2.2 Theorems Concerning Probability-One and Mean-Square

Let {Yn:n = 1, 2, " '} be an observable process satisfying Assump

where, for each n, o

then tn converges to 8 with probability one if either

4. L sup lan(x)I < 00

Proof. For notational convenience,denote

sup lan(x)I by sup lanl.