Nonlinear Least Squares Theory Ch8

Chapter 8
Nonlinear Least Squares Theory

For real world data, it is hard to believe that linear specications are universal in
characterizing all economic relationships. A straightforward extension of linear speci-
cations is to consider specications that are nonlinear in parameters. For example, the
function +x
oers more exibility than the simple linear function +x. Although
such an extension is quite natural, it also creates various diculties. First, deciding an
appropriate nonlinear function is typically dicult. Second, it is usually cumbersome to
estimate nonlinear specications and analyze the properties of the resulting estimators.
Last, but not the least, estimation results of nonlinear specication may not be easily
interpreted.
Despite these diculties, more and more empirical evidences show that many eco-
nomic relationships are in fact nonlinear. Examples include nonlinear production func-
tions, regime switching in output series, and time series models that can capture asym-
metric dynamic patterns. In this chapter, we concentrate on the estimation of and hy-
pothesis testing for nonlinear specications. For more discussion of nonlinear regressions
we refer to Gallant (1987), Gallant and White (1988), Davidson and MacKinnon (1993)
and Bierens (1994).
8.1 Nonlinear Specications
We consider the nonlinear specication
y = f(x; ) + e(), (8.1)
where f is a given function with x an 1 vector of explanatory variables and a
k1 vector of parameters, and e() denotes the error of the specication. Note that for
209
210 CHAPTER 8. NONLINEAR LEAST SQUARES THEORY
a nonlinear specication, the number of explanatory variables need not be the same
as the number of parameters k. This formulation includes the linear specication as a
special case with f(x; ) = x
and = k. Clearly, nonlinear functions that can be

expressed in a linear form should be treated as linear specications. For example, a
specication involving a structural change is nonlinear in parameters:
y
t
=
_
+ x
t
+ e
t
, t t
,
( + ) + x
t
+ e
t
, t > t
,
but it is equivalent to the linear specication:
y
t
= + D
t
+ x
t
+ e
t
,
where D
t
= 0 if t t
and D
t
= 1 if t > t
. Our discussion in this chapter focuses on

the specications that cannot be expressed as linear functions.
There are numerous nonlinear specications considered in empirical applications. A
exible nonlinear specication is
y
t
= +
x
t
1
+ e
t
,
where (x
t
1)/ is the so-called Box-Cox transform of x
t
, which yields dierent func-
tions, depending on the value . For example, the Box-Cox transform yields x
t
1
when = 1, 1 1/x
t
when = 1, and a value close to lnx
t
when approaches
zero. This function is thus more exible than, e.g., the linear specication + x and
nonlinear specication +x
. Note that the Box-Cox transformation is often applied

to positively valued variables.
In the study of rm behavior, the celebrated CES (constant elasticity of substitution)
production function suggests characterizing the output y by the following nonlinear
function:
y =
_
L
+ (1 )K
/
,
where L denotes labor, K denotes capital, , , and are parameters such that > 0,
0 < < 1 and 1. The elasticity of substitution for a CES production function is
s =
d ln(K/L)
d ln(MP
L
/MP
K
)
=
1
(1 + )
0,
where MP denotes marginal product. This function includes the linear, Cobb-Douglas,
Leontief production functions as special cases. To estimate the CES production function,
the following nonlinear specication is usually considered:
ln y = ln

ln
_
L
+ (1 )K
+ e;
c _ Chung-Ming Kuan, 2004
8.1. NONLINEAR SPECIFICATIONS 211
for a dierent estimation strategy, see Exercise 8.3. On the other hand, the translog
(transcendental logarithmic) production function is nonlinear in variables but linear in
parameters:
ln y =
1
+
2
ln L +
3
lnK +
4
(ln L)(ln K) +
5
(ln L)
2
+
6
(ln K)
2
,
and hence can be estimated by the OLS method.
In the time series context, a nonlinear AR(p) specication is
y
t
= f(y
t1
, . . . , y
tp
) + e
t
.
For example, the exponential autoregressive (EXPAR) specication takes the following
form:
y
t
=
p
j=1
_
j
+
j
exp
_
y
2
t1
_
y
tj
+ e
t
,
where in some cases one may replace y
2
t1
in the exponential function with y
2
tj
for
j = 1, . . . , p. This specication was designed to describe physical vibration whose
amplitude depends on the magnitude of y
t1
.
As another example, consider the self-exciting threshold autoregressive (SETAR)
specication:
y
t
=
_
a
0
+ a
1
y
t1
+ + a
p
y
tp
+ e
t
, if y
td
(, c],
b
0
+ b
1
y
t1
+ + b
p
y
tp
+ e
t
, if y
td
(c, ),
where d is known as the delay parameter which is an integer between 1 and p, and c is
the threshold parameter. Note that the SETAR model is dierent from the structural
change model in that the parameters switch from one regime to another depending on
whether a past realization y
td
exceeds the threshold value c. This specication can be
easily extended to allow for r threshold parameters, so that the specication switches
among r + 1 dierent dynamic structures.
The SETAR specication above can be written as
y
t
= a
0
+
p
j=1
a
j
y
tj
+
_
0
+
p
j=1
j
y
tj
_
1
{y
td
>c}
+ e
t
,
where a
j
+
j
= b
j
, and 1 denotes the indicator function. To avoid abrupt changes of
parameters, one may replace the indicator function with a smooth function h so as
to allow for smoother transitions of structures. It is typical to choose the function h as
a distribution function, e.g.,
h(y
td
; c, ) =
1
1 + exp[(y
td
c)/]
,
where c is still the threshold value and is a scale parameter. This leads to the following
smooth threshold autoregressive (STAR) specication:
y
t
= a
0
+
p
j=1
a
j
y
tj
+
_
0
+
p
j=1
j
y
tj
_
h(y
td
; c, ) + e
t
.
Clearly, this specication behaves similarly to a SETAR specication when [(y
td
c)/[
is very large. For more nonlinear time series models and their motivations we refer to
Tong (1990).
Another well known nonlinear specication is the so-called articial neural network
which has been widely used in cognitive science, engineering, biology and linguistics. A
3-layer neural network can be expressed as
f(x
1
. . . . , x
p
; ) = g
_
_
0
+
q
i=1
i
h
_
i0
+
p
j=1
ij
x
j
_
_
_
,
where is the parameter vector containing all and , g and h are some pre-specied
functions. In the jargon of the neural network literature, this specication contains
p inputs units in the input layer (each corresponding to an explanatory variable
x
j
), q hidden units in the hidden (middle) layer with the i
th
hidden-unit activation
h
i
= h(
i0
+
p
j=1
ij
x
j
), and one output unit in the output layer with the activation
o = g(
0
+
q
i=1
i
h
i
). The functions h and g are known as activation functions, the
parameters in these functions are connection weights. That is, the input values simul-
taneously activate q hidden units, and these hidden-unit activations in turn determine
the output value. The output value is supposed to capture the behavior of the target
(dependent) variable y. In the context of nonlinear regression, we can write
y = g
_
_
0
+
q
i=1
i
h
_
i0
+
p
j=1
ij
x
j
_
_
_
+ e,
For a multivariate target y, networks with multiple outputs can be constructed similarly
with g being a vector-valued function.
In practice, it is typical to choose h as a sigmoid (S-shaped) function bounded
within a certain range. For example, two leading choices of h are the logistic function
8.2. THE METHOD OF NONLINEAR LEAST SQUARES 213
h(x) = 1/(1 + e
x
) which is bounded between 0 and 1 and the hyperbolic tangent
function
h(x) =
e
x
e
x
e
x
+ e
x
,
which is bounded between 1 and 1. The function g may be the identity function or the
same as h. Although the class of neural networks is highly nonlinear in parameters, it
possesses two appealing properties. First, a neural network is capable of approximating
any Borel-measurable function to any degree of accuracy, provided that the number of
hidden units q is suciently large. Second, to achieve a given degree of approximation
accuracy, neural networks are relatively more parsimonious than, e.g., the polynomial
and trignometric expansions. For more details of articial neural networks and their
relationships to econometrics we refer to Kuan and White (1994).
8.2 The Method of Nonlinear Least Squares
Formally, we consider the nonlinear specication (8.1):
y = f(x; ) + e(),
where f : R
1
R,
1
dentoes the parameter space, a subspace of R
k
, and e() is
the specication error. Given T observations of y and x, let
y =
_
_
y
1
y
2
.
.
.
y
T
_
_
, f(x
1
, . . . , x
T
; ) =
_
_
f(x
1
; )
f(x
2
; )
.
.
.
f(x
T
; )
_
_
.
The nonlinear specication (8.1) now can be expressed as
y = f(x
1
, . . . , x
T
; ) +e(),
where e() is the vector of errors.
8.2.1 Nonlinear Least Squares Estimator
Our objective is to nd a k-dimensional surface that best ts the data (y
t
, x
t
), t =
1, . . . , T. Analogous to the OLS method, the method of nonlinear least squares (NLS)
suggests to minimize the following NLS criterion function with respect to :
Q
T
() =
1
T
[y f(x
1
, . . . , x
T
; )]
[y f(x
1
, . . . , x
T
; )]
=
1
T
T
t=1
[y
t
f(x
t
; )]
2
.
(8.2)
Note that Q
T
is also a function of the data y
t
and x
t
; we omit the arguments y
t
and x
t
just for convenience.
The rst order condition of the NLS minimization problem is a system of k nonlinear
equations with k unknowns:
Q
T
() =
2
T
f(x
1
, . . . , x
T
; ) [y f(x
1
, . . . , x
T
; )]
set
= 0,
where
f(x
1
, . . . , x
T
; ) =
_

f(x
1
; )
f(x
2
; ) . . .
f(x
T
; )
_
,
is a kT matrix. A solution to this minimization problem is some

1
that solves the
rst order condition:
Q
T
(
) = 0, and satises the second order condition:

2
Q
T
(
)
is positive denite. We thus impose the following identication requirement; cf. [ID-1]
for linear specications.
[ID-2] f(x; ) is twice continuously dierentiable in the second argument on
1
, such
that for given data (y
t
, x
t
), t = 1, . . . , T,
2
Q
T
() is positive denite at some interior
point of
1
.
While [ID-2] ensures that a minimum of Q
T
() can be found, it does not guarantee
the uniqueness of this solution. For a a given data set, there may exist multiple solutions
to the NLS minimization problem such that each solution is a local minimum of Q
T
().
This result is stated below; cf. Theorem 3.1.
Theorem 8.1 Given the specication (8.1), suppose that [ID-2] holds. Then, there
exists a solution that minimizes the NLS criterion function (8.2).
Writing f(x
1
, . . . , x
T
; ) as f(), we have
Q
T
() =
2
T
f() [y f()] +
2
T
[
f()][
f()]
.
For linear regressions, f() = X so that
f() = X
and
2
f() = 0. It follows
that
2
Q
T
() = 2(X
X)/T, which is positive denite if, and only if, X has full
column rank. This shows that [ID-2] is, in eect, analogous to [ID-1] for the OLS
method. Comparing to the OLS method, the NLS minimization problem may not have
a closed-form solution because the rst order condition is a system of nonlinear functions
in general; see also Exercise 8.1.
The minimizer of Q
T
() is known as the NLS estimator and will be denoted as

T
. Let
y denote the vector of NLS tted values with the t
th
element y
t
= f(x
t
,

T
), and e
denote the vector of NLS residuals y y with the t
th
element e
t
= y
t
y
t
. Denote the
transpose of
f() as (). Then by the rst order condition,

(
T
)
e = [
f(
T
)] e = 0.
That is, the residual vector is orthogonal to every column vector of (
T
). Geometri-
cally, f() denes a surface on
1
, and for any in
1
, () is a k-dimensional linear
subspace tangent at the point f(). Thus, y is orthogonally projected onto this surface
at f(
T
) so that the residual vector is orthogonal to the tangent space at that point. In
contrast with linear regressions, there may be more than one orthogonal projections and
hence multiple solutions to the NLS minimization problem. There is also no guarantee
that the sum of NLS residuals is zero; see Exercise 8.2.
Remark: The marginal response to the change of the i
th
regressor is f(x
t
; )/x
ti
.
Thus, one should be careful in interpreting the estimation results because a parameter
in a nonlinear specication is not necessarily the marginal response to the change of a
regressor.
8.2.2 Nonlinear Optimization Algorithms
When a solution to the rst order condition of the NLS minimization problem cannot be
obtained analytically, the NLS estimates must be computed using numerical methods.
To optimizing a nonlinear function, an iterative algorithm starts from some initial value
of the argument in that function and then repeatedly calculates next available value
according to a particular rule until an optimum is reached approximately. It should be
noted that when there are multiple optima, an iterative algorithm may not be able to
locate the global optimum. In fact, it is more common that an algorithm gets stuck at
a local optimum, except in some special cases, e.g., when optimizing a globally concave
(convex) function. In the literature, several new methods, such as the simulated anneal-
ing algorithm, have been proposed to nd the global solution. These methods have not
yet been standard because they are typically dicult to implement and computation-
ally very intensive. We will therefore conne ourselves to those commonly used local
methods.
To minimize Q
T
(), a generic algorithm can be expressed as
(i+1)
=
(i)
+ s
(i)
d
(i)
,
so that the (i +1)
th
iterated value
(i+1)
is obtained from
(i)
, the value from the pre-
vious iteration, by adjusting the amount s
(i)
d
(i)
, where d
(i)
characterizes the direction
of change in the parameter space and s
(i)
controls the amount of change. Dierent al-
gorithms are resulted from dierent choices of s and d. As maximizing Q
T
is equivalent
to minimizing Q
T
, the methods discussed here are readily modied to the algorithms
for maximization problems.
Consider the rst-order Taylor expansion of Q() about
:
Q
T
() Q
T
(
) + [
Q
T
(
)]
).
Replacing with
(i+1)
and
with
(i)
we have
Q
T
_
(i+1)
_
Q
T
_
(i)
_
+
_
Q
T
_
(i)
_
s
(i)
d
(i)
.
Note that this approximation is valid when
(i+1)
is in the neighborhood of
(i)
. Let
g() denote the gradient vector of Q
T
:
Q
T
(), and g
(i)
denote g() evaluated at
(i)
. If d
(i)
= g
(i)
,
Q
T
_
(i+1)
_
Q
T
_
(i)
_
s
(i)
_
g
(i)
g
(i)
.
As g
(i))
g
(i)
is non-negative, we can nd a positive and small enough s such that Q
T
is decreasing. Clearly, when
(i)
is already a minimum of Q
T
, g
(i)
is zero so that no
further adjustment is possible. This suggests the following algorithm:
(i+1)
=
(i)
s
(i)
g
(i)
.
Choosing d
(i)
= g
(i)
leads to:
(i+1)
=
(i)
+ s
(i)
g
(i)
,
which can be used to search for a maximum of Q
T
.
Given the search direction, one may want to choose s
(i)
such that the next value
of the objective function Q
T
_
(i+1)
_
is a minimum. This suggests that the rst order
condition below should hold:
Q
T
_
(i+1)
_
s
(i)
=
Q
T
_
(i+1)
_

(i+1)
s
(i)
= g
(i+1)
g
(i)
= 0.
Let H
(i)
denote the Hessian matrix of Q
T
evaluated at
(i)
:
H
(i)
=
2
Q
T
()[
=
(i) =
g()[
=
(i) .
Then by Taylors expansion of g, we have
g
(i+1)
g
(i)
+H
(i)
_
(i+1)
(i)
_
= g
(i)
H
(i)
s
(i)
g
(i)
.
It follows that
0 = g
(i+1)
g
(i)
g
(i)
g
(i)
s
(i)
g
(i)
H
(i)
g
(i)
,
or equivalently,
s
(i)
=
g
(i)
g
(i)
g
(i)
H
(i)
g
(i)
.
The step length s
(i)
is non-negative whenever H
(i)
is positive denite. The algorithm
derived above now reads
(i+1)
=
(i)
g
(i)
g
(i)
g
(i)
H
(i)
g
(i)
g
(i)
,
which is known as the steepest descent algorithm. If H
(i)
is not positive denite, s
(i)
may be non-negative so that this algorithm may point to a wrong direction.
As the steepest descent algorithm adjusts parameters along the opposite of the
gradient direction, it may run into diculty when, e.g., the nonlinear function being
optimized is at around the optimum. The algorithm may iterate back and forth without
much progress in approaching an optimum. An alternative is to consider the second-
order Taylor expansion of Q() around some
:
Q
T
() Q
T
(
) +g
) +
1
2
(
),
where g
and H
are g and H evaluated at
, respectively. From this expansion, the

rst order condition of Q
T
() may be expressed as
g
+H
) 0,
so that
(H
)
1
g
. This suggests the following algorithm:
(i+1)
=
(i)
_
H
(i)
_
1
g
(i)
,
where the step length is 1, and the direction vector is
_
H
(i)
_
1
g
(i)
. This is also
known as the Newton-Raphson algorithm. This algorithm is more dicult to implement
because it involves matrix inversion at each iteration step.
From Taylors expansion we can also see that
Q
T
_
(i+1)
_
Q
T
_
(i)
_

1
2
g
(i)
_
H
(i)
_
1
g
(i)
,
where the right-hand side is negative provided that H
(i)
is positive denite. When this
approximation is good, the Newton-Raphson algorithm usually (but not always) results
in a decrease in the value of Q
T
. This algorithm may point to a wrong direction if
H
(i)
is not positive denite; this happens when, e.g., Q is concave at
i
. When Q
T
is
(locally) quadratic with the local minimum
, the second-order expansion about
is
exact, and hence
=
H(
)
1
g(
).
In this case, the Newton-Raphson algorithm can reach the minimum in a single step.
Alternatively, we may also add a step length to the Newton-Raphson algorithm:
(i+1)
=
(i)
s
(i)
_
H
(i)
_
1
g
(i)
,
where s
(i)
may be found by minimizing Q
_
(i+1)
_
. In practice, it is more typical to
choose s
(i)
such that Q
_
(i)
_
is decreasing at each iteration.
A algorithm that avoids computing the second-order derivatives is the so-called
Gauss-Newton algorithm. When Q
T
() is the NLS criterion function,
H() =
2
T
f()[y f()] +
2
T
()
(),
where () =
f(). It is therefore convenient to ignore the rst term on the right-

hand side and approximate H() by 2()
()/T. There are some advantages of this

approximation. First, only the rst-order derivatives need to be computed. Second,
this approximation is guaranteed to be positive denite under [ID-2]. The resulting
algorithm is
(i+1)
=
(i)
+
_
(i)
_
(i)
_
1
(i)
__
y f
_
(i)
_
.
Observe that the adjustment term can be obtained as the OLS estimator of regressing
y f
_
(i)
_
on
_
(i)
_
; this regression is thus known as the Gauss-Newton regression.
The iterated values can be easily computed by performing the Gauss-Newton regres-
sion repeatedly. The performance of this algorithm may be quite dierent from the
Newton-Raphson algorithm because it utilizes only an approximation to the Hessian
matrix.
To maintain a correct search direction of the steepest descent and Newton-Raphson
algorithms, it is important to ensure that H
(i)
is positive denite at each iteration. A
simple approach is to correct H
(i)
, if necessary, by adding an appropriate matrix to it.
A popular correction is
H
(i)
c
= H
(i)
+ c
(i)
I,
where c
(i)
is a positive number chosen to force H
(i)
c
to be a positive denite matrix.
Let

H = H
1
. One may also compute
H
(i)
c
=

H
(i)
+ cI,
because it is the inverse of H
(i)
that matters in the algorithm. Such a correction is used
in, for example, the so-called Marquardt-Levenberg algorithm.
The quasi-Newton method, on the other hand, corrects

H
(i)
iteratively by adding a
symmetric, correction matrix C
(i)
:
H
(i+1)
=

H
(i)
+C
(i)
,
with the initial value

H
(0)
= I. This method includes the Davidon-Fletcher-Powell
(DFP) algorithm and the Broydon-Fletcher-Goldfarb-Shanno (BFGS) algorithm, where
the latter is the algorithm used in the GAUSS program. In the DFP algorithm,
C
(i)
=

(i)
(i)
(i)
(i)
+

H
(i)
(i)
(i)

H
(i)
(i)
H
(i)
(i)
,
where
(i)
=
(i+1)

(i)
and
(i)
= g
(i+1)
g
(i)
. The BFGS algorithm contains an
additional term in the correction matrix.
To implement an iterative algorithm, one must choose a vector of initial values to
start the algorithm and a stopping rule to terminate the iteration procedure. Initial
values are usually specied by the researcher or by random number generation; prior in-
formation, if available, should also be taken into account. For example, if the parameter
is a probability, the algorithm may be initialized by, say, 0.5 or by a number randomly
generated from the uniform distribution on [0, 1]. Without prior information, it is also
typical to generate initial values from a normal distribution. In practice, one would
generate many sets of initial values and then choose the one that leads to a better result
(for example, a better t of data). Of course, this search process is computationally
demanding.
When an algorithm results in no further improvement, a stopping rule must be
invoked to terminate the iterations. Typically, an algorithm stops when one of the
following convergence criteria is met: for a pre-determined, small positive number c,
1.
_
_
(i+1)
(i)
_
_
< c, where | | denotes the Euclidean norm,
2.
_
_
g
_
(i)
__
_
< c, or
3.

Q
T
_
(i+1)
_
Q
T
_
(i)
_
< c.
For the Gauss-Newton algorithm, one may stop the algorithm when TR
2
is close to
zero, where R
2
is the coecient of determination of the Gauss-Newton regression. As the
residual vector must be orthogonal to the tangent space at the optimum, this stopping
rule amounts to checking whether the rst order condition is satised approximately.
In some cases, an algorithm may never meet its pre-set convergence criterion and hence
keeps on iterating. To circumvent this diculty, an optimization program usually sets
a maximum number for iterations so that the program terminates automatically once
the number of iterations reaches this upper bound.
8.3 Asymptotic Properties of the NLS Estimators
8.3.1 Consistency
As the NLS estimator does not have an analytic form in general, a dierent approach is
thus needed to establish NLS consistency. Intuitively, when the NLS objective function
Q
T
() is close to IE[Q
T
()] for all , it is reasonable to expect that the minimizer of
Q
T
(), i.e., the NLS estimator

T
, is also close to a minimum of IE[Q
T
()]. Given that
Q
T
is nonlinear in , a ULLN must be invoked to justify the closeness between Q
T
()
and IE[Q
T
()], as discussed in Section 5.6.
To illustrate how consistency can be obtained, we consider a special case. Suppose
that IE[Q
T
()] is a continuous function on the compact parameter space
1
such that
o
is its unique, global minimum. The NLS estimator

T
is such that
Q
T
(
T
) = inf
1
Q
T
().
8.3. ASYMPTOTIC PROPERTIES OF THE NLS ESTIMATORS 221
Suppose also that Q
T
has a SULLN eect, i.e., there is a set
0
such that IP(
0
) = 1
and
sup
Q
T
() IE[Q
T
()]
0,
for all
0
. Set
= inf
B
c
1
_
IE[Q
T
()] IE[Q
T
(
o
)]
_
,
where B is an open neighborhood of
o
. Then for
0
, we can choose T suciently
large such that
IE[Q
T
(
T
)] Q
T
(
T
) <

2
,
and that
Q
T
(
T
) E[Q
T
(
o
)] Q
T
(
o
) E[Q
T
(
o
)] <

2
,
because the NLS estimator

T
minimizes Q
T
(). It follows that for
0
,
IE[Q
T
(
T
)] IE[Q
T
(
o
)]
IE[Q
T
(
T
)] Q
T
(
T
) + Q
T
(
T
) E[Q
T
(
o
)]
< ,
for all T suciently large. This shows that, comparing to all outside the neighborhood
B of
o
,

T
will eventually render IE[Q
T
()] closer to IE[Q
T
(
o
)] with probability one.
Thus,

T
must be in B for large T. As B is arbitrary,

T
must converge to
o
almost
surely. Convergence in probability of

T
to
o
can be established using a similar
argument; see e.g., Amemiya (1985) and Exercise 8.4.
The preceding discussion shows what matters for consistency is the eect of a SULLN
(WULLN). Recall from Theorem 5.34 that, to ensure a SULLN (WULLN), Q
T
should
obey a SLLN (WLLN) for each
1
and also satisfy a Lipschitz-type continuity
condition:
[Q
T
() Q
T
(
)[ C
T
|
| a.s.,
with C
T
bounded almost surely (in probability). If the parameter space
1
is compact
and convex, we have from the mean-value theorem and the Cauchy-Schwartz inequality
that
[Q
T
() Q
T
(
)[ |
Q
T
(
)| |
| a.s.,
where and
are in
1
and
is the mean value of and
, in the sense that

[
o
[ < [
o
[. Hence, the Lipschitz-type condition would hold by setting
C
T
= sup
Q
T
().
Observe that in the NLS context,
Q
T
() =
1
T
T
t=1
_
y
2
t
2y
t
f(x
t
; ) + f(x
t
; )
2
_
,
and
Q
T
() =
2
T
T
t=1
f(x
t
; )[y
t
f(x
t
; )].
Hence,
Q
T
() cannot be almost surely bounded in general. (It would be bounded
if, for example, y
t
are bounded random variables and both f and
f are bounded
functions.) On the other hand, it is practically more plausible that
Q
T
() is bounded
in probability. It is the case when, for example, IE[
Q
T
()[ is bounded uniformly in
. As such, we shall restrict our discussion below to WULLN and weak consistency of
T
.
To proceed we assume that the identication requirement [ID-2] holds with proba-
bility one. The discussion above motivates the additional conditions given below.
[C1] (y
t
w
t
)
is a sequence of random vectors, and x

t
is vector containing some
elements of
t1
and J
t
.
(i) The sequences y
2
t
, y
t
f(x
t
; ) and f(x
t
; )
2
all obey a WLLN for each
in
1
, where
1
is compact and convex.
(ii) y
t
, f(x
t
; ) and
f(x
t
; ) all have bounded second moment uniformly in
.
[C2] There exists a unique parameter vector
o
such that IE(y
t
[
t1
, J
t
) = f(x
t
;
o
).
Condition [C1] is analogous to [B1] so that stochastic regressors are allowed. [C1](i)
regulates that each components of Q
T
() obey a standard WLLN. [C1](ii) implies
IE[
Q
T
()[
2
T
T
t=1
_
|
f(x
t
; )|
2
|y
t
|
2
+|
f(x
t
; )|
2
|f(x
t
; )|
2
_
,
for some which does not depend on . This in turn implies
Q
T
() is bounded
in probability (uniformly in ) by Markovs inequality. Condition [C2] is analogous to
[B2] and requires f(x
t
; ) been a correct specication of the conditional mean function.
Thus,
o
globally minimizes IE[Q
T
()] because the conditional mean must minimizes
mean-squared errors.
Theorem 8.2 Given the nonlinear specication (8.1), suppose that [C1] and [C2] hold.
Then,

T
IP
o
.
Theorem 8.2 is not completely satisfactory because it is concerned with the conver-
gence to the global minimum. As noted in Section 8.2.2, an iterative algorithm is not
guaranteed to nd a global minimum of the NLS objective function. Hence, it is more
reasonable to expect that the NLS estimator only converges to some local minimum
of IE[Q
T
()]. A simple proof of such local consistency result is not yet available. We
therefore omit the details and assert only that the NLS estimator converges in proba-
bility to a local minimum
. Note that f(x;
) is, at most, an approximation to the

conditional mean function.
8.3.2 Asymptotic Normality
Given that the NLS estimator

T
is weakly consistent for some
, we will sketch a
proof that, with more regularity conditions, the suitably normalized NLS estimator is
asymptotically distributed as a normal random vector.
First note that by the mean-value expansion of
Q
T
(
T
) about
Q
T
(
T
) =
Q
T
(
) +
2
Q
T
(
T
)(
),
where
T
is a mean value of

T
and
. Clearly, the left-hand side is zero because

T
is the NLS estimator and hence solves the rst order condition. By [ID-2], the Hessian
matrix is invertible, so that
T(
) = [
2
Q
T
(
T
)]
1
Q
T
(
).
The asymptotic distribution of

T(
) is therefore the same as that of the right-

hand side.
Let H
T
() = IE[
2
Q
T
()] and vec denote the operator such that for the matrix A,
vec(A) is the vector that stacks all the column vectors of A. By the triangle inequality,
_
_
vec
_
Q
T
(
T
)
vec
_
H
T
(
)
_
_
_
_
vec
_
Q
T
(
T
)
vec
_
H
T
(
T
)
_
_
+
_
_
vec
_
H
T
(
T
)
vec
_
H
T
(
)
_
_
.
The rst term on the right-hand side converges to zero in probability, provided that
Q
T
() also obeys a WULLN. As
T
is a mean value of

T
and
, weak consistency
of

T
implies
T
also converges in probability to
. This shows that, when H

T
() is
continuous in , the second term also converges to zero in probability. Consequently,
Q
T
(
T
) is essentially close to H
T
(
).
The result above shows that the normalized NLS estimator,
T(
), is asymp-
totically equivalent to
H
T
(
)
1
Q
T
(
),
and hence they must have the same limiting distribution. Under suitable regularity
conditions,
Q
T
(
) =
2
T
T
t=1
f(x
t
;
)[y
t
f(x
t
;
)]
obeys a CLT, i.e., (V
T
)
1/2
Q
T
(
)
D
N(0, I
k
), where
V
T
= var
_
2
T
T
t=1
f(x
t
;
)[y
t
f(x
t
;
)]
_
.
Then for D
T
= H
T
(
)
1
V
T
H
T
(
)
1
, we immediately obtain the following asymp-
totic normality result:
(D
T
)
1/2
H
T
(
)
1
Q
T
(
)
D
N(0, I
k
),
which in turn implies
(D
T
)
1/2
T(
)
D
N(0, I
k
),
As in linear regression, asymptotic normality of the normalized NLS estimator remains
valid when D
T
is replaced by its consistent estimator

D
T
:
D
1/2
T
T(
)
D
N(0, I
k
),
Thus, nding a consistent estimator for D
T
is important in practice.
Consistent estimation of D
T
is completely analogous to that for linear regression;
see Chapter 6.3. First observe that H
T
(
) is
H
T
(
) =
2
T
T
t=1
IE
__
f(x
t
;
)
_
f(x
t
;
2
T
T
t=1
IE
_
f(x
t
;
)
_
y
t
f(x
t
;
)
_
,
which can be consistently estimated by its sample counterpart:
H
T
=
2
T
T
t=1
_
f(x
t
;

T
)
_
f(x
t
;

T
)
2
T
T
t=1
f(x
t
;

T
) e
t
_
.
Let
t
= y
t
f(x
t
;
). When
t
are uncorrelated with
2
f(x
t
;
), H
T
(
) depends
only on the expectation of the outer product of
f(x
t
;
) so that

H
T
simplies to
H
T
=
2
T
T
t=1
_
f(x
t
;

T
)
_
f(x
t
;

T
)
.
This estimator is analogous to

T
t=1
x
t
x
t
/T for M
xx
in linear regression.
If
=
o
so that f(x
t
;
o
) is the conditional mean of y
t
, V
T
is
V
o
T
=
4
T
T
t=1
IE
_
2
t
_
f(x
t
;
o
)
_
f(x
t
;
o
)
_
.
When there is conditional homoskedasticity: IE(
2
t
[
t1
, J
t
) =
2
o
, V
o
T
simplies to
V
o
T
=
2
o
4
T
T
t=1
IE
_
_
f(x
t
;
o
)
_
f(x
t
;
o
)
_
,
which can be consistently estimated by
V
T
=
2
T
4
T
T
t=1
_
f(x
t
;

T
)
_
f(x
t
;

T
)
,
where
2
T
=

T
t=1
e
2
t
/T is a consistent estimator for
2
o
. In this case,
D
T
=
2
T
_
1
T
T
t=1
_
f(x
t
;

T
)
_
f(x
t
;

T
)
_
1
.
This estimator is analogous to the standard OLS variance matrix estimator
2
T
(X
X/T)
1
for linear regressions.
When there is conditional heteroskedasticity such that IE(
2
t
[
t1
, J
t
) are functions
of the elements of
t1
and J
t
, V
o
T
can be consistently estimated by
V
T
=
4
T
T
t=1
e
2
t
_
f(x
t
;

T
)
_
f(x
t
;

T
)
,
so that
D
T
=
_
1
T
T
t=1
_
f(x
t
;

T
)
_
f(x
t
;

T
)
_
1
V
T
_
1
T
T
t=1
_
f(x
t
;

T
)
_
f(x
t
;

T
)
_
1
.
This is Whites heteroskedasticity-consistent covariance matrix estimator for nonlinear
regressions. If
t
is not a martingale dierence sequence with respect to
t1
and J
t
,
V
T
can be consistently estimated using a Newey-West type estimator; see Exercise 8.7.
8.4 Hypothesis Testing
We again consider testing linear restrictions of parameters so that the null hypothesis
is R
o
= r, where R is a q k matrix and r is a q 1 vector of pre-specied constants.
More generally, one may want to test for nonlinear restrictions r(
o
) = 0, where r is
now a R
q
-valued nonlinear function. By linearizing r, the testing principles for linear
restrictions carry over to this case.
The Wald test now evaluates the dierence between the NLS estimates and the hy-
pothetical values. When normalized NLS estimates, T
1/2
(
o
), have an asymptotic
normal distribution with asymptotic covariance matrix D
T
, we have under the null
hypothesis
1/2
T
TR(
T

o
) =

1/2
T
T(R
T
r)
D
N(0, I
q
).
where

T
= R
D
T
R
, and

D
T
is a consistent estimator for D
T
. It follows that the
Wald statistic is
J
T
= T(R
T
r)
1
T
(R
T
r)
2
(q),
which is of the same form as the Wald statistic based on the OLS estimator.
Remark: A well known problem with the Wald test for nonlinear hypotheses is that
the statistic is not invariant with respect to the expressions of r() = 0. For example,
the Wald tests perform quite dierently against two equivalent hypotheses:
1
2
= 1
and
1
= 1/
2
. See e.g., Gregory & Veal (1985) and Phillips & Park (1988).
8.4. HYPOTHESIS TESTING 227
Exercises
8.1 Suppose that Q
T
() is quadratic in :
Q
T
() = a +b
C,
where a is a scalar, b a vector and C a symmetric, positive denite matrix. Find
the rst order condition of minimizing Q
T
() and the resulting solution. Is the
OLS criterion function (3.2) quadratic in ?
8.2 Let
t
= y
t
y
t
denote the t
th
NLS residuals. Is

T
t=1

t
zero in general? Why or
why not?
8.3 Given the nonlinear specication of the CES production function
ln y = ln

ln
_
L
+ (1 )K
+ e,
nd the second order Taylor expansion of ln y around = 0. How would you
estimate this linearized function and how can you calculate the original parameters
, , and ?
8.4 Suppose that IE[Q
T
()] is a continuous function on the compact parameter space
1
such that
o
is its unique, global minimum. Also suppose that the NLS
estimator

T
is such that
IE[Q
T
(
T
)] = inf
1
IE[Q
T
()].
Prove that when Q
T
has a WULLN eect, then

T
converges in probability to
o
.
8.5 Apply Theorem 8.2 to discuss the consistency property of the OLS estimator for
the linear specication y
t
= x
t
+ e
t
.
8.6 Let
t
= y
t
f(x
t
;
o
). If
t
is a martingale dierence sequence with respect to
t1
and J
t
such that IE(
2
t
[
t1
, J
t
) =
2
o
, state the conditions under which

2
T
=

T
t=1
e
2
t
/T is consistent for
2
o
.
8.7 Let
t
= y
t
f(x
t
;
), where
may not be the same as

o
. If
t
is not
a martingale dierence sequence with respect to
t1
and J
t
, give consistent
estimators for V
T
and D
T
.
References
Amemiya, Takeshi (1985). Advanced Econometrics, Cambridge, MA: Harvard Univer-
sity Press.
Bierens, Herman J. (1994). Topics in Advanced Econometrics, New York, NY: Cam-
bridge University Press.
Davidson, Russell and James G. MacKinnon (1993). Estimation and Inference in Econo-
metrics, New York, NY: Oxford University Press.
Gallant, A. Ronald (1987). Nonlinear Statistical Inference, New York, NY: John Wiley
& Sons.
Gallant, A. Ronald and Halbert White (1988). A Unied Theory of Estimation and
Inference for Nonlinear Dynamic Models, Oxford, UK: Basil Blackwell.
Kuan, Chung-Ming and Halbert White (1994). Articial neural networks: An econo-
metric perspective, Econometric Reviews, 13, 191.

Nonlinear Least Squares Theory Ch8

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nonlinear Least Squares Theory Ch8

Uploaded by

Copyright:

Available Formats

Chapter 8

Nonlinear Least Squares Theory

and = k. Clearly, nonlinear functions that can be

. Our discussion in this chapter focuses on

. Note that the Box-Cox transformation is often applied

) = 0, and satises the second order condition:

f() as (). Then by the rst order condition,

are g and H evaluated at

, respectively. From this expansion, the

. This suggests the following algorithm:

, the second-order expansion about

f(). It is therefore convenient to ignore the rst term on the right-

()/T. There are some advantages of this

is the mean value of and

, in the sense that

is a sequence of random vectors, and x

. Note that f(x;

) is, at most, an approximation to the

. Clearly, the left-hand side is zero because

) is therefore the same as that of the right-

. This shows that, when H

may not be the same as

You might also like