You are on page 1of 3

Newton Raphson Algorithm

STA705 Spring 2006


Let f(x) be a function (possibly multivariate) and suppose we are interested in determining
the maximum of f and, often more importantly, the value of x which maximizes f. The most
common statistical application of this problem is nding a Maximum Likelihood Estimate (MLE).
This document discusses the Newton Raphson method
1 Motivation
Newton Raphson maximization is based on a Taylor series expansion of the function f(x). Specif-
ically, if we expand f(x) about a point a,
f(x) f(a) + (x a)
T
f

(a) +
1
2
(x a)
T
f

(a)(x a)
where f

() is the gradient vector and f

() is the hessian matrix of second derivatives. This creates


a quadratic approximation for f. We know how to maximize a quadratic function (take derivatives,
set equal to zero, and solve)
d
dx
f(a) + (x a)
T
f

(a) +
1
2
(x a)
T
f

(a)(x a) = f

(a) + (x a)
T
f

(a) = 0
x = a [f

(a)]
T
[f

(a)]
1
The Newton Raphson process iterates this equation. Specically, let x
0
be a starting point for
the algorithm and dene successive estimates x
1
, x
2
, . . . recursively through the equation
x
i+1
= x
i
[f

(x
i
)]
T
[f

(x
i
)]
1
If the function f(x) is quadratic, then of course the quadratic approximation is exact and the
Newton Raphson method converges to the maximum in one iteration. If the function is concave,
then the Newton Raphson method is gauranteed to converge to the correct answer. If the function
is convex for some values of x, then the algorithm may or may not converge. The NR algorithm may
converge to a local maximum and not the global maximum, it might converge to a local minimum,
or it might cycle between two points. Starting the algorithm near the global maximum is the best
practical method for helping convergence to the global maximum.
Fortunately, loglikelihoods are typically approximately quadratic (the reason asymptotically
normality occurs for many random variables). Thus, the NR algorithm is an obvious choice for
nding MLEs. The starting value for the algorithm is often a simpler estimate (in terms of ease
of computation) of the parameter, such as a method of moments estimator.
1
Example
Let X
1
, . . . , X
n
Gamma(, ) and suppose we want the joint maximum likelihood estimate
of (, ). The loglikelihood is
lnf(x|, ) = ln
_

n

n
()
_

x
i
_
1
exp
_

x
i
_
_
= nln nln() + ( 1)

lnx
i

x
i
Solving for the MLE analytically requires solving the equations

lnf(x|, ) = nln n
ln()

+

lnx
i
= 0

lnf(x|, ) =
n

x
i
= 0
These two equations cannot be solved analytically (the Gamma function is dicult to work with).
The two equations do provide us with the gradient
f

(, ) =
_
nln n
ln ()

+

lnx
i
(n/)

x
i
_
The hessian matrix is
f

(, ) =
_
n

2
ln ()

2
n/
n/ n/
2
_
The starting values of the algorithm may be found using the method of moments. Since E[X
i
] =
/ and V [X
i
] = /
2
, the method of moment estimators are
M
= x
2
/s
2
and
M
= x/s
2
.
Suppose we have data (in truth actually generated with = 2 and = 3) such that n = 1000,
ln

X
i
= 646.0951,

X = 0.6809364, and s
2
= 0.2235679. The algorithm begins at
M
=
2.073976 and
M
= 3.04577. After 3 iterations, the NR algorithm stabilizes at = 2.060933 and

= 3.026616.
Note that there is no need to treat this situation as a multivariate problem. The partial
derivative for may be solved in terms of .

lnf(x|, ) =
n

x
i
= 0
= n/

x
i
= / x
Thus, for each value of , the maximum of is attained at = n/

x
i
. Thus, we can reduce the
problem to the one-dimensional problem of maximizing
2
lnf(x|, = / x) = nln(/ x) nln() + ( 1)

lnx
i
(/ x)

x
i
= nln nln x nln() + ( 1)

lnx
i
n
This is called the prole loglikelihood. The rst and second derivatives are

= n[1 + ln] nln x n


ln()

+

lnx
i
n = nln nln x n
ln()

+

lnx
i

2
=
n

2
ln()

2
Again starting the algorithm at
M
= 2.073976, the algorithm converges to = 2.060933 after
???? iterations.
3

You might also like