You are on page 1of 10

Conjugate Gradients

Although the method of steepest ascent is a well-know procedure based upon a


very plausible idea, we have seen that the method is computationally not very
effective. We therefore turn our attention to some more advanced methods in
which the path taken is related to, though not identical with, the gradient
direction.
The first of these is known as the method of conjgate gradients. This method
was originally devised by Hestenes and Steifel [1952] for solving the system of
linear algebraic equations
(3.2.20)

AX = B,

Where A is a real, symmetric, positive-definite matrix, by minimizing the


corresponding quadratic form
(3.2.21)

1
y= 2 XAX BX

The equivalence of these two problems is established by the relationship


(3.2.22)

y = 0 = AX B.

Hence the vector X0 that minimizes Equation (3.2.21) will also be the solution
vector of equation (3.2.20). We shall proceed to develop the method based
upon the minimization of the quadratic form given by Equation (3.2.21). The
extension of the algorithm to the optimization of a more general objective
function will then be discussed.
The idea behind the conjgate-gradient procedure is similar to that of
steepest descent in that a sequence of one-dimensional searches is carried out
in directions which are determined by the partial derivaties of the objective
function. Unlike the method of steepest, however, the search vectors are not
equal to the negative gradient vectors; rather, a sequence of search vectors is
determined in such a manner that each search vector. The algorithm is
guaranteed to minimize a quadratic function of n independent variables with no
more tan n iterations a conditionknown as quadratic convergence.
Proceeding from an arbitrary initial search point X 0, we locate a sequence
of points that are successively closer to the mnimum as follows:
(3.2.23)

Xi+1 = Xi + iPi,

where i is a positive scalar defines the distance between X i and Xi+1 along the
search vector Pi. Notice that t the minimum along Pi will occur where Pi is

tangent to the family of contours given by Equation (3.2.21). Stated differently,


the gradient of y at Xi+1 will be normal to P, i.e.,
(3.2.24)

yi+1Pi= Pi yi+1 = 0.

If we apply Equation (3.2.23) recursively, we obtain


(3.2.25)

Xk = Xi

In particular,
Subtracting X0 from each side of of Equation (3.2.26) and premultiplyng by A,
we obtain

Forming the gradient of Equation(3.2.21), however, we see that

Hence Equation (3.2.27) becomes:

As a special case of the above expresin we have

Let us now develop a criterion for defining the search vector P j. Premultiplyng
Equation (3.2.29) by Pi-1 gives

The first termo n the right-hand side vaniches because of Equation (3.2.24). If
we now choose the Pj such that
PAPj = 0
For i j, then the summation term in Equation (3.2.31) will also vanish, so that

The condition expressed by Equation (3.2.32) is known as A-conjugacy, and the


set of vectors Pi is said to be A-conjugate.
It can easily be shown that the A-conjugacy condition is sufficient to
ensure that the vectors P0, P1,, Pn-1 are linearly independent. Therefore, the
vector P0, P1, , Pn-1 are linearly independent. Therefore, the vector set P i is a
basis in the n-dimensional vector space, so that yn can be expressed in terms
of one or more of the Pi. Thus Equation (3.2.33) will be nonzero for at least one

i unless yn = 0. We see, then, that the condition that the vectors P, be Aconjugate causes yn to vanish identically. Since this condition exists only
where the quadratic form is minimized, we conclude that the quadratic form is
thus minimized after no more than n one-dimensional searches in the
directions P0, P1, , Pn-1.
We still have not specified exactly how the Pi are chosen. Le tus
arbitrarily let P0 = -y0, and the let
(3.2.34)
Where rhe i are positive scalars that must be determined, This choice of
Pi can be shown to satisfy the A-conjugacy condition expressedby Equation
(3.2.32). For further decelopment of this point the reader is referred to
Beckman [1960].
From Equation (3.2.32), we can write
(3.2.35)

PiAPi+1 = 0

Conbining this result with Equation (3.2.34) gives


(3.2.36)
Or
(3.2.37)
Equation (3.2.37) can be used to compute the i if we so desire. In a practical
sense, however, notice that Equation (3.2.37) requires an explicit knowledge of
the matrix A. If a large problema (i.e., a problema of high dimensionality) is
being solved on a digital computer, then the nxn matrix a must be stored in the
computer. This can occupy a significant portion of core storage and hence limit
the size of a problema that can be solved by the conjgate-gradient method.
For this reason we will not use Equation (3.2.37) to determine i that does not
contain the matrix A.
From Equation (3.2.30), we can write
(3.2.38)
Feom Equation (3.2.24), however, we see that the left-hand side of Equation
(3.2.38) vanishes, so that
(3.2.39)
Now le tus premultiply Equation (3.2.34) by yi+1. This gives

Again referring to Equation (3.2.24), we see that last term in Equation (3.2.40)
vanishes. Hence,

Substituting this result into Equation (3.2.39) gives

Le tus again make use of Equation (3.2.30) to write

Which can be rearranged to give

Recall that A is assumed to be aymmetric; hence

So that

Utilizing Equation (3.2.30) one more time we can write

Because of Equations (3.2.24) and (3.2.32). From Equation (3.2.34), however,


we have

Combining this result with Equation (3.2.47) gives

Since yi+1Pi vanishes because of Equation (3.2.24); we conclude that


(3.2.50)

yi+1yi = 0.

Hence Equation (3.2.46) becomes


(3.2.51)
A simplified expression for i can now be obtaines by substituting
Equations (3.2.42) and (3.2.51) into Equation (3.2.37). This yields
(3.2.52)
Which does not require explicit knowledge of the matrix A.

Several other mathematical relationships can be shown to exist among


the various vectors that appear in the algorithm. For a detailed account of
these relationships, as well as a more rigorous exposition of the conjgategradient method, the reader is referred to a paper by Beckman [1960].
Let us now summarize the algorithm, including a generalization to the
maximization problema. Choose as arbitrary point X 0 and evaluate the gradient
vector y0. Let P0 = y0 (the plus sign corresponding to a maximization
problema, the minus to a minimization). Obtain
(3.2.23)

Xi+1 = Xi + iPi

As the point on the P, vector where the objective function is extremized. This
point is located by conducting a one-dimensional search along P i. Determine
yi+1, the gradient vector, at Xi+1. Compute i in accordance with Equation
(3.2.52), and determine a new search vector
(3.2.53)
Again the plus sign is chosen for a maximization problema and the minus sign
corresponds to a minimization.
Although the method is supposed to find the optimum of a quadratic
form with no more than n iterations, it is in fact rather sensitive to roundoff
error. Fletcher and Reeves [1964] suggest that the computation be restarted
with Pi =yi after every n+1 iterations as an effective way to minimize the
problema of cumulative roundoff.
Our discussion thus far has been concerned exclusively with the
optimization of quadratic forms. We can see that the method is applicable to a
broader clase of function. However by expanding the objective function in a
Taylor series:

Where H is the Hessian matrix copnsisting of the second partial derivatives of y


evaluated at X0; i.e.,

Notice that H is a real , symmetric matrix, providing the ogjective fuction is not
linear.

If X0 represents an extremum of y(X), then

And Equation (3.2.54) becomes

Hence the quantity y(X)-y(X0) can be expressed by Equation (3.2.57), providing


X is sufficiently closet o X0 so that the higher-order terms in the Taylor-series
expansin vanish. In the neighborhood of an optimum, then, we would expect
the conjugate-gradient method to perform quite efficiently when applied to any
continuous and differentiable nonlinear function.
Inplementation of the method does not require explicit knowledge of the matrix
H. The first partial derivaties of the objective function must be known, however,
in order to determine the gradient vector y. These partial derivaties can be
determined by finite differences. Unfortunately, the method is susceptible to
the accumulation of roundoff errors-a shortcoming that becomes particulary
troublesome when the partial derivaties are approximated numerically.
Therefore, the use of a restart after every n+1 iterations is strongly advised.
An appreciation for quadratically convergent methods can be obtained from
Figure 3.14 where a quadratic function is minimized by the methos of steepest
descent and by a conjugate-gradient search. The simple gradient (steepestdescent) procedure, which is shown by the solid lines, tends to zigzag back and
forth near the optimum, requiring many search iterations. This type of
oscillation near an optimum is very characteristic of steepest-descent
techniques. On the other hand the conjugate-gradient procedure, shown by
dashed lines, finds the mnimum of the function in only two iterations, thus
avoiding the problema of oscillations near an optimum. Quadratically
convergent gradient methods are sometimes referred to as second-order
methods (ef. Crockett and Chernoff [1955]).

Figure 3.14. Comparison of Behavior of Steepest Descent and Conjugate


Gradient Search in Minimization of Quadratic Function

EXAMPLE 3.2.4
Reslove Example 3.2.1 using the method of conjugate gradients, with X 10 = 1
and X20 = 1 as an initial point.
In vector notation,
X0 =

[]
1
1

And, from Example 3.2.1.


y|xn = -

[ ]
4
72

So that

From Equation (3.2.23), we have

[]

1 +
1

X1 =

[ ]
4
72

> 0.

The objective function can be expressed as a function of


y(
Minimizing y(

) = (4

- 2)2 + 9(72

), we obtain y = 3.1594 at

X1 =

1.223 .
5.011

The gradient can now be determined as

And 0 can be computed as

3.554
0.197

as follows:

- 4)2.

=0.0555. Hence

[ ]

y|x1 =

( 3.554 )2+ ( 0.197 )2


( 4 )2 + ( 72 )2

0 =

= 0.00244.

Making use of Equations (3.2.34) and (3.2.23), we obtain

P1 =

Solving for

[ ][

3.554 +0.00244 4 = 3.564 .


0.197
72 0.022

as before [i.e., expressing y(X2) as a function of

minimizing with respect to

] yields y= 5.91x10-10 at

X2 =

and

= 0.4986. Hence

3.0000
,
5.000

Which is, for all practical purposes, the desired result.


Notice that this two-dimensional function has been minimized with the
determination of only two points. This is, of course, to be expected, since the
objective function is a quadratic and the conjugate-gradient algorithm is
quadratically convergent.
EXAMPLE 3.25
Resolve Example 3.2.3 using the method of conjugate gradients and a
goldenratio search. Let =10-5 be the largest permisible value of each the
partial derivatives for which the problema will be considered to have converged
to an optimum. The partial derivatives are to be evaluated numerically.
Some select results of this problema, obtained with a digital computer,
are given in Table 3.8.
Table 3.8. RESULTS OF CONJUGATE GRADIENT OPTIMIZATION
n
0
1
2
3
5
10
16

X1n
0.5000
0.1800
0.1703
0.1922
0.1977
0.1998
0.19998

X2n
0.5000
0.7716
0.7618
0.7264
0.6984
0.6685
0.66683

yn
5.7899
7.1772 x 10-3
3.8202 x 10-3
1.2420 x 10-3
2.9656 x 10-4
4.0192 x 10-7
5.0831 x 10-12

The computation was terminated after 16 iterations, since each partial


derivative bcame less than or equal to 10-5 in magnitude. Notice that this
problema required less than one tenth as many iterations as optimal steepest
for a comparable solution (cf. Example 3.2.3). The computational effort per
iteration is slightly greater than with optimal steepest descent.
Variable-metric algorithm
The variable-metric algorithm is another sophisticated gradient
techniques, originally devised by DAvidon [1959] to minimize a quadratic
function with no more than n ateps. The method was later modified and
improved upon by Fletvher and Powell [1964]. This later of the algorithm is an
extremely powerful gradient method for extremizing any unconstrained,
continuous, and differentiable objective function. We shall present the Fletcher
and Powell version of the algorithm and discuss its advantages and
disadvantages.
Like the method of conjugate gradients, the variable-metric algorithm is
designed to extremize the function
(3.2.58)
By conducting a sequence of one-dimensional searches. These searches begin
at some arbitrary point X0 and proceed to locate a succesession of improved
points in accordance with

Where

is some positive constant.

Now le tus recall a few significant relationship that relate to the


minimization of a quadratic form. First, each X i+1 represents the position of an
extremum along Pi. Hence the gradient of the objective function will be
ortogonal to Pi at Xi+1; i.e.,

Where

[The validity of Equation (3.2.60) is, of course, not restricted to a quadratic


form.] Also, we have established that

Finally, we have shown that an extremization algorithm will be quadratically


convergent, providing the Pi are A-conjugate; i.e.,

PiAPj = 0,

ij.

(3.2.63)

Any algorithm that satisfies Equation (3.2.63) will, apart from numerical
roundoff errors, minimize a quadratic form with no more than n onedimensional searches in the Pi directions, i = 0,1,,n-1.
Thus far the description of the variable metric algorithm parallels that of
the methos of conjugate gradients. The algorithms differ in the manner in
which the search vectors P1, P2,, Pn-1 are chosen.
In the variable metric algorithm the search vectors are chosen as

You might also like