Chapter 4

Adaptive Filters
V. John Mathews Scott C. Douglas
Copyright 2003 V John Mathews and Scott C Douglas
Contents
4 Stochastic Gradient Adaptive Filters 4.1 Gradient Adaptation . . . . . . . . . . . . . . . . . . . . . 4.1.1 An Analogy . . . . . . . . . . . . . . . . . . . . . . 4.1.2 The Method of Steepest Descent . . . . . . . . . . 4.1.3 Implementation of the Steepest Descent Algorithm 4.2 Stochastic Gradient Adaptive Filters . . . . . . . . . . . . 4.2.1 The Least-Mean-Square Algorithm . . . . . . . . . 4.2.2 General Stochastic Gradient Adaptive Filters . . . 4.2.3 Examples of LMS Adaptive Filters . . . . . . . . . 4.3 Main Points of This Chapter . . . . . . . . . . . . . . . . . 4.4 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . 4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 5 9 17 18 18 23 31 34 36
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
CONTENTS
Chapter 4 Stochastic Gradient Adaptive Filters

This chapter introduces a class of adaptive lters that employ a gradient descent optimization procedure. Implementing this procedure exactly requires knowledge of the input signal statistics, which are almost always unknown for real-world problems. Instead, an approximate version of the gradient descent procedure can be applied to adjust the adaptive lter coecients using only the measured signals. Such algorithms are collectively known as stochastic gradient algorithms.
4.1
Gradient Adaptation
We introduce the method of gradient descent in this section using a real-world analogy. We develop the concept of a cost function using this analogy. The gradient descent procedure can be used to nd the minimum of this function. We then apply these ideas to the adaptive ltering problem and derive an entire family of stochastic gradient adaptive lters.
4.1.1
An Analogy
Consider Figure 4.1, which shows a bowl-shaped surface and a ball perched on the edge of this bowl. If we were to let this ball go, gravity would cause the ball to roll down the sides of this bowl to the bottom. If we observe the balls movement from directly above the bowl, its path would look something like that shown in Figure 4.2. The elliptical curves in the gure denote contours of equal height, and the path that the ball travels is indicated by the dotted line. Gravitys net pull on the ball at any time instant would be in a direction perpendicular to the line that is tangential to the contour line at the balls current location. Moreover, the ball would descend faster for steeper sections of the bowl. The shape of the bowls surface plays an important role in the path the ball takes to reach the bottom. In particular, if the surface has two or more depressions where the ball could sit idle, there is no guarantee that the ball will descend to the lowest point on the surface. 3
CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS
Figure 4.1: A ball rolling into a valley is a useful analogy for visualizing the method of steepest descent.
Figure 4.2: The path of a ball descending into the valley.
4.1. GRADIENT ADAPTATION
Figure 4.3 shows a surface with two depressions, which are also known as local minima of the surface. A ball placed nearer to the right-most local minimum will travel to that local minimum as opposed to the lower global minimum point on the left.
4.1.2
The Method of Steepest Descent
The above simple analogy illustrates some of the features of an optimization procedure called the method of steepest descent. As the name implies, the method relies on the slope at any point on the surface to provide the best direction in which to move. The steepest descent direction gives the greatest change in elevation of the surface of the cost function for a given step laterally. The steepest descent procedure uses the knowledge of this direction to move to a lower point on the surface and nd the bottom of the surface in an iterative manner. Mathematical Preliminaries Consider a system identication problem in which we wish to have the output of a linear lter match a desired response signal d(n) as closely as possible. For simplicity of our discussion, we choose the FIR lter structure for the system model. The output of this lter is given by
L1
d(n) =
wi (n)x(n i)
i=0 T
= W (n)X(n),
(4.1)
where X(n) = [x(n) x(n 1) x(n L + 1)]T is a vector of input signal samples and W(n) = [w0 (n) w2 (n) wL1 (n)]T is a vector containing the coecients of the FIR lter at time n. Our objective is to nd the coecient vector W(n) that best models the input-output relation of the unknown system such that some positive-valued cost function of the estimation error e(n) = d(n) d(n), (4.2)
is the smallest among all possible choices of the coecient vector. An additional constraint on this cost function is that it has no local minima, due to the nature of the search method as illustrated by our analogy. Cost Functions We need to dene an appropriate cost function to formulate the steepest descent algorithm mathematically. In analogy with the example discussed above, this cost function provides a surface on which we can descend to nd the lowest point. The location of this lowest point denes the optimum values for the coecients.
For our main discussion, we consider the mean-square-error cost function dened in Chapter 2 as J(n) = E{(e(n))2 } = E{(d(n) WT (n)X(n))2 }.
(4.3)
Recall from Chapter 2 that J(n) is a quadratic, non-negative function of the coecient vector. If the autocorrelation matrix RXX (n) is invertible, the cost function has a unique minimum given by Wopt (n) = R1 (n)PdX (n). XX (4.4)
Our objective is to iteratively descend to the bottom of the cost function surface, so that W(n) approaches Wopt (n), using a strategy analogous to that of the ball rolling in a bowl. The Algorithm Consider Figure 4.4, which shows the mean-square-error cost function for a single-coecient FIR lter with parameter w1 (n). Shown in the gure are ve dierent points in the range of the unknown parameter, along with the tangents of the cost function at each point. We notice the following facts from the gure: 1. The cost function has no local minima. 2. At the optimum parameter value associated with the minimum of the cost function, the slope of the function is zero. 3. The slope of the cost function is always positive at points located to the right of the optimum parameter value. Conversely, the slope of the cost function is always negative at points located to the left of the optimum parameter value. 4. For any given point, the larger the distance from this point to the optimum value, the larger is the magnitude of the slope of the cost function. These facts suggest an iterative approach for nding the parameter value associated with the minimum of the cost function: simply move the current parameter value in the direction opposite to that of the slope of the cost function at the current parameter value. Furthermore, if we make the magnitude of the change in the parameter value proportional to the magnitude of the slope of the cost function, the algorithm will make large adjustments of the parameter value when its value is far from the optimum value and will make smaller adjustments to the parameter value when the value is close to the optimum value. This approach is the essence of the steepest descent algorithm.
Figure 4.3: A ball cannot be expected to descend to the lowest point on a surface with multiple depressions.
Figure 4.4: Mean-square-error cost function for a single-coecient FIR lter.
We can generalize the above approach for an arbitrary cost function J(n) and a vector of parameters W(n). The new coecient vector W(n + 1) is computed in this case as W(n + 1) = W(n) J(n) , W(n) (4.5)
where J(n)/W(n) denotes a vector whose ith value is given by J(n)/wi (n) and is a proportionality constant. This vector is known as the gradient of the error surface. For the mean-square-error cost function, the above algorithm becomes W(n + 1) = W(n) E{e2 (n)} , 2 W(n) (4.6)
where we have dened = /2. The parameter is termed the step size of the algorithm. The additional factor of 1/2 in (4.6) is introduced for notational convenience. Characteristics of Cost Functions We are not limited to mean-square-error cost functions or those that depend on statistical expectations. In general, we can consider arbitrary functions of the error (e(n)) that have the following characteristics: 1. The function (e(n)) is an even function of the estimation error signals; i.e., (e(n)) = (e(n)). 2. The function (e(n)) is monotonically-increasing in the argument |e(n)|. In other words, for two errors e1 and e2 , the inequality |e1 | < |e2 | implies that (e1 ) < (e2 ). Examples of commonly-employed cost functions that satisfy the above two characteristics include: Mean-square-error: Mean-absolute-error: Mean-Kth-power-error: Mean-normalized-squared-error: Least-squares error: Instantaneous squared error:
i=1
E{e2 (n)} E{| e(n) |} E{| e(n) |K } e2 (n) E n 2 j=nL+1 x (j)

n
e2 (i); en (i) = d(i) WT (n)X(i) n
e2 (n)
The least-squares error criterion was considered extensively in Chapter 2 and will be discussed further in Chapter 5. The last error criterion listed above is an instantaneous approximation of the mean-square-error criterion. This approximation forms the basis of stochastic gradient adaptive ltering algorithms.
4.1.3
Implementation of the Steepest Descent Algorithm
To implement the steepest descent algorithm, we must rst evaluate the partial derivatives of the cost function with respect to the coecient values. Since derivatives and expectations are both linear operations, we can change the order in which the two operations are performed on the squared estimation error. With this change, we have E{e2 (n)} e2 (n) = E W(n) W(n) e(n) = E 2e(n) W(n) (d(n) WT (n)X(n)) = E 2e(n) W(n) = 2E{e(n)X(n)}. Thus, we can restate the steepest descent algorithm as W(n + 1) = W(n) + E{e(n)X(n)}. is E{e(n)X(n)} = E{X(n)(d(n) d(n))} = E{d(n)X(n)} E{X(n)XT (n)W(n)} = PdX (n) RXX (n)W(n), (4.8)
(4.7)
To proceed further, we must evaluate the expectation in (4.8) directly. This expectation
(4.9)
where PdX (n) = E{d(n)X(n)} is the cross-correlation vector of the desired response signal and the input vector at time n and RXX (n) is the autocorrelation matrix of the input vector. Thus, the steepest descent procedure for mean-square-error minimization can be written as W(n + 1) = W(n) + (PdX (n) RXX (n)W(n)). (4.10)
Table 4.1 shows a MATLAB function for implementing the steepest descent algorithm for a given autocorrelation matrix and cross-correlation vector.
Example 4.1: Behavior of the Steepest Descent Algorithm Consider a two-coecient system with autocorrelation matrix and cross-correlation vector given by RXX (n) = 1 0.5 0.5 1 and PdX (n) = 1.5 1.5 ,
10
Table 4.1: MATLAB function for performing the steepest descent search. function [W] = steepdes(mu,W0,R,P,num_iter); % % % % % % % % % % % % This function adapts a finite-impulse-response (FIR) filter using the method of steepest descent. Input parameters: mu = step size W0 = Initial value of W(0) coefficients (L x 1) R = Input autocorrelation matrix (L x L) P = Cross-correlation vector (L x 1) num_iter = number of iterations for simulation Output of program: W = Evolution of coefficients (L x (num_iter + 1))
L = length(W0); start_iter = 1; end_iter = num_iter; W = zeros(L,end_iter); W(:,1:start_iter) = W0*ones(1,start_iter); for n = start_iter:end_iter; W(:,n+1) = W(:,n) + mu*(P - R*W(:,n)); end;

respectively. These statistics correspond to a set of optimum MMSE coecients given by Wopt (n) = 1 1 .
11
The mean-squared error surface for this problem is plotted in Figure 4.5. We now investigate how the steepest descent algorithm behaves for dierent choices of the step size parameter and starting coecient values W(0). Figure 4.6 shows the evolution of the coecients for a step size of = 0.01 and three dierent starting vectors W(0). For each of the three adaptation curves, a single dot () denotes one iteration of the algorithm. As can be seen from this graph, all the adaptation curves approach the optimum coecient values Wopt = [1 1]T . For the two initial starting vectors that fall along the principal axes of the elliptical contours of the MSE surface, adaptation occurs along a straight line in the two-dimensional coecient space. In contrast, when W(0) = [3.0 1.5]T , the coecients take a curved path towards the bottom of the error surface. Figure 4.7 shows the evolution of the coecients for each of the initial starting vectors for = 0.1. The behavior of the algorithm is similar to that shown in Figure 4.6, except that the spatial distances between successive values of W(n) are increased, indicating faster adaptation for this step size as compared to the previous case. Figure 4.8 shows the behavior of the algorithm for a step size of = 1. We have traced the coecient paths for each of the dierent starting conditions using dashed lines in this gure. The larger dots in the gure indicate the coecient values after individual iterations. The results of Figure 4.8 indicate that the behavior of the coecients is more erratic for starting vectors of W(0) = [3.0 1.5]T and W(0) = [0.5 0.5]T , as the coecients oscillate between the two sides of the error surface. Figures 4.9 and 4.10 show the evolution of the coecients w1 (n) and w2 (n), respectively, for dierent step sizes with an initial coecient vector W(0) = [3 1.5]T . The x-axes on both plots are logarithmic in scale. We can see that a larger step size causes faster convergence of the coecients to their optimum values. However, the behavior of the coecient vector is more erratic for very large step sizes. We can also observe from each of the gures that the corrections made to the coecient values are smaller when the coecients are near the vicinity of their optimum values as compared to the changes made during the initial stages of adaptation. This characteristic is desirable for any adaptation algorithm, as it enables the coecients to smoothly approach their optimum values.
We can see from Example 4.1 that the choice of step size is critical in obtaining good results with the steepest descent method. Too small a step size requires an excessive number of iterations to reach the vicinity of the minimum point on the error surface. Too large a step size causes the path to bounce from one side of the surface to the other, which can slow convergence as well. An excessively large step size will cause the next cost to be greater than the current cost, and the algorithm may diverge! Clearly, the success of the algorithm hinges on a good step size choice. Guidelines for selecting a good value for the step size can be determined through a performance analysis of the steepest descent algorithm.
12
MSE
12 10 8 6 4 2 0 -1 -0.5 0 0.5 0 1 1.5 2 2.5 3 -1 1 2
w2
Figure 4.5: The mean-squared error surface for Example 4.1.
2.5
1.5
w 2
0.5
-0.5
-1 -1
-0.5
0.5
1.5
2.5
Figure 4.6: Evolution of the coecients in Example 4.1 for dierent starting values of W(n) with = 0.01.

3
13
2.5
1.5
w 2
0.5
-0.5
-1 -1
-0.5
0.5
1.5
2.5
Figure 4.7: Evolution of the coecients in Example 4.1 for dierent starting values of W(n) with = 0.1. Steady-State Properties of the Algorithm Example 4.1 suggests that the steepest descent algorithm can converge to the minimum point on the error surface for a proper choice of step size. However, we have not yet proven that such convergence of the steepest descent algorithm will occur in general. To pursue this issue further, assume that the autocorrelation matrix and cross correlation vector are constant over time, such that RXX (n) = RXX and PdX (n) = PdX . We ask the question: what coecient values W(n) are not changed by the steepest descent update? Let Wss be such a value of the coecient vector. We can write the steepest descent update for this special value of W(n) as W(n + 1) = W(n) + (PdX RXX W(n)) = W(n) = Wss . (4.11)
The algorithm applies no correction to the coecient vector in this situation, indicating that the system has converged to a stationary point. Equations (4.8) and (4.11) imply that E{e(n)X(n)} = 0. (4.12)
at the stationary point of the system. The above condition is the same as the orthogonality principle described in Chapter 2. This result implies that if the steepest descent algorithm converges, then the coecient
14

3
2.5
1.5
w 2
0.5
-0.5
-1 -1
-0.5
0.5
1.5
2.5
w 1
Figure 4.8: Evolution of the coecients in Example 4.1 for dierent starting values of W(n) for = 1.
3
2.5
2 w_1
mu=1
mu=0.1
mu=0.01
1.5
0.5 0 10
10
10 number of iterations
10
Figure 4.9: Evolution of w1 (n) for dierent step sizes in Example 4.1.

1.5
15
w_2
mu=0.1
mu=0.01
0.5 mu=1
0 0 10
10
10 number of iterations
10
Figure 4.10: Evolution of w2 (n) for dierent step sizes in Example 4.1. values at convergence correspond to the optimal solution to the minimum mean-square-error estimation problem! The steepest descent procedure can potentially be used to nd this optimal solution iteratively. As further evidence of this fact, we can determine the value of W(n) = Wss at the stationary point of the iteration by solving the L equations dened by (4.11) to get RXX Wss = PdX . (4.13)
We can solve for Wss if the inverse of the autocorrelation matrix exists. The steady-state solution in this case is Wss = R1 PdX = Wopt , XX (4.14)
which is simply the optimal solution for the MMSE estimation problem. The solution in (4.14) is unique whenever R1 exists. In other words, there exists only one XX possible stationary point for the iteration, and it corresponds to the optimum MMSE solution for the problem. The value of the mean-squared error at this stationary point corresponds to the minimum mean-squared error value for this problem and can be evaluated using (2.50) as
2 E{e2 (n)|W(n) = Wopt } = d PT R1 PdX XX dX T 2 = d PdX Wopt .
(4.15)
16
Convergence of the Steepest Descent Method Given that the stationary point of the steepest descent algorithm is the optimum MMSE solution, a second, equally-important consideration is whether the algorithm converges at all. We now explore the conditions on the step size to guarantee convergence for a single coecient system. The results that we derive are similar in avor to more complete results that we will derive in Chapter 4 for data-driven approximate versions of the steepest descent method. For a single coecient system with L = 1, the evolution equation in (4.10) is given by w(n + 1) = w(n) + (pdx rxx (0)w(n)) = (1 rxx (0))w(n) + pdx ,
(4.16)
where pdx = E{d(n)x(n)} and rxx (0) = E{x2 (n)}. This equation is simply a rst-order scalar dierence equation in the coecient w(n). In fact, the coecient sequence w(n + 1) is exactly the same as the output y(n) of a linear, time-invariant digital lter dened by the equation y(n) = ay(n 1) + (n), (4.17)
where a = 1 rxx (0), y(1) = w(0), and the input signal is given by (n) = pdx u(n), where u(n) is the discrete-time step function. From the theory of digital lters, we know that the stability of a causal, linear, time-invariant discrete-time lter in (4.17) is controlled by the constant a. For |a| < 1, the digital lter of (4.17) is stable; i.e., the sequence y(n) is nite-valued as n tends toward innity. Using this relationship, we nd that the steepest descent method is stable if and only if 1 < (rxx (0) 1) < 1. (4.18)
Adding one to both sides of the above inequalities and dividing all quantities by rxx (0), we nd that the conditions given by 0<< 2 rxx (0) (4.19)
guarantee the convergence of the steepest descent method for a single-coecient system. Note that rxx (0) is also the power in the input signal, a quantity that can be easily estimated using signal measurements. We can also show that the coecient of the steepest descent method converges to its optimal value wopt = pdx /rxx (0) when the system is stable. To see this, let us subtract wopt from both sides of (4.16). After substituting pdx = rxx (0)wopt in the resulting equation, we get [w(n + 1) wopt ] = (1 rxx (0))[w(n) wopt ]. (4.20)
4.2. STOCHASTIC GRADIENT ADAPTIVE FILTERS
17
It is easy to see from the above equation that if 0 < < 2/rxx (0), the coecient error w(n) wopt decreases exponentially to zero as the number of iterations n increases. These results indicate three important facts concerning the stability of the steepest descent method: For stable operation, the step size must be positive. This result is intuitively pleasing, as a negative step size would cause the coecients to move up the mean-square-error surface. The range of stable step sizes decreases as the input signal power increases. This fact also makes sense, as the input data power is directly related to the curvature of the mean-square-error surface used by the steepest descent method. If the curvature of the error surface is too great, the oscillatory behavior observed in previous examples becomes more likely as the step size is increased. When the system operates in a stable manner, the coecient converges to its optimal value in stationary environments. This fact is also essential if an adaptive lter is to be useful in practice. This single coecient example does not illustrate the dependence of the step size bounds on the lter length L. We defer such a discussion to the next section.
4.2
Stochastic Gradient Adaptive Filters
The method of steepest descent can be used to nd the optimum minimum mean-squareerror estimate of W(n) in an iterative fashion. However, this procedure uses the statistics of the input and desired response signals and not on the actual measured signals. In practice, the input signal statistics are not known a priori. Moreover, if these statistics were known and if the autocorrelation matrix RXX (n) were invertible, we could nd the optimum solution given in (4.14) directly in one step! Thus, the method of steepest descent, as described in the previous section, is not useful as an estimation procedure on its own in most practical situations. We now describe a simple approximation that yields a practical and ecient variation of the steepest descent algorithm. The Instantaneous Gradient We can see from (4.8) that the method of steepest descent depends on the input data and desired response signal statistics through the expectation operation that is performed on the product e(n)X(n). This product is the gradient of the squared error function (e2 (n))/2 with respect to the coecient vector W(n). We can consider the vector e(n)X(n) as an approximation of the true gradient of the mean-squared error estimation surface. This approximation is known as the instantaneous gradient of the mean-squared error surface.
18
Our approach to developing a useful and realizable adaptive algorithm is to replace the gradient vector E{e(n)X(n)} in the steepest descent update in (4.8) by its instantaneous approximation e(n)X(n). Adaptive lters that are based on the instantaneous gradient approximation are known as stochastic gradient adaptive lters.
4.2.1
The Least-Mean-Square Algorithm
We get the following strategy for updating the coecients by using the instantaneous gradient approximation in the steepest descent algorithm: W(n + 1) = W(n) + e(n)X(n), where the error e(n) is given by e(n) = d(n) WT (n)X(n). (4.22) (4.21)
The coecient vector W(n) may be initialized arbitrarily and is typically chosen to be the zero vector. The only dierence between the procedure of (4.21) and (4.22), and the steepest descent procedure of (4.8) is that we have removed the expectation operator E{} from the gradient estimate. The above algorithm has become known as the Least-Mean-Square (LMS) adaptive lter, a name coined by its originators [Widrow 1960]. Because of its simplicity and properties, it is the most widely-used adaptive lter today. Table 4.2 lists a MATLAB function that implements the LMS adaptive lter. REMARK 4.1: Substituting e(n)X(n) for E{e(n)X(n)} is a crude approximation for the gradient of the mean-square-error surface. However, the value of e(n)X(n) points in the same direction as the true gradient on average. In other words, the instantaneous gradient is an unbiased estimate of the true gradient. Since the step size parameter is chosen to be a small value, any errors introduced by the instantaneous gradient are averaged over several iterations, and thus the performance loss incurred by this approximation is relatively small.
4.2.2
General Stochastic Gradient Adaptive Filters
Recall from our discussion of the steepest descent algorithm that the choice of cost function J(n) = E{e2 (n)} was an arbitrary one and that other cost functions can provide adequate error surfaces for a gradient search. Some alternative cost functions were discussed in Section 4.1.2. We now consider a particular class of cost functions of the form J(n) = E{g(e(n))}, (4.23)
where g(e(n)) is an even function of e(n). We can develop a family of steepest descent procedures that attempt to minimize the cost function in (4.23) using (4.5). The coecient
4.2. STOCHASTIC GRADIENT ADAPTIVE FILTERS Table 4.2: MATLAB function for applying the FIR LMS adaptive lter. function [W,dhat,e] = fir_lms(mu,W0,x,d); % % % % % % % % % % % % % % This function adapts a finite-impulse-response (FIR) filter using the least-mean-square (LMS) adaptive algorithm. Input parameters: mu = step size W0 = Initial value of W(0) coefficients (L x 1) x = input data signal (num_iter x 1) d = desired response signal (num_iter x 1) Output of program: W = Evolution of coefficients (L x (num_iter + 1)) dhat = output of adaptive filter (num_iter x 1) e = error of adaptive filter (num_iter x 1)
19
L = length(W0); start_iter = 1; end_iter = min([length(x) length(d)]); W = zeros(L,end_iter); dhat = zeros(end_iter,1); e = zeros(end_iter,1); W(:,1:start_iter) = W0*ones(1,start_iter); X = zeros(L,1); for n = start_iter:end_iter; X(2:L) = X(1:L-1); X(1) = x(n); dhat(n) = X*W(:,n); e(n) = d(n) - dhat(n); W(:,n+1) = W(:,n) + mu*e(n)*X; end;
20
vector update is given by W(n + 1) = W(n) E{g(e(n))} W(n) = W(n) + E{f (e(n))X(n)},
(4.24)
where we dene f (e) to be dg(e) . (4.25) de We can use the instantaneous gradient approximation to provide realizable adaptive lters of the form f (e) = W(n + 1) = W(n) + f (e(n))X(n). (4.26) The only dierence of this general form of the stochastic gradient adaptive lter from the LMS adaptive lter is the use of the nonlinearity f () on the error e(n) in the update. From the constraints on (e) presented in Section 4.1.2, we see that g(e) is an even function that monotonically increases with |e|. Consequently, the nonlinearity f () is an odd function that preserves the polarity of e(n); i.e., sgn(f (e(n))) = sgn(e(n)), where the sgn() operation is dened to be sgn(e) =

(4.27)
1 e>0 0 e=0 1 e < 0.
(4.28)
We can derive many useful stochastic gradient adaptive lters from the general structure given in (4.26) using dierent functions g(e). We now describe several such adaptive lters. The Sign-Error Adaptive Filter Consider the mean-absolute-error cost function J(n) = E{|e(n)|}. Since the derivative of |e(n)| with respect to the error is f (e(n)) = sgn(e(n)), we obtain the following stochastic gradient adaptive lter using (4.26): W(n + 1) = W(n) + sgn(e(n))X(n), (4.30) where we have dened = for convenience. The coecient vector update for this adaptive lter is known as the sign-error algorithm, or simply the sign algorithm, as it uses the sign of the error in the gradient update. Although we derived this algorithm from a gradient descent argument, it is interesting to note that it has in the past been interpreted as a simplied LMS update algorithm, where the sign operation allows a simpler multiplier structure in dedicated signal processing hardware [Duttweiler 1981]. (4.29)
4.2. STOCHASTIC GRADIENT ADAPTIVE FILTERS The Least-Mean-Kth-Power Adaptive Filter
21
We can generalize the mean-square-error and mean-absolute-error cost functions in a natural way by dening this cost function as J(n) = E{|e(n)|K }, (4.31)
where K is a positive integer. Following a similar development as before and noting that d|e|K /de = K|e|K1 sgn(e), we arrive at the following least-mean-Kth-power adaptive lter: W(n + 1) = W(n) + |e(n)|K1 sgn(e(n))X(n), (4.32)
where we have dened = K for convenience. It has been shown that this algorithm can achieve better performance than the LMS adaptive lter by adjusting the integer-valued parameter K for certain signal and noise statistics [Walach 1984]. Quantized-Error Algorithms Consider a piecewise-linear cost function g(e) shown in Figure 4.11a. We can derive a stochastic gradient adaptive lter for which the nonlinearity f (e) is as shown in Figure 4.11b. This nonlinearity represents a quantizer, since values of e in dierent ranges are mapped to specic constants. In a digital computer, quantization of signals is necessary for implementing algorithms in general. In dedicated VLSI hardware, however, it may be necessary to quantize certain signals to a fewer number of bits, in order to allow a reasonable multiplier structure. Thus, we are motivated to study the performance of these quantized stochastic gradient adaptive lters to see how they behave relative to oating-point versions that suer from the eects of quantization to a much lesser degree. Quantized error algorithms can also be designed to provide larger than normal coecient changes when the estimation errors are large in magnitude and smaller changes when the estimation errors are smaller in magnitude. Such algorithms include as special cases the sign-error algorithm in (4.30); the dual-sign algorithm, where f (e(n)) is given by f (e(n)) = Ksgn(e(n)) if |e(n)| t0 sgn(e(n)) if |e(n)| < t0 , (4.33)
where K and t0 are parameters of the nonlinearity [Kwong 1986]; and the power-of-two quantized algorithm, where f (e(n)) is given by f (e(n)) = 2 log2 (|e(n)|) sgn(e(n)), if |e(n)| < 1 sgn(e(n)) if |e(n)| 1, (4.34)
where denotes the next largest integer value [Ping1986].
22
Figure 4.11: a) A piecewise-linear cost function. b) The resulting quantizer nonlinearity. Block LMS Algorithm Consider the following error criterion that is based on a nite sum-of-squared errors:
1 E N
n
i=nN +1
(en (i))2 =
1 N
n i=nN +1
E{(d(i) WT (n)X(i))2 },
(4.35)
where the subscript on the error en (i) explicitly indicates that its calculation depends on the coecients at time n. Moreover, since W(n) is used for N consecutive time samples, we need to consider updating the coecients only once every N samples. Using the instantaneous approximation to the gradient of this cost function as in (4.26), we arrive at the following block LMS adaptive lter: W(n + N ) = W(n) + N
n
en (i)X(i).
i=nN +1
(4.36)
This update uses an average of a set of consecutive instantaneous gradients to adjust the coecients of the lter in one step. This averaging results in a more accurate estimate of the gradient of the mean-squared error surface as the block length is increased. However, the adaptive lter coecients are updated less frequently, and this may result in a slower speed of adaptation. At rst glance, the block LMS algorithm looks more complicated than the LMS algorithm because of the summation of the consecutive gradient terms. However, since the coecients of the lter are xed over the block, ecient convolution techniques employing the fast Fourier transform (FFT) algorithms can be used to implement the ltering operation. Moreover,
23
FFT-based techniques can also be used to implement the gradient summation, leading to signicant savings in multiplications for long block lengths [Clark 1981]. We will discuss the performance and behavior of many of the stochastic gradient adaptive lters discussed above in the following chapters.
4.2.3
Examples of LMS Adaptive Filters
By far the most popular adaptive lter, the LMS adaptive lter has been studied extensively by many in the signal processing community. We conclude this chapter with several simulation examples to illustrate the LMS adaptive lters behavior.
Example 4.2: Stationary System Identicationm This example considers the identication of the system in Example 4.1 using measurements of its input and output signals. For this system, we generated a correlated input data sequence using the single-pole IIR digital lter whose input-output relationship is given by x(n) = ax(n 1) + b(n), where (n) is an i.i.d., zero-mean, unit-variance Gaussian sequence and a and b have been chosen as a = 0.5 3 . b = 2 The desired response signal was generated using the following FIR model: d(n) = x(n) + x(n 1) + (n),
2 where (n) is an i.i.d. zero-mean Gaussian sequence with variance = 0.01. The statistics of this problem match those in Example 4.1, allowing us to compare the results of the LMS adaptation with those produced by the steepest descent algorithm. Figure 4.12 shows the evolution of the coecients for 1000 iterations of both the steepest descent and the LMS adaptive lter superimposed on the MSE surface for a step size of = 0.01. The coecient vector was initialized as W(0) = [3 1.5]T . Each dot on the solid-line curve indicates one iteration of the LMS algorithm, and the solid line is an ensemble average of one hundred dierent runs of the LMS adaptive lter over independent data sets with identical individual statistics. The dashed line on the plot corresponds to the path of the coecients adapted using the steepest descent method. The same information is plotted as a function of time in Figure 4.13. The evolutions of the LMS adaptive lter coecients, both as individual and ensemble averages of the convergence paths, closely follow the path produced by the steepest descent algorithm. However, the behavior of the coecients of the LMS adaptive lter is more noisy for each individual run. The coecients of both systems approach the optimum lter coecient values in this example.
24

3
2.5
1.5
w 2
0.5
-0.5
-1 -1
-0.5
0.5
1.5
2.5
w 1
Figure 4.12: Evolution of the coecients of the LMS (dotted curve), ensemble-averaged LMS (solid curve), and steepest descent (dashed curve) algorithms in Example 4.2 for = 0.01.
2.5
1.5
w_1(n)
w_2(n) 0.5 0 100 200 300 400 500 600 700 800 900 1000
TIME
25
Figure 4.14 displays the evolution of the error signal e(n) for the LMS adaptive lter. Starting from large initial values, the errors decrease to smaller values as time progresses. The error never goes to zero because of the random noise (n) that perturbs our measurements d(n) of the system. We now investigate the behavior of the LMS algorithm for a larger step size = 0.1. Figures 4.15 and 4.16 show the behaviors of the coecients for this case. We can see that the evolution of the LMS adaptive lter coecients follows the general path of those adapted using the steepest descent algorithm. However, the behavior of the LMS adaptive lter coecients is considerably more erratic for this larger step size. As we might expect, the coecients approach their optimum values much faster for this larger step size. Figure 4.18 and 4.17 show the evolutions of the absolute value of the rst adaptive lter coefcient and the squared value of the estimation error for a single experiment of the LMS adaptive lter operating with a step size of = 1 in this case. Clearly, the evolution of the system is erratic, with large variations in both the magnitudes of the lter coecients and the estimation error. Since the steepest descent procedure converges in this case as observed in Example 4.1, we infer that the behaviors of the LMS and steepest descent adaptation procedures are quite dierent for large step sizes. The reasons for these dierences are explored in the next chapter.
Example 4.3: Nonstationary Channel Equalization We now consider an example drawn from digital communications, in which an adaptive lter is used to compensate for the non-ideal characteristics of a communications channel. Figure 4.19 shows the block diagram of the system, in which a message is encoded in the form of a digital bit stream before it is modulated and transmitted over a channel. At the receiver, the signal is sampled and then processed to retrieve the original message. For this example, we model the encoding, transmission, and decoding of the signal as a time-varying linear lter whose output is corrupted by noise. The task of the adaptive lter is to recover the original bits transmitted by developing an approximate inverse of the channel. This process is known as equalization. Because the properties of the channel are typically unknown or changing over time, an adaptive lter is used to approximate the inverse of this system. To initially adapt the lter, a known series of bits are transmitted over the channel, and the adaptive lter is trained using a delayed version of this known sequence, where the sample delay is chosen for best performance. Then, a decision-directed technique can be used to maintain the proper equalization of the channel. For our example, we assume that the noise is negligible and that the channel can be modeled using the rst-order dierence equation given by x(n) = a(n)x(n 1) + s(n), where s(n) are the bits transmitted and a(n) is a time-varying coecient. The bit sequence s(n) is an i.i.d. binary sequence where Pr(x(n) = 1) = Pr(x(n) = 1) = 0.5.
26
ERROR 5
4 3 2 1 0 -1 -2 -3 -4 -5 0
100
200
300
400
500
600
700
800
900
1000
TIME
Figure 4.14: Evolution of error e(n) in Example 4.2 for = 0.01.

3
2.5
1.5
w 2
0.5
-0.5
-1 -1
-0.5
0.5
w 1
1.5
2.5
27
2.5
1.5 w_1(n)
1 w_2(n) 0.5 0 20 40 60 80 100 120 140 160 180 200
TIME
|w (n)| 1
10
10
10
10
10
10
10
10
-2
10
-4
100
200
300
400
500
600
700
800
900
1000
TIME
Figure 4.17: Evolution of the absolute value of the rst coecient of the LMS adaptive lter in Example 4.2 for = 1.
28
2 e (n)
10
20
10
15
10
10
10
10
10
-5
10
-10
100
200
300
400
500
600
700
800
900
1000
TIME
Figure 4.18: Evolution of the squared error e2 (n) in Example 4.2 for = 1.
s(n)
COMM CHANNEL
x(n)
ADAPTIVE EQUALIZER
^ s(n- )
_
z
-
e(n)
Figure 4.19: Block diagram of an adaptive equalizer used in digital communication systems.

The actual behavior of the coecient {a(n)} is 9(n 100) a(n) = 2000 0.9
0
29
0 n 100 101 n 300 301 n 600.
Thus, the coecient a(n) undergoes a linear change from a(100) = 0 to a(300) = 0.9. The inverse system for the channel in the absence of any noise is described by the relationship s(n) = x(n) a(n)x(n 1). Consequently, we can use a two-coecient adaptive lter whose input signal is x(n) and whose desired response signal is s(n ) = s(n) to equalize the received signal for the eects of the channel. The optimal coecient vector is given by Wopt (n) = 1 a(n) .
The adaptive lter coecients were initialized to their optimum values W(0) = [1 0] in this example in order to observe the tracking behavior of the system. Figure 4.20 shows the evolution of the lter coecients w1 (n) and w2 (n) for a step size of = 0.1. The adaptive lter coecients track their optimum values as the system function changes with a lag from the true coecient values. This lag error is in general greater for smaller step sizes due to the decreased speed of adaptation for smaller step sizes. We can also see that, even though the optimum value of the rst coecient does not change, the value of w1 (n) produced by the adaptive lter changes. This eect is due to the coupled nature of the coecient adaptation. Figure 4.21 shows the behavior of the same system for = 0.01, in which case the lag error in the coecients is much greater.
Example 4.4: Adaptive Line Enhancement In Example 2.13 of Chapter 2, we considered the task of line enhancement, whereby a sinusoidal signal is recovered from a noisy version of the sinusoid using a one-step linear predictor. Figure 4.22 shows the block diagram of the adaptive system. In this example, we employ the LMS algorithm to nd the coecients of the lter. For this example, we choose the signals to be the same as those for Example 2.13, so that we can compare the adaptive lters output with that of the optimum MMSE xed-coecient line enhancer. Figure 4.23 plots the dierence between the output d(n) of the LMS adaptive line enhancer and T the output of the optimum MMSE line enhancer, given by do (n) = Wopt (n)X(n) for a step size of = 0.0001. Initial convergence of the system occurs over the rst 5000 samples. Figure 4.24 shows the spectra of the input signal as well as the enhanced signals as obtained from the optimum MMSE estimator and from the adaptive LMS line enhancer for the sequence of values from 5001
30
1.5
w_1(n) 1
0.5
w_2(n) -0.5
-1 0
100
200
300
400
500
600
TIME
Figure 4.20: Tracking of optimal coecients in Example 4.3 for = 0.1.
1.5
w_1(n) 1
0.5
0 w_2(n)
-0.5
-1 0
100
200
300
400
500
600
TIME
Figure 4.21: Tracking of optimal coecients in Example 4.3 for = 0.01.
4.3. MAIN POINTS OF THIS CHAPTER
31
n < 10000. As can be seen, the adaptive line enhancers performance closely follows that of the xed system. In verication of this fact, Figure 4.25 shows the output signals of both the optimum MMSE and adaptive LMS line enhancers after convergence, along with the original uncorrupted sinusoid. Clearly, both line enhancers perform nearly as well, indicating that the LMS adaptive line enhancer can achieve similar performance as the optimum MMSE line enhancer after a sucient number of iterations.
4.3
Main Points of This Chapter
The method of steepest descent is an iterative procedure for nding the minimum point of a smooth error surface. When searching the MMSE surface for an FIR system model, the method of steepest descent converges to the optimum MMSE solution for adequately small step sizes. Convergence of the method of steepest descent is controlled by the autocorrelation statistics of the input signal, the cross-correlation between the input and desired response signals, and the step size. Too large a step size can cause divergence of the algorithm. Stochastic gradient adaptive algorithms are approximate implementations of steepest descent procedures in which an instantaneous estimate of the cost function (e(n)) is used in place of the expected value E{(e(n))}. The least-mean-square (LMS) adaptive lter is a stochastic gradient version of the method of steepest descent that minimizes the mean-squared estimation error. The LMS algorithm is the most widely-used adaptive algorithm for FIR lters due to its computational simplicity and robust adaptation properties. Variants of the LMS adaptive lter include the sign-error, least-mean-K, and block LMS adaptive lters as well as adaptive lters with quantized updates. These other adaptive lters are useful in certain situations, depending on the implementation constraints and signals being processed. It is seen through examples that the LMS adaptive lters behavior closely follows that of the method of steepest descent for small step sizes, and the LMS adaptive algorithm can achieve performance that approaches that of the optimum MMSE estimator in certain situations.
32
x(n)
e(n) x (n)
z 1
LINE ENHANCER
Figure 4.22: The conguration of the adaptive line enhancer for Example 4.4.
1.5
d_hat(n) - d_o_hat(n)
0.5
-0.5
-1
-1.5 0
0.2
0.4
0.6
0.8 1 1.2 number of iterations
1.4
1.6
1.8 x 10
2
4
Figure 4.23: Dierence between the outputs of the LMS adaptive line enhancer and the optimum MMSE line enhancer in Example 4.4.
4.3. MAIN POINTS OF THIS CHAPTER
33
SPECTRUM
10
2
Original Optimum MMSE LMS 10

1
10
10
-1
10
-2
10
-3
10
-4
0.5
1.5
2.5
Frequency (radians/sample)
Figure 4.24: Spectra of the original noisy signal, the output of the optimum MMSE line enhancer, and the LMS adaptive line enhancer in Example 4.4.
1.5
0.5
-0.5
-1 Original Optimum MMSE LMS -1.5 5000 5010 5020 5030 5040 5050 5060 5070 5080 5090 5100
TIME
Figure 4.25: Time series of the original noiseless signal, the optimum MMSE line enhancer output, and the LMS adaptive line enhancer output in Example 4.4.
34
4.4
Bibliographical Notes
Method of Steepest Descent. The method of steepest descent rst appeared in the context of the theory of optimization of parameterized functions [Curry 1944]. An excellent introduction to these methods can be found in [Luenberger 1984]. Newtons method is another well-known algorithm for determining the minimum of a locally-quadratic error surface (see Exercise 4.3). We also discuss this method in the context of recursive leastsquares adaptive lters in Chapter 10. Development of the LMS Adaptive Filter. The least-mean-square adaptive lter grew out of the eorts of several researchers working in the eld of learning system in the late 1950s and early 1960s. The work of Widrow and Ho [Widrow1960] is often credited as the rst appearance of the algorithm in the literature, although the work of Rosenblatt [Rosenblatt 1957] is similar in both motivation and developed results. References to even earlier works than these have been noted in the literature; for example, Tsypkin [Tsypkin 1973] credits [Kaczmarz 1937] as the original work on the LMS algorithm with normalized step size. In the control literature, the LMS adaptive lter often appears in its continuoustime form and is referred to as the MIT rule, in deference to the promoters of the algorithm in that eld [Astrm 1995]. o Variations of the LMS Adaptive Filter. Due to the diculties in computing multiplications in early digital hardware, early users of adaptive lters were forced to approximate the LMS adaptive lters implementation using reduced-bit multiplications and additions. For an early application of these ideas, see [Lucky 1966]. A formal study of the sign-error adaptive lter is presented in [Gersho 1984], and least-mean-K adaptive algorithms are presented in [Walach 1984], respectively. A balanced presentation and analysis of several types of adaptive lters involving nonlinearities in the gradient update term can be found in [Duttweiler 1982]. Algorithms involving dual-sign and power-of-two error quantizers are considered in [Kwong 1986] and [Ping1986], respectively. We explore the performance and behavior of these modied algorithms more extensively in Chapter 7. Work in the mid-1960s on the fast Fourier transform [Cooley1965] and fast convolution [Stockham 1966] paved the way for the development of the block LMS adaptive lter [Clark 1981]. For a good review of more recent work in block and frequency-domain adaptive lters, see [Shynk 1992]. These algorithms are also discussed in Chapter 8. Applications of Adaptive Filters One of the rst successful widespread applications of adaptive lters was in digital communications, where a modied version of the LMS adaptive lter was used in channel equalization [Lucky 1966]. Applications followed in geophysical exploration [Burg 1967], radar and sonar [Capon 1969, Frost 1972, Haykin 1985], medicine [Widrow 1975], speech processing and coding [Makhoul 1975, Gibson 1984], echo cancellation [Gritton 1984], image processing and coding [Benvenuto 1986], spread-spectrum
4.4. BIBLIOGRAPHICAL NOTES
35
communications [Milstein 1986], beamforming [Van Veen 1988], and noise control [Elliott 1993], among others. A good review of applications in noise cancellation can be found in [Widrow 1975]. Quereshi [Quereshi 1988] gives an excellent overview of adaptive lters as used in digital communcations for channel equalization. A discussion of linear prediction as it applies to adaptive line enhancement appears in [Zeidler 1990].
36
4.5
Exercises
4.1. Adaptive Filters are Nonlinear and Time-Invariant Systems: Show, using the classical denitions of linearity and time-invariance that the LMS adaptive lter is a nonlinear and time-invariant system. 4.2. The Sign-Sign Adaptive Filter: Consider the following search technique, based on a simplication of the gradient search technique described by (4.5): W(n + 1) = W(n) sgn J(n) W(n) = W(n) + sgn(e(n))sgn(X(n)),
where [sgn(X)]i = sgn(xi ) is as dened in (4.28) and J(n) is the mean-square-error cost function. a. Is the above search technique a true gradient search procedure? Why or why not? b. Consider the one-dimensional case, for which W(n) = w(n). Explain why the above search technique will not converge to limn w(n) = wopt in general. Determine a bound on the coecient error limn |w(n) wopt | for an arbitrary initial value w(0). Hint: The bound depends on the value of the chosen step size . c. Even though the above method works well for most signals, there are a few situations in which the adaptive lter will diverge for all positive choices of . Show that the following situation is one such case. The input signal x(n) is periodic with a period of three samples and the rst period is given by 3, -1, and -1. The desired response signal takes a constant value of one for all samples. The adaptive lter has three coecients. Assume that at the beginning of some period, the adaptive lter coecients are all zero. 4.3. Newtons Method: Newtons method is a classical technique for nding the minimum of a locally-quadratic performance surface [Luenberger 1984]. The algorithm is dened as W(n + 1) = W(n) (F(n))1 J(n) , W(n) (4.37)
where F(n) is an L L-element matrix whose (i, j)th value is given by [F(n)]i,j = 2 J(n) . wi (n)wj (n) (4.38)
4.5. EXERCISES
37
a. Determine F(n) for the mean-square error criterion J(n) = E{e2 (n)}. Is the matrix a function of the time index n? b. For your result in part a, determine conditions on the input and desired response signals so that F(n) can be inverted for all n. c. Derive the coecient update equation for Newtons method for the mean-square error criterion, and describe its convergence properties. d. Describe the diculties in developing a stochastic gradient version of Newtons method. Consider the amount of computation and knowledge of the signals required. 4.4. Constant Modulus Error Criteria: Consider the following cost function for a steepest descent procedure: J(n) = E{(A2 (WT (n)X(n))2 )2 }. where A is a known amplitude. Such a constant modulus cost function makes use of the knowledge that the squared values of the desired output signal of the system is a constant value A at each iteration, a situation that is realistic in many digital communication systems. a. Derive the steepest descent procedure for this cost function. b. Determine a stochastic gradient version of this steepest descent procedure. How is it similar to the LMS adaptive lter? c. Repeat parts a) and b) for the general constant modulus cost function given by J(n) = E{||A|m |WT (n)X(n)|m |p }, where | | denotes absolute value and m and p are positive integers. 4.5. The Statistics of the Output of a Single-Pole Filter With an I.I.D. Input Signal: Consider the input signal model given by x(n) = ax(n 1) + b(n), where (n) is an i.i.d. sequence with zero mean value and unit variance and a, b, and x(0) are values to be specied. a. Find an expression for x(0) in terms of a and b such that the random sequence {x(n)} has stationary second-order statistics, i.e., E{x(ni)x(nj)} = rxx (ij) for all n i > 0, n j > 0 . b. For your value of x(0) in part a, nd expressions for a and b in terms of rxx (0) and rxx (1).
38
4.6. Equation Error Adaptive Recursive Filters: Consider the identication of a recursive linear system as described in Example 2.16. We wish to develop an adaptive method for identifying such systems. An identication algorithm that employs feedback of the desired response signal d(n) in the system model as in Example 2.16 is known as an equation error algorithm. (Another class of algorithms that uses delayed samples of d(n) in the system model is known as the output error algorithms. Output error adaptive recursive lters are described in Chapter 13.) Derive the the coecientupdating strategy of an equation-error LMS adaptive lter using the recursive system model L N d(n) = bi (n)x(n i) + ai (n)d(n i),
i=0 i=1
where x(n) and d(n) are the input signal and the desired response signal, respectively, of the adaptive lter. Explain the possible advantages and disadvantages of this adaptive lter over the adaptive FIR lter. 4.7. Adaptive Quadratic Filters: Develop an adaptive LMS quadratic lter that models the relationship between the input signal and the desired response signal as d(n) =
L1 L1
h2 (i1 , i2 ; n)x(n i1 )x(n i2 ).

i1 =0 i2 =i1
A quadratic system identication problem is briey discussed in Example 2.17. 4.8. Linear Phase Adaptive Filters: Derive an LMS adaptive lter that is constrained such that wi (n) = wLi (n) so that the lter coecients at any time corresponds to that of a linear phase lter. 4.9. Adaptive Filters With Variable Update Equations: Develop a stochastic gradient adaptive lter that attempts to minimize the following cost function: J(n) = E{|e2 (n)|} ; |e(n)| < 1 E{|e3 (n)|} ; |e(n)| 1.
Discuss the possible advantages and disadvantages of your algorithm over the LMS adaptive lter. 4.10. The Backpropagation Algorithm for A Single Articial Neuron: Consider the block diagram of the system in Figure 4.26, which depicts the structure of an L-input, oneoutput articial neuron. When several of these structures are cascaded together, they form a feedforward articial neural network. The output of this system is
N
y(n) = f
i=1
wi (n)xi (n) ,
(4.39)
4.5. EXERCISES
39
x1 (n)
y (n)

x2 ( n )
f ()
Figure 4.26: A single articial neuron. where xi (n) is the ith input signal and wi (n) is the ith neuron coecient. A common choice for the function f (u) is f (u) = eu eu eu + eu = tanh(u),
which is also known as the sigmoid function in the neural network eld. a. Derive a stochastic gradient algorithm for adjusting the ith coecient of the articial neuron to approximately minimize the mean-squared error J(n) = E{e2 (n)}, where e(n) = d(n) y(n). Express your answer in vector form. b. From your result in part a, is the update for wi (n) linear in the instantaneous values of the parameters {wi (n)}? 4.11. The Complex LMS Adaptive Filter: The generalization of the LMS adaptive lter to complex-valued signals and coecients is useful in communication systems, where the complex signal representation is used to describe the in-phase and quadrature components of the received signal. Let x(n) be dened as x(n) = xR (n) + jxI (n), where xR (n) and xI (n) are the real and imaginary components of the input signal. Similarly, let wi (n) = wR,i (n) + jwI,i (n)
40
CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS denote the ith complex-valued lter coecient. Then, the output of the system is dened as
L1
y(n) =
i=0
wi (n)x(n i)
as before. Dene the error signal e(n) as e(n) = d(n) y(n) = (dR (n) yR (n)) + j(dI (n) yI (n)). Show that the stochastic gradient algorithm for adjusting the coecient vector W(n) to approximately minimize the mean-squared value of the absolute value of the error, given by E{|e(n)|2 }, is W(n + 1) = W(n) + e(n)X (n), (4.40)
where the ith element of X (n) is the complex conjugate of x(n i + 1). In this case, the dierentiation of a real-valued function f (u) with respect to its complex-valued argument u = uR + juI is dened as f (u) f (u) f (u) = +j . u uR uI (4.41)
4.12. The Filtered-X LMS Adaptive Filter: Consider the block diagram of the system in Figure 4.27, where the output of an adaptive lter is passed through another xed FIR lter with impulse response vector H = [h0 h1 hM 1 ]T . Such a block diagram often arises in adaptive control systems. The error signal e(n) in this situation is given by
M 1
e(n) = d(n)
m=0
hm y(n m).
a. Develop a version of the LMS adaptive lter that minimizes the mean-squared error cost function E{e2 (n)}. In your derivation, assume that y(n m) y(n m) = X(n m). W(n) W(n m) b. Draw a block diagram of the resulting system that uses the fewest number of multiplications and additions possible Hint: The minimum number of mutiplications necessary is 2L + M + 1 per iteration. c. What are the implications of the assumption that you used in part a to derive the algorithm on the choice of step size for this system?
4.5. EXERCISES
41
d(n)
x(n)
ADAPTIVE FILTER
y(n)
FIXED FIR FILTER
e(n)
Figure 4.27: LMS adaptive lter for adaptive control.
x(n)
a0 + a1 z 1 + a2 z 2
b0 + b1 z 1 + b2 z 2
y (n)
Figure 4.28: Cascade-form LMS adaptive lter .
42
4.13. Cascade-Form LMS Adaptive Filter: Consider the cascade form structure of the system model shown in Figure 4.28. Develop an LMS adaptive lter that attempts to minimize the squared estimation error at each time instant for the parameters a0 (n), a1 (n), a2 (n), b1 (n), and b2 (n). Does the mean-square error surface for this problem have a unique minimum? Hint: Consider the approximation used in Problem 4.12 above. 4.14. Optimum MMSE Solution for Nonstationary Channel Equalization: a. Show through direct solution of the equation RXX (n)Wopt (n) = PdX that the minimum mean-square-error solution for the nonstationary channel equalization problem in Example 4.3 is given by Wopt (n) = 1 a(n) . (4.42)
b. Does this result hold if {s(n)} is a nonstationary i.i.d. random sequence? Explain. 4.15. The Continuous-Time LMS Adaptive Filter: Consider the continuous-time system dened as y(t) =

w(s)x(t s),
(4.43)
where w(t) is the impulse response of the continuous-time lter. Determine a dierential equation update for w(t) of the form dw(t) de2 (t) = , dt dw(t) where e(t) = d(t) y(t). 4.16. Computing Assignment on Adaptive Prediction: This assignment evaluates the performance of the LMS adaptive lter in a prediction problem. For this, we consider an input signal that is generated using the model x(n) = 0.44(n) + 1.5x(n 1) x(n 2) + 0.25x(n 3), where (n) is a zero-mean, i.i.d, Gaussian-distributed random process with unit variance. a. Obtain an expression for the power spectral density of x(n). b. Find the coecients of the MMSE, one-step linear predictor for x(n) that employs four coecients. (4.44)
4.5. EXERCISES c. Develop an adaptive LMS predictor employing four coecients for x(n).
43
d. Evaluate the 44-element autocorrelation matrix and the 4-element cross-correlation vector for this prediction problem. Derive the evolution equations for the mean values of each adaptive predictor coecient for = 0.01, and zero initial coecient values. Find the steady-state misadjustment for this step size. e. Generate a 2000-sample sequence using the model for x(n) described earlier. Evaluate the mean coecient behavior using fty independent experiments. Compare the empirical averages with the theoretical equations of part d. f. Plot the mean-squared prediction error obtained by averaging the squared prediction errors of the fty experiments. If the steady state appears to have been reached, evaluate the mean-square prediction error as the ensemble average of the time average of the last one hundred samples of the squared errors in each run over the fty experiments. Compare the empirical misadjustment with its theoretical value. g. Explain the possible reasons for the dierences between the theoretical and empirical results. 4.17. Computing Assignment on Adaptive Interference Cancellation: One signicant problem that occurs in test equipments such as electro-cardiographs (ECG) and electroencephalographs (EEG) is the inability to completely isolate the devices from line voltages. Since the measurements made by these machines typically range in the microvolts, even a small leakage of the line voltage can completely obscure the desired measurements. Fortunately, the source of interference is known in this case and we can use this information to cancel the interference adaptively. A block diagram of the system one would employ for this application is shown in Figure 4.29. The desired response signal contains the signals f (n) that we want extracted. The interference signal is dierent from the input signal by an unknown initial phase and an unknown amplitude value as shown in the gure. Assuming that f (n) is uncorrelated with the source of interference x(n), we can argue that the estimate of d(n) using x(n) will estimate only the interference and, therefore, the estimation error signal is a cleaner version of the signal f (n). a. Develop an adaptive interference canceller using the ideas described above. b. To simulate an ECG signal, generate a triangular waveform f (n) with period twenty samples and a peak value of 0.1 volt. Also generate a sinusoidal signal x(n) with amplitude 1 volt and frequency 60 Hz and sampled at a rate of 200 samples/second. Generate 2000 samples of each signal. You can simulate the corrupted signal using the model d(n) = f (n) + 0.5 sin 120 (n 0.25) . 200
44
f (n) + B sin( 0n + ) A sin( 0 n + )

INTERFERENCE ESTIMATOR

f ( n)
Figure 4.29: Block diagram of an adaptive interference canceller. By trial and error, as well as your understanding about the predictability of sinusoids, nd a good choice for the number of coecients for the adaptive lter and the step size. Plot the enhanced version of f (n) obtained as the estimation error of the adaptive lter. Comment on the performance of the interference canceller you developed. 4.18. Computing Assignment on Adaptive Frequency Tracking: In this assignment, we consider the problem of tracking the instantaneous frequency of an FM signal modeled as x(t) = cos (20.25t 0.025 cos(4t)) + (n), where (n) is an additive noise signal that is uncorrelated with the sinusoidal components. We can compute the instantaneous frequency of this signal by nding the derivative of the instantaneous phase function given by (t) = 20.25t 0.025 cos(4t). Our approach to nding the instantaneous frequency is to use a L-coecient predictor for x(t) after sampling it and evaluating the frequency corresponding to the peak of the autoregressive spectrum estimate obtained from the coecients at each time. See Example 2.15 for a description of the autoregressive spectrum estimation technique. a. Generate 2000 samples of a discrete version of the input FM signal by sampling it at a rate of 1000 samples/second. The noise component may be modeled as a 2 white, Gaussian process with zero mean value and variance = 0.01. Develop an adaptive predictor for this signal.
4.5. EXERCISES
45
b. By trial and error, nd good choices of the step size and coecient length for this adaptive lter to track the frequencies. You can use the model of the instantaneous frequency to guide you in your selection process. Estimate the instantaneous frequency by calculating the autoregressive spectrum estimate every ten samples. After compensating for the normalization of the frequency variable during sampling, plot the estimated instantaneous frequencies against their true values. c. Document your observations on this experiment. 4.19. Computing Assignment on FIR Filter Design. This assignment guides you through the design of time-invariant FIR lters from specications using the LMS adaptive lter. Consider the problem of designing a linear phase FIR lter that meets the following specications: 0.9 |H()| 1.1 ; 0 || /4 |H()| 0.01 ; /2 || < /4. We can design this lter using the adaptive lter by creating the appropriate input and desired response signals for the adaptive lter. Create an input signal as
K
x(n) =
i=1
A1 cos(i n + i ),
where the frequencies are uniformly sampled from the passband and stop band of the desired lter response and the phase values i s are uncorrelated with each other and uniformly distributed in the range [, pi). Let the ideal lter response be HI () = ejI () ; 0 || /4 . 0 ; otherwise
(What should I () be for an L-coecient lter to have linear phase characteristics?) Since the input is a sum of sinusoids, we can easily nd the output of the ideal lter and use it as the desired response signal for the adaptive lter. Use an adaptive lter with the above input signal and desired response signal and nd the coecients of the adaptive lter when it reaches the steady-state. You may even average the coecients over a long duration of time after the adaptive lter has eectively converged. You must verify that the approximation obtained using the adaptive lter meets the specications. When you perform the experiments, keep the following points in mind. The number of sinusoidal components in the input signal should be fairly large. The amplitude values Ai s may all be chosen to be the same. If they are dierently chosen, you are weighting the sinusoids dierently, thereby emphasizing the specications in certain
46
CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS regions more than those at other regions. Remember that the acceptable values of the step size depends on the input signal power. You may have to run the adaptive lter for a long time. Consider impulse response lengths upto 128 samples. Choose the design that employs the minimum number of coecients and still meets the specications.

Chapter 4

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4

Uploaded by

Copyright:

Available Formats

Adaptive Filters

V. John Mathews Scott C. Douglas

Copyright 2003 V John Mathews and Scott C Douglas

Chapter 4 Stochastic Gradient Adaptive Filters

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

Figure 4.2: The path of a ball descending into the valley.

4.1. GRADIENT ADAPTATION

The Method of Steepest Descent

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

4.1. GRADIENT ADAPTATION

Figure 4.4: Mean-square-error cost function for a single-coecient FIR lter.

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

E{e2 (n)} E{| e(n) |} E{| e(n) |K } e2 (n) E n 2 j=nL+1 x (j)

e2 (i); en (i) = d(i) WT (n)X(i) n

4.1. GRADIENT ADAPTATION

Implementation of the Steepest Descent Algorithm

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

4.1. GRADIENT ADAPTATION

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

Figure 4.5: The mean-squared error surface for Example 4.1.

4.1. GRADIENT ADAPTATION

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

4.1. GRADIENT ADAPTATION

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

4.2. STOCHASTIC GRADIENT ADAPTIVE FILTERS

Stochastic Gradient Adaptive Filters

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

The Least-Mean-Square Algorithm

General Stochastic Gradient Adaptive Filters

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

1 e>0 0 e=0 1 e < 0.

4.2. STOCHASTIC GRADIENT ADAPTIVE FILTERS The Least-Mean-Kth-Power Adaptive Filter

where denotes the next largest integer value [Ping1986].

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

4.2. STOCHASTIC GRADIENT ADAPTIVE FILTERS

Examples of LMS Adaptive Filters

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

4.2. STOCHASTIC GRADIENT ADAPTIVE FILTERS

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

Figure 4.14: Evolution of error e(n) in Example 4.2 for = 0.01.

4.2. STOCHASTIC GRADIENT ADAPTIVE FILTERS

1 w_2(n) 0.5 0 20 40 60 80 100 120 140 160 180 200

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

4.2. STOCHASTIC GRADIENT ADAPTIVE FILTERS

0 n 100 101 n 300 301 n 600.

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

Figure 4.20: Tracking of optimal coecients in Example 4.3 for = 0.1.

4.3. MAIN POINTS OF THIS CHAPTER

Main Points of This Chapter

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

0.8 1 1.2 number of iterations

4.3. MAIN POINTS OF THIS CHAPTER

Original Optimum MMSE LMS 10

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

4.4. BIBLIOGRAPHICAL NOTES

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

h2 (i1 , i2 ; n)x(n i1 )x(n i2 ).

FIXED FIR FILTER

Figure 4.27: LMS adaptive lter for adaptive control.

Figure 4.28: Cascade-form LMS adaptive lter .

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS

CHAPTER 4. STOCHASTIC GRADIENT ADAPTIVE FILTERS