Professional Documents
Culture Documents
1/120
Two Fundamental Learning
Paradigms
Non-associative
an organism acquires the properties of a single
repetitive stimulus.
Associative
an organism acquires knowledge about the
relationship of either one stimulus to another, or one
stimulus to the organism’s own behavioral response to
that stimulus.
2/120
Examples of Associative
Learning(1/2)
Classical conditioning
Association of an unconditioned stimulus (US) with a
conditioned stimulus (CS).
CS’s such as a flash of light or a sound tone produce
weak responses.
US’s such as food or a shock to the leg produce a
strong response.
Repeated presentation of the CS followed by the US,
the CS begins to evoke the response of the US.
Example: If a flash of light is always followed by
serving of meat to a dog, after a number of learning
trials the light itself begins to produce salivation. 3/120
Examples of Associative
Learning(2/2)
Operant conditioning
Formation of a predictive relationship between a
stimulus and a response.
Example:
Place a hungry rat in a cage which has a lever
on one of its walls. Measure the spontaneous
rate at which the rat presses the lever by virtue
of its random movements around the cage. If
the rat is promptly presented with food when
the lever is pressed, the spontaneous rate of
lever pressing increases!
4/120
Reflexive and Declarative Learning
Reflexive learning
repetitive learning is involved and recall does not
involve any awareness or conscious evaluation.
Declarative learning
established by a single trial or experience and
involves conscious reflection and evaluation for its
recall.
Constant repitition of declarative knowledge
often manifest itself in reflexive form. 5/120
Important Aspects of Human
Memory(1/4)
Input
Two distinct stages: stimulus
6/120
Important Aspects of Human
Memory(2/4)
Input
Information is stimulus
8/120
Important Aspects of Human
Memory(4/4)
Long-term memory involves
plastic changes in the brain which take the form of
strengthening or weakening of existing synapses
the formation of new synapses.
Learning mechanism distributes the memory
over different areas
Makes robust to damage
Permits the brain to work easily from partially
corrupted information.
Reflexive and declarative memories may
actually involve different neuronal circuits. 9/120
Learning Algorithms(1/2)
Define an architecture-dependent procedure to
encode pattern information into weights
Learning proceeds by modifying connection
strengths.
Learning is data driven:
A set of input–output patterns derived from a (possibly
unknown) probability distribution.
Output pattern might specify a desired system response for a
given input pattern
Learning involves approximating the unknown function as
described by the given data. 10/120
Learning Algorithms(2/2)
Learning is data driven:
Alternatively, the data might comprise patterns that
naturally cluster into some number of unknown
classes
Learning problem involves generating a suitable
classification of the samples.
11/120
Supervised Learning(1/2)
Data comprises a set of An example function described
by a set of noisy data points
discrete samples 20
T = { ( X k , Dk )}
5 4 3 2
3 x -1.2 x -12.27 x +3.288 x +7.182 x
Q 15
k =1 10
-5
vector Xk ∈ Rn to an -15
-20
12/120
Supervised Learning(2/2)
An example function described
The set of samples by a set of noisy data points
describe the behavior 20
5 4 3 2
of an unknown
3 x -1.2 x -12.27 x +3.288 x +7.182 x
15
10
function f : Rn → Rp 5
which is to be
f(x)
0
-5
characterized. -10
-15
-20
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
13/120
The Supervised Learning Procedure
Error information fed back for network adaptation
Sk Error
Xk
Dx
Neural Network
16/120
Clustering and Classification(1/3)
Given a set of data
samples {Xi}, Xi ∈ Rn, 2.5
Cluster 1
is it possible to identify
2
1.5
well defined “clusters”, 1
defines a class of 0
sense? 17/120
Clustering and Classification(2/3)
Clusters help establish a
classification structure 2.5
Cluster 1
1.5
no categories defined in 1
advance. 0.5
Cluster 2 Cluster centroids
clusters by appropriate
-0.5
labeling.
-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
18/120
Clustering and Classification(3/3)
The goal of pattern classification
is to assign an input pattern to one
2.5
of a finite number of classes. Cluster 1
2
Quantization vectors are called
codebook vectors. 1.5
-0.5
19/120
Characteristics of Supervised and
Unsupervised Learning
20/120
General Philosophy of Learning:
Principle of Minimal Disturbance
Adapt to reduce the output error for the
current training pattern, with minimal
disturbance to responses already learned.
21/120
Error Correction and Gradient
Descent Rules
Error correction rules alter the weights of a
network using a linear error measure to reduce
the error in the output generated in response to
the present input pattern.
Gradient rules alter the weights of a network
during each pattern presentation by employing
gradient information with the objective of
reducing the mean squared error (usually
averaged over all training patterns).
22/120
Learning Objective for
TLNs(1/4)
+1
Weight vectors
X k = ( x0 , x1k , , x )
W2
k T
n X k ∈ R n+1 X2
Σ S
Wk = ( w , w , , w )W
…
Wn
k T n+1
k
0
k
1 n k ∈R
Xn
Objective: To design the
weights of a TLN to
correctly classify a given
set of patterns.
23/120
Learning Objective for
TLNs(2/4)
+1
Assumption: A training
WO
set of following form is X1
W1
given W2
T = { ( X k , Dk )}
Q
XK ∈R n +1
, d k ∈ { 0,1}
X2
Σ S
k =1
…
Wn
25/120
Learning Objective for TLNs(4/4)
Context: TLNs
Find a weight vectorWS such that for all Xk ∈ X1,
S(yk) = 1; and for all Xk ∈ X0, S(yk) = 0.
Positive inner products translate to a +1 signal
and negative inner products to a 0 signal
Translates to saying that for all Xk ∈ X1, XkTWS >
0; and for all Xk ∈ X0, XkTWS < 0.
26/120
Pattern Space(1/2)
Activation
Points that satisfy XTWS =
0 define a separating
hyperplane in pattern
space.
Two dimensional case:
Pattern space points on one
side of this hyperplane
(with an orientation
indicated by the arrow)
yield positive inner
products with WS and thus
generate a +1 neuron
signal. 27/120
Pattern Space(2/2)
Activation
Two dimensional case:
Pattern space points on
the other side of the
hyperplane generate a
negative inner product
with WS and consequently
a neuron signal equal to 0.
Points in C0 and C1 are
thus correctly classified
by such a placement of
the hyperplane.
28/120
A Different View: Weight Space
(1/2)
Weight vector is a variable
vector.
WTXk = 0 represents a
hyperplane in weight space
Always passes through the
origin since W = 0 is a
trivial solution of WTXk = 0.
29/120
A Different View: Weight Space
(2/2)
Called the pattern
hyperplane of pattern Xk.
Locus of all pointsW such
thatWTXk = 0.
Divides the weight space
into two parts: one which
generates a positive inner
product WTXk > 0, and the
other a negative inner
product WTXk<0.
30/120
Identifying a Solution Region from
Orientated Pattern Hyperplanes(1/2)
W2
X3 X2
For each pattern Xk in
pattern space there is a X1
corresponding
hyperplane in weight
space. W1
corresponding
hyperplane in pattern
Solution region
space.
31/120
Identifying a Solution Region from
Orientated Pattern Hyperplanes(2/2)
W2
X3 X2
A solution region in
weight space with X1
four pattern
hyperplanes W1
χ
1 = {X1,X2}
χ
0 = {X3,X4}
X4
Solution region
32/120
Requirements of the Learning
Procedure
Linear separability guarantees the existence of a
solution region.
Points to be kept in mind in the design of an
automated weight update procedure :
It must consider each pattern in turn to assess the correctness of the
present classification.
It must subsequently adjust the weight vector to eliminate a
classification error, if any.
Since the set of all solution vectors forms a convex cone, the
weight update procedure should terminate as soon as it penetrates
the boundary of this cone (solution region). 33/120
Design in Weight Space(1/3)
Assume: Xk ∈ X1 and Xk
Wk+1
WkTXk as erroneously
non-positive.
For correct
classification, shift the WkT Xk>0
weight vector to some Wk
position Wk+1 where the
inner product is WT Xk<0
positive.
34/120
Design in Weight Space(2/3)
Xk
The smallest Wk+1
perturbation in Wk that
produces the desired
change is, the
perpendicular distance WkT Xk>0
Wk
from Wk onto the
pattern hyperplane. WT Xk<0
35/120
Design in Weight Space(3/3)
In weight space, the Xk
Wk+1
direction
perpendicular to the
pattern hyperplane is
none other than that WkT Xk>0
of Xk itself. Wk
WT Xk<0
36/120
Simple Weight Change Rule:
Perceptron Learning Law
hyperplanes as indicated:
χ 1 = {X1,X2} and X4
χ 0 = {X3,X4}
Solution region
38/120
Linear Containment
Consider the set X0’ in which each element X0 is
negated.
Given a weight vectorWk, for any Xk ∈ X1 ∪ X0’, XkT
Wk > 0 implies correct classification and XkTWk < 0
implies incorrect classification.
X‘ = X1 ∪ X0’ is called the adjusted training set.
Assumption of linear separability guarantees the
existence of a solution weight vector WS, such that
XkTWS > 0 ∀ Xk ∈ X
We say X’ is a linearly contained set. 39/120
Recast of Perceptron Learning with
Linearly Contained Data
41/120
Perceptron Convergence
Theorem
Given: A linearly contained training set X’ and
any initial weight vectorW1.
Let SW be the weight vector sequence generated in
response to presentation of a training sequence SX
upon application of Perceptron learning law. Then
for some finite index k0 we have: Wk0 = Wk0 +1 =
Wk0 +2 = ・ ・ ・ = WS as a solution vector.
See the text for detailed proofs.
42/120
Hand-worked Example
+1
X1
WO
W1
W2
X2
43/120
Classroom Exercise(1/2)
44/120
Classroom Exercise(2/2)
45/120
MATLAB Simulation
(a) Hyperplane movement depicted during Perceptron Learning
3
k=15 2
R
2.5
2
k=35
1.5
k=25
2
x
1
(0,1) k=5 (1,1)
0.5
0
(0,0) (1,0)
-0.5
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
x1
3
0
2
w2
1 -1
0 -2
3
2.5 -3
2
1.5
1
0.5 -4
0
w0
w
1
46/120
Perceptron Learning and Non-
separable Sets
Theorem:
Given a finite set of training patterns X, there
exists a number M such that if we run the
Perceptron learning algorithm beginning with
any initial set of weights,W1, then any weight
vector Wk produced in the course of the
algorithm will satisfyWk ≤ W1 +M
47/120
Two Corollaries(1/2)
If, in a finite set of training patterns X, each
pattern Xk has integer (or rational) components
xik, then the Perceptron learning algorithm will
visit a finite set of distinct weight vectors Wk.
48/120
Two Corollaries(2/2)
For a finite set of training patterns X, with
individual patterns Xk having integer (or rational)
components xik the Perceptron learning algorithm
will, in finite time, produce a weight vector that
correctly classifies all training patterns iff X is
linearly separable, or leave and re-visit a specific
weight vector iff X is linearly non-separable.
49/120
Handling Linearly Non-separable
Sets: The Pocket Algorithm(1/2)
Philosophy: Incorporate positive reinforcement
in a way to reward weights that yield a low error
solution.
Pocket algorithm works by remembering the
weight vector that yields the largest number of
correct classifications on a consecutive run.
50/120
Handling Linearly Non-separable
Sets: The Pocket Algorithm(2/2)
This weight vector is kept in the “pocket”, and
we denote it as Wpocket .
While updating the weights in accordance with
Perceptron learning, if a weight vector is
discovered that has a longer run of consecutively
correct classifications than the one in the pocket,
it replaces the weight vector in the pocket.
51/120
Pocket Algorithm:
Operational Summary(1/2)
52/120
Pocket Algorithm:
Operational Summary(2/2)
53/120
Pocket Convergence Theorem
Given a finite set of training examples, X,
and a probabilityp < 1, there exists an
integer k0 such that after any k > k0
iterations of the pocket algorithm, the
probability that the pocket weight
vectorWpocket is optimal exceeds p.
54/120
Linear Neurons and Linear Error
Consider a training set of the form T = {Xk, dk},
Xk ∈ Rn+1 , dk ∈ R.
To allow the desired output to vary smoothly or
continuously over some interval consider a linear
signal function: sk = yk = XkTWk
The linear error ek due to a presented training pair
(Xk, dk), is the difference between the desired
output dk andthe neuronal signal sk: ek = dk − sk
= dk − XkTWk 55/120
Operational Details of α –LMS
+1
itself Xk
W2k
Sk=XkTWk
2
Each iteration reduces the
…
error by a factor of η. Wnk
Xn k
xk
Wk+1
Wk
Wk
x2
x1
57/120
α –LMS: Operational Summary
58/120
MATLAB Simulation Example
Synthetic data set 10
f = 3x(x-1)(x-1.9)(x+0.7)(x+1.8)± ε
scattering points around 0
direction.
-2 -1 0 1 2
x
59/120
MATLAB Simulation Example
This is achieved by first 10
generating a random
scatter in the interval
5
f = 3x(x-1)(x-1.9)(x+0.7)(x+1.8)± ε
[0,1]. 0
-2 0
-2 -1 0 1 2
x
60/120
Computer Simulation of α -LMS
Algorithm
2
1 .8
Ite ra tio n 1
1 .6
1 .4
1 .2
y
0 .8
0 .6
Ite ra tio n 5 0 (M S E = 0 .0 4 )
0 .4
Ite ra tio n 1 0
0 .2
0 .2 0 .4 0 .6 0 .8 1 1 .2 1 .4 1 .6 1 .8 2
x
61/120
A Stochastic Setting
Assumption that the training set T is well defined
in advance is incorrect when the setting is
stochastic.
In such a situation, instead of deterministic
patterns, we have a sequence of samples {(Xk, dk)}
assumed to be drawn from a statistically
stationary population or process.
For adjustment of the neuron weights in response
to some pattern-dependent error measure, the error
computation has to be based on the expectation of
the error over the ensemble. 62/120
Definition of Mean Squared Error
(MSE)(1/2)
We introduce the square error on a pattern Xk as
ε k = ( d k − X k Wk ) = ek2
1 T 2 1
(1)
2 2
= ( d k − 2d k X kT Wk + WkT X k X kT Wk )
1 2
(2)
2
Assumption: The weights are held fixed at Wk
while computing the expectation.
63/120
Definition of Mean Squared Error
(MSE)(2/2)
The mean-squared error can now be computed by
taking the expectation on both sides of (2):
1
[ ] [ ] 1 T
[ ]
ε = E [ε k ] = E d k − E d k X k Wk + Wk E X k X kT Wk
2
2 T
2
(3)
64/120
Our problem…
To find optimal weight vector that
minimizes the mean-square error.
65/120
Cross Correlations(1/2)
For convenience of expression we define the
pattern vector P as the cross-correlation between
the desired scalar output, dk, and the input vector,
Xk
[ ] [
P T ∆ E d k X kT = E ( d k , d k x1k ,, d k xnk ) ] (4)
[ ]
1
2
2 T 1 T
ε = E [ε k ] = E d k − P Wk + Wk RWk
2
(6)
T
∂ε ∂ε
∇ε = , , = − P + RW (7)
To find ∂the
w0 optimal
∂wn set of weights, , simply set
Ŵ
which yields
∇ε = 0
69/120
Computing the Optimal
Filter(1/2)
First compute R and P . Substituting from Eqn.
−1
1
2
[ ]
= E d k + ( R P ) R ( R −1P ) − P T R −1 P
2 1 −1 T
2
(11)
1
[ ] 1
= E d k2 − P TWˆ
2 2
(12)
70/120
Computing the Optimal
Filter(2/2)
For the treatment of weight update procedures
we reformulate the expression for mean-square
error in terms of the deviation
V = W − Wˆ , of the
weight vector from the Weiner solution.
71/120
Computing R(1/2)
Substituting ˆ Eqn. (6)
into
W =V +W
[ ]
2
k
1
2
( ) (
T
) (
ε = E d + V + W R V + Wˆ − P T V + Wˆ
ˆ ) (13)
1 T
[ ]
= E d k + V RV + V
2
2
T
RWˆ + Wˆ RV + Wˆ RWˆ − P TV − P TWˆ
T T
ˆ T T −1 T
2W RV = 2 P R RV = 2 P V
(14)
1 T
= ε min + V RV (15)
2
(
1
2
) (
T
ˆ T
) (
= ε min + W − W R W − W R W − Wˆ
ˆ ) (16)
72/120
Computing R(2/2)
Note that since the mean-square error ε is
non-negative, we must have V RV ≥ 0 . This
T
73/120
Diagonalization of R
Assume that R has distinct eigenvalues λi . Then we
can construct a matrix Q whose columns are
corresponding eigenvectors ηi of R.
Q = (η0 η1 η n ) (17)
R can be diagonalized using an orthogonal similarity
transformation as follows. Having constructed Q, and
knowing that: λ0 0 0
We have 0 λ1 0
RQ = (η0 η1 η n ) (18)
0 λ
n
76/120
Eigenvectors of R(2/2)
However, vectors passing through the origin
must take the λV . Therefore, for the principal
axes V ' ,
RV ' = λV ' (22)
Clearly, V ' is an eigenvector of R.
77/120
Steepest Descent Search with Exact
Gradient Information
Steepest descent search uses
exact gradient information
available from the mean-
square error surface to direct
the search in weight space.
The figure shows a
projection of the square-error
function on the plane.
78/120
ε − wik
Steepest Descent Procedure
Summary(1/2)
Provide an appropriate weight increment to wik to
push the error towards the minimum which
occurs at ŵi .
Perturb the weight in a direction that depends on
which side of the optimal weight ŵ the current
i
79/120
Steepest Descent Procedure
Summary(2/2)
If w is on the right of , say at
k
i
ŵi where the
wik2
error gradient is positive, we need to decrease
w k
i .
This rule is summarized in the following
statement:
∂ε
If
∂wik
> 0,( wk
i > ˆ
w i ) , decrease
w k
i
∂ε
If
∂wik
< 0,( wk
i < ˆ
w i ) , increase
w k
i
80/120
Weight Update Procedure(1/3)
It follows logically therefore, that the weight
component should be updated in proportion
with the negative of the gradient:
∂ε
wk +1
i = w + η −
k
i i = 0,1, , n ( 27 )
∂wi
k
81/120
Weight Update Procedure(2/3)
Vectorially we may write
Wk +1 = Wk + η ( − ∇ε ) (28)
82/120
Weight Update Procedure(3/3)
Equation (28) is the steepest descent update
procedure. Note that steepest descent uses
exact gradient information at each step to
decide weight changes.
83/120
Convergence of Steepest Descent –
1(1/2)
Question: What can one say about the
stability of the algorithm? Does it converge
for all values of η ?
To answer this question consider the
following series of subtitutions and
transformations. From Eqns. (28) and (21)
Wk +1 = Wk + η ( − ∇ε ) (30)
= Wk − ηRVk (31)
(
= Wk + ηR Wˆ − Wk ) (32)
= (1 − ηR )Wk + ηRWˆ (33)
84/120
Convergence of Steepest Descent –
1(2/2)
Vk +1 = ( I − ηR )Vk (34)
85/120
Steepest Descent Convergence –
2(1/2)
Rotation to the principal axes of the elliptic
contours can be effected by using V = QV :
'
86/120
Steepest Descent Convergence –
2(2/2)
where D is the diagonal eigenvalue matrix.
Recursive
or application of
Eqn. (37) yields:
Vk = ( I − ηD ) V0
' k '
(38)
It follows from this that for stability and
convergence of the algorithm:
lim( I − ηD ) = 0
k
(39)
k →∞
87/120
Steepest Descent Convergence –
3(1/2)
This requires that
lim(1 − η λmax ) = 0
k
(40)
k →∞
or
2
0 <η < (41)
λmax
88/120
Steepest Descent Convergence –
3(2/2)
If this condition is satisfied then we have
k →∞ k →∞
( )
lim Vk' = lim Q −1 Wk − Wˆ = 0 (42)
or lim Wk = Wˆ (43)
k →∞
89/120
Computer Simulation
Example(1/2)
This simulation example employs the fifth
order function data scatter with the data
shifted in the y direction by 0.5.
Consequently, the values of R,P and the
Weiner solution are respectively:
1 0.500 0.8386
R = , P = (44)
0.500 1.61 − 1.4625
ˆ 1.5303
W = (45)
− 1.3834
90/120
Computer Simulation
Example(2/2)
Exact gradient information is available
since the correlation matrix R and the cross-
correlation matrix P are known.
The weights are updated using the equation:
(
Wk +1 =W k+2ηR Wˆ − Wk ) (46)
91/120
MATLAB Simulation
Example(1/3)
eta = .01; % Set learning rate
R=zeros(2,2); % Initialize correlation matrix
X = [ones(1,max_points);x]; % Augment input vectors
for k =1:max_points
R = R + X(:,k)*X(:,k)'; % Compute R
end
R = R/max_points;
weiner=inv(R)*P; % Compute the Weiner solution
errormin = D - P'*inv(R)*P; % Find the minimum error
92/120
MATLAB Simulation Example (2/3)
shift1 = linspace(-12,12, 21); % Generate a weight space matrix
shift2 = linspace(-9,9, 21);
for i = 1:21 % Compute a weight matrix about
shiftwts(1,i) = weiner(1)+shift1(i); % Weiner solution
shiftwts(2,i) = weiner(2)+shift2(i);
end
94/120
Smooth Trajectory towards the
Weiner Solution
Steepest descent
uses exact 8
6
160 W = (-3.9, 6.27) T
0
194 229 264
gradient
160
information to 2
55.3
search the 1
w
0
Weiner solution
(1.53, -1.38)
90.1
-2
in weight space. -6
229
264
160
-8
95/120
µ -LMS: Approximate Gradient
Descent(1/2)
The problem with steepest descent is that
true gradient information is only available
in situations where the data set is
completely specified in advance.
98/120
µ -LMS employs ε k for
ε = Ε [ε k](1/2)
The gradient computation modifies to:
T T
~ ∂ε ∂ε ∂e ∂e
∇ε k = kk , kk = ek kk , , kk = −ek X k (47)
∂w0 ∂wn ∂w0 ∂wn
~
Wk +1 = Wk + η (−∇ε k ) = Wk + η ( d k − sk ) X k (48)
= Wk + ηek X k (49)
99/120
µ -LMS employs ε k for
ε = Ε [ε k](2/2)
~
What value does the long term average of ∇ε k
converge to? Taking the Expectation of
both sides of Eqn. (47):
E [∇ε k ] = − E [ ek X k ] = − E [ d k X k − X k X kT W ]
~
(50)
= RW − P (51)
= ∇ε (52)
100/120
Observations(1/8)
Since the long term average of approaches
~
∇ε k ,
~
we can safely ∇ε use as an unbiased
∇ε k estimate.
∇ε µ -LMS work!
That’s what makes
101/120
Observations(2/8)
If the data set is finite (deterministic), then
one can compute ∇ε accurately by first
collecting the different ∇~ ε gradients over
k
102/120
Observations(3/8)
~
Even if the data set is deterministic, we still use ∇ε k
to update the weights. After all if the data set becomes
large, collection of all the gradients becomes
expensive in terms of storage. Much easier to just go
ahead and use
Be clear about the approximation made: we are
estimating the true gradient (which should be
computed from ) by a gradient computed from
E [ε ]
the instantaneous sample
k error . Although this may
seem to be a rather drastic approximation, ε k it works.
103/120
Observations(4/8)
In the deterministic case we can justify this
as follows: if the learning rate η , is kept
small, the weight change in each iteration
will be small and consequently the weight
vector W will remain “somewhat constant”
over Q iterations where Q is the number of
patterns in the training set.
104/120
Observations(5/8)
Of course this is provided that Q is a small
number! To see this, observe the total weight
change ∇W , over Q iterations from the k th
iteration:
Q −1
∂ε k +i
∇W = − ∑ (53)
i =0 ∂Wk +i
1 Q −1 ∂ε k +i
= −Q ∑ (54)
Q i =0 ∂Wk
∂ 1 Q −1
= −Q ∑ ε k +i (55)
∂Wk Q i =0
∂ε
= −Q (56)
∂Wk 105/120
Observations(6/8)
Where ε denotes the mean-square error. Thus
the weight updates follow the true gradient on
average.
106/120
Observations(7/8)
Observe that steepest descent search is guaranteed to
search the Weiner solution provided the learning rate
condition (41) is satisfied.
107/120
Observations(8/8)
Although α -LMS and µ - LMS are similar
algorithms, α -LMS works on the normalizing
training set. What this simply means is that α -
LMS also uses gradient information, and will
eventually search out the Weiner solution-of the
normalized training set. However, in one case
the two algorithms are identical: the case when
input vectors are bipolar. (Why?)
108/120
µ -LMS Algorithm:
Convergence in the Mean (1)(1/2)
Definition 0.1 The µ -LMS algorithm is convergent in
the mean if the average of the weight vector Wk
approaches the optimal solution Ŵas the number of
iterations k, approaches infinity:
E [Wk ] → Wˆ as k → ∞ (57)
110/120
µ -LMS Algorithm:
Convergence in the Mean (2)(1/2)
Consider the µ -LMS weight update equation:
Wk +1 = Wk + η ( d k − sk ) X k (59)
= Wk + η ( d k − X kT Wk ) X k (60)
= Wk + ηd k X k − ηX kT Wk X k (61)
= Wk - ηX k X kT Wk + ηd k X k (62)
= ( I - ηX k X kT ) Wk + ηd k X k (63)
111/120
µ -LMS Algorithm:
Convergence in the Mean (2)(2/2)
Taking the expectation of both sides of Eqn.
(63) yields,
( [
E [Wk +1 ] = I − ηE X k X kT ] ) E[W ] + ηE[ d X ]
k k k (64)
= ( I − ηR ) E [Wk ] + ηP (65)
112/120
µ -LMS Algorithm:
Convergence in the Mean (3)(1/2)
Appropriate substitution yields:
( )
E [Wk +1 ] = I − ηQDQ T E [Wk ] + ηQDQ T Wˆ
= ( QQ )
− ηQDQ T E [Wk ] + ηQDQ T Wˆ
T
(66)
= Q( I − ηD ) Q T E [Wk ] + ηQDQ T Wˆ
113/120
µ -LMS Algorithm:
Convergence in the Mean (3)(2/2)
And subtraction of Q TWˆ from both sides
gives:
( )
Q T E [Wk +1 ] − Wˆ = ( I − ηD ) Q T E [Wk ] − ( I − ηD ) Q T Wˆ (68)
(
= ( I − ηD ) Q T E [W ] − Wˆ
k ) (69)
114/120
µ -LMS Algorithm:
Convergence in the Mean (4)(1/2)
And
~' ~'
( )
Vk +1 = I − ηD Vk (71)
where ~
and V~ = Q V~ . Eqn. (71)
Vk = E [Wk ] − Wˆ k
' T
k
115/120
µ -LMS Algorithm:
Convergence in the Mean (4)(2/2)
Recursive application of Eqn. (72) yields,
v i = (1 − η λi ) v i
~ 'k k ~ '0
i = 0,1, , n (73)
~ 'ik → 0
To ensure convergence in the mean,
as v
k→∞ since this condition
requires that the
E [W ] deviation
k Ŵ of from
should tend to 0.
116/120
µ -LMS Algorithm:
Convergence in the Mean (5)(1/2)
Therefore from Eqn. (73):
| 1 − η λi | < 1 i = 0,1, , n. (74)
If this condition is satisfied for the largest eigenvalue
then it will λbe satisfied for all other eigenvalues. We
max
therefore conclude that if
2
0 <η < (75)
λmax
then the µ -LMS algorithm is convergent in the mean.
117/120
µ -LMS Algorithm:
Convergence in the Mean (5)(2/2)
n
Further, since tr [ R] = ∑ λ ≥ λ
i =0
i
(where tr [ R]
max
118/120
Random Walk towards the Weiner
Solution
Assume the
familiar fifth
19
95 141. 0.3
6
4 .56 4
12 141.88
39 8 839 16 41
21 41 66.1 117.6554 6.1
.3 1 12
0 682 11 4
4 19 93.42
order function
93 7 .
8 .42 655
982 68 4
69.1 69
4
2
55
2 5 .1
44.9697 98
7.6
39
28
µ -LMS uses a
11
.88
682
0
141
Weiner solution
93.42682
93.42
w1 (0.339, -1.881)
828
-2
local estimate of
69.
69.19
µ LMS solution
44
5
982
(0.5373, -2.3311)
97
.9
-4
6
96
97
.
8
4
11
5
gradient to search
4
7.
-6 14
65
1 4
54
.8
8
93.
42 6 69.19828 55
39 82 7 .6
19 82
93 .426 11 8 3 9 2 4
the Weiner
-8 0
21 .341 16 1.8 11
4.5 6.1 117.6554 14 66.
12 1 1
69 4 4
-10 5 141.8839 9 0.3
-10 -5 0 5 1 10
solution in weight w0
space. 119/120
Adaptive Noise Cancellation
120/120