Slides Lecture7 Ext

Covariance and Correlation Matrix
N , where x Rd , x = Given sample {xn }1 n
x1n x2n . . . xdn
sample mean x = 1 mean are xi = N
1 N
N n=1 xn ,
and entries of sample
N n=1 xin
sample covariance matrix is a d d matrix Z with N entries Zij = N 1 n=1 (xin xi )(xjn xj ) 1 sample correlation matrix is a d d matrix C with entries Cij =
1 N 1 PN
n=1 (xin xi )(xjn xj ) xi xj
, where xi and
xj are the sample standard deviations

p. 168
Covariance and Correlation Matrix Example

Given sample:
1.2 0.9 , 2.5 3.9 Z=
2.443333 1.5631171.563117 3.940000 2.5540821.563117
0.7 0.4
4.2 5.8
x=
2.15 2.75
2.443333 3.940000 3.940000 6.523333

3.940000 1.5631172.554082 6.523333 2.5540822.554082
C=
1.000000 0.986893 0.986893 1.000000
Observe, if sample is z -normalized (xnew = ij standard deviation 1) then C equals Z. See cov(), cor(), scale() in R.
xij xi xi ,
mean 0,
p. 169
Principal Component Analysis with NN

Principal Component Analysis (PCA) is a technique for dimensionality reduction lossy data compression feature extraction data visualization Idea: orthogonal projection of the data onto a lower dimensional linear space, such that the variance of the projected data is maximized.
u1
p. 170
Maximize Variance of Projected Data

Given data {xn }N where xn has dimensionality d. 1 Goal: project data onto a space having dimensionality m < d while maximizing the variance of the projected data. Let us consider the projection onto one-dimensional space (m = 1). Dene direction of this space using a d-dimensional vector u1 . Mean of the projected data is uT x, where x is sample mean 1
1 x= N
N
xn
n=1
p. 171
Maximize Variance of Projected Data (cont.)

Variance of projected data is given by
1 N
N 2
uT xn 1
n=1
uT x 1
= uT Su1 1
where S is the data covariance matrix dened by

1 S= N
N
(xn x)(xn x)T

n=1
Goal: maximize the projected variance uT Su1 with respect to 1 u1 . Prevent u1 growing to innity, use constrain uT u1 = 1, 1 gives optimization problem: maximize subject to
uT Su1 1 uT u1 = 1 1
p. 172
Maximize Variance of Projected Data (cont.)

Lagrangian form (one Lagrange multiplier 1 ):
L(u1 , 1 ) = uT Su1 1 (uT u1 1) 1 1
Set derivative with respect to u1 to zero,

L(u1 , 1 ) =0 u1
gives
Su1 = 1 u1
last term says that u1 must be an eigenvector of S. Finally by left-multiplying by uT and making use of uT u1 = 1 one can 1 1 see that the variance is given by
uT Su1 = 1 . 1
Observe, that variance is maximized when u1 equals to the eigenvector having largest eigenvalue 1 .
p. 173
Second Principal Component

Second eigenvector u2 should also be of unit length and orthogonal to u1 (after projection uncorrelated to uT x). 1 maximize subject to
uT Su2 2 uT u2 = 1, 2 uT u1 = 0 2
Lagrangian form (two Lagrange multipliers 1 , 2 ):

L(u2 , 1 , 2 ) = uT Su2 2 (uT u2 1) 1 (uT u1 0) 2 2 2
This gives solution

uT Su2 = 2 2
which implies that u2 should be eigenvector of S with second largest eigenvalue 2 . Other dimensions are given by the eigenvectors with decreasing eigenvalues. p. 174
PCA Example
First and second eigenvector Projection on first eigenvector
cbind(data.x.eig.1, rep(0, N))[,2] 4 2 0 data.xy[,1] 2 4 6
data.xy[,2]
cbind(data.x.eig.1, rep(0, N))[,1]
Projection on both orthogonal eigenvectors
data.x.eig[,2]
0 data.x.eig[,1]
p. 175
Proportion of Variance
In image and speech processing problems the inputs are usually highly correlated.
If dimensions are highly correlated, then there will be small number of eigenvectors with large eigenvalues (m d). As a result, a large reduction in dimensionality can be attained.
Proportion of variance explained, digit class 1 (USPS database)
1.0
1 + 2 + . . . + m 1 + 2 + . . . + m + . . . + d
Proportion of variance
0.5
0.6
0.7
0.8
0.9
50
100 Eigenvectors
150
200
250
p. 176
PCA Second Example

8 8
256
256
Segment image in 32 32 = 1024 image pieces of size 8 8 1 64: x1 , x2 , . . . , x1024 R64 Determine mean: x = 1024 1 n=1 xi 1024
Determine covariance matrix S and the m eigenvectors u1 , u2 , . . . , um having the largest corresponding eigenvalues 1 , 2 , . . . , m Create eigenvector matrix U, where u1 , u2 , . . . , um are column vectors Project image pieces xi into subspace as follows: zT = UT (xT xT ) i i
p. 177
PCA Second Example (cont.)
Reconstruct image pieces by back-projecting it to the original space as xT = UzT + x. Note, mean is added i i (substracted step before) because data is not normalized
Proportion of variance explained in image Original image
Proportion of variance
0.6 0
0.8
1.0
10
20
30
Eigenvectors
40
50
60
Reconstructed with 16 eigenvectors
p. 178
PCA with a Neural Network

V
d
w1 x1 x2
w2
wd xd
V = wT x =
j=1
w j xj
Apply Hebbian learning rule

wi = V xi ,
such that after some update steps weight vector w should point in direction of maximum variance.
p. 179
PCA with a Neural Network (cont.)

Suppose that there is a stable equilibrium point for w such that the average weight change is zero
0 = wi = V xi = V
j
w j xj xi =
j
Cij wj = Cw.
Angle brackets indicates an average over the input distribution P (x) and C denotes the correlation matrix with
Cij xi xj ,
or
C x xT
Note, C is symmetric (Cij = Cji ) and positive semi-denite which implies that its eigenvalues are positive or zero and eigenvectors can be taken as orthogonal.
p. 180
PCA with a Neural Network (cont.)
At our hypothetical equilibrium point, w is an eigenvector of C with eigenvalue 0 Never stable, because C has some pos. eigenvalues and some corresponding eigenvector would grow exponentially constrain the growth of w, e.g. renormalization ( w = 1) after each update step more elegant idea: adding a weight decay proportional to V 2 to Hebbian learning rule (Ojas Rule)
wi = V (xi V wi )
p. 181
PCA with a Neural Network Example

Ojas Rule (blue vector), largest eigenvector (red vector)
data.xy[,2]
0 data.xy[,1]
p. 182
Some insights into Ojas Rule

Ojas rule converges to a weight vector w with following properties: w = 1 (unit length), eigenvector direction: w lies in a maximal eigenvector direction of C, variance maximization: w lies in a direction that maximizes V 2 Ojas learning rule is still limited, because we can construct only the rst principal component of the z -normalized data.
p. 183
Construct the rst m principal components
Single-layer network with the i-th output Vi given by T Vi = j wij xj = wi x, wi is the weight vector for the i-th output Ojas m-unit learning rule
d
wij = Vi (xj
k=1
Vk wkj )
Sangers learning rule

i
wij = Vi (xj
k=1
Vk wkj )
Both rules reduce to Ojas 1-unit rule for the m = 1 and i = 1 case
p. 184
Ojas and Sangers Rule
In both cases the wi vectors converge to orthogonal unit vectors Weight vectors become in Sangers rule exactly the rst m principal components, in order wi = ci , where ci is normalized eigenvector of the correlation matrix C belonging to the i-th largest eigenvalue i Ojas m-unit rule converges to the m weight vectors that span the same subspace as the rst m eigenvectors, but do not nd the eigenvector directions themselves
p. 185
Linear Auto-Associative Network

x1
reconstructed features
xd
reconstruction
z1 m extracted features
zm
extraction
x1
original features
xd
Network is training to perform identity mapping Idea: bottleneck units represents signicant features in the input data Train network by minimizing the sum-of-square error
1 2 N n=1 d k=1 (n) yk (x(n) ) xk ) 2
p. 186
Linear Auto-Associative Network (cont.)
Equivalent to the Ojas/Sangers update rule, this type of learning can be considered as unsupervised learning, since no independent target data is provided Error function has a unique global minimum when hidden units have linear activations functions At this minimum the network performs a projection onto the m-dimensional sub-space which is spanned by the rst m principal components of the data Note, however, that these vectors need not to be orthogonal or normalized
p. 187
Non-Linear Auto-Associative Network

x1 reconstructed features xd
linear
non-linear
linear
z1 zm
m extracted features
non-linear
x1
original features
xd
p. 188

Slides Lecture7 Ext

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides Lecture7 Ext

Uploaded by

Copyright:

Available Formats

Covariance and Correlation Matrix

N , where x Rd , x = Given sample {xn }1 n

x1n x2n . . . xdn

sample mean x = 1 mean are xi = N

and entries of sample

xj are the sample standard deviations

Covariance and Correlation Matrix Example

2.443333 3.940000 3.940000 6.523333

1.000000 0.986893 0.986893 1.000000

Principal Component Analysis with NN

Maximize Variance of Projected Data

Maximize Variance of Projected Data (cont.)

where S is the data covariance matrix dened by

(xn x)(xn x)T

Maximize Variance of Projected Data (cont.)

Set derivative with respect to u1 to zero,

Second Principal Component

Lagrangian form (two Lagrange multipliers 1 , 2 ):

This gives solution

cbind(data.x.eig.1, rep(0, N))[,2] 4 2 0 data.xy[,1] 2 4 6

cbind(data.x.eig.1, rep(0, N))[,1]

Projection on both orthogonal eigenvectors

PCA Second Example

PCA Second Example (cont.)

Reconstructed with 16 eigenvectors

Reconstructed with 32 eigenvectors

Reconstructed with 48 eigenvectors

Reconstructed with 64 eigenvectors

PCA with a Neural Network

Apply Hebbian learning rule

PCA with a Neural Network (cont.)

PCA with a Neural Network (cont.)

PCA with a Neural Network Example

Some insights into Ojas Rule

Construct the rst m principal components

Sangers learning rule

Ojas and Sangers Rule

Linear Auto-Associative Network

Linear Auto-Associative Network (cont.)

Non-Linear Auto-Associative Network

You might also like