You are on page 1of 21

Covariance and Correlation Matrix

N , where x Rd , x = Given sample {xn }1 n

x1n x2n . . . xdn

sample mean x = 1 mean are xi = N

1 N

N n=1 xn ,

and entries of sample

N n=1 xin

sample covariance matrix is a d d matrix Z with N entries Zij = N 1 n=1 (xin xi )(xjn xj ) 1 sample correlation matrix is a d d matrix C with entries Cij =
1 N 1 PN
n=1 (xin xi )(xjn xj ) xi xj

, where xi and

xj are the sample standard deviations


p. 168

Covariance and Correlation Matrix Example


Given sample:
1.2 0.9 , 2.5 3.9 Z=
2.443333 1.5631171.563117 3.940000 2.5540821.563117

0.7 0.4

4.2 5.8

x=

2.15 2.75

2.443333 3.940000 3.940000 6.523333


3.940000 1.5631172.554082 6.523333 2.5540822.554082

C=

1.000000 0.986893 0.986893 1.000000

Observe, if sample is z -normalized (xnew = ij standard deviation 1) then C equals Z. See cov(), cor(), scale() in R.

xij xi xi ,

mean 0,

p. 169

Principal Component Analysis with NN


Principal Component Analysis (PCA) is a technique for dimensionality reduction lossy data compression feature extraction data visualization Idea: orthogonal projection of the data onto a lower dimensional linear space, such that the variance of the projected data is maximized.
u1

p. 170

Maximize Variance of Projected Data


Given data {xn }N where xn has dimensionality d. 1 Goal: project data onto a space having dimensionality m < d while maximizing the variance of the projected data. Let us consider the projection onto one-dimensional space (m = 1). Dene direction of this space using a d-dimensional vector u1 . Mean of the projected data is uT x, where x is sample mean 1
1 x= N
N

xn
n=1

p. 171

Maximize Variance of Projected Data (cont.)


Variance of projected data is given by
1 N
N 2

uT xn 1
n=1

uT x 1

= uT Su1 1

where S is the data covariance matrix dened by


1 S= N
N

(xn x)(xn x)T


n=1

Goal: maximize the projected variance uT Su1 with respect to 1 u1 . Prevent u1 growing to innity, use constrain uT u1 = 1, 1 gives optimization problem: maximize subject to
uT Su1 1 uT u1 = 1 1
p. 172

Maximize Variance of Projected Data (cont.)


Lagrangian form (one Lagrange multiplier 1 ):
L(u1 , 1 ) = uT Su1 1 (uT u1 1) 1 1

Set derivative with respect to u1 to zero,


L(u1 , 1 ) =0 u1

gives

Su1 = 1 u1

last term says that u1 must be an eigenvector of S. Finally by left-multiplying by uT and making use of uT u1 = 1 one can 1 1 see that the variance is given by
uT Su1 = 1 . 1

Observe, that variance is maximized when u1 equals to the eigenvector having largest eigenvalue 1 .
p. 173

Second Principal Component


Second eigenvector u2 should also be of unit length and orthogonal to u1 (after projection uncorrelated to uT x). 1 maximize subject to
uT Su2 2 uT u2 = 1, 2 uT u1 = 0 2

Lagrangian form (two Lagrange multipliers 1 , 2 ):


L(u2 , 1 , 2 ) = uT Su2 2 (uT u2 1) 1 (uT u1 0) 2 2 2

This gives solution


uT Su2 = 2 2

which implies that u2 should be eigenvector of S with second largest eigenvalue 2 . Other dimensions are given by the eigenvectors with decreasing eigenvalues. p. 174

PCA Example
First and second eigenvector Projection on first eigenvector

cbind(data.x.eig.1, rep(0, N))[,2] 4 2 0 data.xy[,1] 2 4 6

data.xy[,2]

cbind(data.x.eig.1, rep(0, N))[,1]

Projection on both orthogonal eigenvectors

data.x.eig[,2]

0 data.x.eig[,1]

p. 175

Proportion of Variance

In image and speech processing problems the inputs are usually highly correlated.

If dimensions are highly correlated, then there will be small number of eigenvectors with large eigenvalues (m d). As a result, a large reduction in dimensionality can be attained.
Proportion of variance explained, digit class 1 (USPS database)
1.0

1 + 2 + . . . + m 1 + 2 + . . . + m + . . . + d

Proportion of variance

0.5

0.6

0.7

0.8

0.9

50

100 Eigenvectors

150

200

250

p. 176

PCA Second Example


8 8

256

256

Segment image in 32 32 = 1024 image pieces of size 8 8 1 64: x1 , x2 , . . . , x1024 R64 Determine mean: x = 1024 1 n=1 xi 1024

Determine covariance matrix S and the m eigenvectors u1 , u2 , . . . , um having the largest corresponding eigenvalues 1 , 2 , . . . , m Create eigenvector matrix U, where u1 , u2 , . . . , um are column vectors Project image pieces xi into subspace as follows: zT = UT (xT xT ) i i

p. 177

PCA Second Example (cont.)

Reconstruct image pieces by back-projecting it to the original space as xT = UzT + x. Note, mean is added i i (substracted step before) because data is not normalized
Proportion of variance explained in image Original image
Proportion of variance

0.6 0

0.8

1.0

10

20

30
Eigenvectors

40

50

60

Reconstructed with 16 eigenvectors

Reconstructed with 32 eigenvectors

Reconstructed with 48 eigenvectors

Reconstructed with 64 eigenvectors

p. 178

PCA with a Neural Network


V
d

w1 x1 x2

w2

wd xd

V = wT x =
j=1

w j xj

Apply Hebbian learning rule


wi = V xi ,

such that after some update steps weight vector w should point in direction of maximum variance.

p. 179

PCA with a Neural Network (cont.)


Suppose that there is a stable equilibrium point for w such that the average weight change is zero
0 = wi = V xi = V
j

w j xj xi =
j

Cij wj = Cw.

Angle brackets indicates an average over the input distribution P (x) and C denotes the correlation matrix with
Cij xi xj ,

or

C x xT

Note, C is symmetric (Cij = Cji ) and positive semi-denite which implies that its eigenvalues are positive or zero and eigenvectors can be taken as orthogonal.

p. 180

PCA with a Neural Network (cont.)

At our hypothetical equilibrium point, w is an eigenvector of C with eigenvalue 0 Never stable, because C has some pos. eigenvalues and some corresponding eigenvector would grow exponentially constrain the growth of w, e.g. renormalization ( w = 1) after each update step more elegant idea: adding a weight decay proportional to V 2 to Hebbian learning rule (Ojas Rule)
wi = V (xi V wi )

p. 181

PCA with a Neural Network Example


Ojas Rule (blue vector), largest eigenvector (red vector)

data.xy[,2]

0 data.xy[,1]

p. 182

Some insights into Ojas Rule


Ojas rule converges to a weight vector w with following properties: w = 1 (unit length), eigenvector direction: w lies in a maximal eigenvector direction of C, variance maximization: w lies in a direction that maximizes V 2 Ojas learning rule is still limited, because we can construct only the rst principal component of the z -normalized data.

p. 183

Construct the rst m principal components

Single-layer network with the i-th output Vi given by T Vi = j wij xj = wi x, wi is the weight vector for the i-th output Ojas m-unit learning rule
d

wij = Vi (xj
k=1

Vk wkj )

Sangers learning rule


i

wij = Vi (xj
k=1

Vk wkj )

Both rules reduce to Ojas 1-unit rule for the m = 1 and i = 1 case
p. 184

Ojas and Sangers Rule

In both cases the wi vectors converge to orthogonal unit vectors Weight vectors become in Sangers rule exactly the rst m principal components, in order wi = ci , where ci is normalized eigenvector of the correlation matrix C belonging to the i-th largest eigenvalue i Ojas m-unit rule converges to the m weight vectors that span the same subspace as the rst m eigenvectors, but do not nd the eigenvector directions themselves

p. 185

Linear Auto-Associative Network


x1

reconstructed features

xd

reconstruction

z1 m extracted features

zm

extraction

x1

original features

xd

Network is training to perform identity mapping Idea: bottleneck units represents signicant features in the input data Train network by minimizing the sum-of-square error
1 2 N n=1 d k=1 (n) yk (x(n) ) xk ) 2
p. 186

Linear Auto-Associative Network (cont.)

Equivalent to the Ojas/Sangers update rule, this type of learning can be considered as unsupervised learning, since no independent target data is provided Error function has a unique global minimum when hidden units have linear activations functions At this minimum the network performs a projection onto the m-dimensional sub-space which is spanned by the rst m principal components of the data Note, however, that these vectors need not to be orthogonal or normalized

p. 187

Non-Linear Auto-Associative Network


x1 reconstructed features xd

linear

non-linear

linear
z1 zm

m extracted features

non-linear

x1

original features

xd

p. 188

You might also like