Professional Documents
Culture Documents
10.12.2014
Goals of Today’s Lecture
1 / 40
Introduction
2 / 40
The unknown parameters µ and σ can be estimated from data.
3 / 40
Multivariate Data
In that case
X (1)
X = ,
X (2)
where X (1) is temperature and X (2) is pressure.
4 / 40
A possible data-set now consists of n vectors of dimension 2 (or m in
the general case):
x 1, . . . , x n,
where x i ∈ R2 (or Rm ).
Remark
Here, the situation is more general in the sense that we don’t have a
response variable but we want to model “relationships” between (any)
variables.
5 / 40
Expectation and Covariance Matrix
6 / 40
What about dependency?
●
●
22
●
●
● ●
22
● ● ● ●
● ● ● ●
● ● ●
● ●
21
● ●
● ● ● ●
● ● ●● ●
● ● ● ● ●● ●●●● ●● ●●
21
● ● ● ●
Temperature B
Temperature B
● ● ● ● ●
● ● ● ●
●●● ● ●●
●● ● ● ● ●● ● ●
●
●● ● ●
20
● ●● ● ● ●● ● ●
●● ● ● ●● ● ● ● ●●
20
● ● ●● ●●● ● ● ● ●●
● ●
● ● ● ●● ●● ●● ●
●
● ● ●
● ● ● ●
● ● ●● ●
● ●● ● ●● ●● ●
●
● ●● ● ● ● ● ● ● ●●
19
● ● ●
● ● ●● ●● ● ● ● ●
●
19
●● ● ● ●
● ● ●
●
● ●
●
● ● ●
● ● ● ● ●
●
● ●
18
● ●
18
● ●
● ●
18 19 20 21 22 23 18 19 20 21 22
Temperature A Temperature A
23
22
● ●
22
● ● ● ●● ● ●
● ●● ●● ●
●
● ●● ●
●
● ● ●●
● ● ●●
21
● ● ●
●●●● ● ● ●
21
●● ●● ● ● ●
Temperature B
Temperature B
● ● ● ● ●
● ●● ● ● ●●●●● ● ●
●● ● ●
●● ●●● ●
● ● ● ●●● ●● ●
●●●● ●●
20
20
● ●● ●
● ● ● ●●●●
● ●●
●
● ●● ●● ● ●●●
●●●●●● ●●●● ● ●● ● ●●●
● ●● ●●●
● ● ●● ● ● ●● ● ●
●
● ●● ●● ●
19
19
● ● ● ●
● ●●● ● ● ● ●
● ●●
● ●●● ● ●
● ●
●
● ●●
●
● ● ●
18
●● ● ●
18
● ●
● ●
●
17
17
17 18 19 20 21 22 23 17 18 19 20 21 22
Temperature A Temperature A
7 / 40
We need a measure to characterize the dependency between the
different components.
−1 ≤ ρ ≤ 1
8 / 40
The formal definition of the correlation is based on the covariance.
9 / 40
The covariance matrix Σ| is an m × m matrix with elements
|
We also write Var(X ) or Cov(X ) instead of Σ.
The special |
Psymbol Σ is used in order to avoid confusion with the
sum sign .
10 / 40
The covariance matrix contains a lot of information, e.g.
Σ| jj = Var(X (j) ).
11 / 40
Again, from a real data-set we can estimate these quantities with
h iT
µ
b = x (1) , x (2) , . . . , x (m)
n
b| 1 X (j) (k)
Σ jk = (xi − µ
bj )(xi − µ
bk ),
n−1
i=1
12 / 40
Illustration: Empirical Correlation
Source: Wikipedia
13 / 40
Linear Transformations
The following table illustrates how the expectation and the variance
(covariance matrix) change when linear transformations are applied to
the univariate random variable X or the multivariate random vector X .
Univariate Multivariate
Y = a + bX Y = a + BX
E [Y ] = a + bE [X ] E [Y ] = a + BE [X ]
Var(Y ) = b2 Var(X ) Var(Y ) = BΣ| X BT
14 / 40
Principal Component Analysis (PCA)
15 / 40
Illustration of Artificial 3-Dim Data-Set
●
●●
●
●
●
● ●● ●
●● ●
●
●
1.5
● ● ●
●
●● ● ●
● ●●
● ●●
●
● ●
1.0
● ●
●● ● ●● ● ●
●
● ● ●
● ●
● ●
● ●
●
●
0.5
● ● ● ● ●
●
●
●
●● ●
● ●
●
● ●
x(3)
0.0
● ● ● ●
●
●
●
● ● ● ●
x(2)
●
●
● 1.5
● ●● ●
●
●● ●
● ●
1.0
−0.5
●●
●●
0.5
0.0
−1.0
−0.5
−1.0
−1.5
−1.5
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
x(1)
16 / 40
We have to be more precise with what we mean with “variation”.
17 / 40
Of course, our data-points x i ∈ Rm will then have new coordinates
(k)
zi = xT
i b k , k = 1, . . . , m
I And so on...
18 / 40
The new basis vectors are the so-called principal components.
19 / 40
PCA: Illustration in Two Dimensions
0.1
2. Principal Component
0.6
log10 (width)
0
0.5
20 / 40
We could find the first basis vector b1 by solving the following
maximization problem
max Var(Xb),
d
b:kbk=1
b| that
It can be shown that b1 is the (standardized) eigenvector of Σ X
corresponds to the largest eigenvalue.
21 / 40
To summarize: We are performing a transformation to new
variables
z i = BT (x i − µ),
b
where the transformation matrix B is orthogonal and contains the
bk ’s as columns.
22 / 40
Hence, we have
λ1 0 ... 0
b| = BT Σ
b| B = 0 λ2 ... 0
Var(Z
c ) = Σ
Z X .. .. ..
. . .
0 0 ... λm
λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0.
23 / 40
Scaling Issues
24 / 40
PCA and Dimensionality Reduction
25 / 40
Sometimes we see a sharp drop when plotting the eigenvalues (in
decreasing order).
26 / 40
PCA and Dimensionality Reduction
27 / 40
Illustration
●
●●
●
●
●
● ●● ●
●● ●
●
●
1.5
● ● ●
●
●● ● ●
● ●●
● ●●
●
● ●
1.0
● ●
●● ● ●● ● ●
●
● ● ●
● ●
● ●
● ●
●
●
0.5
● ● ● ● ●
●
●
●
●● ●
● ●
●
● ●
x(3)
0.0
● ● ● ●
●
●
●
● ● ● ●
x(2)
●
●
● 1.5
● ●● ●
●
●● ●
● ●
1.0
−0.5
●●
●●
0.5
0.0
−1.0
−0.5
−1.0
−1.5
−1.5
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
x(1)
28 / 40
−1.0 −0.5 0.0 0.5 1.0
●● ●●
● ●
1.0
● ●● ●● ●
● ●● ● ●●●
●
● ● ● ●
● ● ●●
0.5
● ● ● ● ●
● ●
● ● ●
● ● ●●
● ● ● ●
PC1
0.0
● ● ● ●
● ● ● ●
● ●
● ●
●● ● ●●
−1.0 −0.5
●●● ● ●● ●
● ●● ●
● ● ●●
● ●
● ● ● ●
● ● ● ●
●●●
●
● ●
● ●
● ● ●
●
1.0
0.5
● ● ● ●● ●
● ●
● ● ● ● ● ●●●● ●
●
●
● ● ● ● ●●●●● ●
●
● ● ●
● ● ● ● ● ●
●●●●
●
PC2 ●
0.0
● ● ● ●●●●
● ●● ●
● ● ● ●●● ●
● ● ●● ●
● ●●
●●
● ●● ● ●●●●
● ●● ●● ●
●●
−1.0 −0.5
● ● ●●
● ● ● ● ●
1.0
0.5
● ● ● ●●
●
● ● ● ●●●● ● ●
●●● ● ● ● ● ●● ●● ● ●● ● ● ●●
●
PC3
0.0
● ● ●●● ●●● ●●●
●● ●●
●●
●● ● ●● ●●
●●●● ● ●
● ●●●● ● ● ●● ●● ●●
● ● ●●●
●● ● ●●
● ●●●
●● ●●
● ●
● ●
−1.0 −0.5
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
29 / 40
Loadings
PC1 PC2 PC3
x1 -0.395118451 0.06887762 0.91604437
x2 -0.009993467 0.99680385 -0.07926044
x3 -0.918575822 -0.04047172 -0.39316727
Screeplot Variances
0.5
0.4
0.3
0.2
0.1
0.0
30 / 40
Applying PCA to NIR-Spectra
A spectrum has the property that the different variables are “ordered”
and we can plot one observation as a “function” (see plot on next
slide).
31 / 40
Illustration: Spectra (centered at each wavelength)
0.4
0.2
0.0
−0.2
−0.4
Wavelength
33 / 40
Scatterplot of the First Two Principal Components
●
●●
1.5
●
●
●
● ●
●
●
●
●
●
1.0
● ●
●
● ●
● ●
●
● ●
●
● ●
●
0.5
● ●
● ●
● ●
2. PCA
● ●
●
●
●
●
●
●
●
● ● ●●
●
0.0
● ●
●●
● ● ● ● ●
● ●●●
●● ●●
●
●
●●
●
●
●
●
●
●
●
●
−0.5
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●
●●●●
●●
●●
●
●●●
−1.0
−2 0 2 4 6 8
1. PCA
34 / 40
Variances
0 2 4 6 8 10 12 14
Scree-Plot
35 / 40
Alternative Interpretations
xi − µ
b = xbi + e i
where
p m
(k) (k)
X X
xbi = zi b(k) , ei = zi b(k) .
k=1 k=p+1
36 / 40
Linear Algebra
It can be shown that the data matrix consisting of the xbi is the best
approximation of our original (centered) data matrix if we restrict
ourselves to matrices of rank p (with respect to the Frobenius norm), i.e.
it has smallest X (j) 2
ei .
i,j
Statistics
It’s the best approximation in the sense that it has the smallest sum of
variances
Xm
c (j) ).
Var(E
j=1
37 / 40
PCA via Singular Value Decomposition (SVD)
We can also get the principal components from the singular value
decomposition (SVD) of the data matrix X.
X = UDVT
where
I U is n × n and orthogonal
I D is n × m and generalized diagonal (containing the so-called singular
values in descending order)
I V is m × m and orthogonal
38 / 40
Properties of SVD
n n
(j) (k)
X X
T
(X X)jk = (x i x T
i )jk = xi xi b| .
= (n − 1)Σ jk
i=1 i=1
39 / 40
Summary
40 / 40