CS195-5: Introduction To Machine Learning: Greg Shakhnarovich

CS195-5 : Introduction to Machine Learning
Lecture 5
Greg Shakhnarovich
September 15 2006
Revised October 24th, 2006
Announcements
Collaboration policy on Psets
Projects
Clarications for Problem Set 1
CS195-5 2006 Lecture 5 1
The correlation question
N values in each of two samples:
e
i
= y
i
w
T
x the prediction error
z
i
= a
T
x
i
a linear function evaluated on the training examples.
Show that ({e
i
}, {z
i
}) = 0.
Develop an intuition, before you attack the derivation: Play with these in
Matlab!
Generate a random w
, random X
Compute Xw
, generate and add Gaussian noise

Fit w, calculate {e
i
}
Generate a random a, calculate {z
i
}. plot them!
Calculate correlation.
CS195-5 2006 Lecture 5 2
More notation
A B means A is dened by B (rst time A is introduced)
A B for varying A and/or B means they are always equal.
E.g., f(x) 1 means f returns 1 regardless of the input x.
a p(a) random variable a is drawn from density p(a)
CS195-5 2006 Lecture 5 3
Review
Uncertainty in w as an estimate of w
:
w N
_
w; w
,
2
(X
T
X)
1
_
Generalized linear regression
f(x; w) = w
0
+ w
1
1
(x) + w
2
2
(x) + . . . + w
m
m
(x)
Multivariate Gaussians
CS195-5 2006 Lecture 5 4
Today
More on Gaussians
Introduction to classication
Projections
Linear discriminant analysis
CS195-5 2006 Lecture 5 5
Refresher on probability
Variance of a r.v. a:
2
a
= E
_
(a
a
)
2
, where
a
= E [a].
Standard deviation:
_
2
a
. Measures the spread around the mean.
Generalization to two variables: covariance
Cov
a,b
E
p(a,b)
[(a
a
)(b
b
)]
Measures how the two variable deviate together from their means (co-vary).
CS195-5 2006 Lecture 5 6
Correlation and covariance
Correlation:
cor(a, b)
Cov
a,b
b
.
1.5 1 0.5 0 0.5 1 1.5
1.5
1
0.5
0
0.5
1
1.5
a
b
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
a
b
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
a
b
cor(a, b) measures the linear relationship between a and b.
1 cor(a, b) +1 ; +1 or 1 means a is a linear function of b.
CS195-5 2006 Lecture 5 7
Covariance matrix
For a random vector x = [x
1
, . . . , x
d
]
T
,
Cov
x

_
2
x
1
Cov
x
1
,x
2
. . . . . . . Cov
x
1
,x
d
Cov
x
2
,x
1

2
x
2
. . . . . . . Cov
x
2
,x
d
.
.
.
.
.
.
.
.
.
Cov
x
d
,x
1
Cov
x
d
,x
2
. . . . .
2
x
d
_
_
.
Square, symmetric, non-negative main diagonal (variances 0)
Under that denition, one can show:
Cov
x
= E
_
(x
x
)(x
x
)
T
i.e. expectation of the outer product of x

x
with itself.
Note: so far nothing Gaussian-specic!
CS195-5 2006 Lecture 5 8
Covariance matrix decomposition
Any covariance matrix can be decomposed:
= R
_
_
1
.
.
.
d
_
_
R
T
where R is a rotation matrix, and
j
0 for all j = 1, . . . , d.
Rotation in 2D:
R =
_
cos() sin()
sin() cos()
_
CS195-5 2006 Lecture 5 9
Rotation matrices
= R
_
_
1
.
.
.
d
_
_
R
T
Rotation matrix R:
orthonormal: if columns are r
1
, . . . , r
d
, then r
T
i
r
i
= 1, r
T
i
r
j
= 0 for i = j.
From here follows R
T
= R
1
(R
T
reverses the rotation produced by R).
Columns r
i
specify the basis for the new (rotated) coordinate system.
R determines the orientation of the ellipse (so called principal directions)
The inner diag(
1
, . . . ,
d
) species the scaling along each of the principal
directions.
Interpretation of the whole product: rotate, scale, and rotate back.
CS195-5 2006 Lecture 5 10
Covariance and correlation for Gaussians
Suppose (for simplicity) = 0. What happens if we rotate the data by R
T
?
The new covariance matrix is just
_
2
x
1
Cov
x
1
,x
2
. . . . . . . Cov
x
1
,x
d
Cov
x
2
,x
1

2
x
2
. . . . . . . Cov
x
2
,x
d
.
.
.
.
.
.
.
.
.
Cov
x
d
,x
1
Cov
x
d
,x
2
. . . . .
2
x
d
_
_
=
_
_
1
.
.
.
d
_
_
The components of x are now uncorrelated (covariances are zero). This is
known as whitening transformation.
For Gaussians, this also means they are independent.
Not true for all distributions!
CS195-5 2006 Lecture 5 11
Classication versus regression
Formally: just like in regression, we want to learn a mapping from X to Y, but
Y is discrete and nite.
One approach is to (navely) ignore that Y is such.
Regression on the indicator matrix:
Code the possible values of the label as 1, . . . , C.
Dene matrix Y:
Y
ij
=
_
1 if y
i
= c,
0 otherwise
This denes C independent regression problems; solving them with least
squares yields
Y
0
= X
0
(X
T
X)
1
XY.
CS195-5 2006 Lecture 5 12
Classication as regression
Suppose we have a binary problem, y {1, 1}.
Assuming the standard model y = f(x; w) + , and solving with least squares,
we get w.
This corresponds to squared loss as a measure of classication performance!
Does this make sense?
CS195-5 2006 Lecture 5 13
Suppose we have a binary problem, y {1, 1}.
Assuming the standard model y = f(x; w) + , and solving with least squares,
we get w.
This corresponds to squared loss as a measure of classication performance!
Does this make sense?
How do we decide on the label based on f(x; w)?
CS195-5 2006 Lecture 5 13
Classication as regression: example
A 1D example:
x
CS195-5 2006 Lecture 5 14
A 1D example:
x
y
+1
-1
CS195-5 2006 Lecture 5 14
A 1D example:
x
y
+1
-1
w
0
+ w
T
x
CS195-5 2006 Lecture 5 14
A 1D example:
x
y
+1
-1
w
0
+ w
T
x
y = 1
y = +1
CS195-5 2006 Lecture 5 14
f(x; w) = w
0
+ w
T
x
Cant just take y = f(x; w) since it wont be a valid label.
A reasonable decision rule:
decide on y = 1 if f(x; w) 0, otherwise y = 1.
y = sign
_
w
0
+ w
T
x
_
This species a linear classier :
The linear decision boundary (hyperplane) given by the equation w
0
+ w
T
x = 0
separates the space into two half-spaces.
CS195-5 2006 Lecture 5 15
Seems to work well here but not so well here?
CS195-5 2006 Lecture 5 16
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
CS195-5 2006 Lecture 5 17
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
w
T
w
T
x+w
0
w.
CS195-5 2006 Lecture 5 17
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
x
0
w
T
w
T
x+w
0
w.
CS195-5 2006 Lecture 5 17
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
x
0
w
0
+w
T
x
0
w
x
0
w
T
w
T
x+w
0
w.
CS195-5 2006 Lecture 5 17
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
x
0
w
0
+w
T
x
0
w
x
0
x
0
w
T
w
T
x+w
0
w.
x
is the projection of x on w.
CS195-5 2006 Lecture 5 17
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
x
0
w
0
+w
T
x
0
w
x
0
x
0
w
T
w
T
x+w
0
w.
x
is the projection of x on w.
Set up a new 1D coordinate system: x (w
0
+x
T
x)/w.
CS195-5 2006 Lecture 5 17
Distribution in 1D projection
Consider a projection given by w
T
x = 0 (i.e., w is the normal)
Each training point x
i
is projected to a scalar z
i
= w
T
x.
We can study how well the projected values corresponding to dierent classes
are separated
This is a function of w; some projections may be better than others.
CS195-5 2006 Lecture 5 18
Linear discriminant and dimensionality reduction
The discriminant function f(x; w) = w
0
+w
T
x reduces the dimension of
examples from d to 1:
f(x, w) = 1
f(x, w) = 0
f(x, w) = +1
w
CS195-5 2006 Lecture 5 19
Projections and classication
What objecive are we optimizing the 1D projection for?
CS195-5 2006 Lecture 5 20
1D projections of a Gaussian
Let p(x) = N (x; , ).
For any A, p(Ax) = N
_
Ax; A, AA
T
_
.
To get a marginal of 1D projection on the direction dened by a unit vector v:
Make R a rotation such that R[1, 0, . . . , 0]
T
= v
Compute
v
= v
T
v; thats the variance of the marginal.
Lets assume for now = 0 (but think what happens if its not!)
Matlab demo: margGausDemo.m
CS195-5 2006 Lecture 5 21
Objective: class separation
We want to minimize overlap between projections of the two classes.
One way to approach that: make the class projections a) compact, b) far apart.
CS195-5 2006 Lecture 5 22
Next time
Continue with linear discriminant analysis, and talk about optimal way to place
the decision boundary.
CS195-5 2006 Lecture 5 23

CS195-5: Introduction To Machine Learning: Greg Shakhnarovich

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS195-5: Introduction To Machine Learning: Greg Shakhnarovich

Uploaded by

Copyright:

Available Formats

CS195-5 : Introduction to Machine Learning

, generate and add Gaussian noise

i.e. expectation of the outer product of x

You might also like