You are on page 1of 33

CS195-5 : Introduction to Machine Learning

Lecture 5
Greg Shakhnarovich
September 15 2006
Revised October 24th, 2006
Announcements
Collaboration policy on Psets
Projects
Clarications for Problem Set 1
CS195-5 2006 Lecture 5 1
The correlation question
N values in each of two samples:
e
i
= y
i
w
T
x the prediction error
z
i
= a
T
x
i
a linear function evaluated on the training examples.
Show that ({e
i
}, {z
i
}) = 0.
Develop an intuition, before you attack the derivation: Play with these in
Matlab!
Generate a random w

, random X
Compute Xw

, generate and add Gaussian noise


Fit w, calculate {e
i
}
Generate a random a, calculate {z
i
}. plot them!
Calculate correlation.
CS195-5 2006 Lecture 5 2
More notation
A B means A is dened by B (rst time A is introduced)
A B for varying A and/or B means they are always equal.
E.g., f(x) 1 means f returns 1 regardless of the input x.
a p(a) random variable a is drawn from density p(a)
CS195-5 2006 Lecture 5 3
Review
Uncertainty in w as an estimate of w

:
w N
_
w; w

,
2
(X
T
X)
1
_
Generalized linear regression
f(x; w) = w
0
+ w
1

1
(x) + w
2

2
(x) + . . . + w
m

m
(x)
Multivariate Gaussians
CS195-5 2006 Lecture 5 4
Today
More on Gaussians
Introduction to classication
Projections
Linear discriminant analysis
CS195-5 2006 Lecture 5 5
Refresher on probability
Variance of a r.v. a:
2
a
= E
_
(a
a
)
2

, where
a
= E [a].
Standard deviation:
_

2
a
. Measures the spread around the mean.
Generalization to two variables: covariance
Cov
a,b
E
p(a,b)
[(a
a
)(b
b
)]
Measures how the two variable deviate together from their means (co-vary).
CS195-5 2006 Lecture 5 6
Correlation and covariance
Correlation:
cor(a, b)
Cov
a,b

b
.
1.5 1 0.5 0 0.5 1 1.5
1.5
1
0.5
0
0.5
1
1.5
a
b
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
a
b
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
a
b
cor(a, b) measures the linear relationship between a and b.
1 cor(a, b) +1 ; +1 or 1 means a is a linear function of b.
CS195-5 2006 Lecture 5 7
Covariance matrix
For a random vector x = [x
1
, . . . , x
d
]
T
,
Cov
x

_

2
x
1
Cov
x
1
,x
2
. . . . . . . Cov
x
1
,x
d
Cov
x
2
,x
1

2
x
2
. . . . . . . Cov
x
2
,x
d
.
.
.
.
.
.
.
.
.
Cov
x
d
,x
1
Cov
x
d
,x
2
. . . . .
2
x
d
_

_
.
Square, symmetric, non-negative main diagonal (variances 0)
Under that denition, one can show:
Cov
x
= E
_
(x
x
)(x
x
)
T

i.e. expectation of the outer product of x


x
with itself.
Note: so far nothing Gaussian-specic!
CS195-5 2006 Lecture 5 8
Covariance matrix decomposition
Any covariance matrix can be decomposed:
= R
_
_

1
.
.
.

d
_
_
R
T
where R is a rotation matrix, and
j
0 for all j = 1, . . . , d.
Rotation in 2D:
R =
_
cos() sin()
sin() cos()
_
CS195-5 2006 Lecture 5 9
Rotation matrices
= R
_
_

1
.
.
.

d
_
_
R
T
Rotation matrix R:
orthonormal: if columns are r
1
, . . . , r
d
, then r
T
i
r
i
= 1, r
T
i
r
j
= 0 for i = j.
From here follows R
T
= R
1
(R
T
reverses the rotation produced by R).
Columns r
i
specify the basis for the new (rotated) coordinate system.
R determines the orientation of the ellipse (so called principal directions)
The inner diag(
1
, . . . ,
d
) species the scaling along each of the principal
directions.
Interpretation of the whole product: rotate, scale, and rotate back.
CS195-5 2006 Lecture 5 10
Covariance and correlation for Gaussians
Suppose (for simplicity) = 0. What happens if we rotate the data by R
T
?
The new covariance matrix is just
_

2
x
1
Cov
x
1
,x
2
. . . . . . . Cov
x
1
,x
d
Cov
x
2
,x
1

2
x
2
. . . . . . . Cov
x
2
,x
d
.
.
.
.
.
.
.
.
.
Cov
x
d
,x
1
Cov
x
d
,x
2
. . . . .
2
x
d
_

_
=
_
_

1
.
.
.

d
_
_
The components of x are now uncorrelated (covariances are zero). This is
known as whitening transformation.
For Gaussians, this also means they are independent.
Not true for all distributions!
CS195-5 2006 Lecture 5 11
Classication versus regression
Formally: just like in regression, we want to learn a mapping from X to Y, but
Y is discrete and nite.
One approach is to (navely) ignore that Y is such.
Regression on the indicator matrix:
Code the possible values of the label as 1, . . . , C.
Dene matrix Y:
Y
ij
=
_
1 if y
i
= c,
0 otherwise
This denes C independent regression problems; solving them with least
squares yields

Y
0
= X
0
(X
T
X)
1
XY.
CS195-5 2006 Lecture 5 12
Classication as regression
Suppose we have a binary problem, y {1, 1}.
Assuming the standard model y = f(x; w) + , and solving with least squares,
we get w.
This corresponds to squared loss as a measure of classication performance!
Does this make sense?
CS195-5 2006 Lecture 5 13
Classication as regression
Suppose we have a binary problem, y {1, 1}.
Assuming the standard model y = f(x; w) + , and solving with least squares,
we get w.
This corresponds to squared loss as a measure of classication performance!
Does this make sense?
How do we decide on the label based on f(x; w)?
CS195-5 2006 Lecture 5 13
Classication as regression: example
A 1D example:
x
CS195-5 2006 Lecture 5 14
Classication as regression: example
A 1D example:
x
y
+1
-1
CS195-5 2006 Lecture 5 14
Classication as regression: example
A 1D example:
x
y
+1
-1
w
0
+ w
T
x
CS195-5 2006 Lecture 5 14
Classication as regression: example
A 1D example:
x
y
+1
-1
w
0
+ w
T
x
y = 1
y = +1
CS195-5 2006 Lecture 5 14
Classication as regression
f(x; w) = w
0
+ w
T
x
Cant just take y = f(x; w) since it wont be a valid label.
A reasonable decision rule:
decide on y = 1 if f(x; w) 0, otherwise y = 1.
y = sign
_
w
0
+ w
T
x
_
This species a linear classier :
The linear decision boundary (hyperplane) given by the equation w
0
+ w
T
x = 0
separates the space into two half-spaces.
CS195-5 2006 Lecture 5 15
Classication as regression
Seems to work well here but not so well here?
CS195-5 2006 Lecture 5 16
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
CS195-5 2006 Lecture 5 17
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
CS195-5 2006 Lecture 5 17
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
x
0
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
CS195-5 2006 Lecture 5 17
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
x
0
w
0
+w
T
x
0
w
x
0
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
CS195-5 2006 Lecture 5 17
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
x
0
w
0
+w
T
x
0
w
x
0
x

0
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
x

is the projection of x on w.
CS195-5 2006 Lecture 5 17
Geometry of projections
x
2
w
0
+ w
T
x = 0
x
1
w
w
0
w
x
0
w
0
+w
T
x
0
w
x
0
x

0
w
T
x = 0: a line passing through
the origin and orthogonal to w
w
T
x+w
0
= 0 shifts the line along
w.
x

is the projection of x on w.
Set up a new 1D coordinate system: x (w
0
+x
T
x)/w.
CS195-5 2006 Lecture 5 17
Distribution in 1D projection
Consider a projection given by w
T
x = 0 (i.e., w is the normal)
Each training point x
i
is projected to a scalar z
i
= w
T
x.
We can study how well the projected values corresponding to dierent classes
are separated
This is a function of w; some projections may be better than others.
CS195-5 2006 Lecture 5 18
Linear discriminant and dimensionality reduction
The discriminant function f(x; w) = w
0
+w
T
x reduces the dimension of
examples from d to 1:
f(x, w) = 1
f(x, w) = 0
f(x, w) = +1
w
CS195-5 2006 Lecture 5 19
Projections and classication
What objecive are we optimizing the 1D projection for?
CS195-5 2006 Lecture 5 20
1D projections of a Gaussian
Let p(x) = N (x; , ).
For any A, p(Ax) = N
_
Ax; A, AA
T
_
.
To get a marginal of 1D projection on the direction dened by a unit vector v:
Make R a rotation such that R[1, 0, . . . , 0]
T
= v
Compute
v
= v
T
v; thats the variance of the marginal.
Lets assume for now = 0 (but think what happens if its not!)
Matlab demo: margGausDemo.m
CS195-5 2006 Lecture 5 21
Objective: class separation
We want to minimize overlap between projections of the two classes.
One way to approach that: make the class projections a) compact, b) far apart.
CS195-5 2006 Lecture 5 22
Next time
Continue with linear discriminant analysis, and talk about optimal way to place
the decision boundary.
CS195-5 2006 Lecture 5 23

You might also like