Professional Documents
Culture Documents
6.867 Machine learning, lecture 1 (Jaakkola) 2
image, and x
i
is the i
th
pixel in any image x
, then
f
i
(x
) =
y
t
, if x
ti
=x
i
for some t= 1, . . . , n (in this order)
(1)
1, otherwise
would appear to solve the task. In fact, it is always possible to come up with such a
perfect binary function if the training images are distinct (no two images have identical
pixel intensities for all pixels). But do we expect such rules to be useful for images not
in the training set? Even an image of the same person varies somewhat each time the
imageistaken(orientationisslightlydierent,lightingconditionsmayhavechanged,etc).
Theserulesprovidenosensiblepredictionsforimagesthatarenotidenticaltothoseinthe
trainingset. Theprimaryreasonforwhysuchtrivialrulesdonotsuceisthatourtaskis
not to correctly classify the training images. Our task is to nd a rule that works well for
allnewimageswewouldencounterintheaccesscontrolsetting;thetrainingsetismerely
a helpful source of information to nd such a function. To put it a bit more formally, we
would like to nd classiers that generalize well, i.e., classiers whose performance on the
training set is representative of how well it works for yet unseen images.
Model selection
So how can we nd classiers that generalize well? The key is to constrain the set of
possiblebinaryfunctionswecanentertain. Inotherwords,wewouldliketondaclassof
binaryfunctionssuchthatifafunctioninthisclassworkswellonthetrainingset,itisalso
likelytoworkwellontheunseenimages. Therightclassoffunctionstoconsidercannot
be too large in the sense of containing too many clearly dierent functions. Otherwise we
are likely to nd rules similar to the trivial ones that are close to perfect on the training
set but do not generalize well. The class of function should not be too small either or we
run the risk of not nding any functions in the class that work well even on the training
set. Iftheydontworkwellonthetrainingset,howcouldweexpectthemtoworkwellon
the new images? Finding the class of functions is a key problem in machine learning, also
known as the model selection problem.
Linear classiers through origin
Letsjustxthefunctionclassfornow. Specically,wewillconsideronlyatypeoflinear
classiers. These are thresholded linear mappings from images to labels. More formally,
we only consider functions of the form
f(x;)=sign
1
x
1
+. . .+
d
x
d
=sign
T
x (2)
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 1 (Jaakkola) 3
where = [
1
, . . . ,
d
]
T
is a column vector of real valued parameters. Dierent settings of
the parameters give dierent functions in this class, i.e., functions whose value or output
in {1,1} could be dierent for some input images x. Put another way, the functions in
our class are parameterized by R
d
.
We can also understand these linear classiers geometrically. The classier changes its
prediction only when the argument to the sign function changes from positive to negative
(orviceversa). Geometrically,inthespaceofimagevectors,thistransitioncorrespondsto
crossingthedecisionboundary wheretheargumentisexactlyzero: allxsuchthat
T
x=0.
The equation denes a plane in d-dimensions, a plane that goes through the origin since
x=0 satises the equation. The parameter vector is normal (orthogonal) to this plane;
this is clear since the plane is dened as all x for which
T
x = 0. The vector as the
normaltotheplanealsospeciesthedirectionintheimagespacealongwhichthevalueof
T
x would increase the most. Figure 1 below tries to illustrate these concepts.
images labeled +1
T
x>0
x
x
decision boundary
T
x= 0
x
x
x
x
x
x
images labeled -1
x
T
x<0
x
x
Figure 1: A linear classier through origin.
Before moving on lets gure out whether we lost some useful properties of images as a
result of restricting ourselves to linear classiers? In fact, we did. Consider, for example,
how nearby pixels in face images relate to each other (e.g., continuity of skin). This
information is completely lost. The linear classier is perfectly happy (i.e., its ability
to classify images remains unchanged) if we get images where the pixel positions have
been reordered provided that we apply the same transformation to all the images. This
permutationofpixelsmerelyreordersthetermsintheargumenttothesignfunctioninEq.
(2). Alinearclassierthereforedoesnothaveaccesstoinformationaboutwhichpixelsare
close to each other in the image.
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 1 (Jaakkola) 4
Learning algorithm: the perceptron
Now that we have chosen a function class (perhaps suboptimally) we still have to nd a
specic function in this class that works well on the training set. This is often referred to
astheestimation problem. Letsbeabitmoreprecise. Wedliketondalinearclassier
thatmakesthefewestmistakesonthetrainingset. Inotherwords,wedliketondthat
minimizes the training error
E()=
1
n
n
1(y
t
,f(x
t
;))
=
1
n
n
Loss(y
t
,f(x
t
;)) (3)
t=1 t=1
where (y, y
) = 1 if y=y
+y
t
x
t
if y
t
= f(x
t
;) (4)
In other words, the parameters (classier) is changed only if we make a mistake. These
updatestendtocorrectmistakes. Toseethis,notethatwhenwemakeamistakethesign
of
T
x
t
disagrees with y
t
and the product y
t
T
x
t
is negative; the product is positive for
correctlyclassiedimages. Supposewemakeamistakeonx
t
. Thentheupdatedparameters
are given by
= +y
t
x
t
, written here in a vector form. If we consider classifying the
same image x
t
after the update, then
y
t
T
x
t
=y
t
(+y
t
x
t
)
T
x
t
=y
t
T
x
t
+y
t
2
x
T
t
x
t
=y
t
T
x
t
+x
t
2
(5)
In other words, the value of y
t
T
x
t
increases as a result of the update (becomes more
positive). If we consider the same image repeatedly, then we will necessarily change the
parameters such that the image is classied correctly, i.e., the value of y
t
T
x
t
becomes
positive. Mistakes on other images may steer the parameters in dierent directions so it
may not be clear that the algorithm converges to something useful if we repeatedly cycle
through the training images.
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
6.867 Machine learning, lecture 1 (Jaakkola) 5
Analysis of the perceptron algorithm
The perceptron algorithm ceases to update the parameters only when all the training
images are classied correctly (no mistakes, no updates). So, if the training images are
possibletoclassifycorrectlywithalinearclassier,willtheperceptronalgorithmndsuch
a classier? Yes, it does, and it will converge to such a classier in a nite number of
updates (mistakes). Well show this in lecture 2.
Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare
http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY]. (