You are on page 1of 7

Machine Learning Notes

Sarah Constantin October 12, 2011


Goal: nd techniques relevant to the collaborative ltering problem of showing top matches. Book: Smola

Basic Algorithms

Spam ltering is our example . x1 . . . xm is a set of emails. Labels: spam or ham. E-mail generating process is some joint distribution p(x, y) emails and labels. Think of a text as a bag of words a score for the incidence of each word.

1.1

Naive Bayes
p(y|x) = p(x|y)p(y) p(x)

p(y) is the probability of receiving a spam or ham email: mh am and ms pam are numbers of ham and spam emails. We dont know p(x|y) or p(x). Likelihood ratio: L(x) = p(spam|x) p(x|spam)p(spam) = p(ham|x) p(x|ham)p(ham)

If the likelihood ratio is above a threshold, call it spam. We dont really know the conditional probabilities, so we estimate. Treat all words as having independent probabilities (they dont ,but whatever) so p(w|y) can be estimated by counting the frequency of that word in that class. Store the probabilities, and use their product to approximate p(x|y). This is called Bayesian ltering.

1.2

Simple Classier

Dene the means + and of the two categories. For a new email, use he class label which corresponds to the closer mean, in Euclidean distance. Classication rule can be expressed with dot products: ||+ ||2 =< + + >= m2 +
yi =yj =1

< xi , xj >

< + , x >= m1 +
yi =1

< xi , x >

So the classication rule is f (x) = i < xi , x > +b where b = m2 yi =yj =1 < xi , xj > m2 yi =yj =1 < xi , xj > and i = yi /mi . This + is the dierence between the two distances; we care if its positive or negative. It may be better to map into a feature space, so long as we have a dot product. If we have a reproducing kernel Hilbert space, the inner product is in the form of a kernel k(x, x ) =< (x), (x ) > That allows us to kernelize our basic algorithm. f (x) = i k(xi , x) + b

That takes us from a linear classier to a nonlinear function and a nonlinear decision boundary.

Perceptron

In real life we do not always have a training set! As emails arrive, we want our algorithm to classify them, and then learn from any corrections to mistakes the system makes. Such algorithms are said to be online. Assume labels are plus or minus 1, and we assign 1 to spam and -1 to non-spam. The Perceptron maintains a weight vector w and classies xt according to the rule yt = sign{< w, xt > +b} where b is some oset. This is a linear classier that separates its domain into half spaces. If yt = yt then no updates are made. But if yt = yt the weight is updated as w = w + yt xt and b = b + yt . Just as before, we can replace xi and x with (xi ) and (x) to get a kernelized version. This should converge to the correct linear classier. The rate of convergence depends on the margin. Margin of an observation x with associated label y is (x, y) = y(< w, x > +b) Margin of an entire set of observations: (X, Y ) = min (xi , yi )
i

The margin measures the distance from x to the hyperplane dened by {x| < w, x > +b = 0}. That is, its a measure of how close it is to the dividing line. 2

Novikos Theorem. Let (X, Y) be a dataset with at least one example labeled +1 and one example labeled -1. Let R = maxt ||xt || and assume that there exists (w , b ) such that ||w || = 1andt = 2 2 yt (< w , xt > +b ) for all t. Then the Perceptron will make at most (1+R )(1+(b ) ) 2 mistakes. THIS DOES NOT DEPEND ON DIMENSIONALITY. Proof: Without loss of generality, we can ignore iterations with no mistakes and no updates. Also assume we started with w0 = 0 and b0 = 0. < wt , w > +b b =< wt1 , w > +bt1 b + yt (< xt , w > +b ) t < wt1 , w > +bt1 b + using the fact wt = wt1 + yt xt , bt = bt1 + yt By induction, it follows that < wt , w > +bt b t On the other hand, we made an update because yt (< xt , wt1 + bt1 ) < 0 (i.e. the classier got it wrong.) By using yt yt = 1, we get
2 ||wt ||2 + b2 = ||wt1 ||2 + b2 + yt ||xt ||2 + 1 + 2y1 (< wt1 xt > +bt1 ) t t1

||wt1 ||2 + b2 + ||xt ||2 + 1 t1 Since ||xt ||2 = R2 we can apply induction to conclude ||wt ||2 + b2 t[R2 + 1] t Combining the upper and lower bounds, we get t < wt , w > +bt b ||(wt , bt )||||(w , b )|| = ||wt ||2 + b2 t t(R2 + 1) 1 + (b )2 1 + (b )2

This gives the result. The Perceptron was the beginning of neural networks. The standard neural networks approach was to make lots of nodes; our approach is to increase the complexity of the feature map. 3

2.1

K-means

This has the advantage of being UNSUPERVISED. No labels. Dene prototype vectors 1 . . . k and an indicator vector rij which is 1 if and only if xi is assigned to cluster j. We want to minimize the distortion
m k

J(r, ) = 1/2
i=1 j=1

rij ||xi j ||2

Two-stage strategy for minimizing r and : 1. Keep xed and determine r. Decomposes into m independent problems: rij = 1 if j = argmin||xi j ||2 . 2. Keep r xed and determine ; you minimize J as a quadratic function of , with optimum j = rij xi i rij

Sample mean! Algorithm stops when this stops changing.

Density Estimation

Im skipping standard statistics and probability theory I already know, but its there. Hoeding Theorem: if Xi are iid with bounded range inside [a, b] and mean and X is their sample average. Then: 2m 2 P (|X | > ) 2exp( ) (b a)2 This is EXPONENTIALLY better than the Chebyshev or Markov inequalities. This is NICE WORK IF YOU CAN GET IT. Hoeding bound: If you want to get condence 1 of having |X | < , we need |b | | log 2 log |/2m

To improve our condence interval by a factor of 10, we need 100 times as many observations. This bound is tight, but there are even better bounds if you also have a bound on the variance of a random variable. Curse of dimensionality: as the dimension of the vectors to estimate increases, we require exponentially more samples to do density estimation. Given a histogram, we can smooth it out with a smoothing kernels; this is called the Parzen windows estimate. m x xi p(x) = 1/m rd h( ) r
i=1

Popular choices for h: h(x) = (2)1/2 e1/2z Gaussian kernel

h(x) = 3/4 max(0, 1, x2 ) Epanechnikov kernel h(x) = 1/2e|x| Laplace kernel h(x) = [1,1] (x) Uniform kernel, and so on. The Epanechnikov kernel has the attractive (yeah baby!) property of compact support. Basically this is a nice convolution. Density estimation can be used to perform classication and regression. If you have a density estimate p(x) given your observations, you can do Bayes: p(y|x) =
x my /m 1/my rd h( xir ) p(x|y)p(y) = x p(x) 1/m m rd h( xir ) i=1

This is problematic because it may require sums over a large number of observations. For binary classication we can simplify this. For p(y = 1|x) > 0.5 we estimate y = 1 and in the converse case we estimate y = -1. f (x) =
x yi h( xir xi x = i h( r )

yi

x h( xir ) xi x i h( r )

yi wi (x)

This is a weighted combination of the labels associated with weights which depend on the proximity of f to an observation xi . In other words its a smoothed-out nearest neighbor classier. Its called the Watson-Nadaraya estimator.

3.1

Exponential Family
p(x, ) = p0 (x)exp(< (x), > g())

p0 is called the base measure, (x) is a map to the sucient statistics, is called the natural parameter (lives in the dual space) and g() is a normalization constant known as the log-partition function. g() = log exp(< (x), >)dx

Binary model: options are 0 or 1 and (x) = x. Then g() = log(e0 + e ) = log(1 + e ). 1 e So p(x = 0; ) = 1+e and p(x = 1; ) = 1+e . Dierent Bernoulli distributions. The log partition function is convex. Moreover, g() = E[(x)] 2 g() = V ar[(x)] g is referred to as the cumulant-generating function; higher derivatives generate higher order cumulants of (x) under p(x; ).

Reproducing Kernel Hilbert Spaces

Can I use them for regression? Can I use them online? YES say Kivinen, Smola, and Williamson. There exists a kernel k : X X R and a dot product such that < f (), k(x, ) >= f (x) Thats the reproducing property: inner product with the kernel will give you the original value back. H is the closure of the span of all k(x, ) with x X. All functions are linear combinations of kernel functions. Regularizer: [f ] = 1/2||f ||2 and of course f [f ] = f . For the evaluation functional ex [f ] = f (x) the derivative, using the reproducing property, is k(x, ). We have pairs of observations (xi , yi ). [In our case: xi is a persons responses on a questionnaire, yi is that persons compatibility score relative to some target.] Our aim is to predict the likely outcome y at a location x. We want to minimize a loss function which penalizes the deviation between an observation y at location x and the prediction f(x), based on a set of observation. Minimize the empirical risk plus an additional regularization term to avoid overly complex hypotheses: 1 c(xi , yi , f (xi )) + [f ] m Minimizing this can be slow if the number of observations is large; you use a form of gradient descent: f f f R[f, t] where R[f, t] = c(xt , yt , f (xt )) + [f ]]. and is the learning rate. f f (c (xt , yt , f (xt ))k(xt , ) + f ) Rreg [f ] = = (1 )f c (xt , yt , f (xt ))k(xt , ) We can do this computationally as follows: express f as a kernel expansion f (x) =
i

i k(xi , x)

where xi are previously seen training patterns. Then we get t (1 )t c (xt , yt , f (xt )) = c (xt , yt , f (xt ) for t = 0, i = (1 )i for i = t. That is, at each iteration the kernel expansion can grow by one term. Once we have computed f (xt ), t is obtained by the derivative of c at (xt , yt , f (xt )). At each iteration the coecients i with i = t are shrunk by (1 ) One loss function is the soft margin, given by c(x, y, g(x)) = max(0, 1 yg(x)). Update equations become i (1 )i t yi b b + yi b is the intercept: g(x) = f (x) + b. if yg(xt ) < 1, and i (1 )i t 0 bb otherwise. In regression, the squared loss function is common: 1/2(y f (x))2 (i , t ) ((1 )i , (yt f (xt )) store the prediction error on every observation.

You might also like