You are on page 1of 12

Notes from CaltechX Learning From Data

Nick Schandler
NSchandler@gmail.com

December 2016

1 Feasibility of Learning

1.1 Hoeffding Inequality

Consider an unknown target function f : X Y and a hypothesis h f selected from a hypothesis set H.
The Hoeffding inequality states that for a given hypothesis h,
2
P [| | > ] 2e2 N

where is the sample frequency, and is the population frequency h(x) = f (x). The sample frequency and
population frequency can be replaced by Ein (h) and Eout (h), respectively, to give us:
2
P [|Ein (h) Eout (h)| > ] 2e2 N

Because the Hoeffding inequality only applies to a single hypothesis, if we select a given hypothesis g(x)
from a finite hypothesis set H with M hypotheses, we can bound P [|Ein (h) Eout (h)| > ] using the union
bound: 2
P [| | > ] 2M e2 N

1.2 Extension to Infinite Hypothesis Sets

1.2.1 Growth Function

To extend the Hoeffding Inequality to infinite hypothesis sets, we introduce the growth function mH (N ),
which counts the maximum number of dichotomoies on any N points:
mH (n) = max |H(x1 , ..., xN )|
x1 ,...,xN X

Characteristics of the growth function include:


1. Number of dichotomies |H(x1 , x2 , ..., xN |, mH (N ) 2N
k1
N
P 
2. For mH (N ) with finite break point k, mH (N ) i
i=0

We also introduce a few definitions:


1. If a hypothesis set H can achieve all 2N dichotomies on a set of N points, these N points are shattered
by H
2. If no data set of size k can be shattered by H, then k is a break point for H and mH (k) < 2k
3. The Vapnik-Chervonenkis (VC) dimension of a hypothesis set H, dvc (H), is the largest value of N for
which mH (N ) = 2N . If mH (N ) = 2N for all N , then dvc (H) = .

1
1.2.2 Vapnik-Chervonenkis Inequality

We now use the growth function to establish the Vapnik-Chervonenkis Inequality, which establishes that
learning is possible, at least in a probabilistic sense.
First, we establish that mH (N ) can replace M for infinite hypothesis sets, albeit with some modifications to
various constants. Proof. This gives us the Vapnik-Chervonenkis Inequality:
1 2
P [|Ein (h) Eout (h)| > ] 4mH (2N )e 8  N

k1
N

Now we establish that if mH (k) < 2k for some k, then mH (N )
P
i for all N . Proof. This shows
i=0
that mH (N ) is bounded by a polynomial in N .Since mH (N ) is bounded by a polynomial in N , the Vapnik-
Chervonenkis Inequality will go to 0 as N , establishing that learning is possible in the case of infinite
hypothesis sets with a finite VC dimension.
Rearranging the VP Inequality gives us a formula for the generalization bound:
r
8 4mH (2N )
Eout Ein + ln , with probability1
N

1.3 Bias-Variance Decomposition

The VC analysis was one approach to quantifying the approximation-generalization tradeoff, though it
was developed only for the case of binary targets. While it could be extended to real-valued targets, the
complementary approach of bias-variance decomposition is well-suited to this task and can provide additional
insights.
The bias-variance decomposition decomposes Eout into two terms:
1. How well H can approximate f
2. How well we can zoom in on a good h H
The bias-variance decomposition applies to real-valued targets and uses squared error as the loss function.
We work under a framework in which we are calculating the average squared error of x, across all data sets
D. The error of a hypothesis g across all x and across all data sets D is:

ED [Eout (g (D) )] = ED [EX [(g (D) (x) f (x))2 ]]


= EX [ED [(g (D) (x) f (x))2 ]]

where g (D) indicates that the hypothesis g is a function of the data set D that is chosen from the population
of data sets. We evaluate

ED [(g (D) (x) f (x))2 ] = ED [(g (D) (x) + g(x) g(x) f (x))2 ]
= ED [(g (D) (x) g(x))2 + (g(x) f (x))2 + 2(g (D) (x) g(x))(g(x) f (x))]
= ED [(g (D) (x) g(x))2 ] + (g(x) f (x))2

where g(x) is the average hypothesis: g(x) = ED [g (D) (x)].


ED [(g (D) (x) g(x))2 ] is var(x) and (g(x) f (x))2 is bias(x). Therefore,

ED [Eout (g (D) )] = EX [ED [(g (D) (x) f (x))2 ]]


= EX [bias(x) + var(x)]
= bias + var

ED [(g (D) (x) f (x))2 ] = ED [(g (D) (x) g(x))2 ] + (g(x) f (x))2

2
1.3.1 Bias Variance tradeoff with stochastic noise

2 Models

2.1 Perceptron Learning Algorithm

For a binary classification problem, the percepton learning algorithm attempts to find a hyperplane that will
linearly separate the data such that y = sign(wT x). The PLA is as follows:
1. Initialize w with arbitrary weights
2. Pick any misclassified point yn 6= sign(wT xn )
3. Update weight vector to be w w + yn xn
Provided that the data is linearly separable such that there exists some w for which y = sign(wT x) y, the
PLA will converge.
In case of data that is not linearly separable, the perceptron will never converge. However, we can instead
use the pocket algorithm, in which we run PLA for a set number of iterations, and from the models
that are produced after each iteration, pick the one with the lowest Ein (breaking ties in some arbitrary
manner).

2.2 Linear Regression

Linear regression attempts to approximate real-valued targets with h(x) = wT x, and uses squared error
(h(x) f (x))2 as its loss measure. In matrix form:
1
Ein (w) = kXw yk2
N
where T
x1 y1
xT2 y2
X = . , y = .

.. ..
T
xn yn

Minimizing Ein gives


2 T
Ein (w) = X (Xw y) = 0
N
X T Xw = X T y
w = X y

where X = (X T X)1 X T

2.3 Logistic Regression

Consider a target function f : Rd [0, 1] that is not an outcome, but a probability:


(
f (x) y=1
P (y|x) =
1 f (x) y = 1

3
s
e T
We model (s) = 1+e s and h(x) = (w x), selecting g(x) based on the criterion of maximum likelihood.

We note that (s) = 1 (s), giving us P (y|x) = (ywT x).


N N
(yn wT xn ). Maximizing the likelihood is
Q Q
The likelihood of D = (x1 , y1 ), . . . , (xn , yn ) is P (yn |xn ) =
n=1 n=1
equal to:
N
!
1 Y
Minimize ln (yn wt xn )
N n=1
N  
1 X 1
ln
N n=1 (yn wt xn )

1
cross entropy error
N
z }| {
T n
1
ln(1 + eyn w x ) . Gradient descent can be used to minimize
P
We take as our error measure Ein (w) = N
n=1
the error function numerically. We initialize w(0) (typically to be the zero vector) and compute the gradient
N
yn x n
as Ein = N1
P
1+eyn w(t)T xn
.
n=1

2.4 Neural Networks

2.5 Support Vector Machines

Support Vector Machines are an extension of the Percepton Learning Algorithm. Instead of choosing any
hyperplane that linearly separates the data, we instead choose the unique hyperplane that maximizes the
distance to the nearest points (the margin).
Fatter margins imply fewer dichotomies, suggesting that a fatter margin corresponds to a smaller H and
better generalization

2.5.1 Hard-Margin Support Vector Machines

A hard margin support vector machine involves a strict linear separation of data by a hyperplane. If the data
is not linearly separable, we either move to a soft-margin support vector machine, or perform a nonlinear
transformation to x (or both).
We want to minimize the distance from the nearest data point xn to the plane wT x = 0. There are two
representations of the data we make in order to simplify the analysis later:
1. We normalize w so that |wT xn | = 1. This can be done because if wT x = 0 defines a hyperplane, so
does any multiple c wT x = 0. This normalization allows simpler interpretations of w later.
2. We pull out w0 from w and label it b. The hyperplane is now wT x + b = 0.
From the geometry of linear algebra, w is orthogonal to the plane wT x + b = 0. To get the distance between
w
xn and any point on the hyperplane x, we project xn x onto w. Because w = kwk is a unit vector, finding
the multiple of x that the projection lies on is equivalent to finding the length of the projection. This gives
1 The quotes indicate this is not technically a cross-entropy error

4
us
1
distance = |wT xn wT x|
kwk
1
= |wT xn + b wT x b|
kwk
1
=
kwk

where the last line follows because the plane is defined by wT x + b = 0 and wT xn + b = 1 from our w
normalization. Our optimization problem is therefore:
1
Maximize
kwk
subject to min kwT xn + bk = 1
n=1,2,...,N

w Rd , b R

Or equivalently,
1 T
Minimize w w
2
subject to yn (wT xn + b) 1, for n = 1,2,...,N
w Rd , b R

One subtlty is that our new constraint allows all yn (wT xn + b) > 1. However, this will not happen - if
equality did not hold for at least one (xn , yn ), then w and b could each be reduced slightly, reducing the
functions value while still maintaining the constraint.
Because the constraints are inequalities, we must apply the Karush-Kuhn-Tucker (KKT) conditions to gen-
eralize our Lagrangian formulation. In this formulation, we want to minimize
N
1 T X
L(w, b, ) = w w n (yn (wT xn + b) 1)
2 n=1

w.r.t. w and b, and maximize the above w.r.t. subject to each n 0. We first tackle the unconstrained
minimzation:
N
X
w L = w n yn xn = 0
n=1
N
L X
= n yn = 0
b n=1

Substituting the above into our original Lagrangian formulation, we get


N N N
X 1XX
L() = n yn ym n m xTn xm
n=1
2 n=1 m=1

N
P
which we are maximizing subject to the constraints an 0 for n = 1, ..., N and n yn = 0. The solution
n=1

5
can be found numerically through quadratic programming, by solving

y1 y1 xT1 x1 y1 y2 xT1 x2 y1 yN xT1 xN



T T T
1 T y2 y1 x2 x1 y2 y2 x2 x2 y2 yN x2 xN

min .
.. .
.. . .. .
.. (1T )
2
T T T
yN y1 xN x1 yN y2 xN x2 yN yN xN xN
L
subject to: y T = 0 (from = 0 above)
b
0 (KKT condition)

N
P
We then use this to find w = n yn xn from the w L above.
n=1

From our KKT constraint (yn (wT xn + b) 1) = 0, we see that either the slack (yn (wT xn + b) 1) = 0,
indicating that point lies on the margin and is a support vector, or the Lagrangian multiplier = 0, indicating
that it is an interior point.
Finally, we use the knowledge that yn (wT xn + b) = 1 for all support vectors to find b, by simply substituting
yn and xn for any support vector.

2.5.2 Nonlinear transformations

In performing a non-linear transformation before applying support vector machines, conceptually we are
simply performing a transformation : X Z and then applying SVM as usual. However, we can observe
that we do not need to explicitly find Z, as all we use from this space is (x)T (x0 ).
This leads to the so-called kernel trick, where we find K(x, x0 ) (x)T (x0 ) without actually performing
the transformations (x) and (x0 ). Using kernels provides two main benefits.

1. Efficiency: computation costs are O(d) rather than O(d)
2. We can perform transformations to spaces that we cannot explicitly describe, including infinite-
dimensional Z
The only requirement for K(x, x0 ) is that it is an inner product in some space Z. It can be shown that
K(x, x0 ) is a valid kernel iff and satisfies Mercers condition, that the so-called Kernel Matrix:

K(x1 , x1 ) K(x1 , x2 ) K(x1 , xN )
K(x2 , x1 ) K(x2 , x2 ) K(x2 , xN )

.. .. .. ..
. . . .
K(xN , x1 ) K(xN , x2 ) K(xN , xN )

is positive semi-definite for any x1 , , xN

2.5.3 Soft Margin SVM

Another option for dealing with data that is not linearly separable is to allow for margin violations, i.e.
yn (wT xn + b) 1 fails. Note that these margin violations may or may not be correctly classified.
To account for these margin violations, we now say yn (wT xn + b) 1 n , 0. Our total violation of

6
N
P
the margin is n . Our optimization is now:
n=1

N
1 T X
Minimize w w + C n
2 n=1
s.t. yn (wT xn + b) 1 n for n = 1, ..., N
n 0 for n = 1, ..., N
w Rd , b R, RN

The Lagrange formulation of this optimization is:


N N N
1 T X X X
L(w, b, , , ) = w w+C N n (yn (wT xn + b) 1 + n ) n n
2 n=1 n=1 n=1

We minimize w.r.t. w, b, and and maximize w.r.t. to each n 0 and n 0:


N
X
w L = w n yn xn = 0
n=1
N
L X
= n yn = 0
b n=1
L
= C n n = 0
n

The dual formulation is the same as in the hard margin case, except that now 0 n C instead of
0 n :

N N N
X 1XX
Maximize L() = n yn ym n m xTn xm
n=1
2 n=1 m=1
s.t. 0 n C for n = 1, ..., N
N
X
n yn = 0.
n=1

We now have 2 types of support vectors:


1. margin support vectors: yn (wT xn + b) = 1 (n = 0)
2. non-margin support vectors: yn (wT xn + b) < 1 (n > 0)

2.5.4 Generalization of SVM

Since the separating hyperplane depends only on the support vectors, we know that removing any non-
support-vector point from the data set will not change the hypothesis. Thus, the leave-one-out cross-validated
error for g corresponding to these points will be 0, and so Ecv must be less than the number of support
vectors. We also know that leave-one-out cross-validation on N points is an unbiased estimate of Eout for a
data set of size N 1, giving us the following property:
S
E(Eout )
N
where S = number of support vectors

7
2.5.5 Relation of SVM and Regularization

In regularization, we optimize Ein subject to wT w C, whereas in SVM we reverse this, optimizing wT w


subject to Ein = 0. This points to the built-in regularization that is inherent in SVM.

2.6 Radial Basis Functions

3 Learning Algorithms

3.1 Gradient Descent

Gradient descent is a general method for nonlinear optimization. We start at w(0) and descend along the
gradient by iterative updating

Ein (w(n))
w(n + 1) = w(n) (fixed step size)
kEin (w(n))k

In the above formulation, the step size remains fixed. However, it typically makes more sense for this
step size to scale with the size of the gradient, so that steps become smaller as we approach the function
Ein (w(n))
minimum. To achieve this, we multiply w = kE in (w(n))k
by kEin (w(n))k, so that our step size now
represents a fixed learning rate. This gives us

w(n + 1) = w(n) Ein (w(n)) (fixed learning rate)

In batch gradient descent, Ein is based on the entire sample. In stochastic gradient descent, we instead
calculate the gradient of a single example at a time. The average direction given by this process is
N
1 X
En [e(h(xn ), yn )] = e(h(xn ), yn )
N n=1
= Ein

The benefits of stochastic gradient descent include:


1. Cheaper computationally - we only need to calculate the gradient for a single data point at a time
2. Randomization can help avoid local minima and help get us out of flat regions

4 Overfitting

4.1 Deterministic Noise and Stochastic Noise

Regularization is the process of restricting our hypothesis set, either directly through a hard constraint, or
indirectly through a soft constraint, in order to reduce overfitting and improve our out-of-sample perfor-
mance.
We first identify two sources of noise: stochastic noise and deterministic noise. Stochastic noise is the  we
typically think of as noise, the difference between our target function and the data. Deterministic noise is
the difference between the target function and our hypothesis sets best approximation of that function. It
is that part of the target function which cannot be modeled. Two main differences between stochastic noise
and deterministic noise are:

8
1. Deterministic noise depends on H - a more sophisticated H can better approximate f and will therefore
have lower deterministic noise.
2. Deterministic noise is fixed for a given x - given a point x, the difference between the target function
and Hs best approximation is fixed.
In the bias-variance framework with stochastic noise, we had

ED, [(g (D) (x) y)2 ] = ED,x [(g (D) (x) g(x))2 ] + Ex [(g(x) f (x))2 ] + E,x [((x))2 ]

and the bias is (roughly) equivalent to the deterministic noise, and 2 is equal to the stochastic noise. We
deal with these two types of noise through regularization and validation

4.2 Regularization

Regularization typically takes the form of a soft-order constraint, in which we minimize our error function
subject to a constraint on our weights.

4.2.1 Regularizing Linear Regression

In the context of linear regression, our basic vanilla regularization minimizes


1
Ein (w) = (Zw y)T (Zw y)
N
s.t. wT w C

The geometry of this constrained optimization shows us that Ein (wreg ) wreg , where we will define

the constant of proportionality to be 2 N so that


Ein (wreg ) = 2 wreg
N

Ein (wreg ) + 2 wreg = 0
N

which is the solution to minimizing Ein (w) + N wT w, showing that the unconstrained augmented error
formulation is equivalent to our original constrained minimization formulation. As with unconstrained
linear regression, in the case of a regularized linear regression solution we can still derive a closed-form
solution:
T
Eaug (w) = Ein (w) + w w
N
1
= (Zw y)T (Zw y) + wT w
N N
Minimizing Eaug (w) by setting Eaug (w) = 0 gives:

Z T (Zw y) + w = 0

so

wreg = (Z T Z + I)1 Z T y

9
4.2.2 General Regularization

Q
q wq2 , or even more generally
P
Regularization can be used to emphasize certain weights by taking (h) =
q=0
can also include cross-terms through use of a Tikhonov regularizer (h) = wT T w. Regardless of its
form, through either geometrical reasoning or basic Lagrangian theory, we can show that our constrained
minimization of Ein occurs where Ein = c(h), where c is some constant, so


Minimize Eaug = Ein + (h)
N
solves

Minimize Ein
s.t. (h) C

This provides the theoretical justification for the better generalization properties of augmented error, by
showing its equivalence to a smaller hypothesis set in the VC framework. We can also compare our augmented
in-sample error to our VC bound on Eout :

Eaug (h) = Ein (h) + (h), vs
N
Eout (h) Ein (h) + (H)

In the above, the s are different functions, and we use the same letter to denote them simply to draw
attention to regularizations role as a proxy for the generalization error, showing why Eaug is a better proxy
than Ein for Eout .

4.3 Validation

While regularization estimates the overfit penalty of a model, validation estimates the out-of-sample error
directly. In selecting a size for our validation set, we have 2 competing goals: we would like to have a
large training set N K, so that Eout (g) Eout (g ), but we would also like to have a large validation
set K, so that Eout (g ) Eval (g ). Accomplishing both of these goals would give us confidence that
Eout (g) Eval (g ).

4.3.1 Cross-validation

Through the technique of cross-validation, we can simultaneously accomplish both a large K and a large
N
N K. In K-fold cross-validation, the data is split into K groups of K data points each. K models are fit
to the data, each one using a different group as the validation set, with the remaining K 1 groups used
as the training set. We average the K cross-validation errors to arrive at our estimate of Eval (g ). At the
extreme, we have leave-one-out cross-validation, in which K = 1.
While cross-validation provides an unbiased estimate of Eout (proof), the variance of Ecv is larger than an
independent sample of N validation points due to the correlation between estimates.

A Math Background
Some results from linear algebra of particular utility to machine learning include:

10
tr(AB) = tr(BA)P roof
A tr(AB) = B T P roof
AT f (A) = (A f (A))T P roof
A tr(ABAT C = CAB + C T AB T P roof
A |A| = |A|(A1 )T , for all non-singular square AP roof

B Proofs
Proof. Proof that mH (N ) is bounded by a polynomial in N goes here

Proof. Proof that mH (N ) can replace M goes here

Proof. Proof that Ecv is an unbiased estimate of Eout goes here


Theorem 1. tr(AB) = tr(BA)

Proof.

tr(AB) = (AB)11 + (AB)22 + . . . + (AB)nn


= a11 b11 + a12 b21 + . . . a1n bn1
+ a21 b12 + a22 b22 + . . . a2n bn2
+ a31 b13 + a32 b23 + . . . a3n bn3
.
+ ..
+ an1 b1n + an2 b2n + . . . ann bnn

Summing each column shows these are the elements of (BA)11 +(BA)22 +. . .+(BA)nn = tr(BA). Therefore,

tr(AB) = tr(BA)

Theorem 2. A tr(AB) = B T

Proof.
T
aT1 b2 aT1 bn

a1 b1 ...
aT2 b1 aT2 b2 ... aT2 bn
tr(AB) = tr .

.. .. ..
.. . . .
aTn b1 aTn b2 ... aTn bn
m X
X n
= aij bji
i=1 j=1

Therefore,
tr(AB)
= bji ,
aij
so
A tr(AB) = B T

11
Theorem 3. AT f (A) = (A f (A))T

Proof.
Theorem 4.

Proof.
Theorem 5.

Proof.

12

You might also like