Coursera Stanford Machine Learning Summary

October 9, 2017
Abstract
1 Notations
1. m (or mtrain ) is the number of training examples (training set).
a. mCV is the number of cross-validation examples (CV set), on which we will find optimal
parameters for the algorithm.
b. mtest is the number of test examples (test set).
2. n is the number of features of our examples, i.e. number of axis of data.

3. X (or Xtrain ) is the matrix of training data. every row is a specific example (of length n), and
every column is a specific feature of all examples (of length m). X(:, j) is the jth feature. X(i, :)
(i)
is the ith example, which we will denote as x(i) . xj is the j-th feature of the i-th example. The
size of X is (m, n).
a. XCV is the matrix of cross-validation examples. Its size is (mCV , n).
b. Xtest is the matrix of test examples. Its size is (mtest , n).
c. Usually we add a bias/intercept feature which is of constant 1s to our matrix of examples,
X. Well denote it Xb . Its size is (m, n + 1).
4. y (or ytrain ) is a column vector of results. Its size is (m, 1).

a. yCV is a column vector of results of validation. Its size is (mCV , 1).
b. ytest is a column vector of results of test. Its size is (mtest , 1).
c. In some cases the result of each example will be a vector and not a simple value. In this
case, y will be a matrix of size (m, K).
5. is a column vector of coefficients for our hypothesis. Its size is (n + 1, 1), including the bias
coefficient, 0 .
a. In some cases we would like to refer to without the bias term, 0 . We denote it by
= (2 : n, :), and its size is (n,1).
b. In some cases we would like to refer to where the place of the bias term exists, but is
equal to 0. We denote it by .
c. In some cases will be a matrix of size (n + 1, K).
6. h (X) is a hypothesis, i.e. some function which will change in each algorithm, which takes as
an input, an approximates the wished results of the examples set X. Our objective is to find the
which will give the best hypothesis, i.e. which predicts best the results.
1
7. The Cost Function, J, is the function which tells us how far the predictions of our hypothesis
(h (X)) are from the real results (y). It will usually be a function of . We would like to find
the which minimizes J.
8. is a regularization parameter.
2 Linear Regression
1. In this case we have continuous results that we wish to predict.
2. Our hypothesis is
n
X
h (X) = j xj = Xb
j=0
which results in a column vector of results (m, 1).

3. The Cost function is
m
1 X 1 1
J() = (h (x(i) ) y (i) )2 = (h (X) y)T (h (X) y) = (Xb y)T (Xb y)
2m i=1 2m 2m
The Cost function with regularization is

m n
1 X X 1
J() = ( (h (x(i) ) y (i) )2 + j2 ) = ((h (X) y)T (h (X) y) + T )
2m i=1 j=1
2m
2.1 Minimizing J
1. Numerical Approximation:
a. Gradient Descent:
m
J X (i)
j = j = j (h (x(i) ) y (i) )xj = j (h (X) y)T xj
j m i=1 m
and all together:

J
= = XbT (h (X) y)
m
With regularization it becomes:
J
= = (XbT (h (X) y) + )
m
which is of size (n + 1, m) (m, 1) = (n + 1, 1).
Two important thing are scaling the features and choosing :
i. In case our features are not on the same scale, we wish to make them be by
m m
xj j 1 X (i) 1 X (i)
xj , when j = x , j2 = (x j )2
j m i=1 j m i=1 j
ii. We need to choose carefully. If its value will be too high, then the step will skip
over the minimum, and J will increase instead of decrease. If its value will be too
low, then the convergence will be very slow. To check if the value is too high or
too low we can plot the value of J to the number of iterations. If we fix a num-
ber of iterations, then we should try some values foro , such as in the range of
{...0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3...} and we should choose the one which gives the
lowest value for JCV .
2
b. There are other methods for finding minimum numerically given the function and its deriva-
tive(s), such as Conjugate gradient, BFGS, and L-BFGS. These are more sophisti-
cated rather then plain GD, and dont need us to provide .
We can implement in Octave by using fminunc() in the following way:
o p t i o n s = o p t i m s e t ( GradObj , on , MaxIter , 1 0 0 ) ;
i n i t i a l T h e t a = rand ( 2 , 1 ) ;
[ optTheta , f u n c t i o n V a l , e x i t F l a g ] = fminunc ( @costFunction , i n i t i a l T h e t a , o p t i o n s )
2. Least Squares:
The minimum value of J is when = (X T X)1 X T y. To invert a noninvertible matrix we use
pinv.
T 1 T
Withregularization,
the minimum value of J is obtained when = (X X + L) X y, when
0
1
L= , and the matrix X T X + L will always be invertible.

. ..
1
3. We should use LS whenever n is not too large, and then we get an exact answer right away.
When n is large, inverting the matrix will take a lot of time, and then we will prefer to use GD.
3 Logistic Regression
1. Assume we want to classify things into two categories. Then y is a vector with entries of {0, 1},
and we would like to get predictions of 0s and 1s.
1 1
2. Our hypothesis is h (X) = 1+eXb
, or g(z) = 1+ez when z = Xb . That way, 0 h (X) 1,

1
<
2 Xb 0
1
h (X) = = 2 Xb = 0
1

> Xb 0

2
and we interpret it as the probability of the result being 1.

3. Theres a decision boundary we set in order to decide whenever the result should be 1 and
otherwise 0. We set it to be 0.5 if we want 50:50 chances for each result. however, if we want to
predict 1 only when theres higher certainty, then we raise the boundary.
4. The Cost function is

m
1 X (i) 1
J() = [y log(h (x(i) ))+(1y (i) ) log(1h (x(i) ))] = (y T log(g(Xb )+(1y)T log(g(Xb ))
m i=1 m
Where log(h (x(i) )) contributes to the error in case y (i) = 1, and log(1h (x(i) )) contributes
in case y (i) = 0.
With regularization this becomes:
m n
1 X (i) X 2
J() = [y log(h (x(i) )) + (1 y (i) ) log(1 h (x(i) ))] + =
m i=1 2m j=1 j
1 T
(y log(g(Xb )) + (1 y)T log(g(Xb ))) + T
m 2m
3
5. In the case of logistic regression, we dont have an analogue to the LS method, and only can
minimize J numerically, for example with gradient descent:
m
J X (i)
j = j = j (h (x(i) ) y (i) )xj = j (h (X) y)T xj
j m i=1 m
and with regularization and vectorized:

J
= = (XbT (h (X) y) + )
m
We emphasize that while the formulas are the same as for linear regression, h is different.
6. In case we are interested in multi-class classification we do the following:
a. Assume we have K classes, so that y (i) {0...K}.
b. We repeat the algorithm K + 1 times, when each time we wish the distinguish between the
k-th class in {0...K} to all the other classes.
c. In each repetition we get a probability of the result h (x(i) ) belonging to the k-th class, and
in the end we would choose the class with the maximal probability.
3.1 Support Vector Machines/Large Margin Separator

We would like to separate our different clusters with a separator which is the most far that
possible, in order for the separation to be the most liable, so that new examples, that arent in
the training set (CV, test) will also be separated correctly. We exchange the logarithm function
with a different cost function:
m n
1 X (i) X 2
J() = [y cost1 (T x(i) )) + (1 y (i) )cost0 (T x(i) ))] +
m i=1 2m j=1 j
a. This translates in the linear case to predicting

(
1 if Xb 1
h (X) =
0 if Xb 1
Therefore, when training, we force that
(
T xi 1 if yi = 1
T xi 1 if yi = 0
which will guarantee that the unregularized cost function will be 0, and under these con-
straints we only minimize the regularization:
n
X
min i2 = T

i=1
4
b. We can also use non-linear kernels:
Instead of linear Xb , we can use T fi , when fi = d(x, li ) for some landmarks li and some
distance function between them. For example, we can use as landmarks our examples
themselves: l(i) = x(i) , i = 1...m and gaussian distance function
||x li ||2
d(x, li ) = exp( )
2 2
In this case, large corresponds to a wide gaussian which may give high bias, while small
corresponds to a narrow gaussian which may give high variance.
The cost function is
m m
1 X (i) X 2
J() = [y cost1 (T f (i) )) + (1 y (i) )cost0 (T f (i) ))] +
m i=1 2m j=1 j
There are other kernels in use, such as polynomial kernels in which d(x, l) = (lT x + c)d .
4 Neural Networks
4.1 Forward Propagation
1. We have an input layer, output layer, and hidden layers in between. We denote the number of
layers by L.
2. The input layer is in fact X, but we will also denote it as a(1) .
(l) (l)
3. In every step we add a bias term to the previous layer, a0 , to receive ab , and we have a matrix
of weights, (l) , which acts as Logistic regression to pass from level l to l + 1 thus giving us
a(l+1) :
z (l+1) = a(l) (l) , a(l+1) = g(z (l) )
The size of (l) is (sl + 1, sl+1 ), when sl is the number of units in layer l (without the bias unit),
so for example, s1 = n the number of features, sL = K the number of outputs/classes).
5
4. Out hypothesis is the output layer, h (X) = a(L) = g(z (L) ).
5. The Cost function is:
m K L sl +1,sl+1
1 X X (i) (i) X X (l)
J() = [yk log(h (x(i) )k ) + (1 yk ) log(1 h (x(i) )k )] + (jl ,jl+1 )2 =
m i=1 2m j ,j =1
k=1 l=1 l l+1
L
1 (L) (L) X (l) 2
(y T log(g(ab (L) )) + (1 y)T log(g(ab (L) ))) + || ||f
m 2m
l=1
6. For example, if you want to use NN for multi-class classification, we would need to convert the
vector of results y of length m with entries in {1...K}, to a matrix of size (K, m), when the i-th
result yi = k is then converted to the i-th column with 1 in the k-th place, and 0 otherwise:

3
4 0 0 1 ... 0
0 0 0 . . . 1
y = 1 7 Y =

.. 1 0 0 . . . 0
. 0 1 0 ... 0
2
4.2 Computuing The Gradient of J - Back Propagation
(L) (L)
1. We denote k = ak yk the difference between the result of the FP to the actual result.
Vectorizing, we get (L) = a(L) y = g(z (L) ) y of length sL = K.
2. Now we calculate
(l) = (l) (l+1) . g 0 (z (l) ) = (l) (l+1) . (g(z (l) )(1 g(z (l) ))
(l)
of length sl + 1. We then take out 0 to get (l) of length sl .
3. Now we calculate (l) = (l+1) (a(l) )T and get a matrix of size (sl , sl+1 ).
J() 1 (l)
4. We finish by calculating (l)
= m ( + (l) ).
6
5 In case our algorithm doesnt work well
1. Gradient Descent: If the value of J doesnt converge, we should take smaller .
2. Underfitting (High Bias): In case both Jtrain and JCV are high, we might not have enough
features. We can:
a. Decrease regularization parameter .
b. Add polynomial (higher powers) features of existing features.
c. Add different kinds of features which depend on existing features, such as log, square root,
etc.
d. Add different features, i.e. getting more data on the existing examples.
e. Add hidden layers and/or units in hidden layers.
3. Overfitting (High Variance): In case Jtrain is low, but JCV is high, we might have too much
features or not enough data.
a. Increase regularization parameter .
b. Reduce number of features.
c. Increase number of training examples.
d. decrease hidden layers and/or units in hidden layers.

4. Error evaluation in case of skewed data (amount of data in different classes are on different scale):
We say that an outcome is positive if it in the class we want to check.
We denote
True Positives True Positives TP
Precision = = =
Predicted Positives True Positives + False Positives TP + FP
7
True Positives True Positives TP
Recall = = =
Actual Positives True Positives + False Negatives TP + FN
If we want only to predict positive when were highly confident, we would get high precision
and low recall, and if were okay with predicting positive also when were not very confident, we
would get high recall and low precision.
We would like both of them to be high, and can do so by checking on the CV set and maximizing
the value of
P R
F1 = 2
P +R
6 K means
1. Now we dont have vector/matrix of results y, but only the set of examples with features X, and
we would like to classify them to different classes by this.
2. We scale the features.
3. we choose number of clusters, K, and then randomly choose K centers for the wished clusters,
k , k = 1...K.
4. we classify every example xi to the k-th cluster by seeking the closest center k .
5. We now take the whole k-th cluster with the examples classified to it, and define its new center
k to be the mean of all these examples.
6. We can view this algorithm as optimizing
m
1 X (i)
J(c(1) ...c(m) , 1 ...K ) = ||x c(i) ||2
m i=1
when c(i) is the index of cluster which the i-th example is in.
7. This function isnt convex, therefore we may land in a local minimum instead of global. Therefore
we will repeat the algorithm a few times ( 100), each time with random initialization and look
for the minimal case.
8. We choose number of clusters by a priori need. If we dont have any, we can try using the elbow
method to see where the cost function rate decreases.
8
7 Principal Components Analysis (PCA)
1. This method is used to reduce the number of dimensions used to represent our data. It can be
useful to reduce number of features, to save storage data, and to visualize data.
xj j
2. We scale the features by using mean normalization xj j .
1
Pn (i)
3. We compute the matrix of covariance = m i=1 (x )(x(i) )T .
4. We then compute the eigenvalues and eigenvectors of the matrix . We denote the eigenvectors
ui , i = 1...n, and the eigenvalues as si , i = 1...n.

Coursera Stanford Machine Learning Summary

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Coursera Stanford Machine Learning Summary

Uploaded by

Copyright:

Available Formats

October 9, 2017

2. n is the number of features of our examples, i.e. number of axis of data.

4. y (or ytrain ) is a column vector of results. Its size is (m, 1).

which results in a column vector of results (m, 1).

The Cost function with regularization is

and all together:

and we interpret it as the probability of the result being 1.

4. The Cost function is

and with regularization and vectorized:

3.1 Support Vector Machines/Large Margin Separator

a. This translates in the linear case to predicting

4.2 Computuing The Gradient of J - Back Propagation

d. decrease hidden layers and/or units in hidden layers.

You might also like