The Simple + Practical Path To Machine Learning Capability - Models With Learned Parameters

9/22/2016 The Simple + Practical Path to Machine Learning Capability: Models with Learned Parameters
(/blog/)
The Simple + Practical Path to Machine Learning Capailit:
Models with Learned Parameters
Dan Kuster - Septemer 20, 2016
In part one, we showed how the machine learning process is like the scientific thinking
process (https://indico.io/log/simple-practical-path-to-machine-learning-capailit-part1/), and
in part two, we introduced a enchmark task and showed how to get our machine learning
sstem up and running with a simple nearest neighors model (https://indico.io/log/simple-
practical-path-to-machine-learning-capailit-part2/).
Now ou have all the necessar parts of a machine learning sstem:
Aproblemworthsolving,definedasaspecificmachine learning task.Hereweresolvingthetaskof
opticalcharacterrecognition,a10wayclassificationtask.Specifically,givena2828grayscaleimage
ofahandwrittennumber,labelitasthecorrectdigit(e.g.,4).
Data + laelstofeedourmodel.
Modeltorepresentandexploitknowledgeforthisdatadistributionandtask.Sofar,weveonlyused
thenearestneighborsmodel,whichlooksforthemostsimilarexampleintheknowndataset,and
assumestheinputhasthesamelabelastheknownexample.
Aprincipledwaytoevaluate and compareperformance(i.e.,evaluationmetrics).
ut mae ou are wonderingwhat aout etter models?

Time to crank it up! In this post, we continue the scientific thinking process extending our
simple starting model into increasingl powerful models.
https://indico.io/blog/simple-practical-path-to-machine-learning-capability-part3/ 1/17
viaGIPHY(http://giphy.com/gifs/AN1D2YksM07yE)
Well show how to implement a particularl useful kind of model in Tensorflow. In fact, ou can
follow these steps to implement an model:
1.valuateonadatasetthatwasntusedtotrainthemodel.
2.Inspecttheerrors.
3.Identif patternsoferrors.
4.Diagnosethepatternsoferrors,inthecontextofthemodel.Whataboutthemodelledittomake
thewrongpredictions?
5.Makeanew hpothesis.
6.Implementthenewhypothesisasanewmodel.
7.Trainthenewmodel.
8.Visualizetheresult.
(and iterate!)
Shortcutting this process is a common failure modeeve een warned! xperts practice
exactl the same process, perhaps even more sstematicall. ut familiarit with common
patterns of errors, diagnoses, and solutions allow experts to zip through the iterations. From
the perspective of someone who is learning, it might look like the expert is skipping steps and
going straight to the complicated stuff. ut that isnt the case! xperts spend the most time
working on complicated stuff ecause the simple stuff has een tried/solved alread :).
There is no reason to assume our particular prolem cannot e solved simpler models
until ou have tested them. Simpler models have man enefits, so start simple! Skip around if
ou must, ut make sure ou understand how to practice the entire end-to-end scientific
thinking process.
1. valuate predictions
To improve our machine learning sstem, we need to understand how the current sstem
makes errors. Then we test a hpothesis (i.e., a new model) to see if we can eliminate errors.
When does the nearest neighors model make mistakes? Lets evaluate the first 1000 dev
images, and inspect the examples where the nearest neighor prediction was wrong.
dev_images=dev_images[0:1000]
dev_labels=dev_labels[0:1000]
pred=np.zeros(len(dev_images))
fori,query_imageintqdm(enumerate(dev_images)):
pred[i]=nearest_neighbor(query_image)
acc=np.zeros(len(dev_images))
fori,pred_labelinenumerate(pred):
ifint(pred_label)==int(dev_labels[i]):
acc[i]=1
else:
print("example%d:predicted%d,realansweris%d"%(i,pred[i],dev_labels[i]))
accuracy=acc.sum()/len(acc)
print("Accuracy:%.3f%%,evaluatedon%dexamples"%(accuracy*100.,len(acc)))
valuate the first 1000 dev examples.

47 mistakes 95.3% accurac.
From just the predicted vs. real laels, can we start to understand wh this model made
mistakes? There do seem to e some patterns. For example, a handwritten 4 vs. 9 might have
ver similar pixels, depending on the roundness of the top loop. And perhaps there is a
pattern to the predictionsit seems like 7, 9, and 1 are incorrectl predicted more frequentl
than others.
2. Inspect the errors
It is good to exercise intuition, ut we cant stop there! Now that we know which examples
were incorrectl predicted, we can inspect the actual image data of those examples (feel free
to inspect correct predictions too). Here are a few errors and a code snippet to do it for
ourself:
frommatplotlibimportpyplotasplt
ex=5#changethisvaluetowhicheverexampleyouwanttoview
plt.imshow(dev_images[ex].reshape((28,28)),cmap="gray_r")
plt.show()
xample 5: predicted 9, real answer is 4.

Mae the model made a mistake ecause the overall shapes are similar,
ut the sharp difference in a few pixels at the top of the loop are not enough
to overcome the rest of the shape (which looks ver similar for 4 and 9).
xample 86: predicted 2, real answer is 6. Is this *m* handwriting?

Just kidding. I dont know what to make of this lo. Next!

As in example 5 aove, a sharp visual difference that spans onl a few pixels
is not enough to overcome the overall similarit of the rest of the shape.

Most of the pixels (the top loop) do look like a 9, ut the ottom loop distinguishes it as an 8.
The ottom loop is relativel small here, ut a nearest neighors model doesnt care.
3. Look for patterns of errors
OK, weve onl shown a few illustrative images here, ut we looked at man more, and
encourage ou to do the same. Oserve the errors. Do ou notice anthing in common to
these incorrectl predicted examples?
What patterns did ou notice? We noticed how, in some examples, a small region of pixels can
e ver important for determining the correct digit. Where the discriminative part of the
handwriting is small, and the overall stroke is amiguous, the model makes mistakes.
4. Diagnose errors: how did the model make mistakes?
When a dataset is large and diverse, there are ound to e ad examples (e.g., example #86
in the figure aove). ut if we can detect a sstematic pattern and understand how it happens
in the specific context of this model and data distriution, then we can think aout formulating
a etter hpothesis.
Man of the pixel locations, especiall the order pixels, alwas have a value of 0. Other
locations are more interesting.
Keeping in mind that MNIST data were generated from lack/white data and then converted
into grascale, we expect a imodal distriution like this. We also find a clear difference in
distriutions etween pixels at the edges of the image (where values are alwas 0), and pixels
in the middle of the image (e.g., pixels 215, 325, 658), which generall have a imodal
distriution as well. Thinking in terms of a model, wed like to inform our model with some
information aout the distriutions of values at each pixel location, so it can learn which values
at which pixel locations are important to predict digits.
So, now that we have identified a pattern of errors, how do we explain the pattern in the
context of the model? It is great practice to think through this, please take a moment to
express our thoughts efore continuing on.
It can e helpful to write ideas on paper or whiteoard. Do whatever is most natural for ou
draw something, sketch logical diagrams, or write words. For example:
We noticed errors where the overall stroke of handwriting might e similar to

an incorrect digit (green), and a small discriminative region (red)
is not ale to overcome it.
Our explanation for the pattern of mistakes? In the nearest neighors model, specificall in the
sse_distance function, each pixel contriutes equall to the prediction. ut the useful
information is not uniforml distriuted across pixels. Some pixels are more informative than
others, like the region at the top of the 9 vs. 4 in the figure aove. Intuitivel, this makes sense.
When we humans read handwriting, we focus on specific parts of the stroke to discriminate
one character from another. So es, the errors make sense in the context of a nearest
neighors model! The model is eing fooled when the discriminative regions of handwriting
are small and the overall stroke shape is amiguous.
Diagnosis: each pixel is a feature that could e exploited to make predictions, ut the nearest
neighors model doesnt have the ailit to weight the informative pixels more strongl than
other pixels.
5. Strateg to fix the errors -> a new hpothesis
Now that we have a hpothetical explanation for pattern of errors we have oservedcan we
think of a strateg to eliminate those errors? How would ou change the model to give more
predictive power to the discriminative regions of handwriting (regardless of size), while still
eing sensitive to the overall shape?
There are man possile strategies, and we encourage ou to tr a few :). One particularl
useful strateg exploits the concept of weights. We give each pixel a weight parameter, so that
the contriution of each pixel (to the prediction) can e scaled according to how informative it
is. Lets assume for now that we have a method for discovering the optimal value of each
weight parameter. How would such a model look? The simplest version would e something
like a weighted average:
Predictedclass=weight1*pixel1++weightn*pixeln
ut there are few prolems with weighted average. First, how do we discover good values for
each parameter? Secondl, this onl allows us to predict two classesa target class and not
the target class. ut we need to predict 10 different target classes.
The solution to the first prolem isproailities! Instead of aritrar weights and pixel values,
if we frame everthing in terms of proailit, we can compare predictions on equal terms and
learn which parameter values ield the oserved proailities. So, how to do it?
Use logistic regression to make proailistic predictions

The purpose of our model is to guess the lael [0,1,2,3,,9] for an image of handwriting. ut
guessing an answer is prett crudeespeciall when the answer might e amiguous (e.g.,
proal a 4 or mae a 9 ut definitel not a 1). What if we could include some
measure of uncertaint in each guess?
Lets state it preciselwe want to predict a conditional distriution of responses, , for each
input, X. ach element in the conditional distriution should represent the proailit that a
given input is laeled with the given target class. This is a fundamental concept of machine
learning, so lets walk through it in detail. Well start with the inar classification scenario,
then show how to extend it to an numer of target classes.
definition, proailities sum to 1. Thus, for two target classes, we define p as the proailit
of the target class (the event or outcome we want to predict), given some input, X. Then (1-p)
must e the proailit of the other class. Since we are interested in estimating the relative
likelihood of outcomes, we construct a ratio of those proailities, called the odds ratio:
ut there is a difficult with using such a ratio. Inputs could e an real numer, ut
proailities must e ounded on [0,1]. We need to map the inputs into something that has an
upper ound. Thus, we take the logarithm of the odds ratio:
This is called the logit or sigmoid, and it is simpl a log-transformed odds ratio. You can
think of it like a coin flip, where the coin is iased to land heads with whatever odds are
oserved. When p = 0.5, outcomes are equall likel and the coin is fair. For other values of p,
the coin is iased towards one outcome or the other.
Next, we want to parameterize the logit function. In other words, we want to include our
hpothesis in this equation, refactoring it to have parameters of some form. Lets assume
our learned parameters take the form of a linear model. A linear model has a ias and a slope;
ou might rememer the form of a linear equation from mathematics classes:
We want to follow the same convention as other machine learning practitioners (easier to
share models when we speak the same language). So well change the variale names
slightl. The slope, m, ecomes a weight, w. And the -intercept, , is the ias.
Perhaps ouve recognized how this (linear) equation is just the first few terms from a series
expansion, and ou could easil extend it to an n-degree polnomial hpothesis. ut were
starting simple, with the linear model.
Now we can plug the linear model (hpothesis) into the logit model. This essentiall makes
makes our hpothesis proailistic:
Solving (for the inverse-logit) gives us the logistic function:
This is a reall important result. We have written a mathematical hpothesis that lets us
oserve ratios of outcomes as proailities, in terms of a ias parameter, , and weight
parameters, wn. It is important ecause we now have a wa to compute parameters for an
given input example.
ut the equation aove is for a single input onl, and for this task we have 784 features (i.e.,
pixels) to consider for each input, X. No prolem, lets expand it out:
Now, we have a mathematical hpothesis that lets us oserve ratios of outcomes as

proailities, in terms of a ias parameter, , and man weight parameters, mn. This is exactl
what we want! Using this equation, we can make proailistic predictions aout a target class,
considering the values at each pixel.
ut we still need to generalize from inar classification (2 target classes) to multi-wa

classification (10 target classes). The trick is recognizing that inar classification (1 vs. 0) can
e framed as a unch of one vs. the rest classifiers, where each class gets a turn as the 1
class in a inar classifier. Generalizing inar logistic regression in this wa ields softmax
classification. Well explore the details in another article; for now it is enough to know that the
softmax function is like a multi-wa logistic function. Our final model looks like this:
OK! Now that we have a hpothesis for making structured proailistic predictions into
multiple classes for a given input vector of 784 pixelshow do we discover good parameter
values for and [w, w1 , , wn] ? We train the model.
Train the model to learn good parameters

To train a machine learning model, we implement a training loop:
1.Initializeparameters.Goodinitialvaluescouldbezero,orsampledfromsomestatistical
distribution.
2.Feedaninputexample,X,andlabel,,intoaprobabilisticmodel.
3.Foreachinput,evaluatethemodelusingthecurrentparametervaluestoguessalabel.This
stepiscalledinferenceorprediction.
4.Comparethepredictedlabel,_pred,withthereallabel,,andevaluatealossfunction.This
yieldsalossvaluetoindicatehowwrongthemodelwas.
5.Updateparametersbyfeedingthelossvalueintoanoptimizer.Optimizationisadeeply
technicaltopicandwellcoveritseparatelyinanotherarticle.Fornow,thinkoftheoptimizerasa
functionthatuseslossvaluestotakestepsawayfrombadparametervalues(andhopefully,
towardsgoodparametervalues).
6.Iteratebyrepeatingsteps26.Stopwhentheparametersstopchanging,yourunoutofdata,or
thepredictionsaregoodenoughusingsomeothercriterion.
Tensorflow implementation:
Start as in previous articles, importing modules and loading data:
importnumpyasnp
importtensorflowastf
fromskdata.mnist.viewsimportOfficialVectorClassification
fromitertoolsimportizip_longest
fromtqdmimporttqdm
view=OfficialVectorClassification()
train_idxs=view.fit_idxs[:]
dev_idxs=view.val_idxs[:]
holdout_idxs=view.tst_idxs[:]
train_images=[]
train_labels=[]
dev_images=[]
dev_labels=[]
foridxintrain_idxs:
train_images.append(view.all_vectors[idx])#image
train_labels.append(view.all_labels[idx])#label
foridxindev_idxs:
dev_images.append(view.all_vectors[idx])
dev_labels.append(view.all_labels[idx])
Previousl, we didnt care aout the range of values in the input images, ecause a nearest
neighors model does not have an learned parameters, just lookups. Here we appl a simple
normalization to convert the unsigned 8-it grascale input pixels from [0,255] range to
floating point numers in the range [0,1]:
train_images=np.array(train_images)/255.
dev_images=np.array(dev_images)/255.
We also need to define a new utilit function for encoding numerical lael values as one-hot
values. Well explore the reasoning and implementation for this in the upcoming softmax
article. ut the gist is that one-hot encoding gives us a convenient wa to index the class
laels into a unch of one vs. the rest classifications.
defone_hot(dense_label_vector,n_labels=None):
"""
Givenadensevector(ofclasslabels),returnsthesparse"onehot"encodingofthatvector.
Usesnumpyops,withdefaulton_value=1.0andoff_value=0.0.
"""
ifnotn_labels:
n_labels=np.max(dense_label_vector)+1
oh=np.eye(n_labels)[dense_label_vector]
returnoh
Previous models didnt have an learned parameters; there was nothing to train. Now we need
to implement a training loop. First we write a simple utilit function to ield atches from a
sequence of input examples:
defbatches(iterable,n,fillvalue=None):
"""
Yieldbatchesofnexamplesfromasequenceofinputexamples
"""
args=[iter(iterable)]*n
returnizip_longest(fillvalue=fillvalue,*args)
Next, we translate our (dense) class laels from something like [4,0,9,1, ] into one-hot vectors
like [[0,0,0,0,1,0,0,0,0,0], [1,0,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,0,1], [0,1,0,0,0,0,0,0,0,0], ].
train_labels_onehot=one_hot(train_labels)
dev_labels_onehot=one_hot(dev_labels)
We also have a few hperparameters. Models have learned parameters that encode
knowledge aout the data distriution and task eing optimized. In this case, our parameters
are iases and weights. Hperparameters operate at a meta-level, governing the ehavior of
the modeling process itself. For example, n_batch determines how man examples are
processed in a atch.
Feel free to tinker with these. For example, what happens if ou make n_batch=1000? Wh
is that?
#params
learning_rate=0.01
n_epochs=100
n_batch=100
display_each=1
Finall, we can start uilding the Tensorflow graph to define our model, in much the same wa
as the official Tensorflow tutorial, MNIST for xperts
(https://www.tensorflow.org/versions/r0.10/tutorials/mnist/pros/index.html). We want to feed
data in as examples, X, and laels, , so we need to define placeholders:
x=tf.placeholder("float",[None,784])#imageshave784pixels
y=tf.placeholder("float",[None,10])#10targetclasses,oneforeachdigit
We also need initial values for model parameters. Unlike placeholders, which will take on
whatever value is fed in from the data during training, we want these parameter values to e
initialized to a real value, and then updated as the model learns. Here well initialize weights
and iases with zeros. ut ou could tr other strategies too, like random sampling from a
normal distriution. Note that we need to tell Tensorflow the shapes of these variales.
w=tf.Variable(tf.zeros([784,10]))
b=tf.Variable(tf.zeros([10]))
Now we can define a model in one line! When people talk aout how deep learning
frameworks like Theano and Tensorflow enale rapid experimentation, the are referencing
this smolic expression functionalit. See how this model is exactl like the math we derived
aove!
There is some effort to get everthing loaded and initialized, ut once ouve done that,
expressing models is ver transparent. We use the tf.matmul function instead of simple
multiplication ecause X is a vector and we want Tensorflow to use the vectorized ops for that
computation.
y_pred=tf.nn.softmax(tf.matmul(x,w)+b)
To evaluate the prediction against the known class lael for each example, we need a loss
function. Like the logistic function we derived aove, the softmax function is returning a
proailistic prediction. The cross entrop loss is simpl a convenient wa to add up
predictions across the target classes, and reduce it from a vector to a single numer that
represents prediction error. We take the log of the prediction from y_pred to get a log loss,
and appl tf.reduce_mean to averaging the loss across all output classes.
cross_entropy=tf.reduce_mean(tf.reduce_sum(y*tf.log(y_pred),reduction_indices=[1]))
An optimizer takes a sequence of loss values (e.g., cross_entropy values) and updates
parameters (e.g., b and w values). Tensorflow has a numer of optimizers availale, so we
pick one and instantiate it. Here well use vanilla mini-atch gradient descent:
train_op=tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)
Finall, we construct a training loop to feed input examples, evaluate these ops, and optimize
model parameters. Tensorflow will take care of ops dependencies, so when we ask for a
train_op and y_pred values, it will also compute all the other values it needs to compute
them.
The MNIST dataset has relativel few examples, so well make man passes (epochs) through
the data, shuffling the order of examples each time so that each miniatch gets different
examples. Periodicall, we evaluate accurac metrics.
#Trainloop
examples=zip(train_images,train_labels_onehot)
idxs=range(len(examples))
withtf.Session()assess:
sess.run(tf.initialize_all_variables())
forepochinrange(n_epochs):
np.random.shuffle(idxs)#shuffleinplace
forbatchinbatches(idxs,n_batch):
xs=[examples[idx][0]foridxinbatch]#images
ys=[examples[idx][1]foridxinbatch]#onehotlabels
_,c=sess.run([train_op,y_pred],feed_dict={
x:xs,
y:ys
})
correct_prediction=tf.equal(tf.argmax(y_pred,1),tf.argmax(y,1))#isthisacorrectprediction?res
accuracy=tf.reduce_mean(tf.cast(correct_prediction,tf.float32))#convertfrombooleanstofloats:
#evaluatetheaccuracyoponthedevdata>actualvaluesforthisepoch
if(epoch+1)%display_each==0:
acc=accuracy.eval(feed_dict={
x:dev_images,
y:dev_labels_onehot
})
print("Epoch:%d,accuracy:%.5f"%(epoch,acc))
Load it up and run it! You should get results similar to this:
.
.
.
The optimizer and shuffling ops here are non-deterministic, so ou ma get slightl different
results. ut ou should get accuracies around 92%.
Wait a second, 92% is worse than the last article!

Yes it is. Wh is that?
You have the tools ou need to ecome an experttime to earn some experience! Logistic
regression models are great for making principled proailistic predictions from aritrar input
features. ut ultimatel, logistic regression is just a linear model. Linear models are
surprisingl useful, ut if optical character recognition were solvale with a linear model, this
prolem would not have een worth of stud for the past two decades!
Science is iterative, and ou just did an iteration! Congrats! Time to do anotherwhat is our
next hpothesis?
Visualization informs etter models
To develop etter models, ou need to understand the strengths and deficiencies of our
current model. We showed a simple version of this aove, using grascale distriutions and
visual inspection of a few examples. This is a good asic method, ecause ever dataset has
examples and distriutions to inspect.
ut optical character recognition is a visual task, and our human ees + rains are ver good
at seeing visual patternscan we exploit visualization to discover patterns of errors for this
specific task? You e the judge! There are man examples MNIST visualizations. Here are a
couple of our favorites, a creative take on the standard confusion matrix (http://scikit-
learn.org/stale/auto_examples/model_selection/plot_confusion_matrix.html#example-model-
selection-plot-confusion-matrix-p) from @genekogan (https://twitter.com/genekogan) and
@AlecRad (https://twitter.com/alecrad):
ach element is the sample with the highest proailit.

Model is a simple 1-laer neural network, trained on ~3k samples.
Accurac is around 88%.
Image credit: Gene Kogan (https://twitter.com/genekogan/status/709490984757886977).
Comparing three different tpes of model:

(left) logistic regression, (mid) a multi-laered/deep neural network, (right) a convolutional neural network
Image credit: Alec Radford (https://twitter.com/AlecRad/status/709565459646050305).
Next, well show how to implement neural networks in Tensorflow. changing a few lines
of code to implement a more powerful model, well go from aout 92% accurac to etter
than 99%.
Suggested Posts
Pulse Wins indicos API Prize at Hack the North (https://indico.io/log/pulse-wins-indico-api-prize-
hackthenorth/)
Machine Learning So as, ven Your Cat Could Do It (Part 2): Text Tags (https://indico.io/log/machine-
learning-so-eas-even-our-cat-could-do-it-text-tags/)
Deep Learning in Fashion (Part 2): Matching Recommendations (https://indico.io/log/deep-learning-fashion-
matching-recommendations/)
Done reading and read to uild?
GT STARTD (HTTPS://INDICO.IO/PLANS)
(/)
Hackindico(/hack) Careers(/careers)
Gallery(/gallery/) Docs(/docs)
News(/news) Team(/team)
Blog(/blog/) TermsofService(/terms)
RSSFeed(https://indico.io/blog/feed/) Privacy(/terms#privacy)
Contact(/contact)
(https://github.com/IndicoDataSolutions)
( h t t p s : / / w w w. f a c e b o o k . c o m / I n d i c o D a t a S o l u t i o n s )
( h t t p s : / / t w i t t e r. c o m / i n d i c o d a t a )

( h t t p s : / / w w w. y o u t u b e . c o m / c h a n n e l / U C G u U w m 6 P a P k e f t G F m N O f T H w )
(https://instagram.com/indicodata/)
(https://mixpanel.com/f/partner)

The Simple + Practical Path To Machine Learning Capability - Models With Learned Parameters

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Simple + Practical Path To Machine Learning Capability - Models With Learned Parameters

Uploaded by

Copyright:

Available Formats

9/22/2016 The Simple + Practical Path to Machine Learning Capability: Models with Learned Parameters

Now ou have all the necessar parts of a machine learning sstem:

Aproblemworthsolving,definedasaspecificmachine learning task.Hereweresolvingthetaskof

ut mae ou are wonderingwhat aout etter models?

valuate the first 1000 dev examples.

2. Inspect the errors

xample 5: predicted 9, real answer is 4.

xample 86: predicted 2, real answer is 6. Is this *m* handwriting?

xample 177: predicted 4, real answer is 9.

xample 180: predicted 9, real answer is 8.

3. Look for patterns of errors

4. Diagnose errors: how did the model make mistakes?

We noticed errors where the overall stroke of handwriting might e similar to

5. Strateg to fix the errors -> a new hpothesis

Use logistic regression to make proailistic predictions

Solving (for the inverse-logit) gives us the logistic function:

Now, we have a mathematical hpothesis that lets us oserve ratios of outcomes as

ut we still need to generalize from inar classification (2 target classes) to multi-wa

Train the model to learn good parameters

Wait a second, 92% is worse than the last article!

Visualization informs etter models

ach element is the sample with the highest proailit.

Comparing three different tpes of model:

Done reading and read to uild?

You might also like

xample 86: predicted 2, real answer is 6. Is this m handwriting?