You are on page 1of 121

MACHINE LEARNING AN

OVERVIEW
Tony Cooper
Senior Data Scientist
tonycooper@kpmg.co.nz

July 2016
kpmg.com/nz

Agenda
Introduction
Machine Learning
- Deep Learning
- Transfer Learning
- Reinforcement Learning

Agenda
Introduction
Machine Learning
- Deep Learning
- Transfer Learning
- Reinforcement Learning

Introduction
Last meeting Machine Learning what can it do
This meeting Machine Learning how does it work
Not covering How to do Machine Learning (e.g. test / train split)
Not covering Applications (see e.g. a long list at
http://www.deeplearningpatterns.com/doku.php/applications)

Also not covering:


Speech
Text (Buffalo buffalo Buffalo, buffalo buffalo, buffalo Buffalo buffalo)
Audio
Time Series
Graphs
Internet of Things
Bots (e.g. Siri)
Big Data

Reminder - Last Meeting Nickle Lu


Amazing things AI has done
Why AI can do those things
Why AI will eat everything
How I learned AI in my career
How can you apply it in yours

Presenter Tony Cooper


5 years Stanford PhD
Thesis Computer Intensive Statistics project on numerical methods for the
bootstrap (unfinished)

DSIR - Consulting Statistician


Funds Management Database Technology (not Big Data)
Double-Digit Numerics Consulting Data Scientist
KPMG Setup Data Science Innovation Lab
3 years experience with Deep Learning (2 years CNNs, 3 years H2O)

KPMG Innovation Lab


Big Data
Spark (Spark Meetup, Auckland, 5 September 2016 at KPMG)
R
H2O

Machine Learning

Recommender Systems
Computational Advertising
Hyperpersonalisation (segmentation with segment size 1)
Computer Vision
GPU programming

KPMG Hardware
7 Node Spark Cluster (7 x 2 Xeon)
1 GPU Server

Tesla K80 GPU

4 x Tesla K80
(4 x 24GB GPU RAM
4 x 5000 cores
4 x 5.8 TeraFLOPs)

2 x Xeon 14 cores
(56 threads)
1 TB RAM
6TB SSD

Agenda
Introduction
Machine Learning
- Deep Learning
- Transfer Learning
- Reinforcement Learning

Machine Learning

SL
ML

Statistical Learning

AI

Machine Learning
Machine Intelligence

Types of Machine Learning

Unsupervised
Supervised

Semi-supervised

Machine Learning Resources

Technical

especially 7.10.2, 7.10.3

Practical

Dummies
(can download the R and Python code without buying the book)

Experts
Machine Learnings best kept secret

Interesting

Courses, MOOCs (e.g. Udacity Deep Learning)

Internet
http://deeplearning.stanford.edu/tutorial/
http://cs231n.stanford.edu/
(some images in this presentation taken from there)

Contests (esp Kaggle.com)


Glossaries
http://envisat.esa.int/handbooks/meris/CNTR4-2-5.html
http://www.wildml.com/deep-learning-glossary/
http://deeplearning4j.org/glossary.html

Kaggle.com competitions (Titanic a good


starter), scripts (kernels), and real data

Tip use containers


Docker
Can run Ubuntu on Windows
All set up for you e.g. Google TensorFlow course at Udacity

A Taste of Machine Learning?


Regression Example Recommender System
1
Alice
Bob
Chad

5
5

Recommender System
4
4

4
4

2
3
2

1
3
1

2
3
2

1
3
1

5
3
3

h11
h21
h31

h12
h22
h32

h11
h21
h31

h12
h22
h32

5
3
3

12 = 11 12 + 12 22

w11
w21

w12
w22

w13
w23

w14
w24

w15
w25

w11
w21

w12
w22

w13
w23

w14
w24

w15
w25

11 equations in 16 unknowns

Generically:
= 1 1 + 2 2 + +
Solved using Alternating Least Squares (the machine chooses the features latent features)
The machine did the work for us in deciding what features to use

R pseudo code
# ratings matrix
R = matrix(nr=3, nc=5, data=c(4,2,1,NA,5,NA,3,3,3,NA,4,2,1,3,NA))
# initial users matrix.
h = matrix(nr=3, nc=2, data=rnorm(6));
# initial items matrix.
w = matrix(nr=5, nc=2, data=rnorm(10));
# find h, w to minimize the squared error
For (iter in 1:5) {
# update users
for (i in 1:3) {
h[i, ] = solve(...)
}
# update items
for (j in 1:5) {
w[j, ] = solve(...)
}
}

Another Taste - Beyond Linear Regression


Suburb
List Price Agreement Date Type
Mt Roskill
308000
40972 R
Mt Roskill
300000
40944 R
Mt Albert
41007 R
Mt Albert
40900 R
Mt Albert
695000
40728 R
Mt Roskill
760000
40862 R
Mt Albert
40961 R
Mt Albert
40996 R
Mt Albert
41016 R
Mt Eden
40856 R
Mt Roskill
380000
40985 R
Mt Albert
40975 R
Mt Eden
160000
40757 R
Mt Eden
40689 R
Mt Eden
173000
40996 APT
Mt Albert
249000
40967 APT
Mt Eden
359000
40819 R
Mt Albert
299000
40985 R
Mt Eden
40709 R
Mt Eden
40974 APT
Mt Albert
380000
40750 R
Mt Eden
40788 R
Mt Albert
40985 R
Mt Albert
399000
40994 R
Mt Eden
665000
40966 R
Mt Albert
319000
40711 APT
Mt Albert
319000
40757 APT

Title Sale Method Bedrooms Land Area Floor Area Existing/New Valuation Valuation Year Sale Price
P
2
308000
P
3
400000
C
P
3
E
484000
F
P
3
E
625000
P
4
695000
P
3
760000
C
A
3
E
790000
2011
945000
F
P
2
511
E
730000
F
P
5
556
E
670000
815000
F
A
3
612
E
570000
810000
P
3
754
400000
F
A
3
809
E
730000
2011
780000
P
1
32
220000
223000
S
P
1
45 E
220000
2008
265000
P
1
51
173000
S
P
2
51 E
230000
2011
230000
P
2
54
336575
P
2
59
270000
289800
F
P
2
60 E
405000
C
T
2
70
385000
P
2
70
340000
370000
A
2
70
340000
380000
C
A
2
74 E
505000
P
2
74
375000
395500
F
P
3
202
80
640000
U
P
2
81 E
305000
P
2
81
315000

CONTENT GOES HERE

Linear Regression (200 years old)


Essentially a weighted combination of the inputs e.g.
SalePrice = w0 + w1*Suburb + w2*ListPrice + w3*AgreementDate + w4*Type + w5*Title + w6*SaleMethod +
w7*Bedrooms + w8*LandArea + w9*FloorArea + w10*Existing + w11*ValuationYear
Pros
Simple to understand and interpret (taught at high school)
Simple to compute in Excel (Least Squares)

Problems
The world isnt linear
Doesnt handle interactions easily (Samuel Johnson: Your Manuscript Is Good and Original, But What is Original Is Not
Good; What Is Good Is Not Original)
Doesnt handle missing values at all
Doesnt handle correlated inputs well

Simple Example Actual Function


12

10

Response

500

1000

1500
Input

2000

2500

3000

Simple Example Noise Added


Response samples
12

10

Response

500

1000

1500
Input

2000

2500

3000

Linear Fit (underfit)


Linear Fit (Underfitting)
12

10

Response

500

1000

1500
Input

2000

2500

3000

Cubic Fit
Cubic Fit
12

10

Response

500

1000

1500
Input

2000

2500

3000

Quartic Fit
Quartic Fit
12

10

Response

500

1000

1500
Input

2000

2500

3000

Quintic Fit
Quintic Fit
12

10

Response

500

1000

1500
Input

2000

2500

3000

Overfitting
Overfitting Fit
12

10

Response

500

1000

1500
Input

2000

2500

3000

Support Vector Regression Fit


Support Vector Machine Fit C = 127.578, gamma = 1.22
12

10

Response

500

1000

1500
Input

2000

2500

3000

Sigmoid

Neural Network 1 Neuron (sigmoid)


Neural Network Fit Hidden Layer Size = 1
12

10

Response

500

1000

1500
Input

2000

2500

3000

= 1 1 + 2 2 + +

Linear hi is constant

= 1 1 + 2 2 + +

Neural Network Si is sigmoid

A neural network is just a bunch of weighted sigmoid regressions, n is the number of nodes

Neural Network 2 Neurons (sigmoid)


Neural Network Fit Hidden Layer Size = 2
12

10

Response

500

1000

1500
Input

2000

2500

3000

Neural Network 5 Neurons (sigmoid)


Neural Network Fit Hidden Layer Size = 5
12

10

Response

500

1000

1500
Input

2000

2500

3000

Neural Network 10 Neurons (sigmoid)


Neural Network Fit Hidden Layer Size = 10
12

10

Overfitting

10 is too many neurons

Response

500

1000

1500
Input

2000

2500

3000

Neural Network 5 Neurons (sigmoid)

playground.tensorflow.org

Neural Networks are just weighted regressions


= 1 1 + 2 2 + +

Neural Network there is a theorem that says


you can model anything with a single layer
Neural Network
But instead of going wide it can be more
effective going deep

Agenda
Introduction
Machine Learning
- Deep Learning
- Transfer Learning
- Reinforcement Learning

- Deep Learning

detect complex interactions among features


learn low-level features from minimally processed raw data
work with high-cardinality class memberships
work with unlabelled data

Lots of hype but its mostly true


Fosbury flop analogy, gold rush analogy
Unreasonably effective

Example - Drive a car

Which way to turn the steering wheel?

Same Problem as
Suburb
List Price Agreement Date Type
Mt Roskill
308000
40972 R
Mt Roskill
300000
40944 R
Mt Albert
41007 R
Mt Albert
40900 R
Mt Albert
695000
40728 R
Mt Roskill
760000
40862 R
Mt Albert
40961 R
Mt Albert
40996 R
Mt Albert
41016 R
Mt Eden
40856 R
Mt Roskill
380000
40985 R
Mt Albert
40975 R
Mt Eden
160000
40757 R
Mt Eden
40689 R
Mt Eden
173000
40996 APT
Mt Albert
249000
40967 APT
Mt Eden
359000
40819 R
Mt Albert
299000
40985 R
Mt Eden
40709 R
Mt Eden
40974 APT
Mt Albert
380000
40750 R
Mt Eden
40788 R
Mt Albert
40985 R
Mt Albert
399000
40994 R
Mt Eden
665000
40966 R
Mt Albert
319000
40711 APT
Mt Albert
319000
40757 APT

Build a model to predict


output from input

Title Sale Method Bedrooms Land Area Floor Area Existing/New Valuation Valuation Year Sale Price
P
2
308000
P
3
400000
C
P
3
E
484000
F
P
3
E
625000
P
4
695000
P
3
760000
C
A
3
E
790000
2011
945000
F
P
2
511
E
730000
F
P
5
556
E
670000
815000
F
A
3
612
E
570000
810000
P
3
754
400000
F
A
3
809
E
730000
2011
780000
P
1
32
220000
223000
S
P
1
45 E
220000
2008
265000
P
1
51
173000
S
P
2
51 E
230000
2011
230000
P
2
54
336575
P
2
59
270000
289800
F
P
2
60 E
405000
C
T
2
70
385000
P
2
70
340000
370000
A
2
70
340000
380000
C
A
2
74 E
505000
P
2
74
375000
395500
F
P
3
202
80
640000
U
P
2
81 E
305000
P
2
81
315000

Example Two Variable Classification


e.g.
X1 = House Price,
X2 = House Area,
Y = whether or not house sells at auction

Model: Predict whether or not the house


will sell at auction
(obviously fake data for illustration only)

Linear Regression

(X1, X2) model and (X1, X2, X1*X2) model

Tree

(X1, X2) model and (X1, X2, X1*X2) model

Random Forest

(X1, X2) model and (X1, X2, X1*X2) model

Support Vector Machine

(X1, X2) model and (X1, X2, X1*X2) model

Gradient Boosting

(X1, X2) model and (X1, X2, X1*X2) model

Single Layer Neural Network 5 nodes


(X1, X2) model and (X1, X2, X1*X2) model

Going Deeper

playground.tensorflow.org

Adding X1*X2

Dropping X1 and X2 Feature Engineering

Feature Engineering and Feature Selection is hard!


An art and a science, computationally difficult
O(2n), n = no. of features, n can be thousands
How did we know to add X1*X2?
Can we get the machine to do it for us?
Yes Deep Learning

Deep Learning no X1*X2

Go and Play!

(use ReLU)

ReLU (Rectified Linear Unit) (similar to


sigmoid but has advantages)

Types of Neural Networks (different plumbing)


Recursive Neural Networks (including LSTM)
Deep Belief Network
Deep Boltzman Machines
Autoencoders
Convolutional

Recursive Neural Networks (including LSTM)


Exploit repeated patterns that occur over, say, time or, say, sentences by
feeding data repeatedly into the network

Recursive Neural Networks (including LSTM)

LSTM Long Short-Term Memory

Autoencoders (strange but the most fun)


Train the output to match the input

deeplearning4j.org

Autoencoders

Compression
Dimension Reduction (resembles PCA)
Noise reduction (MRI example)
Drawing stuff

Drawing Stuff

More Drawing Stuff (DeepDream) messing


the picture instead of optimising the weights.
with weights Optimise
Find the best picture that turns on the dog neuron

More Drawing Stuff (DeepDream) messing


with weights

More Drawing Stuff

-Deep Learning Convolutional Neural Networks


Essentially networks of weights connected by
activation functions (e.g. sigmoid)

Convolutions
Just functions that combine pixels in a weighted way
A way of getting a correlation between a shape and parts of an image
Example: find red circles in an image, find edges
Measure how much the image part matches the shape

Convolutions Example Gabor filters


Find correlations with these shapes in the image

Example - Find edges in Images

The mathematics behind convolutions


(animated gif)

Hierarchy of Image Features Max Pooling

We rescale the picture to find what we are


looking for at different sizes and to find
more complicated shapes

A Convolutional Neural Network is just


stacked layers of convolutions and pooling

(clarifai.com)

It creates a hierarchy of features at


decreasing resolutions

Another example (Le Cun)

Googles Inception V3 network

Googles Inception V3 network performance

Inception V4 is out see ArXiv 1602.07261

Its not all Convolutions (but still weights)

Deep Learning Software (examples)


Theano, Keras, Caffe, CNTK, MXNet, H2O, Neon, Deeplearning4j, etc
TensorFlow has lots of community support
NVIDIA DIGITS easy to use a web interface to Caffe
MXNet and H2O can be used from R

Agenda
Introduction
Machine Learning
- Deep Learning
- Transfer Learning
- Reinforcement Learning

- Transfer Learning
(one more reason why the robots will be eating our lunch)

Example Cat detector to turn on sprinklers

Transfer Learning
Transfer Learning uses an existing set of weights for an existing Deep Learning
network and adapts (retrains) some of the layers for a new set of images. This
lets us (and robots) transfer learning to new tasks.

The next two examples show a case where the last three layers of a network
downloaded from the internet are retrained to distinguish between cars and
SUVs. The fruit image example does no retraining but takes weights from the
second last layer as inputs to a model trained in R.

- SUV / CAR example

ImageNet Network (MATLAB)

ImageNet Network (MATLAB)

Results - before

Results - after

Fruit Classification Demo

0 correct out of 57 images

Demo Use the outputs from the second last


layer (before classification), 1000 columns

Demo R code using the 1000 columns


library(readr)
library(xgboost)
library(caret)
library(plyr)

setwd("~/MATLAB/matconvnet")
cat("reading data file\n")
train <- read_csv("fruit.csv", col_names=FALSE)
rows = nrow(train)
cols = ncol(train)
train[, cols] <- factor(train[, cols], labels=c("peaches", "apricots"))
set.seed(2) # 2 gives a nice even split
inTrain <- sample(rows) # no test sample
x.train <- train[inTrain, -cols]
y.train <- train[inTrain, cols]
fitControl <- trainControl(method = "repeatedcv", number=4, repeats=50) #, classProbs=TRUE, summaryFunction=twoClassSummary)
cat("training model\n")
xgbGrid <- expand.grid(.nrounds = 4, .max_depth = 3, .eta = 0.3, .gamma = 0, .colsample_bytree = 0.6, .min_child_weight = 1) # good
xgb <- train(x=x.train, y=y.train, method='xgbTree', trControl=fitControl, tuneGrid=xgbGrid)
save(xgb, file="xgb.RData")
pred <- predict(xgb, newdata=x.train)
confusionMatrix(pred, y.train)
# output the predicted values
cat("saving predicted labels\n")
pred <- predict(xgb, newdata=train[, -cols])
pred <- data.frame(pred) # convert to 0,1 variable
write_csv(pred, "fruitresults.csv", col_names=FALSE)
cat("done\n")

Result

57 correct out of 57 images

Agenda
Introduction
Machine Learning
- Deep Learning
- Transfer Learning
- Reinforcement Learning

- Reinforcement Learning
(one more reason why the robots will be eating our lunch)

Reinforcement Learning
Semi-supervised learning (not all input data has labels)
Labels are sparse
The labels come some time after the input data (when we get our
reward we dont know which actions in the past contributed to that
reward)

Deep Reinforcement Learning


Input data can be very complicated such as images
Example Applications

Traffic light control single video camera at intersection


Elevator control
Game playing (e.g. Go)
Computational Advertising
(delivering the right ad to the right person at the right time similar to A/B testing but
much smarter)

Example Flappy Bird


Most of the time do nothing
Occasionally hit the space bar
Input = picture (state) + action
Labels (output = one point or death)
Labels (rewards) come after the actions
Similar to advertising campaign

State is complicated (but simple compared to


some games)
Problem:

Code is developed for each case


Solution:
Deep Reinforcement Learning

DeepMind (now bought by Google)


Single code base plays all Atari games

Open AI Gym platform for


reinforcement learning testing

https://gym.openai.com/envs

Algorithm (not as difficult as it looks)

Q-Learning Model
Using Q turns a semi-supervised problem into a supervised problem
Q(s,a) is the long run mean reward from taking action a in state s
, = [+1 + +2 + 2 +3 + |, ]
We dont know Q but we play lots of games using trial and error and
record all the rewards from playing various actions a.
We update Q as we go
Its not as difficult as it looks! (c.f. Kalman Filter)
, , + + max , ,

Q-Learning Algorithm
Loop through frames and games updating
, , + + max , ,

Where
s=current frame + previous frame
a=action (space bar or nothing)
Q is a deep neural network approximation of E[]

Demo

Demo

Results (videos)
https://www.youtube.com/watch?v=W2CAghUiofY

https://www.youtube.com/watch?v=TmPfTpjtdgg

You might also like