AI Meetup Auckland Tony Cooper PDF

MACHINE LEARNING AN
OVERVIEW
Tony Cooper
Senior Data Scientist
tonycooper@kpmg.co.nz
July 2016
kpmg.com/nz
Agenda
Introduction
Machine Learning
- Deep Learning
- Transfer Learning
- Reinforcement Learning
Agenda
Introduction
Machine Learning
- Deep Learning
- Transfer Learning
Introduction
Last meeting Machine Learning what can it do
This meeting Machine Learning how does it work
Not covering How to do Machine Learning (e.g. test / train split)
Not covering Applications (see e.g. a long list at
http://www.deeplearningpatterns.com/doku.php/applications)
Also not covering:

Speech
Text (Buffalo buffalo Buffalo, buffalo buffalo, buffalo Buffalo buffalo)
Audio
Time Series
Graphs
Internet of Things
Bots (e.g. Siri)
Big Data
Reminder - Last Meeting Nickle Lu

Amazing things AI has done
Why AI can do those things
Why AI will eat everything
How I learned AI in my career
How can you apply it in yours
Presenter Tony Cooper

5 years Stanford PhD
Thesis Computer Intensive Statistics project on numerical methods for the
bootstrap (unfinished)
DSIR - Consulting Statistician

Funds Management Database Technology (not Big Data)
Double-Digit Numerics Consulting Data Scientist
KPMG Setup Data Science Innovation Lab
3 years experience with Deep Learning (2 years CNNs, 3 years H2O)
KPMG Innovation Lab

Big Data
Spark (Spark Meetup, Auckland, 5 September 2016 at KPMG)
R
H2O
Machine Learning
Recommender Systems
Computational Advertising
Hyperpersonalisation (segmentation with segment size 1)
Computer Vision
GPU programming
KPMG Hardware
7 Node Spark Cluster (7 x 2 Xeon)
1 GPU Server
Tesla K80 GPU
4 x Tesla K80
(4 x 24GB GPU RAM
4 x 5000 cores
4 x 5.8 TeraFLOPs)
2 x Xeon 14 cores
(56 threads)
1 TB RAM
6TB SSD
Agenda
Introduction
Machine Learning
- Deep Learning
- Transfer Learning
Machine Learning
SL
ML
Statistical Learning
AI
Machine Learning
Machine Intelligence
Types of Machine Learning
Unsupervised
Supervised
Semi-supervised
Machine Learning Resources
Technical
especially 7.10.2, 7.10.3
Practical
Dummies
(can download the R and Python code without buying the book)
Experts
Machine Learnings best kept secret
Interesting
Courses, MOOCs (e.g. Udacity Deep Learning)
Internet
http://deeplearning.stanford.edu/tutorial/
http://cs231n.stanford.edu/
(some images in this presentation taken from there)
Contests (esp Kaggle.com)

Glossaries
http://envisat.esa.int/handbooks/meris/CNTR4-2-5.html
http://www.wildml.com/deep-learning-glossary/
http://deeplearning4j.org/glossary.html
Kaggle.com competitions (Titanic a good

starter), scripts (kernels), and real data
Tip use containers

Docker
Can run Ubuntu on Windows
All set up for you e.g. Google TensorFlow course at Udacity
A Taste of Machine Learning?

Regression Example Recommender System
1
Alice
Bob
Chad
5
5
Recommender System
4
4
4
4
2
3
2
1
3
1
2
3
2
1
3
1
5
3
3
h11
h21
h31
h12
h22
h32
h11
h21
h31
h12
h22
h32
5
3
3
12 = 11 12 + 12 22
w11
w21
w12
w22
w13
w23
w14
w24
w15
w25
w11
w21
w12
w22
w13
w23
w14
w24
w15
w25
11 equations in 16 unknowns
Generically:
= 1 1 + 2 2 + +
Solved using Alternating Least Squares (the machine chooses the features latent features)
The machine did the work for us in deciding what features to use
R pseudo code
# ratings matrix
R = matrix(nr=3, nc=5, data=c(4,2,1,NA,5,NA,3,3,3,NA,4,2,1,3,NA))
# initial users matrix.
h = matrix(nr=3, nc=2, data=rnorm(6));
# initial items matrix.
w = matrix(nr=5, nc=2, data=rnorm(10));
# find h, w to minimize the squared error
For (iter in 1:5) {
# update users
for (i in 1:3) {
h[i, ] = solve(...)
}
# update items
for (j in 1:5) {
w[j, ] = solve(...)
}
}
Another Taste - Beyond Linear Regression

Suburb
List Price Agreement Date Type
Mt Roskill
308000
40972 R
Mt Roskill
300000
40944 R
Mt Albert
41007 R
Mt Albert
40900 R
Mt Albert
695000
40728 R
Mt Roskill
760000
40862 R
Mt Albert
40961 R
Mt Albert
40996 R
Mt Albert
41016 R
Mt Eden
40856 R
Mt Roskill
380000
40985 R
Mt Albert
40975 R
Mt Eden
160000
40757 R
Mt Eden
40689 R
Mt Eden
173000
40996 APT
Mt Albert
249000
40967 APT
Mt Eden
359000
40819 R
Mt Albert
299000
40985 R
Mt Eden
40709 R
Mt Eden
40974 APT
Mt Albert
380000
40750 R
Mt Eden
40788 R
Mt Albert
40985 R
Mt Albert
399000
40994 R
Mt Eden
665000
40966 R
Mt Albert
319000
40711 APT
Mt Albert
319000
40757 APT
Title Sale Method Bedrooms Land Area Floor Area Existing/New Valuation Valuation Year Sale Price
P
2
308000
P
3
400000
C
P
3
E
484000
F
P
3
E
625000
P
4
695000
P
3
760000
C
A
3
E
790000
2011
945000
F
P
2
511
E
730000
F
P
5
556
E
670000
815000
F
A
3
612
E
570000
810000
P
3
754
400000
F
A
3
809
E
730000
2011
780000
P
1
32
220000
223000
S
P
1
45 E
220000
2008
265000
P
1
51
173000
S
P
2
51 E
230000
2011
230000
P
2
54
336575
P
2
59
270000
289800
F
P
2
60 E
405000
C
T
2
70
385000
P
2
70
340000
370000
A
2
70
340000
380000
C
A
2
74 E
505000
P
2
74
375000
395500
F
P
3
202
80
640000
U
P
2
81 E
305000
P
2
81
315000
CONTENT GOES HERE
Linear Regression (200 years old)

Essentially a weighted combination of the inputs e.g.
SalePrice = w0 + w1*Suburb + w2*ListPrice + w3*AgreementDate + w4*Type + w5*Title + w6*SaleMethod +
w7*Bedrooms + w8*LandArea + w9*FloorArea + w10*Existing + w11*ValuationYear
Pros
Simple to understand and interpret (taught at high school)
Simple to compute in Excel (Least Squares)
Problems
The world isnt linear
Doesnt handle interactions easily (Samuel Johnson: Your Manuscript Is Good and Original, But What is Original Is Not
Good; What Is Good Is Not Original)
Doesnt handle missing values at all
Doesnt handle correlated inputs well
Simple Example Actual Function

12
10
Response
500
1000
1500
Input
2000
2500
3000
Simple Example Noise Added

Response samples
12
10
Response
500
1000
1500
Input
2000
2500
3000
Linear Fit (underfit)

Linear Fit (Underfitting)
12
10
Response
500
1000
1500
Input
2000
2500
3000
Cubic Fit
Cubic Fit
12
10
Response
500
1000
1500
Input
2000
2500
3000
Quartic Fit
Quartic Fit
12
10
Response
500
1000
1500
Input
2000
2500
3000
Quintic Fit
Quintic Fit
12
10
Response
500
1000
1500
Input
2000
2500
3000
Overfitting
Overfitting Fit
12
10
Response
500
1000
1500
Input
2000
2500
3000
Support Vector Regression Fit

Support Vector Machine Fit C = 127.578, gamma = 1.22
12
10
Response
500
1000
1500
Input
2000
2500
3000
Sigmoid
Neural Network 1 Neuron (sigmoid)

Neural Network Fit Hidden Layer Size = 1
12
10
Response
500
1000
1500
Input
2000
2500
3000
= 1 1 + 2 2 + +
Linear hi is constant
= 1 1 + 2 2 + +
Neural Network Si is sigmoid
A neural network is just a bunch of weighted sigmoid regressions, n is the number of nodes
Neural Network 2 Neurons (sigmoid)

12
10
Response
500
1000
1500
Input
2000
2500
3000

12
10
Response
500
1000
1500
Input
2000
2500
3000

12
10
Overfitting
10 is too many neurons
Response
500
1000
1500
Input
2000
2500
3000
playground.tensorflow.org
Neural Networks are just weighted regressions

= 1 1 + 2 2 + +
Neural Network there is a theorem that says

you can model anything with a single layer
Neural Network
But instead of going wide it can be more
effective going deep
Agenda
Introduction
Machine Learning
- Deep Learning
- Transfer Learning
- Deep Learning
detect complex interactions among features

learn low-level features from minimally processed raw data
work with high-cardinality class memberships
work with unlabelled data
Lots of hype but its mostly true

Fosbury flop analogy, gold rush analogy
Unreasonably effective
Example - Drive a car
Which way to turn the steering wheel?
Same Problem as
Suburb
List Price Agreement Date Type
Mt Roskill
308000
40972 R
Mt Roskill
300000
40944 R
Mt Albert
41007 R
Mt Albert
40900 R
Mt Albert
695000
40728 R
Mt Roskill
760000
40862 R
Mt Albert
40961 R
Mt Albert
40996 R
Mt Albert
41016 R
Mt Eden
40856 R
Mt Roskill
380000
40985 R
Mt Albert
40975 R
Mt Eden
160000
40757 R
Mt Eden
40689 R
Mt Eden
173000
40996 APT
Mt Albert
249000
40967 APT
Mt Eden
359000
40819 R
Mt Albert
299000
40985 R
Mt Eden
40709 R
Mt Eden
40974 APT
Mt Albert
380000
40750 R
Mt Eden
40788 R
Mt Albert
40985 R
Mt Albert
399000
40994 R
Mt Eden
665000
40966 R
Mt Albert
319000
40711 APT
Mt Albert
319000
40757 APT
Build a model to predict

output from input
Title Sale Method Bedrooms Land Area Floor Area Existing/New Valuation Valuation Year Sale Price
P
2
308000
P
3
400000
C
P
3
E
484000
F
P
3
E
625000
P
4
695000
P
3
760000
C
A
3
E
790000
2011
945000
F
P
2
511
E
730000
F
P
5
556
E
670000
815000
F
A
3
612
E
570000
810000
P
3
754
400000
F
A
3
809
E
730000
2011
780000
P
1
32
220000
223000
S
P
1
45 E
220000
2008
265000
P
1
51
173000
S
P
2
51 E
230000
2011
230000
P
2
54
336575
P
2
59
270000
289800
F
P
2
60 E
405000
C
T
2
70
385000
P
2
70
340000
370000
A
2
70
340000
380000
C
A
2
74 E
505000
P
2
74
375000
395500
F
P
3
202
80
640000
U
P
2
81 E
305000
P
2
81
315000
Example Two Variable Classification

e.g.
X1 = House Price,
X2 = House Area,
Y = whether or not house sells at auction
Model: Predict whether or not the house

will sell at auction
(obviously fake data for illustration only)
Linear Regression
(X1, X2) model and (X1, X2, X1*X2) model
Tree
Random Forest
Support Vector Machine
Gradient Boosting
Single Layer Neural Network 5 nodes

Going Deeper
playground.tensorflow.org
Adding X1*X2
Dropping X1 and X2 Feature Engineering
Feature Engineering and Feature Selection is hard!

An art and a science, computationally difficult
O(2n), n = no. of features, n can be thousands
How did we know to add X1*X2?
Can we get the machine to do it for us?
Yes Deep Learning
Deep Learning no X1*X2
Go and Play!
(use ReLU)
ReLU (Rectified Linear Unit) (similar to

sigmoid but has advantages)
Types of Neural Networks (different plumbing)

Recursive Neural Networks (including LSTM)
Deep Belief Network
Deep Boltzman Machines
Autoencoders
Convolutional

Exploit repeated patterns that occur over, say, time or, say, sentences by
feeding data repeatedly into the network
LSTM Long Short-Term Memory
Autoencoders (strange but the most fun)

Train the output to match the input
deeplearning4j.org
Autoencoders
Compression
Dimension Reduction (resembles PCA)
Noise reduction (MRI example)
Drawing stuff
Drawing Stuff
More Drawing Stuff (DeepDream) messing

the picture instead of optimising the weights.
with weights Optimise
Find the best picture that turns on the dog neuron
More Drawing Stuff (DeepDream) messing

with weights
More Drawing Stuff
-Deep Learning Convolutional Neural Networks

Essentially networks of weights connected by
activation functions (e.g. sigmoid)
Convolutions
Just functions that combine pixels in a weighted way
A way of getting a correlation between a shape and parts of an image
Example: find red circles in an image, find edges
Measure how much the image part matches the shape
Convolutions Example Gabor filters

Find correlations with these shapes in the image
Example - Find edges in Images
The mathematics behind convolutions

(animated gif)
Hierarchy of Image Features Max Pooling
We rescale the picture to find what we are

looking for at different sizes and to find
more complicated shapes
A Convolutional Neural Network is just

stacked layers of convolutions and pooling
(clarifai.com)
It creates a hierarchy of features at

decreasing resolutions
Another example (Le Cun)
Googles Inception V3 network
Googles Inception V3 network performance
Inception V4 is out see ArXiv 1602.07261
Its not all Convolutions (but still weights)
Deep Learning Software (examples)

Theano, Keras, Caffe, CNTK, MXNet, H2O, Neon, Deeplearning4j, etc
TensorFlow has lots of community support
NVIDIA DIGITS easy to use a web interface to Caffe
MXNet and H2O can be used from R
Agenda
Introduction
Machine Learning
- Deep Learning
- Transfer Learning
- Transfer Learning
(one more reason why the robots will be eating our lunch)
Example Cat detector to turn on sprinklers
Transfer Learning
Transfer Learning uses an existing set of weights for an existing Deep Learning
network and adapts (retrains) some of the layers for a new set of images. This
lets us (and robots) transfer learning to new tasks.
The next two examples show a case where the last three layers of a network
downloaded from the internet are retrained to distinguish between cars and
SUVs. The fruit image example does no retraining but takes weights from the
second last layer as inputs to a model trained in R.
- SUV / CAR example
ImageNet Network (MATLAB)
ImageNet Network (MATLAB)
Results - before
Results - after
Fruit Classification Demo
0 correct out of 57 images
Demo Use the outputs from the second last

layer (before classification), 1000 columns
Demo R code using the 1000 columns

library(readr)
library(xgboost)
library(caret)
library(plyr)
setwd("~/MATLAB/matconvnet")
cat("reading data file\n")
train <- read_csv("fruit.csv", col_names=FALSE)
rows = nrow(train)
cols = ncol(train)
train[, cols] <- factor(train[, cols], labels=c("peaches", "apricots"))
set.seed(2) # 2 gives a nice even split
inTrain <- sample(rows) # no test sample
x.train <- train[inTrain, -cols]
y.train <- train[inTrain, cols]
fitControl <- trainControl(method = "repeatedcv", number=4, repeats=50) #, classProbs=TRUE, summaryFunction=twoClassSummary)
cat("training model\n")
xgbGrid <- expand.grid(.nrounds = 4, .max_depth = 3, .eta = 0.3, .gamma = 0, .colsample_bytree = 0.6, .min_child_weight = 1) # good
xgb <- train(x=x.train, y=y.train, method='xgbTree', trControl=fitControl, tuneGrid=xgbGrid)
save(xgb, file="xgb.RData")
pred <- predict(xgb, newdata=x.train)
confusionMatrix(pred, y.train)
# output the predicted values
cat("saving predicted labels\n")
pred <- predict(xgb, newdata=train[, -cols])
pred <- data.frame(pred) # convert to 0,1 variable
write_csv(pred, "fruitresults.csv", col_names=FALSE)
cat("done\n")
Result
57 correct out of 57 images
Agenda
Introduction
Machine Learning
- Deep Learning
- Transfer Learning
(one more reason why the robots will be eating our lunch)
Reinforcement Learning
Semi-supervised learning (not all input data has labels)
Labels are sparse
The labels come some time after the input data (when we get our
reward we dont know which actions in the past contributed to that
reward)
Deep Reinforcement Learning

Input data can be very complicated such as images
Example Applications
Traffic light control single video camera at intersection

Elevator control
Game playing (e.g. Go)
Computational Advertising
(delivering the right ad to the right person at the right time similar to A/B testing but
much smarter)
Example Flappy Bird

Most of the time do nothing
Occasionally hit the space bar
Input = picture (state) + action
Labels (output = one point or death)
Labels (rewards) come after the actions
Similar to advertising campaign
State is complicated (but simple compared to

some games)
Problem:
Code is developed for each case

Solution:
Deep Reinforcement Learning
DeepMind (now bought by Google)

Single code base plays all Atari games
Open AI Gym platform for

reinforcement learning testing
https://gym.openai.com/envs
Algorithm (not as difficult as it looks)
Q-Learning Model
Using Q turns a semi-supervised problem into a supervised problem
Q(s,a) is the long run mean reward from taking action a in state s
, = [+1 + +2 + 2 +3 + |, ]
We dont know Q but we play lots of games using trial and error and
record all the rewards from playing various actions a.
We update Q as we go
Its not as difficult as it looks! (c.f. Kalman Filter)
, , + + max , ,
Q-Learning Algorithm
Loop through frames and games updating
, , + + max , ,
Where
s=current frame + previous frame
a=action (space bar or nothing)
Q is a deep neural network approximation of E[]
Demo
Demo
Results (videos)
https://www.youtube.com/watch?v=W2CAghUiofY
https://www.youtube.com/watch?v=TmPfTpjtdgg

AI Meetup Auckland Tony Cooper PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI Meetup Auckland Tony Cooper PDF

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING AN

Also not covering:

Reminder - Last Meeting Nickle Lu

Presenter Tony Cooper

DSIR - Consulting Statistician

KPMG Innovation Lab

Tesla K80 GPU

Types of Machine Learning

Machine Learning Resources

especially 7.10.2, 7.10.3

Courses, MOOCs (e.g. Udacity Deep Learning)

Contests (esp Kaggle.com)

Kaggle.com competitions (Titanic a good

Tip use containers

A Taste of Machine Learning?

Another Taste - Beyond Linear Regression

CONTENT GOES HERE

Linear Regression (200 years old)

Simple Example Actual Function

Simple Example Noise Added

Linear Fit (underfit)

Support Vector Regression Fit

Neural Network 1 Neuron (sigmoid)

Neural Network Si is sigmoid

Neural Network 2 Neurons (sigmoid)

Neural Network 5 Neurons (sigmoid)

Neural Network 10 Neurons (sigmoid)

10 is too many neurons

Neural Network 5 Neurons (sigmoid)

Neural Networks are just weighted regressions

Neural Network there is a theorem that says

detect complex interactions among features

Lots of hype but its mostly true

Example - Drive a car

Which way to turn the steering wheel?

Build a model to predict

Example Two Variable Classification

Model: Predict whether or not the house

(X1, X2) model and (X1, X2, X1*X2) model

(X1, X2) model and (X1, X2, X1*X2) model

(X1, X2) model and (X1, X2, X1*X2) model

Support Vector Machine

(X1, X2) model and (X1, X2, X1*X2) model

(X1, X2) model and (X1, X2, X1*X2) model

Single Layer Neural Network 5 nodes

Dropping X1 and X2 Feature Engineering

Feature Engineering and Feature Selection is hard!

Deep Learning no X1*X2

ReLU (Rectified Linear Unit) (similar to

Types of Neural Networks (different plumbing)

Recursive Neural Networks (including LSTM)

Recursive Neural Networks (including LSTM)

LSTM Long Short-Term Memory

Autoencoders (strange but the most fun)

More Drawing Stuff (DeepDream) messing

More Drawing Stuff (DeepDream) messing

More Drawing Stuff

-Deep Learning Convolutional Neural Networks

Convolutions Example Gabor filters

Example - Find edges in Images

The mathematics behind convolutions

Hierarchy of Image Features Max Pooling

We rescale the picture to find what we are