You are on page 1of 108

Introduction to Machine Learning

with H2O and Python


Jo-fai (Joe) Chow
Data Scientist
joe@h2o.ai
@matlabulous

PyData Amsterdam Tutorial at GoDataDriven


7th April, 2017
Slides and Code Examples:
bit.ly/joe_h2o_tutorials

2
About Me
Civil (Water) Engineer Data Scientist
2010 2015 2015
Consultant (UK) Virgin Media (UK)
Utilities Domino Data Lab (Silicon Valley,
Asset Management US)
Constrained Optimization
Industrial PhD (UK) 2016 Present
Infrastructure Design Optimization
H2O.ai (Silicon Valley, US)
Machine Learning +
Water Engineering
Discovered H2O in 2014

3
About Me

The Long Story


bit.ly/joe_kaggle_story

R + H2O + Domino for Kaggle


Guest Blog Post for Domino & H2O (2014)

4
Agenda
About H2O.ai
Company
Machine Learning Platform
Tutorial
H2O Python Module
Download & Install
Step-by-Step Examples:
Basic Data Import / Manipulation
Regression & Classification (Basics)
Regression & Classification (Advanced)
Using H2O in the Cloud

5
Agenda
About H2O.ai
Company Background Information
Machine Learning Platform
Tutorial
H2O Python Module
Download & Install
Step-by-Step Examples: For beginners
Basic Data Import / Manipulation
Regression & Classification (Basics) Short Break
Regression & Classification (Advanced)
As if I am working on
Using H2O in the Cloud
Kaggle competitions
6
About H2O.ai

7
Company Overview
Founded 2011 Venture-backed, debuted in 2012
Products H2O Open Source In-Memory AI Prediction Engine
Sparkling Water
Steam
Mission Operationalize Data Science, and provide a platform for users to build beautiful data products
Team 70 employees
Distributed Systems Engineers doing Machine Learning
World-class visualization designers
Headquarters Mountain View, CA

8
Our Team

Kuba

Joe

9
Scientific Advisory Council

10
11
Joe (2015)

http://www.h2o.ai/gartner-magic-quadrant/

12
Check
out our
website
h2o.ai

13
H2O Machine Learning Platform

14
High Level Architecture

HDFS
H2O Compute Engine
S3 Exploratory & Supervised &
Load Data Unsupervised
Descriptive Predict
Analysis Modeling
Distributed
In-Memory Feature Model
NFS Data & Model
Loss-less Engineering & Evaluation &
Storage
Compression Selection Selection

Local Data Prep Export: Model Export:


Plain Old Java Object Plain Old Java Object

SQL
Production Scoring Environment
Your
Imagination
15
Import Data from
High Level Architecture Multiple Sources

HDFS
H2O Compute Engine
S3 Exploratory & Supervised &
Load Data Unsupervised
Descriptive Predict
Analysis Modeling
Distributed
In-Memory Feature Model
NFS Data & Model
Loss-less Engineering & Evaluation &
Storage
Compression Selection Selection

Local Data Prep Export: Model Export:


Plain Old Java Object Plain Old Java Object

SQL
Production Scoring Environment
Your
Imagination
16
Fast, Scalable & Distributed
High Level Architecture Compute Engine Written in
Java

HDFS
H2O Compute Engine
S3 Exploratory & Supervised &
Load Data Unsupervised
Descriptive Predict
Analysis Modeling
Distributed
In-Memory Feature Model
NFS Data & Model
Loss-less Engineering & Evaluation &
Storage
Compression Selection Selection

Local Data Prep Export: Model Export:


Plain Old Java Object Plain Old Java Object

SQL
Production Scoring Environment
Your
Imagination
17
Fast, Scalable & Distributed
High Level Architecture Compute Engine Written in
Java

HDFS
H2O Compute Engine
S3 Exploratory & Supervised &
Load Data Unsupervised
Descriptive Predict
Analysis Modeling
Distributed
In-Memory Feature Model
NFS Data & Model
Loss-less Engineering & Evaluation &
Storage
Compression Selection Selection

Local Data Prep Export: Model Export:


Plain Old Java Object Plain Old Java Object

SQL
Production Scoring Environment
Your
Imagination
18
Algorithms Overview
Supervised Learning Unsupervised Learning

Generalized Linear Models: Binomial, K-means: Partitions observations into k


Statistical Gaussian, Gamma, Poisson and Tweedie clusters/groups of the same spatial size.
Clustering
Analysis Nave Bayes Automatically detect optimal k

Distributed Random Forest: Classification Principal Component Analysis: Linearly transforms


or regression models Dimensionality correlated variables to independent components
Ensembles Gradient Boosting Machine: Produces an Reduction Generalized Low Rank Models: extend the idea of
ensemble of decision trees with increasing PCA to handle arbitrary data consisting of numerical,
refined approximations Boolean, categorical, and missing data

Deep learning: Create multi-layer feed


Autoencoders: Find outliers using a
Deep Neural forward neural networks starting with an Anomaly
nonlinear dimensionality reduction using
Networks input layer followed by multiple layers of Detection deep learning
nonlinear transformations

19
H2O Deep Learning in Action

20
Multiple Interfaces
High Level Architecture

HDFS
H2O Compute Engine
S3 Exploratory & Supervised &
Load Data Unsupervised
Descriptive Predict
Analysis Modeling
Distributed
In-Memory Feature Model
NFS Data & Model
Loss-less Engineering & Evaluation &
Storage
Compression Selection Selection

Local Data Prep Export: Model Export:


Plain Old Java Object Plain Old Java Object

SQL
Production Scoring Environment
Your
Imagination
21
H2O + Python

22
H2O + R

23
H2O Flow (Web) Interface

24
Export Standalone Models
High Level Architecture for Production

HDFS
H2O Compute Engine
S3 Exploratory & Supervised &
Load Data Unsupervised
Descriptive Predict
Analysis Modeling
Distributed
In-Memory Feature Model
NFS Data & Model
Loss-less Engineering & Evaluation &
Storage
Compression Selection Selection

Local Data Prep Export: Model Export:


Plain Old Java Object Plain Old Java Object

SQL
Production Scoring Environment
Your
Imagination
25
docs.h2o.ai

26
H2O + Python Tutorial

27
Learning Objectives
Start and connect to a local H2O cluster from Python.
Import data from Python data frames, local files or web.
Perform basic data transformation and exploration.
Train regression and classification models using various H2O machine
learning algorithms.
Evaluate models and make predictions.
Improve performance by tuning and stacking.
Connect to H2O cluster in the cloud.

28
29
Install H2O
h2o.ai -> Download -> Install in Python

30
31
Start and Connect to a
Local H2O Cluster
py_01_data_in_h2o.ipynb

32
Local H2O Cluster
Import H2O module

Start a local H2O cluster


nthreads = -1 means
using ALL CPU resources
33
34
Importing Data into H2O
py_01_data_in_h2o.ipynb

35
36
37
38
Basic Data Transformation &
Exploration
py_02_data_manipulation.ipynb
(see notebooks)

39
40
41
42
43
Regression Models (Basics)
py_03a_regression_basics.ipynb

44
Algorithms Overview
Supervised Learning Unsupervised Learning

Generalized Linear Models: Binomial, K-means: Partitions observations into k


Statistical Gaussian, Gamma, Poisson and Tweedie clusters/groups of the same spatial size.
Clustering
Analysis Nave Bayes Automatically detect optimal k

Distributed Random Forest: Classification Principal Component Analysis: Linearly transforms


or regression models Dimensionality correlated variables to independent components
Ensembles Gradient Boosting Machine: Produces an Reduction Generalized Low Rank Models: extend the idea of
ensemble of decision trees with increasing PCA to handle arbitrary data consisting of numerical,
refined approximations Boolean, categorical, and missing data

Deep learning: Create multi-layer feed


Autoencoders: Find outliers using a
Deep Neural forward neural networks starting with an Anomaly
nonlinear dimensionality reduction using
Networks input layer followed by multiple layers of Detection deep learning
nonlinear transformations

45
docs.h2o.ai

46
47
48
49
50
Regression Performance MSE

51
52
53
54
55
56
Classification Models (Basics)
py_04_classification_basics.ipynb

57
58
59
60
61
Classification Performance Confusion Matrix

62
Confusion Matrix

63
64
65
66
67
68
Regression Models (Tuning)
py_03b_regression_grid_search.ipynb

69
Improving Model Performance (Step-by-Step)
Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

70
71
72
Improving Model Performance (Step-by-Step)
Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

73
74
Improving Model Performance (Step-by-Step)
Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

75
Cross-Validation

76
77
78
Improving Model Performance (Step-by-Step)
Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

79
Early Stopping

80
81
Improving Model Performance (Step-by-Step)
Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

82
Grid Search

Combination Parameter 1 Parameter 2


1 0.7 0.7
2 0.7 0.8
3 0.7 0.9
4 0.8 0.7
5 0.8 0.8
6 0.8 0.9
7 0.9 0.7
8 0.9 0.8
9 0.9 0.9
83
84
85
86
Improving Model Performance (Step-by-Step)
Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

87
88
89
Improving Model Performance (Step-by-Step)
Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

90
Regression Models (Ensembles)
py_03c_regression_ensembles.ipynb

91
https://github.com/h2oai/h2o-
meetups/blob/master/2017_02_23_
Metis_SF_Sacked_Ensembles_Deep_
Water/stacked_ensembles_in_h2o_fe
b2017.pdf

92
93
94
95
Lowest MSE =
96
Best Performance
Improving Model Performance (Step-by-Step)
Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

97
Classification Models (Ensembles)
py_04_classification_ensembles.ipynb

98
Highest AUC =
99
Best Performance
H2O in the Cloud
py_05_h2o_in_the_cloud.ipynb

100
101
102
Recap

103
Learning Objectives
Start and connect to a local H2O cluster from Python.
Import data from Python data frames, local files or web.
Perform basic data transformation and exploration.
Train regression and classification models using various H2O machine
learning algorithms.
Evaluate models and make predictions.
Improve performance by tuning and stacking.
Connect to H2O cluster in the cloud.

104
Improving Model Performance (Step-by-Step)
Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

105
106
H2O Tutorial
Friday 4:00 pm
bit.ly/joe_h2o_tutorials

H2O Sparkling Water Talk


Saturday 3:15 pm
107
Thanks!
Organizers & Sponsors Code, Slides & Documents
bit.ly/h2o_meetups
docs.h2o.ai

Contact
joe@h2o.ai
@matlabulous
github.com/woobe
Find us at PyData Conference
Live Demos
Please search/ask questions on
Stack Overflow
Use the tag `h2o` (not H2 zero)

108

You might also like