20170406pydataamsterdamtutorial 170413143759

Introduction to Machine Learning
with H2O and Python

Jo-fai (Joe) Chow
Data Scientist
joe@h2o.ai
@matlabulous
PyData Amsterdam Tutorial at GoDataDriven

7th April, 2017
Slides and Code Examples:
bit.ly/joe_h2o_tutorials
2
About Me
Civil (Water) Engineer Data Scientist
2010 2015 2015
Consultant (UK) Virgin Media (UK)
Utilities Domino Data Lab (Silicon Valley,
Asset Management US)
Constrained Optimization
Industrial PhD (UK) 2016 Present
Infrastructure Design Optimization
H2O.ai (Silicon Valley, US)
Machine Learning +
Water Engineering
Discovered H2O in 2014
3
About Me
The Long Story

bit.ly/joe_kaggle_story
R + H2O + Domino for Kaggle

Guest Blog Post for Domino & H2O (2014)
4
Agenda
About H2O.ai
Company
Machine Learning Platform
Tutorial
H2O Python Module
Download & Install
Step-by-Step Examples:
Basic Data Import / Manipulation
Regression & Classification (Basics)
Regression & Classification (Advanced)
Using H2O in the Cloud
5
Agenda
About H2O.ai
Company Background Information
Machine Learning Platform
Tutorial
H2O Python Module
Download & Install
Step-by-Step Examples: For beginners
Basic Data Import / Manipulation
Regression & Classification (Basics) Short Break
Regression & Classification (Advanced)
As if I am working on
Using H2O in the Cloud
Kaggle competitions
6
About H2O.ai
7
Company Overview
Founded 2011 Venture-backed, debuted in 2012
Products H2O Open Source In-Memory AI Prediction Engine
Sparkling Water
Steam
Mission Operationalize Data Science, and provide a platform for users to build beautiful data products
Team 70 employees
Distributed Systems Engineers doing Machine Learning
World-class visualization designers
Headquarters Mountain View, CA
8
Our Team
Kuba
Joe
9
Scientific Advisory Council
10
11
Joe (2015)
http://www.h2o.ai/gartner-magic-quadrant/
12
Check
out our
website
h2o.ai
13
H2O Machine Learning Platform
14
High Level Architecture
HDFS
H2O Compute Engine
S3 Exploratory & Supervised &
Load Data Unsupervised
Descriptive Predict
Analysis Modeling
Distributed
In-Memory Feature Model
NFS Data & Model
Loss-less Engineering & Evaluation &
Storage
Compression Selection Selection
Local Data Prep Export: Model Export:

Plain Old Java Object Plain Old Java Object
SQL
Production Scoring Environment
Your
Imagination
15
Import Data from
High Level Architecture Multiple Sources
HDFS
H2O Compute Engine
Descriptive Predict
Analysis Modeling
Distributed
NFS Data & Model
Storage

SQL
Your
Imagination
16
Fast, Scalable & Distributed
High Level Architecture Compute Engine Written in
Java
HDFS
H2O Compute Engine
Descriptive Predict
Analysis Modeling
Distributed
NFS Data & Model
Storage

SQL
Your
Imagination
17
Fast, Scalable & Distributed
High Level Architecture Compute Engine Written in
Java
HDFS
H2O Compute Engine
Descriptive Predict
Analysis Modeling
Distributed
NFS Data & Model
Storage

SQL
Your
Imagination
18
Algorithms Overview
Supervised Learning Unsupervised Learning
Generalized Linear Models: Binomial, K-means: Partitions observations into k

Statistical Gaussian, Gamma, Poisson and Tweedie clusters/groups of the same spatial size.
Clustering
Analysis Nave Bayes Automatically detect optimal k
Distributed Random Forest: Classification Principal Component Analysis: Linearly transforms

or regression models Dimensionality correlated variables to independent components
Ensembles Gradient Boosting Machine: Produces an Reduction Generalized Low Rank Models: extend the idea of
ensemble of decision trees with increasing PCA to handle arbitrary data consisting of numerical,
refined approximations Boolean, categorical, and missing data
Deep learning: Create multi-layer feed

Autoencoders: Find outliers using a
Deep Neural forward neural networks starting with an Anomaly
nonlinear dimensionality reduction using
Networks input layer followed by multiple layers of Detection deep learning
nonlinear transformations
19
H2O Deep Learning in Action
20
Multiple Interfaces
High Level Architecture
HDFS
H2O Compute Engine
Descriptive Predict
Analysis Modeling
Distributed
NFS Data & Model
Storage

SQL
Your
Imagination
21
H2O + Python
22
H2O + R
23
H2O Flow (Web) Interface
24
Export Standalone Models
High Level Architecture for Production
HDFS
H2O Compute Engine
Descriptive Predict
Analysis Modeling
Distributed
NFS Data & Model
Storage

SQL
Your
Imagination
25
docs.h2o.ai
26
H2O + Python Tutorial
27
Learning Objectives
Start and connect to a local H2O cluster from Python.
Import data from Python data frames, local files or web.
Perform basic data transformation and exploration.
Train regression and classification models using various H2O machine
learning algorithms.
Evaluate models and make predictions.
Improve performance by tuning and stacking.
Connect to H2O cluster in the cloud.
28
29
Install H2O
h2o.ai -> Download -> Install in Python
30
31
Start and Connect to a
Local H2O Cluster
py_01_data_in_h2o.ipynb
32
Local H2O Cluster
Import H2O module
Start a local H2O cluster

nthreads = -1 means
using ALL CPU resources
33
34
Importing Data into H2O
py_01_data_in_h2o.ipynb
35
36
37
38
Basic Data Transformation &
Exploration
py_02_data_manipulation.ipynb
(see notebooks)
39
40
41
42
43
Regression Models (Basics)
py_03a_regression_basics.ipynb
44
Algorithms Overview
Supervised Learning Unsupervised Learning
Generalized Linear Models: Binomial, K-means: Partitions observations into k

Statistical Gaussian, Gamma, Poisson and Tweedie clusters/groups of the same spatial size.
Clustering
Analysis Nave Bayes Automatically detect optimal k
Distributed Random Forest: Classification Principal Component Analysis: Linearly transforms

or regression models Dimensionality correlated variables to independent components
Ensembles Gradient Boosting Machine: Produces an Reduction Generalized Low Rank Models: extend the idea of
ensemble of decision trees with increasing PCA to handle arbitrary data consisting of numerical,
refined approximations Boolean, categorical, and missing data
Deep learning: Create multi-layer feed

Autoencoders: Find outliers using a
Deep Neural forward neural networks starting with an Anomaly
nonlinear dimensionality reduction using
Networks input layer followed by multiple layers of Detection deep learning
nonlinear transformations
45
docs.h2o.ai
46
47
48
49
50
Regression Performance MSE
51
52
53
54
55
56
Classification Models (Basics)
py_04_classification_basics.ipynb
57
58
59
60
61
Classification Performance Confusion Matrix
62
Confusion Matrix
63
64
65
66
67
68
Regression Models (Tuning)
py_03b_regression_grid_search.ipynb
69
Improving Model Performance (Step-by-Step)
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
70
71
72
73
74
75
Cross-Validation
76
77
78
79
Early Stopping
80
81
82
Grid Search
Combination Parameter 1 Parameter 2

1 0.7 0.7
2 0.7 0.8
3 0.7 0.9
4 0.8 0.7
5 0.8 0.8
6 0.8 0.9
7 0.9 0.7
8 0.9 0.8
9 0.9 0.9
83
84
85
86
87
88
89
90
Regression Models (Ensembles)
py_03c_regression_ensembles.ipynb
91
https://github.com/h2oai/h2o-
meetups/blob/master/2017_02_23_
Metis_SF_Sacked_Ensembles_Deep_
Water/stacked_ensembles_in_h2o_fe
b2017.pdf
92
93
94
95
Lowest MSE =
96
Best Performance
97
Classification Models (Ensembles)
py_04_classification_ensembles.ipynb
98
Highest AUC =
99
Best Performance
H2O in the Cloud
py_05_h2o_in_the_cloud.ipynb
100
101
102
Recap
103
Learning Objectives
Start and connect to a local H2O cluster from Python.
Import data from Python data frames, local files or web.
Perform basic data transformation and exploration.
Train regression and classification models using various H2O machine
learning algorithms.
Evaluate models and make predictions.
Improve performance by tuning and stacking.
Connect to H2O cluster in the cloud.
104
105
106
H2O Tutorial
Friday 4:00 pm
bit.ly/joe_h2o_tutorials
H2O Sparkling Water Talk

Saturday 3:15 pm
107
Thanks!
Organizers & Sponsors Code, Slides & Documents
bit.ly/h2o_meetups
docs.h2o.ai
Contact
joe@h2o.ai
@matlabulous
github.com/woobe
Find us at PyData Conference
Live Demos
Please search/ask questions on
Stack Overflow
Use the tag `h2o` (not H2 zero)
108

20170406pydataamsterdamtutorial 170413143759

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

20170406pydataamsterdamtutorial 170413143759

Uploaded by

Copyright:

Available Formats

Introduction to Machine Learning

with H2O and Python

PyData Amsterdam Tutorial at GoDataDriven

The Long Story

R + H2O + Domino for Kaggle

Local Data Prep Export: Model Export:

Local Data Prep Export: Model Export:

Local Data Prep Export: Model Export:

Local Data Prep Export: Model Export:

Generalized Linear Models: Binomial, K-means: Partitions observations into k

Distributed Random Forest: Classification Principal Component Analysis: Linearly transforms

Deep learning: Create multi-layer feed

Local Data Prep Export: Model Export:

Local Data Prep Export: Model Export:

Start a local H2O cluster

Generalized Linear Models: Binomial, K-means: Partitions observations into k

Distributed Random Forest: Classification Principal Component Analysis: Linearly transforms

Deep learning: Create multi-layer feed

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

Combination Parameter 1 Parameter 2

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433