Open Platform for AI & ML modeling

Scalable Automatic  
Machine Learning in H2O
Erin LeDell Ph.D. 
@ledell
Data Science Conference 5.0 
Belgrade, Serbia
Nov 2019

What is H2O?
H2O.ai, the
company
H2O, the
platform
• Founded in 2012
• Advised by Stanford Professors Hastie, Tibshirani & Boyd
• Headquarters: Mountain View, California, USA
• Open Source Software (Apache 2.0 Licensed)
• R, Python, Scala, Java and Web Interfaces
• Distributed Machine Learning Algorithms for Big Data

Agenda
• H2O Platform
• Automatic Machine Learning (AutoML)
• H2O AutoML Overview
• Resources
Slides ⬇ https://tinyurl.com/automl-dsc5

H2O Machine Learning Platform
• Distributed (multi-core + multi-node) implementations of
cutting edge ML algorithms.
• Core algorithms written in high performance Java.
• APIs available in R, Python, Scala; web GUI.
• Easily deploy models to production as pure Java code.
• Works on Hadoop, Spark, EC2, your laptop, etc.

H2O Distributed Computing
H2O Cluster
H2O Frame
• Multi-node cluster with shared memory model.
• All computations in memory.
• Each node sees only some rows of the data.
• No limit on cluster size.
• Distributed data frames (collection of vectors).
• Columns are distributed (across nodes) arrays.
• Works just like R’s data.frame or Python Pandas
DataFrame

H2O Machine Learning Features
• Supervised & unsupervised machine learning algos
(GBM, RF, DNN, GLM, Stacked Ensembles, etc.)
• Imputation, normalization & auto one-hot-encoding
• Automatic early stopping
• Cross-validation, grid search & random search
• Variable importance, model evaluation metrics, plots

Intro to Automatic
Machine Learning

Goals & Features of AutoML
• 🏆 Train the best model in the least amount of time.
• 📉 Reduce the human effort & expertise required
in machine learning.
• 📈 Improve the performance of machine learning
models.
• 🔄 Increase reproducibility & establish a baseline
for scientific research or applications.

Aspects of Automatic Machine Learning
Data Prep
Model 
Generation
Ensembles

Aspects of Automatic Machine Learning
• Cartesian grid search or random grid search
• Bayesian Hyperparameter Optimization
• Individual models can be tuned using a validation set
Data
Preprocessing
Model 
Generation
Ensembles
• Imputation, one-hot encoding, standardization
• Feature selection and/or feature extraction (e.g. PCA)
• Count/Label/Target encoding of categorical features
• Ensembles often out-perform individual models
• Stacking / Super Learning (Wolpert, Breiman)
• Ensemble Selection (Caruana)

Different Flavors of AutoML
https://tinyurl.com/flavors-of-automl

H2O AutoML (v3.26)
• Cartesian grid search or random grid search
• Bayesian Hyperparameter Optimization
• Individual models can be tuned using a validation set
Data
Preprocessing
Model 
Generation
Ensembles
• Imputation, one-hot encoding, standardization
• Feature selection and/or feature extraction (e.g. PCA)
• Count/Label/Target encoding of categorical features
• Ensembles often out-perform individual models:
• Stacking / Super Learning (Wolpert, Breiman)
• Ensemble Selection (Caruana)

Random Grid Search & Stacking
• Random Grid Search combined with Stacked
Ensembles is a powerful combination. 
• Ensembles perform particularly well if the models
they are based on (1) are individually strong,  
and (2) make uncorrelated errors. 
• Stacking uses a second-level metalearning algorithm
to find the optimal combination of base learners.

H2O AutoML
• Basic data pre-processing (as in all H2O algos).
• Trains a random grid of GBMs, DNNs, GLMs, etc.
using a carefully chosen hyper-parameter space.
• Individual models are tuned using cross-validation.
• Two Stacked Ensembles are trained (“All Models”
ensemble & a lightweight “Best of Family” ensemble).
• Returns a sorted “Leaderboard” of all models.
• All models can be easily exported to production.

H2O AutoML Leaderboard
Example Leaderboard
for binary classification
(Higgs 10k)

AutoML Pro Tips: Exclude Algos
• If you have sparse, wide data (e.g. text), use
the exclude_algos argument to turn off the
tree-based models (GBM, RF). 
• If you want tree-based algos only, turn off
GLM and DNNs via exclude_algos.

AutoML Pro Tips: Time & Model Limits
• AutoML will stop after 1 hour unless you
change max_runtime_secs. 
• If you need reproducibility, you must use use
max_models instead. Running with
max_runtime_secs is not reproducible since
available resources on a machine may change
from run to run.

AutoML Pro Tips: Cluster memory
• Reminder: All H2O models are stored in H2O
Cluster memory.
• Make sure to give the H2O Cluster a lot of
memory if you’re going to create hundreds or
thousands of models.
• e.g. h2o.init(max_mem_size = “80G”)

H2O AutoML Roadmap
• Automatic target encoding of high cardinality categorical cols
• Better support for wide datasets via feature selection/extraction
• Support text input directly via Word2Vec
• Variable importance for Stacked Ensembles
• Improvements to the models we train based on benchmarking
• Fully customizable model list (grid space, etc)
• New algorithms: SVM, GAM

AutoML Benchmarks
 
arXiv paper ⬇  
https://tinyurl.com/automlbenchmark

AutoML Benchmarks
https://openml.github.io/automlbenchmark/results.html

AutoML Benchmarks
https://youtu.be/WlXhpXv9kDU

Learn H2O AutoML!
• Docs: https://tinyurl.com/h2o-automl-docs
• R & Py tutorials: https://tinyurl.com/h2o-automl-tutorials

H2O Resources
• Documentation: http://docs.h2o.ai
• Tutorials: https://github.com/h2oai/h2o-tutorials
• Slidedecks: https://github.com/h2oai/h2o-meetups
• Videos: https://www.youtube.com/user/0xdata
• Stack Overflow: https://stackoverflow.com/tags/h2o
• Google Group: https://tinyurl.com/h2ostream
• Gitter: http://gitter.im/h2oai/h2o-3
• Events & Meetups: http://h2o.ai/events

Thank you!
@ledell on Github, Twitter
erin@h2o.ai

Open Platform for AI & ML modeling

More Related Content

What's hot (20)

Similar to Open Platform for AI & ML modeling (20)

More from Institute of Contemporary Sciences (20)

Recently uploaded (20)

Open Platform for AI & ML modeling