SlideShare a Scribd company logo
Scalable Automatic 

Machine Learning in H2O
Erin LeDell Ph.D.

@ledell
Data Science Conference 5.0

Belgrade, Serbia
Nov 2019
What is H2O?
H2O.ai, the
company
H2O, the
platform
• Founded in 2012
• Advised by Stanford Professors Hastie, Tibshirani & Boyd
• Headquarters: Mountain View, California, USA
• Open Source Software (Apache 2.0 Licensed)
• R, Python, Scala, Java and Web Interfaces
• Distributed Machine Learning Algorithms for Big Data
Agenda
• H2O Platform
• Automatic Machine Learning (AutoML)
• H2O AutoML Overview
• Resources
Slides ⬇ https://tinyurl.com/automl-dsc5
H2O Platform
H2O Machine Learning Platform
• Distributed (multi-core + multi-node) implementations of
cutting edge ML algorithms.
• Core algorithms written in high performance Java.
• APIs available in R, Python, Scala; web GUI.
• Easily deploy models to production as pure Java code.
• Works on Hadoop, Spark, EC2, your laptop, etc.
H2O Distributed Computing
H2O Cluster
H2O Frame
• Multi-node cluster with shared memory model.
• All computations in memory.
• Each node sees only some rows of the data.
• No limit on cluster size.
• Distributed data frames (collection of vectors).
• Columns are distributed (across nodes) arrays.
• Works just like R’s data.frame or Python Pandas
DataFrame
H2O Machine Learning Features
• Supervised & unsupervised machine learning algos
(GBM, RF, DNN, GLM, Stacked Ensembles, etc.)
• Imputation, normalization & auto one-hot-encoding
• Automatic early stopping
• Cross-validation, grid search & random search
• Variable importance, model evaluation metrics, plots
Intro to Automatic
Machine Learning
Goals & Features of AutoML
• 🏆 Train the best model in the least amount of time.
• 📉 Reduce the human effort & expertise required
in machine learning.
• 📈 Improve the performance of machine learning
models.
• 🔄 Increase reproducibility & establish a baseline
for scientific research or applications.
Aspects of Automatic Machine Learning
Data Prep
Model

Generation
Ensembles
Aspects of Automatic Machine Learning
• Cartesian grid search or random grid search
• Bayesian Hyperparameter Optimization
• Individual models can be tuned using a validation set
Data
Preprocessing
Model

Generation
Ensembles
• Imputation, one-hot encoding, standardization
• Feature selection and/or feature extraction (e.g. PCA)
• Count/Label/Target encoding of categorical features
• Ensembles often out-perform individual models
• Stacking / Super Learning (Wolpert, Breiman)
• Ensemble Selection (Caruana)
Different Flavors of AutoML
https://tinyurl.com/flavors-of-automl
H2O’s AutoML
H2O AutoML (v3.26)
• Cartesian grid search or random grid search
• Bayesian Hyperparameter Optimization
• Individual models can be tuned using a validation set
Data
Preprocessing
Model

Generation
Ensembles
• Imputation, one-hot encoding, standardization
• Feature selection and/or feature extraction (e.g. PCA)
• Count/Label/Target encoding of categorical features
• Ensembles often out-perform individual models:
• Stacking / Super Learning (Wolpert, Breiman)
• Ensemble Selection (Caruana)
Random Grid Search & Stacking
• Random Grid Search combined with Stacked
Ensembles is a powerful combination.

• Ensembles perform particularly well if the models
they are based on (1) are individually strong, 

and (2) make uncorrelated errors.

• Stacking uses a second-level metalearning algorithm
to find the optimal combination of base learners.
H2O AutoML
• Basic data pre-processing (as in all H2O algos).
• Trains a random grid of GBMs, DNNs, GLMs, etc.
using a carefully chosen hyper-parameter space.
• Individual models are tuned using cross-validation.
• Two Stacked Ensembles are trained (“All Models”
ensemble & a lightweight “Best of Family” ensemble).
• Returns a sorted “Leaderboard” of all models.
• All models can be easily exported to production.
H2O AutoML in Python
H2O AutoML in R
H2O AutoML in Flow GUI
H2O AutoML Leaderboard
Example Leaderboard
for binary classification
(Higgs 10k)
AutoML Pro Tips!
AutoML Pro Tips: Exclude Algos
• If you have sparse, wide data (e.g. text), use
the exclude_algos argument to turn off the
tree-based models (GBM, RF).

• If you want tree-based algos only, turn off
GLM and DNNs via exclude_algos.
AutoML Pro Tips: Time & Model Limits
• AutoML will stop after 1 hour unless you
change max_runtime_secs.

• If you need reproducibility, you must use use
max_models instead. Running with
max_runtime_secs is not reproducible since
available resources on a machine may change
from run to run.
AutoML Pro Tips: Cluster memory
• Reminder: All H2O models are stored in H2O
Cluster memory.
• Make sure to give the H2O Cluster a lot of
memory if you’re going to create hundreds or
thousands of models.
• e.g. h2o.init(max_mem_size = “80G”)
H2O AutoML Roadmap
• Automatic target encoding of high cardinality categorical cols
• Better support for wide datasets via feature selection/extraction
• Support text input directly via Word2Vec
• Variable importance for Stacked Ensembles
• Improvements to the models we train based on benchmarking
• Fully customizable model list (grid space, etc)
• New algorithms: SVM, GAM
H2O AutoML in the Wild
AutoML Benchmarks


arXiv paper ⬇ 

https://tinyurl.com/automlbenchmark
AutoML Benchmarks
https://openml.github.io/automlbenchmark/results.html
AutoML Benchmarks
https://youtu.be/WlXhpXv9kDU
Learn H2O AutoML!
• Docs: https://tinyurl.com/h2o-automl-docs
• R & Py tutorials: https://tinyurl.com/h2o-automl-tutorials
H2O Resources
• Documentation: http://docs.h2o.ai
• Tutorials: https://github.com/h2oai/h2o-tutorials
• Slidedecks: https://github.com/h2oai/h2o-meetups
• Videos: https://www.youtube.com/user/0xdata
• Stack Overflow: https://stackoverflow.com/tags/h2o
• Google Group: https://tinyurl.com/h2ostream
• Gitter: http://gitter.im/h2oai/h2o-3
• Events & Meetups: http://h2o.ai/events
Thank you!
@ledell on Github, Twitter
erin@h2o.ai

More Related Content

PDF
Machine Learning for (JVM) Developers
PDF
3rd Hivemall meetup
PDF
Rust is for "Big Data"
PDF
Hopsworks at Google AI Huddle, Sunnyvale
ODT
Hadoop online trainings
PDF
Hopsworks in the cloud Berlin Buzzwords 2019
PPTX
Hadoop, Infrastructure and Stack
PDF
Hadoopsummit16 myui
Machine Learning for (JVM) Developers
3rd Hivemall meetup
Rust is for "Big Data"
Hopsworks at Google AI Huddle, Sunnyvale
Hadoop online trainings
Hopsworks in the cloud Berlin Buzzwords 2019
Hadoop, Infrastructure and Stack
Hadoopsummit16 myui

What's hot (20)

PPTX
Automate Machine Learning Pipeline Using MLBox
PPTX
How to automate Machine Learning pipeline ?
PPT
Database concepts by vaishali sahare[katkar]
PDF
PPTX
Big Data Certifications Workshop - 201711 - Introduction and Database Essentials
PPTX
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
PDF
Hadoop spark online demo
PPTX
Farewell XSL, Welcome Display Templates
PPTX
Machine Learning at Scale
PPTX
Hadoop Solutions
PPTX
Data mining-2011-09
PDF
PPTX
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
PPTX
Scalding by Adform Research, Alex Gryzlov
PPTX
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
PPTX
Rust & Apache Arrow @ RMS
PPTX
Benefits of Cassandra
PDF
071410 sun a_1515_feldman_stephen
PDF
Intro to Apache Spark
PDF
Yet another intro to Apache Spark
Automate Machine Learning Pipeline Using MLBox
How to automate Machine Learning pipeline ?
Database concepts by vaishali sahare[katkar]
Big Data Certifications Workshop - 201711 - Introduction and Database Essentials
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Hadoop spark online demo
Farewell XSL, Welcome Display Templates
Machine Learning at Scale
Hadoop Solutions
Data mining-2011-09
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Scalding by Adform Research, Alex Gryzlov
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Rust & Apache Arrow @ RMS
Benefits of Cassandra
071410 sun a_1515_feldman_stephen
Intro to Apache Spark
Yet another intro to Apache Spark
Ad

Similar to Open Platform for AI & ML modeling (20)

PDF
Scalable Automatic Machine Learning with H2O” by Erin LeDell, Chief Machine L...
PDF
Scalable Automatic Machine Learning in H2O
PDF
Scalable Automatic Machine Learning with H2O
PDF
Intro to AutoML + Hands-on Lab - Erin LeDell, Machine Learning Scientist, H2O.ai
PDF
Machine Learning With H2O vs SparkML
PDF
Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...
PPTX
Building Machine Learning Inference Pipelines at Scale (July 2019)
PPTX
Basic Application Performance Optimization Techniques (Backend)
PPTX
Spark meetup feb 2016
PDF
Scalable Automatic Machine Learning in H2O
PPT
AWS (Hadoop) Meetup 30.04.09
PPTX
Drupal performance
PDF
Infrastructure Challenges in Scaling RAG with Custom AI models
PPTX
Advanced Machine Learning with Amazon SageMaker
PDF
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
PDF
Kognitio - an overview
PPTX
Building machine learning inference pipelines at scale (March 2019)
PDF
Low Latency Polyglot Model Scoring using Apache Apex
PDF
SystemML - Datapalooza Denver - 05.17.16 MWD
PDF
Tuning ML Models: Scaling, Workflows, and Architecture
Scalable Automatic Machine Learning with H2O” by Erin LeDell, Chief Machine L...
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning with H2O
Intro to AutoML + Hands-on Lab - Erin LeDell, Machine Learning Scientist, H2O.ai
Machine Learning With H2O vs SparkML
Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...
Building Machine Learning Inference Pipelines at Scale (July 2019)
Basic Application Performance Optimization Techniques (Backend)
Spark meetup feb 2016
Scalable Automatic Machine Learning in H2O
AWS (Hadoop) Meetup 30.04.09
Drupal performance
Infrastructure Challenges in Scaling RAG with Custom AI models
Advanced Machine Learning with Amazon SageMaker
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Kognitio - an overview
Building machine learning inference pipelines at scale (March 2019)
Low Latency Polyglot Model Scoring using Apache Apex
SystemML - Datapalooza Denver - 05.17.16 MWD
Tuning ML Models: Scaling, Workflows, and Architecture
Ad

More from Institute of Contemporary Sciences (20)

PDF
First 5 years of PSI:ML - Filip Panjevic
PPTX
Building valuable (online and offline) Data Science communities - Experience ...
PPT
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
PPTX
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
PPTX
Solving churn challenge in Big Data environment - Jelena Pekez
PDF
Application of Business Intelligence in bank risk management - Dimitar Dilov
PPTX
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
PPTX
Recommender systems for personalized financial advice from concept to product...
PDF
Advanced tools in real time analytics and AI in customer support - Milan Sima...
PPTX
Complex AI forecasting methods for investments portfolio optimization - Pawel...
PPTX
From Zero to ML Hero for Underdogs - Amir Tabakovic
PDF
Data and data scientists are not equal to money david hoyle
PPSX
The price is right - Tomislav Krizan
PPTX
When it's raining gold, bring a bucket - Andjela Culibrk
PPTX
Reality and traps of real time data engineering - Milos Solujic
PPTX
Sensor networks for personalized health monitoring - Vladimir Brusic
PDF
Improving Data Quality with Product Similarity Search
PPTX
Prediction of good patterns for future sales using image recognition
PPTX
Using data to fight corruption: full budget transparency in local government
PPTX
Geospatial Analysis and Open Data - Forest and Climate
First 5 years of PSI:ML - Filip Panjevic
Building valuable (online and offline) Data Science communities - Experience ...
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Solving churn challenge in Big Data environment - Jelena Pekez
Application of Business Intelligence in bank risk management - Dimitar Dilov
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Recommender systems for personalized financial advice from concept to product...
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Complex AI forecasting methods for investments portfolio optimization - Pawel...
From Zero to ML Hero for Underdogs - Amir Tabakovic
Data and data scientists are not equal to money david hoyle
The price is right - Tomislav Krizan
When it's raining gold, bring a bucket - Andjela Culibrk
Reality and traps of real time data engineering - Milos Solujic
Sensor networks for personalized health monitoring - Vladimir Brusic
Improving Data Quality with Product Similarity Search
Prediction of good patterns for future sales using image recognition
Using data to fight corruption: full budget transparency in local government
Geospatial Analysis and Open Data - Forest and Climate

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
1_Introduction to advance data techniques.pptx
Supervised vs unsupervised machine learning algorithms
oil_refinery_comprehensive_20250804084928 (1).pptx
Database Infoormation System (DBIS).pptx
Reliability_Chapter_ presentation 1221.5784
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
IB Computer Science - Internal Assessment.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Fluorescence-microscope_Botany_detailed content
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Qualitative Qantitative and Mixed Methods.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to Knowledge Engineering Part 1
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Miokarditis (Inflamasi pada Otot Jantung)

Open Platform for AI & ML modeling

  • 1. Scalable Automatic 
 Machine Learning in H2O Erin LeDell Ph.D.
 @ledell Data Science Conference 5.0
 Belgrade, Serbia Nov 2019
  • 2. What is H2O? H2O.ai, the company H2O, the platform • Founded in 2012 • Advised by Stanford Professors Hastie, Tibshirani & Boyd • Headquarters: Mountain View, California, USA • Open Source Software (Apache 2.0 Licensed) • R, Python, Scala, Java and Web Interfaces • Distributed Machine Learning Algorithms for Big Data
  • 3. Agenda • H2O Platform • Automatic Machine Learning (AutoML) • H2O AutoML Overview • Resources Slides ⬇ https://tinyurl.com/automl-dsc5
  • 5. H2O Machine Learning Platform • Distributed (multi-core + multi-node) implementations of cutting edge ML algorithms. • Core algorithms written in high performance Java. • APIs available in R, Python, Scala; web GUI. • Easily deploy models to production as pure Java code. • Works on Hadoop, Spark, EC2, your laptop, etc.
  • 6. H2O Distributed Computing H2O Cluster H2O Frame • Multi-node cluster with shared memory model. • All computations in memory. • Each node sees only some rows of the data. • No limit on cluster size. • Distributed data frames (collection of vectors). • Columns are distributed (across nodes) arrays. • Works just like R’s data.frame or Python Pandas DataFrame
  • 7. H2O Machine Learning Features • Supervised & unsupervised machine learning algos (GBM, RF, DNN, GLM, Stacked Ensembles, etc.) • Imputation, normalization & auto one-hot-encoding • Automatic early stopping • Cross-validation, grid search & random search • Variable importance, model evaluation metrics, plots
  • 9. Goals & Features of AutoML • 🏆 Train the best model in the least amount of time. • 📉 Reduce the human effort & expertise required in machine learning. • 📈 Improve the performance of machine learning models. • 🔄 Increase reproducibility & establish a baseline for scientific research or applications.
  • 10. Aspects of Automatic Machine Learning Data Prep Model
 Generation Ensembles
  • 11. Aspects of Automatic Machine Learning • Cartesian grid search or random grid search • Bayesian Hyperparameter Optimization • Individual models can be tuned using a validation set Data Preprocessing Model
 Generation Ensembles • Imputation, one-hot encoding, standardization • Feature selection and/or feature extraction (e.g. PCA) • Count/Label/Target encoding of categorical features • Ensembles often out-perform individual models • Stacking / Super Learning (Wolpert, Breiman) • Ensemble Selection (Caruana)
  • 12. Different Flavors of AutoML https://tinyurl.com/flavors-of-automl
  • 14. H2O AutoML (v3.26) • Cartesian grid search or random grid search • Bayesian Hyperparameter Optimization • Individual models can be tuned using a validation set Data Preprocessing Model
 Generation Ensembles • Imputation, one-hot encoding, standardization • Feature selection and/or feature extraction (e.g. PCA) • Count/Label/Target encoding of categorical features • Ensembles often out-perform individual models: • Stacking / Super Learning (Wolpert, Breiman) • Ensemble Selection (Caruana)
  • 15. Random Grid Search & Stacking • Random Grid Search combined with Stacked Ensembles is a powerful combination.
 • Ensembles perform particularly well if the models they are based on (1) are individually strong, 
 and (2) make uncorrelated errors.
 • Stacking uses a second-level metalearning algorithm to find the optimal combination of base learners.
  • 16. H2O AutoML • Basic data pre-processing (as in all H2O algos). • Trains a random grid of GBMs, DNNs, GLMs, etc. using a carefully chosen hyper-parameter space. • Individual models are tuned using cross-validation. • Two Stacked Ensembles are trained (“All Models” ensemble & a lightweight “Best of Family” ensemble). • Returns a sorted “Leaderboard” of all models. • All models can be easily exported to production.
  • 17. H2O AutoML in Python
  • 19. H2O AutoML in Flow GUI
  • 20. H2O AutoML Leaderboard Example Leaderboard for binary classification (Higgs 10k)
  • 22. AutoML Pro Tips: Exclude Algos • If you have sparse, wide data (e.g. text), use the exclude_algos argument to turn off the tree-based models (GBM, RF).
 • If you want tree-based algos only, turn off GLM and DNNs via exclude_algos.
  • 23. AutoML Pro Tips: Time & Model Limits • AutoML will stop after 1 hour unless you change max_runtime_secs.
 • If you need reproducibility, you must use use max_models instead. Running with max_runtime_secs is not reproducible since available resources on a machine may change from run to run.
  • 24. AutoML Pro Tips: Cluster memory • Reminder: All H2O models are stored in H2O Cluster memory. • Make sure to give the H2O Cluster a lot of memory if you’re going to create hundreds or thousands of models. • e.g. h2o.init(max_mem_size = “80G”)
  • 25. H2O AutoML Roadmap • Automatic target encoding of high cardinality categorical cols • Better support for wide datasets via feature selection/extraction • Support text input directly via Word2Vec • Variable importance for Stacked Ensembles • Improvements to the models we train based on benchmarking • Fully customizable model list (grid space, etc) • New algorithms: SVM, GAM
  • 26. H2O AutoML in the Wild
  • 27. AutoML Benchmarks 
 arXiv paper ⬇ 
 https://tinyurl.com/automlbenchmark
  • 30. Learn H2O AutoML! • Docs: https://tinyurl.com/h2o-automl-docs • R & Py tutorials: https://tinyurl.com/h2o-automl-tutorials
  • 31. H2O Resources • Documentation: http://docs.h2o.ai • Tutorials: https://github.com/h2oai/h2o-tutorials • Slidedecks: https://github.com/h2oai/h2o-meetups • Videos: https://www.youtube.com/user/0xdata • Stack Overflow: https://stackoverflow.com/tags/h2o • Google Group: https://tinyurl.com/h2ostream • Gitter: http://gitter.im/h2oai/h2o-3 • Events & Meetups: http://h2o.ai/events
  • 32. Thank you! @ledell on Github, Twitter erin@h2o.ai