SlideShare a Scribd company logo
H2O.ai
Machine Intelligence
H2O Design and
Infrastructure
R Summit and Workshop, Copenhagen
27 Jun 2015
Matt Dowle
H2O.ai
Machine Intelligence
2
Overview
1. What exactly is H2O
2. How it works
H2O.ai
Machine Intelligence
3
I'll start an 8 node cluster live on EC2 now
Click
H2O.ai
Machine Intelligence
4
4 mins to start up, 2 slides
H2O.ai
Machine Intelligence
5
H2O
Machine learning e.g. Deep Learning
In-memory, parallel and distributed
1. Data > 240GB needle-in-haystack; e.g. fraud
2. Data < 240GB compute intensive, parallel 100's cores
3. Data < 240GB where feature engineering > 240GB
Speed for i) production and ii) interaction
Developed in the open on GitHub
Liberal Apache license
Use from R, Python or H2O Flow … simultaneously
I now work here
H2O.ai
Machine Intelligence
6
8-node cluster on EC2 is now ready
LIVE 15MIN DEMO
H2O.ai
Machine Intelligence
7
To use from R
# If java is not already installed :
$ sudo add-apt-repository -y ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get -y install oracle-java8-installer
$ sudo apt-get -y install oracle-java8-set-default
$ java -version
$ R
> install.packages(“h2o”)
That's it.
H2O.ai
Machine Intelligence
8
Start H2O
> library(h2o)
> h2o.init()
H2O is not running yet, starting it now...
Successfully connected to http://127.0.0.1:54321
R is connected to H2O cluster:
H2O cluster uptime: 1 sec 397 ms
H2O cluster version: 2.8.4.4
H2O cluster total nodes: 1
H2O cluster total memory: 26.67 GB
H2O cluster total cores: 32
H2O.ai
Machine Intelligence
9
h2o.importFile
23GB .csv, 9 columns, 500e6 rows
> DF <­ h2o.importFile("/dev/shm/test.csv")
   user  system elapsed 
  0.775   0.058  50.559
> head(DF)
    id1   id2          id3 id4 id5  id6 v1 v2      v3
1 id076 id035 id0000003459  20  80 8969  4  3 43.1525
2 id062 id023 id0000002848  99  49 7520  5  2 86.9519
3 id001 id052 id0000007074  89  16 8183  1  3 19.6696
H2O.ai
Machine Intelligence
10
library(h2o)
h2o.importFile("/dev/shm/test.csv") # 50 seconds
library(data.table)
fread("/dev/shm/test.csv")  # 5 minutes
library(readr)  
read_csv(“/dev/shm/test.csv”)  # 12 minutes
23GB .csv, 9 columns, 500e6 rows
Parallel
Single thread
Single thread
H2O.ai
Machine Intelligence
11
h2o.importFile also
● compresses the data in RAM
● profiles the data while reading; e.g. stores min
and max per column, for later efficiency gains
● included in 50 seconds
● accepts a directory of multiple files
H2O.ai
Machine Intelligence
12
hex <- h2o.importFile(conn, path)
summary(hex)
hex$Year <- as.factor(hex$Year)
myY <- "IsDepDelayed"
myX <- c("Origin", "Dest", "Year", "UniqueCarrier",
"DayOfWeek", "Month", "Distance", "FlightNum")
dl <- h2o.deeplearning(y = myY, x = myX,
training_frame = hex, hidden=c(20,20,20,20),
epochs = 1, variable_importances = T)
Standard R
H2O.ai
Machine Intelligence
13
How it works
Slides by Cliff Click
CTO and Co-Founder
14
Simple Data-Parallel Coding
● Map/Reduce Per-Row: Stateless
●
Example from Linear Regression, Σ y2
● Auto-parallel, auto-distributed
● Fortran speed, Java Ease
double sumY2 = new MRTask() {
double map( double d ) { return d*d; }
double reduce( double d1, double d2 ) {
return d1+d2;
}
}.doAll( data );
15
Simple Data-Parallel Coding
● Map/Reduce Per-Row: Statefull
●
Linear Regression Pass1: Σ x, Σ y, Σ y2
class LRPass1 extends MRTask {
double sumX, sumY, sumY2;// I can have State?
void map( double X, double Y ) {
sumX += X; sumY += Y; sumY2 += Y*Y;
}
void reduce( LRPass1 that ) {
sumX += that.sumX ;
sumY += that.sumY ;
sumY2 += that.sumY2;
}
}
16
Non-blocking distributed KV
● Uniques
● Uses distributed hash set
class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
Setting dnbhs in <init> makes it an input field.
Shared across all maps(). Often read-only.
This one is written, so needs a reduce.
17
Limitations
● Code runs distributed...
● No I/O or Machine Resource allocation
● No new threads, no locks, no System.exit()
● No global / static variables
● Instead they become node-local
● "Small" global read state: in constructor
● "Small" global writable state: use reduce()
● "Big" state: read/write distributed arrays (Vecs)
● Runs one (big step) to completion, then
another...
18
Strengths
● Code runs distributed & parallel without effort
● Millions & billions of rows; 1000's of cores
● Single-threaded coding style
● No concurrency issues
● Excellent resource management
● "No knobs needed" for GC or CPUs or network
● No "data placement", no "hot blocks" or "hot locks"
19
How Does It Work?
(Code)
Distributed Fork / Join
● T = new MRtask().doAll(data);
JVM
Task
JVM
JVM
JVM
JVM
JVM
JVM
JVM
Distributed Fork / Join
● T = new MRtask().doAll(data);
● Log tree fan-out copy...
JVM
Task
JVM
Task
JVM
Task
JVM
JVM
JVM
JVM
JVM
Distributed Fork / Join
● T = new MRtask().doAll(data);
● Log tree fan-out copy...
JVM
Task
JVM
Task
JVM
Task
JVM
Task
JVM
Task
JVM
Task
JVM
Task
JVM
Distributed Fork / Join
● T = new MRtask().doAll(data);
● Log tree fan-out copy...
JVM
Task
JVM
Task
JVM
Task
JVM
Task
JVM
Task
JVM
Task
JVM
Task
JVM
Task
Distributed Fork / Join
● Within a single node, classic Fork/Join
● Divide & Conquer
JVM
Task data
data
data
data
Distributed Fork / Join
● Within a single node, classic Fork/Join
● Divide & Conquer
JVM
Task
Task
data
data
data
data
Distributed Fork / Join
● Within a single node, classic Fork/Join
● Divide & Conquer
JVM
Task
Task
Task
Task
data
data
data
data
Distributed Fork / Join
● Within a single node, classic Fork/Join
● Divide & Conquer
JVM
Task
Task
Task
Task
data
data
data
data
Distributed Fork / Join
● Within a single node, classic Fork/Join
● Divide & Conquer, parallel map over local data
JVM
Task
Task
Task
Task
data
data
data
data
map()
Task.B
Task.B
Task.B
Task.B
Distributed Fork / Join
JVM
Task.B
Task.B
Task.B
Task.B
reduce()
reduce()
● Within a single node, classic Fork/Join
● Parallel, eager reduces
Distributed Fork / Join
JVM
Task.B
Task.B
reduce(
)
● Within a single node, classic Fork/Join
● Log-tree reduce
Distributed Fork / Join
JVM
● Within a single node, classic Fork/Join
● Reduce to the same top-level instance
Task.B
Distributed Fork / Join
● Reductions back up the log-tree
JVM
Task.B
JVM
Task.B
JVM
Task.B
JVM
Task.B
JVM
Task.B
JVM
Task.B
JVM
Task.B
JVM
Task.B
reduce()
reduce()
reduce()
Distributed Fork / Join
● Reductions back up the log-tree
JVM
Task.B
JVM
Task.B
JVM
Task.B
JVM
Task.B
JVM
JVM
JVM
JVM
reduce()
Distributed Fork / Join
● Final reduction into the original instance
JVM
Task.B
JVM
Task.B
JVM
JVM
JVM
JVM
JVM
JVM
Distributed Fork / Join
● Final reduction into the original instance
JVM
Task.B
JVM
JVM
JVM
JVM
JVM
JVM
JVM
36
How Does It Work?
(Data)
37
JVM
4
Heap
JVM
1
Heap
JVM
2
Heap
JVM
3
Heap
Distributed Data Taxonomy
Distributed Parallel Execution
Vec Vec Vec Vec Vec
● All CPUs grab
Chunks in paralle
● F/J load balances
● Code moves to Da
● Map/Reduce & F/J
handles all sync
● H2O handles all
comm, data mana
38
Distributed Data Taxonomy
Frame – a collection of Vecs
Vec – a collection of Chunks
Chunk – a collection of 1e3 to 1e6 elems
elem – a java double
Row i – i'th elements of all the Vecs in a
Frame
39
Summary
40
Distributed Coding Taxonomy
● No Distribution Coding:
● Whole Algorithms, Whole Vector-Math
● REST + JSON: e.g. load data, GLM, get results
● R, Python, Web, bash/curl
● Simple Data-Parallel Coding:
● Map/Reduce-style: e.g. Any dense linear algebra
● Java/Scala foreach* style
● Complex Data-Parallel Coding
● K/V Store, Graph Algo's, e.g. PageRank
41
Summary: Writing (distributed) Java
● Most simple Java “just works”
● Scala API is experimental, but will also "just work"
● Fast: parallel distributed reads, writes, appends
● Reads same speed as plain Java array loads
● Writes, appends: slightly slower (compression)
● Typically memory bandwidth limited
– (may be CPU limited in a few cases)
● Slower: conflicting writes (but follows strict JMM)
● Also supports transactional updates
42
Summary: Writing Analytics
● We're writing Big Data Distributed Analytics
● Deep Learning
● Generalized Linear Modeling (ADMM, GLMNET)
– Logistic Regression, Poisson, Gamma
● Random Forest, GBM, KMeans, PCA, ...
● Come write your own (distributed) algorithm!!!
H2O.ai
Machine Intelligence
43
Further articles from H2O
Efficient Low Latency Java and GCs
A K/V Store For In-Memory Analytics: Part 1
A K/V Store For In-Memory Analytics, Part 2
H2O Architecture
http://h2o.ai/about/

More Related Content

PDF
H2O Big Data Environments
PDF
Sparkling Water 5 28-14
PPTX
H2O on Hadoop Dec 12
PDF
Building Machine Learning Applications with Sparkling Water
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PDF
High Performance Machine Learning in R with H2O
PPTX
Big Data Science with H2O in R
PDF
Introduction to Spark Training
H2O Big Data Environments
Sparkling Water 5 28-14
H2O on Hadoop Dec 12
Building Machine Learning Applications with Sparkling Water
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
High Performance Machine Learning in R with H2O
Big Data Science with H2O in R
Introduction to Spark Training

What's hot (20)

PDF
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
Data Science with Spark
PDF
Fast Data Analytics with Spark and Python
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
PPTX
Deeplearning
PDF
Hadoop to spark-v2
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Extending Hadoop for Fun & Profit
PPTX
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
PDF
New Developments in H2O: April 2017 Edition
PPTX
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
PPTX
Distributed GLM with H2O - Atlanta Meetup
PPTX
Processing Large Data with Apache Spark -- HasGeek
PPTX
Functional Programming and Big Data
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
PDF
Hadoop Spark Introduction-20150130
PDF
End-to-end Data Pipeline with Apache Spark
PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Spark Summit East 2015 Advanced Devops Student Slides
Data Science with Spark
Fast Data Analytics with Spark and Python
Frustration-Reduced PySpark: Data engineering with DataFrames
Deeplearning
Hadoop to spark-v2
Unified Big Data Processing with Apache Spark (QCON 2014)
Extending Hadoop for Fun & Profit
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
New Developments in H2O: April 2017 Edition
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Distributed GLM with H2O - Atlanta Meetup
Processing Large Data with Apache Spark -- HasGeek
Functional Programming and Big Data
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Hadoop Spark Introduction-20150130
End-to-end Data Pipeline with Apache Spark
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Apache Spark: The Next Gen toolset for Big Data Processing
Ad

Similar to H2O Design and Infrastructure with Matt Dowle (20)

PDF
Building a Big Data Machine Learning Platform
PDF
Porting a Streaming Pipeline from Scala to Rust
PPT
Threaded Programming
PDF
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PDF
GBM in H2O with Cliff Click: H2O API
PDF
Cliff Click Explains GBM at Netflix October 10 2013
PDF
Java Memory Model
PDF
2013 05 ny
PDF
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
PDF
Apache Flink internals
PDF
It's always sunny with OpenJ9
PPTX
Profiling & Testing with Spark
PDF
Deep learning - the conf br 2018
PDF
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
PPTX
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
PPTX
16 artifacts to capture when there is a production problem
PDF
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
PPTX
OpenHFT: An Advanced Java Data Locality and IPC Transport Solution
PPTX
Flink internals web
PPT
Migration To Multi Core - Parallel Programming Models
Building a Big Data Machine Learning Platform
Porting a Streaming Pipeline from Scala to Rust
Threaded Programming
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
GBM in H2O with Cliff Click: H2O API
Cliff Click Explains GBM at Netflix October 10 2013
Java Memory Model
2013 05 ny
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Apache Flink internals
It's always sunny with OpenJ9
Profiling & Testing with Spark
Deep learning - the conf br 2018
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
16 artifacts to capture when there is a production problem
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
OpenHFT: An Advanced Java Data Locality and IPC Transport Solution
Flink internals web
Migration To Multi Core - Parallel Programming Models
Ad

More from Sri Ambati (20)

PDF
H2O Label Genie Starter Track - Support Presentation
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
PDF
Intro to Enterprise h2oGPTe Presentation Slides
PDF
Enterprise h2o GPTe Learning Path Slide Deck
PDF
H2O Wave Course Starter - Presentation Slides
PDF
Large Language Models (LLMs) - Level 3 Slides
PDF
Data Science and Machine Learning Platforms (2024) Slides
PDF
Data Prep for H2O Driverless AI - Slides
PDF
H2O Cloud AI Developer Services - Slides (2024)
PDF
LLM Learning Path Level 2 - Presentation Slides
PDF
LLM Learning Path Level 1 - Presentation Slides
PDF
Hydrogen Torch - Starter Course - Presentation Slides
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
PDF
H2O Driverless AI Starter Course - Slides and Assignments
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
PPTX
Generative AI Masterclass - Model Risk Management.pptx
H2O Label Genie Starter Track - Support Presentation
H2O.ai Agents : From Theory to Practice - Support Presentation
H2O Generative AI Starter Track - Support Presentation Slides.pdf
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Intro to Enterprise h2oGPTe Presentation Slides
Enterprise h2o GPTe Learning Path Slide Deck
H2O Wave Course Starter - Presentation Slides
Large Language Models (LLMs) - Level 3 Slides
Data Science and Machine Learning Platforms (2024) Slides
Data Prep for H2O Driverless AI - Slides
H2O Cloud AI Developer Services - Slides (2024)
LLM Learning Path Level 2 - Presentation Slides
LLM Learning Path Level 1 - Presentation Slides
Hydrogen Torch - Starter Course - Presentation Slides
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
H2O Driverless AI Starter Course - Slides and Assignments
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Generative AI Masterclass - Model Risk Management.pptx

Recently uploaded (20)

PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
L1 - Introduction to python Backend.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Introduction to Artificial Intelligence
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
ai tools demonstartion for schools and inter college
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
How to Choose the Right IT Partner for Your Business in Malaysia
L1 - Introduction to python Backend.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
ManageIQ - Sprint 268 Review - Slide Deck
Navsoft: AI-Powered Business Solutions & Custom Software Development
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Introduction to Artificial Intelligence
Design an Analysis of Algorithms II-SECS-1021-03
CHAPTER 2 - PM Management and IT Context
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
How to Migrate SBCGlobal Email to Yahoo Easily
ISO 45001 Occupational Health and Safety Management System
ai tools demonstartion for schools and inter college
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Softaken Excel to vCard Converter Software.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
VVF-Customer-Presentation2025-Ver1.9.pptx

H2O Design and Infrastructure with Matt Dowle