SlideShare a Scribd company logo
GBM:
Distributed Tree Algorithms
on H2O
Cliff Click, CTO 0xdata
cliffc@0xdata.com
http://0xdata.com
http://cliffc.org/blog
0xdata.com 2
H2O is...
● Pure Java, Open Source: 0xdata.com
● https://github.com/0xdata/h2o/
● A Platform for doing Math
● Parallel Distributed Math
● In-memory analytics: GLM, GBM, RF, Logistic Reg
● Accessible via REST & JSON
● A K/V Store: ~150ns per get or put
● Distributed Fork/Join + Map/Reduce + K/V
0xdata.com 3
Agenda
● Building Blocks For Big Data:
● Vecs & Frames & Chunks
● Distributed Tree Algorithms
● Access Patterns & Execution
● GBM on H2O
● Performance
0xdata.com 4
A Collection of Distributed Vectors
// A Distributed Vector
// much more than 2billion elements
class Vec {
long length(); // more than an int's worth
// fast random access
double at(long idx); // Get the idx'th elem
boolean isNA(long idx);
void set(long idx, double d); // writable
void append(double d); // variable sized
}
0xdata.com 5
JVM 4
Heap
JVM 1
Heap
JVM 2
Heap
JVM 3
Heap
Frames
A Frame: Vec[]
age sex zip ID car
●Vecs aligned
in heaps
●Optimized for
concurrent access
●Random access
any row, any JVM
●But faster if local...
more on that later
0xdata.com 6
JVM 4
Heap
JVM 1
Heap
JVM 2
Heap
JVM 3
Heap
Distributed Data Taxonomy
A Chunk, Unit of Parallel Access
Vec Vec Vec Vec Vec
●Typically 1e3 to
1e6 elements
●Stored compressed
●In byte arrays
●Get/put is a few
clock cycles
including
compression
0xdata.com 7
JVM 4
Heap
JVM 1
Heap
JVM 2
Heap
JVM 3
Heap
Distributed Parallel Execution
Vec Vec Vec Vec Vec
●All CPUs grab
Chunks in parallel
●F/J load balances
●Code moves to Data
●Map/Reduce & F/J
handles all sync
●H2O handles all
comm, data manage
0xdata.com 8
Distributed Data Taxonomy
Frame – a collection of Vecs
Vec – a collection of Chunks
Chunk – a collection of 1e3 to 1e6 elems
elem – a java double
Row i – i'th elements of all the Vecs in a Frame
0xdata.com 9
Distributed Coding Taxonomy
● No Distribution Coding:
● Whole Algorithms, Whole Vector-Math
● REST + JSON: e.g. load data, GLM, get results
● Simple Data-Parallel Coding:
● Per-Row (or neighbor row) Math
● Map/Reduce-style: e.g. Any dense linear algebra
● Complex Data-Parallel Coding
● K/V Store, Graph Algo's, e.g. PageRank
0xdata.com 10
Distributed Coding Taxonomy
● No Distribution Coding:
● Whole Algorithms, Whole Vector-Math
● REST + JSON: e.g. load data, GLM, get results
● Simple Data-Parallel Coding:
● Per-Row (or neighbor row) Math
● Map/Reduce-style: e.g. Any dense linear algebra
● Complex Data-Parallel Coding
● K/V Store, Graph Algo's, e.g. PageRank
Read the docs!
This talk!
Join our GIT!
0xdata.com 11
Simple Data-Parallel Coding
● Map/Reduce Per-Row: Stateless
●
Example from Linear Regression, Σ y2
● Auto-parallel, auto-distributed
● Near Fortran speed, Java Ease
double sumY2 = new MRTask() {
double map( double d ) { return d*d; }
double reduce( double d1, double d2 ) {
return d1+d2;
}
}.doAll( vecY );
0xdata.com 12
Simple Data-Parallel Coding
● Map/Reduce Per-Row: State-full
●
Linear Regression Pass1: Σ x, Σ y, Σ y2
class LRPass1 extends MRTask {
double sumX, sumY, sumY2; // I Can Haz State?
void map( double X, double Y ) {
sumX += X; sumY += Y; sumY2 += Y*Y;
}
void reduce( LRPass1 that ) {
sumX += that.sumX ;
sumY += that.sumY ;
sumY2 += that.sumY2;
}
}
0xdata.com 13
Simple Data-Parallel Coding
● Map/Reduce Per-Row: Batch State-full
class LRPass1 extends MRTask {
double sumX, sumY, sumY2;
void map( Chunk CX, Chunk CY ) {// Whole Chunks
for( int i=0; i<CX.len; i++ ){// Batch!
double X = CX.at(i), Y = CY.at(i);
sumX += X; sumY += Y; sumY2 += Y*Y;
}
}
void reduce( LRPass1 that ) {
sumX += that.sumX ;
sumY += that.sumY ;
sumY2 += that.sumY2;
}
}
0xdata.com 14
Distributed Trees
● Overlay a Tree over the data
● Really: Assign a Tree Node to each Row
● Number the Nodes
● Store "Node_ID" per row in a temp Vec
● Make a pass over all Rows
● Nodes not visited in order...
● but all rows, all Nodes efficiently visited
● Do work (e.g. histogram) per Row/Node
Vec nids = v.makeZero();
… nids.set(row,nid)...
0xdata.com 15
Distributed Trees
● An initial Tree
● All rows at nid==0
● MRTask: compute stats
● Use the stats to make a decision...
● (varies by algorithm)!
nid=0
X Y nids
A 1.2 0
B 3.1 0
C -2. 0
D 1.1 0
nid=0nid=0
Tree
MRTask.sum=3.4
0xdata.com 16
Distributed Trees
● Next layer in the Tree (and MRTask across rows)
● Each row: decide!
– If "1<Y<1.5" go left else right
● Compute stats per new leaf
● Each pass across all
rows builds entire layer
nid=0
X Y nids
A 1.2 1
B 3.1 2
C -2. 2
D 1.1 1
nid=01 < Y < 1.5
Tree
sum=1.1
nid=1 nid=2
sum=2.3
0xdata.com 17
Distributed Trees
● Another MRTask, another layer...
● i.e., a 5-deep tree
takes 5 passes
●
nid=0nid=01 < Y < 1.5
Tree
sum=1.1
Y==1.1 leaf
nid=3 nid=4
X Y nids
A 1.2 3
B 3.1 2
C -2. 2
D 1.1 4 sum= -2. sum=3.1
0xdata.com 18
Distributed Trees
● Each pass is over one layer in the tree
● Builds per-node histogram in map+reduce calls
class Pass extends MRTask2<Pass> {
void map( Chunk chks[] ) {
Chunk nids = chks[...]; // Node-IDs per row
for( int r=0; r<nids.len; r++ ){// All rows
int nid = nids.at80(i); // Node-ID THIS row
// Lazy: not all Chunks see all Nodes
if( dHisto[nid]==null ) dHisto[nid]=...
// Accumulate histogram stats per node
dHisto[nid].accum(chks,r);
}
}
}.doAll(myDataFrame,nids);
0xdata.com 19
Distributed Trees
● Each pass analyzes one Tree level
● Then decide how to build next level
● Reassign Rows to new levels in another pass
– (actually merge the two passes)
● Builds a Histogram-per-Node
● Which requires a reduce() call to roll up
● All Histograms for one level done in parallel
0xdata.com 20
Distributed Trees: utilities
● “score+build” in one pass:
● Test each row against decision from prior pass
● Assign to a new leaf
● Build histogram on that leaf
● “score”: just walk the tree, and get results
● “compress”: Tree from POJO to byte[]
● Easily 10x smaller, can still walk, score, print
● Plus utilities to walk, print, display
0xdata.com 21
GBM on Distributed Trees
● GBM builds 1 Tree, 1 level at a time, but...
● We run the entire level in parallel & distributed
● Built breadth-first because it's "free"
● More data offset by more CPUs
● Classic GBM otherwise
● Build residuals tree-by-tree
● Tuning knobs: trees, depth, shrinkage, min_rows
● Pure Java
0xdata.com 22
GBM on Distributed Trees
● Limiting factor: latency in turning over a level
● About 4x faster than R single-node on covtype
● Does the per-level compute in parallel
● Requires sending histograms over network
– Can get big for very deep trees
●
0xdata.com 23
Summary: Write (parallel) Java
● Most simple Java “just works”
● Fast: parallel distributed reads, writes, appends
● Reads same speed as plain Java array loads
● Writes, appends: slightly slower (compression)
● Typically memory bandwidth limited
– (may be CPU limited in a few cases)
● Slower: conflicting writes (but follows strict JMM)
● Also supports transactional updates
0xdata.com 24
Summary: Writing Analytics
● We're writing Big Data Analytics
● Generalized Linear Modeling (ADMM, GLMNET)
– Logistic Regression, Poisson, Gamma
● Random Forest, GBM, KMeans++, KNN
● State-of-the-art Algorithms, running Distributed
● Solidly working on 100G datasets
● Heading for Tera Scale
● Paying customers (in production!)
● Come write your own (distributed) algorithm!!!
0xdata.com 25
Cool Systems Stuff...
● … that I ran out of space for
● Reliable UDP, integrated w/RPC
● TCP is reliably UNReliable
● Already have a reliable UDP framework, so no prob
● Fork/Join Goodies:
● Priority Queues
● Distributed F/J
● Surviving fork bombs & lost threads
● K/V does JMM via hardware-like MESI protocol
0xdata.com 26
H2O is...
● Pure Java, Open Source: 0xdata.com
● https://github.com/0xdata/h2o/
● A Platform for doing Math
● Parallel Distributed Math
● In-memory analytics: GLM, GBM, RF, Logistic Reg
● Accessible via REST & JSON
● A K/V Store: ~150ns per get or put
● Distributed Fork/Join + Map/Reduce + K/V
0xdata.com 27
The Platform
NFS
HDFS
byte[]
extends Iced
extends DTask
AutoBuffer
RPC
extends DRemoteTask D/F/J
extends MRTask User code?
JVM 1
NFS
HDFS
byte[]
extends Iced
extends DTask
AutoBuffer
RPC
extends DRemoteTask D/F/J
extends MRTask User code?
JVM 2
K/V get/put
UDP / TCP
0xdata.com 28
Other Simple Examples
● Filter & Count (underage males):
● (can pass in any number of Vecs or a Frame)
long sumY2 = new MRTask() {
long map( long age, long sex ) {
return (age<=17 && sex==MALE) ? 1 : 0;
}
long reduce( long d1, long d2 ) {
return d1+d2;
}
}.doAll( vecAge, vecSex );
0xdata.com 29
Other Simple Examples
● Filter into new set (underage males):
● Can write or append subset of rows
– (append order is preserved)
class Filter extends MRTask {
void map(Chunk CRisk, Chunk CAge, Chunk CSex){
for( int i=0; i<CAge.len; i++ )
if( CAge.at(i)<=17 && CSex.at(i)==MALE )
CRisk.append(CAge.at(i)); // build a set
}
};
Vec risk = new AppendableVec();
new Filter().doAll( risk, vecAge, vecSex );
...risk... // all the underage males
0xdata.com 30
Other Simple Examples
● Filter into new set (underage males):
● Can write or append subset of rows
– (append order is preserved)
class Filter extends MRTask {
void map(Chunk CRisk, Chunk CAge, Chunk CSex){
for( int i=0; i<CAge.len; i++ )
if( CAge.at(i)<=17 && CSex.at(i)==MALE )
CRisk.append(CAge.at(i)); // build a set
}
};
Vec risk = new AppendableVec();
new Filter().doAll( risk, vecAge, vecSex );
...risk... // all the underage males
0xdata.com 31
Other Simple Examples
● Group-by: count of car-types by age
class AgeHisto extends MRTask {
long carAges[][]; // count of cars by age
void map( Chunk CAge, Chunk CCar ) {
carAges = new long[numAges][numCars];
for( int i=0; i<CAge.len; i++ )
carAges[CAge.at(i)][CCar.at(i)]++;
}
void reduce( AgeHisto that ) {
for( int i=0; i<carAges.length; i++ )
for( int j=0; i<carAges[j].length; j++ )
carAges[i][j] += that.carAges[i][j];
}
}
0xdata.com 32
class AgeHisto extends MRTask {
long carAges[][]; // count of cars by age
void map( Chunk CAge, Chunk CCar ) {
carAges = new long[numAges][numCars];
for( int i=0; i<CAge.len; i++ )
carAges[CAge.at(i)][CCar.at(i)]++;
}
void reduce( AgeHisto that ) {
for( int i=0; i<carAges.length; i++ )
for( int j=0; i<carAges[j].length; j++ )
carAges[i][j] += that.carAges[i][j];
}
}
Other Simple Examples
● Group-by: count of car-types by age
Setting carAges in map() makes it an output field.
Private per-map call, single-threaded write access.
Must be rolled-up in the reduce call.
Setting carAges in map makes it an output field.
Private per-map call, single-threaded write access.
Must be rolled-up in the reduce call.
0xdata.com 33
Other Simple Examples
● Uniques
● Uses distributed hash set
class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
0xdata.com 34
Other Simple Examples
● Uniques
● Uses distributed hash set
class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
Setting dnbhs in <init> makes it an input field.
Shared across all maps(). Often read-only.
This one is written, so needs a reduce.

More Related Content

PDF
Cliff Click Explains GBM at Netflix October 10 2013
PDF
Large volume data analysis on the Typesafe Reactive Platform
PDF
Matlab netcdf guide
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PDF
Save Java memory
PDF
Pitfalls of object_oriented_programming_gcap_09
PDF
NoSQL Solutions - a comparative study
PPTX
Step By Step Guide to Learn R
Cliff Click Explains GBM at Netflix October 10 2013
Large volume data analysis on the Typesafe Reactive Platform
Matlab netcdf guide
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Save Java memory
Pitfalls of object_oriented_programming_gcap_09
NoSQL Solutions - a comparative study
Step By Step Guide to Learn R

What's hot (20)

PDF
Advanced Non-Relational Schemas For Big Data
PDF
User biglm
PDF
Predicate-Preserving Collision-Resistant Hashing
PPTX
RealmDB for Android
PPT
Database Sizing
PDF
AES effecitve software implementation
PDF
MapDB - taking Java collections to the next level
PPTX
Nicety of Java 8 Multithreading
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
PDF
icwet1097
PPT
The No SQL Principles and Basic Application Of Casandra Model
PDF
Mapreduce Algorithms
PPTX
Parallel programming patterns - Олександр Павлишак
PPTX
Modern software design in Big data era
PPT
IR-ranking
PDF
Apache Tajo on Swift: Bringing SQL to the OpenStack World
PPTX
Secure Hash Algorithm
PDF
Three steps to untangle data traffic jams
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
PDF
Introduction to Hadoop and MapReduce
Advanced Non-Relational Schemas For Big Data
User biglm
Predicate-Preserving Collision-Resistant Hashing
RealmDB for Android
Database Sizing
AES effecitve software implementation
MapDB - taking Java collections to the next level
Nicety of Java 8 Multithreading
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
icwet1097
The No SQL Principles and Basic Application Of Casandra Model
Mapreduce Algorithms
Parallel programming patterns - Олександр Павлишак
Modern software design in Big data era
IR-ranking
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Secure Hash Algorithm
Three steps to untangle data traffic jams
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Introduction to Hadoop and MapReduce
Ad

Viewers also liked (20)

PDF
21st Century University feasibility study
PDF
Building Random Forest at Scale
PDF
Machine Learning with H2O, Spark, and Python at Strata 2015
PDF
PayPal's Fraud Detection with Deep Learning in H2O World 2014
PDF
Deep Learning through Examples
PDF
Applied Machine learning using H2O, python and R Workshop
PDF
MLconf - Distributed Deep Learning for Classification and Regression Problems...
PDF
Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati,
PDF
H2O Big Data Environments
PPTX
Big Data Science with H2O in R
PDF
H2O World - Sparkling Water - Michal Malohlava
PDF
High Performance Machine Learning in R with H2O
PDF
H2O PySparkling Water
PDF
H2O AutoML roadmap - Ray Peck
PDF
Designing chatbot personalities
PDF
La rivoluzione dei chatbot
PDF
Chatbot - new opportunities and insights
PPTX
Artificially Intelligent chatbot Implementation
PDF
Transform your Business with AI, Deep Learning and Machine Learning
PDF
Chatbot Artificial Intelligence
21st Century University feasibility study
Building Random Forest at Scale
Machine Learning with H2O, Spark, and Python at Strata 2015
PayPal's Fraud Detection with Deep Learning in H2O World 2014
Deep Learning through Examples
Applied Machine learning using H2O, python and R Workshop
MLconf - Distributed Deep Learning for Classification and Regression Problems...
Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati,
H2O Big Data Environments
Big Data Science with H2O in R
H2O World - Sparkling Water - Michal Malohlava
High Performance Machine Learning in R with H2O
H2O PySparkling Water
H2O AutoML roadmap - Ray Peck
Designing chatbot personalities
La rivoluzione dei chatbot
Chatbot - new opportunities and insights
Artificially Intelligent chatbot Implementation
Transform your Business with AI, Deep Learning and Machine Learning
Chatbot Artificial Intelligence
Ad

Similar to GBM in H2O with Cliff Click: H2O API (20)

PDF
2013 05 ny
PDF
Building a Big Data Machine Learning Platform
PDF
Sv big datascience_cliffclick_5_2_2013
PDF
H2O Design and Infrastructure with Matt Dowle
PDF
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
PDF
Address/Thread/Memory Sanitizer
PDF
Introduction to CUDA
PDF
HPC Essentials 0
PDF
Simd programming introduction
PDF
NVIDIA HPC ソフトウエア斜め読み
PPSX
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
PPTX
Gpu workshop cluster universe: scripting cuda
PDF
Options and trade offs for parallelism and concurrency in Modern C++
PPTX
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
PDF
Cryptography and secure systems
PPTX
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
PDF
Cassandra Explained
PDF
Stevens 3rd Annual Conference Hfc2011
PDF
Vectorization in ATLAS
PDF
C# as a System Language
2013 05 ny
Building a Big Data Machine Learning Platform
Sv big datascience_cliffclick_5_2_2013
H2O Design and Infrastructure with Matt Dowle
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Address/Thread/Memory Sanitizer
Introduction to CUDA
HPC Essentials 0
Simd programming introduction
NVIDIA HPC ソフトウエア斜め読み
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Gpu workshop cluster universe: scripting cuda
Options and trade offs for parallelism and concurrency in Modern C++
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
Cryptography and secure systems
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
Cassandra Explained
Stevens 3rd Annual Conference Hfc2011
Vectorization in ATLAS
C# as a System Language

More from Sri Ambati (20)

PDF
H2O Label Genie Starter Track - Support Presentation
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
PDF
Intro to Enterprise h2oGPTe Presentation Slides
PDF
Enterprise h2o GPTe Learning Path Slide Deck
PDF
H2O Wave Course Starter - Presentation Slides
PDF
Large Language Models (LLMs) - Level 3 Slides
PDF
Data Science and Machine Learning Platforms (2024) Slides
PDF
Data Prep for H2O Driverless AI - Slides
PDF
H2O Cloud AI Developer Services - Slides (2024)
PDF
LLM Learning Path Level 2 - Presentation Slides
PDF
LLM Learning Path Level 1 - Presentation Slides
PDF
Hydrogen Torch - Starter Course - Presentation Slides
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
PDF
H2O Driverless AI Starter Course - Slides and Assignments
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
PPTX
Generative AI Masterclass - Model Risk Management.pptx
H2O Label Genie Starter Track - Support Presentation
H2O.ai Agents : From Theory to Practice - Support Presentation
H2O Generative AI Starter Track - Support Presentation Slides.pdf
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Intro to Enterprise h2oGPTe Presentation Slides
Enterprise h2o GPTe Learning Path Slide Deck
H2O Wave Course Starter - Presentation Slides
Large Language Models (LLMs) - Level 3 Slides
Data Science and Machine Learning Platforms (2024) Slides
Data Prep for H2O Driverless AI - Slides
H2O Cloud AI Developer Services - Slides (2024)
LLM Learning Path Level 2 - Presentation Slides
LLM Learning Path Level 1 - Presentation Slides
Hydrogen Torch - Starter Course - Presentation Slides
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
H2O Driverless AI Starter Course - Slides and Assignments
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Generative AI Masterclass - Model Risk Management.pptx

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Cloud computing and distributed systems.
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Modernizing your data center with Dell and AMD
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Electronic commerce courselecture one. Pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
cuic standard and advanced reporting.pdf
Understanding_Digital_Forensics_Presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
20250228 LYD VKU AI Blended-Learning.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Modernizing your data center with Dell and AMD
Advanced methodologies resolving dimensionality complications for autism neur...
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
Electronic commerce courselecture one. Pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
CIFDAQ's Market Insight: SEC Turns Pro Crypto
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf

GBM in H2O with Cliff Click: H2O API

  • 1. GBM: Distributed Tree Algorithms on H2O Cliff Click, CTO 0xdata cliffc@0xdata.com http://0xdata.com http://cliffc.org/blog
  • 2. 0xdata.com 2 H2O is... ● Pure Java, Open Source: 0xdata.com ● https://github.com/0xdata/h2o/ ● A Platform for doing Math ● Parallel Distributed Math ● In-memory analytics: GLM, GBM, RF, Logistic Reg ● Accessible via REST & JSON ● A K/V Store: ~150ns per get or put ● Distributed Fork/Join + Map/Reduce + K/V
  • 3. 0xdata.com 3 Agenda ● Building Blocks For Big Data: ● Vecs & Frames & Chunks ● Distributed Tree Algorithms ● Access Patterns & Execution ● GBM on H2O ● Performance
  • 4. 0xdata.com 4 A Collection of Distributed Vectors // A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized }
  • 5. 0xdata.com 5 JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap Frames A Frame: Vec[] age sex zip ID car ●Vecs aligned in heaps ●Optimized for concurrent access ●Random access any row, any JVM ●But faster if local... more on that later
  • 6. 0xdata.com 6 JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap Distributed Data Taxonomy A Chunk, Unit of Parallel Access Vec Vec Vec Vec Vec ●Typically 1e3 to 1e6 elements ●Stored compressed ●In byte arrays ●Get/put is a few clock cycles including compression
  • 7. 0xdata.com 7 JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap Distributed Parallel Execution Vec Vec Vec Vec Vec ●All CPUs grab Chunks in parallel ●F/J load balances ●Code moves to Data ●Map/Reduce & F/J handles all sync ●H2O handles all comm, data manage
  • 8. 0xdata.com 8 Distributed Data Taxonomy Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame
  • 9. 0xdata.com 9 Distributed Coding Taxonomy ● No Distribution Coding: ● Whole Algorithms, Whole Vector-Math ● REST + JSON: e.g. load data, GLM, get results ● Simple Data-Parallel Coding: ● Per-Row (or neighbor row) Math ● Map/Reduce-style: e.g. Any dense linear algebra ● Complex Data-Parallel Coding ● K/V Store, Graph Algo's, e.g. PageRank
  • 10. 0xdata.com 10 Distributed Coding Taxonomy ● No Distribution Coding: ● Whole Algorithms, Whole Vector-Math ● REST + JSON: e.g. load data, GLM, get results ● Simple Data-Parallel Coding: ● Per-Row (or neighbor row) Math ● Map/Reduce-style: e.g. Any dense linear algebra ● Complex Data-Parallel Coding ● K/V Store, Graph Algo's, e.g. PageRank Read the docs! This talk! Join our GIT!
  • 11. 0xdata.com 11 Simple Data-Parallel Coding ● Map/Reduce Per-Row: Stateless ● Example from Linear Regression, Σ y2 ● Auto-parallel, auto-distributed ● Near Fortran speed, Java Ease double sumY2 = new MRTask() { double map( double d ) { return d*d; } double reduce( double d1, double d2 ) { return d1+d2; } }.doAll( vecY );
  • 12. 0xdata.com 12 Simple Data-Parallel Coding ● Map/Reduce Per-Row: State-full ● Linear Regression Pass1: Σ x, Σ y, Σ y2 class LRPass1 extends MRTask { double sumX, sumY, sumY2; // I Can Haz State? void map( double X, double Y ) { sumX += X; sumY += Y; sumY2 += Y*Y; } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; } }
  • 13. 0xdata.com 13 Simple Data-Parallel Coding ● Map/Reduce Per-Row: Batch State-full class LRPass1 extends MRTask { double sumX, sumY, sumY2; void map( Chunk CX, Chunk CY ) {// Whole Chunks for( int i=0; i<CX.len; i++ ){// Batch! double X = CX.at(i), Y = CY.at(i); sumX += X; sumY += Y; sumY2 += Y*Y; } } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; } }
  • 14. 0xdata.com 14 Distributed Trees ● Overlay a Tree over the data ● Really: Assign a Tree Node to each Row ● Number the Nodes ● Store "Node_ID" per row in a temp Vec ● Make a pass over all Rows ● Nodes not visited in order... ● but all rows, all Nodes efficiently visited ● Do work (e.g. histogram) per Row/Node Vec nids = v.makeZero(); … nids.set(row,nid)...
  • 15. 0xdata.com 15 Distributed Trees ● An initial Tree ● All rows at nid==0 ● MRTask: compute stats ● Use the stats to make a decision... ● (varies by algorithm)! nid=0 X Y nids A 1.2 0 B 3.1 0 C -2. 0 D 1.1 0 nid=0nid=0 Tree MRTask.sum=3.4
  • 16. 0xdata.com 16 Distributed Trees ● Next layer in the Tree (and MRTask across rows) ● Each row: decide! – If "1<Y<1.5" go left else right ● Compute stats per new leaf ● Each pass across all rows builds entire layer nid=0 X Y nids A 1.2 1 B 3.1 2 C -2. 2 D 1.1 1 nid=01 < Y < 1.5 Tree sum=1.1 nid=1 nid=2 sum=2.3
  • 17. 0xdata.com 17 Distributed Trees ● Another MRTask, another layer... ● i.e., a 5-deep tree takes 5 passes ● nid=0nid=01 < Y < 1.5 Tree sum=1.1 Y==1.1 leaf nid=3 nid=4 X Y nids A 1.2 3 B 3.1 2 C -2. 2 D 1.1 4 sum= -2. sum=3.1
  • 18. 0xdata.com 18 Distributed Trees ● Each pass is over one layer in the tree ● Builds per-node histogram in map+reduce calls class Pass extends MRTask2<Pass> { void map( Chunk chks[] ) { Chunk nids = chks[...]; // Node-IDs per row for( int r=0; r<nids.len; r++ ){// All rows int nid = nids.at80(i); // Node-ID THIS row // Lazy: not all Chunks see all Nodes if( dHisto[nid]==null ) dHisto[nid]=... // Accumulate histogram stats per node dHisto[nid].accum(chks,r); } } }.doAll(myDataFrame,nids);
  • 19. 0xdata.com 19 Distributed Trees ● Each pass analyzes one Tree level ● Then decide how to build next level ● Reassign Rows to new levels in another pass – (actually merge the two passes) ● Builds a Histogram-per-Node ● Which requires a reduce() call to roll up ● All Histograms for one level done in parallel
  • 20. 0xdata.com 20 Distributed Trees: utilities ● “score+build” in one pass: ● Test each row against decision from prior pass ● Assign to a new leaf ● Build histogram on that leaf ● “score”: just walk the tree, and get results ● “compress”: Tree from POJO to byte[] ● Easily 10x smaller, can still walk, score, print ● Plus utilities to walk, print, display
  • 21. 0xdata.com 21 GBM on Distributed Trees ● GBM builds 1 Tree, 1 level at a time, but... ● We run the entire level in parallel & distributed ● Built breadth-first because it's "free" ● More data offset by more CPUs ● Classic GBM otherwise ● Build residuals tree-by-tree ● Tuning knobs: trees, depth, shrinkage, min_rows ● Pure Java
  • 22. 0xdata.com 22 GBM on Distributed Trees ● Limiting factor: latency in turning over a level ● About 4x faster than R single-node on covtype ● Does the per-level compute in parallel ● Requires sending histograms over network – Can get big for very deep trees ●
  • 23. 0xdata.com 23 Summary: Write (parallel) Java ● Most simple Java “just works” ● Fast: parallel distributed reads, writes, appends ● Reads same speed as plain Java array loads ● Writes, appends: slightly slower (compression) ● Typically memory bandwidth limited – (may be CPU limited in a few cases) ● Slower: conflicting writes (but follows strict JMM) ● Also supports transactional updates
  • 24. 0xdata.com 24 Summary: Writing Analytics ● We're writing Big Data Analytics ● Generalized Linear Modeling (ADMM, GLMNET) – Logistic Regression, Poisson, Gamma ● Random Forest, GBM, KMeans++, KNN ● State-of-the-art Algorithms, running Distributed ● Solidly working on 100G datasets ● Heading for Tera Scale ● Paying customers (in production!) ● Come write your own (distributed) algorithm!!!
  • 25. 0xdata.com 25 Cool Systems Stuff... ● … that I ran out of space for ● Reliable UDP, integrated w/RPC ● TCP is reliably UNReliable ● Already have a reliable UDP framework, so no prob ● Fork/Join Goodies: ● Priority Queues ● Distributed F/J ● Surviving fork bombs & lost threads ● K/V does JMM via hardware-like MESI protocol
  • 26. 0xdata.com 26 H2O is... ● Pure Java, Open Source: 0xdata.com ● https://github.com/0xdata/h2o/ ● A Platform for doing Math ● Parallel Distributed Math ● In-memory analytics: GLM, GBM, RF, Logistic Reg ● Accessible via REST & JSON ● A K/V Store: ~150ns per get or put ● Distributed Fork/Join + Map/Reduce + K/V
  • 27. 0xdata.com 27 The Platform NFS HDFS byte[] extends Iced extends DTask AutoBuffer RPC extends DRemoteTask D/F/J extends MRTask User code? JVM 1 NFS HDFS byte[] extends Iced extends DTask AutoBuffer RPC extends DRemoteTask D/F/J extends MRTask User code? JVM 2 K/V get/put UDP / TCP
  • 28. 0xdata.com 28 Other Simple Examples ● Filter & Count (underage males): ● (can pass in any number of Vecs or a Frame) long sumY2 = new MRTask() { long map( long age, long sex ) { return (age<=17 && sex==MALE) ? 1 : 0; } long reduce( long d1, long d2 ) { return d1+d2; } }.doAll( vecAge, vecSex );
  • 29. 0xdata.com 29 Other Simple Examples ● Filter into new set (underage males): ● Can write or append subset of rows – (append order is preserved) class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set } }; Vec risk = new AppendableVec(); new Filter().doAll( risk, vecAge, vecSex ); ...risk... // all the underage males
  • 30. 0xdata.com 30 Other Simple Examples ● Filter into new set (underage males): ● Can write or append subset of rows – (append order is preserved) class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set } }; Vec risk = new AppendableVec(); new Filter().doAll( risk, vecAge, vecSex ); ...risk... // all the underage males
  • 31. 0xdata.com 31 Other Simple Examples ● Group-by: count of car-types by age class AgeHisto extends MRTask { long carAges[][]; // count of cars by age void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; } }
  • 32. 0xdata.com 32 class AgeHisto extends MRTask { long carAges[][]; // count of cars by age void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; } } Other Simple Examples ● Group-by: count of car-types by age Setting carAges in map() makes it an output field. Private per-map call, single-threaded write access. Must be rolled-up in the reduce call. Setting carAges in map makes it an output field. Private per-map call, single-threaded write access. Must be rolled-up in the reduce call.
  • 33. 0xdata.com 33 Other Simple Examples ● Uniques ● Uses distributed hash set class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); } }; long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size();
  • 34. 0xdata.com 34 Other Simple Examples ● Uniques ● Uses distributed hash set class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); } }; long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size(); Setting dnbhs in <init> makes it an input field. Shared across all maps(). Often read-only. This one is written, so needs a reduce.