[db analytics showcase Sapporo 2018] B33　H2O4GPU and GoAI: harnessing the power of GPUs.

H2O4GPU and GoAI: harnessing the power of GPUs.
Mateusz Dymczyk
Senior Software Engineer
H2O.ai
@mdymczyk

Agenda
• About me
• About H2O.ai
• A bit of history: H2O-3
• Moving forward: feature engineering & Driverless AI
• The need for GPUs
• GPU overview
• Machine Learning + GPUs = why? how?
• About GoAI
• About H2O4GPU
• Q&A

About me
• M.Sc. in Computer Science @ AGH UST in Poland
• Ph.D. dropout (machine learning)
• Previously NLP/ML @ Fujitsu Laboratories, Kanagawa
• Currently Lead/Senior Machine Learning Engineer @
H2O.ai (remotely from Tokyo)
• Conference speaker (Strata Beijing/NY/Singapore,
Hadoop World Tokyo etc.)

About H2O.ai
FOUNDED 2012, SERIES C IN NOV, 2017
PRODUCTS • DRIVERLESS AI – AUTOMATED MACHINE LEARNING
• H2O OPEN SOURCE MACHINE LEARNING
• SPARKLING WATER
• H2O4GPU OS ML GPU LIBRARY
MISSION DEMOCRATIZE AI
TEAM • ~100 EMPLOYEES
• SEVERAL KAGGLE GRANDMASTERS
• DISTRIBUTED SYSTEMS ENGINEERS DOING MACHINE LEARNING
• WORLD-CLASS VISUALIZATION DESIGNERS
OFFICES MOUNTAIN VIEW, LONDON, PRAGUE

Community Adoption
* DATA FROM GOOGLE ANALYTICS EMBEDDED IN THE END USER PRODUCT

Select Customers
Financial InsuranceMarketing TelecomHealthcareRetail
“Overall customer satisfaction is very high.” - Gartner
Advisory &
Accounting

H2O-3 Overview
• Distributed implementations of cutting edge ML algorithms.
• Core algorithms written in high performance Java.
• APIs available in R, Python, Scala, REST/JSON.
• Interactive Web GUI called H2O Flow.
• Easily deploy models to production with H2O Steam.

H2O-3 Distributed Computing
• Multi-node cluster with shared memory model.
• All computations in memory.
• Each node sees only some rows of the data.
• No limit on cluster size.
• Distributed data frames (collection of vectors).
• Columns are distributed (across nodes) arrays.
• Works just like R’s data.frame or Python Pandas DataFrame
H2O Frame
H2O Cluster

H2O-3 Algorithms
Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest:
Classification or regression models
• Gradient Boosting Machine: Produces
an ensemble of decision trees with
increasing refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with an
input layer followed by multiple layers of
nonlinear transformations
Unsupervised Learning
• K-means: Partitions observations into k
clusters/groups of the same spatial size.
Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly transforms
correlated variables to independent components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of numerical,
Boolean, categorical, and missing data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction using
deep learning

DriverlessAI & Feature Engineering

The Need for Automation
“The United States alone faces a shortage of 140,000 to
190,000 people with analytical expertise and 1.5 million
managers and analysts”
–McKinsey Prediction for 2018

Recipe for Success
Auto Feature Generation 
Kaggle Grand Master Out of the Box • Automatic Text Handling
• Frequency Encoding
• Cross Validation Target
Encoding
• Truncated SVD
• Clustering and more
Feature Transformations
Generated Features
Original Features

Recipe for Success
Driverless AI
AI to do AI

3 Pillars
Speed Accuracy Interpretability

Moore’s Law
1980 1990 2000 2010 2020
102
103
104
105
106
107
40 Years of Microprocessor Trend Data
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham,
K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
Single-threaded perf
1.5X per year
1.1X per year
Transistors 
(thousands)

GPU
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by 2025
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham,
K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE

GPU Shortcomings
GPU
Global Memory
Thread
Local
Thread
Local
Thread
Local
Shared
Thread
Local
Thread
Local
Thread
Local
Shared
Thread
Local
Thread
Local
Thread
Local
Shared
Thread
Local
Thread
Local
Thread
Local
Shared
CPU
Host
Memory
C
PU
copies
data
from
host
to
G
PU
m
em
ory
via
PC
I-E
CPU launches kernels
SLOW!!!

GPU Open Analytics Initiative (GOAI)
github.com/gpuopenanalytics
GPU Data Frame (GDF)
Ingest/ 
Parse
Exploratory
Analysis
Feature
Engineering
ML/DL
Algorithms
Grid Search
Scoring
Model 
Export

GPU architecture
Low latency vs High throughput
GPU
• Optimized for data-parallel,
throughput computation
• Architecture tolerant of
memory latency
• More transistors dedicated to
computation
CPU
• Optimized for low-latency
access to cached data sets
• Control logic for out-of-order
and speculative execution

GPU Enhanced Applications
Application Code
GPU
Use GPU to
Parallelize
Compute-Intensive
Functions CPU
Rest of Sequential
CPU Code

Machine Learning and GPUs
2
4 A
3
5
m ⇥ k
2
4 B
3
5
k ⇥ n
=
2
4 C
3
5
m ⇥ n

Matrix Multiplication
2
6
6
6
6
6
4
a1,1 a1,2 a1,3 . . . a1,k
a2,1 a2,2 a2,3 . . . a2,k
a3,1 a3,2 a3,3 . . . a3,k
...
...
...
...
...
am,1 am,2 am,3 . . . am,k
3
7
7
7
7
7
5
A
2
6
6
6
6
6
4
b1,1 b1,2 b1,3 . . . b1,n
b2,1 b2,2 b2,3 . . . b2,n
b3,1 b3,2 b3,3 . . . b3,n
...
...
...
...
...
bk,1 bk,2 bk,3 . . . bk,n
3
7
7
7
7
7
5
B
=
2
6
6
6
6
6
4
c1,1 c1,2 c1,3 . . . c1,n
c2,1 c2,2 c2,3 . . . c2,n
c3,1 c3,2 c3,3 . . . c3,n
...
...
...
...
...
cm,1 cm,2 cm,3 . . . cm,n
3
7
7
7
7
7
5
C

Matrix Multiplication
2
6
6
6
6
6
4
a1,1 a1,2 a1,3 . . . a1,k
a2,1 a2,2 a2,3 . . . a2,k
a3,1 a3,2 a3,3 . . . a3,k
...
...
...
...
...
am,1 am,2 am,3 . . . am,k
3
7
7
7
7
7
5
A
2
6
6
6
6
6
4
b1,1 b1,2 b1,3 . . . b1,n
b2,1 b2,2 b2,3 . . . b2,n
b3,1 b3,2 b3,3 . . . b3,n
...
...
...
...
...
bk,1 bk,2 bk,3 . . . bk,n
3
7
7
7
7
7
5
B
=
2
6
6
6
6
6
4
c1,1 c1,2 c1,3 . . . c1,n
c2,1 c2,2 c2,3 . . . c2,n
c3,1 c3,2 c3,3 . . . c3,n
...
...
...
...
...
cm,1 cm,2 cm,3 . . . cm,n
3
7
7
7
7
7
5
C
C[0,0]
C[0,1]
C[n,m]

Matrix Operations in ML
Matrix
Multiplication!
All black lines are
matrix multiplications!

Practical Machine Learning
Machine
Learning

H2O4GPU
• Open-Source: https://github.com/h2oai/h2o4gpu
• Collection of important ML algorithms ported to the GPU (with CPU fallback option):
• Gradient Boosted Machines
• GLM
• Truncated SVD
• PCA
• KMeans
• (soon) Field Aware Factorization Machines
• Performance optimized, multi-GPU support (certain algorithms)
• Used within our own Driverless AI Product to boost performance 30X
• Scikit-Learn compatible Python API (and now R API)

H2O4GPU Algorithms
10X
XGBoost
5X
GLM
40X
K-means
5X
SVD

Gradient Boosting Machines
• Based upon XGBoost
• Raw floating point data -> Binned into Quantiles
• Quantiles are stored as compressed instead of floats
• Compressed Quantiles are efficiently transferred to GPU
• Sparsity is handled directly with highly GPU efficiency
• Multi-GPU by sharding rows using NVIDIA NCCL AllReduce

[db analytics showcase Sapporo 2018] B33　H2O4GPU and GoAI: harnessing the power of GPUs.

KMeans
• Significantly faster than Scikit-learn implementation (up to 50x)
• Significantly faster than other GPU implementations (5x-10x)
• Supports kmeans|| initialization
• Supports multiple GPUs by sharding the dataset
• Supports batching data if exceeds GPU memory

Truncated SVD & PCA
• Matrix decomposition
• Popular for text processing
and dimensionality reduction
• GPU optimizes linear algebra
operations

Truncated SVD & PCA
• The intrinsic dimensionality of certain datasets is much lower than the
original (e.g. here 4096 vs. actual ~200)
• PCA can reduce the dimensionality and preserve most of the explained
variance at the same time
• Better input for further modeling - takes less time

Field Aware Factorization Machines
* under development
• Click Through Rate (CTR):
• One of the most important tasks in computational advertising
• Percentage of users, who actually click on ads
• Until recently solved with logistic regression - bad at finding feature conjunctions
(learns the effect of all variables or features individually)
Clicked Publisher (P) Advertiser (A) Gender (G)
Yes ESPN Nike Male
No NBC Adidas Male

Field Aware Factorization Machines
* under development
• Separates the data into fields (Publisher, Advertiser, Gender) and features (EPSN, NBC,
Adidas, Nike, Male, Female)
• Uses a latent space for each pair to generate the model
• Used to win the first prize of three CTR competitions hosted by Criteo, Avazu, Outbrain,
and also the third prize of RecSys Challenge 2015.

More info
• Documentation: http://docs.h2o.ai
• Online Training: http://learn.h2o.ai
• Tutorials: https://github.com/h2oai/h2o-tutorials
• Slidedecks: https://github.com/h2oai/h2o-meetups
• Video Presentations: https://www.youtube.com/user/0xdata
• Events & Meetups: http://h2o.ai/events
• Code: http://github.com/h2oai/
• Questions:
• https://stackoverflow.com/questions/tagged/h2o4gpu
• https://gitter.im/h2oai/{h2o-3,h2o4gpu}

Thank you!
@mdymczyk
mateusz@h2o.ai

[db analytics showcase Sapporo 2018] B33　H2O4GPU and GoAI: harnessing the power of GPUs.

More Related Content

What's hot (20)

Similar to [db analytics showcase Sapporo 2018] B33　H2O4GPU and GoAI: harnessing the power of GPUs. (20)

More from Insight Technology, Inc. (20)

Recently uploaded (20)