H2O4GPU and GoAI: harnessing the power of GPUs.
Mateusz	Dymczyk	
Senior	Software	Engineer	
H2O.ai	
@mdymczyk
Agenda
• About me
• About H2O.ai
• A bit of history: H2O-3
• Moving forward: feature engineering & Driverless AI
• The need for GPUs
• GPU overview
• Machine Learning + GPUs = why? how?
• About GoAI
• About H2O4GPU
• Q&A
About me
• M.Sc. in Computer Science @ AGH UST in Poland
• Ph.D. dropout (machine learning)
• Previously NLP/ML @ Fujitsu Laboratories, Kanagawa
• Currently Lead/Senior Machine Learning Engineer @
H2O.ai (remotely from Tokyo)
• Conference speaker (Strata Beijing/NY/Singapore,
Hadoop World Tokyo etc.)
About H2O.ai
FOUNDED 2012, SERIES C IN NOV, 2017
PRODUCTS • DRIVERLESS AI – AUTOMATED MACHINE LEARNING
• H2O OPEN SOURCE MACHINE LEARNING
• SPARKLING WATER
• H2O4GPU OS ML GPU LIBRARY
MISSION DEMOCRATIZE AI
TEAM • ~100 EMPLOYEES
• SEVERAL KAGGLE GRANDMASTERS
• DISTRIBUTED SYSTEMS ENGINEERS DOING MACHINE LEARNING
• WORLD-CLASS VISUALIZATION DESIGNERS
OFFICES MOUNTAIN VIEW, LONDON, PRAGUE
Community Adoption
*	DATA	FROM	GOOGLE	ANALYTICS	EMBEDDED	IN	THE	END	USER	PRODUCT
Select Customers
Financial InsuranceMarketing TelecomHealthcareRetail
“Overall customer satisfaction is very high.” - Gartner
Advisory &
Accounting
A bit of history: H2O-3
H2O-3 Overview
• Distributed implementations of cutting edge ML algorithms.
• Core algorithms written in high performance Java.
• APIs available in R, Python, Scala, REST/JSON.
• Interactive Web GUI called H2O Flow.
• Easily deploy models to production with H2O Steam.
H2O-3 Distributed Computing
• Multi-node cluster with shared memory model.
• All computations in memory.
• Each node sees only some rows of the data.
• No limit on cluster size.
• Distributed data frames (collection of vectors).
• Columns are distributed (across nodes) arrays.
• Works just like R’s data.frame or Python Pandas DataFrame
H2O Frame
H2O Cluster
H2O-3 Algorithms
Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest:
Classification or regression models
• Gradient Boosting Machine: Produces
an ensemble of decision trees with
increasing refined approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with an
input layer followed by multiple layers of
nonlinear transformations
Unsupervised Learning
• K-means: Partitions observations into k
clusters/groups of the same spatial size.
Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly transforms
correlated variables to independent components
• Generalized Low Rank Models: extend the idea of
PCA to handle arbitrary data consisting of numerical,
Boolean, categorical, and missing data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction using
deep learning
DriverlessAI & Feature Engineering
The Need for Automation
“The United States alone faces a shortage of 140,000 to
190,000 people with analytical expertise and 1.5 million
managers and analysts”
–McKinsey Prediction for 2018
Recipe for Success
Auto Feature Generation

Kaggle Grand Master Out of the Box • Automatic Text Handling
• Frequency Encoding
• Cross Validation Target
Encoding
• Truncated SVD
• Clustering and more
Feature Transformations
Generated Features
Original Features
Recipe for Success
Recipe for Success
Driverless AI
AI to do AI
3 Pillars
Speed Accuracy Interpretability
The need for GPUs
Moore’s Law
1980 1990 2000 2010 2020
102
103
104
105
106
107
40	Years	of	Microprocessor	Trend	Data
Original	data	up	to	the	year	2010	collected	and	plotted	by	M.	Horowitz,	F.	Labonte,	O.	Shacham,	
K.	Olukotun,	L.	Hammond,	and	C.	Batten	New	plot	and	data	collected	for	2010-2015	by	K.	Rupp
Single-threaded	perf
1.5X	per	year
1.1X	per	year
Transistors

(thousands)
GPU
1980 1990 2000 2010 2020
GPU-Computing	perf	
1.5X	per	year
1000X
by	2025
Original	data	up	to	the	year	2010	collected	and	plotted	by	M.	Horowitz,	F.	Labonte,	O.	Shacham,	
K.	Olukotun,	L.	Hammond,	and	C.	Batten	New	plot	and	data	collected	for	2010-2015	by	K.	Rupp
102
103
104
105
106
107
Single-threaded	perf
1.5X	per	year
1.1X	per	year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
GoAI
GPU Shortcomings
GPU
Global Memory
Thread
Local
Thread
Local
Thread
Local
Shared
Thread
Local
Thread
Local
Thread
Local
Shared
Thread
Local
Thread
Local
Thread
Local
Shared
Thread
Local
Thread
Local
Thread
Local
Shared
CPU
Host
Memory
C
PU
copies
data
from
host
to
G
PU
m
em
ory
via
PC
I-E
CPU launches kernels
SLOW!!!
GPU Open Analytics Initiative (GOAI)
github.com/gpuopenanalytics
GPU Data Frame (GDF)
Ingest/

Parse
Exploratory
Analysis
Feature
Engineering
ML/DL
Algorithms
Grid Search
Scoring
Model

Export
GOAI Data Flow
GPU Overview
GPU architecture
Low latency vs High throughput
GPU
• Optimized for data-parallel,
throughput computation
• Architecture tolerant of
memory latency
• More transistors dedicated to
computation
CPU
• Optimized for low-latency
access to cached data sets
• Control logic for out-of-order
and speculative execution
GPU Enhanced Applications
Application Code
GPU
Use GPU to
Parallelize
Compute-Intensive
Functions CPU
Rest of Sequential
CPU Code
Machine Learning on GPU
Machine Learning and GPUs
2
4 A
3
5
m ⇥ k
2
4 B
3
5
k ⇥ n
=
2
4 C
3
5
m ⇥ n
Matrix Multiplication
2
6
6
6
6
6
4
a1,1 a1,2 a1,3 . . . a1,k
a2,1 a2,2 a2,3 . . . a2,k
a3,1 a3,2 a3,3 . . . a3,k
...
...
...
...
...
am,1 am,2 am,3 . . . am,k
3
7
7
7
7
7
5
A
2
6
6
6
6
6
4
b1,1 b1,2 b1,3 . . . b1,n
b2,1 b2,2 b2,3 . . . b2,n
b3,1 b3,2 b3,3 . . . b3,n
...
...
...
...
...
bk,1 bk,2 bk,3 . . . bk,n
3
7
7
7
7
7
5
B
=
2
6
6
6
6
6
4
c1,1 c1,2 c1,3 . . . c1,n
c2,1 c2,2 c2,3 . . . c2,n
c3,1 c3,2 c3,3 . . . c3,n
...
...
...
...
...
cm,1 cm,2 cm,3 . . . cm,n
3
7
7
7
7
7
5
C
Matrix Multiplication
2
6
6
6
6
6
4
a1,1 a1,2 a1,3 . . . a1,k
a2,1 a2,2 a2,3 . . . a2,k
a3,1 a3,2 a3,3 . . . a3,k
...
...
...
...
...
am,1 am,2 am,3 . . . am,k
3
7
7
7
7
7
5
A
2
6
6
6
6
6
4
b1,1 b1,2 b1,3 . . . b1,n
b2,1 b2,2 b2,3 . . . b2,n
b3,1 b3,2 b3,3 . . . b3,n
...
...
...
...
...
bk,1 bk,2 bk,3 . . . bk,n
3
7
7
7
7
7
5
B
=
2
6
6
6
6
6
4
c1,1 c1,2 c1,3 . . . c1,n
c2,1 c2,2 c2,3 . . . c2,n
c3,1 c3,2 c3,3 . . . c3,n
...
...
...
...
...
cm,1 cm,2 cm,3 . . . cm,n
3
7
7
7
7
7
5
C
Matrix Multiplication
2
6
6
6
6
6
4
a1,1 a1,2 a1,3 . . . a1,k
a2,1 a2,2 a2,3 . . . a2,k
a3,1 a3,2 a3,3 . . . a3,k
...
...
...
...
...
am,1 am,2 am,3 . . . am,k
3
7
7
7
7
7
5
A
2
6
6
6
6
6
4
b1,1 b1,2 b1,3 . . . b1,n
b2,1 b2,2 b2,3 . . . b2,n
b3,1 b3,2 b3,3 . . . b3,n
...
...
...
...
...
bk,1 bk,2 bk,3 . . . bk,n
3
7
7
7
7
7
5
B
=
2
6
6
6
6
6
4
c1,1 c1,2 c1,3 . . . c1,n
c2,1 c2,2 c2,3 . . . c2,n
c3,1 c3,2 c3,3 . . . c3,n
...
...
...
...
...
cm,1 cm,2 cm,3 . . . cm,n
3
7
7
7
7
7
5
C
Matrix Multiplication
2
6
6
6
6
6
4
a1,1 a1,2 a1,3 . . . a1,k
a2,1 a2,2 a2,3 . . . a2,k
a3,1 a3,2 a3,3 . . . a3,k
...
...
...
...
...
am,1 am,2 am,3 . . . am,k
3
7
7
7
7
7
5
A
2
6
6
6
6
6
4
b1,1 b1,2 b1,3 . . . b1,n
b2,1 b2,2 b2,3 . . . b2,n
b3,1 b3,2 b3,3 . . . b3,n
...
...
...
...
...
bk,1 bk,2 bk,3 . . . bk,n
3
7
7
7
7
7
5
B
=
2
6
6
6
6
6
4
c1,1 c1,2 c1,3 . . . c1,n
c2,1 c2,2 c2,3 . . . c2,n
c3,1 c3,2 c3,3 . . . c3,n
...
...
...
...
...
cm,1 cm,2 cm,3 . . . cm,n
3
7
7
7
7
7
5
C
Matrix Multiplication
2
6
6
6
6
6
4
a1,1 a1,2 a1,3 . . . a1,k
a2,1 a2,2 a2,3 . . . a2,k
a3,1 a3,2 a3,3 . . . a3,k
...
...
...
...
...
am,1 am,2 am,3 . . . am,k
3
7
7
7
7
7
5
A
2
6
6
6
6
6
4
b1,1 b1,2 b1,3 . . . b1,n
b2,1 b2,2 b2,3 . . . b2,n
b3,1 b3,2 b3,3 . . . b3,n
...
...
...
...
...
bk,1 bk,2 bk,3 . . . bk,n
3
7
7
7
7
7
5
B
=
2
6
6
6
6
6
4
c1,1 c1,2 c1,3 . . . c1,n
c2,1 c2,2 c2,3 . . . c2,n
c3,1 c3,2 c3,3 . . . c3,n
...
...
...
...
...
cm,1 cm,2 cm,3 . . . cm,n
3
7
7
7
7
7
5
C
C[0,0]
C[0,1]
C[n,m]
Matrix Operations in ML
Matrix	
Multiplication!
All	black	lines	are	
matrix	multiplications!
H2O4GPU
Practical Machine Learning
Machine	
Learning
H2O4GPU
• Open-Source: https://github.com/h2oai/h2o4gpu
• Collection of important ML algorithms ported to the GPU (with CPU fallback option):
• Gradient Boosted Machines
• GLM
• Truncated SVD
• PCA
• KMeans
• (soon) Field Aware Factorization Machines
• Performance optimized, multi-GPU support (certain algorithms)
• Used within our own Driverless AI Product to boost performance 30X
• Scikit-Learn compatible Python API (and now R API)
H2O4GPU Algorithms
10X
XGBoost
5X
GLM
40X
K-means
5X
SVD
Gradient Boosting Machines
• Based upon XGBoost
• Raw floating point data -> Binned into Quantiles
• Quantiles are stored as compressed instead of floats
• Compressed Quantiles are efficiently transferred to GPU
• Sparsity is handled directly with highly GPU efficiency
• Multi-GPU by sharding rows using NVIDIA NCCL AllReduce
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the power of GPUs.
KMeans
• Significantly faster than Scikit-learn implementation (up to 50x)
• Significantly faster than other GPU implementations (5x-10x)
• Supports kmeans|| initialization
• Supports multiple GPUs by sharding the dataset
• Supports batching data if exceeds GPU memory
12 with kmeans||
Truncated SVD & PCA
• Matrix decomposition
• Popular for text processing
and dimensionality reduction
• GPU optimizes linear algebra
operations
Truncated SVD & PCA
• The intrinsic dimensionality of certain datasets is much lower than the
original (e.g. here 4096 vs. actual ~200)
• PCA can reduce the dimensionality and preserve most of the explained
variance at the same time
• Better input for further modeling - takes less time
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the power of GPUs.
Field Aware Factorization Machines
* under development
• Click Through Rate (CTR):
• One of the most important tasks in computational advertising
• Percentage of users, who actually click on ads
• Until recently solved with logistic regression - bad at finding feature conjunctions
(learns the effect of all variables or features individually)
Clicked Publisher	(P) Advertiser	(A) Gender	(G)
Yes ESPN Nike Male
No NBC Adidas Male
Field Aware Factorization Machines
* under development
• Separates the data into fields (Publisher, Advertiser, Gender) and features (EPSN, NBC,
Adidas, Nike, Male, Female)
• Uses a latent space for each pair to generate the model
• Used to win the first prize of three CTR competitions hosted by Criteo, Avazu, Outbrain,
and also the third prize of RecSys Challenge 2015.
Demo
More info
• Documentation: http://docs.h2o.ai
• Online Training: http://learn.h2o.ai
• Tutorials: https://github.com/h2oai/h2o-tutorials
• Slidedecks: https://github.com/h2oai/h2o-meetups
• Video Presentations: https://www.youtube.com/user/0xdata
• Events & Meetups: http://h2o.ai/events
• Code: http://github.com/h2oai/
• Questions:
• https://stackoverflow.com/questions/tagged/h2o4gpu
• https://gitter.im/h2oai/{h2o-3,h2o4gpu}
Thank you!
@mdymczyk
mateusz@h2o.ai
Q&A

More Related Content

PDF
Intro to Machine Learning for GPUs
PPT
Hadoop mapreduce performance study on arm cluster
PPTX
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
PDF
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
PPTX
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
PPTX
Pycon 2016-open-space
PDF
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
PDF
20181116 Massive Log Processing using I/O optimized PostgreSQL
Intro to Machine Learning for GPUs
Hadoop mapreduce performance study on arm cluster
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
Pycon 2016-open-space
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
20181116 Massive Log Processing using I/O optimized PostgreSQL

What's hot (20)

PDF
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
PDF
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
PDF
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
PDF
20181025_pgconfeu_lt_gstorefdw
PDF
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
PDF
Alembic: Distilling C++ into high-performance Grappa
PDF
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PDF
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
PDF
Getting The Best Performance With PySpark
PDF
Parallel Implementation of K Means Clustering on CUDA
PDF
IIBMP2019 講演資料「オープンソースで始める深層学習」
PPTX
GoodFit: Multi-Resource Packing of Tasks with Dependencies
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
PDF
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
PDF
Tokyo Webmining Talk1
PPTX
London hug
PDF
Mining Top-k Closed Sequential Patterns in Sequential Databases
PDF
Web Traffic Time Series Forecasting
PDF
強化学習の分散アーキテクチャ変遷
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
20181025_pgconfeu_lt_gstorefdw
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Alembic: Distilling C++ into high-performance Grappa
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Getting The Best Performance With PySpark
Parallel Implementation of K Means Clustering on CUDA
IIBMP2019 講演資料「オープンソースで始める深層学習」
GoodFit: Multi-Resource Packing of Tasks with Dependencies
Time-Evolving Graph Processing On Commodity Clusters
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Tokyo Webmining Talk1
London hug
Mining Top-k Closed Sequential Patterns in Sequential Databases
Web Traffic Time Series Forecasting
強化学習の分散アーキテクチャ変遷
Ad

Similar to [db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the power of GPUs. (20)

PDF
Introduction to GPUs for Machine Learning
PDF
Machine Learning on Google Cloud with H2O
PPTX
An Introduction to H2O4GPU
PDF
Dl2 computing gpu
PDF
[update] Introductory Parts of the Book "Dive into Deep Learning"
PDF
Netflix machine learning
PPTX
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
PPTX
Computer Design Concepts for Machine Learning
PDF
Introducción al Machine Learning Automático
PPTX
Introduction & Hands-on with H2O Driverless AI
PPT
Introduction to Machine Learning, Hands-on Deep Learning with Tensroflow 2.0
PDF
Data! Data! Data! I Can't Make Bricks Without Clay!
PPTX
A leap around AI
PDF
Machine Learning Challenges and Opportunities in Education, Industry, and Res...
PPTX
Introduction to Machine learning
PDF
H2o.ai presentation at 2nd Virtual Pydata Piraeus meetup
PDF
How to program DL & AI applications
PPTX
Lrz kurs: gpu and mic programming with r
PDF
Data Science, Machine Learning and Neural Networks
PDF
Generalized Linear Models with H2O
Introduction to GPUs for Machine Learning
Machine Learning on Google Cloud with H2O
An Introduction to H2O4GPU
Dl2 computing gpu
[update] Introductory Parts of the Book "Dive into Deep Learning"
Netflix machine learning
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Computer Design Concepts for Machine Learning
Introducción al Machine Learning Automático
Introduction & Hands-on with H2O Driverless AI
Introduction to Machine Learning, Hands-on Deep Learning with Tensroflow 2.0
Data! Data! Data! I Can't Make Bricks Without Clay!
A leap around AI
Machine Learning Challenges and Opportunities in Education, Industry, and Res...
Introduction to Machine learning
H2o.ai presentation at 2nd Virtual Pydata Piraeus meetup
How to program DL & AI applications
Lrz kurs: gpu and mic programming with r
Data Science, Machine Learning and Neural Networks
Generalized Linear Models with H2O
Ad

More from Insight Technology, Inc. (20)

PDF
グラフデータベースは如何に自然言語を理解するか?
PDF
Docker and the Oracle Database
PDF
Great performance at scale~次期PostgreSQL12のパーティショニング性能の実力に迫る~
PDF
事例を通じて機械学習とは何かを説明する
PDF
仮想通貨ウォレットアプリで理解するデータストアとしてのブロックチェーン
PDF
MBAAで覚えるDBREの大事なおしごと
PDF
グラフデータベースは如何に自然言語を理解するか?
PDF
DBREから始めるデータベースプラットフォーム
PDF
SQL Server エンジニアのためのコンテナ入門
PDF
Lunch & Learn, AWS NoSQL Services
PDF
db tech showcase2019オープニングセッション @ 森田 俊哉
PDF
db tech showcase2019 オープニングセッション @ 石川 雅也
PDF
db tech showcase2019 オープニングセッション @ マイナー・アレン・パーカー
PPTX
難しいアプリケーション移行、手軽に試してみませんか?
PPTX
Attunityのソリューションと異種データベース・クラウド移行事例のご紹介
PPTX
そのデータベース、クラウドで使ってみませんか?
PPTX
コモディティサーバー3台で作る高速処理 “ハイパー・コンバージド・データベース・インフラストラクチャー(HCDI)” システム『Insight Qube』...
PDF
複数DBのバックアップ・切り戻し運用手順が異なって大変?!運用性の大幅改善、その先に。。
PPTX
Attunity社のソリューションの日本国内外適用事例及びロードマップ紹介[ATTUNITY & インサイトテクノロジー IoT / Big Data フ...
PPTX
レガシーに埋もれたデータをリアルタイムでクラウドへ [ATTUNITY & インサイトテクノロジー IoT / Big Data フォーラム 2018]
グラフデータベースは如何に自然言語を理解するか?
Docker and the Oracle Database
Great performance at scale~次期PostgreSQL12のパーティショニング性能の実力に迫る~
事例を通じて機械学習とは何かを説明する
仮想通貨ウォレットアプリで理解するデータストアとしてのブロックチェーン
MBAAで覚えるDBREの大事なおしごと
グラフデータベースは如何に自然言語を理解するか?
DBREから始めるデータベースプラットフォーム
SQL Server エンジニアのためのコンテナ入門
Lunch & Learn, AWS NoSQL Services
db tech showcase2019オープニングセッション @ 森田 俊哉
db tech showcase2019 オープニングセッション @ 石川 雅也
db tech showcase2019 オープニングセッション @ マイナー・アレン・パーカー
難しいアプリケーション移行、手軽に試してみませんか?
Attunityのソリューションと異種データベース・クラウド移行事例のご紹介
そのデータベース、クラウドで使ってみませんか?
コモディティサーバー3台で作る高速処理 “ハイパー・コンバージド・データベース・インフラストラクチャー(HCDI)” システム『Insight Qube』...
複数DBのバックアップ・切り戻し運用手順が異なって大変?!運用性の大幅改善、その先に。。
Attunity社のソリューションの日本国内外適用事例及びロードマップ紹介[ATTUNITY & インサイトテクノロジー IoT / Big Data フ...
レガシーに埋もれたデータをリアルタイムでクラウドへ [ATTUNITY & インサイトテクノロジー IoT / Big Data フォーラム 2018]

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
STKI Israel Market Study 2025 version august
PPT
Geologic Time for studying geology for geologist
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Architecture types and enterprise applications.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
August Patch Tuesday
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
Modernising the Digital Integration Hub
PPT
What is a Computer? Input Devices /output devices
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
NewMind AI Weekly Chronicles – August ’25 Week III
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Developing a website for English-speaking practice to English as a foreign la...
STKI Israel Market Study 2025 version august
Geologic Time for studying geology for geologist
sustainability-14-14877-v2.pddhzftheheeeee
Getting started with AI Agents and Multi-Agent Systems
Enhancing emotion recognition model for a student engagement use case through...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Architecture types and enterprise applications.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
observCloud-Native Containerability and monitoring.pptx
August Patch Tuesday
Web Crawler for Trend Tracking Gen Z Insights.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
A review of recent deep learning applications in wood surface defect identifi...
Modernising the Digital Integration Hub
What is a Computer? Input Devices /output devices
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf

[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the power of GPUs.

  • 1. H2O4GPU and GoAI: harnessing the power of GPUs. Mateusz Dymczyk Senior Software Engineer H2O.ai @mdymczyk
  • 2. Agenda • About me • About H2O.ai • A bit of history: H2O-3 • Moving forward: feature engineering & Driverless AI • The need for GPUs • GPU overview • Machine Learning + GPUs = why? how? • About GoAI • About H2O4GPU • Q&A
  • 3. About me • M.Sc. in Computer Science @ AGH UST in Poland • Ph.D. dropout (machine learning) • Previously NLP/ML @ Fujitsu Laboratories, Kanagawa • Currently Lead/Senior Machine Learning Engineer @ H2O.ai (remotely from Tokyo) • Conference speaker (Strata Beijing/NY/Singapore, Hadoop World Tokyo etc.)
  • 4. About H2O.ai FOUNDED 2012, SERIES C IN NOV, 2017 PRODUCTS • DRIVERLESS AI – AUTOMATED MACHINE LEARNING • H2O OPEN SOURCE MACHINE LEARNING • SPARKLING WATER • H2O4GPU OS ML GPU LIBRARY MISSION DEMOCRATIZE AI TEAM • ~100 EMPLOYEES • SEVERAL KAGGLE GRANDMASTERS • DISTRIBUTED SYSTEMS ENGINEERS DOING MACHINE LEARNING • WORLD-CLASS VISUALIZATION DESIGNERS OFFICES MOUNTAIN VIEW, LONDON, PRAGUE
  • 6. Select Customers Financial InsuranceMarketing TelecomHealthcareRetail “Overall customer satisfaction is very high.” - Gartner Advisory & Accounting
  • 7. A bit of history: H2O-3
  • 8. H2O-3 Overview • Distributed implementations of cutting edge ML algorithms. • Core algorithms written in high performance Java. • APIs available in R, Python, Scala, REST/JSON. • Interactive Web GUI called H2O Flow. • Easily deploy models to production with H2O Steam.
  • 9. H2O-3 Distributed Computing • Multi-node cluster with shared memory model. • All computations in memory. • Each node sees only some rows of the data. • No limit on cluster size. • Distributed data frames (collection of vectors). • Columns are distributed (across nodes) arrays. • Works just like R’s data.frame or Python Pandas DataFrame H2O Frame H2O Cluster
  • 10. H2O-3 Algorithms Supervised Learning • Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie • Naïve Bayes Statistical Analysis Ensembles • Distributed Random Forest: Classification or regression models • Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations Deep Neural Networks • Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations Unsupervised Learning • K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k Clustering Dimensionality Reduction • Principal Component Analysis: Linearly transforms correlated variables to independent components • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data Anomaly Detection • Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
  • 11. DriverlessAI & Feature Engineering
  • 12. The Need for Automation “The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts” –McKinsey Prediction for 2018
  • 13. Recipe for Success Auto Feature Generation
 Kaggle Grand Master Out of the Box • Automatic Text Handling • Frequency Encoding • Cross Validation Target Encoding • Truncated SVD • Clustering and more Feature Transformations Generated Features Original Features
  • 16. 3 Pillars Speed Accuracy Interpretability
  • 17. The need for GPUs
  • 18. Moore’s Law 1980 1990 2000 2010 2020 102 103 104 105 106 107 40 Years of Microprocessor Trend Data Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp Single-threaded perf 1.5X per year 1.1X per year Transistors
 (thousands)
  • 19. GPU 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp 102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year APPLICATIONS SYSTEMS ALGORITHMS CUDA ARCHITECTURE
  • 20. GoAI
  • 22. GPU Open Analytics Initiative (GOAI) github.com/gpuopenanalytics GPU Data Frame (GDF) Ingest/
 Parse Exploratory Analysis Feature Engineering ML/DL Algorithms Grid Search Scoring Model
 Export
  • 25. GPU architecture Low latency vs High throughput GPU • Optimized for data-parallel, throughput computation • Architecture tolerant of memory latency • More transistors dedicated to computation CPU • Optimized for low-latency access to cached data sets • Control logic for out-of-order and speculative execution
  • 26. GPU Enhanced Applications Application Code GPU Use GPU to Parallelize Compute-Intensive Functions CPU Rest of Sequential CPU Code
  • 28. Machine Learning and GPUs 2 4 A 3 5 m ⇥ k 2 4 B 3 5 k ⇥ n = 2 4 C 3 5 m ⇥ n
  • 29. Matrix Multiplication 2 6 6 6 6 6 4 a1,1 a1,2 a1,3 . . . a1,k a2,1 a2,2 a2,3 . . . a2,k a3,1 a3,2 a3,3 . . . a3,k ... ... ... ... ... am,1 am,2 am,3 . . . am,k 3 7 7 7 7 7 5 A 2 6 6 6 6 6 4 b1,1 b1,2 b1,3 . . . b1,n b2,1 b2,2 b2,3 . . . b2,n b3,1 b3,2 b3,3 . . . b3,n ... ... ... ... ... bk,1 bk,2 bk,3 . . . bk,n 3 7 7 7 7 7 5 B = 2 6 6 6 6 6 4 c1,1 c1,2 c1,3 . . . c1,n c2,1 c2,2 c2,3 . . . c2,n c3,1 c3,2 c3,3 . . . c3,n ... ... ... ... ... cm,1 cm,2 cm,3 . . . cm,n 3 7 7 7 7 7 5 C
  • 30. Matrix Multiplication 2 6 6 6 6 6 4 a1,1 a1,2 a1,3 . . . a1,k a2,1 a2,2 a2,3 . . . a2,k a3,1 a3,2 a3,3 . . . a3,k ... ... ... ... ... am,1 am,2 am,3 . . . am,k 3 7 7 7 7 7 5 A 2 6 6 6 6 6 4 b1,1 b1,2 b1,3 . . . b1,n b2,1 b2,2 b2,3 . . . b2,n b3,1 b3,2 b3,3 . . . b3,n ... ... ... ... ... bk,1 bk,2 bk,3 . . . bk,n 3 7 7 7 7 7 5 B = 2 6 6 6 6 6 4 c1,1 c1,2 c1,3 . . . c1,n c2,1 c2,2 c2,3 . . . c2,n c3,1 c3,2 c3,3 . . . c3,n ... ... ... ... ... cm,1 cm,2 cm,3 . . . cm,n 3 7 7 7 7 7 5 C
  • 31. Matrix Multiplication 2 6 6 6 6 6 4 a1,1 a1,2 a1,3 . . . a1,k a2,1 a2,2 a2,3 . . . a2,k a3,1 a3,2 a3,3 . . . a3,k ... ... ... ... ... am,1 am,2 am,3 . . . am,k 3 7 7 7 7 7 5 A 2 6 6 6 6 6 4 b1,1 b1,2 b1,3 . . . b1,n b2,1 b2,2 b2,3 . . . b2,n b3,1 b3,2 b3,3 . . . b3,n ... ... ... ... ... bk,1 bk,2 bk,3 . . . bk,n 3 7 7 7 7 7 5 B = 2 6 6 6 6 6 4 c1,1 c1,2 c1,3 . . . c1,n c2,1 c2,2 c2,3 . . . c2,n c3,1 c3,2 c3,3 . . . c3,n ... ... ... ... ... cm,1 cm,2 cm,3 . . . cm,n 3 7 7 7 7 7 5 C
  • 32. Matrix Multiplication 2 6 6 6 6 6 4 a1,1 a1,2 a1,3 . . . a1,k a2,1 a2,2 a2,3 . . . a2,k a3,1 a3,2 a3,3 . . . a3,k ... ... ... ... ... am,1 am,2 am,3 . . . am,k 3 7 7 7 7 7 5 A 2 6 6 6 6 6 4 b1,1 b1,2 b1,3 . . . b1,n b2,1 b2,2 b2,3 . . . b2,n b3,1 b3,2 b3,3 . . . b3,n ... ... ... ... ... bk,1 bk,2 bk,3 . . . bk,n 3 7 7 7 7 7 5 B = 2 6 6 6 6 6 4 c1,1 c1,2 c1,3 . . . c1,n c2,1 c2,2 c2,3 . . . c2,n c3,1 c3,2 c3,3 . . . c3,n ... ... ... ... ... cm,1 cm,2 cm,3 . . . cm,n 3 7 7 7 7 7 5 C
  • 33. Matrix Multiplication 2 6 6 6 6 6 4 a1,1 a1,2 a1,3 . . . a1,k a2,1 a2,2 a2,3 . . . a2,k a3,1 a3,2 a3,3 . . . a3,k ... ... ... ... ... am,1 am,2 am,3 . . . am,k 3 7 7 7 7 7 5 A 2 6 6 6 6 6 4 b1,1 b1,2 b1,3 . . . b1,n b2,1 b2,2 b2,3 . . . b2,n b3,1 b3,2 b3,3 . . . b3,n ... ... ... ... ... bk,1 bk,2 bk,3 . . . bk,n 3 7 7 7 7 7 5 B = 2 6 6 6 6 6 4 c1,1 c1,2 c1,3 . . . c1,n c2,1 c2,2 c2,3 . . . c2,n c3,1 c3,2 c3,3 . . . c3,n ... ... ... ... ... cm,1 cm,2 cm,3 . . . cm,n 3 7 7 7 7 7 5 C C[0,0] C[0,1] C[n,m]
  • 34. Matrix Operations in ML Matrix Multiplication! All black lines are matrix multiplications!
  • 37. H2O4GPU • Open-Source: https://github.com/h2oai/h2o4gpu • Collection of important ML algorithms ported to the GPU (with CPU fallback option): • Gradient Boosted Machines • GLM • Truncated SVD • PCA • KMeans • (soon) Field Aware Factorization Machines • Performance optimized, multi-GPU support (certain algorithms) • Used within our own Driverless AI Product to boost performance 30X • Scikit-Learn compatible Python API (and now R API)
  • 39. Gradient Boosting Machines • Based upon XGBoost • Raw floating point data -> Binned into Quantiles • Quantiles are stored as compressed instead of floats • Compressed Quantiles are efficiently transferred to GPU • Sparsity is handled directly with highly GPU efficiency • Multi-GPU by sharding rows using NVIDIA NCCL AllReduce
  • 41. KMeans • Significantly faster than Scikit-learn implementation (up to 50x) • Significantly faster than other GPU implementations (5x-10x) • Supports kmeans|| initialization • Supports multiple GPUs by sharding the dataset • Supports batching data if exceeds GPU memory
  • 43. Truncated SVD & PCA • Matrix decomposition • Popular for text processing and dimensionality reduction • GPU optimizes linear algebra operations
  • 44. Truncated SVD & PCA • The intrinsic dimensionality of certain datasets is much lower than the original (e.g. here 4096 vs. actual ~200) • PCA can reduce the dimensionality and preserve most of the explained variance at the same time • Better input for further modeling - takes less time
  • 46. Field Aware Factorization Machines * under development • Click Through Rate (CTR): • One of the most important tasks in computational advertising • Percentage of users, who actually click on ads • Until recently solved with logistic regression - bad at finding feature conjunctions (learns the effect of all variables or features individually) Clicked Publisher (P) Advertiser (A) Gender (G) Yes ESPN Nike Male No NBC Adidas Male
  • 47. Field Aware Factorization Machines * under development • Separates the data into fields (Publisher, Advertiser, Gender) and features (EPSN, NBC, Adidas, Nike, Male, Female) • Uses a latent space for each pair to generate the model • Used to win the first prize of three CTR competitions hosted by Criteo, Avazu, Outbrain, and also the third prize of RecSys Challenge 2015.
  • 48. Demo
  • 49. More info • Documentation: http://docs.h2o.ai • Online Training: http://learn.h2o.ai • Tutorials: https://github.com/h2oai/h2o-tutorials • Slidedecks: https://github.com/h2oai/h2o-meetups • Video Presentations: https://www.youtube.com/user/0xdata • Events & Meetups: http://h2o.ai/events • Code: http://github.com/h2oai/ • Questions: • https://stackoverflow.com/questions/tagged/h2o4gpu • https://gitter.im/h2oai/{h2o-3,h2o4gpu}
  • 51. Q&A