OpenPOWER Boot camp in Zurich

Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation
IBM PowerAI LMS and DDL
—
Florin Manaila
Senior IT Architect and Inventor
Cognitive Systems (HPC and Deep Learning)
IBM Systems Hardware Europe

AGENDA
2
PowerAI
Overview
Large Model Support (LMS)
Distributed Deep Learning Library (DDL)
Demo
LMS

When decision made
How decisions made
Understanding user goals
Optimizes for
Performance
Traditional Runtime Self-aware Runtime
Runtime
Evidence-based
Yes
Application metrics (science
accomplished)
Improves without user
action
Decide
Act
Observe
Decide
Design time
Ad-hoc, based on guesses
about future
No
System metrics (utilization)
Static
Act
New ways of
Innovation
3Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation

4
AI Infrastructure Stack
Vision
Enterprise
L1-L3 Support
Base
Transform & Prep
Data (ETL)
Micro-Services / Applications
AI APIs
(Eg: Watson)
In-House APIs
Machine & Deep Learning
Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Segment Specific:
Finance, Retail, Healthcare,
Automotive
Speech, Vision,
NLP, Sentiment
TensorFlow, Caffe,
Pytoch
SparkML, Snap.ML
Spark, MPI
Hadoop HDFS,
NoSQL DBs,
Parallel File
System
Accelerated
Infrastructure

What’s in the training of deep neural networks?
Neural network model
Billions of parameters
Gigabytes
Computation
Iterative gradient based search
Millions of iterations
Mainly matrix operations
Data
Millions of images, sentences
Terabytes
Goal: Search for the best parameters to make model fit data.
Workload characteristics: Both compute and data intensive!

Data processing stages for distributed deep learning
Training data
on storage
CPU:
Coordination
and data prep
GPU
computation
Parameter data
exchange
across systems
Network,
NVLink,
GPU Memory
POWER9
CPU
Storage
NVMe, SSD
GPU
PCIe Gen. 4 2nd Gen
NVLink
Hillery Hunter, IBM, GTC 2018

Balanced system design for seamless data movement
IBM AC922 Architecture
7
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
CP
U
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
Multi-Host Socket Direct
PCIe Gen4, CAPI 2.0 Infiniband
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
X-Bus 4B
CP
U
PCIe NVMe Flash
Storage

NVLink CPU-GPU
reducing communication time, keeping hardware utilized
§ NVLink reduces communication time and overhead
§ Data gets from GPU-GPU, Memory-GPU faster, for shorter training times
x86 based
PCIe-attached
GPU system
POWER8 +
Tesla
P100+NVLink
ImageNet / Alexnet: Minibatch size = 128
170 ms
78 ms
Advantage: data communication and
GPU performance

IBM PowerAI at the glance

Large Model Support
POWER
CPU
DDR4
GPU
NVLink
Graphics
Memory
Traditional Model Support à
CPUDDR4
GPU
PCIe
Graphics
Memory
PowerAI with Large Model Support (LMS)
(Competitors)
Limited memory on GPU
forces trade-off in model
size / data resolution
PowerAI
Use system memory and
GPU to support more
complex models and
higher resolution data

Large AI Models Train
~4 Times Faster
POWER9 Servers with NVLink to GPUs
vs
x86 Servers with PCIe to GPUs
11
3.1 Hours
49 Mins
0
2000
4000
6000
8000
10000
12000
Xeon x86 2640v4 w/
4x V100 GPUs
Power AC922 w/ 4x
V100 GPUs
Time(secs)
Caffe with LMS (Large Model Support)
Runtime of 1000 Iterations
3.8x Faster
GoogleNet model on Enlarged
ImageNet Dataset (2240x2240)Cognitive Systems / v3.1 / May 28 / © 2018 IBM Corporation

LMS Usage in Caffe (PowerAI 5.1)
LMS enables processing of high definition images, large models, and higher batch sizes that doesn’t fit in
GPU memory today (Maximum GPU memory available in Nvidia P100 GPUs is 16GB).
LMS Options
• lms <size in KB>
• lms_frac <x>, where 0<x<1.0
Example of running IBM Caffe with LMS for Deep Residual Network – Resnet152 :
/opt/DL/caffe-ibm/bin/caffe train -gpu 0,1,2,3 –solver=solver.prototxt -lms 10000 –lms_frac=0.5
Note that configuring the “lms” and “lms_frac”
values depends on the below factors:
•Batch size used
•Model used
•Number of GPUs used
•System memory available

PowerAI Distributed Deep Learning Library (DDL)
Communication Library for Distributed
Deep Learning Training
• Enables deep learning software to scale to 100s of
servers with GPUs
• Works across variety of system sizes
• Works with variety of network types, switch topologies
Released results @ 256 P100 GPUs
• Better scaling efficiency than Facebook AI Research:
95% (IBM) vs <90% (FB)
• Higher image recognition accuracy than Microsoft:
33.8% (IBM) vs 29.8% (MS)
TECHNICAL DETAILS:
https://arxiv.org/abs/1708.02188

Costly inter-GPU communication time can limit scaling
Example: DevBox with 4 TITAN X GPU
Computation Communication
Training cycle

c
PIX
PXB
c
PIX
PXB
c
PIX
PXB
c
PIX
PXB
Scaling within a system suffers without peer-to-peer
Example: 16 GPUs on PCIe in a system
P2P GPUDirect
is Key

What does DDL do?
DDL for
TensorFlow
1. Places the job on the local GPU to the CPU (negotiating to use
NVLink interface)
2. Places the job on its nearest neighbor, to leverage NVLink
GPU:GPU communication
3. Places the job on the same system, on the other socket
4. Sends the job, integrating RDMA over IB (not present in the
frameworks themselves), to a remote system and it’s first GPU
Same kind of intelligence you see in good HPC job schedulers, but created with specific tuning for our architecture

Network Switch
GPU
Memory
POWER
CPUDDR4
GPU
Storage Network IB, Eth
PCle
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
GPU
Memory
POWER
CPUDDR4
GPU
Storage Network IB, Eth
PCle
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
Communication paths
PowerAI DDL: Fully utilize bandwidth for links within each node and across all nodes
à Learners communicate as efficiently as possible
Network Switch
GPU
Memory
POWER
CPUDDR4
GPU
Storage
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
GPU
Memory
POWER
CPUDDR4
GPU
Storage
DDR4
POWER
CPU
GPU GPU GPU
GPU
Memory
GPU
Memory
GPU
Memory
PCle PCle
Network IB, Eth Network IB, Eth

64
32
16
8
4
2
1
1 2 4 8 16 32 64
#Nodes
Speedup
PowerAI DDL
PowerAI DDL training ResNet-50 Imagenet1k, Caffe
#GPUs 4 8 16 32 64 128 256
#Nodes 1 2 4 8 16 32 64
Speedup 1.0 2.0 3.9 7.9 15.5 30.5 60.6
Scaling efficiency 1.00 1.00 .98 .99 .97 .95 .95
ideal
actual

64
32
16
8
4
2
1
1 2 4 8 16 32 64
#Nodes
Speedup
PowerAI DDL
PowerAI DDL training ResNet-101 Imagenet22k, Caffe
#GPUs 4 8 64 256
#Nodes 1 2 16 64
Speedup 1.0 1.8 3.9 13.8
Scaling efficiency 1.00 .92 .86 .85
ideal
actual
7 hours to 33.8% top-1 accuracy using 256 GPUs

20
TensorFlow 1.4 Performance on IBM POWER9 with V100
Single node
35% more images processed per second vs tested x86 systems
ResNet50 testing on ILSVRC 2012 dataset (aka Imagenet 2012)
Training on 1.2M images
Validation on 50K images
§ Results are based IBM Internal Measurements running
1000 iterations of HPM Resnet50 on 1.2M images and
validation on 50K images with Dataset from ILSVRC 2012
also known as Imagenet 2012.
§ Software: Tensorflow 1.4.0 framework and HPM
Resnet50 https://github.com/tensorflow/benchmarks.git (
commit: f5d85aef) and with the following parameters:
Batch-Size: 64 per GPU; Iterations: 1100; Data: Imagenet;
local-parameter-device: gpu; variable-update: replicated
Date of testing: November 26, 2017

21
TensorFlow 1.4 Performance on IBM POWER9 with V100
Multiple nodes
Distributed Deep Learning: IBM POWER9™ with Nvidia
Tesla V100 results in 2.3X more data processed on
TensorFlow versus tested x86 systems
2.3X more images processed per second vs tested x86
systems
PowerAI Distributed Deep Learning (DDL) library provides
innovative distribution methods enabling AI frameworks
to scale to multiple servers leveraging all attached GPUs
ResNet50 testing on ILSVRC 2012 dataset (also known as
Imagenet 2012)
Training on 1.2M images
Validation on 50K images
Date of testing: December 2, 2017

DDL TF operator functions/semantics
1. Init function: This must be called before any real TF operators. Typically, we can execute this op on CPU using an
additional session. The input is the DDL configuration. This will inform the targeted network topology and learner
mapping to it. The output consists of MPI information (rank,size) and GPU assignment for TF.

2. Bcast function: Broadcast is to synchronize all the trainable parameters (i.e., weights and biases) before
the training. Broadcast can be called once init has been called and completed on the assigned GPU device.
Each and every trainable parameter must be broadcasted to ensure good convergence.

3. AllReduceN function : This is an aggregated version of AllReduce. Essentially, this takes an array of N
tensors, performs allreduce in a single shot, and return an array of N reduced tensors. The benefits of using
AllReduceN are better performance and simpler integration.

Thank you
26
Florin Manaila
Senior IT Architect and Inventor
Cognitive Systems (HPC and Deep Learning)
florin.manaila@de.ibm.com

OpenPOWER Boot camp in Zurich

More Related Content

What's hot (20)

Similar to OpenPOWER Boot camp in Zurich (20)

More from Ganesan Narayanasamy (20)

Recently uploaded (20)

OpenPOWER Boot camp in Zurich