IBM AI at Scale

AI at Scale
COGNITIVE SYSTEMS
Ing. Florin Manaila
Senior Architect and Inventor
Cognitive Systems (Distributed Deep Learning and HPC)
IBM Systems Hardware Europe
Member of the IBM Academy of Technology (AoT)
July 9, 2020

Technical R&D today disruption
2
New
Product
New
Product
Opportunistic
Discovery
by Humans
Simulation
Experiments
Simulation &
Inference
ExperimentsComprehensive
Discovery by
Cognitive
Today Cognitive Discovery
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation

Next-Generation
Infrastructure Stack
3IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation

Problem
―
4
 Datasets are large and growing
 The size of a batch of samples is large and growing
 Sample sizes are large and growing
 More and more sophisticated models are being designed, some with
hundreds of layers
 GPU memory capacity is growing as well (but slower)
 Limited by cost, technology, physical space
 Energy costs is increase YY
 Large CO2e / training cycles

What’s in the training of deep neural networks?
Neural network model
Billions of parameters
Gigabytes
Computation
Iterative gradient based search
Millions of iterations
Mainly matrix operations
Data
Millions of images, sentences
Terabytes
Workload characteristics: Both compute and data intensive!

Distributed Deep Learning
Common options
6
SINGLE
ACCELERATOR
DATA PARALLEL MODEL PARALLEL DATA AND MODEL PARALLEL
1x Accelerator 4x Accelerators 4x Accelerators
4x n Accelerators
Longer Training Time Shorter Training Time
System1System2Systemn
System
Data
Data
DataDataDataData
DataDataData

Node 0
Data-Parallel Framework
Distributed Learning
Partition 0
GPU 0
GPU 1
GPU 2
GPU 3
Partition (0,0)
Partition (0,1)
Partition (0,2)
Partition (0,3)
Node 1
Partition 1
GPU 0
GPU 1
GPU 2
GPU 3
Partition (1,0)
Partition (1,1)
Partition (1,2)
Partition (1,3)
7
Large Dataset

Scaling
Misperception

AI Frameworks and Multi-GPU
Single GPU utilization
9
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
CP
U
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
Multi-Host Socket Direct
PCIe Gen4, CAPI 2.0 Infiniband
NVLink
DDR4
X-Bus 4B
CP
U
PCIe NVMe Flash
Storage
If not told explicitly, AI
Frameworks makes
use of a single GPU!

10
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
CP
U
NVLink
DDR4
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
Multi-Host Socket Direct
PCIe Gen4, CAPI 2.0 Infiniband
NVLink
DDR4
X-Bus 4B
CP
U
PCIe NVMe Flash
Storage
Use explicitly, multi GPU
model in AI Frameworks
makes use of all GPUs
available on the host ore
assigned by SLURM if
interactive session is
requested with a specific
number of GPUs!
4x GPUs utilization

11
InfiniBand EDR Switch
12x GPU utilization using collective communication operation called “AllReduce

Multi GPU in Keras
Scenarios
Training models with
weights merge on CPU
weights merge on CPU
using cpu_relocation
(recommended for
IC922)
weights merge on GPU
(recommended for
AC922)

Issues
 Batch Size
 GPU data starvation aka the CPUs can’t keep up with the GPUs
 Saving your parallel models
 Counting the availableGPUs has a nasty side-effect

Issues
GPU data starvation

Training models with weights merge on CPU
Example 1
import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np
num_samples = 1000
height = 224
width = 224
num_classes = 1000
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)

Training models with weights merge on CPU
Example 1
# Replicates the model on 4 GPUs.
# This assumes that your machine has 4 available GPUs.
parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
# Generate dummy data.
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)
# Save model via the template model (which shares the same weights):
model.save('my_model.h5')
NOTE:
To save the multi-gpu model, use .save(fname) or .save_weights(fname) with the template model (the argument you passed to multi_gpu_model), rather than the model returned by multi_gpu_model.

Training models with weights merge on CPU using cpu_relocation
Example 2
..
# Not needed to change the device scope for model definition:
model = Xception(weights=None, ..)
try:
parallel_model = multi_gpu_model(model, cpu_relocation=True)
print("Training using multiple GPUs..")
except ValueError:
parallel_model = model
print("Training using single GPU or CPU..")
parallel_model.compile(..)
..

Training models with weights merge on GPU (recommended for AC922)
Example 3
..
# Not needed to change the device scope for model definition:
model = Xception(weights=None, ..)
try:
parallel_model = multi_gpu_model(model, cpu_merge=False)
print("Training using multiple GPUs..")
except:
parallel_model = model
print("Training using single GPU or CPU..")
parallel_model.compile(..)
..

Next-Generation
Software Stack

20
AI Infrastructure Stack
ON-CLOUD and ON-PREM
Transform & Prep
Data (ETL)
Micro-Services / Applications
Governance AI
(Fairness, Explainable AI,
Model Health, Accuracy)
APIs
(external and in-house)
Machine & Deep Learning
Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Segment Specific:
Finance, Retail, Healthcare,
Automotive
Speech, Vision,
NLP, Sentiment
TensorFlow, Caffe,
Pytorch
SparkML, Snap.ML
Spark, MPI
Hadoop HDFS,
NoSQL DBs,
Parallel File
System
Accelerated
Infrastructure

Watson ML Community Edition (WMLCE)
21
CUDA
TensorRTTensorFlow Caffe2
RAPIDS.AI: cuDF, cuML
LIBS:
DDL
Large Model Support (LMSv2)
SnapML
Local, MPI, Spark
DASK
Pytorch
Estimator, Probability,
Serving, Tensorboard APEX XGBoostBazel
libevent, libgdf, libgdf_cffi, libopencv, libprotobuf, parquet-cpp, thrift-cpp,
arrow-cpp, pyarrow, gflags, magma, cupy, py-oepncv, arrow-cpp etc
NCCL cuDNN
Spectrum MPI
AIX360
delivered via
Bare Metal or Containers
ONNX
Version1.7.0
Horovod AIF360

Watson ML Community Edition (WMLCE)
22
CUDA
TensorFlow Caffe2
RAPIDS.AI: cuDF, cuML
LIBS:
SnapML
Local, MPI, Spark
DASK
Pytorch
Estimator, Probability,
Serving, Tensorboard APEX XGBoostBazel
libevent, libgdf, libgdf_cffi, libopencv, libprotobuf, parquet-cpp, thrift-cpp, arrow-cpp, pyarrow, gflags, magma, cupy,
py-oepncv, arrow-cpp etc
NCCL cuDNN
delivered via
Bare Metal or Containers
Version1.7.0
TensorFlow
Serving Server
TensorRT
ONNX
Protobuf
Training
Inference
DDL
Large Model Support (LMSv2)Spectrum MPI
AIX360
ONNX
Horovod AIF360

AI Eplainability and Fairness toolkits on POWER

24
IBMWatsonMachineLearning
CommunityEdition
DockerContainers

WMLCE - Installation
25
$ conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
$ conda create --name wmlce-1.7.0 python=3.7
$ conda activate wmlce-1.7.0
$ conda install powerai
$ conda install powerai-rapids
Optional Packages:
After you have install Anaconda on your user profile, add IBM WMLCE conda channel:
Create a python virtual environment:
Install WMLCE:
$ conda install py-xgboost-gpu
https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.html#wmlce_install__install

IBM Large Model
Support (LMS)

IBM Large Model Support (LMS)
Swap-out unused parameters to large CPU memory (TB order)
l+1l-1 LLayer 1
Loss
Function
…..…
……….
…...
Forward
Backward
l
…….
CPU memory
GPU memory
l+1l-1 LLayer 1
Loss
Function
…..…
……….
…...
Forward
Backward
l
…….
GPU memory
Swap-out
Swap-in
Normal Backpropagation Backpropagation in LMS(Swap)
Keep unused parameters in GPU memory Swap-out unused parameters to CPU memory
Background
Neural Network is growing deeper and wider
In near future, memory to keep the network parameters may exceed the GPU memory (16GB, 40GB, etc)
 Large Model Support is required in deep learning frameworks
CPU-GPU NVLink plays the key role

IBM Large Model Support (LMS)
28
Allow seamlessly moves layers of a model between the GPU and CPU to overcome GPU memory limits allows
training of:
 Deeper models
 Higher resolution data
 Larger batch sizes
PytorchTensorFlow
TFLMSv2 introduces four hyper-parameters to work
with:
 swapout_threshold: The number of tensors to hold within
GPU memory before pushing them to system memory.
 swapin_ahead: The larger swapin_ahead is, the earlier a
tensor is swapped in to the GPU memory from the host
memory.
 swapin_groupby: Multiple swap-in operations of the
same tensor will be grouped or fused into one swap-in
operation for better performance if they are close to each
other (the distance between them is within
swapin_groupby).
 sync_mode: Whether to do synchronisation between
data transfer and kernel computation or not.
Keras

What’s possible with Large Model Support
29
 8.3x image resolution - Keras ResNet50
 14.4x image resolution – ResNet152v2
 7x MRI resolution - 3D U-Net 3D image segmentation

Distributed
Deep Learning

Goals
31
The overall goal of distributed
deep learning is to reduce the
training time
To this end the primary features:
 Automatic Topology Detection
 Rankfile generation
 Automatic mpirun option handling
 Efficiency in scalability

32
How is working?
 A process is created for each GPU in the cluster
 Each process contains a copy of the model
 Mini-batch is spread across all of the processes
 Each process uses different input data
 After each iteration, all of the processes sync and average together their
gradients

33
Tools and Libraries
The following tools are libraries, which provide the communication functions necessary to perform distributed
training. Primarily allReduce and broadcast functions.
 IBM Spectrum MPI: Classic tool for distributed computing. Still commonly used for distributed deep
learning.
 NVIDIA NCCL: Nvidia’s gpu-to-gpu communication library. Since NCCL2, between-node communication
is supported.
 IBM DDL: Provides a topology-aware all-Reduce. Capable of optimally dividing communication across
hierarchies of fabrics. Utilizes different communication protocols at different hierarchies. When WMLCE is
installed all related frameworks are comming with IBM DDL support, you don’t have to compile additional
software packages, only to modify your training scripts to make use of the need distributed deep learning
APIs.Integrations into deep learning frameworks to enable distributed training is using common communication
libraries such as:
 TensorFlow Distribution Strategies. Native Tensorflow distribution methods.
 IBM DDL. Provides integrations into common frameworks, including a Tensorflow operator that integrates
IBM DDL with Tensorflow and similar for Pytorch.
 Horovod [Sergeev et al. 2018]. Provides integration libraries into common frameworks which enable
distributed training with common communication libraries, including. IBM DDL or NCCL can be used as
backend for Horovod implementation.

Horovod
distributed training framework
34
 Distributed training framework for:
• TensorFlow
• Keras
• PyTorch
 Separates infrastructure from ML
 Easy installation on top of ML frameworks:
 Best performance with NCCL or DDL - uses
bandwidth-optimal communication protocols
(NVLINK, RDMA (InfiniBand, RoCE)) if available
 Named after traditional Russian fold dance where
participants dance in a circle with linked hands
$ conda install horovod

Horovod with DDL
Running
35
$ ddlrun -H host1,host2,host3,host4 -mpiarg "-x HOROVOD_FUSION_THRESHOLD=16777216" python
hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --
variable_update=horovod
I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 5682.34
----------------------------------------------------------------

Horovod Architecture
• Multiple Towers (here 2)
• Each Tower:
 Runs in context of individual OS
process (own PID)
 Has own data pipeline to read and
augment
 Runs own training step
 Synchronization Step via
hvd.DistributedOptimizer()
Tower
(indiv. process)
Tower
(indiv. process)
...
hvd.DistributedOptimizer()
point of gradient
synchronization
1.st NCCL log output
Rank 0 Rank 1

How to Start Horovod Jobs
Samples below, train.py is our training code:
• 2 GPUs: mpirun -np 2 --allow-run-as-root -H localhost:2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x
LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
• 4 GPUs: mpirun -np 4 --allow-run-as-root -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x
• Single GPU: mpirun -np 1 --allow-run-as-root -H localhost:1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x
A better way is to use horovodrun or ddlrun:
One Training codebase, One way to start (MPI based), Easy to orchestrate, No parameter server
horovodrun -np 16 -H compute1:4,compute2:4,compute3:4,compute4:4 python train.py

Horovod Rank Enumeration
 Parameters given by Horovod
• hvd.size() – total amount of GPUs working in this job
• hvd.rank() – rank id assigned to this specific tower/worker
o Perform special steps in single rank (mostly rank 0)
o Checkpointing
o TensorBoard log writing
 Pitfall: hvd.local_rank() is not unique when doing multi-node jobs!

Horovod MPI Rank Enumeration
A sample with 4 GPUs on 2 nodes
Tower
(indiv. process)
Tower
(indiv. process)
Caller
hvd.rank() = 0
hvd.local_rank() = 0
Tower
(indiv. process)
Tower
(indiv. process)
hvd.rank() = 1
hvd.rank() = 2
hvd.rank() = 3
hvd.size() = 4

WMLCE
and SLURM
integration

41
SLURM template example for 4x IBM AC922s
Batch AI

42
SLURM template example for 4x IBM AC922s
Batch AI with Horovod and Pytorch

43
MNIST example for 4x IBM AC922s

44

45

46

Thank you
48
Florin Manaila
—
florin.manaila@de.ibm.com
ibm.com

IBM AI at Scale

More Related Content

What's hot (20)

Similar to IBM AI at Scale (20)

More from Ganesan Narayanasamy (20)

Recently uploaded (20)

IBM AI at Scale

Editor's Notes