SlideShare a Scribd company logo
AI at Scale
COGNITIVE SYSTEMS
Ing. Florin Manaila
Senior Architect and Inventor
Cognitive Systems (Distributed Deep Learning and HPC)
IBM Systems Hardware Europe
Member of the IBM Academy of Technology (AoT)
July 9, 2020
Technical R&D today disruption
2
New
Product
New
Product
Opportunistic
Discovery
by Humans
Simulation
Experiments
Simulation &
Inference
ExperimentsComprehensive
Discovery by
Cognitive
Today Cognitive Discovery
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Next-Generation
Infrastructure Stack
3IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Problem
―
4
 Datasets are large and growing
 The size of a batch of samples is large and growing
 Sample sizes are large and growing
 More and more sophisticated models are being designed, some with
hundreds of layers
 GPU memory capacity is growing as well (but slower)
 Limited by cost, technology, physical space
 Energy costs is increase YY
 Large CO2e / training cycles
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
What’s in the training of deep neural networks?
Neural network model
Billions of parameters
Gigabytes
Computation
Iterative gradient based search
Millions of iterations
Mainly matrix operations
Data
Millions of images, sentences
Terabytes
Workload characteristics: Both compute and data intensive!
5IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Distributed Deep Learning
Common options
6
SINGLE
ACCELERATOR
DATA PARALLEL MODEL PARALLEL DATA AND MODEL PARALLEL
1x Accelerator 4x Accelerators 4x Accelerators
4x n Accelerators
Longer Training Time Shorter Training Time
System1System2Systemn
System
Data
Data
DataDataDataData
DataDataData
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Node 0
Data-Parallel Framework
Distributed Learning
Partition 0
GPU 0
GPU 1
GPU 2
GPU 3
Partition (0,0)
Partition (0,1)
Partition (0,2)
Partition (0,3)
Node 1
Partition 1
GPU 0
GPU 1
GPU 2
GPU 3
Partition (1,0)
Partition (1,1)
Partition (1,2)
Partition (1,3)
7
Large Dataset
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Scaling
Misperception
8IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
AI Frameworks and Multi-GPU
Single GPU utilization
9
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
CP
U
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
Multi-Host Socket Direct
PCIe Gen4, CAPI 2.0 Infiniband
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
X-Bus 4B
CP
U
PCIe NVMe Flash
Storage
If not told explicitly, AI
Frameworks makes
use of a single GPU!
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
10
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
CP
U
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
Multi-Host Socket Direct
PCIe Gen4, CAPI 2.0 Infiniband
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
X-Bus 4B
CP
U
PCIe NVMe Flash
Storage
Use explicitly, multi GPU
model in AI Frameworks
makes use of all GPUs
available on the host ore
assigned by SLURM if
interactive session is
requested with a specific
number of GPUs!
AI Frameworks and Multi-GPU
4x GPUs utilization
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
11
InfiniBand EDR Switch
AI Frameworks and Multi-GPU
12x GPU utilization using collective communication operation called “AllReduce
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Multi GPU in Keras
Scenarios
Training models with
weights merge on CPU
Training models with
weights merge on CPU
using cpu_relocation
(recommended for
IC922)
Training models with
weights merge on GPU
(recommended for
AC922)
12IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Issues
 Batch Size
 GPU data starvation aka the CPUs can’t keep up with the GPUs
 Saving your parallel models
 Counting the availableGPUs has a nasty side-effect
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Issues
GPU data starvation
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Training models with weights merge on CPU
Example 1
import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np
num_samples = 1000
height = 224
width = 224
num_classes = 1000
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Training models with weights merge on CPU
Example 1
# Replicates the model on 4 GPUs.
# This assumes that your machine has 4 available GPUs.
parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
# Generate dummy data.
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)
# Save model via the template model (which shares the same weights):
model.save('my_model.h5')
NOTE:
To save the multi-gpu model, use .save(fname) or .save_weights(fname) with the template model (the argument you passed to multi_gpu_model), rather than the model returned by multi_gpu_model.
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Training models with weights merge on CPU using cpu_relocation
Example 2
..
# Not needed to change the device scope for model definition:
model = Xception(weights=None, ..)
try:
parallel_model = multi_gpu_model(model, cpu_relocation=True)
print("Training using multiple GPUs..")
except ValueError:
parallel_model = model
print("Training using single GPU or CPU..")
parallel_model.compile(..)
..
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Training models with weights merge on GPU (recommended for AC922)
Example 3
..
# Not needed to change the device scope for model definition:
model = Xception(weights=None, ..)
try:
parallel_model = multi_gpu_model(model, cpu_merge=False)
print("Training using multiple GPUs..")
except:
parallel_model = model
print("Training using single GPU or CPU..")
parallel_model.compile(..)
..
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Next-Generation
Software Stack
19IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
20
AI Infrastructure Stack
ON-CLOUD and ON-PREM
Transform & Prep
Data (ETL)
Micro-Services / Applications
Governance AI
(Fairness, Explainable AI,
Model Health, Accuracy)
APIs
(external and in-house)
Machine & Deep Learning
Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Segment Specific:
Finance, Retail, Healthcare,
Automotive
Speech, Vision,
NLP, Sentiment
TensorFlow, Caffe,
Pytorch
SparkML, Snap.ML
Spark, MPI
Hadoop HDFS,
NoSQL DBs,
Parallel File
System
Accelerated
Infrastructure
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Watson ML Community Edition (WMLCE)
21
CUDA
TensorRTTensorFlow Caffe2
RAPIDS.AI: cuDF, cuML
LIBS:
DDL
Large Model Support (LMSv2)
SnapML
Local, MPI, Spark
DASK
Pytorch
Estimator, Probability,
Serving, Tensorboard APEX XGBoostBazel
libevent, libgdf, libgdf_cffi, libopencv, libprotobuf, parquet-cpp, thrift-cpp,
arrow-cpp, pyarrow, gflags, magma, cupy, py-oepncv, arrow-cpp etc
NCCL cuDNN
Spectrum MPI
AIX360
delivered via
Bare Metal or Containers
ONNX
Version1.7.0
Horovod AIF360
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Watson ML Community Edition (WMLCE)
22
CUDA
TensorFlow Caffe2
RAPIDS.AI: cuDF, cuML
LIBS:
SnapML
Local, MPI, Spark
DASK
Pytorch
Estimator, Probability,
Serving, Tensorboard APEX XGBoostBazel
libevent, libgdf, libgdf_cffi, libopencv, libprotobuf, parquet-cpp, thrift-cpp, arrow-cpp, pyarrow, gflags, magma, cupy,
py-oepncv, arrow-cpp etc
NCCL cuDNN
delivered via
Bare Metal or Containers
Version1.7.0
TensorFlow
Serving Server
TensorRT
ONNX
Protobuf
Training
Inference
DDL
Large Model Support (LMSv2)Spectrum MPI
AIX360
ONNX
Horovod AIF360
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
AI Eplainability and Fairness toolkits on POWER
23IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
24
IBMWatsonMachineLearning
CommunityEdition
DockerContainers
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
WMLCE - Installation
25
$ conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
$ conda create --name wmlce-1.7.0 python=3.7
$ conda activate wmlce-1.7.0
$ conda install powerai
$ conda install powerai-rapids
Optional Packages:
After you have install Anaconda on your user profile, add IBM WMLCE conda channel:
Create a python virtual environment:
Install WMLCE:
$ conda install py-xgboost-gpu
https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.html#wmlce_install__install
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
IBM Large Model
Support (LMS)
26IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
IBM Large Model Support (LMS)
Swap-out unused parameters to large CPU memory (TB order)
27IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
l+1l-1 LLayer 1
Loss
Function
…..…
……….
…...
Forward
Backward
l
…….
CPU memory
GPU memory
l+1l-1 LLayer 1
Loss
Function
…..…
……….
…...
Forward
Backward
l
…….
GPU memory
Swap-out
Swap-in
Normal Backpropagation Backpropagation in LMS(Swap)
Keep unused parameters in GPU memory Swap-out unused parameters to CPU memory
Background
Neural Network is growing deeper and wider
In near future, memory to keep the network parameters may exceed the GPU memory (16GB, 40GB, etc)
 Large Model Support is required in deep learning frameworks
CPU-GPU NVLink plays the key role
IBM Large Model Support (LMS)
28
Allow seamlessly moves layers of a model between the GPU and CPU to overcome GPU memory limits allows
training of:
 Deeper models
 Higher resolution data
 Larger batch sizes
PytorchTensorFlow
TFLMSv2 introduces four hyper-parameters to work
with:
 swapout_threshold: The number of tensors to hold within
GPU memory before pushing them to system memory.
 swapin_ahead: The larger swapin_ahead is, the earlier a
tensor is swapped in to the GPU memory from the host
memory.
 swapin_groupby: Multiple swap-in operations of the
same tensor will be grouped or fused into one swap-in
operation for better performance if they are close to each
other (the distance between them is within
swapin_groupby).
 sync_mode: Whether to do synchronisation between
data transfer and kernel computation or not.
Keras
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
What’s possible with Large Model Support
29
 8.3x image resolution - Keras ResNet50
 14.4x image resolution – ResNet152v2
 7x MRI resolution - 3D U-Net 3D image segmentation
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Distributed
Deep Learning
30IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Distributed Deep Learning
Goals
31
The overall goal of distributed
deep learning is to reduce the
training time
To this end the primary features:
 Automatic Topology Detection
 Rankfile generation
 Automatic mpirun option handling
 Efficiency in scalability
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
32
Distributed Deep Learning
How is working?
 A process is created for each GPU in the cluster
 Each process contains a copy of the model
 Mini-batch is spread across all of the processes
 Each process uses different input data
 After each iteration, all of the processes sync and average together their
gradients
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
33
Tools and Libraries
The following tools are libraries, which provide the communication functions necessary to perform distributed
training. Primarily allReduce and broadcast functions.
 IBM Spectrum MPI: Classic tool for distributed computing. Still commonly used for distributed deep
learning.
 NVIDIA NCCL: Nvidia’s gpu-to-gpu communication library. Since NCCL2, between-node communication
is supported.
 IBM DDL: Provides a topology-aware all-Reduce. Capable of optimally dividing communication across
hierarchies of fabrics. Utilizes different communication protocols at different hierarchies. When WMLCE is
installed all related frameworks are comming with IBM DDL support, you don’t have to compile additional
software packages, only to modify your training scripts to make use of the need distributed deep learning
APIs.Integrations into deep learning frameworks to enable distributed training is using common communication
libraries such as:
 TensorFlow Distribution Strategies. Native Tensorflow distribution methods.
 IBM DDL. Provides integrations into common frameworks, including a Tensorflow operator that integrates
IBM DDL with Tensorflow and similar for Pytorch.
 Horovod [Sergeev et al. 2018]. Provides integration libraries into common frameworks which enable
distributed training with common communication libraries, including. IBM DDL or NCCL can be used as
backend for Horovod implementation.
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Horovod
distributed training framework
34
 Distributed training framework for:
• TensorFlow
• Keras
• PyTorch
 Separates infrastructure from ML
 Easy installation on top of ML frameworks:
 Best performance with NCCL or DDL - uses
bandwidth-optimal communication protocols
(NVLINK, RDMA (InfiniBand, RoCE)) if available
 Named after traditional Russian fold dance where
participants dance in a circle with linked hands
$ conda install horovod
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Horovod with DDL
Running
35
$ ddlrun -H host1,host2,host3,host4 -mpiarg "-x HOROVOD_FUSION_THRESHOLD=16777216" python
hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --
variable_update=horovod
I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 5682.34
----------------------------------------------------------------
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Horovod Architecture
• Multiple Towers (here 2)
• Each Tower:
 Runs in context of individual OS
process (own PID)
 Has own data pipeline to read and
augment
 Runs own training step
 Synchronization Step via
hvd.DistributedOptimizer()
Tower
(indiv. process)
Tower
(indiv. process)
...
hvd.DistributedOptimizer()
point of gradient
synchronization
1.st NCCL log output
Rank 0 Rank 1
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
How to Start Horovod Jobs
Samples below, train.py is our training code:
• 2 GPUs: mpirun -np 2 --allow-run-as-root -H localhost:2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x
LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
• 4 GPUs: mpirun -np 4 --allow-run-as-root -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x
LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
• Single GPU: mpirun -np 1 --allow-run-as-root -H localhost:1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x
LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
A better way is to use horovodrun or ddlrun:
One Training codebase, One way to start (MPI based), Easy to orchestrate, No parameter server
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
horovodrun -np 16 -H compute1:4,compute2:4,compute3:4,compute4:4 python train.py
Horovod Rank Enumeration
 Parameters given by Horovod
• hvd.size() – total amount of GPUs working in this job
• hvd.rank() – rank id assigned to this specific tower/worker
o Perform special steps in single rank (mostly rank 0)
o Checkpointing
o TensorBoard log writing
 Pitfall: hvd.local_rank() is not unique when doing multi-node jobs!
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Horovod MPI Rank Enumeration
A sample with 4 GPUs on 2 nodes
Tower
(indiv. process)
Tower
(indiv. process)
Caller
hvd.rank() = 0
hvd.local_rank() = 0
Tower
(indiv. process)
Tower
(indiv. process)
hvd.rank() = 1
hvd.local_rank() = 1
hvd.rank() = 2
hvd.local_rank() = 0
hvd.rank() = 3
hvd.local_rank() = 1
hvd.size() = 4
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
WMLCE
and SLURM
integration
40IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
41
SLURM template example for 4x IBM AC922s
Batch AI
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
42
SLURM template example for 4x IBM AC922s
Batch AI with Horovod and Pytorch
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
43
MNIST example for 4x IBM AC922s
Batch AI with Horovod and Pytorch
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
44
MNIST example for 4x IBM AC922s
Batch AI with Horovod and Pytorch
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
45
MNIST example for 4x IBM AC922s
Batch AI with Horovod and Pytorch
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
46
MNIST example for 4x IBM AC922s
Batch AI with Horovod and Pytorch
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
47IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Thank you
48
Florin Manaila
—
florin.manaila@de.ibm.com
ibm.com
49

More Related Content

PDF
Performance Evaluation using TAU Performance System and E4S
PDF
An Open Hardware CPU written in VHDL, synthesized with Open Source tools
PDF
Introduction to GPUs in HPC
PDF
ARM and Machine Learning
PDF
Utilizing AMD GPUs: Tuning, programming models, and roadmap
PDF
PostgreSQL with OpenCL
PDF
Lustre Best Practices
PDF
Post-K: Building the Arm HPC Ecosystem
Performance Evaluation using TAU Performance System and E4S
An Open Hardware CPU written in VHDL, synthesized with Open Source tools
Introduction to GPUs in HPC
ARM and Machine Learning
Utilizing AMD GPUs: Tuning, programming models, and roadmap
PostgreSQL with OpenCL
Lustre Best Practices
Post-K: Building the Arm HPC Ecosystem

What's hot (20)

PDF
Exploring the Programming Models for the LUMI Supercomputer
PDF
SGI: Meeting Manufacturing's Need for Production Supercomputing
PDF
Involvement in OpenHPC
PDF
"Deep Learning on Arm Cortex-M Microcontrollers," a Presentation from Arm
PDF
Getting started with AMD GPUs
PDF
Post-K: Building the Arm HPC Ecosystem
PDF
Evaluating GPU programming Models for the LUMI Supercomputer
PDF
Programming Languages & Tools for Higher Performance & Productivity
PDF
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
A Library for Emerging High-Performance Computing Clusters
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
Linaro HPC Workshop Note
PDF
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
PDF
SNAP MACHINE LEARNING
PDF
SDVIs and In-Situ Visualization on TACC's Stampede
PDF
Best Practices and Performance Studies for High-Performance Computing Clusters
PDF
It's Time to ROCm!
PDF
“Khronos Group Standards: Powering the Future of Embedded Vision,” a Presenta...
PDF
ARM HPC Ecosystem
Exploring the Programming Models for the LUMI Supercomputer
SGI: Meeting Manufacturing's Need for Production Supercomputing
Involvement in OpenHPC
"Deep Learning on Arm Cortex-M Microcontrollers," a Presentation from Arm
Getting started with AMD GPUs
Post-K: Building the Arm HPC Ecosystem
Evaluating GPU programming Models for the LUMI Supercomputer
Programming Languages & Tools for Higher Performance & Productivity
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Introducing HPC with a Raspberry Pi Cluster
A Library for Emerging High-Performance Computing Clusters
TAU E4S ON OpenPOWER /POWER9 platform
Linaro HPC Workshop Note
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
SNAP MACHINE LEARNING
SDVIs and In-Situ Visualization on TACC's Stampede
Best Practices and Performance Studies for High-Performance Computing Clusters
It's Time to ROCm!
“Khronos Group Standards: Powering the Future of Embedded Vision,” a Presenta...
ARM HPC Ecosystem
Ad

Similar to IBM AI at Scale (20)

PPTX
Innovation with ai at scale on the edge vt sept 2019 v0
PDF
OpenPOWER Boot camp in Zurich
PDF
IBM Cloud Paris Meetup - 20190520 - IA & Power
PDF
BSC LMS DDL
PDF
Dog Breed Classification using PyTorch on Azure Machine Learning
PPT
Enabling a hardware accelerated deep learning data science experience for Apa...
PDF
Scaling up Deep Learning by Scaling Down
PDF
The Convergence of HPC and Deep Learning
PPTX
Inteligencia artificial, open source e IBM Call for Code
PDF
Scalable TensorFlow Deep Learning as a Service with Docker, OpenPOWER, and GPUs
PPTX
Scaling up deep learning by scaling down
PPTX
Leonid Kuligin "Training ML models with Cloud"
PDF
Deep Learning with Apache Spark and GPUs with Pierce Spitler
PPTX
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
PDF
Spark and Deep Learning frameworks with distributed workloads
 
PDF
Large Scale Deep Learning with TensorFlow
PDF
Open power ddl and lms
PDF
Distributed deep learning reference architecture v3.2l
PPTX
GPU and Deep learning best practices
PPTX
Dp2 ppt by_bikramjit_chowdhury_final
Innovation with ai at scale on the edge vt sept 2019 v0
OpenPOWER Boot camp in Zurich
IBM Cloud Paris Meetup - 20190520 - IA & Power
BSC LMS DDL
Dog Breed Classification using PyTorch on Azure Machine Learning
Enabling a hardware accelerated deep learning data science experience for Apa...
Scaling up Deep Learning by Scaling Down
The Convergence of HPC and Deep Learning
Inteligencia artificial, open source e IBM Call for Code
Scalable TensorFlow Deep Learning as a Service with Docker, OpenPOWER, and GPUs
Scaling up deep learning by scaling down
Leonid Kuligin "Training ML models with Cloud"
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Spark and Deep Learning frameworks with distributed workloads
 
Large Scale Deep Learning with TensorFlow
Open power ddl and lms
Distributed deep learning reference architecture v3.2l
GPU and Deep learning best practices
Dp2 ppt by_bikramjit_chowdhury_final
Ad

More from Ganesan Narayanasamy (20)

PDF
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
PDF
Chip Design Curriculum development Residency program
PDF
Basics of Digital Design and Verilog
PDF
180 nm Tape out experience using Open POWER ISA
PDF
Workload Transformation and Innovations in POWER Architecture
PDF
OpenPOWER Workshop at IIT Roorkee
PDF
Deep Learning Use Cases using OpenPOWER systems
PDF
IBM BOA for POWER
PDF
OpenPOWER System Marconi100
PDF
OpenPOWER Latest Updates
PDF
POWER10 innovations for HPC
PDF
Deeplearningusingcloudpakfordata
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
PDF
AI in healthcare - Use Cases
PDF
AI in Health Care using IBM Systems/OpenPOWER systems
PDF
AI in Healh Care using IBM POWER systems
PDF
Poster from NUS
PDF
SAP HANA on POWER9 systems
PPTX
Graphical Structure Learning accelerated with POWER9
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Chip Design Curriculum development Residency program
Basics of Digital Design and Verilog
180 nm Tape out experience using Open POWER ISA
Workload Transformation and Innovations in POWER Architecture
OpenPOWER Workshop at IIT Roorkee
Deep Learning Use Cases using OpenPOWER systems
IBM BOA for POWER
OpenPOWER System Marconi100
OpenPOWER Latest Updates
POWER10 innovations for HPC
Deeplearningusingcloudpakfordata
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare - Use Cases
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Healh Care using IBM POWER systems
Poster from NUS
SAP HANA on POWER9 systems
Graphical Structure Learning accelerated with POWER9

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Review of recent advances in non-invasive hemoglobin estimation
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Per capita expenditure prediction using model stacking based on satellite ima...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Monthly Chronicles - July 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
NewMind AI Weekly Chronicles - August'25 Week I
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced Soft Computing BINUS July 2025.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Review of recent advances in non-invasive hemoglobin estimation

IBM AI at Scale

  • 1. AI at Scale COGNITIVE SYSTEMS Ing. Florin Manaila Senior Architect and Inventor Cognitive Systems (Distributed Deep Learning and HPC) IBM Systems Hardware Europe Member of the IBM Academy of Technology (AoT) July 9, 2020
  • 2. Technical R&D today disruption 2 New Product New Product Opportunistic Discovery by Humans Simulation Experiments Simulation & Inference ExperimentsComprehensive Discovery by Cognitive Today Cognitive Discovery IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 3. Next-Generation Infrastructure Stack 3IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 4. Problem ― 4  Datasets are large and growing  The size of a batch of samples is large and growing  Sample sizes are large and growing  More and more sophisticated models are being designed, some with hundreds of layers  GPU memory capacity is growing as well (but slower)  Limited by cost, technology, physical space  Energy costs is increase YY  Large CO2e / training cycles IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 5. What’s in the training of deep neural networks? Neural network model Billions of parameters Gigabytes Computation Iterative gradient based search Millions of iterations Mainly matrix operations Data Millions of images, sentences Terabytes Workload characteristics: Both compute and data intensive! 5IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 6. Distributed Deep Learning Common options 6 SINGLE ACCELERATOR DATA PARALLEL MODEL PARALLEL DATA AND MODEL PARALLEL 1x Accelerator 4x Accelerators 4x Accelerators 4x n Accelerators Longer Training Time Shorter Training Time System1System2Systemn System Data Data DataDataDataData DataDataData IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 7. Node 0 Data-Parallel Framework Distributed Learning Partition 0 GPU 0 GPU 1 GPU 2 GPU 3 Partition (0,0) Partition (0,1) Partition (0,2) Partition (0,3) Node 1 Partition 1 GPU 0 GPU 1 GPU 2 GPU 3 Partition (1,0) Partition (1,1) Partition (1,2) Partition (1,3) 7 Large Dataset IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 8. Scaling Misperception 8IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 9. AI Frameworks and Multi-GPU Single GPU utilization 9 Coherent access to system memory (2TB) NVLink 150GB/s NVLink 150GB/s 170GB/s CP U NVLink 150GB/s NVDIA V100NVDIA V100 DDR4 Coherent access to system memory (2TB) NVLink 150GB/s NVLink 150GB/s 170GB/s Multi-Host Socket Direct PCIe Gen4, CAPI 2.0 Infiniband NVLink 150GB/s NVDIA V100NVDIA V100 DDR4 X-Bus 4B CP U PCIe NVMe Flash Storage If not told explicitly, AI Frameworks makes use of a single GPU! IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 10. 10 Coherent access to system memory (2TB) NVLink 150GB/s NVLink 150GB/s 170GB/s CP U NVLink 150GB/s NVDIA V100NVDIA V100 DDR4 Coherent access to system memory (2TB) NVLink 150GB/s NVLink 150GB/s 170GB/s Multi-Host Socket Direct PCIe Gen4, CAPI 2.0 Infiniband NVLink 150GB/s NVDIA V100NVDIA V100 DDR4 X-Bus 4B CP U PCIe NVMe Flash Storage Use explicitly, multi GPU model in AI Frameworks makes use of all GPUs available on the host ore assigned by SLURM if interactive session is requested with a specific number of GPUs! AI Frameworks and Multi-GPU 4x GPUs utilization IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 11. 11 InfiniBand EDR Switch AI Frameworks and Multi-GPU 12x GPU utilization using collective communication operation called “AllReduce IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 12. Multi GPU in Keras Scenarios Training models with weights merge on CPU Training models with weights merge on CPU using cpu_relocation (recommended for IC922) Training models with weights merge on GPU (recommended for AC922) 12IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 13. Issues  Batch Size  GPU data starvation aka the CPUs can’t keep up with the GPUs  Saving your parallel models  Counting the availableGPUs has a nasty side-effect IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 14. Issues GPU data starvation IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 15. Training models with weights merge on CPU Example 1 import tensorflow as tf from keras.applications import Xception from keras.utils import multi_gpu_model import numpy as np num_samples = 1000 height = 224 width = 224 num_classes = 1000 # Instantiate the base model (or "template" model). # We recommend doing this with under a CPU device scope, # so that the model's weights are hosted on CPU memory. # Otherwise they may end up hosted on a GPU, which would # complicate weight sharing. with tf.device('/cpu:0'): model = Xception(weights=None, input_shape=(height, width, 3), classes=num_classes) IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 16. Training models with weights merge on CPU Example 1 # Replicates the model on 4 GPUs. # This assumes that your machine has 4 available GPUs. parallel_model = multi_gpu_model(model, gpus=4) parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop') # Generate dummy data. x = np.random.random((num_samples, height, width, 3)) y = np.random.random((num_samples, num_classes)) # This `fit` call will be distributed on 8 GPUs. # Since the batch size is 256, each GPU will process 32 samples. parallel_model.fit(x, y, epochs=20, batch_size=256) # Save model via the template model (which shares the same weights): model.save('my_model.h5') NOTE: To save the multi-gpu model, use .save(fname) or .save_weights(fname) with the template model (the argument you passed to multi_gpu_model), rather than the model returned by multi_gpu_model. IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 17. Training models with weights merge on CPU using cpu_relocation Example 2 .. # Not needed to change the device scope for model definition: model = Xception(weights=None, ..) try: parallel_model = multi_gpu_model(model, cpu_relocation=True) print("Training using multiple GPUs..") except ValueError: parallel_model = model print("Training using single GPU or CPU..") parallel_model.compile(..) .. IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 18. Training models with weights merge on GPU (recommended for AC922) Example 3 .. # Not needed to change the device scope for model definition: model = Xception(weights=None, ..) try: parallel_model = multi_gpu_model(model, cpu_merge=False) print("Training using multiple GPUs..") except: parallel_model = model print("Training using single GPU or CPU..") parallel_model.compile(..) .. IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 19. Next-Generation Software Stack 19IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 20. 20 AI Infrastructure Stack ON-CLOUD and ON-PREM Transform & Prep Data (ETL) Micro-Services / Applications Governance AI (Fairness, Explainable AI, Model Health, Accuracy) APIs (external and in-house) Machine & Deep Learning Libraries & Frameworks Distributed Computing Data Lake & Data Stores Segment Specific: Finance, Retail, Healthcare, Automotive Speech, Vision, NLP, Sentiment TensorFlow, Caffe, Pytorch SparkML, Snap.ML Spark, MPI Hadoop HDFS, NoSQL DBs, Parallel File System Accelerated Infrastructure IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 21. Watson ML Community Edition (WMLCE) 21 CUDA TensorRTTensorFlow Caffe2 RAPIDS.AI: cuDF, cuML LIBS: DDL Large Model Support (LMSv2) SnapML Local, MPI, Spark DASK Pytorch Estimator, Probability, Serving, Tensorboard APEX XGBoostBazel libevent, libgdf, libgdf_cffi, libopencv, libprotobuf, parquet-cpp, thrift-cpp, arrow-cpp, pyarrow, gflags, magma, cupy, py-oepncv, arrow-cpp etc NCCL cuDNN Spectrum MPI AIX360 delivered via Bare Metal or Containers ONNX Version1.7.0 Horovod AIF360 IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 22. Watson ML Community Edition (WMLCE) 22 CUDA TensorFlow Caffe2 RAPIDS.AI: cuDF, cuML LIBS: SnapML Local, MPI, Spark DASK Pytorch Estimator, Probability, Serving, Tensorboard APEX XGBoostBazel libevent, libgdf, libgdf_cffi, libopencv, libprotobuf, parquet-cpp, thrift-cpp, arrow-cpp, pyarrow, gflags, magma, cupy, py-oepncv, arrow-cpp etc NCCL cuDNN delivered via Bare Metal or Containers Version1.7.0 TensorFlow Serving Server TensorRT ONNX Protobuf Training Inference DDL Large Model Support (LMSv2)Spectrum MPI AIX360 ONNX Horovod AIF360 IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 23. AI Eplainability and Fairness toolkits on POWER 23IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 25. WMLCE - Installation 25 $ conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ $ conda create --name wmlce-1.7.0 python=3.7 $ conda activate wmlce-1.7.0 $ conda install powerai $ conda install powerai-rapids Optional Packages: After you have install Anaconda on your user profile, add IBM WMLCE conda channel: Create a python virtual environment: Install WMLCE: $ conda install py-xgboost-gpu https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.html#wmlce_install__install IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 26. IBM Large Model Support (LMS) 26IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 27. IBM Large Model Support (LMS) Swap-out unused parameters to large CPU memory (TB order) 27IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation l+1l-1 LLayer 1 Loss Function …..… ………. …... Forward Backward l ……. CPU memory GPU memory l+1l-1 LLayer 1 Loss Function …..… ………. …... Forward Backward l ……. GPU memory Swap-out Swap-in Normal Backpropagation Backpropagation in LMS(Swap) Keep unused parameters in GPU memory Swap-out unused parameters to CPU memory Background Neural Network is growing deeper and wider In near future, memory to keep the network parameters may exceed the GPU memory (16GB, 40GB, etc)  Large Model Support is required in deep learning frameworks CPU-GPU NVLink plays the key role
  • 28. IBM Large Model Support (LMS) 28 Allow seamlessly moves layers of a model between the GPU and CPU to overcome GPU memory limits allows training of:  Deeper models  Higher resolution data  Larger batch sizes PytorchTensorFlow TFLMSv2 introduces four hyper-parameters to work with:  swapout_threshold: The number of tensors to hold within GPU memory before pushing them to system memory.  swapin_ahead: The larger swapin_ahead is, the earlier a tensor is swapped in to the GPU memory from the host memory.  swapin_groupby: Multiple swap-in operations of the same tensor will be grouped or fused into one swap-in operation for better performance if they are close to each other (the distance between them is within swapin_groupby).  sync_mode: Whether to do synchronisation between data transfer and kernel computation or not. Keras IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 29. What’s possible with Large Model Support 29  8.3x image resolution - Keras ResNet50  14.4x image resolution – ResNet152v2  7x MRI resolution - 3D U-Net 3D image segmentation IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 30. Distributed Deep Learning 30IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 31. Distributed Deep Learning Goals 31 The overall goal of distributed deep learning is to reduce the training time To this end the primary features:  Automatic Topology Detection  Rankfile generation  Automatic mpirun option handling  Efficiency in scalability IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 32. 32 Distributed Deep Learning How is working?  A process is created for each GPU in the cluster  Each process contains a copy of the model  Mini-batch is spread across all of the processes  Each process uses different input data  After each iteration, all of the processes sync and average together their gradients IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 33. 33 Tools and Libraries The following tools are libraries, which provide the communication functions necessary to perform distributed training. Primarily allReduce and broadcast functions.  IBM Spectrum MPI: Classic tool for distributed computing. Still commonly used for distributed deep learning.  NVIDIA NCCL: Nvidia’s gpu-to-gpu communication library. Since NCCL2, between-node communication is supported.  IBM DDL: Provides a topology-aware all-Reduce. Capable of optimally dividing communication across hierarchies of fabrics. Utilizes different communication protocols at different hierarchies. When WMLCE is installed all related frameworks are comming with IBM DDL support, you don’t have to compile additional software packages, only to modify your training scripts to make use of the need distributed deep learning APIs.Integrations into deep learning frameworks to enable distributed training is using common communication libraries such as:  TensorFlow Distribution Strategies. Native Tensorflow distribution methods.  IBM DDL. Provides integrations into common frameworks, including a Tensorflow operator that integrates IBM DDL with Tensorflow and similar for Pytorch.  Horovod [Sergeev et al. 2018]. Provides integration libraries into common frameworks which enable distributed training with common communication libraries, including. IBM DDL or NCCL can be used as backend for Horovod implementation. IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 34. Horovod distributed training framework 34  Distributed training framework for: • TensorFlow • Keras • PyTorch  Separates infrastructure from ML  Easy installation on top of ML frameworks:  Best performance with NCCL or DDL - uses bandwidth-optimal communication protocols (NVLINK, RDMA (InfiniBand, RoCE)) if available  Named after traditional Russian fold dance where participants dance in a circle with linked hands $ conda install horovod IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 35. Horovod with DDL Running 35 $ ddlrun -H host1,host2,host3,host4 -mpiarg "-x HOROVOD_FUSION_THRESHOLD=16777216" python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 -- variable_update=horovod I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ==== ... ---------------------------------------------------------------- total images/sec: 5682.34 ---------------------------------------------------------------- IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 36. Horovod Architecture • Multiple Towers (here 2) • Each Tower:  Runs in context of individual OS process (own PID)  Has own data pipeline to read and augment  Runs own training step  Synchronization Step via hvd.DistributedOptimizer() Tower (indiv. process) Tower (indiv. process) ... hvd.DistributedOptimizer() point of gradient synchronization 1.st NCCL log output Rank 0 Rank 1 IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 37. How to Start Horovod Jobs Samples below, train.py is our training code: • 2 GPUs: mpirun -np 2 --allow-run-as-root -H localhost:2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py • 4 GPUs: mpirun -np 4 --allow-run-as-root -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py • Single GPU: mpirun -np 1 --allow-run-as-root -H localhost:1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py A better way is to use horovodrun or ddlrun: One Training codebase, One way to start (MPI based), Easy to orchestrate, No parameter server IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation horovodrun -np 16 -H compute1:4,compute2:4,compute3:4,compute4:4 python train.py
  • 38. Horovod Rank Enumeration  Parameters given by Horovod • hvd.size() – total amount of GPUs working in this job • hvd.rank() – rank id assigned to this specific tower/worker o Perform special steps in single rank (mostly rank 0) o Checkpointing o TensorBoard log writing  Pitfall: hvd.local_rank() is not unique when doing multi-node jobs! IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 39. Horovod MPI Rank Enumeration A sample with 4 GPUs on 2 nodes Tower (indiv. process) Tower (indiv. process) Caller hvd.rank() = 0 hvd.local_rank() = 0 Tower (indiv. process) Tower (indiv. process) hvd.rank() = 1 hvd.local_rank() = 1 hvd.rank() = 2 hvd.local_rank() = 0 hvd.rank() = 3 hvd.local_rank() = 1 hvd.size() = 4 IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 40. WMLCE and SLURM integration 40IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 41. 41 SLURM template example for 4x IBM AC922s Batch AI IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 42. 42 SLURM template example for 4x IBM AC922s Batch AI with Horovod and Pytorch IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 43. 43 MNIST example for 4x IBM AC922s Batch AI with Horovod and Pytorch IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 44. 44 MNIST example for 4x IBM AC922s Batch AI with Horovod and Pytorch IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 45. 45 MNIST example for 4x IBM AC922s Batch AI with Horovod and Pytorch IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 46. 46 MNIST example for 4x IBM AC922s Batch AI with Horovod and Pytorch IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 47. 47IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 49. 49

Editor's Notes

  • #10: Training models on GPU using Keras & Tensorflow is seamless. If you have an NVIDIA card and you have installed CUDA, the libraries will automatically detect it and use it for training. But what if you are a spoilt brat and you have multiple GPUs? Well unfortunately you will have to work a bit to achieve multi-GPU training.
  • #11: There are multiple ways to parallelise a network depending on what you want to achieve but the main two approaches is model and data parallelization. The first can help you if your model is too complex to fit in a single GPU while the latter helps when you want to speed up the execution. The main idea is that you pass your model through the method and it is copied across different GPUs. The original input is split into chunks which are fed to the various GPUs and then they are aggregated as a single output. This method can be used for achieving parallel training and predictions, nevertheless keep in mind that for training it does not scale linearly with the amount of GPUs due to the required synchronization.
  • #12: In synchronized data-parallel distributed deep learning, the major computation steps are: 1. Compute the gradient of the loss function using a minibatch on each GPU. 2. Compute the mean of the gradients by inter-GPU communication. 3. Update the model.