Distributed DNN training: Infrastructure, challenges, and lessons learned

Distributed DNN Training
Infrastructure, Challenges, and Lessons learned
O’Reilly AI Conference 2018
Wee Hyong Tok
Principal Data Scientist Lead
Microsoft
@weehyong
Kaarthik Sivashanmugam
Principal Software Engineer
Microsoft
@kaarthikss

Overview
• Quick ramp to learn about distributed deep learning
• Learn about how to get started with distributed deep learning & infra
• Learn about what are the common pitfalls and how to avoid them

Before 2017
2017
April
ResNet-50
32 CPU
256 Nvidia P100 GPUs
1
hour
ResNet-50
NVIDIA M40 GPU
14
days
1018 single precision
operations
Sept
ResNet-50
1,600 CPUs
31
minutes
Nov
15
minutes
ResNet-50
1,024 P100 GPUs
UC Berkeley, TACC, UC DavisFacebook Preferred Network
ChainerMN

Why Distributed Training?
time
My first model
Building more
models
Dataset
Size
time
Model
Size
time
Model
Complexity
time
d
d
d
d
Building larger
models, using larger dataset
Will model fit
on 1 GPU?
How should I
partition the
dataset?

Data Parallelism Model Parallelism
1. Parallel training on different
machines
2. Update the parameter server
synchronously/asynchronously
3. Refresh the local model with
new parameters, go to 1 and
repeat
1. The global model is
partitioned into K sub-
models without overlap.
2. The sub-models are
distributed over K local
workers and serve as their
local models.
3. In each mini-batch, the local
workers compute the
gradients of the local
weights by back
propagation.
Credits: Taifeng Wang, DMTK team

tf.train.ClusterSpec({
"local": [
"localhost:2222",
"localhost:2223“
]})
Credits: https://www.tensorflow.org/deploy/distributed
Tasks:
/job:local/task:0
/job:local/task:

tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222"
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222"
]})
Tasks:
/job:worker/task:0
/job:worker/task:1
/job:worker/task:2
/job:ps/task:0
/job:ps/task:1

with tf.device("/job:ps/task:0"):
weights_1 = tf.Variable(...)
biases_1 = tf.Variable(...)
with tf.device("/job:ps/task:1"):
weights_2 = tf.Variable(...)
biases_2 = tf.Variable(...)
with tf.device("/job:worker/task:7"):
input, labels = ...
layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)
logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2)
# ...
train_op = ...
with tf.Session("grpc://worker7.example.com:2222") as sess:
for _ in range(10000):
sess.run(train_op)
Create variables on 2 tasks
Run compute-intensive part of
the model in the worker job

Distributed Training (K80)
Src: https://bit.ly/2qWbqrU
ImageNet
8x NVIDIA Tesla K80, CUDA/cuDNN: 8.0 / 5.1
1.0TB EFS (burst 100MB/sec for 12 hours, continuous 50MB/sec)
Models Batch Size
per GPU
InceptionV3 64
ResNet-50 64
ResNet-152 32
variable_update:
distributed_replicated
Cross_replica_sync
True
4 – 6 parameter servers
64 GPUs

Distributed Training (DGX-1)
8 GPUs
Models Batch Size
per GPU
InceptionV3 64
ResNet-50 64
ResNet-152 64
Src: https://bit.ly/2qWbqrU
ImageNet
8x NVIDIA Tesla P100, CUDA/cuDNN: 8.0 / 5.1
1.0TB EFS (burst 100MB/sec for 12 hours, continuous 50MB/sec)

Distributed TensorFlow
Src: https://arxiv.org/abs/1802.05799
From 1 to 128 NVidia Pascal GPUs

Typical Steps
• Training involves iterative method for minimizing objective function
• Gradient computation is performed in each iteration
• Each iteration uses a random subset of input data
• Iterations are parallelizable
• Large scale (training samples, model parameters) requires distributed
training

model
gradients
1. Read data
2. Compute Model Updates
(Gradients)
Update Model
Deep Learning - Illustration

model
gradients
1. Read data
(Gradients)
Averaged
Gradients
3. Average Gradients
Update Model
model
gradients
1. Read data
(Gradients)
Averaged
Gradients
Update Model
model
gradients
1. Read data
(Gradients)
Averaged
Gradients
Update Model
Distributed Deep Learning - Data Parallelism
Synchronous – Gradients for different batches are computed on each
node, averaged across nodes

Distributed Deep Learning
Workers Parameter Servers
• Process Training Data
• Compute Gradient
• Send the gradient to parameter servers
• Compute averages of gradients
Parameter
Server
Parameter
Server 1
Parameter
Server 2
Parameter
Server 3
Worker
1
Worker
2
Worker
3
Worker
1
Worker
2
Worker
3
Average all Gradients
Sent from workers
Each parameter server averages
parts of the gradient

Reality 1 – Ratio of workers to parameter
servers is important

Reality 2 – Increasing complexity when moving
towards distributed deep learning training

From Single Node/Device to Multiple Nodes/Devices
Saturate single device/node first before distribution
Important to
• Understand how to scale your model
• Maximize CPU / GPU Utilization
Single Multiple
Single Simple use
cases
Sufficient for
most use cases
Multiple X XXL use cases
Nodes Devices

Distributed Deep Learning – Get Started!
Cognitive Toolkit (CNTK)
https://bit.ly/2HWNuwP
+
TF Jobs
And more …..

Optimizing
Distributed
Deep Learning
“Physics”
CPU
GPU
Network
Storage
Deep Learning
Lib
Memory

Running Deep Learning
Infrastructure @ Scale

Infrastructure – Background
Build a Deep
Learning
Infrastructure that
supports hundred
of users
Data
Scientists
Researchers
Developers
Lots of Jobs!
Different Toolkits!

Infrastructure - Background
• 1st generation infra built ground up when GPUs were not
available in any public cloud
• Ever increasing demand for GPU (millions of hours of GPU
time)
• Maximize GPU utilization and minimize wasted cycles

Infrastructure Considerations
• Single vs. multi-tenancy
• Multitenancy scope (cluster-level, node-level, GPUs)
• Tenant isolation
• Workload device requirements
• CPU, GPU, other hardware accelerators
• Reuse high-performance CPU cluster built for big data jobs
• Connectivity (nodes, devices)

GPU Device Interconnect
• NVLink
• GPUDirect RDMA
• GPUDirect P2P
Interconnect topology referenced in Project
Olympus
Credits:CUDA-MPI Blog (https://bit.ly/2KnmN58)

From CUDA to NCCL1 to NCCL2
Multi-Core CPU GPU Multi-GPU Multi-GPU
Multi-Node
NCCL 2NCCL 1CUDA
Multi-GPU
Communication
Library
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)

NCCL 2.x (multi-node)

NCCL 2.x
(multi-node)

Infrastructure Considerations (contd.)
• Data storage
• Co-locate Data Engineering storage infrastructure (cluster-local)
• File reads are sequential
• Toolkit support for HDFS (reading from HDFS does not mean data-locality-
aware computation)
• Mountable network-attached storage
• Job scheduler
• Support for GPU (and other accelerators)
• Cluster sharing with other types of jobs
• Sharing ratio among teams
• Support for Docker containers

Infrastructure Challenges
• Optimal node placement / device allocation
• Pack and minimize distribution
• Simple resource counting vs. interconnect-aware scheduling
• Repacking (if possible) would be good to have
• Static nature of job topology in deep learning
• Gang scheduling (suboptimal workarounds)
• Heterogeneity of resources & portability of workloads
• Support different generation of GPU devices in the same cluster

Infrastructure Challenges (contd.)
• Data lifecycle integration
• Consumer of data and producer of models
• Workflow on dataflow, model reuse etc.
• Abstracting implementation details in end-to-end experience

Reality 3 – GPU memory errors happen!

What we learned on DNN Infrastructure
• Surprising rate of memory/ECC issues with GPUs
• Job schedulers used for big data jobs are not great fit for training jobs
• Multitenancy is hard to get right but XXL clusters could greatly benefit
from it
• No one-size that fits all when it comes to task placement, hardware
needs, toolkit selection, optimization techniques
• Need end-user tools for debugging/troubleshooting, profiling

Optimizing
Distributed
Deep Learning
“Physics”
CPU
GPU
Network
Storage
Deep Learning
Lib
Memory
Summary

Distributed TensorFlow
Src: https://arxiv.org/abs/1802.05799
Going from 1 to 128 NVidia Pascal GPUs

Distributed DNN Training
Infrastructure, Challenges, and
Lessons learned
O’Reilly AI Conference 2018
Wee Hyong Tok
Principal Data Scientist Lead
Microsoft
@weehyong
Kaarthik Sivashanmugam
Principal Software Engineer
Microsoft
@kaarthikss
Thank you

Distributed DNN training: Infrastructure, challenges, and lessons learned

More Related Content

What's hot (20)

Similar to Distributed DNN training: Infrastructure, challenges, and lessons learned (20)

Recently uploaded (20)

Distributed DNN training: Infrastructure, challenges, and lessons learned