SlideShare a Scribd company logo
Distributed DNN Training
Infrastructure, Challenges, and Lessons learned
O’Reilly AI Conference 2018
Wee Hyong Tok
Principal Data Scientist Lead
Microsoft
@weehyong
Kaarthik Sivashanmugam
Principal Software Engineer
Microsoft
@kaarthikss
Overview
• Quick ramp to learn about distributed deep learning
• Learn about how to get started with distributed deep learning & infra
• Learn about what are the common pitfalls and how to avoid them
Before 2017
2017
April
ResNet-50
32 CPU
256 Nvidia P100 GPUs
1
hour
ResNet-50
NVIDIA M40 GPU
14
days
1018 single precision
operations
Sept
ResNet-50
1,600 CPUs
31
minutes
Nov
15
minutes
ResNet-50
1,024 P100 GPUs
UC Berkeley, TACC, UC DavisFacebook Preferred Network
ChainerMN
Why Distributed Training?
time
My first model
Building more
models
Dataset
Size
time
Model
Size
time
Model
Complexity
time
d
d
d
d
Building larger
models, using larger dataset
Will model fit
on 1 GPU?
How should I
partition the
dataset?
Distributed DNN Training 101
Distributed DNN training: Infrastructure, challenges, and lessons learned
Data Parallelism Model Parallelism
1. Parallel training on different
machines
2. Update the parameter server
synchronously/asynchronously
3. Refresh the local model with
new parameters, go to 1 and
repeat
1. The global model is
partitioned into K sub-
models without overlap.
2. The sub-models are
distributed over K local
workers and serve as their
local models.
3. In each mini-batch, the local
workers compute the
gradients of the local
weights by back
propagation.
Credits: Taifeng Wang, DMTK team
tf.train.ClusterSpec({
"local": [
"localhost:2222",
"localhost:2223“
]})
Credits: https://www.tensorflow.org/deploy/distributed
Tasks:
/job:local/task:0
/job:local/task:
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222"
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222"
]})
Credits: https://www.tensorflow.org/deploy/distributed
Tasks:
/job:worker/task:0
/job:worker/task:1
/job:worker/task:2
/job:ps/task:0
/job:ps/task:1
with tf.device("/job:ps/task:0"):
weights_1 = tf.Variable(...)
biases_1 = tf.Variable(...)
with tf.device("/job:ps/task:1"):
weights_2 = tf.Variable(...)
biases_2 = tf.Variable(...)
with tf.device("/job:worker/task:7"):
input, labels = ...
layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)
logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2)
# ...
train_op = ...
with tf.Session("grpc://worker7.example.com:2222") as sess:
for _ in range(10000):
sess.run(train_op)
Create variables on 2 tasks
Run compute-intensive part of
the model in the worker job
Credits: https://www.tensorflow.org/deploy/distributed
Distributed Training (K80)
Src: https://bit.ly/2qWbqrU
ImageNet
8x NVIDIA Tesla K80, CUDA/cuDNN: 8.0 / 5.1
1.0TB EFS (burst 100MB/sec for 12 hours, continuous 50MB/sec)
Models Batch Size
per GPU
InceptionV3 64
ResNet-50 64
ResNet-152 32
variable_update:
distributed_replicated
Cross_replica_sync
True
4 – 6 parameter servers
64 GPUs
Distributed Training (DGX-1)
8 GPUs
Models Batch Size
per GPU
InceptionV3 64
ResNet-50 64
ResNet-152 64
Src: https://bit.ly/2qWbqrU
ImageNet
8x NVIDIA Tesla P100, CUDA/cuDNN: 8.0 / 5.1
1.0TB EFS (burst 100MB/sec for 12 hours, continuous 50MB/sec)
Distributed TensorFlow
Src: https://arxiv.org/abs/1802.05799
From 1 to 128 NVidia Pascal GPUs
Typical Steps
• Training involves iterative method for minimizing objective function
• Gradient computation is performed in each iteration
• Each iteration uses a random subset of input data
• Iterations are parallelizable
• Large scale (training samples, model parameters) requires distributed
training
model
gradients
1. Read data
2. Compute Model Updates
(Gradients)
Update Model
Deep Learning - Illustration
model
gradients
1. Read data
2. Compute Model Updates
(Gradients)
Averaged
Gradients
3. Average Gradients
Update Model
model
gradients
1. Read data
2. Compute Model Updates
(Gradients)
Averaged
Gradients
3. Average Gradients
Update Model
model
gradients
1. Read data
2. Compute Model Updates
(Gradients)
Averaged
Gradients
3. Average Gradients
Update Model
Distributed Deep Learning - Data Parallelism
Synchronous – Gradients for different batches are computed on each
node, averaged across nodes
Distributed Deep Learning
Workers Parameter Servers
• Process Training Data
• Compute Gradient
• Send the gradient to parameter servers
• Compute averages of gradients
Parameter
Server
Parameter
Server 1
Parameter
Server 2
Parameter
Server 3
Worker
1
Worker
2
Worker
3
Worker
1
Worker
2
Worker
3
Average all Gradients
Sent from workers
Each parameter server averages
parts of the gradient
Reality 1 – Ratio of workers to parameter
servers is important
Reality 2 – Increasing complexity when moving
towards distributed deep learning training
From Single Node/Device to Multiple Nodes/Devices
Saturate single device/node first before distribution
Important to
• Understand how to scale your model
• Maximize CPU / GPU Utilization
Single Multiple
Single Simple use
cases
Sufficient for
most use cases
Multiple X XXL use cases
Nodes Devices
Distributed Deep Learning – Get Started!
Cognitive Toolkit (CNTK)
https://bit.ly/2HWNuwP
+
TF Jobs
And more …..
Optimizing
Distributed
Deep Learning
“Physics”
CPU
GPU
Network
Storage
Deep Learning
Lib
Memory
Running Deep Learning
Infrastructure @ Scale
Infrastructure – Background
Build a Deep
Learning
Infrastructure that
supports hundred
of users
Data
Scientists
Researchers
Developers
Lots of Jobs!
Different Toolkits!
Infrastructure - Background
• 1st generation infra built ground up when GPUs were not
available in any public cloud
• Ever increasing demand for GPU (millions of hours of GPU
time)
• Maximize GPU utilization and minimize wasted cycles
Infrastructure Considerations
• Single vs. multi-tenancy
• Multitenancy scope (cluster-level, node-level, GPUs)
• Tenant isolation
• Workload device requirements
• CPU, GPU, other hardware accelerators
• Reuse high-performance CPU cluster built for big data jobs
• Connectivity (nodes, devices)
GPU Device Interconnect
• NVLink
• GPUDirect RDMA
• GPUDirect P2P
Interconnect topology referenced in Project
Olympus
Credits:CUDA-MPI Blog (https://bit.ly/2KnmN58)
From CUDA to NCCL1 to NCCL2
Multi-Core CPU GPU Multi-GPU Multi-GPU
Multi-Node
NCCL 2NCCL 1CUDA
Multi-GPU
Communication
Library
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
NCCL 2.x (multi-node)
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
NCCL 2.x
(multi-node)
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
Infrastructure Considerations (contd.)
• Data storage
• Co-locate Data Engineering storage infrastructure (cluster-local)
• File reads are sequential
• Toolkit support for HDFS (reading from HDFS does not mean data-locality-
aware computation)
• Mountable network-attached storage
• Job scheduler
• Support for GPU (and other accelerators)
• Cluster sharing with other types of jobs
• Sharing ratio among teams
• Support for Docker containers
Infrastructure Challenges
• Optimal node placement / device allocation
• Pack and minimize distribution
• Simple resource counting vs. interconnect-aware scheduling
• Repacking (if possible) would be good to have
• Static nature of job topology in deep learning
• Gang scheduling (suboptimal workarounds)
• Heterogeneity of resources & portability of workloads
• Support different generation of GPU devices in the same cluster
Infrastructure Challenges (contd.)
• Data lifecycle integration
• Consumer of data and producer of models
• Workflow on dataflow, model reuse etc.
• Abstracting implementation details in end-to-end experience
Reality 3 – GPU memory errors happen!
What we learned on DNN Infrastructure
• Surprising rate of memory/ECC issues with GPUs
• Job schedulers used for big data jobs are not great fit for training jobs
• Multitenancy is hard to get right but XXL clusters could greatly benefit
from it
• No one-size that fits all when it comes to task placement, hardware
needs, toolkit selection, optimization techniques
• Need end-user tools for debugging/troubleshooting, profiling
Optimizing
Distributed
Deep Learning
“Physics”
CPU
GPU
Network
Storage
Deep Learning
Lib
Memory
Summary
Distributed TensorFlow
Src: https://arxiv.org/abs/1802.05799
Going from 1 to 128 NVidia Pascal GPUs
Distributed TensorFlow
Src: https://arxiv.org/abs/1802.05799
Going from 1 to 128 NVidia Pascal GPUs
Distributed DNN Training
Infrastructure, Challenges, and
Lessons learned
O’Reilly AI Conference 2018
Wee Hyong Tok
Principal Data Scientist Lead
Microsoft
@weehyong
Kaarthik Sivashanmugam
Principal Software Engineer
Microsoft
@kaarthikss
Thank you

More Related Content

PDF
Intro to the Distributed Version of TensorFlow
PDF
Deep Learning Computer Build
PPTX
Deep Learning
PPTX
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
PDF
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analytics
PDF
Introduction to Apache Mesos
PPTX
High Performance Computing (HPC) in cloud
PDF
High performance computing tutorial, with checklist and tips to optimize clus...
Intro to the Distributed Version of TensorFlow
Deep Learning Computer Build
Deep Learning
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analytics
Introduction to Apache Mesos
High Performance Computing (HPC) in cloud
High performance computing tutorial, with checklist and tips to optimize clus...

What's hot (20)

PDF
Data Science und Machine Learning im Kubernetes-Ökosystem
PDF
High Performance Computing: an Introduction for the Society of Actuaries
PDF
Nervana and the Future of Computing
PDF
High performance computing - building blocks, production & perspective
PDF
Machine Learning with New Hardware Challegens
PDF
Deep learning on spark
PPTX
MEW22 22nd Machine Evaluation Workshop Microsoft
PDF
Puppet Camp CERN Geneva
PPTX
Squeezing Deep Learning Into Mobile Phones
PPT
Introduction to HPC
PDF
Improving Hardware Efficiency for DNN Applications
PPTX
Cloud Computing
PPTX
High performance computing
PDF
Faster deep learning solutions from training to inference - Michele Tameni - ...
PPTX
Deep learning an Introduction with Competitive Landscape
PDF
Fast Scalable Easy Machine Learning with OpenPOWER, GPUs and Docker
PDF
An Introduction to Deep Learning (May 2018)
PDF
Recent developments in Deep Learning
PDF
2017 04-13-google-tpu-04
PDF
Microsoft Azure in HPC scenarios
Data Science und Machine Learning im Kubernetes-Ökosystem
High Performance Computing: an Introduction for the Society of Actuaries
Nervana and the Future of Computing
High performance computing - building blocks, production & perspective
Machine Learning with New Hardware Challegens
Deep learning on spark
MEW22 22nd Machine Evaluation Workshop Microsoft
Puppet Camp CERN Geneva
Squeezing Deep Learning Into Mobile Phones
Introduction to HPC
Improving Hardware Efficiency for DNN Applications
Cloud Computing
High performance computing
Faster deep learning solutions from training to inference - Michele Tameni - ...
Deep learning an Introduction with Competitive Landscape
Fast Scalable Easy Machine Learning with OpenPOWER, GPUs and Docker
An Introduction to Deep Learning (May 2018)
Recent developments in Deep Learning
2017 04-13-google-tpu-04
Microsoft Azure in HPC scenarios
Ad

Similar to Distributed DNN training: Infrastructure, challenges, and lessons learned (20)

PDF
Using Deep Learning Toolkits with Kubernetes clusters
PDF
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
PDF
Democratizing machine learning on kubernetes
PPTX
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
PDF
Deep Learning at Scale
PDF
Open power ddl and lms
PDF
Spark and Deep Learning frameworks with distributed workloads
 
PPTX
2018 03 25 system ml ai and openpower meetup
PPTX
Choosing the right parallel compute architecture
PPT
Pdc lecture1
PDF
In datacenter performance analysis of a tensor processing unit
PPTX
54665962-Nav-Cluster-Computing.pptx
PPTX
Exascale Capabl
PDF
Tesla Accelerated Computing Platform
PPTX
e-Infrastructure available for research, using the right tool for the right job
PPTX
GPU and Deep learning best practices
PDF
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
PPTX
processor struct
PPTX
Modern processor art
PPTX
Parallel Distributed Deep Learning on HPCC Systems
Using Deep Learning Toolkits with Kubernetes clusters
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Democratizing machine learning on kubernetes
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
Deep Learning at Scale
Open power ddl and lms
Spark and Deep Learning frameworks with distributed workloads
 
2018 03 25 system ml ai and openpower meetup
Choosing the right parallel compute architecture
Pdc lecture1
In datacenter performance analysis of a tensor processing unit
54665962-Nav-Cluster-Computing.pptx
Exascale Capabl
Tesla Accelerated Computing Platform
e-Infrastructure available for research, using the right tool for the right job
GPU and Deep learning best practices
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
processor struct
Modern processor art
Parallel Distributed Deep Learning on HPCC Systems
Ad

Recently uploaded (20)

PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Foundation of Data Science unit number two notes
PDF
Mega Projects Data Mega Projects Data
PPTX
Global journeys: estimating international migration
PDF
Lecture1 pattern recognition............
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
Supervised vs unsupervised machine learning algorithms
oil_refinery_comprehensive_20250804084928 (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
.pdf is not working space design for the following data for the following dat...
Data_Analytics_and_PowerBI_Presentation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Fluorescence-microscope_Botany_detailed content
STUDY DESIGN details- Lt Col Maksud (21).pptx
Database Infoormation System (DBIS).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Quality review (1)_presentation of this 21
Foundation of Data Science unit number two notes
Mega Projects Data Mega Projects Data
Global journeys: estimating international migration
Lecture1 pattern recognition............
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Galatica Smart Energy Infrastructure Startup Pitch Deck

Distributed DNN training: Infrastructure, challenges, and lessons learned

  • 1. Distributed DNN Training Infrastructure, Challenges, and Lessons learned O’Reilly AI Conference 2018 Wee Hyong Tok Principal Data Scientist Lead Microsoft @weehyong Kaarthik Sivashanmugam Principal Software Engineer Microsoft @kaarthikss
  • 2. Overview • Quick ramp to learn about distributed deep learning • Learn about how to get started with distributed deep learning & infra • Learn about what are the common pitfalls and how to avoid them
  • 3. Before 2017 2017 April ResNet-50 32 CPU 256 Nvidia P100 GPUs 1 hour ResNet-50 NVIDIA M40 GPU 14 days 1018 single precision operations Sept ResNet-50 1,600 CPUs 31 minutes Nov 15 minutes ResNet-50 1,024 P100 GPUs UC Berkeley, TACC, UC DavisFacebook Preferred Network ChainerMN
  • 4. Why Distributed Training? time My first model Building more models Dataset Size time Model Size time Model Complexity time d d d d Building larger models, using larger dataset Will model fit on 1 GPU? How should I partition the dataset?
  • 7. Data Parallelism Model Parallelism 1. Parallel training on different machines 2. Update the parameter server synchronously/asynchronously 3. Refresh the local model with new parameters, go to 1 and repeat 1. The global model is partitioned into K sub- models without overlap. 2. The sub-models are distributed over K local workers and serve as their local models. 3. In each mini-batch, the local workers compute the gradients of the local weights by back propagation. Credits: Taifeng Wang, DMTK team
  • 9. tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222" ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222" ]}) Credits: https://www.tensorflow.org/deploy/distributed Tasks: /job:worker/task:0 /job:worker/task:1 /job:worker/task:2 /job:ps/task:0 /job:ps/task:1
  • 10. with tf.device("/job:ps/task:0"): weights_1 = tf.Variable(...) biases_1 = tf.Variable(...) with tf.device("/job:ps/task:1"): weights_2 = tf.Variable(...) biases_2 = tf.Variable(...) with tf.device("/job:worker/task:7"): input, labels = ... layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1) logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2) # ... train_op = ... with tf.Session("grpc://worker7.example.com:2222") as sess: for _ in range(10000): sess.run(train_op) Create variables on 2 tasks Run compute-intensive part of the model in the worker job Credits: https://www.tensorflow.org/deploy/distributed
  • 11. Distributed Training (K80) Src: https://bit.ly/2qWbqrU ImageNet 8x NVIDIA Tesla K80, CUDA/cuDNN: 8.0 / 5.1 1.0TB EFS (burst 100MB/sec for 12 hours, continuous 50MB/sec) Models Batch Size per GPU InceptionV3 64 ResNet-50 64 ResNet-152 32 variable_update: distributed_replicated Cross_replica_sync True 4 – 6 parameter servers 64 GPUs
  • 12. Distributed Training (DGX-1) 8 GPUs Models Batch Size per GPU InceptionV3 64 ResNet-50 64 ResNet-152 64 Src: https://bit.ly/2qWbqrU ImageNet 8x NVIDIA Tesla P100, CUDA/cuDNN: 8.0 / 5.1 1.0TB EFS (burst 100MB/sec for 12 hours, continuous 50MB/sec)
  • 14. Typical Steps • Training involves iterative method for minimizing objective function • Gradient computation is performed in each iteration • Each iteration uses a random subset of input data • Iterations are parallelizable • Large scale (training samples, model parameters) requires distributed training
  • 15. model gradients 1. Read data 2. Compute Model Updates (Gradients) Update Model Deep Learning - Illustration
  • 16. model gradients 1. Read data 2. Compute Model Updates (Gradients) Averaged Gradients 3. Average Gradients Update Model model gradients 1. Read data 2. Compute Model Updates (Gradients) Averaged Gradients 3. Average Gradients Update Model model gradients 1. Read data 2. Compute Model Updates (Gradients) Averaged Gradients 3. Average Gradients Update Model Distributed Deep Learning - Data Parallelism Synchronous – Gradients for different batches are computed on each node, averaged across nodes
  • 17. Distributed Deep Learning Workers Parameter Servers • Process Training Data • Compute Gradient • Send the gradient to parameter servers • Compute averages of gradients Parameter Server Parameter Server 1 Parameter Server 2 Parameter Server 3 Worker 1 Worker 2 Worker 3 Worker 1 Worker 2 Worker 3 Average all Gradients Sent from workers Each parameter server averages parts of the gradient
  • 18. Reality 1 – Ratio of workers to parameter servers is important
  • 19. Reality 2 – Increasing complexity when moving towards distributed deep learning training
  • 20. From Single Node/Device to Multiple Nodes/Devices Saturate single device/node first before distribution Important to • Understand how to scale your model • Maximize CPU / GPU Utilization Single Multiple Single Simple use cases Sufficient for most use cases Multiple X XXL use cases Nodes Devices
  • 21. Distributed Deep Learning – Get Started! Cognitive Toolkit (CNTK) https://bit.ly/2HWNuwP + TF Jobs And more …..
  • 24. Infrastructure – Background Build a Deep Learning Infrastructure that supports hundred of users Data Scientists Researchers Developers Lots of Jobs! Different Toolkits!
  • 25. Infrastructure - Background • 1st generation infra built ground up when GPUs were not available in any public cloud • Ever increasing demand for GPU (millions of hours of GPU time) • Maximize GPU utilization and minimize wasted cycles
  • 26. Infrastructure Considerations • Single vs. multi-tenancy • Multitenancy scope (cluster-level, node-level, GPUs) • Tenant isolation • Workload device requirements • CPU, GPU, other hardware accelerators • Reuse high-performance CPU cluster built for big data jobs • Connectivity (nodes, devices)
  • 27. GPU Device Interconnect • NVLink • GPUDirect RDMA • GPUDirect P2P Interconnect topology referenced in Project Olympus Credits:CUDA-MPI Blog (https://bit.ly/2KnmN58)
  • 28. From CUDA to NCCL1 to NCCL2 Multi-Core CPU GPU Multi-GPU Multi-GPU Multi-Node NCCL 2NCCL 1CUDA Multi-GPU Communication Library Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  • 29. NCCL 2.x (multi-node) Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  • 30. NCCL 2.x (multi-node) Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  • 31. Infrastructure Considerations (contd.) • Data storage • Co-locate Data Engineering storage infrastructure (cluster-local) • File reads are sequential • Toolkit support for HDFS (reading from HDFS does not mean data-locality- aware computation) • Mountable network-attached storage • Job scheduler • Support for GPU (and other accelerators) • Cluster sharing with other types of jobs • Sharing ratio among teams • Support for Docker containers
  • 32. Infrastructure Challenges • Optimal node placement / device allocation • Pack and minimize distribution • Simple resource counting vs. interconnect-aware scheduling • Repacking (if possible) would be good to have • Static nature of job topology in deep learning • Gang scheduling (suboptimal workarounds) • Heterogeneity of resources & portability of workloads • Support different generation of GPU devices in the same cluster
  • 33. Infrastructure Challenges (contd.) • Data lifecycle integration • Consumer of data and producer of models • Workflow on dataflow, model reuse etc. • Abstracting implementation details in end-to-end experience
  • 34. Reality 3 – GPU memory errors happen!
  • 35. What we learned on DNN Infrastructure • Surprising rate of memory/ECC issues with GPUs • Job schedulers used for big data jobs are not great fit for training jobs • Multitenancy is hard to get right but XXL clusters could greatly benefit from it • No one-size that fits all when it comes to task placement, hardware needs, toolkit selection, optimization techniques • Need end-user tools for debugging/troubleshooting, profiling
  • 39. Distributed DNN Training Infrastructure, Challenges, and Lessons learned O’Reilly AI Conference 2018 Wee Hyong Tok Principal Data Scientist Lead Microsoft @weehyong Kaarthik Sivashanmugam Principal Software Engineer Microsoft @kaarthikss Thank you