SlideShare a Scribd company logo
Pipeline Parallel Training of
Large-scale Neural Network
Changjiang Gou
Zhejiang Lab
January 2022
1
Agenda 1. Introduction
2. Fundamentals
3. Core Techniques
4. Evaluation on BERT
5. To be continue
6. Closing notes
2
Why we need it?
Compared to the most prevalent data
parallel, Pipeline Parallel (PP) can:
1. Train a large model
2. Low communication overhead
(around 90% less)
3. It overlaps computation and
communication
Introduction
x[2]
x[3]
x[1]
But, a naive implementation
incurs:
1. Idle devices
2. Low throughput
3. State staleness
3
Fundamentals
x[2]
x[3]
x[1]
!" !# !$ !% !&
Partition the NN into several
stages (continuous sequence
of layers)
All devices are running on
different task and data stream
'" '# '$ '%
Assign stages into a device
time
device
'"
'#
'$
'%
F F F F
F F F F
F F F F
F F F F B B B B
B B B B
B B B B
B B B B F F F F
F F F F
F F F F
F F F F B B B B
B B B
B B
B
4
Memory
Computation
time
device
!"
!#
!$
!%
1 2 3
1 2 3
1 2 3
1 2 3 3 2 1 0
3 2 1 0
3 2 1 0
3 2 1 0
5%
16%
79%
Memory Consumption
(Transformer)
Model
Optimizer
Activations
&'
&'
device
time
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F F F B B B B
B B B B
B B B B
B B B B F F F F
F F F F
F F F F
F F F F B B B B
B B B
B B
B
Fundamentals
(# ()
($ (*
(%
+%
+# +$
,*
+)
,%
,$ ,)
-.//
0# 0$ 0% 0) 0*
12
+$ +% +)
+#
Coarse grain Computation graph
5
Core Techniques
time
device
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F F F B B B B
B B B B
B B B B
B B B B F F F F
F F F F
F F F F
F F F F B B B B
B B B
B B
B
NeurIPS19, GPipe
• Micro-batching
Divides mini-batch of size N into M micro-batches, at the end of a
mini-batch, gradients are accumulated and applied to update
parameters.
• Gradient checkpointing
Each device only stores output activations at the boundary layer.
During the backwards, the device re-computes the forward function.
(sub-linear memory cost)
67%
23%
3%
4% 3%
Time consumption
computation weight update
recompute load imbalance
bubble setup
6
time
device
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
SOSP19, PipeDream
B
B
B
B
B
B
B
B B
improvement
F
F
F
F
F
B
F
• 1F1B: one-forward-one-backward
To eliminate idle slots, each device switch between forward computation and backward computation.
• Weight Stashing to reduce staleness
• Discrepancy in weight versions can prevent the model from converging
Core Techniques
7
time
SOSP19, PipeDream
• 1F1B: one-forward-one-backward
To eliminate idle slots, each device switch between forward computation and backward
computation.
• Weight Stashing to reduce staleness
• Discrepancy in weight versions can prevent the model from converging,
for instance, !" starts F computation on green micro-batch at time #$, and B computation at
time #%, the weight is already updated at time #& and #'.
#$ #%
#& #'
Core Techniques
device
!"
!$
!%
!&
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B B
improvement
F
F
F
F
F
B
F
8
time
SOSP19, PipeDream
• Weight Stashing to reduce staleness
Weight stashing: For a given micro-batch, denoted by colors, each stage maintains its
own version of the latest weight, which is used for F and B computation.
An instance: at !", device #$ uses weight updated by yellow micro-batch, and it stores this
weight until used at !%; for device #", at time !&, weight used is just updated by blue
micro-batch, and is kept just until time !%.
Shortcoming: weight inconsistency across stages.
!" !%
!'
!&
Core Techniques
device
#$
#"
#%
#'
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B B
improvement
F
F
F
F
F
B
F
9
time
SOSP19, PipeDream
• Weight Stashing to reduce staleness
Vertical Sync: for a given micro-batch, only the input stage get the latest weight, which is
propagated along with the activations and gradients, i.e, p2p operation.
An instance: at !", device #$ uses weight updated by yellow micro-batch, and it stores this
weight until used at !%; for device #", at time !&, weight used is the one transformed from #$,
and is kept just until time !%.
Each device still stashes several versions of weights.
!" !%
!'
!&
Core Techniques
device
#$
#"
#%
#'
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B B
improvement
F
F
F
F
F
B
F
10
time
device
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
ICML21, PipeDream-2BW
B
B
B
B
B
B
B
B B
F
F
F
F
F
B
F
• Double-buffered weight update
Each device stashes at most 2 versions of weights. Memory footprint is reduced!
An instance:
1. at &#, a training period starts with weight '#;
2. at &$, another training period starts with '$;
3. at &(, the training with )# terminates on a mini-batch consists of 4 micro-batches, )# is discarded;
4. at &%, a third periods starts with the weight just updated at &(;
5. only 2 versions needed here!
&# &$ &%
Core Techniques
B
B
B
F
B
B
F
F
F
B
F
F B
B
F B
F
F
B
F B
F
F
F
B
F B
B
B
B
B
B
B
B
B
B
&(
11
time
device
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
ICML21, PipeDream-2BW
B
B
B
B
B
B
B
B B
F
F
F
F
F
B
F
• Double-buffered weight update
Each device stashes at most 2 versions of weights. Memory footprint is reduced!
An instance:
1. at &#, a training period starts with weight '#;
2. at &$, another training period starts with '$;
3. at &(, the training with )# terminates on a mini-batch consists of 4 micro-batches, )# is discarded;
4. at &%, a third periods starts with the weight just updated at &(;
5. only 2 versions needed here!
&# &$ &%
Core Techniques
B
B
B
F
B
B
F
F
F
B
F
F B
B
F B
F
F
B
F B
F
F
F
B
F B
B
B
B
B
B
B
B
B
B
&(
12
time
device
!"
!#
!$
!%
SC20, GEMS
Core Techniques
time
device
!"
!#
!$
!%
&"
&#
&$
&%
&%
&$
&#
&"
Model replica 1
Model replica 2
Model parallelism with a model replica to increase memory efficiency
F
F
B
F
F
F
B
B
B
Memory
(sum of all devices)
F
B
F
F
B
B
B
Memory
(sum of all devices)
13
F
F
F
F
SC20, GEMS
Core Techniques
Model parallelism with a model replica to increase memory efficiency
F
B
F
F
F
B
B
B
F
B
F
F
F
B
B
B
It is designed for:
• Extremely large DNN, e.g., 1000-layer ResNet-1k, which consumes huge memory
• High-resolution images, so that batch-size 1 is large enough, micro-batches is not feasible
Example:
• High resolution histopathology images of 100,000 x 100,000 pixels.
14
time
device
!"
!#
!$
!%
SC21, Chimera
F
F
F
F B
B
B
B F
F
F
F B
B
B
B
&# &$
Core Techniques
time
device
!"
!#
!$
!%
&# &$
'"
'#
'$
'%
F
B
B
B
B
F
F
F
F
B
B
B
B
F
F
F
'%
'$
'#
'"
time
device
!"
!#
!$
!%
F
F
F
F B
B
B
B
&# &$
F
B
B
B
B
F
F
F F
F
F
F B
B
B
B
F
B
B
B
B
F
F
F
Model replica 1
Model replica 2
15
SC21, Chimera
Core Techniques
time
device
!"
!#
!$
!%
F
F
F
F
&#
B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
• After all local computation
Synchronize gradients (allreduce operation) after 2
micro-batches of bidirections, i.e., ∧ and ∨.
Gradient synchronization between model replicas
'% '"
'% '"
'# '$
'# '$
time
device
!"
!#
!$
!%
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
'% '"
'% '"
'# '$
'# '$
• Eager, as soon as gradients are ready
Synchronize gradients (allreduce operation) of the
first and last stage after the gradients are ready.
Bubble is reduced!
&#
16
SC21, Chimera
Core Techniques
Combine Pipeline and Data Parallelism
time
device
!"
!#
!$
!%
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
&% &"
&% &"
&# &$
&# &$
time
device
!'
!(
!)
!*
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
&% &"
&% &"
&# &$
&# &$
With data parallelism:
• p2p communication is reduced since less devices
are used for pipeline stages, e.g., 4 stages instead
of 8 exist here.
• allreduce communication is increased due to
gradient synchronization. High bandwidth
interconnected networks (such as IB, Cray Aries,
Slingshot, NVLink) can partially alleviate it.
• workload on each stage is reduced.
• It’s important to find a sweet pot between !
(number of stages) and +(number of replicas).
,#
,#
17
SC21, Chimera
Core Techniques
Performance modelling
!"
!#
!$
!%
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
Runtime of a single training iteration:
& = ()*+,-./$/),1 + B3 + Com7$7 C8 + max ,-.;<=>?@AB//?C i : i ∈ 0, ! − 1
• )*, K*, runtime of a single forward and backward computation respectively,
• ,-._M2M, p2p communication cost between stages, classical method: O + PQ, Q the size of message
• R1, RS, number of forward and backward computation in the critical path respectively
• ,-.BAA@?C;T? = 2 log$r O + 2 X − 1 PQ/X, classical Rabenseifner’s algorithm, X number of the stage
replicas
• ,-.;<=>?@AB//?C Z , part of ,-.BAA@?C;T? that cannot be covered by bubble on device Z
time
training iteration 0
F F B
training iteration 1
18
Evaluation on BERT
4 workstation interconnected by IB, each equipped with 8 V100 GPUs, which is connected by NVLink.
Memory:
20
Evaluation on BERT
4 workstation interconnected by IB, each equipped with 8 V100 GPUs, which is connected by NVLink.
Throughput:
21
To be continue
Auto parallelism
SOSP19, PipeDream
PPoPP21, DAPPLE
Figure 2. PipeDream framework overview
• Micro-benchmarks to profile computation, memory overhead, etc..
• Design a multi-constraints optimization problem
• Dynamic programming is the core to partition and map DNN
22
To be continue
Extremely memory-efficient training
24
To tackle the low throughput problem of GEMs in high-resolution
histopathology images scenarios:
l Mixed precision training, FP16 with FP32.
l Re-computation, trade-off between computation with memory
l ZeRo techinique from DeepSpeed: trade-off between
communication with memory.
l Harness sparsity, remove zeros from computation and storage,
trade-off between accuracy with computation and memory.
l ……
1. Closing notes
Closing notes:
We investigated several SOTA pipeline parallel training techniques in ML
l It enables training out-of-core ML models compared to Data parallelism
l It enhances throughput compared to Model parallelism
l It is a multi-objective optimization problem: computation efficiency (less bubble),
memory overhead(lower is better), keep convergency guarantee (synchronous).
l It lays down foundations for auto parallelism
25
1. Closing notes
Thank you
26

More Related Content

PDF
Generalized Pipeline Parallelism for DNN Training
PDF
Toward Distributed, Global, Deep Learning Using IoT Devices
PDF
Scaling Up LLM Pretraining: Parallel Training
PPTX
UNET: Massive Scale DNN on Spark
PDF
Machine Learning @NECST
PDF
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
PPTX
Parallel/Distributed Deep Learning and CDSW
PDF
Echo state networks and locomotion patterns
Generalized Pipeline Parallelism for DNN Training
Toward Distributed, Global, Deep Learning Using IoT Devices
Scaling Up LLM Pretraining: Parallel Training
UNET: Massive Scale DNN on Spark
Machine Learning @NECST
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
Parallel/Distributed Deep Learning and CDSW
Echo state networks and locomotion patterns

Similar to A review of Pipeline Parallel Training of Large-scale Neural Network.pdf (20)

PPTX
Parallel & Distributed Deep Learning - Dataworks Summit
PPTX
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
PDF
Distributed deep learning
PDF
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
PDF
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
PDF
PPTX
Knitting boar atl_hug_jan2013_v2
PDF
Hardware Acceleration for Machine Learning
PDF
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
PDF
Machine Learning Challenges and Opportunities in Education, Industry, and Res...
PDF
DLD meetup 2017, Efficient Deep Learning
PDF
Improving of artifical neural networks performance by using gpu's a survey
PDF
IMPROVING OF ARTIFICIAL NEURAL NETWORKS PERFORMANCE BY USING GPU’S: A SURVEY
PDF
IMPROVING OF ARTIFICIAL NEURAL NETWORKS PERFORMANCE BY USING GPU’S: A SURVEY
PPTX
08 neural networks
PDF
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency ...
PDF
Neuromorphic computing for neural networks
PPTX
Detection of medical instruments project- PART 1
PPTX
PDF
201907 AutoML and Neural Architecture Search
Parallel & Distributed Deep Learning - Dataworks Summit
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Distributed deep learning
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Knitting boar atl_hug_jan2013_v2
Hardware Acceleration for Machine Learning
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Machine Learning Challenges and Opportunities in Education, Industry, and Res...
DLD meetup 2017, Efficient Deep Learning
Improving of artifical neural networks performance by using gpu's a survey
IMPROVING OF ARTIFICIAL NEURAL NETWORKS PERFORMANCE BY USING GPU’S: A SURVEY
IMPROVING OF ARTIFICIAL NEURAL NETWORKS PERFORMANCE BY USING GPU’S: A SURVEY
08 neural networks
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency ...
Neuromorphic computing for neural networks
Detection of medical instruments project- PART 1
201907 AutoML and Neural Architecture Search
Ad

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Electronic commerce courselecture one. Pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
Review of recent advances in non-invasive hemoglobin estimation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx
Spectroscopy.pptx food analysis technology
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Electronic commerce courselecture one. Pdf
Ad

A review of Pipeline Parallel Training of Large-scale Neural Network.pdf

  • 1. Pipeline Parallel Training of Large-scale Neural Network Changjiang Gou Zhejiang Lab January 2022 1
  • 2. Agenda 1. Introduction 2. Fundamentals 3. Core Techniques 4. Evaluation on BERT 5. To be continue 6. Closing notes 2
  • 3. Why we need it? Compared to the most prevalent data parallel, Pipeline Parallel (PP) can: 1. Train a large model 2. Low communication overhead (around 90% less) 3. It overlaps computation and communication Introduction x[2] x[3] x[1] But, a naive implementation incurs: 1. Idle devices 2. Low throughput 3. State staleness 3
  • 4. Fundamentals x[2] x[3] x[1] !" !# !$ !% !& Partition the NN into several stages (continuous sequence of layers) All devices are running on different task and data stream '" '# '$ '% Assign stages into a device time device '" '# '$ '% F F F F F F F F F F F F F F F F B B B B B B B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B 4
  • 5. Memory Computation time device !" !# !$ !% 1 2 3 1 2 3 1 2 3 1 2 3 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 5% 16% 79% Memory Consumption (Transformer) Model Optimizer Activations &' &' device time !" !# !$ !% F F F F F F F F F F F F F F F F B B B B B B B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B Fundamentals (# () ($ (* (% +% +# +$ ,* +) ,% ,$ ,) -.// 0# 0$ 0% 0) 0* 12 +$ +% +) +# Coarse grain Computation graph 5
  • 6. Core Techniques time device !" !# !$ !% F F F F F F F F F F F F F F F F B B B B B B B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B NeurIPS19, GPipe • Micro-batching Divides mini-batch of size N into M micro-batches, at the end of a mini-batch, gradients are accumulated and applied to update parameters. • Gradient checkpointing Each device only stores output activations at the boundary layer. During the backwards, the device re-computes the forward function. (sub-linear memory cost) 67% 23% 3% 4% 3% Time consumption computation weight update recompute load imbalance bubble setup 6
  • 7. time device !" !# !$ !% F F F F F F F F F F F F F F F F B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B SOSP19, PipeDream B B B B B B B B B improvement F F F F F B F • 1F1B: one-forward-one-backward To eliminate idle slots, each device switch between forward computation and backward computation. • Weight Stashing to reduce staleness • Discrepancy in weight versions can prevent the model from converging Core Techniques 7
  • 8. time SOSP19, PipeDream • 1F1B: one-forward-one-backward To eliminate idle slots, each device switch between forward computation and backward computation. • Weight Stashing to reduce staleness • Discrepancy in weight versions can prevent the model from converging, for instance, !" starts F computation on green micro-batch at time #$, and B computation at time #%, the weight is already updated at time #& and #'. #$ #% #& #' Core Techniques device !" !$ !% !& F F F F F F F F F F F F F F F F B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B B B B B B B B B B improvement F F F F F B F 8
  • 9. time SOSP19, PipeDream • Weight Stashing to reduce staleness Weight stashing: For a given micro-batch, denoted by colors, each stage maintains its own version of the latest weight, which is used for F and B computation. An instance: at !", device #$ uses weight updated by yellow micro-batch, and it stores this weight until used at !%; for device #", at time !&, weight used is just updated by blue micro-batch, and is kept just until time !%. Shortcoming: weight inconsistency across stages. !" !% !' !& Core Techniques device #$ #" #% #' F F F F F F F F F F F F F F F F B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B B B B B B B B B B improvement F F F F F B F 9
  • 10. time SOSP19, PipeDream • Weight Stashing to reduce staleness Vertical Sync: for a given micro-batch, only the input stage get the latest weight, which is propagated along with the activations and gradients, i.e, p2p operation. An instance: at !", device #$ uses weight updated by yellow micro-batch, and it stores this weight until used at !%; for device #", at time !&, weight used is the one transformed from #$, and is kept just until time !%. Each device still stashes several versions of weights. !" !% !' !& Core Techniques device #$ #" #% #' F F F F F F F F F F F F F F F F B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B B B B B B B B B B improvement F F F F F B F 10
  • 11. time device !" !# !$ !% F F F F F F F F F F F F F F F F B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B ICML21, PipeDream-2BW B B B B B B B B B F F F F F B F • Double-buffered weight update Each device stashes at most 2 versions of weights. Memory footprint is reduced! An instance: 1. at &#, a training period starts with weight '#; 2. at &$, another training period starts with '$; 3. at &(, the training with )# terminates on a mini-batch consists of 4 micro-batches, )# is discarded; 4. at &%, a third periods starts with the weight just updated at &(; 5. only 2 versions needed here! &# &$ &% Core Techniques B B B F B B F F F B F F B B F B F F B F B F F F B F B B B B B B B B B B &( 11
  • 12. time device !" !# !$ !% F F F F F F F F F F F F F F F F B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B ICML21, PipeDream-2BW B B B B B B B B B F F F F F B F • Double-buffered weight update Each device stashes at most 2 versions of weights. Memory footprint is reduced! An instance: 1. at &#, a training period starts with weight '#; 2. at &$, another training period starts with '$; 3. at &(, the training with )# terminates on a mini-batch consists of 4 micro-batches, )# is discarded; 4. at &%, a third periods starts with the weight just updated at &(; 5. only 2 versions needed here! &# &$ &% Core Techniques B B B F B B F F F B F F B B F B F F B F B F F F B F B B B B B B B B B B &( 12
  • 13. time device !" !# !$ !% SC20, GEMS Core Techniques time device !" !# !$ !% &" &# &$ &% &% &$ &# &" Model replica 1 Model replica 2 Model parallelism with a model replica to increase memory efficiency F F B F F F B B B Memory (sum of all devices) F B F F B B B Memory (sum of all devices) 13
  • 14. F F F F SC20, GEMS Core Techniques Model parallelism with a model replica to increase memory efficiency F B F F F B B B F B F F F B B B It is designed for: • Extremely large DNN, e.g., 1000-layer ResNet-1k, which consumes huge memory • High-resolution images, so that batch-size 1 is large enough, micro-batches is not feasible Example: • High resolution histopathology images of 100,000 x 100,000 pixels. 14
  • 15. time device !" !# !$ !% SC21, Chimera F F F F B B B B F F F F B B B B &# &$ Core Techniques time device !" !# !$ !% &# &$ '" '# '$ '% F B B B B F F F F B B B B F F F '% '$ '# '" time device !" !# !$ !% F F F F B B B B &# &$ F B B B B F F F F F F F B B B B F B B B B F F F Model replica 1 Model replica 2 15
  • 16. SC21, Chimera Core Techniques time device !" !# !$ !% F F F F &# B B B B F F F F B B B B F F F F B B B B F F F F B B B B • After all local computation Synchronize gradients (allreduce operation) after 2 micro-batches of bidirections, i.e., ∧ and ∨. Gradient synchronization between model replicas '% '" '% '" '# '$ '# '$ time device !" !# !$ !% F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B '% '" '% '" '# '$ '# '$ • Eager, as soon as gradients are ready Synchronize gradients (allreduce operation) of the first and last stage after the gradients are ready. Bubble is reduced! &# 16
  • 17. SC21, Chimera Core Techniques Combine Pipeline and Data Parallelism time device !" !# !$ !% F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B &% &" &% &" &# &$ &# &$ time device !' !( !) !* F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B &% &" &% &" &# &$ &# &$ With data parallelism: • p2p communication is reduced since less devices are used for pipeline stages, e.g., 4 stages instead of 8 exist here. • allreduce communication is increased due to gradient synchronization. High bandwidth interconnected networks (such as IB, Cray Aries, Slingshot, NVLink) can partially alleviate it. • workload on each stage is reduced. • It’s important to find a sweet pot between ! (number of stages) and +(number of replicas). ,# ,# 17
  • 18. SC21, Chimera Core Techniques Performance modelling !" !# !$ !% F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B Runtime of a single training iteration: & = ()*+,-./$/),1 + B3 + Com7$7 C8 + max ,-.;<=>?@AB//?C i : i ∈ 0, ! − 1 • )*, K*, runtime of a single forward and backward computation respectively, • ,-._M2M, p2p communication cost between stages, classical method: O + PQ, Q the size of message • R1, RS, number of forward and backward computation in the critical path respectively • ,-.BAA@?C;T? = 2 log$r O + 2 X − 1 PQ/X, classical Rabenseifner’s algorithm, X number of the stage replicas • ,-.;<=>?@AB//?C Z , part of ,-.BAA@?C;T? that cannot be covered by bubble on device Z time training iteration 0 F F B training iteration 1 18
  • 19. Evaluation on BERT 4 workstation interconnected by IB, each equipped with 8 V100 GPUs, which is connected by NVLink. Memory: 20
  • 20. Evaluation on BERT 4 workstation interconnected by IB, each equipped with 8 V100 GPUs, which is connected by NVLink. Throughput: 21
  • 21. To be continue Auto parallelism SOSP19, PipeDream PPoPP21, DAPPLE Figure 2. PipeDream framework overview • Micro-benchmarks to profile computation, memory overhead, etc.. • Design a multi-constraints optimization problem • Dynamic programming is the core to partition and map DNN 22
  • 22. To be continue Extremely memory-efficient training 24 To tackle the low throughput problem of GEMs in high-resolution histopathology images scenarios: l Mixed precision training, FP16 with FP32. l Re-computation, trade-off between computation with memory l ZeRo techinique from DeepSpeed: trade-off between communication with memory. l Harness sparsity, remove zeros from computation and storage, trade-off between accuracy with computation and memory. l ……
  • 23. 1. Closing notes Closing notes: We investigated several SOTA pipeline parallel training techniques in ML l It enables training out-of-core ML models compared to Data parallelism l It enhances throughput compared to Model parallelism l It is a multi-objective optimization problem: computation efficiency (less bubble), memory overhead(lower is better), keep convergency guarantee (synchronous). l It lays down foundations for auto parallelism 25