A review of Pipeline Parallel Training of Large-scale Neural Network.pdf

Pipeline Parallel Training of
Large-scale Neural Network
Changjiang Gou
Zhejiang Lab
January 2022
1

Agenda 1. Introduction
2. Fundamentals
3. Core Techniques
4. Evaluation on BERT
5. To be continue
6. Closing notes
2

Why we need it?
Compared to the most prevalent data
parallel, Pipeline Parallel (PP) can:
1. Train a large model
2. Low communication overhead
(around 90% less)
3. It overlaps computation and
communication
Introduction
x[2]
x[3]
x[1]
But, a naive implementation
incurs:
1. Idle devices
2. Low throughput
3. State staleness
3

Fundamentals
x[2]
x[3]
x[1]
!" !# !$ !% !&
Partition the NN into several
stages (continuous sequence
of layers)
All devices are running on
different task and data stream
'" '# '$ '%
Assign stages into a device
time
device
'"
'#
'$
'%
F F F F
F F F F
F F F F
F F F F B B B B
B B B B
B B B B
B B B B F F F F
F F F F
F F F F
F F F F B B B B
B B B
B B
B
4

Memory
Computation
time
device
!"
!#
!$
!%
1 2 3
1 2 3
1 2 3
1 2 3 3 2 1 0
3 2 1 0
3 2 1 0
3 2 1 0
5%
16%
79%
Memory Consumption
(Transformer)
Model
Optimizer
Activations
&'
&'
device
time
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F F F B B B B
B B B B
B B B B
B B B B F F F F
F F F F
F F F F
F F F F B B B B
B B B
B B
B
Fundamentals
(# ()
($ (*
(%
+%
+# +$
,*
+)
,%
,$ ,)
-.//
0# 0$ 0% 0) 0*
12
+$ +% +)
+#
Coarse grain Computation graph
5

Core Techniques
time
device
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F F F B B B B
B B B B
B B B B
B B B B F F F F
F F F F
F F F F
F F F F B B B B
B B B
B B
B
NeurIPS19, GPipe
• Micro-batching
Divides mini-batch of size N into M micro-batches, at the end of a
mini-batch, gradients are accumulated and applied to update
parameters.
• Gradient checkpointing
Each device only stores output activations at the boundary layer.
During the backwards, the device re-computes the forward function.
(sub-linear memory cost)
67%
23%
3%
4% 3%
Time consumption
computation weight update
recompute load imbalance
bubble setup
6

time
device
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
SOSP19, PipeDream
B
B
B
B
B
B
B
B B
improvement
F
F
F
F
F
B
F
• 1F1B: one-forward-one-backward
To eliminate idle slots, each device switch between forward computation and backward computation.
• Weight Stashing to reduce staleness
• Discrepancy in weight versions can prevent the model from converging
Core Techniques
7

time
SOSP19, PipeDream
• 1F1B: one-forward-one-backward
To eliminate idle slots, each device switch between forward computation and backward
computation.
• Discrepancy in weight versions can prevent the model from converging,
for instance, !" starts F computation on green micro-batch at time #$, and B computation at
time #%, the weight is already updated at time #& and #'.
#$ #%
#& #'
Core Techniques
device
!"
!$
!%
!&
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B B
improvement
F
F
F
F
F
B
F
8

time
SOSP19, PipeDream
Weight stashing: For a given micro-batch, denoted by colors, each stage maintains its
own version of the latest weight, which is used for F and B computation.
An instance: at !", device #$ uses weight updated by yellow micro-batch, and it stores this
weight until used at !%; for device #", at time !&, weight used is just updated by blue
micro-batch, and is kept just until time !%.
Shortcoming: weight inconsistency across stages.
!" !%
!'
!&
Core Techniques
device
#$
#"
#%
#'
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B B
improvement
F
F
F
F
F
B
F
9

time
SOSP19, PipeDream
Vertical Sync: for a given micro-batch, only the input stage get the latest weight, which is
propagated along with the activations and gradients, i.e, p2p operation.
An instance: at !", device #$ uses weight updated by yellow micro-batch, and it stores this
weight until used at !%; for device #", at time !&, weight used is the one transformed from #$,
and is kept just until time !%.
Each device still stashes several versions of weights.
!" !%
!'
!&
Core Techniques
device
#$
#"
#%
#'
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B B
improvement
F
F
F
F
F
B
F
10

time
device
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
ICML21, PipeDream-2BW
B
B
B
B
B
B
B
B B
F
F
F
F
F
B
F
• Double-buffered weight update
Each device stashes at most 2 versions of weights. Memory footprint is reduced!
An instance:
1. at &#, a training period starts with weight '#;
2. at &$, another training period starts with '$;
3. at &(, the training with )# terminates on a mini-batch consists of 4 micro-batches, )# is discarded;
4. at &%, a third periods starts with the weight just updated at &(;
5. only 2 versions needed here!
&# &$ &%
Core Techniques
B
B
B
F
B
B
F
F
F
B
F
F B
B
F B
F
F
B
F B
F
F
F
B
F B
B
B
B
B
B
B
B
B
B
&(
11

time
device
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
ICML21, PipeDream-2BW
B
B
B
B
B
B
B
B B
F
F
F
F
F
B
F
• Double-buffered weight update
Each device stashes at most 2 versions of weights. Memory footprint is reduced!
An instance:
1. at &#, a training period starts with weight '#;
2. at &$, another training period starts with '$;
3. at &(, the training with )# terminates on a mini-batch consists of 4 micro-batches, )# is discarded;
4. at &%, a third periods starts with the weight just updated at &(;
5. only 2 versions needed here!
&# &$ &%
Core Techniques
B
B
B
F
B
B
F
F
F
B
F
F B
B
F B
F
F
B
F B
F
F
F
B
F B
B
B
B
B
B
B
B
B
B
&(
12

time
device
!"
!#
!$
!%
SC20, GEMS
Core Techniques
time
device
!"
!#
!$
!%
&"
&#
&$
&%
&%
&$
&#
&"
Model replica 1
Model replica 2
Model parallelism with a model replica to increase memory efficiency
F
F
B
F
F
F
B
B
B
Memory
(sum of all devices)
F
B
F
F
B
B
B
Memory
(sum of all devices)
13

F
F
F
F
SC20, GEMS
Core Techniques
Model parallelism with a model replica to increase memory efficiency
F
B
F
F
F
B
B
B
F
B
F
F
F
B
B
B
It is designed for:
• Extremely large DNN, e.g., 1000-layer ResNet-1k, which consumes huge memory
• High-resolution images, so that batch-size 1 is large enough, micro-batches is not feasible
Example:
• High resolution histopathology images of 100,000 x 100,000 pixels.
14

time
device
!"
!#
!$
!%
SC21, Chimera
F
F
F
F B
B
B
B F
F
F
F B
B
B
B
&# &$
Core Techniques
time
device
!"
!#
!$
!%
&# &$
'"
'#
'$
'%
F
B
B
B
B
F
F
F
F
B
B
B
B
F
F
F
'%
'$
'#
'"
time
device
!"
!#
!$
!%
F
F
F
F B
B
B
B
&# &$
F
B
B
B
B
F
F
F F
F
F
F B
B
B
B
F
B
B
B
B
F
F
F
Model replica 1
Model replica 2
15

SC21, Chimera
Core Techniques
time
device
!"
!#
!$
!%
F
F
F
F
&#
B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
• After all local computation
Synchronize gradients (allreduce operation) after 2
micro-batches of bidirections, i.e., ∧ and ∨.
Gradient synchronization between model replicas
'% '"
'% '"
'# '$
'# '$
time
device
!"
!#
!$
!%
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
'% '"
'% '"
'# '$
'# '$
• Eager, as soon as gradients are ready
Synchronize gradients (allreduce operation) of the
first and last stage after the gradients are ready.
Bubble is reduced!
&#
16

SC21, Chimera
Core Techniques
Combine Pipeline and Data Parallelism
time
device
!"
!#
!$
!%
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
&% &"
&% &"
&# &$
&# &$
time
device
!'
!(
!)
!*
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
&% &"
&% &"
&# &$
&# &$
With data parallelism:
• p2p communication is reduced since less devices
are used for pipeline stages, e.g., 4 stages instead
of 8 exist here.
• allreduce communication is increased due to
gradient synchronization. High bandwidth
interconnected networks (such as IB, Cray Aries,
Slingshot, NVLink) can partially alleviate it.
• workload on each stage is reduced.
• It’s important to find a sweet pot between !
(number of stages) and +(number of replicas).
,#
,#
17

SC21, Chimera
Core Techniques
Performance modelling
!"
!#
!$
!%
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
Runtime of a single training iteration:
& = ()*+,-./$/),1 + B3 + Com7$7 C8 + max ,-.;<=>?@AB//?C i : i ∈ 0, ! − 1
• )*, K*, runtime of a single forward and backward computation respectively,
• ,-._M2M, p2p communication cost between stages, classical method: O + PQ, Q the size of message
• R1, RS, number of forward and backward computation in the critical path respectively
• ,-.BAA@?C;T? = 2 log$r O + 2 X − 1 PQ/X, classical Rabenseifner’s algorithm, X number of the stage
replicas
• ,-.;<=>?@AB//?C Z , part of ,-.BAA@?C;T? that cannot be covered by bubble on device Z
time
training iteration 0
F F B
training iteration 1
18

Evaluation on BERT
4 workstation interconnected by IB, each equipped with 8 V100 GPUs, which is connected by NVLink.
Memory:
20

Evaluation on BERT
4 workstation interconnected by IB, each equipped with 8 V100 GPUs, which is connected by NVLink.
Throughput:
21

To be continue
Auto parallelism
SOSP19, PipeDream
PPoPP21, DAPPLE
Figure 2. PipeDream framework overview
• Micro-benchmarks to profile computation, memory overhead, etc..
• Design a multi-constraints optimization problem
• Dynamic programming is the core to partition and map DNN
22

To be continue
Extremely memory-efficient training
24
To tackle the low throughput problem of GEMs in high-resolution
histopathology images scenarios:
l Mixed precision training, FP16 with FP32.
l Re-computation, trade-off between computation with memory
l ZeRo techinique from DeepSpeed: trade-off between
communication with memory.
l Harness sparsity, remove zeros from computation and storage,
trade-off between accuracy with computation and memory.
l ……

1. Closing notes
Closing notes:
We investigated several SOTA pipeline parallel training techniques in ML
l It enables training out-of-core ML models compared to Data parallelism
l It enhances throughput compared to Model parallelism
l It is a multi-objective optimization problem: computation efficiency (less bubble),
memory overhead(lower is better), keep convergency guarantee (synchronous).
l It lays down foundations for auto parallelism
25

A review of Pipeline Parallel Training of Large-scale Neural Network.pdf

More Related Content

Similar to A review of Pipeline Parallel Training of Large-scale Neural Network.pdf (20)

Recently uploaded (20)

A review of Pipeline Parallel Training of Large-scale Neural Network.pdf