SlideShare a Scribd company logo
PipeDream: Generalized Pipeline
Parallelism for DNN Training
Deepak Narayanan
Stanford University
In collaboration with many others
Deep Neural Networks have empowered state of
the art results across a range of applications…
2
cat dog
வண#க% எ' ெபய+ ,ப#
Hello, my name is Deepak
Machine Translation
Game Playing
Image Classification
Speech-to-Text
…but first need to be trained!
3
𝑦! = tiger
𝑥! =
activations
gradients
𝑊 optimized using standard iterative optimization procedures
𝑊 = 𝑊 − 𝜂 ⋅ ∇𝑊
𝛻𝑊
loss(𝑦!, +𝑦!)
+𝑦! = lion
prediction
Weight parameters 𝑊
Generalized Pipeline Parallelism for DNN Training
Parallelizing DNN training: Data Parallelism
…
Worker 1
∇𝑊 = ∇𝑊!
+ ∇𝑊"
+ ⋯ + ∇𝑊#
∇𝑊!
Gradient aggregation
using all_reduce(.)
𝑛 copies of the
same model
…
Worker 𝑛
∇𝑊#
Inputs sharded
…
Parallelizing DNN training: Data Parallelism
…
Worker 1
∇𝑊 = ∇𝑊!
+ ∇𝑊"
+ ⋯ + ∇𝑊#
∇𝑊!
Gradient aggregation using all_reduce(.)
𝑛 copies of the
same model
…
∇𝑊#
Inputs sharded
…
Worker 𝑛
Despite many performance optimizations,
communication overhead high!
8xV100s with NVLink (AWS)
PyTorch + NCCL 2.4
Worker 𝑛
Model Parallelism: An alternative to data parallelism
All inputs
• Single version of weights split over workers
• Activations and gradients sent between workers using send(.) and recv(.)
Worker 1 …
Worker 𝑛
Model Parallelism: An alternative to data parallelism
All inputs
• Single version of weights split over workers
• Activations and gradients sent between workers using send(.) and recv(.)
Worker 1 …
Low throughput due to poor resource utilization!
Layers partitioned
over workers
Solution: Pipelining can increase throughput
Pipelining: injecting multiple
inputs into the system
Pipelining in DNN training != Traditional pipelining
• How should weight and activation versions be managed?
• Backward pass operators depend on internal state (𝑊, activations)
• Backward pass for inputs should use the same weight version as
corresponding forward pass
Challenge 1: Pipelining leads to weight version mismatches
Naïve pipelining leads to mismatch in weight versions
Input 𝒏 sees updates in backward pass not seen in the forward
pass, leading to incorrect gradients
𝑊#𝑥# 𝑦# Forward pass
𝑊#$%∇𝑥# ∇𝑦# Backward pass
𝑊#$!
𝑊#𝑥# 𝑦# Forward pass
𝑾 𝒏∇𝑥# ∇𝑦# Backward pass
𝑊#$!
Weight stashing: A solution to version mismatches
Naïve pipelining leads to mismatch in weight versions
Store multiple <weight, activation> versions
• Ensures same weight versions used in both forward and backward pass
• Worst case memory footprint similar to data parallelism (= 𝑑 ⋅ /( # $ % )
')
𝑊# 𝑊#$! 𝑊#$"
Stashed weights
Pipelining in DNN training != Traditional pipelining
• How should weight and activation versions be managed?
• Backward pass operators depend on internal state (𝑊, activations)
• Backward pass for inputs should use the same weight version as
corresponding forward pass
• How should the DNN operators be partitioned into pipeline stages?
• Each operator has a different computation time
• Activations and gradients need to be communicated across stages
Challenge 2: How do we assign operators to pipeline
stages?
Stage 1 Stage 2 Stage 3
𝑡( 𝑡) 𝑡*
• Desiderata #1: 𝑡!, 𝑡", 𝑡# as close to each other as possible
• Compute resources seldom idle → better hardware efficiency
• Desiderata #2: 𝑡!→"
comm and 𝑡"→#
comm minimized
• Less communication → better hardware efficiency
𝑡(→)
comm 𝑡)→*
comm
See SOSP paper for details on PipeDream’s optimizer!
Evaluation
Setup
• Integrated PipeDream with PyTorch in ~3000 lines of Python code
• Integrated with PyTorch’s communication library
• NCCL backend for Data Parallelism baselines
• Gloo backend for PipeDream
• Experiments run on three different server types
• Cluster A: 4xV100 GPUs, PCIe intra-server, and 10 Gbps inter-server (Azure)
• Cluster B: 8xV100 GPUs, NVLink intra-server, and 25 Gbps inter-server (AWS)
• Cluster C: 1xTitan X, and 40 Gbps inter-server (private)
5.3x faster
2.5x faster
PipeDream > Data Parallelism (DP) end-to-end
PipeDream vs. Data Parallelism on Time-to-Accuracy
Experiments on 4 different tasks: image
classification, translation, language
modeling, video captioning
PipeDream vs. Data Parallelism on Time-to-Accuracy
With the same number of GPUs, PipeDream
up to 5.3x faster than Data Parallelism
PipeDream vs. Data Parallelism on Time-to-Accuracy
Takeaways
• Model and data parallelism often suffer from high communication
overhead and low resource utilization for certain models and deployments
• PipeDream shows pipelining can be used to accelerate distributed training
for models that fit on a single worker
• Pipelining, when combined with data and model parallelism in a principled
way, achieves end-to-end speedups of up to 5.3x compared to data
parallelism, with similar worst-case memory footprint
Appeared at SOSP 2019
…but modern Deep Neural Networks are becoming
extremely large!
700 GB in 32-
bit precision
Figure from “Language Models are Few-Shot Learners”, Brown et al.
Generalized Pipeline Parallelism for DNN Training
Background: GPipe
How should weight and activation versions be managed?
>> Single weight version
>> Periodic pipeline flushes update weight version across workers
GPipe and PipeDream make different tradeoffs
GPipe
Pipeline flushes
expensive
High memory footprint
from weight versionsPipeDream
Generalized Pipeline Parallelism for DNN Training
Double-buffered weight updates: high
throughput and low memory footprint
How should weight and activation versions be managed?
>> Two weight versions (shadow version and main version)
Double-buffered weight updates: high
throughput and low memory footprint
Weight updates from inputs 1 to 4 accumulated and applied as a
single weight update to generate a new weight version W!
"
Input 5 uses W!
"
throughout
Use activation recomputation to limit memory
footprint of intermediate activations
Double-buffered weight updates: weight semantics
• Assuming a per-GPU microbatch size of 𝑏, minibatch size 𝐵 = 𝑏 ⋅ 𝑑, where 𝑑
is the depth of the pipeline
• Weight update semantics of data parallelism:
𝑊()$!)
= 𝑊())
− 𝜈 ⋅ ∇𝑓(𝑊())
)
• Weight update semantics with 2BW almost unchanged (note additional
delay term of 1 in gradient computation):
𝑊()$!)
= 𝑊())
− 𝜈 ⋅ ∇𝑓(𝑊()+!)
)
• Semantics similar with replicated stages or gradient aggregation (minibatch
size 𝐵 multiplied by appropriate scale factor)
Evaluation
Setup
• Experiments run on p3.16xlarge instances on AWS (8-V100 servers w/ NVLink)
• Baselines are hybrid parallelism (no pipelining), PipeDream, and GPipe
• Model and associated activations do not fit on a single worker for many
models, so data parallelism is not applicable
• Evaluation on BERT models with various numbers of transformer layers (24 to
192), and a GPT-2 model with 760 million parameters
2BW has weight update semantics similar to data
parallelism
Training loss trajectory identical Accuracy on downstream GLUE tasks unchanged
Training loss while pre-training a BERT-24 model with identical hyperparameters, and
downstream GLUE task accuracy after finetuning pre-trained models 3 times
PipeDream-2BW faster than baselines
1.9x faster than GPipe
6.9x faster than hybrid (no
pipelining)
Throughput in sequences/second for PipeDream-2BW and baselines on various models
PipeDream-2BW has low memory footprint
PipeDream-2BW with activation recomputation (R)
has similar memory footprint to model parallelism
Memory footprint for various systems, using a
fixed per-GPU microbatch size of 4
Takeaways
• Model parallelism can be used to train large models that do not fit on a single
worker, but suffers from low resource utilization
• PipeDream-2BW carefully manages weight versions and uses activation
recomputation when necessary to limit memory footprint
• PipeDream-2BW can accelerate training by up to 6.9x compared to optimized
baselines that do not use pipelining, and up to 1.9x compared to GPipe
Preprint on Arxiv: https://arxiv.org/pdf/2006.09503.pdf
Conclusion
https://cs.stanford.edu/~deepakn/
Pipeline parallelism can accelerate distributed training both in regimes
where model metadata (weight parameters and intermediate
activations) fit on a single worker, and where they do not
Code open sourced on Github: https://github.com/msr-fiddle/pipedream
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

PPTX
Fine tuning large LMs
PDF
Large scale-lm-part1
PPTX
십분딥러닝_16_WGAN (Wasserstein GANs)
PDF
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
PPTX
[DL輪読会]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) 表形式デー...
PPTX
【DL輪読会】Hopfield network 関連研究について
PDF
敵対的学習に対するラデマッハ複雑度
PDF
不均衡データのクラス分類
Fine tuning large LMs
Large scale-lm-part1
십분딥러닝_16_WGAN (Wasserstein GANs)
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
[DL輪読会]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) 表形式デー...
【DL輪読会】Hopfield network 関連研究について
敵対的学習に対するラデマッハ複雑度
不均衡データのクラス分類

What's hot (20)

PPTX
CNNの構造最適化手法(第3回3D勉強会)
PDF
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
PDF
[DL輪読会]Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
PPTX
Batch normalization presentation
PPTX
【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...
PDF
Finding connections among images using CycleGAN
PDF
RoFormer: Enhanced Transformer with Rotary Position Embedding
PPTX
【DL輪読会】Variable Bitrate Neural Fields
PDF
Wasserstein GAN 수학 이해하기 I
PPTX
R高速化
PDF
어떻게 하면 데이터 사이언티스트가 될 수 있나요?
PPTX
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
PDF
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
PDF
ICCV 2019 論文紹介 (26 papers)
PDF
[DL輪読会]Inverse Constrained Reinforcement Learning
PPTX
SSII2020 [OS2-02] 教師あり事前学習を凌駕する「弱」教師あり事前学習
PDF
[Dl輪読会]dl hacks輪読
PPTX
Brief intro : Invariance and Equivariance
PPTX
よくわかるHopscotch hashing
PPTX
組合せ最適化を体系的に知ってPythonで実行してみよう PyCon 2015
CNNの構造最適化手法(第3回3D勉強会)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
[DL輪読会]Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
Batch normalization presentation
【DL輪読会】Pervasive Label Errors in Test Sets Destabilize Machine Learning Bench...
Finding connections among images using CycleGAN
RoFormer: Enhanced Transformer with Rotary Position Embedding
【DL輪読会】Variable Bitrate Neural Fields
Wasserstein GAN 수학 이해하기 I
R高速化
어떻게 하면 데이터 사이언티스트가 될 수 있나요?
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
ICCV 2019 論文紹介 (26 papers)
[DL輪読会]Inverse Constrained Reinforcement Learning
SSII2020 [OS2-02] 教師あり事前学習を凌駕する「弱」教師あり事前学習
[Dl輪読会]dl hacks輪読
Brief intro : Invariance and Equivariance
よくわかるHopscotch hashing
組合せ最適化を体系的に知ってPythonで実行してみよう PyCon 2015
Ad

Similar to Generalized Pipeline Parallelism for DNN Training (20)

PDF
A review of Pipeline Parallel Training of Large-scale Neural Network.pdf
PDF
Scaling Up LLM Pretraining: Parallel Training
PDF
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
PDF
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
PDF
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency ...
PDF
PDF
Scaling Up AI Research to Production with PyTorch and MLFlow
PDF
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
PPTX
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
PDF
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
PDF
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
PDF
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
PDF
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
PDF
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PDF
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
PPTX
Parallel & Distributed Deep Learning - Dataworks Summit
PPTX
Leonid Kuligin "Training ML models with Cloud"
PPTX
Parallel/Distributed Deep Learning and CDSW
PPTX
Beyond data and model parallelism for deep neural networks
PDF
Distributed deep learning
A review of Pipeline Parallel Training of Large-scale Neural Network.pdf
Scaling Up LLM Pretraining: Parallel Training
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency ...
Scaling Up AI Research to Production with PyTorch and MLFlow
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
High Performance TensorFlow in Production - Big Data Spain - Madrid - Nov 15 ...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Parallel & Distributed Deep Learning - Dataworks Summit
Leonid Kuligin "Training ML models with Cloud"
Parallel/Distributed Deep Learning and CDSW
Beyond data and model parallelism for deep neural networks
Distributed deep learning
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Foundation of Data Science unit number two notes
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Quality review (1)_presentation of this 21
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Mega Projects Data Mega Projects Data
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
1_Introduction to advance data techniques.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Business Analytics and business intelligence.pdf
Foundation of Data Science unit number two notes
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction-to-Cloud-ComputingFinal.pptx
.pdf is not working space design for the following data for the following dat...
Quality review (1)_presentation of this 21
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Mega Projects Data Mega Projects Data
Miokarditis (Inflamasi pada Otot Jantung)
climate analysis of Dhaka ,Banglades.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Database Infoormation System (DBIS).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
1_Introduction to advance data techniques.pptx
Reliability_Chapter_ presentation 1221.5784
Fluorescence-microscope_Botany_detailed content
IB Computer Science - Internal Assessment.pptx
Business Analytics and business intelligence.pdf

Generalized Pipeline Parallelism for DNN Training

  • 1. PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak Narayanan Stanford University In collaboration with many others
  • 2. Deep Neural Networks have empowered state of the art results across a range of applications… 2 cat dog வண#க% எ' ெபய+ ,ப# Hello, my name is Deepak Machine Translation Game Playing Image Classification Speech-to-Text
  • 3. …but first need to be trained! 3 𝑦! = tiger 𝑥! = activations gradients 𝑊 optimized using standard iterative optimization procedures 𝑊 = 𝑊 − 𝜂 ⋅ ∇𝑊 𝛻𝑊 loss(𝑦!, +𝑦!) +𝑦! = lion prediction Weight parameters 𝑊
  • 5. Parallelizing DNN training: Data Parallelism … Worker 1 ∇𝑊 = ∇𝑊! + ∇𝑊" + ⋯ + ∇𝑊# ∇𝑊! Gradient aggregation using all_reduce(.) 𝑛 copies of the same model … Worker 𝑛 ∇𝑊# Inputs sharded …
  • 6. Parallelizing DNN training: Data Parallelism … Worker 1 ∇𝑊 = ∇𝑊! + ∇𝑊" + ⋯ + ∇𝑊# ∇𝑊! Gradient aggregation using all_reduce(.) 𝑛 copies of the same model … ∇𝑊# Inputs sharded … Worker 𝑛 Despite many performance optimizations, communication overhead high! 8xV100s with NVLink (AWS) PyTorch + NCCL 2.4
  • 7. Worker 𝑛 Model Parallelism: An alternative to data parallelism All inputs • Single version of weights split over workers • Activations and gradients sent between workers using send(.) and recv(.) Worker 1 …
  • 8. Worker 𝑛 Model Parallelism: An alternative to data parallelism All inputs • Single version of weights split over workers • Activations and gradients sent between workers using send(.) and recv(.) Worker 1 … Low throughput due to poor resource utilization! Layers partitioned over workers
  • 9. Solution: Pipelining can increase throughput Pipelining: injecting multiple inputs into the system
  • 10. Pipelining in DNN training != Traditional pipelining • How should weight and activation versions be managed? • Backward pass operators depend on internal state (𝑊, activations) • Backward pass for inputs should use the same weight version as corresponding forward pass
  • 11. Challenge 1: Pipelining leads to weight version mismatches Naïve pipelining leads to mismatch in weight versions Input 𝒏 sees updates in backward pass not seen in the forward pass, leading to incorrect gradients 𝑊#𝑥# 𝑦# Forward pass 𝑊#$%∇𝑥# ∇𝑦# Backward pass 𝑊#$!
  • 12. 𝑊#𝑥# 𝑦# Forward pass 𝑾 𝒏∇𝑥# ∇𝑦# Backward pass 𝑊#$! Weight stashing: A solution to version mismatches Naïve pipelining leads to mismatch in weight versions Store multiple <weight, activation> versions • Ensures same weight versions used in both forward and backward pass • Worst case memory footprint similar to data parallelism (= 𝑑 ⋅ /( # $ % ) ') 𝑊# 𝑊#$! 𝑊#$" Stashed weights
  • 13. Pipelining in DNN training != Traditional pipelining • How should weight and activation versions be managed? • Backward pass operators depend on internal state (𝑊, activations) • Backward pass for inputs should use the same weight version as corresponding forward pass • How should the DNN operators be partitioned into pipeline stages? • Each operator has a different computation time • Activations and gradients need to be communicated across stages
  • 14. Challenge 2: How do we assign operators to pipeline stages? Stage 1 Stage 2 Stage 3 𝑡( 𝑡) 𝑡* • Desiderata #1: 𝑡!, 𝑡", 𝑡# as close to each other as possible • Compute resources seldom idle → better hardware efficiency • Desiderata #2: 𝑡!→" comm and 𝑡"→# comm minimized • Less communication → better hardware efficiency 𝑡(→) comm 𝑡)→* comm See SOSP paper for details on PipeDream’s optimizer!
  • 16. Setup • Integrated PipeDream with PyTorch in ~3000 lines of Python code • Integrated with PyTorch’s communication library • NCCL backend for Data Parallelism baselines • Gloo backend for PipeDream • Experiments run on three different server types • Cluster A: 4xV100 GPUs, PCIe intra-server, and 10 Gbps inter-server (Azure) • Cluster B: 8xV100 GPUs, NVLink intra-server, and 25 Gbps inter-server (AWS) • Cluster C: 1xTitan X, and 40 Gbps inter-server (private)
  • 17. 5.3x faster 2.5x faster PipeDream > Data Parallelism (DP) end-to-end
  • 18. PipeDream vs. Data Parallelism on Time-to-Accuracy
  • 19. Experiments on 4 different tasks: image classification, translation, language modeling, video captioning PipeDream vs. Data Parallelism on Time-to-Accuracy
  • 20. With the same number of GPUs, PipeDream up to 5.3x faster than Data Parallelism PipeDream vs. Data Parallelism on Time-to-Accuracy
  • 21. Takeaways • Model and data parallelism often suffer from high communication overhead and low resource utilization for certain models and deployments • PipeDream shows pipelining can be used to accelerate distributed training for models that fit on a single worker • Pipelining, when combined with data and model parallelism in a principled way, achieves end-to-end speedups of up to 5.3x compared to data parallelism, with similar worst-case memory footprint Appeared at SOSP 2019
  • 22. …but modern Deep Neural Networks are becoming extremely large! 700 GB in 32- bit precision Figure from “Language Models are Few-Shot Learners”, Brown et al.
  • 24. Background: GPipe How should weight and activation versions be managed? >> Single weight version >> Periodic pipeline flushes update weight version across workers
  • 25. GPipe and PipeDream make different tradeoffs GPipe Pipeline flushes expensive High memory footprint from weight versionsPipeDream
  • 27. Double-buffered weight updates: high throughput and low memory footprint How should weight and activation versions be managed? >> Two weight versions (shadow version and main version)
  • 28. Double-buffered weight updates: high throughput and low memory footprint Weight updates from inputs 1 to 4 accumulated and applied as a single weight update to generate a new weight version W! " Input 5 uses W! " throughout Use activation recomputation to limit memory footprint of intermediate activations
  • 29. Double-buffered weight updates: weight semantics • Assuming a per-GPU microbatch size of 𝑏, minibatch size 𝐵 = 𝑏 ⋅ 𝑑, where 𝑑 is the depth of the pipeline • Weight update semantics of data parallelism: 𝑊()$!) = 𝑊()) − 𝜈 ⋅ ∇𝑓(𝑊()) ) • Weight update semantics with 2BW almost unchanged (note additional delay term of 1 in gradient computation): 𝑊()$!) = 𝑊()) − 𝜈 ⋅ ∇𝑓(𝑊()+!) ) • Semantics similar with replicated stages or gradient aggregation (minibatch size 𝐵 multiplied by appropriate scale factor)
  • 31. Setup • Experiments run on p3.16xlarge instances on AWS (8-V100 servers w/ NVLink) • Baselines are hybrid parallelism (no pipelining), PipeDream, and GPipe • Model and associated activations do not fit on a single worker for many models, so data parallelism is not applicable • Evaluation on BERT models with various numbers of transformer layers (24 to 192), and a GPT-2 model with 760 million parameters
  • 32. 2BW has weight update semantics similar to data parallelism Training loss trajectory identical Accuracy on downstream GLUE tasks unchanged Training loss while pre-training a BERT-24 model with identical hyperparameters, and downstream GLUE task accuracy after finetuning pre-trained models 3 times
  • 33. PipeDream-2BW faster than baselines 1.9x faster than GPipe 6.9x faster than hybrid (no pipelining) Throughput in sequences/second for PipeDream-2BW and baselines on various models
  • 34. PipeDream-2BW has low memory footprint PipeDream-2BW with activation recomputation (R) has similar memory footprint to model parallelism Memory footprint for various systems, using a fixed per-GPU microbatch size of 4
  • 35. Takeaways • Model parallelism can be used to train large models that do not fit on a single worker, but suffers from low resource utilization • PipeDream-2BW carefully manages weight versions and uses activation recomputation when necessary to limit memory footprint • PipeDream-2BW can accelerate training by up to 6.9x compared to optimized baselines that do not use pipelining, and up to 1.9x compared to GPipe Preprint on Arxiv: https://arxiv.org/pdf/2006.09503.pdf
  • 36. Conclusion https://cs.stanford.edu/~deepakn/ Pipeline parallelism can accelerate distributed training both in regimes where model metadata (weight parameters and intermediate activations) fit on a single worker, and where they do not Code open sourced on Github: https://github.com/msr-fiddle/pipedream
  • 37. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.