SlideShare a Scribd company logo
Fall 2023 11-667 CMU
1
Announcement
HW3 will be out today. Get started ASAP!
• There will be additional TA office hours hold by the creators of this homework: Amanda and Emmy
• It is due Nov 30th, two weeks from now, excluding Thanksgiving holiday
Final Project Presentation will be a Conference Poster like session at GHC 7107 Atrium
• More instructions on the course website
Fall 2023 11-667 CMU
Scaling Up LLM Pretraining: Parallel Training
Chenyan Xiong
11-667
Fall 2023 11-667 CMU
3
Outline
Parallel Training
• Data Parallelism
• Pipeline Parallelism
• Tensor Parallelism
• Combination of Parallelism
• ZeRO Optimizer
Fall 2023 11-667 CMU
4
Parallel Training: Overview
As scale grows, training with one GPU is not enough
• There are many ways to improve efficiency on single-GPU training
• Checkpointing: moving part of the operations to CPU memory
• Quantizing different part of the optimization to reduce GPU memory cost
• Eventually more FLOPs are needed
Different setups of parallel training:
• When model training can fit into single-GPU
→Data parallelism
• When model training cannot fit into single-GPU
→ Model parallelism: pipeline or tensor
Fall 2023 11-667 CMU
5
Split training data batch into different GPUs
• Each GPU maintains its own copy of model and optimizer
• Each GPU gets a different local data batch, calculates its gradients
Parallel Training: Data Parallelism
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Forward Pass
Backward Pass
GPU 1 GPU 2 GPU 3
Transformer
Layer
Transformer
Layer
𝑓(𝑥2, 𝜃)
𝑥2 𝑔(𝑥2, 𝜃)
Transformer
Layer
Transformer
Layer
𝑓(𝑥3, 𝜃)
𝑥3 𝑔(𝑥3, 𝜃)
Fall 2023 11-667 CMU
6
Split training data batch into different GPUs
• Each GPU maintains its own copy of model and optimizer
• Each GPU gets a different local data batch, calculates its gradients
• Gather local gradients together to each GPU for global updates
Parallel Training: Data Parallelism
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Forward Pass
Backward Pass
GPU 1 GPU 2 GPU 3
Transformer
Layer
Transformer
Layer
𝑓(𝑥2, 𝜃)
𝑥2 𝑔(𝑥2, 𝜃)
Transformer
Layer
Transformer
Layer
𝑓(𝑥3, 𝜃)
𝑥3 𝑔(𝑥3, 𝜃)
𝑔(𝑥1:3, 𝜃) 𝑔(𝑥1:3, 𝜃) 𝑔(𝑥1:3, 𝜃)
All Gather
Global Gradients:
Fall 2023 11-667 CMU
7
Split training data batch into different GPUs
• Each GPU maintains its own copy of model and optimizer
• Each GPU gets a different local data batch, calculates its gradients
• Gather local gradients together to each GPU for global updates
Parallel Training: Data Parallelism
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Forward Pass
Backward Pass
GPU 1 GPU 2 GPU 3
Transformer
Layer
Transformer
Layer
𝑓(𝑥2, 𝜃)
𝑥2 𝑔(𝑥2, 𝜃)
Transformer
Layer
Transformer
Layer
𝑓(𝑥3, 𝜃)
𝑥3 𝑔(𝑥3, 𝜃)
𝑔(𝑥1:3, 𝜃) 𝑔(𝑥1:3, 𝜃) 𝑔(𝑥1:3, 𝜃)
All Gather
Global Gradients:
Communication:
• The full gradient tensor
between every pair of GPUs,
at each training batch.
• Not an issue between GPUs in
the same machine or
machines with infinity band
• Will need work around
without fast cross-GPU
connection
Fall 2023 11-667 CMU
8
Parallel Training: Model Parallelism
LLM size grew quickly and passed the limit of single GPU memory
Solution: Split network parameters (thus their gradients and corresponding optimizer states) to different GPUs
Cost of 10B Model Function to parameter count (𝚿)
Parameter Bytes 20GB 2Ψ
Gradient Bytes 20GB 2Ψ
Optimizer State: 1st Order Momentum 20GB 2Ψ
Optimizer State: 2nd Order Momentum 20GB 2Ψ
Total Per Model Instance 80GB 8Ψ
Table 1: Memory Consumption of Training Solely with BF16 (Ideal case) of a model sized Ψ
Fall 2023 11-667 CMU
9
Parallel Training: Model Parallelism
Two ways of splitting network parameters
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Pipeline Parallelism
GPU 1
GPU 2
Split by Layers
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Tensor Parallelism
GPU 1 GPU 2
Split Tensors
Fall 2023 11-667 CMU
10
Parallel Training: Pipeline Parallelism
Split network by layers, aligning devices by layer order to a pipeline, and pass data through devices [7]
[7] Huang et al. “GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism”.
NeurIPS 2019
Figure 7: Illustration of Pipeline Parallelism [7]
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Pipeline Parallelism
GPU 1
GPU 2
Split by Layers
Fall 2023 11-667 CMU
11
Parallel Training: Pipeline Parallelism
Split network by layers, aligning devices by layer order to a pipeline, and pass data through devices [7]
[7] Huang et al. “GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism”.
NeurIPS 2019
Figure 7: Illustration of Pipeline Parallelism [7]
Split batches for
more fine-
grained pipelines
Fall 2023 11-667 CMU
12
Parallel Training: Pipeline Parallelism
Split network by layers, aligning devices by layer order to a pipeline, and pass data through devices [7]
[7] Huang et al. “GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism”.
NeurIPS 2019
Figure 7: Illustration of Pipeline Parallelism [7]
Communication:
• Activations between nearby
devices in forward pass
• Partial gradients between
nearby devices in backward
Fall 2023 11-667 CMU
13
Parallel Training: Pipeline Parallelism
Split network by layers, aligning devices by layer order to a pipeline, and pass data through devices [7]
Pros: Conceptually simple and not coupled with network architectures. All networks have multiple layers.
Cons: Waste of compute in the Bubble. Bubble gets bigger with more devices and bigger batches.
[7] Huang et al. “GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism”.
NeurIPS 2019
Figure 7: Illustration of Pipeline Parallelism [7]
Communication:
• Activations between nearby
devices in forward pass
• Partial gradients between
nearby devices in backward
Fall 2023 11-667 CMU
14
Outline
Parallel Training
• Data Parallelism
• Pipeline Parallelism
• Tensor Parallelism
• Combination of Parallelism
• ZeRO Optimizer
Fall 2023 11-667 CMU
15
Parallel Training: Tensor Parallelism
Split the parameter tensors of network layers into different devices for parallel matrix operations
[8] Shoeybi et al. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model
Parallelism”. arXiv 2019
Figure 8: Tensor Parallelism of MLP blocks and Self-attention Blocks [8]
Fall 2023 11-667 CMU
16
Parallel Training: Tensor Parallelism
Split the parameter tensors of network layers into different devices for parallel matrix operations
[8] Shoeybi et al. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model
Parallelism”. arXiv 2019
Figure 8: Tensor Parallelism of MLP blocks and Self-attention Blocks [8]
Fall 2023 11-667 CMU
17
Parallel Training: Tensor Parallelism
Split the parameter tensors of network layers into different devices for parallel matrix operations
Pros: No bubble
Cons: Different blocks are better split differently, lots of customizations
[8] Shoeybi et al. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model
Parallelism”. arXiv 2019
Figure 8: Tensor Parallelism of MLP blocks and Self-attention Blocks [8]
Fall 2023 11-667 CMU
18
Parallel Training: Tensor Parallelism
Split the parameter tensors of network layers into different devices for parallel matrix operations
[8] Shoeybi et al. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model
Parallelism”. arXiv 2019
Figure 9: Communication of Tensor Papalism for a Transformer Layer [8]
Communication:
• All-gather of partial activations and gradients for each split tensor
Fall 2023 11-667 CMU
19
Parallel Training: Combining Different Parallelism
Often data parallelism and model parallelism are used together.
• No need not to use data parallelism
Pipeline Parallelism and Tensor Parallelism can also be used together.
[9] Narayanan et al. “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM”.
SC 2021.
Figure 10: Combination of Tensor Parallelism and Pipeline Parallelism [9]
Fall 2023 11-667 CMU
20
Outline
Parallel Training
• Data Parallelism
• Pipeline Parallelism
• Tensor Parallelism
• Combination of Combination
• ZeRO Optimizer
Fall 2023 11-667 CMU
21
ZeRO: Redundancy in Data Parallelism
Majority of GPU memory consumption is on the optimization side: gradients and optimizer momentums
Cost of 10B Model Function to parameter count (𝚿)
Parameter Bytes 20GB 2Ψ
Gradient Bytes 20GB 2Ψ
Optimizer State: 1st Order Momentum 20GB 2Ψ
Optimizer State: 2nd Order Momentum 20GB 2Ψ
Total Per Model Instance 80GB 8Ψ
Table 1: Memory Consumption of Training Solely with BF16 (Ideal case) of a model sized Ψ
Fall 2023 11-667 CMU
22
ZeRO: Reduce Memory Redundancy
ZeRO Optimizer: Split GPU memory consumption into multiple GPUs during data parallelism
[10] Rajbhandari et al. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”. arXiv
2019.
Figure 11: ZeRO Optimizer Stages [10]
Stage 1: Split Optimizer States
Stage 2: +Split Gradients
Fall 2023 11-667 CMU
23
ZeRO: Redundancy in Data Parallelism
ZeRO Stage 1 and 2: reducing memory redundancy
Observation:
• In data parallelism, each
device only has access to local
gradient
• All gather operation required
on all gradients anyway
Fall 2023 11-667 CMU
24
ZeRO: Redundancy in Data Parallelism
An example way to implement ZeRO Stage 1
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Forward Pass
Backward Pass
Transformer
Layer
Transformer
Layer
𝑓(𝑥2, 𝜃)
𝑥2 𝑔(𝑥2, 𝜃)
Transformer
Layer
Transformer
Layer
𝑓(𝑥3, 𝜃)
𝑥3 𝑔(𝑥3, 𝜃)
All Gather
Sharded 1st Momentum 𝒎(𝒙, 𝜽𝟏) 𝒎(𝒙, 𝜽𝟐)
GPU 1 GPU 2 GPU 3
𝒎(𝒙, 𝜽𝟑)
𝒗(𝒙, 𝜽𝟏) 𝒗(𝒙, 𝜽𝟐) 𝒗(𝒙, 𝜽𝟑)
Sharded 2nd Momentum
Adam Parameter Updates
Fall 2023 11-667 CMU
25
ZeRO: Reduce Memory Redundancy
ZeRO Optimizer: Split GPU memory consumption into multiple GPUs during data parallelism
[10] Rajbhandari et al. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”. arXiv
2019.
Figure 11: ZeRO Optimizer Stages [10]
Stage 1: Split Optimizer States
Stage 2: +Split Gradients
Communication
Free ride with data parallelism
Free ride with data parallelism
Fall 2023 11-667 CMU
26
ZeRO: Reduce Memory Redundancy
ZeRO Optimizer: Split GPU memory consumption into multiple GPUs during data parallelism
[10] Rajbhandari et al. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”. arXiv
2019.
Figure 11: ZeRO Optimizer Stages [10]
Stage 1: Split Optimizer States
Stage 2: +Split Gradients
Stage 3: +Split Parameters
Communication
Free ride with data parallelism
Free ride with data parallelism
Fall 2023 11-667 CMU
27
ZeRO: Redundancy in Data Parallelism
Sharding parameters and passing them when needed
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Forward Pass
Backward Pass
Transformer
Layer
Transformer
Layer
𝑓(𝑥2, 𝜃)
𝑥2 𝑔(𝑥2, 𝜃)
Transformer
Layer
Transformer
Layer
𝑓(𝑥3, 𝜃)
𝑥3 𝑔(𝑥3, 𝜃)
All Gather
Sharded 1st Momentum 𝒎(𝒙, 𝜽𝟏) 𝒎(𝒙, 𝜽𝟐)
GPU 1 GPU 2 GPU 3
𝒎(𝒙, 𝜽𝟑)
𝒗(𝒙, 𝜽𝟏) 𝒗(𝒙, 𝜽𝟐) 𝒗(𝒙, 𝜽𝟑)
Sharded 2nd Momentum
Adam Parameter Updates
Fall 2023 11-667 CMU
28
ZeRO: Reduce Memory Redundancy
ZeRO Optimizer: Split GPU memory consumption into multiple GPUs during data parallelism
Pros: Stage 1 and 2 free ride with data parallelism with huge GPU memory savings
Cons: Open-source support not as good
Notes: Stage 3 is different with tensor parallelism. It passes parameters when needed but still performs
computations of the full layer/network in one GPU. It is data parallelism with GPU memory sharding
[10] Rajbhandari et al. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”. arXiv
2019.
Figure 11: ZeRO Optimizer Stages [10]
Stage 1: Split Optimizer States
Stage 2: +Split Gradients
Stage 3: +Split Parameters
Communication
Free ride with data parallelism
Free ride with data parallelism
All-gather parameters
Fall 2023 11-667 CMU
A peek into real large scale pretraining workflow
Lots of first hand information released through the FAIR OPT pretraining run:
https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles
Fall 2023 11-667 CMU
30
Background
A group of researchers and engineers are tasked with the goal of pretraining a large-scale model like GPT-3
• 1024 A100 80GBs to use. Yes!
Constraints:
• Task given at around Beginning of Nov 2021
• Goal is to pretrain a 175 Billion scale model by end of the year
• Which at minimum require 33 days on 1K A100s
• With no previous experience on large scale pretraining at all
Fall 2023 11-667 CMU
31
Challenge #1: Many Research Work Don’t Scale
Hope: We started with high hopes that all our research improvements at Small will give us a better GPT
https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/10_percent_update.md
Fall 2023 11-667 CMU
32
Challenge #1: Many Research Work Don’t Scale
Reality: Short timeline, Big money on the line, Nothing too fancy
https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/10_percent_update.md
Fall 2023 11-667 CMU
33
Challenge #2: Hardware Failures
GPU machines are not very reliable. With 1024 A100s, it is guaranteed to have bad nodes.
https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/27_percent_update.md
Fall 2023 11-667 CMU
34
Challenge #2: Hardware Failures
Solution? Hopefully better tooling in the future, but right now:
https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/27_percent_update.md
Fall 2023 11-667 CMU
35
Challenge #2: Hardware Failures
Forming an on-call group to watch OPT training
Alchemy Furnace
of the LLM Era
We Watching
LLM Training
Fall 2023 11-667 CMU
36
Challenge #3: Optimization Stability
Lots of optimization stability issues:
• Loss explodes, gradients overflows/underflows, training stagers…
Fall 2023 11-667 CMU
37
Challenge #3: Optimization Stability
Fall 2023 11-667 CMU
38
Challenge #3: Optimization Stability
Fall 2023 11-667 CMU
39
Challenge #3: Optimization Stability
Fall 2023 11-667 CMU
40
The Importance of Scaling Law
Essential to determine what to do at large scale using observations at smaller scale
Fall 2023 11-667 CMU
41
Final Remarks from OPT
Fall 2023 11-667 CMU
42
Other Notable Literatures in Scaling Up
Different configurations of layer normalization: pre layernorm, post layernorm and their combination
• Xiong et al. “On Layer Normalization in the Transformer Architecture”. ICML 2020
• Zhang and Sennrich. “Root Mean Square Layer Normalization”. NeurIPS 2019
Position embeddings for longer contexts and expressiveness
• Su et al. “Roformer: Enhanced transformer with rotary position embedding.” arXiv 2021
Stability improvement from adaptive initialization
• Liu et al. “Understanding the Difficulty of Training Transformers”. EMNLP 2020
Fall 2023 11-667 CMU
Quiz: What can we do to reduce communication
overhead if only slow network connection is
available in between GPUs?

More Related Content

PDF
PPTX
GPU and Deep learning best practices
PDF
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
PDF
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
PPTX
Brief introduction to Distributed Deep Learning
PDF
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
PDF
Generalized Pipeline Parallelism for DNN Training
PDF
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
GPU and Deep learning best practices
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Brief introduction to Distributed Deep Learning
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
Generalized Pipeline Parallelism for DNN Training
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...

Similar to Scaling Up LLM Pretraining: Parallel Training (20)

PDF
A review of Pipeline Parallel Training of Large-scale Neural Network.pdf
PDF
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
PDF
Deep Learning at Scale
PDF
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
PDF
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
PDF
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
PPTX
An Introduction to TensorFlow architecture
PDF
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
PDF
Toward Distributed, Global, Deep Learning Using IoT Devices
PDF
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
PDF
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
PDF
Distributed deep learning
PDF
Netflix machine learning
PDF
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
PDF
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
PPTX
IBM AI at Scale
PDF
Scalable Deep Learning on Distributed GPUs
PPTX
improve deep learning training and inference performance
PDF
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency ...
PDF
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
A review of Pipeline Parallel Training of Large-scale Neural Network.pdf
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Deep Learning at Scale
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
An Introduction to TensorFlow architecture
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
Toward Distributed, Global, Deep Learning Using IoT Devices
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
Distributed deep learning
Netflix machine learning
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
IBM AI at Scale
Scalable Deep Learning on Distributed GPUs
improve deep learning training and inference performance
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency ...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Ad

More from cniclsh1 (20)

PDF
Knowledge Representation Part VI by Jan Pettersen Nytun
PDF
Knowledge Representation Part III by Jan Pettersen Nytun
PDF
interacting-with-ai-2023---module-2---session-4---handout.pdf
PDF
interacting-with-ai-2023---module-2---session-3---handout.pdf
PDF
interacting-with-ai-2023---module-2---session-1---handout.pdf
PDF
Chatbot are sentient, turing test, generative AI
PDF
Model-Based Reinforcement Learning CS 285: Deep Reinforcement Learning, Decis...
PDF
Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision ...
PDF
Probabilistic AI Lecture 1: Introduction to variational inference and the ELBO
PDF
Bayesian Statistics in High Dimensions Lecture 1: Curve and surface estimation
PDF
Foundations of Artificial Intelligence 1. Introduction Organizational Aspects...
PDF
W1L2_11-667 - Building Blocks of Modern LLMs 2: Pretraining Tasks
PDF
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Parameter Effi...
PDF
W4L1_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Human Evaluati...
PDF
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS PETM Parameter E...
PDF
W6L1_LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Chatbots and AI Agents
PDF
LLM for Search Engines: Part 2,Pretrain retrieval representations
PDF
W9L2 Scaling Up LLM Pretraining: Scaling Law
PDF
W10L2 Scaling Up LLM Pretraining: Parallel Training Scaling Up Optimizer Basi...
PDF
W11L2 Efficient Scaling Retrieval Augmentation.pdf
Knowledge Representation Part VI by Jan Pettersen Nytun
Knowledge Representation Part III by Jan Pettersen Nytun
interacting-with-ai-2023---module-2---session-4---handout.pdf
interacting-with-ai-2023---module-2---session-3---handout.pdf
interacting-with-ai-2023---module-2---session-1---handout.pdf
Chatbot are sentient, turing test, generative AI
Model-Based Reinforcement Learning CS 285: Deep Reinforcement Learning, Decis...
Inverse Reinforcement Learning CS 285: Deep Reinforcement Learning, Decision ...
Probabilistic AI Lecture 1: Introduction to variational inference and the ELBO
Bayesian Statistics in High Dimensions Lecture 1: Curve and surface estimation
Foundations of Artificial Intelligence 1. Introduction Organizational Aspects...
W1L2_11-667 - Building Blocks of Modern LLMs 2: Pretraining Tasks
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Parameter Effi...
W4L1_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Human Evaluati...
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS PETM Parameter E...
W6L1_LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Chatbots and AI Agents
LLM for Search Engines: Part 2,Pretrain retrieval representations
W9L2 Scaling Up LLM Pretraining: Scaling Law
W10L2 Scaling Up LLM Pretraining: Parallel Training Scaling Up Optimizer Basi...
W11L2 Efficient Scaling Retrieval Augmentation.pdf
Ad

Recently uploaded (20)

PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
history of c programming in notes for students .pptx
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Essential Infomation Tech presentation.pptx
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
ai tools demonstartion for schools and inter college
PDF
top salesforce developer skills in 2025.pdf
Design an Analysis of Algorithms II-SECS-1021-03
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
history of c programming in notes for students .pptx
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Essential Infomation Tech presentation.pptx
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
VVF-Customer-Presentation2025-Ver1.9.pptx
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
How to Migrate SBCGlobal Email to Yahoo Easily
Odoo Companies in India – Driving Business Transformation.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Odoo POS Development Services by CandidRoot Solutions
Upgrade and Innovation Strategies for SAP ERP Customers
Softaken Excel to vCard Converter Software.pdf
PTS Company Brochure 2025 (1).pdf.......
ai tools demonstartion for schools and inter college
top salesforce developer skills in 2025.pdf

Scaling Up LLM Pretraining: Parallel Training

  • 1. Fall 2023 11-667 CMU 1 Announcement HW3 will be out today. Get started ASAP! • There will be additional TA office hours hold by the creators of this homework: Amanda and Emmy • It is due Nov 30th, two weeks from now, excluding Thanksgiving holiday Final Project Presentation will be a Conference Poster like session at GHC 7107 Atrium • More instructions on the course website
  • 2. Fall 2023 11-667 CMU Scaling Up LLM Pretraining: Parallel Training Chenyan Xiong 11-667
  • 3. Fall 2023 11-667 CMU 3 Outline Parallel Training • Data Parallelism • Pipeline Parallelism • Tensor Parallelism • Combination of Parallelism • ZeRO Optimizer
  • 4. Fall 2023 11-667 CMU 4 Parallel Training: Overview As scale grows, training with one GPU is not enough • There are many ways to improve efficiency on single-GPU training • Checkpointing: moving part of the operations to CPU memory • Quantizing different part of the optimization to reduce GPU memory cost • Eventually more FLOPs are needed Different setups of parallel training: • When model training can fit into single-GPU →Data parallelism • When model training cannot fit into single-GPU → Model parallelism: pipeline or tensor
  • 5. Fall 2023 11-667 CMU 5 Split training data batch into different GPUs • Each GPU maintains its own copy of model and optimizer • Each GPU gets a different local data batch, calculates its gradients Parallel Training: Data Parallelism Transformer Layer Transformer Layer 𝑓(𝑥1, 𝜃) 𝑥1 𝑔(𝑥1, 𝜃) Forward Pass Backward Pass GPU 1 GPU 2 GPU 3 Transformer Layer Transformer Layer 𝑓(𝑥2, 𝜃) 𝑥2 𝑔(𝑥2, 𝜃) Transformer Layer Transformer Layer 𝑓(𝑥3, 𝜃) 𝑥3 𝑔(𝑥3, 𝜃)
  • 6. Fall 2023 11-667 CMU 6 Split training data batch into different GPUs • Each GPU maintains its own copy of model and optimizer • Each GPU gets a different local data batch, calculates its gradients • Gather local gradients together to each GPU for global updates Parallel Training: Data Parallelism Transformer Layer Transformer Layer 𝑓(𝑥1, 𝜃) 𝑥1 𝑔(𝑥1, 𝜃) Forward Pass Backward Pass GPU 1 GPU 2 GPU 3 Transformer Layer Transformer Layer 𝑓(𝑥2, 𝜃) 𝑥2 𝑔(𝑥2, 𝜃) Transformer Layer Transformer Layer 𝑓(𝑥3, 𝜃) 𝑥3 𝑔(𝑥3, 𝜃) 𝑔(𝑥1:3, 𝜃) 𝑔(𝑥1:3, 𝜃) 𝑔(𝑥1:3, 𝜃) All Gather Global Gradients:
  • 7. Fall 2023 11-667 CMU 7 Split training data batch into different GPUs • Each GPU maintains its own copy of model and optimizer • Each GPU gets a different local data batch, calculates its gradients • Gather local gradients together to each GPU for global updates Parallel Training: Data Parallelism Transformer Layer Transformer Layer 𝑓(𝑥1, 𝜃) 𝑥1 𝑔(𝑥1, 𝜃) Forward Pass Backward Pass GPU 1 GPU 2 GPU 3 Transformer Layer Transformer Layer 𝑓(𝑥2, 𝜃) 𝑥2 𝑔(𝑥2, 𝜃) Transformer Layer Transformer Layer 𝑓(𝑥3, 𝜃) 𝑥3 𝑔(𝑥3, 𝜃) 𝑔(𝑥1:3, 𝜃) 𝑔(𝑥1:3, 𝜃) 𝑔(𝑥1:3, 𝜃) All Gather Global Gradients: Communication: • The full gradient tensor between every pair of GPUs, at each training batch. • Not an issue between GPUs in the same machine or machines with infinity band • Will need work around without fast cross-GPU connection
  • 8. Fall 2023 11-667 CMU 8 Parallel Training: Model Parallelism LLM size grew quickly and passed the limit of single GPU memory Solution: Split network parameters (thus their gradients and corresponding optimizer states) to different GPUs Cost of 10B Model Function to parameter count (𝚿) Parameter Bytes 20GB 2Ψ Gradient Bytes 20GB 2Ψ Optimizer State: 1st Order Momentum 20GB 2Ψ Optimizer State: 2nd Order Momentum 20GB 2Ψ Total Per Model Instance 80GB 8Ψ Table 1: Memory Consumption of Training Solely with BF16 (Ideal case) of a model sized Ψ
  • 9. Fall 2023 11-667 CMU 9 Parallel Training: Model Parallelism Two ways of splitting network parameters Transformer Layer Transformer Layer 𝑓(𝑥1, 𝜃) 𝑥1 𝑔(𝑥1, 𝜃) Pipeline Parallelism GPU 1 GPU 2 Split by Layers Transformer Layer Transformer Layer 𝑓(𝑥1, 𝜃) 𝑥1 𝑔(𝑥1, 𝜃) Tensor Parallelism GPU 1 GPU 2 Split Tensors
  • 10. Fall 2023 11-667 CMU 10 Parallel Training: Pipeline Parallelism Split network by layers, aligning devices by layer order to a pipeline, and pass data through devices [7] [7] Huang et al. “GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism”. NeurIPS 2019 Figure 7: Illustration of Pipeline Parallelism [7] Transformer Layer Transformer Layer 𝑓(𝑥1, 𝜃) 𝑥1 𝑔(𝑥1, 𝜃) Pipeline Parallelism GPU 1 GPU 2 Split by Layers
  • 11. Fall 2023 11-667 CMU 11 Parallel Training: Pipeline Parallelism Split network by layers, aligning devices by layer order to a pipeline, and pass data through devices [7] [7] Huang et al. “GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism”. NeurIPS 2019 Figure 7: Illustration of Pipeline Parallelism [7] Split batches for more fine- grained pipelines
  • 12. Fall 2023 11-667 CMU 12 Parallel Training: Pipeline Parallelism Split network by layers, aligning devices by layer order to a pipeline, and pass data through devices [7] [7] Huang et al. “GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism”. NeurIPS 2019 Figure 7: Illustration of Pipeline Parallelism [7] Communication: • Activations between nearby devices in forward pass • Partial gradients between nearby devices in backward
  • 13. Fall 2023 11-667 CMU 13 Parallel Training: Pipeline Parallelism Split network by layers, aligning devices by layer order to a pipeline, and pass data through devices [7] Pros: Conceptually simple and not coupled with network architectures. All networks have multiple layers. Cons: Waste of compute in the Bubble. Bubble gets bigger with more devices and bigger batches. [7] Huang et al. “GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism”. NeurIPS 2019 Figure 7: Illustration of Pipeline Parallelism [7] Communication: • Activations between nearby devices in forward pass • Partial gradients between nearby devices in backward
  • 14. Fall 2023 11-667 CMU 14 Outline Parallel Training • Data Parallelism • Pipeline Parallelism • Tensor Parallelism • Combination of Parallelism • ZeRO Optimizer
  • 15. Fall 2023 11-667 CMU 15 Parallel Training: Tensor Parallelism Split the parameter tensors of network layers into different devices for parallel matrix operations [8] Shoeybi et al. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”. arXiv 2019 Figure 8: Tensor Parallelism of MLP blocks and Self-attention Blocks [8]
  • 16. Fall 2023 11-667 CMU 16 Parallel Training: Tensor Parallelism Split the parameter tensors of network layers into different devices for parallel matrix operations [8] Shoeybi et al. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”. arXiv 2019 Figure 8: Tensor Parallelism of MLP blocks and Self-attention Blocks [8]
  • 17. Fall 2023 11-667 CMU 17 Parallel Training: Tensor Parallelism Split the parameter tensors of network layers into different devices for parallel matrix operations Pros: No bubble Cons: Different blocks are better split differently, lots of customizations [8] Shoeybi et al. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”. arXiv 2019 Figure 8: Tensor Parallelism of MLP blocks and Self-attention Blocks [8]
  • 18. Fall 2023 11-667 CMU 18 Parallel Training: Tensor Parallelism Split the parameter tensors of network layers into different devices for parallel matrix operations [8] Shoeybi et al. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”. arXiv 2019 Figure 9: Communication of Tensor Papalism for a Transformer Layer [8] Communication: • All-gather of partial activations and gradients for each split tensor
  • 19. Fall 2023 11-667 CMU 19 Parallel Training: Combining Different Parallelism Often data parallelism and model parallelism are used together. • No need not to use data parallelism Pipeline Parallelism and Tensor Parallelism can also be used together. [9] Narayanan et al. “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM”. SC 2021. Figure 10: Combination of Tensor Parallelism and Pipeline Parallelism [9]
  • 20. Fall 2023 11-667 CMU 20 Outline Parallel Training • Data Parallelism • Pipeline Parallelism • Tensor Parallelism • Combination of Combination • ZeRO Optimizer
  • 21. Fall 2023 11-667 CMU 21 ZeRO: Redundancy in Data Parallelism Majority of GPU memory consumption is on the optimization side: gradients and optimizer momentums Cost of 10B Model Function to parameter count (𝚿) Parameter Bytes 20GB 2Ψ Gradient Bytes 20GB 2Ψ Optimizer State: 1st Order Momentum 20GB 2Ψ Optimizer State: 2nd Order Momentum 20GB 2Ψ Total Per Model Instance 80GB 8Ψ Table 1: Memory Consumption of Training Solely with BF16 (Ideal case) of a model sized Ψ
  • 22. Fall 2023 11-667 CMU 22 ZeRO: Reduce Memory Redundancy ZeRO Optimizer: Split GPU memory consumption into multiple GPUs during data parallelism [10] Rajbhandari et al. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”. arXiv 2019. Figure 11: ZeRO Optimizer Stages [10] Stage 1: Split Optimizer States Stage 2: +Split Gradients
  • 23. Fall 2023 11-667 CMU 23 ZeRO: Redundancy in Data Parallelism ZeRO Stage 1 and 2: reducing memory redundancy Observation: • In data parallelism, each device only has access to local gradient • All gather operation required on all gradients anyway
  • 24. Fall 2023 11-667 CMU 24 ZeRO: Redundancy in Data Parallelism An example way to implement ZeRO Stage 1 Transformer Layer Transformer Layer 𝑓(𝑥1, 𝜃) 𝑥1 𝑔(𝑥1, 𝜃) Forward Pass Backward Pass Transformer Layer Transformer Layer 𝑓(𝑥2, 𝜃) 𝑥2 𝑔(𝑥2, 𝜃) Transformer Layer Transformer Layer 𝑓(𝑥3, 𝜃) 𝑥3 𝑔(𝑥3, 𝜃) All Gather Sharded 1st Momentum 𝒎(𝒙, 𝜽𝟏) 𝒎(𝒙, 𝜽𝟐) GPU 1 GPU 2 GPU 3 𝒎(𝒙, 𝜽𝟑) 𝒗(𝒙, 𝜽𝟏) 𝒗(𝒙, 𝜽𝟐) 𝒗(𝒙, 𝜽𝟑) Sharded 2nd Momentum Adam Parameter Updates
  • 25. Fall 2023 11-667 CMU 25 ZeRO: Reduce Memory Redundancy ZeRO Optimizer: Split GPU memory consumption into multiple GPUs during data parallelism [10] Rajbhandari et al. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”. arXiv 2019. Figure 11: ZeRO Optimizer Stages [10] Stage 1: Split Optimizer States Stage 2: +Split Gradients Communication Free ride with data parallelism Free ride with data parallelism
  • 26. Fall 2023 11-667 CMU 26 ZeRO: Reduce Memory Redundancy ZeRO Optimizer: Split GPU memory consumption into multiple GPUs during data parallelism [10] Rajbhandari et al. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”. arXiv 2019. Figure 11: ZeRO Optimizer Stages [10] Stage 1: Split Optimizer States Stage 2: +Split Gradients Stage 3: +Split Parameters Communication Free ride with data parallelism Free ride with data parallelism
  • 27. Fall 2023 11-667 CMU 27 ZeRO: Redundancy in Data Parallelism Sharding parameters and passing them when needed Transformer Layer Transformer Layer 𝑓(𝑥1, 𝜃) 𝑥1 𝑔(𝑥1, 𝜃) Forward Pass Backward Pass Transformer Layer Transformer Layer 𝑓(𝑥2, 𝜃) 𝑥2 𝑔(𝑥2, 𝜃) Transformer Layer Transformer Layer 𝑓(𝑥3, 𝜃) 𝑥3 𝑔(𝑥3, 𝜃) All Gather Sharded 1st Momentum 𝒎(𝒙, 𝜽𝟏) 𝒎(𝒙, 𝜽𝟐) GPU 1 GPU 2 GPU 3 𝒎(𝒙, 𝜽𝟑) 𝒗(𝒙, 𝜽𝟏) 𝒗(𝒙, 𝜽𝟐) 𝒗(𝒙, 𝜽𝟑) Sharded 2nd Momentum Adam Parameter Updates
  • 28. Fall 2023 11-667 CMU 28 ZeRO: Reduce Memory Redundancy ZeRO Optimizer: Split GPU memory consumption into multiple GPUs during data parallelism Pros: Stage 1 and 2 free ride with data parallelism with huge GPU memory savings Cons: Open-source support not as good Notes: Stage 3 is different with tensor parallelism. It passes parameters when needed but still performs computations of the full layer/network in one GPU. It is data parallelism with GPU memory sharding [10] Rajbhandari et al. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”. arXiv 2019. Figure 11: ZeRO Optimizer Stages [10] Stage 1: Split Optimizer States Stage 2: +Split Gradients Stage 3: +Split Parameters Communication Free ride with data parallelism Free ride with data parallelism All-gather parameters
  • 29. Fall 2023 11-667 CMU A peek into real large scale pretraining workflow Lots of first hand information released through the FAIR OPT pretraining run: https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles
  • 30. Fall 2023 11-667 CMU 30 Background A group of researchers and engineers are tasked with the goal of pretraining a large-scale model like GPT-3 • 1024 A100 80GBs to use. Yes! Constraints: • Task given at around Beginning of Nov 2021 • Goal is to pretrain a 175 Billion scale model by end of the year • Which at minimum require 33 days on 1K A100s • With no previous experience on large scale pretraining at all
  • 31. Fall 2023 11-667 CMU 31 Challenge #1: Many Research Work Don’t Scale Hope: We started with high hopes that all our research improvements at Small will give us a better GPT https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/10_percent_update.md
  • 32. Fall 2023 11-667 CMU 32 Challenge #1: Many Research Work Don’t Scale Reality: Short timeline, Big money on the line, Nothing too fancy https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/10_percent_update.md
  • 33. Fall 2023 11-667 CMU 33 Challenge #2: Hardware Failures GPU machines are not very reliable. With 1024 A100s, it is guaranteed to have bad nodes. https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/27_percent_update.md
  • 34. Fall 2023 11-667 CMU 34 Challenge #2: Hardware Failures Solution? Hopefully better tooling in the future, but right now: https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/27_percent_update.md
  • 35. Fall 2023 11-667 CMU 35 Challenge #2: Hardware Failures Forming an on-call group to watch OPT training Alchemy Furnace of the LLM Era We Watching LLM Training
  • 36. Fall 2023 11-667 CMU 36 Challenge #3: Optimization Stability Lots of optimization stability issues: • Loss explodes, gradients overflows/underflows, training stagers…
  • 37. Fall 2023 11-667 CMU 37 Challenge #3: Optimization Stability
  • 38. Fall 2023 11-667 CMU 38 Challenge #3: Optimization Stability
  • 39. Fall 2023 11-667 CMU 39 Challenge #3: Optimization Stability
  • 40. Fall 2023 11-667 CMU 40 The Importance of Scaling Law Essential to determine what to do at large scale using observations at smaller scale
  • 41. Fall 2023 11-667 CMU 41 Final Remarks from OPT
  • 42. Fall 2023 11-667 CMU 42 Other Notable Literatures in Scaling Up Different configurations of layer normalization: pre layernorm, post layernorm and their combination • Xiong et al. “On Layer Normalization in the Transformer Architecture”. ICML 2020 • Zhang and Sennrich. “Root Mean Square Layer Normalization”. NeurIPS 2019 Position embeddings for longer contexts and expressiveness • Su et al. “Roformer: Enhanced transformer with rotary position embedding.” arXiv 2021 Stability improvement from adaptive initialization • Liu et al. “Understanding the Difficulty of Training Transformers”. EMNLP 2020
  • 43. Fall 2023 11-667 CMU Quiz: What can we do to reduce communication overhead if only slow network connection is available in between GPUs?