Scaling Up LLM Pretraining: Parallel Training

Fall 2023 11-667 CMU
1
Announcement
HW3 will be out today. Get started ASAP!
• There will be additional TA office hours hold by the creators of this homework: Amanda and Emmy
• It is due Nov 30th, two weeks from now, excluding Thanksgiving holiday
Final Project Presentation will be a Conference Poster like session at GHC 7107 Atrium
• More instructions on the course website

Fall 2023 11-667 CMU
Scaling Up LLM Pretraining: Parallel Training
Chenyan Xiong
11-667

Fall 2023 11-667 CMU
3
Outline
Parallel Training
• Data Parallelism
• Pipeline Parallelism
• Tensor Parallelism
• Combination of Parallelism
• ZeRO Optimizer

Fall 2023 11-667 CMU
4
Parallel Training: Overview
As scale grows, training with one GPU is not enough
• There are many ways to improve efficiency on single-GPU training
• Checkpointing: moving part of the operations to CPU memory
• Quantizing different part of the optimization to reduce GPU memory cost
• Eventually more FLOPs are needed
Different setups of parallel training:
• When model training can fit into single-GPU
→Data parallelism
• When model training cannot fit into single-GPU
→ Model parallelism: pipeline or tensor

Fall 2023 11-667 CMU
5
Split training data batch into different GPUs
• Each GPU maintains its own copy of model and optimizer
• Each GPU gets a different local data batch, calculates its gradients
Parallel Training: Data Parallelism
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Forward Pass
Backward Pass
GPU 1 GPU 2 GPU 3
Transformer
Layer
Transformer
Layer
𝑓(𝑥2, 𝜃)
𝑥2 𝑔(𝑥2, 𝜃)
Transformer
Layer
Transformer
Layer
𝑓(𝑥3, 𝜃)
𝑥3 𝑔(𝑥3, 𝜃)

Fall 2023 11-667 CMU
6
• Gather local gradients together to each GPU for global updates
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Forward Pass
Backward Pass
GPU 1 GPU 2 GPU 3
Transformer
Layer
Transformer
Layer
𝑓(𝑥2, 𝜃)
𝑥2 𝑔(𝑥2, 𝜃)
Transformer
Layer
Transformer
Layer
𝑓(𝑥3, 𝜃)
𝑥3 𝑔(𝑥3, 𝜃)
𝑔(𝑥1:3, 𝜃) 𝑔(𝑥1:3, 𝜃) 𝑔(𝑥1:3, 𝜃)
All Gather
Global Gradients:

Fall 2023 11-667 CMU
7
• Gather local gradients together to each GPU for global updates
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Forward Pass
Backward Pass
GPU 1 GPU 2 GPU 3
Transformer
Layer
Transformer
Layer
𝑓(𝑥2, 𝜃)
𝑥2 𝑔(𝑥2, 𝜃)
Transformer
Layer
Transformer
Layer
𝑓(𝑥3, 𝜃)
𝑥3 𝑔(𝑥3, 𝜃)
𝑔(𝑥1:3, 𝜃) 𝑔(𝑥1:3, 𝜃) 𝑔(𝑥1:3, 𝜃)
All Gather
Global Gradients:
Communication:
• The full gradient tensor
between every pair of GPUs,
at each training batch.
• Not an issue between GPUs in
the same machine or
machines with infinity band
• Will need work around
without fast cross-GPU
connection

Fall 2023 11-667 CMU
8
Parallel Training: Model Parallelism
LLM size grew quickly and passed the limit of single GPU memory
Solution: Split network parameters (thus their gradients and corresponding optimizer states) to different GPUs
Cost of 10B Model Function to parameter count (𝚿)
Parameter Bytes 20GB 2Ψ
Gradient Bytes 20GB 2Ψ
Optimizer State: 1st Order Momentum 20GB 2Ψ
Optimizer State: 2nd Order Momentum 20GB 2Ψ
Total Per Model Instance 80GB 8Ψ
Table 1: Memory Consumption of Training Solely with BF16 (Ideal case) of a model sized Ψ

Fall 2023 11-667 CMU
9
Parallel Training: Model Parallelism
Two ways of splitting network parameters
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Pipeline Parallelism
GPU 1
GPU 2
Split by Layers
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Tensor Parallelism
GPU 1 GPU 2
Split Tensors

Fall 2023 11-667 CMU
10
Parallel Training: Pipeline Parallelism
Split network by layers, aligning devices by layer order to a pipeline, and pass data through devices [7]
[7] Huang et al. “GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism”.
NeurIPS 2019
Figure 7: Illustration of Pipeline Parallelism [7]
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Pipeline Parallelism
GPU 1
GPU 2
Split by Layers

Fall 2023 11-667 CMU
11
NeurIPS 2019
Split batches for
more fine-
grained pipelines

Fall 2023 11-667 CMU
12
NeurIPS 2019
Communication:
• Activations between nearby
devices in forward pass
• Partial gradients between
nearby devices in backward

Fall 2023 11-667 CMU
13
Pros: Conceptually simple and not coupled with network architectures. All networks have multiple layers.
Cons: Waste of compute in the Bubble. Bubble gets bigger with more devices and bigger batches.
NeurIPS 2019
Communication:
• Activations between nearby
devices in forward pass
• Partial gradients between
nearby devices in backward

Fall 2023 11-667 CMU
14
Outline
Parallel Training
• Combination of Parallelism
• ZeRO Optimizer

Fall 2023 11-667 CMU
15
Parallel Training: Tensor Parallelism
Split the parameter tensors of network layers into different devices for parallel matrix operations
[8] Shoeybi et al. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model
Parallelism”. arXiv 2019
Figure 8: Tensor Parallelism of MLP blocks and Self-attention Blocks [8]

Fall 2023 11-667 CMU
16

Fall 2023 11-667 CMU
17
Pros: No bubble
Cons: Different blocks are better split differently, lots of customizations

Fall 2023 11-667 CMU
18
Figure 9: Communication of Tensor Papalism for a Transformer Layer [8]
Communication:
• All-gather of partial activations and gradients for each split tensor

Fall 2023 11-667 CMU
19
Parallel Training: Combining Different Parallelism
Often data parallelism and model parallelism are used together.
• No need not to use data parallelism
Pipeline Parallelism and Tensor Parallelism can also be used together.
[9] Narayanan et al. “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM”.
SC 2021.
Figure 10: Combination of Tensor Parallelism and Pipeline Parallelism [9]

Fall 2023 11-667 CMU
20
Outline
Parallel Training
• Combination of Combination
• ZeRO Optimizer

Fall 2023 11-667 CMU
21
ZeRO: Redundancy in Data Parallelism
Majority of GPU memory consumption is on the optimization side: gradients and optimizer momentums
Cost of 10B Model Function to parameter count (𝚿)
Parameter Bytes 20GB 2Ψ
Gradient Bytes 20GB 2Ψ
Optimizer State: 1st Order Momentum 20GB 2Ψ
Optimizer State: 2nd Order Momentum 20GB 2Ψ
Total Per Model Instance 80GB 8Ψ
Table 1: Memory Consumption of Training Solely with BF16 (Ideal case) of a model sized Ψ

Fall 2023 11-667 CMU
22
ZeRO: Reduce Memory Redundancy
ZeRO Optimizer: Split GPU memory consumption into multiple GPUs during data parallelism
[10] Rajbhandari et al. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”. arXiv
2019.
Figure 11: ZeRO Optimizer Stages [10]
Stage 1: Split Optimizer States
Stage 2: +Split Gradients

Fall 2023 11-667 CMU
23
ZeRO Stage 1 and 2: reducing memory redundancy
Observation:
• In data parallelism, each
device only has access to local
gradient
• All gather operation required
on all gradients anyway

Fall 2023 11-667 CMU
24
An example way to implement ZeRO Stage 1
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Forward Pass
Backward Pass
Transformer
Layer
Transformer
Layer
𝑓(𝑥2, 𝜃)
𝑥2 𝑔(𝑥2, 𝜃)
Transformer
Layer
Transformer
Layer
𝑓(𝑥3, 𝜃)
𝑥3 𝑔(𝑥3, 𝜃)
All Gather
Sharded 1st Momentum 𝒎(𝒙, 𝜽𝟏) 𝒎(𝒙, 𝜽𝟐)
GPU 1 GPU 2 GPU 3
𝒎(𝒙, 𝜽𝟑)
𝒗(𝒙, 𝜽𝟏) 𝒗(𝒙, 𝜽𝟐) 𝒗(𝒙, 𝜽𝟑)
Sharded 2nd Momentum
Adam Parameter Updates

Fall 2023 11-667 CMU
25
2019.
Communication
Free ride with data parallelism

Fall 2023 11-667 CMU
26
2019.
Stage 3: +Split Parameters
Communication

Fall 2023 11-667 CMU
27
Sharding parameters and passing them when needed
Transformer
Layer
Transformer
Layer
𝑓(𝑥1, 𝜃)
𝑥1 𝑔(𝑥1, 𝜃)
Forward Pass
Backward Pass
Transformer
Layer
Transformer
Layer
𝑓(𝑥2, 𝜃)
𝑥2 𝑔(𝑥2, 𝜃)
Transformer
Layer
Transformer
Layer
𝑓(𝑥3, 𝜃)
𝑥3 𝑔(𝑥3, 𝜃)
All Gather
Sharded 1st Momentum 𝒎(𝒙, 𝜽𝟏) 𝒎(𝒙, 𝜽𝟐)
GPU 1 GPU 2 GPU 3
𝒎(𝒙, 𝜽𝟑)
𝒗(𝒙, 𝜽𝟏) 𝒗(𝒙, 𝜽𝟐) 𝒗(𝒙, 𝜽𝟑)
Sharded 2nd Momentum
Adam Parameter Updates

Fall 2023 11-667 CMU
28
Pros: Stage 1 and 2 free ride with data parallelism with huge GPU memory savings
Cons: Open-source support not as good
Notes: Stage 3 is different with tensor parallelism. It passes parameters when needed but still performs
computations of the full layer/network in one GPU. It is data parallelism with GPU memory sharding
2019.
Stage 3: +Split Parameters
Communication
All-gather parameters

Fall 2023 11-667 CMU
A peek into real large scale pretraining workflow
Lots of first hand information released through the FAIR OPT pretraining run:
https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles

Fall 2023 11-667 CMU
30
Background
A group of researchers and engineers are tasked with the goal of pretraining a large-scale model like GPT-3
• 1024 A100 80GBs to use. Yes!
Constraints:
• Task given at around Beginning of Nov 2021
• Goal is to pretrain a 175 Billion scale model by end of the year
• Which at minimum require 33 days on 1K A100s
• With no previous experience on large scale pretraining at all

Fall 2023 11-667 CMU
31
Challenge #1: Many Research Work Don’t Scale
Hope: We started with high hopes that all our research improvements at Small will give us a better GPT
https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/10_percent_update.md

Fall 2023 11-667 CMU
32
Challenge #1: Many Research Work Don’t Scale
Reality: Short timeline, Big money on the line, Nothing too fancy

Fall 2023 11-667 CMU
33
Challenge #2: Hardware Failures
GPU machines are not very reliable. With 1024 A100s, it is guaranteed to have bad nodes.

Fall 2023 11-667 CMU
34
Solution? Hopefully better tooling in the future, but right now:

Fall 2023 11-667 CMU
35
Forming an on-call group to watch OPT training
Alchemy Furnace
of the LLM Era
We Watching
LLM Training

Fall 2023 11-667 CMU
36
Challenge #3: Optimization Stability
Lots of optimization stability issues:
• Loss explodes, gradients overflows/underflows, training stagers…

Fall 2023 11-667 CMU
37

Fall 2023 11-667 CMU
38

Fall 2023 11-667 CMU
39

Fall 2023 11-667 CMU
40
The Importance of Scaling Law
Essential to determine what to do at large scale using observations at smaller scale

Fall 2023 11-667 CMU
41
Final Remarks from OPT

Fall 2023 11-667 CMU
42
Other Notable Literatures in Scaling Up
Different configurations of layer normalization: pre layernorm, post layernorm and their combination
• Xiong et al. “On Layer Normalization in the Transformer Architecture”. ICML 2020
• Zhang and Sennrich. “Root Mean Square Layer Normalization”. NeurIPS 2019
Position embeddings for longer contexts and expressiveness
• Su et al. “Roformer: Enhanced transformer with rotary position embedding.” arXiv 2021
Stability improvement from adaptive initialization
• Liu et al. “Understanding the Difficulty of Training Transformers”. EMNLP 2020

Fall 2023 11-667 CMU
Quiz: What can we do to reduce communication
overhead if only slow network connection is
available in between GPUs?

Scaling Up LLM Pretraining: Parallel Training

More Related Content

Similar to Scaling Up LLM Pretraining: Parallel Training (20)

More from cniclsh1 (20)

Recently uploaded (20)

Scaling Up LLM Pretraining: Parallel Training