Parallel Training
• Data Parallelism
• Pipeline Parallelism
• Tensor Parallelism
• Combination of Parallelism
• ZeRO Optimizer
As scale grows, training with one GPU is not enough
• There are many ways to improve efficiency on single-GPU training
• Checkpointing: moving part of the operations to CPU memory
• Quantizing different part of the optimization to reduce GPU memory cost
• Eventually more FLOPs are needed
Different setups of parallel training:
• When model training can fit into single-GPU
→Data parallelism
• When model training cannot fit into single-GPU
→ Model parallelism: pipeline or tensor