Splitting LLMs Across Multiple GPUs: Techniques, Tools, and Best Practices
Introduction
LLMs are fundamental for advanced NLP tasks, including conversational AI systems, text generation applications, and machine translation processes. Models with hundreds of billions of parameters deliver exceptional performance yet require extensive GPU resources. Traditional single-GPU systems can quickly run out of memory when loading or training large language models. Splitting LLMs across multiple GPUs has become essential to ensure operational efficiency. This guide provides comprehensive insights about splitting and loading LLMs across multiple GPUs while addressing GPU memory constraints and improving model inference speeds. We have covered parallelism methods, including model and data parallelism, and explained their application to distributed LLM training.
Prerequisites
Why Split Large Language Models Across Multiple GPUs?
Modern large language models such as PaLM, and networks similar to Megatron contain billions of parameters. A single GPU often cannot host massive models because available VRAM options like 12 GB, 24 GB, or even 80 GB may be insufficient to load all model parameters, activations, and optimizer states. Splitting an LLM across multiple GPUs addresses the bottleneck through two distinct methods:
Splitting LLM remains a fundamental practice for advanced AI tasks, whether you perform multi-GPU deep learning on a single machine or run distributed training for LLMs across multiple servers.
Model Parallelism vs. Data Parallelism
There are two main methods for using multiple GPUs with LLMs that provide specific benefits for various applications:
Data Parallelism: This method runs full model replicas across each GPU and assigns unique data segments to each GPU for processing. Each GPU computes gradients based on its data subset during training before synchronizing them across all GPUs.
Model Parallelism: The model parallelism approach splits the model across multiple GPUs, where each GPU handles specific layers or parameters of the model. The model’s parameters can be distributed across GPUs at various granularities, including tensor, layer, and pipeline stages.
Types of Model Parallelism
Model parallelism can be further divided into specific techniques.
Tensor Parallelism: Tensor parallelism involves splitting the weights of each layer across multiple GPUs at the tensor level. In large matrix multiplication operations, the distribution across GPUs allows each GPU to process different parts of the matrix independently.
Pipeline Parallelism: Pipeline Parallelism distributes different layers across multiple GPUs, so each GPU processes a specific segment of the model. For example, if GPU 0 completes the first batch during the forward pass, it quickly sends the output to GPU 1. That way, GPU 1 can process the data for the next stage, and this cycle continues. By smartly staggering these mini-batches, all the GPUs can operate simultaneously.
Sharded Data Parallelism: This technique uses data parallelism and parameter sharding (Each GPU stores only a portion of the parameters) to reduce memory requirements and maintain efficient model training.
GPU Memory Management: The Hidden Challenge
Multi-GPU systems frequently experience GPU Memory Management as their primary performance bottleneck. The naive method involves pushing model slices across multiple GPUs but ignores the cross-GPU communication costs and potential memory fragmentation issues. During multi-GPU inference with LLM tasks, a careful allocation of layers, tensors, and pipeline stages is essential.
Key considerations:
You can gain comprehensive knowledge about GPU memory fundamentals by reading our Introduction to CUDA tutorial.
Tools and Libraries for Splitting LLM to Multiple GPUs
Many open-source frameworks enable multi-GPU deep learning for large models. The following section outlines some leading solutions and their contributions to GPU parallelism in machine learning for large language models.
PyTorch DistributedDataParallel
PyTorch’s DistributedDataParallel (DDP) represents one of the most widely used approaches to distributed training for large language models. It enables straightforward synchronization of gradients across multiple GPUs and nodes. Each process executes the same model by working on a data subset while averaging gradients after each training iteration. Key Benefits
HuggingFace Accelerate
HuggingFace’s Accelerate library enables a straightforward way for multi-GPU inference while requiring minimal code changes. The library provides automatic model sharding through the device_map=“auto” parameter with tools for distributed inference. We can illustrate this through the following example:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"togethercomputer/LLaMA-2-7B-32K",
torch_dtype=torch.float16,
device_map="auto" # Automatically distributes across available GPUs
)
Accelerate initially fills up GPU 0 and moves to GPU 1 before proceeding to the remaining GPUs until the entire model is loaded. This method provides an automatic and straightforward solution for loading an LLM across multiple GPUs without manual partitioning. For a more in-depth explanation, check out Multi-GPU on Raw PyTorch with Hugging Face’s Accelerate Library.
Ollama Multiple GPUs
The Ollama framework provides a solution for running LLMs with efficient CPU and GPU inference capabilities. Multiple GPUs can run Ollama by setting environment variables or editing a settings file to specify model partitioning among them. The official Ollama documentation provides detailed instructions for splitting model weights.
Environment Variables for GPU Partitioning. The configuration of GPU settings requires exporting the appropriate environment variables. For example:
export OLLAMA_GPU_COUNT=5
export OLLAMA_GPU_MEMORY_LIMIT=16GB
Here:
For more advanced usage of Ollama, check out how to run LLMs with Ollama on H100 GPUs for maximum efficiency.
vLLM
The vLLM (Versatile Large Language Model) library represents a recent development for efficiently executing large language model inference tasks. The library brings a highly optimized transformer interpreter and introduces PagedAttention to handle the memory demands of the KV(key-value) cache with long prompt processing. vLLM supports distributed inference and serving.
You can use multiple GPUs or machines to serve a model through vLLM when it exceeds the capacity of a single GPU. When initializing a serving instance, users can set the tensor parallel size, as we can see in the example below:
from vllm import LLM
# Initialize model with tensor parallelism across 4 GPUs
llm = LLM(model="meta-llama/Llama-2-70b-hf", tensor_parallel_size=4)
# Generate text for multiple prompts in parallel
outputs = llm.generate(["Write a book", "Explain artificial intelligence"])
DeepSpeed
Microsoft developed DeepSpeed as a software library for optimizing large-scale model training processes. The Zero Redundancy Optimizer algorithm from the DeepSpeed partitions model states across GPUs to eliminate memory redundancy. DeepSpeed stages include:
Deepspeed includes support for CPU and NVMe offloading through features known as ZeRO-Offload and ZeRO-Infinity.
You can enable ZeRO stage 3 and direct offload_param to use the CPU when required. The following example represents a portion of what you might typically find in a DeepSpeed configuration file.
{
"train_batch_size": 8,
"fp16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"offload_param": { "device": "cpu" }
}
}
The DeepSpeed system will manage the distribution of models and gradients following the specified configuration.
Megatron-LM
NVIDIA offers this framework through a GitHub repository, which enables the training of massive transformer models such as GPT-2, GPT-3, T5, etc. Megatron-LM merges tensor parallelism with pipeline parallelism to achieve massive parallel processing capabilities.
It allows users to define the tensor parallel size, which specifies GPU distribution per layer and the pipeline parallel size to control model stage segmentation. Megatron-LM contains advanced techniques beneficial for training models with billions of parameters from scratch.
Distributed Training Across Multiple Machines
Multiple machines for LLM require setting up a distributed environment where multiple GPUs operate on each node. PyTorch’s distributed communication backend (NCCL) allows you to create seamless connections between all processes.
Distributed LLM Training with PyTorch DDP: A Minimal Example
The PyTorch DistributedDataParallel (DDP) system trains models across multiple GPUs by replicating the model on each GPU and synchronizing gradients to simulate single-device training. The guide summarizes essential steps for implementing distributed training through DDP.
1. Initialize the Process Group
You must configure the distributed backend to handle communication between processes. The following example shows how to use the NCCL (NVIDIA Collective Communications Library) backend for GPU operations:
import torch.distributed as dist
dist.init_process_group(backend="nccl")
It initializes a communication group that connects all processes. Each computing process will correspond to a single GPU.
2. Configure the Device and Wrap the Model with DDP
Identify the GPU for the current process by checking the LOCAL_RANK environment variable and then assign the model to that GPU. Then, wrap the model with DistributedDataParallel:
import os
import torch
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
model = MyModel().to(local_rank)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
When loss.backward() is executed, DDP synchronizes and averages gradients across all processes.
3. Use a DistributedSampler in the DataLoader
Instead of randomly shuffling data with shuffle=True, use DistributedSampler to ensure each process receives a unique subset of the dataset. For example:
from torch.utils.data import DataLoader, DistributedSampler
sampler = DistributedSampler(train_dataset)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=sampler)
The method prevents multiple processes from accessing the same training samples.
4. Implement the Training Loop
Within each process, run the usual training loop: Retrieve data from train_loader and send it to the GPU before computing loss and executing loss.backward() followed by optimizer.step(). DDP automatically handles gradient synchronization in the backward pass without extra coding requirements. Once training concludes, clean up with: dist.destroy_process_group()
5. Launch Training with torchrun
Launch your training script by using the torchrun utility: torchrun --nproc_per_node=4 train_ddp.py -nproc_per_node=4:
-train_ddp.py:
Thanks to the initial setup, each process acquires its appropriate LOCAL_RANK and synchronizes with other processes through the predefined process group.
Common Errors and Debugging
This summary provides an overview of frequent distribution errors for Large Language Models on multiple GPUs and details their causes and suggested solutions.
If you handle the memory overflow issue, slow synchronization, and inefficient parallelism, the stability and performance of your multi-GPU LLM environment will significantly improve.
Multimodal LLM Considerations
Multimodal LLM models, which combine text with images and potentially audio or video, are becoming more common. Their size often exceeds that of text-only models, making multi-GPU strategies essential. Key points:
FAQ SECTION
Conclusion
For researchers and developers working on cutting-edge NLP and multimodal AI systems today, splitting large language models across multiple GPUs has become essential. This article has presented practical methods for handling modern LLMs’ computational and memory limitations. This includes model and data parallelism and tools like DeepSpeed, Hugging Face Accelerate, and Megatron-LM.
Applying proper strategies will enable multi-GPU systems to support scalable training and faster inference times. Future AI development will rest upon efficient GPU memory management and optimized distributed training methods as models evolve to integrate various data types. Practitioners who understand the architecture of parallelism and use open-source tools while anticipating common performance issues can maximize the potential of large and multimodal LLMs.
Resources
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This is a great insight, DigitalOcean! Overcoming GPU memory constraints is crucial for optimizing performance. Thanks for sharing this valuable tip!