Megatron-LM & Megatron Core

GPU-optimized library for training transformer models at scale

⚡ Quick Start

# 1. Install Megatron Core with required dependencies
pip install megatron-core
pip install --no-build-isolation transformer-engine[pytorch]

# 2. Clone repository for examples
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM

→ Complete Installation Guide - Docker, pip variants (dev,lts,etc.), source installation, and system requirements

Latest News

📣 NEW! DeepSeek & MoE Training with FP8 examples are now available, including optimized configurations for DeepSeek-V3, Qwen2 and Mixtral models with FP8 precision support.
[2025/05] Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training (blog).

Previous News

[2024/07] Megatron Core v0.7 improves scalability and training resiliency and adds support for multimodal training (blog).
[2024/06] Megatron Core added supports for Mamba-based models. Check out our paper An Empirical Study of Mamba-based Language Models and code example.
[2024/01 Announcement] NVIDIA has released the core capabilities in Megatron-LM into Megatron Core in this repository. Megatron Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the [Megatron Core intro](#Megatron Core) for more details.

Table of Contents

Getting Started

Quick Start
Latest News
Megatron Overview
Installation

Core Features

Performance Benchmarking
- Weak Scaling Results
- Strong Scaling Results
Ecosystem Libraries

Training

Training
- Getting Started
- Data Preparation
Parallelism Strategies
Performance Optimizations

Resources

Examples - Training scripts and tutorials
Documentation - Official docs
Community & Support - Get help and contribute

Megatron Overview

Project Structure

Megatron-LM/
├── megatron/                    
│   ├── core/                    # Megatron Core (kernels, parallelism, building blocks)
│   │   ├── models/              # Transformer models
│   │   ├── transformer/         # Transformer building blocks
│   │   ├── tensor_parallel/     # Tensor parallelism
│   │   ├── pipeline_parallel/   # Pipeline parallelism
│   │   ├── distributed/         # Distributed training (FSDP, DDP)
│   │   ├── optimizer/           # Optimizers
│   │   ├── datasets/            # Dataset loaders
│   │   ├── inference/           # Inference engines
│   │   └── export/              # Model export (e.g. TensorRT-LLM)
│   ├── training/                # Training scripts
│   ├── inference/               # Inference server
│   ├── legacy/                  # Legacy components
│   └── post_training/           # Post-training (RLHF, etc.)
├── examples/                    # Ready-to-use training examples
├── tools/                       # Utility tools
├── tests/                       # Comprehensive test suite
└── docs/                        # Documentation

Megatron-LM: Reference Implementation

Reference implementation that includes Megatron Core plus everything needed to train models.

Best for:

Training state-of-the-art foundation models at scale with cutting-edge performance on latest NVIDIA hardware
Research teams exploring new architectures and training techniques
Learning distributed training concepts and best practices
Quick experimentation with proven model configurations

What you get:

Pre-configured training scripts for GPT, LLama, DeepSeek, Qwen, and more.
End-to-end examples from data prep to evaluation
Research-focused tools and utilities

Megatron Core: Composable Library

Composable library with GPU-optimized building blocks for custom training frameworks.

Best for:

Framework developers building on top of modular and optimized components
Research teams needing custom training loops, optimizers, or data pipelines
ML engineers requiring fault-tolerant training pipelines

What you get:

Composable transformer building blocks (attention, MLP, etc.)
Advanced parallelism strategies (TP, PP, DP, EP, CP)
Pipeline schedules and distributed optimizers
Mixed precision support (FP16, BF16, FP8)
GPU-optimized kernels and memory management
High-performance dataloaders and dataset utilities
Model architectures (LLaMA, Qwen, GPT, Mixtral, Mamba, etc.)

Ecosystem Libraries

Libraries used by Megatron Core:

Megatron Energon 📣 NEW! - Multi-modal data loader (text, images, video, audio) with distributed loading and dataset blending
Transformer Engine - Optimized kernels and FP8 mixed precision support
Resiliency Extension (NVRx) - Fault tolerant training with failure detection and recovery

Libraries using Megatron Core:

NeMo Framework - Enterprise framework with cloud-native support and end-to-end examples
TensorRT Model Optimizer (ModelOpt) - Model optimization toolkit for quantization, pruning, and distillation

Compatible with: HuggingFace Accelerate, Colossal-AI, DeepSpeed

Installation

🐳 Docker (Recommended)

We strongly recommend using the previous releases of PyTorch NGC Container rather than the latest one for optimal compatibility with Megatron Core release and testing. Our releases are always based on the previous month's NGC container, so this ensures compatibility and stability.

This container comes with all dependencies pre-installed with compatible versions and optimized configurations for NVIDIA GPUs:

PyTorch (latest stable version)
CUDA, cuDNN, NCCL (latest stable versions)
Support for FP8 on NVIDIA Hopper, Ada, and Blackwell GPUs
For best performance, use NVIDIA Turing GPU architecture generations and later

# Run container with mounted directories
docker run --runtime --nvidia --gpus all -it --rm \
  -v /path/to/megatron:/workspace/megatron \
  -v /path/to/dataset:/workspace/dataset \
  -v /path/to/checkpoints:/workspace/checkpoints \
  nvcr.io/nvidia/pytorch:25.04-py3

Pip Installation

Megatron Core offers support for two NGC PyTorch containers:

dev: Moving head that supports the most recent upstream dependencies
lts: Long-term support of NGC PyTorch 24.01

Both containers can be combined with mlm which adds package dependencies for Megatron-LM on top of Megatron Core.

# Install the latest release with minimal dependencies (no Transformer Engine)
pip install megatron-core[dev]

# Install packages for LTS support NGC PyTorch 24.01
pip install megatron-core[lts]

For a version of Megatron Core with only torch, run:

pip install megatron-core

For dependencies required by Megatron-LM, please run:

pip install megatron-core[mlm]

Source Installation

For development or latest features:

For Hybrid models, Megatron Core requires mamba. If the pre-built wheel in PyPI does not fit your environment, you can fall back to an install script Megatron Core uses in its CI system. For this, please install uv first:

export UV_VERSION=0.7.2
export PATH="$HOME/.local/bin:$PATH"
curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | sh
export UV_PROJECT_ENVIRONMENT=./venv
export PATH="$UV_PROJECT_ENVIRONMENT/bin:$PATH"
export UV_LINK_MODE=copy

Run the following command to build upstream dependencies from source:

# Clone and install
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM

# Optional: checkout specific release
git checkout core_r0.13.0

bash docker/common/install.sh --environment {dev,lts}

System Requirements

Hardware Requirements

FP8 Support: NVIDIA Hopper, Ada, Blackwell GPUs
Recommended: NVIDIA Turing architecture or later

Software Requirements

CUDA/cuDNN/NCCL: Latest stable versions
PyTorch: Latest stable version
Transformer Engine: Latest stable version
Python: 3.12 recommended

Performance Benchmarking

For our latest performance benchmarking results, please refer to NVIDIA NeMo Framework Performance Summary.

Our codebase efficiently trains models from 2B to 462B parameters across thousands of GPUs, achieving up to 47% Model FLOP Utilization (MFU) on H100 clusters.

Benchmark Configuration:

Vocabulary size: 131,072 tokens
Sequence length: 4096 tokens
Model scaling: Varied hidden size, attention heads, and layers to achieve target parameter counts
Communication optimizations: Fine-grained overlapping with DP (--overlap-grad-reduce, --overlap-param-gather), TP (--tp-comm-overlap), and PP (enabled by default)

Key Results:

6144 H100 GPUs: Successfully benchmarked 462B parameter model training
Superlinear scaling: MFU increases from 41% to 47-48% with model size
End-to-end measurement: Throughputs include all operations (data loading, optimizer steps, communication, logging)
Production ready: Full training pipeline with checkpointing and fault tolerance
Note: Performance results measured without training to convergence

Weak Scaling Results

Our weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.

Strong Scaling Results

We also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.

Training

Getting Started

Simple Training Example

# Distributed training example (2 GPUs, mock data)
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py

LLama-3 Training Example

# 8 GPUs, FP8 precision, mock data
./examples/llama/train_llama3_8b_fp8.sh

Data Preparation

JSONL Data Format

{"text": "Your training text here..."}
{"text": "Another training sample..."}

Basic Preprocessing

python tools/preprocess_data.py \
    --input data.jsonl \
    --output-prefix processed_data \
    --tokenizer-type HuggingFaceTokenizer \
    --tokenizer-model /path/to/tokenizer.model \
    --workers 8 \
    --append-eod

Key Arguments

--input: Path to input JSON/JSONL file
--output-prefix: Prefix for output binary files (.bin and .idx)
--tokenizer-type: Tokenizer type (HuggingFaceTokenizer, GPT2BPETokenizer, etc.)
--tokenizer-model: Path to tokenizer model file
--workers: Number of parallel workers for processing
--append-eod: Add end-of-document token

Parallelism Strategies

Data Parallelism (DP)

Standard Data Parallel

# Standard DDP - replicate model on each GPU
torchrun --nproc_per_node=8 pretrain_gpt.py \
    --data-parallel-sharding-strategy no_shard

Fully Sharded Data Parallel (FSDP)

# Megatron's optimized FSDP (~15% faster than PyTorch FSDP2)
--use-custom-fsdp

# PyTorch FSDP2
--use-torch-fsdp2

# Sharding strategies
--data-parallel-sharding-strategy optim              # Shard optimizer states (ZeRO-1)
--data-parallel-sharding-strategy optim_grads        # Shard gradients + optimizer (ZeRO-2)
--data-parallel-sharding-strategy optim_grads_params # Shard parameters + gradients + optimizer (ZeRO-3)

Tensor Parallelism (TP)

Split individual model layers across GPUs:

--tensor-model-parallel-size 4  # 4-way tensor parallelism
--sequence-parallel             # Enable sequence parallelism (recommended with TP)

Pipeline Parallelism (PP)

Split model depth across GPUs:

--pipeline-model-parallel-size 8     # 8 pipeline stages
--virtual-pipeline-model-parallel-size 4  # Virtual pipeline for better load balancing

Context Parallelism (CP)

Split long sequences across GPUs for handling long contexts:

--context-parallel-size 2                    # 2-way context parallelism
--cp-comm-type p2p                          # Communication: p2p, a2a, allgather, a2a+p2p
--hierarchical-context-parallel-sizes 2 4   # Hierarchical context parallelism

Expert Parallelism (EP)

For Mixture of Experts (MoE) models:

--expert-model-parallel-size 4  # 4-way expert parallelism
--num-experts 8                 # 8 experts per MoE layer
--moe-grouped-gemm              # Optimize expert computation

Combining Parallelism Strategies

Parallelism Selection Guide

Based on NVIDIA NeMo production configurations:

Model	Size	GPUs	TP	PP	CP	EP	Notes
LLama-3	8B	8	1	1	2	1	CP for long seqlen (8K)
LLama-3	70B	64	4	4	2	1	TP+PP
LLama-3.1	405B	1024	8	8	2	1	3D parallelism for scale
GPT-3	175B	128-512	4	8	1	1	Large model config
Mixtral	8x7B	64	1	4	1	8	EP for MoE
Mixtral	8x22B	256	4	4	8	8	Combined TP+EP for large MoE
DeepSeek-V3	671B	1024	2	16	1	64	Large MoE config

MoE-Specific Requirements

Important: When combining Expert Parallelism (EP) with Tensor Parallelism (TP), Sequence Parallelism (SP) must be enabled.

Performance Optimizations

Feature	Flag	Benefit
FlashAttention	`--attention-backend`	Faster attention and lower memory usage
FP8 Training	`--fp8-hybrid`	Faster training
Activation Checkpointing	`--recompute-activations`	Reduced memory usage
Data Parallelism Communication Overlap	`--overlap-grad-reduce`	Faster distributed training
Distributed Optimizer	`--use-distributed-optimizer`	Reduced checkpointing time

→ NVIDIA NeMo Framework Performance Tuning Guide - Comprehensive performance optimization guide covering advanced tuning techniques, communication overlaps, memory optimizations, and profiling options.

FlashAttention

FlashAttention is a fast and memory-efficient attention algorithm. We recommend the default usage, which uses cuDNN for attention via Transformer Engine and provides up to 50% speedups on forward and 84% on backward propagation with FP8 kernels. The flash-attn package is also supported via --use-flash-attn.

Mixed Precision Training

--fp16                    # Standard FP16
--bf16                    # BFloat16 (recommended for large models)
--fp8-hybrid              # FP8 training (Hopper, Ada, and Blackwell GPUs)

Activation Checkpointing and Recomputation

# For limited memory
--recompute-activations

# For extreme memory constraints
--recompute-granularity full \
--recompute-method uniform

Data Parallelism Communication Overlap

--overlap-grad-reduce
--overlap-param-gather

Distributed Optimizer

--use-distributed-optimizer

Community & Support

Getting Help

📖 Documentation - Official documentation
🐛 Issues - Bug reports and feature requests

Contributing

We ❤️ contributions! Ways to contribute:

🐛 Report bugs - Help us improve reliability
💡 Suggest features - Shape the future of Megatron Core
📝 Improve docs - Make Megatron Core more accessible
🔧 Submit PRs - Contribute code improvements

→ Contributing Guide

Citation

@article{megatron-lm,
  title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
  author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:1909.08053},
  year={2019}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7,008 Commits
.github		.github
.gitlab		.gitlab
docker		docker
docs		docs
examples		examples
images		images
megatron		megatron
tasks		tasks
tests		tests
tools		tools
.flake8		.flake8
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pretrain_bert.py		pretrain_bert.py
pretrain_gpt.py		pretrain_gpt.py
pretrain_ict.py		pretrain_ict.py
pretrain_mamba.py		pretrain_mamba.py
pretrain_retro.py		pretrain_retro.py
pretrain_t5.py		pretrain_t5.py
pretrain_vision_classify.py		pretrain_vision_classify.py
pretrain_vision_dino.py		pretrain_vision_dino.py
pretrain_vision_inpaint.py		pretrain_vision_inpaint.py
pretrain_vlm.py		pretrain_vlm.py
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

License

NVIDIA/Megatron-LM

Folders and files

Latest commit

History

Repository files navigation

Megatron-LM & Megatron Core

GPU-optimized library for training transformer models at scale

⚡ Quick Start

Latest News

Megatron Overview

Project Structure

Megatron-LM: Reference Implementation

Megatron Core: Composable Library

Ecosystem Libraries

Installation

🐳 Docker (Recommended)

Pip Installation

Source Installation

System Requirements

Hardware Requirements

Software Requirements

Performance Benchmarking

Weak Scaling Results

Strong Scaling Results

Training

Getting Started

Simple Training Example

LLama-3 Training Example

Data Preparation

JSONL Data Format

Basic Preprocessing

Key Arguments

Parallelism Strategies

Data Parallelism (DP)

Standard Data Parallel

Fully Sharded Data Parallel (FSDP)

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

Context Parallelism (CP)

Expert Parallelism (EP)

Combining Parallelism Strategies

Parallelism Selection Guide

MoE-Specific Requirements

Performance Optimizations

FlashAttention

Mixed Precision Training

Activation Checkpointing and Recomputation

Data Parallelism Communication Overlap

Distributed Optimizer

Community & Support

Getting Help

Contributing

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 33

Packages 0

Uh oh!

Contributors 182

Languages

Packages