Skip to content

Commit 27535cb

Browse files
Adding GPT-OSS Deployment Guide documentation
1 parent dcbfa7e commit 27535cb

File tree

1 file changed

+383
-0
lines changed

1 file changed

+383
-0
lines changed
Lines changed: 383 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,383 @@
1+
# Running a High Performance GPT-OSS-120B Inference Server with TensorRT-LLM
2+
3+
In the guide below, we will walk you through how to launch your own
4+
high-performance TensorRT-LLM server for **gpt-oss-120b** for inference.
5+
This guide covers both low-latency and max-throughput cases.
6+
7+
The typical use case for **low-latency**, is when we try to maximize the number of tokens per second per user with a limited concurrency (4, 8 or 16 users).
8+
9+
For **maximum throughput**, the goal is to maximize the amount of tokens produced per GPU per second. The former is an indication of how fast a system can produce tokens, the latter measures how many tokens a "chip" can generate per unit of time.
10+
11+
</br></br>
12+
13+
## Prerequisites
14+
15+
- 1x NVIDIA B200/GB200/H200 GPU (8x NVIDIA B200/H200 GPUs or 4x GB200 GPUs in a single node recommended for higher performance)
16+
- CUDA Toolkit 12.8 or later
17+
- Docker with NVIDIA Container Toolkit installed
18+
- Fast SSD storage for model weights
19+
- Access to the gpt-oss-120b model checkpoint
20+
21+
We have a forthcoming guide for getting great performance on H100, however the below guide focuses on the above GPUs.
22+
23+
</br></br>
24+
25+
## Launching the TRTLLM docker container
26+
27+
The container image that you will use will be pulled from NVIDIA's NGC. This container is multi-platform and will run on both x64 and arm64 architectures: `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`
28+
29+
Run the follow docker command to start the TensorRT-LLM container in interactive mode:
30+
31+
```bash
32+
docker run --rm --ipc=host -it \
33+
--ulimit stack=67108864 \
34+
--ulimit memlock=-1 \
35+
--gpus all \
36+
-p 8000:8000 \
37+
-e TRTLLM_ENABLE_PDL=1 \
38+
-e TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
39+
-v ~/.cache:/root/.cache:rw \
40+
nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev \
41+
/bin/bash
42+
```
43+
44+
</br></br>
45+
46+
This command:
47+
- Automatically removes the container when stopped (`--rm`)
48+
- Allows container to interact with the host's IPC resources and shared memory for optimal performance (`--ipc=host`)
49+
- Runs the container in interactive mode (`-it`)
50+
- Sets up shared memory and stack limits for optimal performance
51+
- Maps port 8000 from the container to your host
52+
- enables PDL for low-latency perf optimization
53+
- disables parallel weight loading
54+
55+
Lastly the container mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container.
56+
57+
</br></br>
58+
59+
## Running the TensorRT-LLM Server
60+
61+
As pointed out in the introduction, this guide covers low-latency and max-throughput cases. Each requires a different configurations and commands to run. We will first cover the Low-Latency use-case, followed by the max throughput use-case.
62+
63+
### Low-latency Use-Case
64+
65+
#### Creating the Extra Options Configuration
66+
67+
To run a server for low-latency workloads, create a YAML configuration file, `low_latency.yaml`, as follows:
68+
69+
```yaml
70+
cat <<EOF > low_latency.yaml
71+
enable_attention_dp: false
72+
enable_mixed_sampler: true
73+
cuda_graph_config:
74+
max_batch_size: 8
75+
enable_padding: true
76+
moe_config:
77+
backend: TRTLLM
78+
EOF
79+
```
80+
81+
> Note: If you are using NVIDIA H200 GPUs it is highly recommended to set the `moe_config.backend` to TRITON to use the OpenAI Triton MoE kernel. See the section [(H200 Only) Using OpenAI Triton Kernels for MoE](#h200-only-using-openai-triton-kernels-for-moe) for more details.
82+
83+
84+
#### Launching TensorRT-LLM Serve
85+
86+
To launch the TensorRT-LLM Server to serve the model with the **low latency** config, run the following command. Commands for different GPU configures are provided (1xGPU, 8xGPU, 4xGPU):
87+
88+
<details open> <summary>1x B200/GB200/H200</summary>
89+
90+
```bash
91+
mpirun -n 1 --oversubscribe --allow-run-as-root \
92+
trtllm-serve openai/gpt-oss-120b \
93+
--host 0.0.0.0 \
94+
--port 8000 \
95+
--backend pytorch \
96+
--tp_size 1 \
97+
--ep_size 1 \
98+
--trust_remote_code \
99+
--extra_llm_api_options low_latency.yaml \
100+
--kv_cache_free_gpu_memory_fraction 0.75
101+
```
102+
</details>
103+
104+
<details> <summary>8x B200/H200</summary>
105+
106+
```bash
107+
mpirun -n 1 --oversubscribe --allow-run-as-root \
108+
trtllm-serve openai/gpt-oss-120b \
109+
--host 0.0.0.0 \
110+
--port 8000 \
111+
--backend pytorch \
112+
--tp_size 8 \
113+
--ep_size 8 \
114+
--trust_remote_code \
115+
--extra_llm_api_options low_latency.yaml \
116+
--kv_cache_free_gpu_memory_fraction 0.75
117+
```
118+
</details>
119+
120+
<details> <summary>4x GB200/B200/H200</summary>
121+
122+
```bash
123+
mpirun -n 1 --oversubscribe --allow-run-as-root \
124+
trtllm-serve openai/gpt-oss-120b \
125+
--host 0.0.0.0 \
126+
--port 8000 \
127+
--backend pytorch \
128+
--tp_size 4 \
129+
--ep_size 4 \
130+
--trust_remote_code \
131+
--extra_llm_api_options low_latency.yaml \
132+
--kv_cache_free_gpu_memory_fraction 0.75
133+
```
134+
</details>
135+
136+
</br></br>
137+
138+
139+
140+
### Max-Throughput Use-Case
141+
142+
#### Creating the Extra Options Configuration
143+
144+
To run a server for max-throughput workloads, create a YAML configuration file,
145+
`max_throughput.yaml`, as follows:
146+
147+
```yaml
148+
cat <<EOF > max_throughput.yaml
149+
enable_attention_dp: true
150+
cuda_graph_config:
151+
max_batch_size: 640
152+
enable_padding: true
153+
stream_interval: 10
154+
moe_config:
155+
backend: CUTLASS
156+
EOF
157+
```
158+
159+
> Note: If you are using NVIDIA H200 GPUs it is highly recommended to set the `moe_config.backend` to TRITON to use the OpenAI Triton MoE kernel. See the section [(H200 Only) Using OpenAI Triton Kernels for MoE](#h200-only-using-openai-triton-kernels-for-moe) for more details.
160+
161+
#### Launching TensorRT-LLM Serve
162+
163+
To launch the TensorRT-LLM Server to serve the model with the **max throughput** config, run the following command. Commands for different GPU configures are provided (1xGPU, 8xGPU, 4xGPU):
164+
165+
<details open> <summary>1x B200/GB200/H200</summary>
166+
167+
```bash
168+
mpirun -n 1 --oversubscribe --allow-run-as-root \
169+
trtllm-serve openai/gpt-oss-120b \
170+
--host 0.0.0.0 \
171+
--port 8000 \
172+
--backend pytorch \
173+
--tp_size 1 \
174+
--ep_size 1 \
175+
--max_batch_size 640 \
176+
--trust_remote_code \
177+
--extra_llm_api_options max_throughput.yaml \
178+
--kv_cache_free_gpu_memory_fraction 0.9
179+
```
180+
</details>
181+
182+
<details> <summary>8x B200/H200</summary>
183+
184+
```bash
185+
mpirun -n 1 --oversubscribe --allow-run-as-root \
186+
trtllm-serve openai/gpt-oss-120b \
187+
--host 0.0.0.0 \
188+
--port 8000 \
189+
--backend pytorch \
190+
--tp_size 8 \
191+
--ep_size 8 \
192+
--max_batch_size 640 \
193+
--trust_remote_code \
194+
--extra_llm_api_options max_throughput.yaml \
195+
--kv_cache_free_gpu_memory_fraction 0.9
196+
```
197+
</details>
198+
199+
<details> <summary>4x GB200/B200/H200</summary>
200+
201+
```bash
202+
mpirun -n 1 --oversubscribe --allow-run-as-root \
203+
trtllm-serve openai/gpt-oss-120b \
204+
--host 0.0.0.0 \
205+
--port 8000 \
206+
--backend pytorch \
207+
--tp_size 4 \
208+
--ep_size 4 \
209+
--max_batch_size 640 \
210+
--trust_remote_code \
211+
--extra_llm_api_options max_throughput.yaml \
212+
--kv_cache_free_gpu_memory_fraction 0.9
213+
```
214+
</details>
215+
216+
</br></br>
217+
218+
This command:
219+
- Maps port 8000 from the container to your host
220+
- Uses the PyTorch backend and specifies the tensor and expert parallel sizes
221+
- References the low latency or max throughput configuration file for extra options
222+
- Configures memory settings for optimal performance
223+
- Enables all GPUs with attention data parallelism for the max throughput scenario
224+
225+
The initialization may take several minutes as it loads and optimizes the models.
226+
227+
</br></br>
228+
229+
## (Optional) Using the MXFP4 Checkpoints
230+
231+
For MXFP4 checkpoints, default quantization is `W4A8_MXFP4_MXFP8` for CUTLASS and TRTLLM backend and `W4A8_MXFP4_FP8` for Triton backend on Blackwell. Default quantization for MXFP4 checkpoints on Hopper is `W4A16_MXFP4`
232+
You can override quantization by:
233+
234+
```bash
235+
export OVERRIDE_QUANT_ALGO=W4A16_MXFP4
236+
```
237+
238+
</br></br>
239+
240+
## (H200 Only) Using OpenAI Triton Kernels for MoE
241+
242+
OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper based GPUs like NVIDIA's H200 to gain significant speed-ups. The NGC TensorRT-LLM contrainer image mentioned above already includes the required kernel so you do not need to build or install it. It is highly recommended to enable them with the steps below:
243+
244+
### Selecting Triton as the MoE backend
245+
246+
To use the Triton MoE backend with **trtllm-serve** (or other similar commands) add this snippet to the YAML file passed via `--extra_llm_api_options`:
247+
248+
```yaml
249+
moe_config:
250+
backend: TRITON
251+
```
252+
253+
Alternatively the TRITON backend can be enabled by passing the CLI flag to the trtllm-server command at runtime:
254+
255+
```bash
256+
--moe_backend TRITON
257+
```
258+
259+
</br></br>
260+
261+
## Test the Server with a Sample Request
262+
263+
You can query the health/readiness of the server using
264+
265+
```bash
266+
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
267+
```
268+
269+
When the `Status: 200` code is returned, the server is ready for queries. Note that the
270+
very first query may take longer due to initialization and compilation.
271+
272+
Once the server is running, you can test it with a simple curl request:
273+
274+
275+
```bash
276+
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
277+
"model": "openai/gpt-oss-120b",
278+
"messages": [
279+
{
280+
"role": "user",
281+
"content": "What is NVIDIAs advantage for inference?"
282+
}
283+
],
284+
"max_tokens": 1024,
285+
"top_p": 0.9
286+
}' -w "\n"
287+
```
288+
289+
<details><summary><b>Show Example Output</b></summary>
290+
291+
```bash
292+
{
293+
"id": "chatcmpl-c440e2a3e7e14cd699295afc3739bf42",
294+
"object": "chat.completion",
295+
"created": 1754358426,
296+
"model": "openai/gpt-oss-120b",
297+
"choices": [
298+
{
299+
"index": 0,
300+
"message": {
301+
"role": "assistant",
302+
"content": "<|channel|>analysis<|message|>The user asks: \"What is NVIDIA's advantage for inference?\" The user wants:
303+
\"What is NVIDIA's advantage for inference?\" Likely they want a detailed answer about why NVIDIA has advantages for
304+
inference tasks (e.g., GPU acceleration, TensorRT, software stack, Tensor Cores, hardware, performance, low latency,
305+
etc.). The user wants a short answer? Not specified. The user wants a straightforward answer. Probably a brief
306+
explanation: hardware features like Tensor cores, optimized libraries (TensorRT), ecosystem, software stack,
307+
integration with frameworks, cuDNN, inference GPU architecture, high throughput, low latency, FP16, INT8, etc.\n\nWe
308+
need to produce a response: Provide NVIDIA's advantage for inference. Provide specifics: hardware (Tensor cores, RT
309+
cores, Ampere architecture), software (TensorRT, CUDA, cuDNN, Triton Inference Server), developer ecosystem, network
310+
effects, etc.\n\nAlso mention inference performance vs competition, DPUs, etc.\n\nWe'll answer concisely but thorough.
311+
\n\nWe also must not mention we are ChatGPT or mention policy. Provide final answer.\n\nLet's produce a clear,
312+
well-structured answer.\n\n<|end|><|start|>assistant<|channel|>final<|message|>**NVIDIA’s edge in AI inference comes
313+
from a tightly‑integrated stack of hardware, software, and ecosystem tools that together deliver the
314+
highest‑throughput, lowest‑latency execution of deep‑learning models.**\n\n| **Layer** | **Key Advantages** | **Why
315+
It Matters for Inference** |\n|-----------|-------------------|---------------------------------|\n| **GPU
316+
Architecture** | • **Tensor Cores** (FP16/FP8/INT8) – dedicated matrix‑multiply units that give 2–8× higher ops/W
317+
than classic CUDA cores. <br>• **CUDA SMs with mixed‑precision support** – can run FP16/INT8/INT4 with minimal
318+
accuracy loss. <br>• **NVLink / NVSwitch** – high‑bandwidth, low‑latency inter‑GPU communication for multi‑GPU
319+
serving. | Maximizes raw arithmetic density, letting a single GPU handle many concurrent requests. |\n|
320+
**Inference‑focused Silicon** | • **A100 / H100 “Tensor‑Core” GPUs** – FP8 support (H100) and sparsity acceleration
321+
(up to 2× throughput). <br>• **T4/Ti, L40, L4, Jetson edge modules** – power‑optimized variants for data‑center,
322+
edge, and robotics. | Provides the right performance‑per‑watt for cloud, on‑prem, or edge deployments. |\n|
323+
**Software Stack** | • **TensorRT** – a compiler & runtime that fuses layers, applies precision calibration, and
324+
auto‑tunes kernels for each GPU. <br>• **CUDA, cuDNN, cuBLAS** – low‑level libraries tuned for every generation of
325+
GPU. <br>• **Triton Inference Server** – model‑agnostic serving, model‑versioning, batching, GPU sharing, and
326+
scaling across CPUs/GPUs. | Turns raw GPU power into production‑ready, low‑latency services with minimal engineering
327+
effort. |\n| **Model Optimizations** | • **Quantization (INT8/FP8) & Structured Sparsity** – supported natively by
328+
TensorRT and the hardware. <br>• **Automatic Mixed‑Precision (AMP)** – retains accuracy while cutting compute. |
329+
Reduces memory bandwidth and compute cost while keeping accuracy within acceptable bounds. |\n| **Ecosystem &
330+
Compatibility** | • **Broad framework support** – TensorFlow, PyTorch, ONNX, JAX, etc., all compile to TensorRT.
331+
<br>• **NVIDIA NGC** – pre‑optimized model zoo, containers, and reference pipelines. <br>• **MLOps tools** –
332+
NVIDIA Merlin, Clara, Metropolis, etc., for recommendation, medical, vision pipelines. | Engineers can
333+
plug‑and‑play, accelerate, and ship models faster. |\n| **Scalability & Deployment Flexibility** | • **DGX Cloud,
334+
EGX, Jetson, and Orin** – end‑to‑end solutions from cloud to edge. <br>• **Multi‑Instance GPU (MIG)** – partition
335+
a single A100 into up to 7 isolated inference instances. <br>• **NVIDIA AI Enterprise** – managed software suite
336+
for on‑prem data‑centers. | Allows the same code to run on a laptop, an edge device, or a massive data‑center
337+
cluster. |\n| **Performance Benchmarks** | • **Industry‑leading latency/throughput** on MLPerf Inference (FP8,
338+
INT8). <br>• **Sparsity‑aware kernels** give >2× speedup on H100 with < 0.1 % accuracy loss. | Demonstrates
339+
real‑world advantage in the most respected benchmark suite. |\n|",
340+
"reasoning_content": null,
341+
"tool_calls": []
342+
},
343+
"logprobs": null,
344+
"finish_reason": "length",
345+
"stop_reason": null,
346+
"disaggregated_params": null
347+
}
348+
],
349+
"usage": {
350+
"prompt_tokens": 17,
351+
"total_tokens": 1041,
352+
"completion_tokens": 1024
353+
},
354+
"prompt_token_ids": null
355+
}
356+
357+
```
358+
</details>
359+
360+
The server exposes a standard OpenAI-compatible API endpoint that accepts JSON
361+
requests. You can adjust parameters like `max_tokens`, `temperature`, and
362+
others according to your needs.
363+
364+
</br></br>
365+
366+
## Troubleshooting Tips
367+
368+
- If you encounter CUDA out-of-memory errors, try reducing `max_batch_size`, `max_seq_len`, or `--kv_cache_free_gpu_memory_fraction`
369+
- Ensure your model checkpoints are compatible with the expected format
370+
- For performance issues, check GPU utilization with `nvidia-smi` while the server is running
371+
- If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed
372+
- For connection issues, make sure port 8000 is not being used by another application
373+
374+
</br></br>
375+
376+
## Performance Tuning
377+
378+
The configuration provided is optimized for 8xB200 GPUs, but you can adjust
379+
several parameters for your specific workload:
380+
381+
- `max_batch_size`: Controls how many requests can be batched together
382+
- `max_draft_len`: The number of tokens Eagle can speculate ahead
383+
- `kv_cache_free_gpu_memory_fraction`: Controls memory allocation for the KV cache

0 commit comments

Comments
 (0)