Releases · NVIDIA/TensorRT-LLM

16 Aug 00:09

Superjomn

v1.1.0rc0

26f413a

v1.1.0rc0 Pre-release

Pre-release

Announcement Highlights:

Model Support
- Add model gpt-oss (#6645)
- Support Aggregate mode for phi4-mm (#6184)
- Add support for Eclairv2 model - cherry-pick changes and minor fix (#6493)
- Support running heterogeneous model execution for Nemotron-H (#6866)
- Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) (#5527)
API
- BREAKING CHANGE Enable TRTLLM sampler by default (#6216)
Benchmark
- Improve Llama4 performance for small max_seqlen cases (#6306)
- Multimodal benchmark_serving support (#6622)
- Add perf-sweep scripts (#6738)
Feature
- Support LoRA reload CPU cache evicted adapter (#6510)
- Add FP8 context MLA support for SM120 (#6059)
- Enable guided decoding with speculative decoding (part 1: two-model engine) (#6300)
- Include attention dp rank info with KV cache events (#6563)
- Clean up ngram auto mode, add max_concurrency to configs (#6676)
- Add NCCL Symmetric Integration for All Reduce (#4500)
- Remove input_sf swizzle for module WideEPMoE (#6231)
- Enable guided decoding with disagg serving (#6704)
- Make fused_moe_cute_dsl work on blackwell (#6616)
- Move kv cache measure into transfer session (#6633)
- Optimize CUDA graph memory usage for spec decode cases (#6718)
- Core Metrics Implementation (#5785)
- Resolve KV cache divergence issue (#6628)
- AutoDeploy: Optimize prepare_inputs (#6634)
- Enable FP32 mamba ssm cache (#6574)
- Support SharedTensor on MultimodalParams (#6254)
- Improve dataloading for benchmark_dataset by using batch processing (#6548)
- Store the block of context request into kv cache (#6683)
- Add standardized GitHub issue templates and disable blank issues (#6494)
- Improve the performance of online EPLB on Hopper by better overlapping (#6624)
- Enable guided decoding with CUDA graph padding and draft model chunked prefill (#6774)
- CUTLASS MoE FC2+Finalize fusion (#3294)
- Add GPT OSS support for AutoDeploy (#6641)
- Add LayerNorm module (#6625)
- Support custom repo_dir for SLURM script (#6546)
- DeepEP LL combine FP4 (#6822)
- AutoTuner tuning config refactor and valid tactic generalization (#6545)
- Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend (#6200)
- Add support for Hopper MLA chunked prefill (#6655)
- Helix: extend mapping to support different CP types (#6816)
Documentation
- Remove the outdated features which marked as Experimental (#5995)
- Add LoRA feature usage doc (#6603)
- Add deployment guide section for VDR task (#6669)
- Add doc for multimodal feature support matrix (#6619)
- Move AutoDeploy README.md to torch docs (#6528)
- Add checkpoint refactor docs (#6592)
- Add K2 tool calling examples (#6667)
- Add the workaround doc for H200 OOM (#6853)
- Update moe support matrix for DS R1 (#6883)
- BREAKING CHANGE: Mismatch between docs and actual commands (#6323)

What's Changed

Qwen3: Fix eagle hidden states by @IzzyPutterman in #6199
[None][fix] Upgrade dependencies version to avoid security vulnerability by @yibinl-nvidia in #6506
[None][chore] update readme for perf release test by @ruodil in #6664
[None][test] remove trt backend cases in release perf test and move NIM cases to llm_perf_nim.yml by @ruodil in #6662
[None][fix] Explicitly add tiktoken as required by kimi k2 by @pengbowang-nv in #6663
[None][doc]: remove the outdated features which marked as Experimental by @nv-guomingz in #5995
[https://nvbugs/5375966][chore] Unwaive test_disaggregated_deepseek_v3_lite_fp8_attention_dp_one by @yweng0828 in #6658
[TRTLLM-6892][infra] Run guardwords scan first in Release Check stage by @yiqingy0 in #6659
[None][chore] optimize kv cache transfer for context TEP and gen DEP by @chuangz0 in #6657
[None][chore] Bump version to 1.1.0rc0 by @yiqingy0 in #6651
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in #6510
[None][test] correct test-db context for perf yaml file by @ruodil in #6686
[None] [feat] Add model gpt-oss by @hlu1 in #6645
[https://nvbugs/5409414][fix] fix Not registered specs by @xinhe-nv in #6660
[None][feat] : Add FP8 context MLA support for SM120 by @peaceh-nv in #6059
[TRTLLM-6092][doc] Add LoRA feature usage doc by @shaharmor98 in #6603
[TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) by @syuoni in #6300
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events by @pcastonguay in #6563
[None][infra] Fix guardwords by @EmmaQiaoCh in #6711
[None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in #6702
[None][doc] Add deployment guide section to the official doc website by @nv-guomingz in #6669
[None][fix] disagg ctx pp4 + gen pp4 integ test by @raayandhar in #6489
[None][feat] Clean up ngram auto mode, add max_concurrency to configs by @mikeiovine in #6676
[None][chore] Remove py_executor from disagg gh team by @pcastonguay in #6716
[https://nvbugs/5423962][fix] Address broken links by @chenopis in #6531
[None][fix] Migrate to new cuda binding package name by @tongyuantongyu in #6700
[https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave by @symphonylyh in #6708
[None][feat] Add NCCL Symmetric Integration for All Reduce by @Tabrizian in #4500
[TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default by @dcampora in #6216
[TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6719
[TRTLLM-5252][test] add for mistral_small_3.1_24b perf test by @ruodil in #6685
[TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE by @StudyingShao in #6231
[None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference by @zhanghaotong in #6626
[TRTLLM-6854][feat] Enable guided decoding with disagg serving by @syuoni in #6704
[TRTLLM-5252][fix] Propagate mapping to intermediate layers by @2ez4bz in #6611
[None][test] fix yml condition error under qa folder by @ruodil in #6734
[None][doc] Add doc for multimodal feature support matrix by @chang-l in #6619
[TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell by @limin2021 in #6616
[https://nvbugs/5436461][infra] Adjust free_gpu_memory_fraction of test_eagle3 to prevent OOM on CI by @leslie-fang25 in #6631
[None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout by @yuxianq in #6654
[https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend by @JunyiXu-nv in #6690
[None][fix] Remove lock related typo in py_executor by @lancelly in #6653
[None][feat] move kv cache measure into transfer session by @zhengd-nv in #6633
[None][fix]revert kvcache transfer by @chuangz0 in #6709
[TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding by @stnie in #6665
[TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #6184
[None][feat] Optimize CUDA graph memory usage for spec decode cases by @mikeiovine in #6718
[TRTLLM-7025] [infra] Reorganize CODEOWNERS to rectify examples mapping by @venkywonka in #6762
[None][doc] Move AutoDeploy README.md to torch docs by @Fridah-nv in #6528
[None][fix] WAR GPT OSS on H20 with Triton MOE by @dongfengy in #6721
[TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix by @yibinl-nvidia in #6493
[None][feat] Core Metrics Implementation by @hcyezhang in #5785
[https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in #6306
[TRTLLM-6637][feat]...

Contributors

Superjomn, MatthiasKohl, and 78 other contributors

Assets 2

07 Aug 10:54

Superjomn

v1.0.0rc6

a16ba64

v1.0.0rc6 Pre-release

Pre-release

Announcement Highlights:

Model Support
Feature
- Add LoRA support for Gemma3 (#6371)
- Add support of scheduling attention dp request (#6246)
- Multi-block mode for Hopper spec dec XQA kernel (#4416)
- LLM sleep & wakeup Part 1: virtual device memory (#5034)
- best_of/n for pytorch workflow (#5997)
- Add speculative metrics for trt llm bench (#6476)
- (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379)
- Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 (#6522)
- check input tokens + improve error handling (#5170)
- Add support for fused gate_up_proj scales for FP8 blockwise (#6496)
- Add vLLM KV Pool support for XQA kernel (#6013)
- Switch to internal version of MMProjector in Gemma3 (#6572)
- Enable fp8 SwiGLU to minimize host overhead (#6540)
- Add Qwen3 MoE support to TensorRT backend (#6470)
- ucx establish connection with zmq (#6090)
- Disable add special tokens for Llama3.3 70B (#6482)
API
Benchmark
- ADP schedule balance optimization (#6061)
- allreduce benchmark for torch (#6271)
Documentation
- Make example SLURM scripts more parameterized (#6511)
- blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) (#6547)
- Exposing the latest tech blogs in README.md (#6553)
- update known issues (#6247)
- trtllm-serve doc improvement. (#5220)
- Adding GPT-OSS Deployment Guide documentation (#6637)
- Exposing the GPT OSS model support blog (#6647)
- Add llama4 hybrid guide (#6640)
- Add DeepSeek R1 deployment guide. (#6579)
- Create deployment guide for Llama4 Scout FP8 and NVFP4 (#6550)
Known Issues
- On bare-metal Ubuntu 22.04 or 24.04, please install the cuda-python==12.9.1 package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of error ImportError: cannot import name 'cuda' from 'cuda'.

What's Changed

[fix] Fix missing fields in xqa kernel cache key by @lowsfer in #6282
[TRTLLM-6364][infra] Validate for PR titles to ensure they follow the required format by @niukuo in #6278
[fix] Update get_trtllm_bench_build_command to handle batch size and tokens by @venkywonka in #6313
refactor: Remove unused buffers and bindings from sampler by @Funatiq in #6484
chore: Make example SLURM scripts more parameterized by @kaiyux in #6511
fix: Fix missing key by @zerollzeng in #6471
[https://nvbugs/5419066][fix] Use trt flow LLM by @crazydemo in #6467
[TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops by @yali-arch in #6515
[https://nvbugs/5419069][fix] Fix the mismatched layer name components. by @hyukn in #6417
[None][doc] blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) by @kaiyux in #6547
[None][chore] Disable add special tokens for Llama3.3 70B by @chenfeiz0326 in #6482
[None][doc] Exposing the latest tech blogs in README.md by @juney-nvidia in #6553
[None][fix] update nemotron nas tests free_gpu_memory_fraction=0.8 by @xinhe-nv in #6552
[None][infra] Pin the version for triton to 3.3.1 (#6508) (#6519) by @chzblych in #6549
[https://nvbugs/5340941][https://nvbugs/5375785] - fix: Wrap attentio… by @liji-nv in #6355
[TRTLLM-6657][feat] Add LoRA support for Gemma3 by @brb-nv in #6371
[https://nvbugs/5381276][fix] fix warning for fused_a_gemm by @yunruis in #6402
[None][Infra] - Skip failed tests in post-merge by @EmmaQiaoCh in #6558
[AutoDeploy] merge feat/ad-2025-07-22 by @lucaslie in #6520
[TRTLLM-6624][feat] skip post blackwell by @xinhe-nv in #6357
[TRTLLM-6357][test] Add accuracy tests for Qwen3 by @reasonsolo in #6177
[None][fix] Serialize the window_size in the kv event by @richardhuo-nv in #6526
[None][feat] Add support of scheduling attention dp request by @Shunkangz in #6246
[None][refactor] Simplify finish reasons handling in DecoderState by @Funatiq in #6524
[None][infra] add eagle3 one model accuracy tests by @jhaotingc in #6264
[TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 by @yiqingy0 in #5678
use cudaSetDevice to create context ,fix nvbug 5394497 by @chuangz0 in #6403
[None][feat] Multi-block mode for Hopper spec dec XQA kernel by @jhaotingc in #4416
[TRTLLM-6473][test] add speculative decoding and ep load balance cases into QA test list by @crazydemo in #6436
[fix] Fix DeepSeek w4a8 weight loading by @jinyangyuan-nvidia in #6498
chore: add EXAONE4 accuracy test by @yechank-nvidia in #6397
test: modify max_lora_rank of phi4_multimodal to 320 by @ruodil in #6474
[None][chore] Mass integration of release/0.21 (part5) by @dc3671 in #6544
[None][infra] update namelist by @niukuo in #6465
[https://nvbugs/5430932][infra] update namelist by @niukuo in #6585
[None][chore] add online help to build_wheel.py and fix a doc link by @zhenhuaw-me in #6391
test: move ministral_8b_fp8 to fp8_specific gpu list(exclude Ampere) by @ruodil in #6533
[TRTLLM-5563][infra] Move test_rerun.py to script folder by @yiqingy0 in #6571
[None][infra] Enable accuracy test for eagle3 and chunked prefill by @leslie-fang25 in #6386
[None][infra] Enable test of chunked prefill with logit post processor by @leslie-fang25 in #6483
[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory by @tongyuantongyu in #5034
[None][fix] remove closed bugs by @xinhe-nv in #6576
[None][fix] xqa precision for fp16/bf16 kv cache by @Bruce-Lee-LY in #6573
[None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens by @LinPoly in #6259
[None][chore] Bump version to 1.0.0rc6 by @yiqingy0 in #6597
[None][chore] Add unit test for Gemma3 lora by @brb-nv in #6560
[TRTLLM-6364] [fix] Update PR title regex to allow optional spaces between ticket and type by @niukuo in #6598
[None][infra] Waive failed case in post-merge on main by @EmmaQiaoCh in #6602
[None][test] update invalid test name by @crazydemo in #6596
[TRTLLM-5271][feat] best_of/n for pytorch workflow by @evezhier in #5997
[None][chore] Update Gemma3 closeness check to mitigate flakiness by @brb-nv in #6591
[TRTLLM-6685][feat] Add speculative metrics for trt llm bench by @kris1025 in #6476
[None][doc] Fix blog4 typo by @syuoni in #6612
[TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6581
[TRTLLM-6856][feat] add disaggregated serving tests to QA list by @xinhe-nv in #6536
[https://nvbugs/5433581][infra] Update install docs and CI script for SBSA deep_gemm workaround by @chzblych in #6607
[TRTLLM-5990][doc] trtllm-serve doc improvement. by @nv-guomingz in #5220
[None][chore] Add readme for perf test by @ruodil in #6443
[https://nvbugs/5436461][infra] Skip test_eagle3 test with device memory check by @leslie-fang25 in #6617
[None][chore] ucx establish connection with zmq by @chuangz0 in #6090
[TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec by @symphonylyh in #6379
[None][fix] Remove expand configuration from mamba2 mixer by @danielafrimi in #6521
[TRTLLM-6826][feat] Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 by @amitz-nv in h...

Contributors

Superjomn, reasonsolo, and 53 other contributors

Assets 2

04 Aug 09:45

QiJune

v1.0.0rc5

fbee279

v1.0.0rc5 Pre-release

Pre-release

Announcement Highlights:

Model Support
Feature
- Deepseek R1 FP8 Support on Blackwell (#6486)
- Auto-enable ngram with concurrency <= 32. (#6232)
- Support turning on/off spec decoding dynamically (#6363)
- Improve LoRA cache memory control (#6220)
- Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408)
- Update long rope for phi3.5/phi4-mini/phi4-mm (#6353)
- Add support for external multimodal embeddings (#6263)
- Add support for disaggregation with pp with pytorch backend (#6369)
- Add _prepare_and_schedule_batch function in PyExecutor (#6365)
- Add status tags to LLM API reference (#5707)
- Remove cudaStreamSynchronize when using relaxed acceptance (#5262)
- Support JSON Schema in OpenAI-Compatible API (#6321)
- Support chunked prefill on spec decode 2 model (#6104)
- Enhance beam search support with CUDA graph integration (#6217)
- Enable Overlap scheduler + Beam Search in TRTLLM Sampler (#6223)
- Add KV cache reuse support for multimodal models (#5444)
- Multistream initial support for torch compile flow (#5847)
- Support nanobind bindings (#6185)
- Support Weight-Only-Quantization in PyTorch Workflow (#5850)
- Support pytorch LoRA adapter eviction (#5616)
API
- [BREAKING CHANGE] Change default backend to PyTorch in trtllm-serve (#5717)
Bug Fixes
- fix: remove duplicate layer multiplication in KV cache size calculation (#6481)
- fix illeagel memory access in MLA (#6437)
- Fix nemotronNAS loading for TP>1 (#6447)
- Switch placement of image placeholder for mistral 3.1 (#6435)
- Fix wide EP when using DeepEP with online EPLB (#6429)
- Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests (#6463)
- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
- Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
- Fix PD + MTP + overlap scheduler accuracy issue (#6136)
- Fix bug of Qwen3 when using fp4 on sm120 (#6065)
Benchmark
- Fixes to parameter usage and low latency configuration. (#6343)
- Add Acceptance Rate calculation to benchmark_serving (#6240)
Performance
- Enable AllReduce-associated fusion patterns in Llama3/4. (#6205)
- Optimize Mtp performance (#5689)
- Customize cublastLt algo for Llamba 3.3 70B TP4 (#6315)
- Add non UB AR + Residual + Norm + Quant fusion (#6320)
Infrastructure
- Remove auto_assign_reviewers option from .coderabbit.yaml (#6490)
- Use build stage wheels to speed up docker release image build (#4939)
Documentation
- Add README for wide EP (#6356)
- Update Llama4 deployment guide: update config & note concurrency (#6222)
- Add Deprecation Policy section (#5784)
Known Issues
- If you encounter the OSError: CUDA_HOME environment variable is not set error, set the CUDA_HOME environment variable
- The aarch64 Docker image and wheel package for 1.0.0rc5 are broken. This will be fixed in the upcoming weekly release

What's Changed

DeepEP LL support variable hidden size and tokens num by @yilin-void in #6141
[Fix][Chore][Qwen3] fix bug of using fp4 on sm120 by @byshiue in #6065
fix: Ensure mlx5 library is installed for deep_ep and remove deprecated python bindings by @MartinMarciniszyn in #6189
[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction by @amitz-nv in #5616
W4A8 GEMM by @danielafrimi in #6005
enh: Lift expectation of single image per sample in Gemma3 VLM by @brb-nv in #6195
test: add phi-4 multimodel and bielik-11b-v2.2 models for perf test by @ruodil in #5826
fix: Flush stale PlanParams with custom attention mask by @brb-nv in #6163
doc: remove cuda_graph_config: {} from doc since cuda_graph enabled b… by @nv-guomingz in #6150
[fix] Fix can_use_alltoall in fused_moe_wide_ep.py by @jinyangyuan-nvidia in #6173
[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #5850
test: [CI] remove closed bugs by @xinhe-nv in #6201
feat: nanobind bindings by @Linda-Stadter in #6185
infra: [TRTLLM-5250] Add sanity check stage for ngc-release images (Build wheels for devel image) by @ZhanruiSunCh in #4656
doc: add Deprecation Policy section by @QiJune in #5784
[TRTLLM-4279] feat: Multistream initial support for torch compile flow by @liji-nv in #5847
[Infra] - Waive failed cases on recent post-merge by @EmmaQiaoCh in #6212
[BREAKING CHANGE]: change default backend to PyTorch in trtllm-serve by @LinPoly in #5717
test: Enable GB200 torch compile multi gpu tests by @yizhang-nv in #6145
[fix] Correct the returned value of has_spec_drafter by @ziyixiong-nv in #6178
[chore] Clean up quickstart_advanced.py by @mikeiovine in #6021
[Chore] Replace MODEL_CACHE_DIR with LLM_MODELS_ROOT and unwaive triton_server/test_triton.py::test_gpt_ib[gpt-ib] by @SimengLiu-nv in #5859
[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models by @chang-l in #5444
feat: Refactor the fetching request logic by @Shunkangz in #5786
tests: add timeout_manager to tensorrt flow test cases by @crazydemo in #5942
feat: moe prepare support topk % 4 != 0 by @WeiHaocheng in #5742
[fix] Fix flaky mistral E2E test by @2ez4bz in #6230
bug: [https://nvbugs/5368507] Fix test_generate_with_seed. by @bobboli in #6206
chore: Mass integration of release/0.21 (part 4) by @dc3671 in #6211
doc: add supported data modality and types on multimodal serve by @yechank-nvidia in #5988
chore: bump version to 1.0.0rc5 by @yiqingy0 in #6252
[TRTLLM-6537][infra] extend multi-gpu tests related file list by @reasonsolo in #6139
test: update test list for RTX6KD by @StanleySun639 in #6213
fix: bindings unit tests for nanobind by @Linda-Stadter in #6221
Add register_fake for finegrained_mixed_dtype_gemm torch_op by @danielafrimi in #6255
[Issue 6193] Fix gemma3vl weight loader by @johncalesp in #6233
[feat] Enable TP and batching for PixtralVisionModel / Mistral3VLM by @2ez4bz in #6152
set NVIDIA_IMEX_CHANNELS for dlcluster slurm job only by @yuanjingx87 in #6234
[nvbug/5361223] doc: Update Llama4 deployment guide: update config & note concurrency by @raayandhar in #6222
[AutoDeploy] merge feat/ad-2025-07-07 by @lucaslie in #6196
[nvbugs/5401261][fix] Fix Triton backend disaggregated serving support by @Tabrizian in #6224
[refactor] Simplification of Speculative decoding configs - Part 2 by @wili-65535 in #5936
doc: Refactor documents and examples of disaggregated serving and wide ep by @kaiyux in #6054
Add basic Nemo Ckpt Lora Loading in pytorch flow by @venkywonka in #6019
[https://nvbugs/5387771] fix deadlocks due to insufficient numSemaphores by @PerkzZheng in #6262
fix: nvbug_5398806 by @hchings in #6239
chore: set default device to cpu on Multimodal models by @yechank-nvidia in #5994
chore: remove duplicate should_stop_processing check by @QiJune in #6242
hopper-style context MLA by @zhou-yuxin in #5713
[nvbug/5322354] fix PD + MTP + overlap scheduler accuracy issue by @yweng0828 in #6136
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6289
[TRTLLM-6651][feat] Enable Overlap scheduler + Beam Search in TRTLLM Sampler by @stnie in #6223
[Infra] - Skip failed cases by @EmmaQiaoCh in #6299
[AutoDeploy] disable flaky MoE nvfp4 test by @lucaslie in #6302
[feat] Update .coderabbit.yaml with review settings and code guidelines by @venkywonka in #6251
Waive tests by @Tabrizian in https://github.com/NVIDIA/Ten...

Contributors

Superjomn, reasonsolo, and 75 other contributors

Assets 2

0 Join discussion

04 Aug 14:23

QiJune

v0.21.0

751d5f1

v0.21.0 Latest

Latest

TensorRT-LLM Release 0.21.0

Key Features and Enhancements

Model Support
- Added Gemma3 VLM support
Features
- Added large-scale EP support
- Integrated NIXL into the communication layer of the disaggregated service
- Added fabric Memory support for KV Cache Transfer
- Added MCP in ScaffoldingLLM
- Added support for w4a8_mxfp4_fp8 quantization
- Added support for fp8 rowwise quantization
- Added generation logits support in TRTLLM Sampler
- Added log probs support in TRTLLM Sampler
- Optimized TRTLLM Sampler perf single beam single step
- Enabled Disaggregated serving for Qwen-3
- Added EAGLE3 support for Qwen-3
- Fused finalize and allreduce for Qwen-MoE model
- Refactored Fused MoE module
- Added support for chunked attention on Blackwell and Hopper
- Introduced sliding-window attention kernels for the generation phase on Blackwell
- Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios
- Added FP8 block-scale GEMM support on SM89
- Enabled overlap scheduler between draft forwards
- Added Piecewise cuda graph support for MLA
- Added model-agnostic one-engine eagle3
- Enabled Finalize + Allreduce + add + rmsnorm fusion
- Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
- Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
- Validated Llama 3.1 models on H200 NVL
Benchmark:
- Added all_reduce.py benchmark script for testing
- Added beam width to trtllm-bench latency command
- Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
- Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
- Supported post_proc for bench
- Added no_kv_cache_reuse option and streaming support for trtllm serve bench

Infrastructure Changes

The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.05-py3.
The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.05-py3.
The dependent public PyTorch version is updated to 2.7.1.
The dependent TensorRT version is updated to 10.11.
The dependent NVIDIA ModelOpt version is updated to 0.31.
The dependent NCCL version is updated to 2.27.5.

API Changes

Set _AutoDeployLlmArgs as primary config object
Removed decoder request from decoder interface
Enhanced the torch_compile_config in llm args
Removed the redundant use_kv_cache field from PytorchConfig
Moved allreduce_strategy from committed api to reference

Fixed Issues

Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
Fixed cuda graph padding for spec decoding (#4853)
Fixed llama 4 long context issue (#4809)
Fixed max_num_sequences calculation with overlap scheduling (#4532)
Fixed chunked prefill + overlap scheduling (#5761)
Fixed trtllm-bench hang issue due to LLM API IPC (#4798)
Fixed index out of bounds error in spec decoding (#5954)
Fixed MTP illegal memory access in cuda graph warmup (#5947)
Fixed no free slots error with spec decode + disagg (#5975)
Fixed one-off attention window size for Gemma3 1B (#5564)

Known Issues

accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.
Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.
In 0.21, full chunked attention support has been added to make sure LLaMA4 model can functionally run with > 8K seq length, while there is a known performance regression(only affect LLaMA4 model) on Hopper due to this functional enhancement. The root cause of the regression has been identified already and the fix will be part of the future release.

What's Changed

test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5221
[test] split nemotron test cases from examples_test_list by @crazydemo in #5238
Update DeepSeek R1 perf numbers to latest release/0.20 results by @litaotju in #5235
[feat] Add llm args to tune python gc threshold by @nv-yilinf in #5141
[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill by @tomeras91 in #5128
[TRTLLM-3456] Speculation: Draft Target in new FW by @IzzyPutterman in #4558
chore: Waive CI failure. by @SimengLiu-nv in #5252
[infra] Make test_chunked_prefill faster by @mikeiovine in #5248
Update internal cutlass commit. by @Tracin in #5228
test: add more pytorch cases in perf test by @ruodil in #5237
Fix: https://nvbugs/5345720 by @QiJune in #5259
test: [CI] remove closed bugs by @xinhe-nv in #5218
[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP by @syuoni in #5215
fix mla test by @qsang-nv in #5240
doc: add document of benchmarking for Qwen3 by @byshiue in #5158
update setup.py for special cases by @qsang-nv in #5227
move some test cases of TensorRT backend back by @QiJune in #5232
[feat] Add EAGLE3 support for Qwen3 by @nv-yilinf in #5206
[TRTLLM-5786][https://nvbugspro.nvidia.com/bug/5310520][test] Add QA test cases by @crazydemo in #5073
CI: move multi-gpu test cases of tensorrt backend to h200 by @QiJune in #5272
refactor: Unify decoder test with e2e worklfow by @Funatiq in #5239
[feat] Piecewise cuda graph support for MLA by @liji-nv in #4467
chore: Mass integration of release/0.20 by @amirkl94 in #5082
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner by @DomBrown in #5207
None - Some clean-ups for the automation pipeline by @chzblych in #5245
Re-implement LlmResponse in Python to reduce host overhead of pybind by @QiJune in #5224
delete cubins by @qsang-nv in #5274
infra[TRTLLM-5635] remove package stage in CI build by @niukuo in #5075
[Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 by @EmmaQiaoCh in #4885
[chore] Remove BaseDraftTokenManager by @mikeiovine in #5251
[infra] Report CI authorization errors to PR by @tburt-nv in #5175
Revert "[infra] Report CI authorization errors to PR" by @tburt-nv in #5298
refactor: Update decoder buffer and logits management by @Funatiq in #4450
fix: only set _mpi_session if world_size is > 1 by @achartier in #5253
update LlmRequest.is_dummy property by @QiJune in #5283
test: update qa test list by @crazydemo in #5305
CI: extend model weights load time for dsv3 in stress test. by @dominicshanshan in #5275
[fix][test] move deepseek single gpu tests to post merge by @omera-nv in #5280
Waive L0 tests by @yiqingy0 in #5308
feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length by @yizhang-nv in #4971
chore: partition LLM class into TorchLLM and TrtLLM by @Superjomn in #4900
[feat]: improve performance of XQA-MLA for sm120 by @lowsfer in #5087
doc:update contributing md for internal developers by @nv-guomingz in #5250
test: cherry-pick deepseek rcca cases in main branch by @ruodil in #5307
[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. by @hyukn in #5139
CI: fix TensorRT H200 tests by @QiJune in #5301
[TRTLLM-5758] test: Add Bielik-11B-v2.2 Model Support by @Wanli-Jiang in #5159
chore: Refine printed info of CHECK_TYPE. by @bobboli in #5295
refactor: Introduce ResourceManagerType enum for resource management by @Funatiq in #5246
chore: bump version to 0.21.0rc3 by @ZhanruiSunCh in #5309
test: correct unittest rerun behavior by @tongyuantongyu in #5273
Fix rerun step by @yiqingy0 in #5319
Waive L0 by @yizhang-nv in https://github.com/...

Contributors

Superjomn, achartier, and 58 other contributors

Assets 2

0 Join discussion

22 Jul 08:24

QiJune

v1.0.0rc4

69e9f6d

v1.0.0rc4 Pre-release

Pre-release

Announcement Highlights:

Model Support
- Add phi-4-multimodal model support (#5644)
- Add EXAONE 4.0 model support (#5696)
Feature
- Add support for two-model engine KV cache reuse (#6133)
- Unify name of NGram speculative decoding (#5937)
- Add retry knobs and handling in disaggregated serving (#5808)
- Add Eagle-3 support for qwen3 dense model (#5879)
- Remove padding of FusedMoE in attention DP (#6064)
- Enhanced handling of decoder requests and logits within the batch manager (#6055)
- Add support for Modelopt fp8_pb_wo quantization scheme (#6106)
- Update deepep dispatch API (#6037)
- Add support for benchmarking individual gemms in MOE benchmark (#6080)
- Simplify token availability calculation for VSWA (#6134)
- Migrate EAGLE3 and draft/target speculation to Drafter (#6007)
- Enable guided decoding with overlap scheduler (#6000)
- Use cacheTransceiverConfig as knobs for disagg service (#5234)
- Add vectorized loading for finalize kernel in MoE Trtllm backend (#5919)
- Enhance ModelConfig for kv cache size calculations (#5868)
- Clean up drafter/resource manager creation logic (#5805)
- Add core infrastructure to enable loading of custom checkpoint formats (#5372)
- Cleanup disable_fp4_allgather (#6006)
- Use session abstraction in data transceiver and cache formatter (#5611)
- Add support for Triton request cancellation (#5898)
- Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs (#5684)
- Remove enforced sorted order of batch slots (#3502)
- Use huge page mapping for host accessible memory on GB200 (#5963)
API
- [BREAKING CHANGE] Unify KvCacheConfig in LLM class for pytorch backend (#5752)
- [BREAKING CHANGE] Rename cuda_graph_config padding_enabled field (#6003)
Bug Fixes
- Skip prompt length checking for generation only requests (#6146)
- Avoid memory calls during broadcast for single GPU (#6010)
- Record kv-cache size in MLACacheFormatter (#6181)
- Always use py_seq_slot in runtime (#6147)
- Update beam search workspace estimation to new upper bound (#5926)
- Update disaggregation handling in sampler (#5762)
- Fix TMA error with GEMM+AR on TP=2 (#6075)
- Fix scaffolding aime test in test_e2e (#6140)
- Fix KV Cache overrides in trtllm-bench (#6103)
- Remove duplicated KVCache transmission check (#6022)
- Release slots with spec decode + disagg (#5975) (#6032)
- Add propogation for trust_remote_code to AutoConfig (#6001)
- Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop (#6053)
- Pad DeepEP fp4 recv tensors if empty (#6048)
- Adjust window sizes of VSWA at torch backend (#5880)
- Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
- Fix eagle3 two model disaggregated serving test (#6014)
- Update torch.compile option to fix triton store_cubin error (#5865)
- Fix chunked prefill + overlap scheduling (#5761)
- Fix mgmn postprocess error (#5835)
- Fallback to cubins for fp8 fmha kernels on Ada (#5779)
- Enhance _check_arguments to filter illegal requests for pytorch backend (#5541)
- Rewrite completion API to avoid repetitive tokens (#5201)
- Fix disagg + speculative decoding (#5558)
Benchmark
- Add latency support for trtllm bench (#3730)
Performance
- Optimize TRTLLM Sampler perf single beam single step (#5550)
- Performance Optimization for MNNVL TwoShot Kernel (#5925)
- Enable 128x256 tile shapes for FP4 MOE CUTLASS backend (#5986)
- Enable cuda graph by default (#5480)
Infrastructure
- Add script to map tests <-> jenkins stages & vice-versa (#5431)
- Speedup beam search unit tests with fixtures for LLM (#5843)
- Fix single-GPU stage failed will not raise error (#6165)
- Update bot help messages (#5277)
- Update jenkins container images (#6094)
- Set up the initial config for CodeRabbit (#6128)
- Upgrade NIXL to 0.3.1 (#5991)
- Upgrade modelopt to 0.33 (#6058)
- Support show all stage name list when stage name check failed (#5946)
- Run docs build only if PR contains only doc changes (#5184)
Documentation
- Update broken link of PyTorchModelEngine in arch_overview (#6171)
- Add initial documentation for trtllm-bench CLI. (#5734)
- Add documentation for eagle3+disagg+dynamo (#6072)
- Update llama-3.3-70B guide (#6028)
Known Issues

What's Changed

[TRTLLM-6164][TRTLLM-6165] chore: add runtime example for pytorch by @Superjomn in #5956
fix: Fix MoE benchmark by @syuoni in #5966
[TRTLLM-6160] chore: add sampling examples for pytorch by @Superjomn in #5951
Use huge page mapping for host accessible memory on GB200 by @dongxuy04 in #5963
Breaking change: perf: [TRTLLM-4662] Enable cuda graph by default by @dominicshanshan in #5480
fix: set allreduce strategy to model config by @WeiHaocheng in #5955
chore: Mass integration of release/0.21 (part 3) by @dc3671 in #5909
infra: [TRTLLM-6242] install cuda-toolkit to fix sanity check by @ZhanruiSunCh in #5709
Waive L0 test by @yiqingy0 in #6002
[Nvbug/5383670] fix: switch test case to non-fp4 ckpt for more GPU coverage by @kaiyux in #5882
fix #4974: A thread leak issue in scaffolding unittest by @ccs96307 in #5020
feat: EXAONE4.0 support by @yechank-nvidia in #5696
[TRTLLM-5653][infra] Run docs build only if PR contains only doc changes by @zhanga5 in #5184
feat: Update Gemma3 Vision Encoder by @brb-nv in #5973
enh: Bidirectional mask with multiple images for Gemma3 by @brb-nv in #5976
refactor: Remove enforced sorted order of batch slots by @Funatiq in #3502
[fix] fix eagle3 two model disaggregated serving test by @Tabrizian in #6014
perf: Enable 128x256 tile shapes for FP4 MOE CUTLASS backend by @djns99 in #5986
[nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs by @ixlmar in #5964
doc: update EXAONE 4.0 news by @yechank-nvidia in #6034
[Model load] Fix llama min-latency model load by @arekay in #5883
fix: Fix MOE benchmark to rotate buffers to prevent L2 cache reuse by @djns99 in #4135
Doc: Update llama-3.3-70B guide by @jiahanc in #6028
infra: [TRTLLM-6331] Support show all stage name list when stage name check failed by @ZhanruiSunCh in #5946
[Infra][TRTLLM-6013] - Fix stage name in single stage test rerun report by @yiqingy0 in #5672
[Fix] check for ImportError or ModuleNotFoundError for deep_ep_utils by @lucaslie in #6026
infra: [TRTLLM-6313] Fix the package sanity stage 'Host Node Name' in… by @ZhanruiSunCh in #5945
chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… by @nv-guomingz in #6003
test: add recursive updating pytorch config and change MOE backend format in perf test by @ruodil in #6046
test: add llama_v3.3_70b_cases in perf test by @ruodil in #6035
[infra] add more log on reuse-uploading by @niukuo in #6036
fix: adjust window sizes of VSWA at torch backend by @jaedeok-nvidia in #5880
[nvbugs/5385972][nvbugs/5387423][Fix] Minor fix for llava_next/llava_onevision by @MinaHuai in #5998
Fix: pad DeepEP fp4 recv tensors if empty by @yuantailing in #6048
[fix] Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop by @jinyangyuan-nvidia in #6053
support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… by @ttyio in #5684
Cherry-pick #5947 by @lfr-0531 in #5989
test: Add regression tests for Gemma3 VLM by @brb-nv in #6033
feat/add latency support for trtllm bench by @danielafrimi in #3730
feat: Add support for Triton request cancellation by @achartier in #5898
[fix] Fix Triton build by @Tabrizian in #6076
fix: Unable to load phi4-model with tp_size>1 by @Wanli-Jiang in #5962
chore: Bump version to 1.0.0rc4 by @yiqingy0 in #6086
chroe: upgrade modelopt to 0.33 by @nv-guomingz in #6058
[nvbug/5347489][nvbug/5388036] increase timeout in disagg worker test by @zhengd-nv in #6041
feat: use sessi...

Contributors

Superjomn, ttyio, and 65 other contributors

Assets 2

0 Join discussion

16 Jul 08:25

QiJune

v1.0.0rc3

cfcb97a

v1.0.0rc3 Pre-release

Pre-release

Announcement Highlights:

Model Support
- Support Mistral3.1 VLM model (#5529)
- Add TensorRT-Engine Qwen3 (dense) model support (#5650)
Feature
- Add support for MXFP8xMXFP4 in pytorch (#5411)
- Log stack trace on error in openai server (#5749)
- Refactor the topk parallelization part for the routing kernels (#5705)
- Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests (#5774)
- Support FP8 row-wise dense GEMM in torch flow (#5615)
- Move DeepEP from Docker images to wheel building (#5534)
- Add user-provided speculative decoding support (#5204)
- Add optional module cache for TRT-LLM Gen Gemm interfaces (#5743)
- Add streaming scaffolding_llm.generate_async support (#5345)
- Detokenize option in /v1/completions request (#5382)
- Support n-gram speculative decoding with disagg (#5732)
- Return context response immediately when stream_interval > 1 (#5836)
- Add support for sm121 (#5524)
- Add LLM speculative decoding example (#5706)
- Update xgrammar version to 0.1.19 (#5830)
- Some refactor on WideEP (#5727)
- Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764)
- Update transformers to 4.53.0 (#5747)
- Share PyTorch tensor between processes (#5396)
- Custom masking utils for Gemma3 VLM (#5853)
- Remove support for llmapi + TRT backend in Triton (#5856)
- Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723)
- Enable kvcache to be reused during request generation (#4028)
- Simplify speculative decoding configs (#5639)
- Add binding type build argument (pybind, nanobind) (#5802)
- Add the ability to write a request timeline (#5258)
- Support deepEP fp4 post quant all2all dispatch (#5881)
- Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend (#5771)
- Move vision parts from processor to model for Gemma3 (#5888)
API
- [BREAKING CHANGE] Rename mixed_sampler to enable_mixed_sampler (#5751)
- [BREAKING CHANGE] Rename LLM.autotuner_enabled to enable_autotuner (#5876)
Bug Fixes
- Fix test_generate_with_seed CI failure. (#5772)
- Improve fp4_block_scale_moe_runner type check (#5681)
- Fix prompt adapter TP2 case (#5782)
- Fix disaggregate serving with attention DP (#4993)
- Ignore nvshmem_src_*.txz from confidentiality-scan (#5831)
- Fix a quote error introduced in #5534 (#5816)
- Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
- Fix lost requests for disaggregated serving (#5815)
- Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
- Fix GEMM+AR fusion on blackwell (#5563)
- Catch inference failures in trtllm-bench (#5841)
- Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) (#5813)
- Skip rope scaling for local layers in Gemma3 VLM (#5857)
- Fix llama4 multimodal support (#5809)
- Fix Llama4 Scout FP4 crash issue (#5925)
- Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
- Fix moe regression for sm120 (#5823)
- Fix Qwen2.5VL FP8 support (#5029)
- Fix the illegal memory access issue in moe gemm on SM120 (#5636)
- Avoid nesting NCCL group in allgather and reduce scatter OPs (#5866)
- Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
- Fix incremental detokenization (#5825)
- Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
- Make the bench serving script compatible with different usages (#5905)
- Fix mistral unit tests due to transformers upgrade (#5904)
- Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
- Fix Gemma3 unit tests due to transformers upgrade (#5921)
- Extend triton exit time for test_llava (#5971)
- Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
- Remove SpecConfig and fix thread leak issues (#5931)
- Fast redux detection in trtllm gen routing kernel (#5941)
- Fix cancel request logic (#5800)
- Fix errors in wide-ep scripts (#5992)
- Fix error in post-merge-tests (#5949)
Benchmark
Performance
- Optimize TRTLLM Sampler perf single beam single step (#5550)
Infrastructure
- Fix a syntax issue in the image check (#5775)
- Speedup fused moe tests (#5726)
- Set the label community action to only run on upstream TRTLLM (#5806)
- Update namelist in blossom-ci (#5838)
- Update nspect version (#5832)
- Reduce redundant test cases for TRTLLM Gen FP8 MoE (#5845)
- Parallelize torch unittests (#5714)
- Use current_image_tags.properties in rename_docker_images.py (#5846)
- Fix two known NSPECT high vulnerability issues and reduce image size (#5434)
Documentation
- Update the document of qwen3 and cuda_graph usage (#5705)
- Update cuda_graph_config usage part in DS R1 docs (#5796)
- Add llama4 Maverick eagle3 and max-throughput and low_latency benchmark guide (#5810)
- Fix link in llama4 Maverick example (#5864)
- Add instructions for running gemma in disaggregated serving (#5922)
- Add qwen3 disagg perf metrics (#5822)
- Update the disagg doc (#5938)
- Update the link of the diagram (#5953)
Known Issues

What's Changed

feat: Add support for MXFP8xMXFP4 in pytorch by @djns99 in #5535
[Doc] update the document of qwen3 and cuda_graph usage by @byshiue in #5703
[Infra] - Fix a syntax issue in the image check by @chzblych in #5775
chore: log stack trace on error in openai server by @zhengd-nv in #5749
fix: [nvbug/5368507] Fix test_generate_with_seed CI failure. by @bobboli in #5772
Refactor the topk parallelization part for the routing kernels by @ChristinaZ in #5567
test: [CI] remove closed bugs by @xinhe-nv in #5770
[TRTLLM-5530][BREAKING CHANGE] refactor: LLM arglist rename mixed_sampler to enable_mixed_sampler by @Superjomn in #5751
fix: Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests by @yizhang-nv in #5774
[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow by @DylanChen-NV in #5615
feat: Optimize TRTLLM Sampler perf single beam single step by @dcampora in #5550
Refactor: move DeepEP from Docker images to wheel building by @yuantailing in #5534
[TRTLLM-6291] feat: Add user-provided speculative decoding support by @Funatiq in #5204
[ci] speedup fused moe tests by @omera-nv in #5726
[feat] Adds optional module cache for TRT-LLM Gen Gemm interfaces by @davidclark-nv in #5743
chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… by @nv-guomingz in #5795
feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL by @yechank-nvidia in #5522
Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" by @nv-guomingz in #5818
[fix] https://nvbugs/5333654 Unwaive to check ci status and improve torch compile multi-gpu coverage by @liji-nv in #5700
[fix] improve fp4_block_scale_moe_runner type check by @Alcanderian in #5681
feat(scaffolding): add streaming scaffolding_llm.generate_async support by @dc3671 in #5345
[None][infra] Set the label community action to only run on upstream TRTLLM by @poweiw in #5806
Waive some test_llama_eagle3 unittests by @venkywonka in #5811
[NvBug 5362426] fix: Fix prompt adapter TP2 case by @syuoni in #5782
chore: bump version to 1.0.0rc3 by @yiqingy0 in #5819
doc: update cuda_graph_config usage part in DS R1 docs by @nv-guomingz in #5796
fix: Disaggregate serving with attention DP by @VALLIS-NERIA in #4993
Fix: ignore nvshmem_src_*.txz from confidentiality-scan by @yuantailing in #5831
tests: waive failed cases on main by @xinhe-nv in #5781
[Infra] - Waive L0 test by @yiqingy0 in #5837
update namelist in blossom-ci by @niukuo in #5838
Fix a quote error introduced in #5534 by @yuantailing in #5816
[feat]: Detokenize option in /v1/completions request by @Wokzy in #5382
[5305318] fix: Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. by @hyukn in #5801
[TRTLLM-5847][feat] Support n-gram speculative decoding with disagg by @raayandhar in #5732
[TRTLLM-5878] update nspect version by @niukuo in #5832
feat: Return context response immediately when stream_interval > 1 by @kaiyux in https://github.c...

Contributors

WilliamTambellini, Superjomn, and 66 other contributors

Assets 2

0 Join discussion

08 Jul 07:04

nv-guomingz

v1.0.0rc2

66f299a

v1.0.0rc2 Pre-release

Pre-release

Announcement Highlights:

Model Support
Feature
- Add KV events support for sliding window attention (#5580)
- Add beam search support to the PyTorch Workflow (#5333)
- Support more parameters in openai worker of scaffolding (#5115)
- Enable CUDA graphs for Nemotron-H (#5646)
- Add spec dec param to attention op for pytorch workflow (#5146)
- Fuse w4a8 moe pre-quant scale on Hopper (#5613)
- Support torch compile for attention dp (#5086)
- Add W4A16 GEMM support for pytorch workflow (#4232)
- Add request_perf_metrics to triton LLMAPI backend (#5554)
- Add AutoDeploy fp8 quantization support for bmm (#3849)
- Refactor moe permute and finalize op by removing duplicated code (#5557)
- Support duplicate_kv_weight for qwen3 blockwise scale (#5459)
- Add LoRA support for pytorch backend in trtllm-serve (#5376)
API
- Enhance yaml loading arbitrary options in LlmArgs (#5610)
- Add back allreduce_strategy parameter into TorchLlmArgs (#5637)
- Add LLmArgs option to force using dynamic quantization (#5346)
- Remove ptuning knobs from TorchLlmArgs (#5595)
- BREAKING CHANGE:Enhance the llm args pytorch config part 1(cuda_graph_config) (#5014)
Bug Fixes
- Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
- Fix attention DP doesn't work with embedding TP (#5642)
- Fix broken cyclic reference detect (#5417)
- Fix permission for local user issues in NGC docker container. (#5373)
- Fix mtp vanilla draft inputs (#5568)
Benchmark
- Add wide-ep benchmarking scripts (#5760)
Performance
- Reduce DeepEPLowLatency memory and time (#5712)
- Use tokenizers API to optimize incremental detokenization perf (#5574)
- Conditionally enable SWAP AB for speculative decoding (#5404)
- Unify new_tokens format sample state to trtllm samper tokens format (#5513)
- Replace allgaher with AllToAllPrepare (#5570)
- Optimizations on weight-only batched gemv kernel (#5420)
- Optimize MoE sort kernels for large-scale EP (#5435)
- Avoid reswizzle_sf after allgather. (#5504)
Infrastructure
- Always use x86 image for the Jenkins agent and few clean-ups (#5753)
- Reduce unnecessary kernel generation (#5476)
- Update the auto-community label action to be triggered every hour (#5658)
- Improve dev container tagging (#5551)
- Update the community action to more appropriate api (#4883)
- Update nccl to 2.27.5 (#5539)
- Upgrade xgrammar to 0.1.18 (#5364)
Documentation
- Fix outdated config in DeepSeek best perf practice doc (#5638)
- Add pd dynamic scaling readme (#5540)
- Add feature support matrix for PyTorch backend (#5037)
- 1.0 LLM API doc updates (#5629)
Known Issues

What's Changed

[TRTLLM-5831][feat] Add LoRA support for pytorch backend in trtllm-serve by @talorabr in #5376
[CI] reduce mamba2 ssm test parameterization by @tomeras91 in #5571
perf: Avoid reswizzle_sf after allgather. by @bobboli in #5504
[feat][test] reuse MPI pool executor across tests by @omera-nv in #5566
[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP by @syuoni in #5435
[feat] Optimizations on weight-only batched gemv kernel by @Njuapp in #5420
[ci] remove MMLU if followed by GSM8K by @omera-nv in #5578
[TRTLLM-5530][BREAKING CHANGE]: enhance the llm args pytorch config part 1(cuda_graph_config) by @nv-guomingz in #5014
Deduplicate waive list by @yiqingy0 in #5546
[fix] speedup modeling unittests by @omera-nv in #5579
feat : support duplicate_kv_weight for qwen3 blockwise scale by @dongjiyingdjy in #5459
[TRTLLM-5331] large-scale EP: perf - Replace allgaher with AllToAllPrepare by @WeiHaocheng in #5570
doc: Minor update to DeepSeek R1 best practice by @kaiyux in #5600
[nvbug/5354946][fix] Fix mtp vanilla draft inputs by @lfr-0531 in #5568
refactor: decoder state setup by @Funatiq in #5093
[Infra][main] Cherry-pick from release/0.21: Update nccl to 2.27.5 (#5539) by @EmmaQiaoCh in #5587
[TRTLLM-5989, TRTLLM-5991, TRTLLM-5993] doc: Update container instructions (#5490) by @ixlmar in #5605
[ci] move eagle1 and medusa tests to post-merge by @omera-nv in #5604
chore [TRTLLM-6009]: remove ptuning knobs from TorchLlmArgs by @Superjomn in #5595
[fix][ci] missing class names in post-merge test reports by @omera-nv in #5603
refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code by @limin2021 in #5557
chore: remove cuda_graph_ prefix from cuda_graph_config filed members. by @nv-guomingz in #5585
feat: AutoDeploy fp8 quantization support for bmm by @meenchen in #3849
feature: unify new_tokens format sample state to trtllm samper tokens format by @netanel-haber in #5513
[fix]: Fix main test skip issue by @yizhang-nv in #5503
chores: [TRTLLM-6072] 1.0 LLMAPI doc updates by @hchings in #5629
add feature support matrix for PyTorch backend by @QiJune in #5037
test: [CI] remove closed bugs by @xinhe-nv in #5572
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5569
rcca: test default kv_cache_reuse option for pytorch multimodal by @StanleySun639 in #5544
[TRTLLM-6104] feat: add request_perf_metrics to triton LLMAPI backend by @xuanzic in #5554
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5582
feat: W4A16 GEMM by @danielafrimi in #4232
test: Reduce number of C++ test cases by @Funatiq in #5437
[https://nvbugs/5318059][test] Unwaive test by @pamelap-nvidia in #5624
[Infra] - Add some timeout and unwaive a test which dev fixed by @EmmaQiaoCh in #5631
[#5403][perf] Conditionally enable SWAP AB for speculative decoding by @zoheth in #5404
[TRTLLM-5277] chore: refine llmapi examples for 1.0 (part1) by @Superjomn in #5431
chore: Mass integration of release/0.21 by @dc3671 in #5507
refactor: Clean up DecodingInput and DecodingOutput by @Funatiq in #5617
perf: Use tokenizers API to optimize incremental detokenization perf by @kaiyux in #5574
[feat] Support torch compile for attention dp by @liji-nv in #5086
feat: add LLmArgs option to force using dynamic quantization by @achartier in #5346
[TRTLLM-5644][infra] Update the community action to more appropriate api by @poweiw in #4883
fix: add missing self. from PR #5346 by @achartier in #5653
[Bug] attention DP doesn't work with embedding TP by @PerkzZheng in #5642
fix: Add back allreduce_strategy parameter into TorchLlmArgs by @HuiGao-NV in #5637
perf: better heuristic for allreduce by @yilin-void in #5432
feat: fuse w4a8 moe pre-quant scale on Hopper by @xiaoweiw-nv in #5613
[chore] 2025-07-02 update github CI allowlist by @niukuo in #5661
doc: Add pd dynamic scaling readme by @Shunkangz in #5540
chore: enhance yaml loading arbitrary options in LlmArgs by @Superjomn in #5610
Feat/pytorch vswa kvcachemanager by @qixiang-99 in #5151
[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest by @Funatiq in #5489
[https://nvbugspro.nvidia.com/bug/5329655] [feat] Pytorch path add spec dec param to attention op by @jhaotingc in #5146
[Infra] - Set default timeout to 1hr and remove some specific settings by @EmmaQiaoCh in #5667
[TRTLLM-6143] feat: Improve dev container tagging by @ixlmar in #5551
feat:[AutoDeploy] E2E build example for llama4 VLM by @Fridah-nv in #3922
fix: Fix missing arg to alltoall_prepare_maybe_dispatch by @syuoni in https://...

Contributors

Superjomn, achartier, and 53 other contributors

Assets 2

0 Join discussion

03 Jul 05:38

nv-guomingz

v1.0.0rc1

de97799

v1.0.0rc1 Pre-release

Pre-release

Model Support

Model Support
Features
- Add support for YARN in NemotronNAS models (#4906)
- Add support for per expert activation scaling factors (#5013)
- Add ReDrafter support for Qwen (#4875)
- Add NGrams V2 support (#4569)
- Use inference mode in update_requests to improve perf of TRTLLM Sampler (#5538)
- Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410)
- Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
- large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226)
- Prevent serialization of entire LoRA adapters in each request (#5080)
- Remove cutlass min latency code from AutoTuner. (#5394)
- Opensource MOE MXFP8-MXFP4 implementation (#5222)
- Add chunked prefill support for MLA (Blackwell) (#4651)
- Support disaggregated serving in TRTLLM Sampler (#5328)
- Support mutliCtasKvMode for high-throughput MLA kernels (#5426)
- Add MTP support for Online EPLB (#5213)
- Add debug hook to support dump tensor data and add new debug functions easily (#5182)
API
- Add request_perf_metrics to LLMAPI (#5497)
- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384)
Bug Fixes
- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
- Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
- Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
- Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485)
- Fix the unexpected keyword argument 'streaming' (#5436)
Benchmark
- Update trtllm-bench to support new Pytorch default. (#5491)
- Add support for TRTLLM CustomDataset (#5511)
- Make benchmark_serving part of the library (#5428)
Performance
- Improve XQA-MLA perf (#5468)
- Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318)
Infrastructure
- Allow configuring linking of NVRTC wrapper (#5189)
- Add timeout setting for long tests found in post-merge (#5501)
Documentation
- Fix benchmark cmd in disagg scripts (#5515)
Known Issues
- multi-GPU model support on RTX Pro 6000

What's Changed

feature: make trtllmsampler new_tokens format the universal format by @netanel-haber in #4401
[fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation by @HuiGao-NV in #5343
test: [CI] remove closed bugs by @xinhe-nv in #5400
refactor: manage cache indirection in decoder state by @Funatiq in #5315
tests: update benchmark test lists by @xinhe-nv in #5365
chore: delete mamba hybrid, since it is now called NemotronH by @vegaluisjose in #5409
[Infra] - Waive failed tests in post-merge and increase some timeout setting by @EmmaQiaoCh in #5424
Add debug hook to support dump tensor data and add new debug functions easily by @HuiGao-NV in #5182
Chore: remove unused variables by @QiJune in #5314
Fix test Pytorch model engine by @Tabrizian in #5416
Add MTP support for Online EPLB by @dongxuy04 in #5213
waive test_moe.py::test_moe_fp8[autotune] by @QiJune in #5455
fix: fix bug of qwen3 + eagle3 + finalize_moe_fusion by @byshiue in #5369
[AutoDeploy] Merge feat/ad_2025_06_13 feature branch by @lucaslie in #5454
feat: Dynamically remove servers in PD by @Shunkangz in #5270
tests: Set kv cache free memory fraction in test case by @HuiGao-NV in #5433
fix (NvBug 5354925): Fix static EPLB by @syuoni in #5411
test: Add LLGuidance test and refine guided decoding by @syuoni in #5348
CI: update multi gpu test triggering file list by @QiJune in #5466
start OAIServer with max_beam_width=1 for TorchSampler by @netanel-haber in #5427
chore: bump version to 1.0.0rc1 by @yiqingy0 in #5460
[https://jirasw.nvidia.com/browse/TRTLLM-4645] support mutliCtasKvMode for high-throughput MLA kernels by @PerkzZheng in #5426
CI: waive test_ad_build_small_multi by @QiJune in #5471
feat: Remove not used padding_idx in models by @HuiGao-NV in #5385
[nvbug/5354956] fix: unexpected keyword argument 'streaming' by @kaiyux in #5436
Move 3 disaggregated cases from 4 GPUs devices to 1 GPU device by @HuiGao-NV in #5457
Fix: fix nvbug 5356427 by @HuiGao-NV in #5464
feat: Make benchmark_serving part of the library by @kaiyux in #5428
[TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler by @dcampora in #5328
[chore] Disable block reuse when draft model speculation is being used by @mikeiovine in #5448
chore: split _build_model method for TorchLlm and TrtLlm by @QiJune in #5418
[fix][test] remove test in global scope by @omera-nv in #5470
[fix][ci] dont build wheel for cpp tests by @omera-nv in #5443
CI: reduce BF16 test cases in B200 by @QiJune in #5482
Add sleep function for disagg gen-only benchmarking by @qiaoxj07 in #5398
CI: enable test cases on single device type by @HuiGao-NV in #5484
[5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. by @hyukn in #5485
feat: chunked prefill for MLA (Blackwell) by @jmydurant in #4651
Add unit test for routing kernels by @ChristinaZ in #5405
[CI] Waive test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] by @venkywonka in #5494
[Infra] - Add timeout setting for long tests found in post-merge by @EmmaQiaoCh in #5501
Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format (#4401)" by @netanel-haber in #5474
keep sm90 headsize 128 cubins by @qsang-nv in #5320
opensource: Opensource MOE MXFP8-MXFP4 implementation by @djns99 in #5222
[TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. by @hyukn in #5394
[TRTLLM-5921][feat] Prevent serialization of entire LoRA adapters in each request by @amitz-nv in #5080
feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) by @dongxuy04 in #5226
[chore] Allow configuring linking of NVRTC wrapper by @AlessioNetti in #5189
perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf by @bobboli in #5318
[fix][ci] trigger multigpu tests for deepseek changes by @omera-nv in #5423
tests: waive tests by @xinhe-nv in #5458
doc: Fix benchmark cmd in disagg scripts by @kaiyux in #5515
[perf] improve XQA-MLA perf by @lowsfer in #5468
feat: Add support for TRTLLM CustomDataset by @kaiyux in #5511
[feat] Add progress bar to benchmark by @arekay in #5173
Add trtllm-bench reviewers. by @FrankD412 in #5452
[CI] move flashinfer llama tests to post merge by @omera-nv in #5506
[fix][ci] move torch tests to run under torch stage by @omera-nv in #5473
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead by @Funatiq in #5384
[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) by @jmydurant in #5475
fix: MoE autotune fallback failed to query default heuristic by @rosenrodt in #5520
Update allow list 2025_06_26 by @yuanjingx87 in #5526
fix: Mapping rank boundary check bug by @venkywonka in #4935
Update trtllm-bench to support new Pytorch default. by @FrankD412 in https://gith...

Contributors

dcampora, vegaluisjose, and 41 other contributors

Assets 2

0 Join discussion

25 Jun 10:23

nv-guomingz

v1.0.0rc0

ebadc13

v1.0.0rc0 Pre-release

Pre-release

Announcement Highlights:

Model Support
Features
- Add EAGLE3 support for Qwen3 (#5206)
- Add Piecewise cuda graph support for MLA (#4467)
- feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207)
- Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224)
- Add no_kv_cache_reuse option and streaming support for trtllm serve bench (#4971)
- Add LLGuidance Support for PyTorch Backend (#5214)
- Fusion finalize and allreduce for qwenmoe model (#5223)
- Support stream_interval (#5284)
API
- Add llm args to tune python gc threshold (#5141)
- Introduce ResourceManagerType enum for resource management (#5246)
- BREAKING CHANGEchore: make pytorch LLM the default (#5312)
- Remove TrtGptModelOptionalParams (#5165)
Bug Fixes
- Fix trtllm-llmapi-launch multiple LLM instances (#4727)
- Fix the deterministic issue in the MTP Eagle path (#5285)
- Fix: missing clientId when serialize and deserialize response (#5231)
Benchmark
Performance
- Optimize MoE supplementary kernels for large-scale EP (#5215)
- Improve performance of XQA-MLA for sm120 (#5087)
Infrastructure
- Update dependencies with NGC PyTorch 25.05 and TRT 10.11 (#4885)
- Add Multi-node CI testing support via Slurm (#4771)
Documentation
- Add document of benchmarking for Qwen3 (#5158)
- Update contributing md for internal developers (#5250)
- blog: Disaggregated Serving in TensorRT-LLM (#5353)
- Update mtp documents (#5387)
Known Issues
- multi-GPU model support on RTX Pro 6000

What's Changed

test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5221
[test] split nemotron test cases from examples_test_list by @crazydemo in #5238
Update DeepSeek R1 perf numbers to latest release/0.20 results by @litaotju in #5235
[feat] Add llm args to tune python gc threshold by @nv-yilinf in #5141
[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill by @tomeras91 in #5128
[TRTLLM-3456] Speculation: Draft Target in new FW by @IzzyPutterman in #4558
chore: Waive CI failure. by @SimengLiu-nv in #5252
[infra] Make test_chunked_prefill faster by @mikeiovine in #5248
Update internal cutlass commit. by @Tracin in #5228
test: add more pytorch cases in perf test by @ruodil in #5237
Fix: https://nvbugs/5345720 by @QiJune in #5259
test: [CI] remove closed bugs by @xinhe-nv in #5218
[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP by @syuoni in #5215
fix mla test by @qsang-nv in #5240
doc: add document of benchmarking for Qwen3 by @byshiue in #5158
update setup.py for special cases by @qsang-nv in #5227
move some test cases of TensorRT backend back by @QiJune in #5232
[feat] Add EAGLE3 support for Qwen3 by @nv-yilinf in #5206
[TRTLLM-5786][https://nvbugspro.nvidia.com/bug/5310520][test] Add QA test cases by @crazydemo in #5073
CI: move multi-gpu test cases of tensorrt backend to h200 by @QiJune in #5272
refactor: Unify decoder test with e2e worklfow by @Funatiq in #5239
[feat] Piecewise cuda graph support for MLA by @liji-nv in #4467
chore: Mass integration of release/0.20 by @amirkl94 in #5082
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner by @DomBrown in #5207
None - Some clean-ups for the automation pipeline by @chzblych in #5245
Re-implement LlmResponse in Python to reduce host overhead of pybind by @QiJune in #5224
delete cubins by @qsang-nv in #5274
infra[TRTLLM-5635] remove package stage in CI build by @niukuo in #5075
[Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 by @EmmaQiaoCh in #4885
[chore] Remove BaseDraftTokenManager by @mikeiovine in #5251
[infra] Report CI authorization errors to PR by @tburt-nv in #5175
Revert "[infra] Report CI authorization errors to PR" by @tburt-nv in #5298
refactor: Update decoder buffer and logits management by @Funatiq in #4450
fix: only set _mpi_session if world_size is > 1 by @achartier in #5253
update LlmRequest.is_dummy property by @QiJune in #5283
test: update qa test list by @crazydemo in #5305
CI: extend model weights load time for dsv3 in stress test. by @dominicshanshan in #5275
[fix][test] move deepseek single gpu tests to post merge by @omera-nv in #5280
Waive L0 tests by @yiqingy0 in #5308
feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length by @yizhang-nv in #4971
chore: partition LLM class into TorchLLM and TrtLLM by @Superjomn in #4900
[feat]: improve performance of XQA-MLA for sm120 by @lowsfer in #5087
doc:update contributing md for internal developers by @nv-guomingz in #5250
test: cherry-pick deepseek rcca cases in main branch by @ruodil in #5307
[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. by @hyukn in #5139
CI: fix TensorRT H200 tests by @QiJune in #5301
[TRTLLM-5758] test: Add Bielik-11B-v2.2 Model Support by @Wanli-Jiang in #5159
chore: Refine printed info of CHECK_TYPE. by @bobboli in #5295
refactor: Introduce ResourceManagerType enum for resource management by @Funatiq in #5246
chore: bump version to 0.21.0rc3 by @ZhanruiSunCh in #5309
test: correct unittest rerun behavior by @tongyuantongyu in #5273
Fix rerun step by @yiqingy0 in #5319
Waive L0 by @yizhang-nv in #5311
tests: add multi nodes tests by @xinhe-nv in #5196
feat: Add LLGuidance Support for PyTorch Backend by @jellysnack in #5214
[Infra]Update 5080 and 5090 case condition since we will upgrade driver by @EmmaQiaoCh in #5317
chore: Update README.md to expose meet-up info by @juney-nvidia in #5329
Remove duplicated test cases by @HuiGao-NV in #5323
Add disagg slurm scripts by @qiaoxj07 in #5243
Unwaive disaggregated serving accuracy tests by @Tabrizian in #5095
[feat] Multi-node CI testing support via Slurm by @yuanjingx87 in #4771
[fix][test] remove some cpp test cases from h100 by @omera-nv in #5335
[fix][test] remove duplicate test runs by @omera-nv in #5241
chore: skip test_llm_gpt2_medium_fp8 for fp8_pc_pt + quant_lm_head by @achartier in #5293
[fix][test] clear cuda cache before unittests automatically by @omera-nv in #5121
fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances by @Superjomn in #4727
ci: Split long running jobs into multiple jobs by @Funatiq in #5268
[feat] Fusion finalize and allreduce for qwenmoe model by @zongfeijing in #5223
chore: remove torch_compile prefix for TorchCompileConfig field members by @nv-guomingz in #5261
[test] add nvfp4 DeepSeek-V3-Lite-mtp tests by @lfr-0531 in #5125
Waive L0 test by @yiqingy0 in #5349
chore: bump version to 1.0.0rc0 by @yiqingy0 in #5326
tests: add ds r1 tp4 test by @xinhe-nv in https://githu...

Contributors

Superjomn, achartier, and 49 other contributors

Assets 2

0 Join discussion

19 Jun 04:19

nv-guomingz

v0.20.0

7965842

v0.20.0

TensorRT-LLM Release 0.20.0

Key Features and Enhancements

Model Support
- Added Qwen3 support.Refer to “Qwen3” section in examples/models/core/qwen/README.md.
- Added HyperCLOVAX-SEED-Vision support in PyTorch flow. Refer to examples/models/contrib/hyperclovax/README.md
- Added Dynasor-CoT in scaffolding examples. Refer to examples/scaffolding/contrib/Dynasor/README.md
- Added Mistral Small 3.1 24B VLM support in TRT workflow
- Added Gemma3-1b-it support in PyTorch workflow
- Added Nemotron-H model support
- Added Eagle-3 support for LLAMA4
PyTorch workflow
- Added lora support
- Added return logits support
- Adopt new logprob definition in PyTorch flow
- Enabled per-request stats with PyTorch backend
- Enabled LogitsProcessor in PyTorch backend
Benchmark:
- Add beam width to low latency.
- Fix trtllm-bench iter_stats and cuda_graph_batch_sizes errors.
- Remove deprecated Python runtime benchmark
- Add benchmark support for scaffolding
Multimodal models
- Added support in trtllm-serve
- Added support in trtllm-bench, the support is limited to image only for now
Supported DeepSeek-R1 W4A8 on Hopper
Add the RTX Pro 6000 support on single GPU
Integrated Llama4 input processor
Added CGA reduction FHMA kernels on Blackwell
Enabled chunked context for FlashInfer
Supported KV cache reuse for MLA
Added Piecewise CUDA Graph support
Supported multiple LoRA adapters and TP
Added KV cache-aware router for disaggregated serving
Unfused attention for native support
Added group_rms_norm kernel to normalize multiple inputs in a single operator
Added smart router for the MoE module
Added head size 72 support for QKV preprocessing kernel
Added MNNVL MoE A2A support
Optimized Large Embedding Tables in Multimodal Models
Supported Top-K logprobs and prompt_logprobs in LLMAPI
Enabled overlap scheduler in TRT workflow via executor API

Infrastructure Changes

TRT-LLM team formally releases docker image on NGC.
The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI
The dependent TensorRT version is updated to 10.10.0
The dependent CUDA version is updated to 12.9.0
The dependent public PyTorch version is updated to 2.7.0
The dependent NVIDIA ModelOpt version is updated to 0.29.0
The dependent NCCL version is maintained at 2.25.1
Open-sourced XQA kernels
Dependent datasets version was upgraded to 3.1.0
Migrate Triton Backend to TensorRT LLM repo to TensorRT LLM submodule
Downgrade gcc toolset version from 13 to 11

API Changes

[Breaking Change]:Enable scheduling overlap by default
Remove deprecated GptSession/V1 from TRT workflow
Set _AutoDeployLlmArgs as primary config object
Allow overriding CLI arguments with YAML file in trtllm-serve
Introduced multimodal embedding field in LlmRequest

Fixed Issues

Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095)
Fix C++ decoder synchronization in PyTorch (#3106)
Fix bug of create cuda stream as default parameter which will be initialized during importing (#3764)
Fix bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
Fix attention DP bug on Qwen3 MoE model (#4141)
Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101)
Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227)

Known Issues

multi-GPU model support on RTX Pro 6000

What's Changed

Refine doc by @juney-nvidia in #4420
Refine doc by @juney-nvidia in #4421
refine doc by @juney-nvidia in #4422
Remove vila test by @Tabrizian in #4376
[TRTLLM-4618][feat] Add Nemotron Super 49B FP8 test on RTX6000 Pro (SM120) by @farazkh80 in #4363
tests: add qa test mentioned in docs by @crazydemo in #4357
[Infra] - Always push the release images in the post-merge job by @chzblych in #4426
tests: Add test cases for rcca cases by @crazydemo in #4347
chore: cleanup perf_evaluator code by @Superjomn in #3833
feat: Add pp support for hybrid attn/mamba model by @yuxianq in #4358
fix: wrong argument name enable_overlap_scheduler by @kaiyux in #4433
Update "Roadmap" link under README.md to the issues with Roadmap label by @AdamzNV in #4425
fix potential issues in allreduce fusion kernel and ut by @yilin-void in #4226
[TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split by @dc3671 in #4337
feat: NIXL interface integration by @Shixiaowei02 in #3934
Downgrade the logger level for fallback tactic warning. by @hyukn in #4440
Test: Improve model re-use in C++ DGX tests for CI stability by @DomBrown in #4263
fix: temp disable the problem test by @Shixiaowei02 in #4445
Add llama4 disagg accuracy tests by @Tabrizian in #4336
[https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 by @liji-nv in #3952
[Docs] - Reapply #4220 by @chzblych in #4434
[TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) by @farazkh80 in #4335
[Feat] add chunked-attention kernels on Hopper (for llama4) by @PerkzZheng in #4291
test(perf): Add some Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (TRT flow, trtllm-bench) by @venkywonka in #4128
fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. by @yuxianq in #4399
feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #4344
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4429
[TRTLLM-4932] Add CLI accuracy tests for Llama-3.3-70B-Instruct and LLM API BF16 variant by @moraxu in #4362
test: update test filter in perf test yml file to select cases by gpu name and add cases for RTX 6000 pro by @ruodil in #4282
[AutoDeploy] HF factory improvements by @lucaslie in #4371
chore: bump version to 0.21.0rc0 by @ZhanruiSunCh in #4465
doc: [TRTLLM-325]Integrate the NGC image in Makefile automation and document by @MartinMarciniszyn in #4400
chore: bump version to 0.20.0 by @ZhanruiSunCh in #4469
fix: replace the image links in the blog by @Shixiaowei02 in #4490
fix: cleanup process tree for disaggregated test by @tongyuantongyu in #4116
Cherry pick #4508 by @QiJune in #4512
Cherry pick #4447 by @yuxianq in #4517
chore: Remove unused script by @kaiyux in #4485
chore: Deprecate autopp. by @yuxianq in #4471
fix: Fix trtllm sampler beam width bug by @dcampora in #4507
tests: update api change from decoder to sampler in test by @crazydemo in #4479
docs: Add KV Cache Management documentation by @Funatiq in #3908
test: add failed case in waive list and fix some test script issue for perf test by @ruodil in #4528
Add tritonrelease container by @Tabrizian in #4544
fix: [TRTLLM-325]WAR against security vulnerabilities in Python packages by @MartinMarciniszyn in #4539
[5141290][5273694][5260696] fix: Fix mrope argument missing issue in the summary tasks for Qwen model. by @hyukn in #4432
test: waive hanging cases for perf test by @ruodil in #4563
[nvbugs/5274894] fix: Moving finished context requests to generation by @Funatiq in #4576
[5234029][5226211] chore: Unwaive multimodal tests for Qwen model. by @hyukn in #4519
test(perf): Extend the Llama-Nemotron-Nano-8B perf-integration-tests (pyt) by @venkywonka in #4407
test: fix for perf sanity test and skip fp8 deepseek blackwell cases by @ruodil in #4598
[5180961] chore: Unwai...

Contributors

Superjomn, dcampora, and 49 other contributors

Assets 2

8 Join discussion

Releases: NVIDIA/TensorRT-LLM

v1.1.0rc0

What's Changed

Contributors

Uh oh!

v1.0.0rc6

What's Changed

Contributors

Uh oh!

v1.0.0rc5

What's Changed

Contributors

Uh oh!

v0.21.0

TensorRT-LLM Release 0.21.0

Key Features and Enhancements

Infrastructure Changes

API Changes

Fixed Issues

Known Issues

What's Changed

Contributors

Uh oh!

v1.0.0rc4

What's Changed

Contributors

Uh oh!

v1.0.0rc3

What's Changed

Contributors

Uh oh!

v1.0.0rc2

What's Changed

Contributors

Uh oh!

v1.0.0rc1

What's Changed

Contributors

Uh oh!

v1.0.0rc0

What's Changed

Contributors

Uh oh!

v0.20.0

TensorRT-LLM Release 0.20.0

Key Features and Enhancements

Infrastructure Changes

API Changes

Fixed Issues

Known Issues

What's Changed

Contributors

Uh oh!