Skip to content

Releases: NVIDIA/TensorRT-LLM

v1.1.0rc0

16 Aug 00:09
26f413a
Compare
Choose a tag to compare
v1.1.0rc0 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

    • Add model gpt-oss (#6645)
    • Support Aggregate mode for phi4-mm (#6184)
    • Add support for Eclairv2 model - cherry-pick changes and minor fix (#6493)
    • Support running heterogeneous model execution for Nemotron-H (#6866)
    • Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) (#5527)
  • API

    • BREAKING CHANGE Enable TRTLLM sampler by default (#6216)
  • Benchmark

    • Improve Llama4 performance for small max_seqlen cases (#6306)
    • Multimodal benchmark_serving support (#6622)
    • Add perf-sweep scripts (#6738)
  • Feature

    • Support LoRA reload CPU cache evicted adapter (#6510)
    • Add FP8 context MLA support for SM120 (#6059)
    • Enable guided decoding with speculative decoding (part 1: two-model engine) (#6300)
    • Include attention dp rank info with KV cache events (#6563)
    • Clean up ngram auto mode, add max_concurrency to configs (#6676)
    • Add NCCL Symmetric Integration for All Reduce (#4500)
    • Remove input_sf swizzle for module WideEPMoE (#6231)
    • Enable guided decoding with disagg serving (#6704)
    • Make fused_moe_cute_dsl work on blackwell (#6616)
    • Move kv cache measure into transfer session (#6633)
    • Optimize CUDA graph memory usage for spec decode cases (#6718)
    • Core Metrics Implementation (#5785)
    • Resolve KV cache divergence issue (#6628)
    • AutoDeploy: Optimize prepare_inputs (#6634)
    • Enable FP32 mamba ssm cache (#6574)
    • Support SharedTensor on MultimodalParams (#6254)
    • Improve dataloading for benchmark_dataset by using batch processing (#6548)
    • Store the block of context request into kv cache (#6683)
    • Add standardized GitHub issue templates and disable blank issues (#6494)
    • Improve the performance of online EPLB on Hopper by better overlapping (#6624)
    • Enable guided decoding with CUDA graph padding and draft model chunked prefill (#6774)
    • CUTLASS MoE FC2+Finalize fusion (#3294)
    • Add GPT OSS support for AutoDeploy (#6641)
    • Add LayerNorm module (#6625)
    • Support custom repo_dir for SLURM script (#6546)
    • DeepEP LL combine FP4 (#6822)
    • AutoTuner tuning config refactor and valid tactic generalization (#6545)
    • Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend (#6200)
    • Add support for Hopper MLA chunked prefill (#6655)
    • Helix: extend mapping to support different CP types (#6816)
  • Documentation

    • Remove the outdated features which marked as Experimental (#5995)
    • Add LoRA feature usage doc (#6603)
    • Add deployment guide section for VDR task (#6669)
    • Add doc for multimodal feature support matrix (#6619)
    • Move AutoDeploy README.md to torch docs (#6528)
    • Add checkpoint refactor docs (#6592)
    • Add K2 tool calling examples (#6667)
    • Add the workaround doc for H200 OOM (#6853)
    • Update moe support matrix for DS R1 (#6883)
    • BREAKING CHANGE: Mismatch between docs and actual commands (#6323)

What's Changed

  • Qwen3: Fix eagle hidden states by @IzzyPutterman in #6199
  • [None][fix] Upgrade dependencies version to avoid security vulnerability by @yibinl-nvidia in #6506
  • [None][chore] update readme for perf release test by @ruodil in #6664
  • [None][test] remove trt backend cases in release perf test and move NIM cases to llm_perf_nim.yml by @ruodil in #6662
  • [None][fix] Explicitly add tiktoken as required by kimi k2 by @pengbowang-nv in #6663
  • [None][doc]: remove the outdated features which marked as Experimental by @nv-guomingz in #5995
  • [https://nvbugs/5375966][chore] Unwaive test_disaggregated_deepseek_v3_lite_fp8_attention_dp_one by @yweng0828 in #6658
  • [TRTLLM-6892][infra] Run guardwords scan first in Release Check stage by @yiqingy0 in #6659
  • [None][chore] optimize kv cache transfer for context TEP and gen DEP by @chuangz0 in #6657
  • [None][chore] Bump version to 1.1.0rc0 by @yiqingy0 in #6651
  • [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in #6510
  • [None][test] correct test-db context for perf yaml file by @ruodil in #6686
  • [None] [feat] Add model gpt-oss by @hlu1 in #6645
  • [https://nvbugs/5409414][fix] fix Not registered specs by @xinhe-nv in #6660
  • [None][feat] : Add FP8 context MLA support for SM120 by @peaceh-nv in #6059
  • [TRTLLM-6092][doc] Add LoRA feature usage doc by @shaharmor98 in #6603
  • [TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) by @syuoni in #6300
  • [TRTLLM-6881][feat] Include attention dp rank info with KV cache events by @pcastonguay in #6563
  • [None][infra] Fix guardwords by @EmmaQiaoCh in #6711
  • [None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in #6702
  • [None][doc] Add deployment guide section to the official doc website by @nv-guomingz in #6669
  • [None][fix] disagg ctx pp4 + gen pp4 integ test by @raayandhar in #6489
  • [None][feat] Clean up ngram auto mode, add max_concurrency to configs by @mikeiovine in #6676
  • [None][chore] Remove py_executor from disagg gh team by @pcastonguay in #6716
  • [https://nvbugs/5423962][fix] Address broken links by @chenopis in #6531
  • [None][fix] Migrate to new cuda binding package name by @tongyuantongyu in #6700
  • [https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave by @symphonylyh in #6708
  • [None][feat] Add NCCL Symmetric Integration for All Reduce by @Tabrizian in #4500
  • [TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default by @dcampora in #6216
  • [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6719
  • [TRTLLM-5252][test] add for mistral_small_3.1_24b perf test by @ruodil in #6685
  • [TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE by @StudyingShao in #6231
  • [None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference by @zhanghaotong in #6626
  • [TRTLLM-6854][feat] Enable guided decoding with disagg serving by @syuoni in #6704
  • [TRTLLM-5252][fix] Propagate mapping to intermediate layers by @2ez4bz in #6611
  • [None][test] fix yml condition error under qa folder by @ruodil in #6734
  • [None][doc] Add doc for multimodal feature support matrix by @chang-l in #6619
  • [TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell by @limin2021 in #6616
  • [https://nvbugs/5436461][infra] Adjust free_gpu_memory_fraction of test_eagle3 to prevent OOM on CI by @leslie-fang25 in #6631
  • [None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout by @yuxianq in #6654
  • [https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend by @JunyiXu-nv in #6690
  • [None][fix] Remove lock related typo in py_executor by @lancelly in #6653
  • [None][feat] move kv cache measure into transfer session by @zhengd-nv in #6633
  • [None][fix]revert kvcache transfer by @chuangz0 in #6709
  • [TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding by @stnie in #6665
  • [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #6184
  • [None][feat] Optimize CUDA graph memory usage for spec decode cases by @mikeiovine in #6718
  • [TRTLLM-7025] [infra] Reorganize CODEOWNERS to rectify examples mapping by @venkywonka in #6762
  • [None][doc] Move AutoDeploy README.md to torch docs by @Fridah-nv in #6528
  • [None][fix] WAR GPT OSS on H20 with Triton MOE by @dongfengy in #6721
  • [TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix by @yibinl-nvidia in #6493
  • [None][feat] Core Metrics Implementation by @hcyezhang in #5785
  • [https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in #6306
  • [TRTLLM-6637][feat]...
Read more

v1.0.0rc6

07 Aug 10:54
a16ba64
Compare
Choose a tag to compare
v1.0.0rc6 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

  • Feature

    • Add LoRA support for Gemma3 (#6371)
    • Add support of scheduling attention dp request (#6246)
    • Multi-block mode for Hopper spec dec XQA kernel (#4416)
    • LLM sleep & wakeup Part 1: virtual device memory (#5034)
    • best_of/n for pytorch workflow (#5997)
    • Add speculative metrics for trt llm bench (#6476)
    • (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379)
    • Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 (#6522)
    • check input tokens + improve error handling (#5170)
    • Add support for fused gate_up_proj scales for FP8 blockwise (#6496)
    • Add vLLM KV Pool support for XQA kernel (#6013)
    • Switch to internal version of MMProjector in Gemma3 (#6572)
    • Enable fp8 SwiGLU to minimize host overhead (#6540)
    • Add Qwen3 MoE support to TensorRT backend (#6470)
    • ucx establish connection with zmq (#6090)
    • Disable add special tokens for Llama3.3 70B (#6482)
  • API

  • Benchmark

    • ADP schedule balance optimization (#6061)
    • allreduce benchmark for torch (#6271)
  • Documentation

    • Make example SLURM scripts more parameterized (#6511)
    • blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) (#6547)
    • Exposing the latest tech blogs in README.md (#6553)
    • update known issues (#6247)
    • trtllm-serve doc improvement. (#5220)
    • Adding GPT-OSS Deployment Guide documentation (#6637)
    • Exposing the GPT OSS model support blog (#6647)
    • Add llama4 hybrid guide (#6640)
    • Add DeepSeek R1 deployment guide. (#6579)
    • Create deployment guide for Llama4 Scout FP8 and NVFP4 (#6550)
  • Known Issues

    • On bare-metal Ubuntu 22.04 or 24.04, please install the cuda-python==12.9.1 package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of error ImportError: cannot import name 'cuda' from 'cuda'.

What's Changed

  • [fix] Fix missing fields in xqa kernel cache key by @lowsfer in #6282
  • [TRTLLM-6364][infra] Validate for PR titles to ensure they follow the required format by @niukuo in #6278
  • [fix] Update get_trtllm_bench_build_command to handle batch size and tokens by @venkywonka in #6313
  • refactor: Remove unused buffers and bindings from sampler by @Funatiq in #6484
  • chore: Make example SLURM scripts more parameterized by @kaiyux in #6511
  • fix: Fix missing key by @zerollzeng in #6471
  • [https://nvbugs/5419066][fix] Use trt flow LLM by @crazydemo in #6467
  • [TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops by @yali-arch in #6515
  • [https://nvbugs/5419069][fix] Fix the mismatched layer name components. by @hyukn in #6417
  • [None][doc] blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) by @kaiyux in #6547
  • [None][chore] Disable add special tokens for Llama3.3 70B by @chenfeiz0326 in #6482
  • [None][doc] Exposing the latest tech blogs in README.md by @juney-nvidia in #6553
  • [None][fix] update nemotron nas tests free_gpu_memory_fraction=0.8 by @xinhe-nv in #6552
  • [None][infra] Pin the version for triton to 3.3.1 (#6508) (#6519) by @chzblych in #6549
  • [https://nvbugs/5340941][https://nvbugs/5375785] - fix: Wrap attentio… by @liji-nv in #6355
  • [TRTLLM-6657][feat] Add LoRA support for Gemma3 by @brb-nv in #6371
  • [https://nvbugs/5381276][fix] fix warning for fused_a_gemm by @yunruis in #6402
  • [None][Infra] - Skip failed tests in post-merge by @EmmaQiaoCh in #6558
  • [AutoDeploy] merge feat/ad-2025-07-22 by @lucaslie in #6520
  • [TRTLLM-6624][feat] skip post blackwell by @xinhe-nv in #6357
  • [TRTLLM-6357][test] Add accuracy tests for Qwen3 by @reasonsolo in #6177
  • [None][fix] Serialize the window_size in the kv event by @richardhuo-nv in #6526
  • [None][feat] Add support of scheduling attention dp request by @Shunkangz in #6246
  • [None][refactor] Simplify finish reasons handling in DecoderState by @Funatiq in #6524
  • [None][infra] add eagle3 one model accuracy tests by @jhaotingc in #6264
  • [TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 by @yiqingy0 in #5678
  • use cudaSetDevice to create context ,fix nvbug 5394497 by @chuangz0 in #6403
  • [None][feat] Multi-block mode for Hopper spec dec XQA kernel by @jhaotingc in #4416
  • [TRTLLM-6473][test] add speculative decoding and ep load balance cases into QA test list by @crazydemo in #6436
  • [fix] Fix DeepSeek w4a8 weight loading by @jinyangyuan-nvidia in #6498
  • chore: add EXAONE4 accuracy test by @yechank-nvidia in #6397
  • test: modify max_lora_rank of phi4_multimodal to 320 by @ruodil in #6474
  • [None][chore] Mass integration of release/0.21 (part5) by @dc3671 in #6544
  • [None][infra] update namelist by @niukuo in #6465
  • [https://nvbugs/5430932][infra] update namelist by @niukuo in #6585
  • [None][chore] add online help to build_wheel.py and fix a doc link by @zhenhuaw-me in #6391
  • test: move ministral_8b_fp8 to fp8_specific gpu list(exclude Ampere) by @ruodil in #6533
  • [TRTLLM-5563][infra] Move test_rerun.py to script folder by @yiqingy0 in #6571
  • [None][infra] Enable accuracy test for eagle3 and chunked prefill by @leslie-fang25 in #6386
  • [None][infra] Enable test of chunked prefill with logit post processor by @leslie-fang25 in #6483
  • [TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory by @tongyuantongyu in #5034
  • [None][fix] remove closed bugs by @xinhe-nv in #6576
  • [None][fix] xqa precision for fp16/bf16 kv cache by @Bruce-Lee-LY in #6573
  • [None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens by @LinPoly in #6259
  • [None][chore] Bump version to 1.0.0rc6 by @yiqingy0 in #6597
  • [None][chore] Add unit test for Gemma3 lora by @brb-nv in #6560
  • [TRTLLM-6364] [fix] Update PR title regex to allow optional spaces between ticket and type by @niukuo in #6598
  • [None][infra] Waive failed case in post-merge on main by @EmmaQiaoCh in #6602
  • [None][test] update invalid test name by @crazydemo in #6596
  • [TRTLLM-5271][feat] best_of/n for pytorch workflow by @evezhier in #5997
  • [None][chore] Update Gemma3 closeness check to mitigate flakiness by @brb-nv in #6591
  • [TRTLLM-6685][feat] Add speculative metrics for trt llm bench by @kris1025 in #6476
  • [None][doc] Fix blog4 typo by @syuoni in #6612
  • [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6581
  • [TRTLLM-6856][feat] add disaggregated serving tests to QA list by @xinhe-nv in #6536
  • [https://nvbugs/5433581][infra] Update install docs and CI script for SBSA deep_gemm workaround by @chzblych in #6607
  • [TRTLLM-5990][doc] trtllm-serve doc improvement. by @nv-guomingz in #5220
  • [None][chore] Add readme for perf test by @ruodil in #6443
  • [https://nvbugs/5436461][infra] Skip test_eagle3 test with device memory check by @leslie-fang25 in #6617
  • [None][chore] ucx establish connection with zmq by @chuangz0 in #6090
  • [TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec by @symphonylyh in #6379
  • [None][fix] Remove expand configuration from mamba2 mixer by @danielafrimi in #6521
  • [TRTLLM-6826][feat] Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 by @amitz-nv in h...
Read more

v1.0.0rc5

04 Aug 09:45
fbee279
Compare
Choose a tag to compare
v1.0.0rc5 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
  • Feature
    • Deepseek R1 FP8 Support on Blackwell (#6486)
    • Auto-enable ngram with concurrency <= 32. (#6232)
    • Support turning on/off spec decoding dynamically (#6363)
    • Improve LoRA cache memory control (#6220)
    • Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408)
    • Update long rope for phi3.5/phi4-mini/phi4-mm (#6353)
    • Add support for external multimodal embeddings (#6263)
    • Add support for disaggregation with pp with pytorch backend (#6369)
    • Add _prepare_and_schedule_batch function in PyExecutor (#6365)
    • Add status tags to LLM API reference (#5707)
    • Remove cudaStreamSynchronize when using relaxed acceptance (#5262)
    • Support JSON Schema in OpenAI-Compatible API (#6321)
    • Support chunked prefill on spec decode 2 model (#6104)
    • Enhance beam search support with CUDA graph integration (#6217)
    • Enable Overlap scheduler + Beam Search in TRTLLM Sampler (#6223)
    • Add KV cache reuse support for multimodal models (#5444)
    • Multistream initial support for torch compile flow (#5847)
    • Support nanobind bindings (#6185)
    • Support Weight-Only-Quantization in PyTorch Workflow (#5850)
    • Support pytorch LoRA adapter eviction (#5616)
  • API
    • [BREAKING CHANGE] Change default backend to PyTorch in trtllm-serve (#5717)
  • Bug Fixes
    • fix: remove duplicate layer multiplication in KV cache size calculation (#6481)
    • fix illeagel memory access in MLA (#6437)
    • Fix nemotronNAS loading for TP>1 (#6447)
    • Switch placement of image placeholder for mistral 3.1 (#6435)
    • Fix wide EP when using DeepEP with online EPLB (#6429)
    • Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests (#6463)
    • Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
    • Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
    • Fix PD + MTP + overlap scheduler accuracy issue (#6136)
    • Fix bug of Qwen3 when using fp4 on sm120 (#6065)
  • Benchmark
    • Fixes to parameter usage and low latency configuration. (#6343)
    • Add Acceptance Rate calculation to benchmark_serving (#6240)
  • Performance
    • Enable AllReduce-associated fusion patterns in Llama3/4. (#6205)
    • Optimize Mtp performance (#5689)
    • Customize cublastLt algo for Llamba 3.3 70B TP4 (#6315)
    • Add non UB AR + Residual + Norm + Quant fusion (#6320)
  • Infrastructure
    • Remove auto_assign_reviewers option from .coderabbit.yaml (#6490)
    • Use build stage wheels to speed up docker release image build (#4939)
  • Documentation
    • Add README for wide EP (#6356)
    • Update Llama4 deployment guide: update config & note concurrency (#6222)
    • Add Deprecation Policy section (#5784)
  • Known Issues
    • If you encounter the OSError: CUDA_HOME environment variable is not set error, set the CUDA_HOME environment variable
    • The aarch64 Docker image and wheel package for 1.0.0rc5 are broken. This will be fixed in the upcoming weekly release

What's Changed

Read more

v0.21.0

04 Aug 14:23
751d5f1
Compare
Choose a tag to compare

TensorRT-LLM Release 0.21.0

Key Features and Enhancements

  • Model Support
    • Added Gemma3 VLM support
  • Features
    • Added large-scale EP support
    • Integrated NIXL into the communication layer of the disaggregated service
    • Added fabric Memory support for KV Cache Transfer
    • Added MCP in ScaffoldingLLM
    • Added support for w4a8_mxfp4_fp8 quantization
    • Added support for fp8 rowwise quantization
    • Added generation logits support in TRTLLM Sampler
    • Added log probs support in TRTLLM Sampler
    • Optimized TRTLLM Sampler perf single beam single step
    • Enabled Disaggregated serving for Qwen-3
    • Added EAGLE3 support for Qwen-3
    • Fused finalize and allreduce for Qwen-MoE model
    • Refactored Fused MoE module
    • Added support for chunked attention on Blackwell and Hopper
    • Introduced sliding-window attention kernels for the generation phase on Blackwell
    • Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios
    • Added FP8 block-scale GEMM support on SM89
    • Enabled overlap scheduler between draft forwards
    • Added Piecewise cuda graph support for MLA
    • Added model-agnostic one-engine eagle3
    • Enabled Finalize + Allreduce + add + rmsnorm fusion
    • Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
    • Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
    • Validated Llama 3.1 models on H200 NVL
  • Benchmark:
    • Added all_reduce.py benchmark script for testing
    • Added beam width to trtllm-bench latency command
    • Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
    • Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
    • Supported post_proc for bench
    • Added no_kv_cache_reuse option and streaming support for trtllm serve bench

Infrastructure Changes

  • The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.05-py3.
  • The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.05-py3.
  • The dependent public PyTorch version is updated to 2.7.1.
  • The dependent TensorRT version is updated to 10.11.
  • The dependent NVIDIA ModelOpt version is updated to 0.31.
  • The dependent NCCL version is updated to 2.27.5.

API Changes

  • Set _AutoDeployLlmArgs as primary config object
  • Removed decoder request from decoder interface
  • Enhanced the torch_compile_config in llm args
  • Removed the redundant use_kv_cache field from PytorchConfig
  • Moved allreduce_strategy from committed api to reference

Fixed Issues

  • Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
  • Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
  • Fixed cuda graph padding for spec decoding (#4853)
  • Fixed llama 4 long context issue (#4809)
  • Fixed max_num_sequences calculation with overlap scheduling (#4532)
  • Fixed chunked prefill + overlap scheduling (#5761)
  • Fixed trtllm-bench hang issue due to LLM API IPC (#4798)
  • Fixed index out of bounds error in spec decoding (#5954)
  • Fixed MTP illegal memory access in cuda graph warmup (#5947)
  • Fixed no free slots error with spec decode + disagg (#5975)
  • Fixed one-off attention window size for Gemma3 1B (#5564)

Known Issues

  • accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.
  • Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.
  • In 0.21, full chunked attention support has been added to make sure LLaMA4 model can functionally run with > 8K seq length, while there is a known performance regression(only affect LLaMA4 model) on Hopper due to this functional enhancement. The root cause of the regression has been identified already and the fix will be part of the future release.

What's Changed

Read more

v1.0.0rc4

22 Jul 08:24
69e9f6d
Compare
Choose a tag to compare
v1.0.0rc4 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
    • Add phi-4-multimodal model support (#5644)
    • Add EXAONE 4.0 model support (#5696)
  • Feature
    • Add support for two-model engine KV cache reuse (#6133)
    • Unify name of NGram speculative decoding (#5937)
    • Add retry knobs and handling in disaggregated serving (#5808)
    • Add Eagle-3 support for qwen3 dense model (#5879)
    • Remove padding of FusedMoE in attention DP (#6064)
    • Enhanced handling of decoder requests and logits within the batch manager (#6055)
    • Add support for Modelopt fp8_pb_wo quantization scheme (#6106)
    • Update deepep dispatch API (#6037)
    • Add support for benchmarking individual gemms in MOE benchmark (#6080)
    • Simplify token availability calculation for VSWA (#6134)
    • Migrate EAGLE3 and draft/target speculation to Drafter (#6007)
    • Enable guided decoding with overlap scheduler (#6000)
    • Use cacheTransceiverConfig as knobs for disagg service (#5234)
    • Add vectorized loading for finalize kernel in MoE Trtllm backend (#5919)
    • Enhance ModelConfig for kv cache size calculations (#5868)
    • Clean up drafter/resource manager creation logic (#5805)
    • Add core infrastructure to enable loading of custom checkpoint formats (#5372)
    • Cleanup disable_fp4_allgather (#6006)
    • Use session abstraction in data transceiver and cache formatter (#5611)
    • Add support for Triton request cancellation (#5898)
    • Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs (#5684)
    • Remove enforced sorted order of batch slots (#3502)
    • Use huge page mapping for host accessible memory on GB200 (#5963)
  • API
    • [BREAKING CHANGE] Unify KvCacheConfig in LLM class for pytorch backend (#5752)
    • [BREAKING CHANGE] Rename cuda_graph_config padding_enabled field (#6003)
  • Bug Fixes
    • Skip prompt length checking for generation only requests (#6146)
    • Avoid memory calls during broadcast for single GPU (#6010)
    • Record kv-cache size in MLACacheFormatter (#6181)
    • Always use py_seq_slot in runtime (#6147)
    • Update beam search workspace estimation to new upper bound (#5926)
    • Update disaggregation handling in sampler (#5762)
    • Fix TMA error with GEMM+AR on TP=2 (#6075)
    • Fix scaffolding aime test in test_e2e (#6140)
    • Fix KV Cache overrides in trtllm-bench (#6103)
    • Remove duplicated KVCache transmission check (#6022)
    • Release slots with spec decode + disagg (#5975) (#6032)
    • Add propogation for trust_remote_code to AutoConfig (#6001)
    • Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop (#6053)
    • Pad DeepEP fp4 recv tensors if empty (#6048)
    • Adjust window sizes of VSWA at torch backend (#5880)
    • Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
    • Fix eagle3 two model disaggregated serving test (#6014)
    • Update torch.compile option to fix triton store_cubin error (#5865)
    • Fix chunked prefill + overlap scheduling (#5761)
    • Fix mgmn postprocess error (#5835)
    • Fallback to cubins for fp8 fmha kernels on Ada (#5779)
    • Enhance _check_arguments to filter illegal requests for pytorch backend (#5541)
    • Rewrite completion API to avoid repetitive tokens (#5201)
    • Fix disagg + speculative decoding (#5558)
  • Benchmark
    • Add latency support for trtllm bench (#3730)
  • Performance
    • Optimize TRTLLM Sampler perf single beam single step (#5550)
    • Performance Optimization for MNNVL TwoShot Kernel (#5925)
    • Enable 128x256 tile shapes for FP4 MOE CUTLASS backend (#5986)
    • Enable cuda graph by default (#5480)
  • Infrastructure
    • Add script to map tests <-> jenkins stages & vice-versa (#5431)
    • Speedup beam search unit tests with fixtures for LLM (#5843)
    • Fix single-GPU stage failed will not raise error (#6165)
    • Update bot help messages (#5277)
    • Update jenkins container images (#6094)
    • Set up the initial config for CodeRabbit (#6128)
    • Upgrade NIXL to 0.3.1 (#5991)
    • Upgrade modelopt to 0.33 (#6058)
    • Support show all stage name list when stage name check failed (#5946)
    • Run docs build only if PR contains only doc changes (#5184)
  • Documentation
    • Update broken link of PyTorchModelEngine in arch_overview (#6171)
    • Add initial documentation for trtllm-bench CLI. (#5734)
    • Add documentation for eagle3+disagg+dynamo (#6072)
    • Update llama-3.3-70B guide (#6028)
  • Known Issues

What's Changed

  • [TRTLLM-6164][TRTLLM-6165] chore: add runtime example for pytorch by @Superjomn in #5956
  • fix: Fix MoE benchmark by @syuoni in #5966
  • [TRTLLM-6160] chore: add sampling examples for pytorch by @Superjomn in #5951
  • Use huge page mapping for host accessible memory on GB200 by @dongxuy04 in #5963
  • Breaking change: perf: [TRTLLM-4662] Enable cuda graph by default by @dominicshanshan in #5480
  • fix: set allreduce strategy to model config by @WeiHaocheng in #5955
  • chore: Mass integration of release/0.21 (part 3) by @dc3671 in #5909
  • infra: [TRTLLM-6242] install cuda-toolkit to fix sanity check by @ZhanruiSunCh in #5709
  • Waive L0 test by @yiqingy0 in #6002
  • [Nvbug/5383670] fix: switch test case to non-fp4 ckpt for more GPU coverage by @kaiyux in #5882
  • fix #4974: A thread leak issue in scaffolding unittest by @ccs96307 in #5020
  • feat: EXAONE4.0 support by @yechank-nvidia in #5696
  • [TRTLLM-5653][infra] Run docs build only if PR contains only doc changes by @zhanga5 in #5184
  • feat: Update Gemma3 Vision Encoder by @brb-nv in #5973
  • enh: Bidirectional mask with multiple images for Gemma3 by @brb-nv in #5976
  • refactor: Remove enforced sorted order of batch slots by @Funatiq in #3502
  • [fix] fix eagle3 two model disaggregated serving test by @Tabrizian in #6014
  • perf: Enable 128x256 tile shapes for FP4 MOE CUTLASS backend by @djns99 in #5986
  • [nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs by @ixlmar in #5964
  • doc: update EXAONE 4.0 news by @yechank-nvidia in #6034
  • [Model load] Fix llama min-latency model load by @arekay in #5883
  • fix: Fix MOE benchmark to rotate buffers to prevent L2 cache reuse by @djns99 in #4135
  • Doc: Update llama-3.3-70B guide by @jiahanc in #6028
  • infra: [TRTLLM-6331] Support show all stage name list when stage name check failed by @ZhanruiSunCh in #5946
  • [Infra][TRTLLM-6013] - Fix stage name in single stage test rerun report by @yiqingy0 in #5672
  • [Fix] check for ImportError or ModuleNotFoundError for deep_ep_utils by @lucaslie in #6026
  • infra: [TRTLLM-6313] Fix the package sanity stage 'Host Node Name' in… by @ZhanruiSunCh in #5945
  • chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… by @nv-guomingz in #6003
  • test: add recursive updating pytorch config and change MOE backend format in perf test by @ruodil in #6046
  • test: add llama_v3.3_70b_cases in perf test by @ruodil in #6035
  • [infra] add more log on reuse-uploading by @niukuo in #6036
  • fix: adjust window sizes of VSWA at torch backend by @jaedeok-nvidia in #5880
  • [nvbugs/5385972][nvbugs/5387423][Fix] Minor fix for llava_next/llava_onevision by @MinaHuai in #5998
  • Fix: pad DeepEP fp4 recv tensors if empty by @yuantailing in #6048
  • [fix] Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop by @jinyangyuan-nvidia in #6053
  • support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… by @ttyio in #5684
  • Cherry-pick #5947 by @lfr-0531 in #5989
  • test: Add regression tests for Gemma3 VLM by @brb-nv in #6033
  • feat/add latency support for trtllm bench by @danielafrimi in #3730
  • feat: Add support for Triton request cancellation by @achartier in #5898
  • [fix] Fix Triton build by @Tabrizian in #6076
  • fix: Unable to load phi4-model with tp_size>1 by @Wanli-Jiang in #5962
  • chore: Bump version to 1.0.0rc4 by @yiqingy0 in #6086
  • chroe: upgrade modelopt to 0.33 by @nv-guomingz in #6058
  • [nvbug/5347489][nvbug/5388036] increase timeout in disagg worker test by @zhengd-nv in #6041
  • feat: use sessi...
Read more

v1.0.0rc3

16 Jul 08:25
cfcb97a
Compare
Choose a tag to compare
v1.0.0rc3 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
    • Support Mistral3.1 VLM model (#5529)
    • Add TensorRT-Engine Qwen3 (dense) model support (#5650)
  • Feature
    • Add support for MXFP8xMXFP4 in pytorch (#5411)
    • Log stack trace on error in openai server (#5749)
    • Refactor the topk parallelization part for the routing kernels (#5705)
    • Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests (#5774)
    • Support FP8 row-wise dense GEMM in torch flow (#5615)
    • Move DeepEP from Docker images to wheel building (#5534)
    • Add user-provided speculative decoding support (#5204)
    • Add optional module cache for TRT-LLM Gen Gemm interfaces (#5743)
    • Add streaming scaffolding_llm.generate_async support (#5345)
    • Detokenize option in /v1/completions request (#5382)
    • Support n-gram speculative decoding with disagg (#5732)
    • Return context response immediately when stream_interval > 1 (#5836)
    • Add support for sm121 (#5524)
    • Add LLM speculative decoding example (#5706)
    • Update xgrammar version to 0.1.19 (#5830)
    • Some refactor on WideEP (#5727)
    • Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764)
    • Update transformers to 4.53.0 (#5747)
    • Share PyTorch tensor between processes (#5396)
    • Custom masking utils for Gemma3 VLM (#5853)
    • Remove support for llmapi + TRT backend in Triton (#5856)
    • Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723)
    • Enable kvcache to be reused during request generation (#4028)
    • Simplify speculative decoding configs (#5639)
    • Add binding type build argument (pybind, nanobind) (#5802)
    • Add the ability to write a request timeline (#5258)
    • Support deepEP fp4 post quant all2all dispatch (#5881)
    • Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend (#5771)
    • Move vision parts from processor to model for Gemma3 (#5888)
  • API
    • [BREAKING CHANGE] Rename mixed_sampler to enable_mixed_sampler (#5751)
    • [BREAKING CHANGE] Rename LLM.autotuner_enabled to enable_autotuner (#5876)
  • Bug Fixes
    • Fix test_generate_with_seed CI failure. (#5772)
    • Improve fp4_block_scale_moe_runner type check (#5681)
    • Fix prompt adapter TP2 case (#5782)
    • Fix disaggregate serving with attention DP (#4993)
    • Ignore nvshmem_src_*.txz from confidentiality-scan (#5831)
    • Fix a quote error introduced in #5534 (#5816)
    • Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
    • Fix lost requests for disaggregated serving (#5815)
    • Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
    • Fix GEMM+AR fusion on blackwell (#5563)
    • Catch inference failures in trtllm-bench (#5841)
    • Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) (#5813)
    • Skip rope scaling for local layers in Gemma3 VLM (#5857)
    • Fix llama4 multimodal support (#5809)
    • Fix Llama4 Scout FP4 crash issue (#5925)
    • Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
    • Fix moe regression for sm120 (#5823)
    • Fix Qwen2.5VL FP8 support (#5029)
    • Fix the illegal memory access issue in moe gemm on SM120 (#5636)
    • Avoid nesting NCCL group in allgather and reduce scatter OPs (#5866)
    • Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
    • Fix incremental detokenization (#5825)
    • Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
    • Make the bench serving script compatible with different usages (#5905)
    • Fix mistral unit tests due to transformers upgrade (#5904)
    • Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
    • Fix Gemma3 unit tests due to transformers upgrade (#5921)
    • Extend triton exit time for test_llava (#5971)
    • Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
    • Remove SpecConfig and fix thread leak issues (#5931)
    • Fast redux detection in trtllm gen routing kernel (#5941)
    • Fix cancel request logic (#5800)
    • Fix errors in wide-ep scripts (#5992)
    • Fix error in post-merge-tests (#5949)
  • Benchmark
  • Performance
    • Optimize TRTLLM Sampler perf single beam single step (#5550)
  • Infrastructure
    • Fix a syntax issue in the image check (#5775)
    • Speedup fused moe tests (#5726)
    • Set the label community action to only run on upstream TRTLLM (#5806)
    • Update namelist in blossom-ci (#5838)
    • Update nspect version (#5832)
    • Reduce redundant test cases for TRTLLM Gen FP8 MoE (#5845)
    • Parallelize torch unittests (#5714)
    • Use current_image_tags.properties in rename_docker_images.py (#5846)
    • Fix two known NSPECT high vulnerability issues and reduce image size (#5434)
  • Documentation
    • Update the document of qwen3 and cuda_graph usage (#5705)
    • Update cuda_graph_config usage part in DS R1 docs (#5796)
    • Add llama4 Maverick eagle3 and max-throughput and low_latency benchmark guide (#5810)
    • Fix link in llama4 Maverick example (#5864)
    • Add instructions for running gemma in disaggregated serving (#5922)
    • Add qwen3 disagg perf metrics (#5822)
    • Update the disagg doc (#5938)
    • Update the link of the diagram (#5953)
  • Known Issues

What's Changed

  • feat: Add support for MXFP8xMXFP4 in pytorch by @djns99 in #5535
  • [Doc] update the document of qwen3 and cuda_graph usage by @byshiue in #5703
  • [Infra] - Fix a syntax issue in the image check by @chzblych in #5775
  • chore: log stack trace on error in openai server by @zhengd-nv in #5749
  • fix: [nvbug/5368507] Fix test_generate_with_seed CI failure. by @bobboli in #5772
  • Refactor the topk parallelization part for the routing kernels by @ChristinaZ in #5567
  • test: [CI] remove closed bugs by @xinhe-nv in #5770
  • [TRTLLM-5530][BREAKING CHANGE] refactor: LLM arglist rename mixed_sampler to enable_mixed_sampler by @Superjomn in #5751
  • fix: Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests by @yizhang-nv in #5774
  • [TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow by @DylanChen-NV in #5615
  • feat: Optimize TRTLLM Sampler perf single beam single step by @dcampora in #5550
  • Refactor: move DeepEP from Docker images to wheel building by @yuantailing in #5534
  • [TRTLLM-6291] feat: Add user-provided speculative decoding support by @Funatiq in #5204
  • [ci] speedup fused moe tests by @omera-nv in #5726
  • [feat] Adds optional module cache for TRT-LLM Gen Gemm interfaces by @davidclark-nv in #5743
  • chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… by @nv-guomingz in #5795
  • feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL by @yechank-nvidia in #5522
  • Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" by @nv-guomingz in #5818
  • [fix] https://nvbugs/5333654 Unwaive to check ci status and improve torch compile multi-gpu coverage by @liji-nv in #5700
  • [fix] improve fp4_block_scale_moe_runner type check by @Alcanderian in #5681
  • feat(scaffolding): add streaming scaffolding_llm.generate_async support by @dc3671 in #5345
  • [None][infra] Set the label community action to only run on upstream TRTLLM by @poweiw in #5806
  • Waive some test_llama_eagle3 unittests by @venkywonka in #5811
  • [NvBug 5362426] fix: Fix prompt adapter TP2 case by @syuoni in #5782
  • chore: bump version to 1.0.0rc3 by @yiqingy0 in #5819
  • doc: update cuda_graph_config usage part in DS R1 docs by @nv-guomingz in #5796
  • fix: Disaggregate serving with attention DP by @VALLIS-NERIA in #4993
  • Fix: ignore nvshmem_src_*.txz from confidentiality-scan by @yuantailing in #5831
  • tests: waive failed cases on main by @xinhe-nv in #5781
  • [Infra] - Waive L0 test by @yiqingy0 in #5837
  • update namelist in blossom-ci by @niukuo in #5838
  • Fix a quote error introduced in #5534 by @yuantailing in #5816
  • [feat]: Detokenize option in /v1/completions request by @Wokzy in #5382
  • [5305318] fix: Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. by @hyukn in #5801
  • [TRTLLM-5847][feat] Support n-gram speculative decoding with disagg by @raayandhar in #5732
  • [TRTLLM-5878] update nspect version by @niukuo in #5832
  • feat: Return context response immediately when stream_interval > 1 by @kaiyux in https://github.c...
Read more

v1.0.0rc2

08 Jul 07:04
66f299a
Compare
Choose a tag to compare
v1.0.0rc2 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
  • Feature
    • Add KV events support for sliding window attention (#5580)
    • Add beam search support to the PyTorch Workflow (#5333)
    • Support more parameters in openai worker of scaffolding (#5115)
    • Enable CUDA graphs for Nemotron-H (#5646)
    • Add spec dec param to attention op for pytorch workflow (#5146)
    • Fuse w4a8 moe pre-quant scale on Hopper (#5613)
    • Support torch compile for attention dp (#5086)
    • Add W4A16 GEMM support for pytorch workflow (#4232)
    • Add request_perf_metrics to triton LLMAPI backend (#5554)
    • Add AutoDeploy fp8 quantization support for bmm (#3849)
    • Refactor moe permute and finalize op by removing duplicated code (#5557)
    • Support duplicate_kv_weight for qwen3 blockwise scale (#5459)
    • Add LoRA support for pytorch backend in trtllm-serve (#5376)
  • API
    • Enhance yaml loading arbitrary options in LlmArgs (#5610)
    • Add back allreduce_strategy parameter into TorchLlmArgs (#5637)
    • Add LLmArgs option to force using dynamic quantization (#5346)
    • Remove ptuning knobs from TorchLlmArgs (#5595)
    • BREAKING CHANGE:Enhance the llm args pytorch config part 1(cuda_graph_config) (#5014)
  • Bug Fixes
    • Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
    • Fix attention DP doesn't work with embedding TP (#5642)
    • Fix broken cyclic reference detect (#5417)
    • Fix permission for local user issues in NGC docker container. (#5373)
    • Fix mtp vanilla draft inputs (#5568)
  • Benchmark
    • Add wide-ep benchmarking scripts (#5760)
  • Performance
    • Reduce DeepEPLowLatency memory and time (#5712)
    • Use tokenizers API to optimize incremental detokenization perf (#5574)
    • Conditionally enable SWAP AB for speculative decoding (#5404)
    • Unify new_tokens format sample state to trtllm samper tokens format (#5513)
    • Replace allgaher with AllToAllPrepare (#5570)
    • Optimizations on weight-only batched gemv kernel (#5420)
    • Optimize MoE sort kernels for large-scale EP (#5435)
    • Avoid reswizzle_sf after allgather. (#5504)
  • Infrastructure
    • Always use x86 image for the Jenkins agent and few clean-ups (#5753)
    • Reduce unnecessary kernel generation (#5476)
    • Update the auto-community label action to be triggered every hour (#5658)
    • Improve dev container tagging (#5551)
    • Update the community action to more appropriate api (#4883)
    • Update nccl to 2.27.5 (#5539)
    • Upgrade xgrammar to 0.1.18 (#5364)
  • Documentation
    • Fix outdated config in DeepSeek best perf practice doc (#5638)
    • Add pd dynamic scaling readme (#5540)
    • Add feature support matrix for PyTorch backend (#5037)
    • 1.0 LLM API doc updates (#5629)
  • Known Issues

What's Changed

  • [TRTLLM-5831][feat] Add LoRA support for pytorch backend in trtllm-serve by @talorabr in #5376
  • [CI] reduce mamba2 ssm test parameterization by @tomeras91 in #5571
  • perf: Avoid reswizzle_sf after allgather. by @bobboli in #5504
  • [feat][test] reuse MPI pool executor across tests by @omera-nv in #5566
  • [TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP by @syuoni in #5435
  • [feat] Optimizations on weight-only batched gemv kernel by @Njuapp in #5420
  • [ci] remove MMLU if followed by GSM8K by @omera-nv in #5578
  • [TRTLLM-5530][BREAKING CHANGE]: enhance the llm args pytorch config part 1(cuda_graph_config) by @nv-guomingz in #5014
  • Deduplicate waive list by @yiqingy0 in #5546
  • [fix] speedup modeling unittests by @omera-nv in #5579
  • feat : support duplicate_kv_weight for qwen3 blockwise scale by @dongjiyingdjy in #5459
  • [TRTLLM-5331] large-scale EP: perf - Replace allgaher with AllToAllPrepare by @WeiHaocheng in #5570
  • doc: Minor update to DeepSeek R1 best practice by @kaiyux in #5600
  • [nvbug/5354946][fix] Fix mtp vanilla draft inputs by @lfr-0531 in #5568
  • refactor: decoder state setup by @Funatiq in #5093
  • [Infra][main] Cherry-pick from release/0.21: Update nccl to 2.27.5 (#5539) by @EmmaQiaoCh in #5587
  • [TRTLLM-5989, TRTLLM-5991, TRTLLM-5993] doc: Update container instructions (#5490) by @ixlmar in #5605
  • [ci] move eagle1 and medusa tests to post-merge by @omera-nv in #5604
  • chore [TRTLLM-6009]: remove ptuning knobs from TorchLlmArgs by @Superjomn in #5595
  • [fix][ci] missing class names in post-merge test reports by @omera-nv in #5603
  • refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code by @limin2021 in #5557
  • chore: remove cuda_graph_ prefix from cuda_graph_config filed members. by @nv-guomingz in #5585
  • feat: AutoDeploy fp8 quantization support for bmm by @meenchen in #3849
  • feature: unify new_tokens format sample state to trtllm samper tokens format by @netanel-haber in #5513
  • [fix]: Fix main test skip issue by @yizhang-nv in #5503
  • chores: [TRTLLM-6072] 1.0 LLMAPI doc updates by @hchings in #5629
  • add feature support matrix for PyTorch backend by @QiJune in #5037
  • test: [CI] remove closed bugs by @xinhe-nv in #5572
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5569
  • rcca: test default kv_cache_reuse option for pytorch multimodal by @StanleySun639 in #5544
  • [TRTLLM-6104] feat: add request_perf_metrics to triton LLMAPI backend by @xuanzic in #5554
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5582
  • feat: W4A16 GEMM by @danielafrimi in #4232
  • test: Reduce number of C++ test cases by @Funatiq in #5437
  • [https://nvbugs/5318059][test] Unwaive test by @pamelap-nvidia in #5624
  • [Infra] - Add some timeout and unwaive a test which dev fixed by @EmmaQiaoCh in #5631
  • [#5403][perf] Conditionally enable SWAP AB for speculative decoding by @zoheth in #5404
  • [TRTLLM-5277] chore: refine llmapi examples for 1.0 (part1) by @Superjomn in #5431
  • chore: Mass integration of release/0.21 by @dc3671 in #5507
  • refactor: Clean up DecodingInput and DecodingOutput by @Funatiq in #5617
  • perf: Use tokenizers API to optimize incremental detokenization perf by @kaiyux in #5574
  • [feat] Support torch compile for attention dp by @liji-nv in #5086
  • feat: add LLmArgs option to force using dynamic quantization by @achartier in #5346
  • [TRTLLM-5644][infra] Update the community action to more appropriate api by @poweiw in #4883
  • fix: add missing self. from PR #5346 by @achartier in #5653
  • [Bug] attention DP doesn't work with embedding TP by @PerkzZheng in #5642
  • fix: Add back allreduce_strategy parameter into TorchLlmArgs by @HuiGao-NV in #5637
  • perf: better heuristic for allreduce by @yilin-void in #5432
  • feat: fuse w4a8 moe pre-quant scale on Hopper by @xiaoweiw-nv in #5613
  • [chore] 2025-07-02 update github CI allowlist by @niukuo in #5661
  • doc: Add pd dynamic scaling readme by @Shunkangz in #5540
  • chore: enhance yaml loading arbitrary options in LlmArgs by @Superjomn in #5610
  • Feat/pytorch vswa kvcachemanager by @qixiang-99 in #5151
  • [TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest by @Funatiq in #5489
  • [https://nvbugspro.nvidia.com/bug/5329655] [feat] Pytorch path add spec dec param to attention op by @jhaotingc in #5146
  • [Infra] - Set default timeout to 1hr and remove some specific settings by @EmmaQiaoCh in #5667
  • [TRTLLM-6143] feat: Improve dev container tagging by @ixlmar in #5551
  • feat:[AutoDeploy] E2E build example for llama4 VLM by @Fridah-nv in #3922
  • fix: Fix missing arg to alltoall_prepare_maybe_dispatch by @syuoni in https://...
Read more

v1.0.0rc1

03 Jul 05:38
de97799
Compare
Choose a tag to compare
v1.0.0rc1 Pre-release
Pre-release

Model Support

  • Model Support
  • Features
    • Add support for YARN in NemotronNAS models (#4906)
    • Add support for per expert activation scaling factors (#5013)
    • Add ReDrafter support for Qwen (#4875)
    • Add NGrams V2 support (#4569)
    • Use inference mode in update_requests to improve perf of TRTLLM Sampler (#5538)
    • Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410)
    • Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
    • large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226)
    • Prevent serialization of entire LoRA adapters in each request (#5080)
    • Remove cutlass min latency code from AutoTuner. (#5394)
    • Opensource MOE MXFP8-MXFP4 implementation (#5222)
    • Add chunked prefill support for MLA (Blackwell) (#4651)
    • Support disaggregated serving in TRTLLM Sampler (#5328)
    • Support mutliCtasKvMode for high-throughput MLA kernels (#5426)
    • Add MTP support for Online EPLB (#5213)
    • Add debug hook to support dump tensor data and add new debug functions easily (#5182)
  • API
    • Add request_perf_metrics to LLMAPI (#5497)
    • Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384)
  • Bug Fixes
    • Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
    • Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
    • Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
    • Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485)
    • Fix the unexpected keyword argument 'streaming' (#5436)
  • Benchmark
    • Update trtllm-bench to support new Pytorch default. (#5491)
    • Add support for TRTLLM CustomDataset (#5511)
    • Make benchmark_serving part of the library (#5428)
  • Performance
    • Improve XQA-MLA perf (#5468)
    • Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318)
  • Infrastructure
    • Allow configuring linking of NVRTC wrapper (#5189)
    • Add timeout setting for long tests found in post-merge (#5501)
  • Documentation
    • Fix benchmark cmd in disagg scripts (#5515)
  • Known Issues
    • multi-GPU model support on RTX Pro 6000

What's Changed

  • feature: make trtllmsampler new_tokens format the universal format by @netanel-haber in #4401
  • [fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation by @HuiGao-NV in #5343
  • test: [CI] remove closed bugs by @xinhe-nv in #5400
  • refactor: manage cache indirection in decoder state by @Funatiq in #5315
  • tests: update benchmark test lists by @xinhe-nv in #5365
  • chore: delete mamba hybrid, since it is now called NemotronH by @vegaluisjose in #5409
  • [Infra] - Waive failed tests in post-merge and increase some timeout setting by @EmmaQiaoCh in #5424
  • Add debug hook to support dump tensor data and add new debug functions easily by @HuiGao-NV in #5182
  • Chore: remove unused variables by @QiJune in #5314
  • Fix test Pytorch model engine by @Tabrizian in #5416
  • Add MTP support for Online EPLB by @dongxuy04 in #5213
  • waive test_moe.py::test_moe_fp8[autotune] by @QiJune in #5455
  • fix: fix bug of qwen3 + eagle3 + finalize_moe_fusion by @byshiue in #5369
  • [AutoDeploy] Merge feat/ad_2025_06_13 feature branch by @lucaslie in #5454
  • feat: Dynamically remove servers in PD by @Shunkangz in #5270
  • tests: Set kv cache free memory fraction in test case by @HuiGao-NV in #5433
  • fix (NvBug 5354925): Fix static EPLB by @syuoni in #5411
  • test: Add LLGuidance test and refine guided decoding by @syuoni in #5348
  • CI: update multi gpu test triggering file list by @QiJune in #5466
  • start OAIServer with max_beam_width=1 for TorchSampler by @netanel-haber in #5427
  • chore: bump version to 1.0.0rc1 by @yiqingy0 in #5460
  • [https://jirasw.nvidia.com/browse/TRTLLM-4645] support mutliCtasKvMode for high-throughput MLA kernels by @PerkzZheng in #5426
  • CI: waive test_ad_build_small_multi by @QiJune in #5471
  • feat: Remove not used padding_idx in models by @HuiGao-NV in #5385
  • [nvbug/5354956] fix: unexpected keyword argument 'streaming' by @kaiyux in #5436
  • Move 3 disaggregated cases from 4 GPUs devices to 1 GPU device by @HuiGao-NV in #5457
  • Fix: fix nvbug 5356427 by @HuiGao-NV in #5464
  • feat: Make benchmark_serving part of the library by @kaiyux in #5428
  • [TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler by @dcampora in #5328
  • [chore] Disable block reuse when draft model speculation is being used by @mikeiovine in #5448
  • chore: split _build_model method for TorchLlm and TrtLlm by @QiJune in #5418
  • [fix][test] remove test in global scope by @omera-nv in #5470
  • [fix][ci] dont build wheel for cpp tests by @omera-nv in #5443
  • CI: reduce BF16 test cases in B200 by @QiJune in #5482
  • Add sleep function for disagg gen-only benchmarking by @qiaoxj07 in #5398
  • CI: enable test cases on single device type by @HuiGao-NV in #5484
  • [5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. by @hyukn in #5485
  • feat: chunked prefill for MLA (Blackwell) by @jmydurant in #4651
  • Add unit test for routing kernels by @ChristinaZ in #5405
  • [CI] Waive test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] by @venkywonka in #5494
  • [Infra] - Add timeout setting for long tests found in post-merge by @EmmaQiaoCh in #5501
  • Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format (#4401)" by @netanel-haber in #5474
  • keep sm90 headsize 128 cubins by @qsang-nv in #5320
  • opensource: Opensource MOE MXFP8-MXFP4 implementation by @djns99 in #5222
  • [TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. by @hyukn in #5394
  • [TRTLLM-5921][feat] Prevent serialization of entire LoRA adapters in each request by @amitz-nv in #5080
  • feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) by @dongxuy04 in #5226
  • [chore] Allow configuring linking of NVRTC wrapper by @AlessioNetti in #5189
  • perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf by @bobboli in #5318
  • [fix][ci] trigger multigpu tests for deepseek changes by @omera-nv in #5423
  • tests: waive tests by @xinhe-nv in #5458
  • doc: Fix benchmark cmd in disagg scripts by @kaiyux in #5515
  • [perf] improve XQA-MLA perf by @lowsfer in #5468
  • feat: Add support for TRTLLM CustomDataset by @kaiyux in #5511
  • [feat] Add progress bar to benchmark by @arekay in #5173
  • Add trtllm-bench reviewers. by @FrankD412 in #5452
  • [CI] move flashinfer llama tests to post merge by @omera-nv in #5506
  • [fix][ci] move torch tests to run under torch stage by @omera-nv in #5473
  • refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead by @Funatiq in #5384
  • [TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) by @jmydurant in #5475
  • fix: MoE autotune fallback failed to query default heuristic by @rosenrodt in #5520
  • Update allow list 2025_06_26 by @yuanjingx87 in #5526
  • fix: Mapping rank boundary check bug by @venkywonka in #4935
  • Update trtllm-bench to support new Pytorch default. by @FrankD412 in https://gith...
Read more

v1.0.0rc0

25 Jun 10:23
ebadc13
Compare
Choose a tag to compare
v1.0.0rc0 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
  • Features
    • Add EAGLE3 support for Qwen3 (#5206)
    • Add Piecewise cuda graph support for MLA (#4467)
    • feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207)
    • Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224)
    • Add no_kv_cache_reuse option and streaming support for trtllm serve bench (#4971)
    • Add LLGuidance Support for PyTorch Backend (#5214)
    • Fusion finalize and allreduce for qwenmoe model (#5223)
    • Support stream_interval (#5284)
  • API
    • Add llm args to tune python gc threshold (#5141)
    • Introduce ResourceManagerType enum for resource management (#5246)
    • BREAKING CHANGEchore: make pytorch LLM the default (#5312)
    • Remove TrtGptModelOptionalParams (#5165)
  • Bug Fixes
    • Fix trtllm-llmapi-launch multiple LLM instances (#4727)
    • Fix the deterministic issue in the MTP Eagle path (#5285)
    • Fix: missing clientId when serialize and deserialize response (#5231)
  • Benchmark
  • Performance
    • Optimize MoE supplementary kernels for large-scale EP (#5215)
    • Improve performance of XQA-MLA for sm120 (#5087)
  • Infrastructure
    • Update dependencies with NGC PyTorch 25.05 and TRT 10.11 (#4885)
    • Add Multi-node CI testing support via Slurm (#4771)
  • Documentation
    • Add document of benchmarking for Qwen3 (#5158)
    • Update contributing md for internal developers (#5250)
    • blog: Disaggregated Serving in TensorRT-LLM (#5353)
    • Update mtp documents (#5387)
  • Known Issues
    • multi-GPU model support on RTX Pro 6000

What's Changed

Read more

v0.20.0

19 Jun 04:19
7965842
Compare
Choose a tag to compare

TensorRT-LLM Release 0.20.0

Key Features and Enhancements

  • Model Support
    • Added Qwen3 support.Refer to “Qwen3” section in examples/models/core/qwen/README.md.
    • Added HyperCLOVAX-SEED-Vision support in PyTorch flow. Refer to examples/models/contrib/hyperclovax/README.md
    • Added Dynasor-CoT in scaffolding examples. Refer to examples/scaffolding/contrib/Dynasor/README.md
    • Added Mistral Small 3.1 24B VLM support in TRT workflow
    • Added Gemma3-1b-it support in PyTorch workflow
    • Added Nemotron-H model support
    • Added Eagle-3 support for LLAMA4
  • PyTorch workflow
    • Added lora support
    • Added return logits support
    • Adopt new logprob definition in PyTorch flow
    • Enabled per-request stats with PyTorch backend
    • Enabled LogitsProcessor in PyTorch backend
  • Benchmark:
    • Add beam width to low latency.
    • Fix trtllm-bench iter_stats and cuda_graph_batch_sizes errors.
    • Remove deprecated Python runtime benchmark
    • Add benchmark support for scaffolding
  • Multimodal models
    • Added support in trtllm-serve
    • Added support in trtllm-bench, the support is limited to image only for now
  • Supported DeepSeek-R1 W4A8 on Hopper
  • Add the RTX Pro 6000 support on single GPU
  • Integrated Llama4 input processor
  • Added CGA reduction FHMA kernels on Blackwell
  • Enabled chunked context for FlashInfer
  • Supported KV cache reuse for MLA
  • Added Piecewise CUDA Graph support
  • Supported multiple LoRA adapters and TP
  • Added KV cache-aware router for disaggregated serving
  • Unfused attention for native support
  • Added group_rms_norm kernel to normalize multiple inputs in a single operator
  • Added smart router for the MoE module
  • Added head size 72 support for QKV preprocessing kernel
  • Added MNNVL MoE A2A support
  • Optimized Large Embedding Tables in Multimodal Models
  • Supported Top-K logprobs and prompt_logprobs in LLMAPI
  • Enabled overlap scheduler in TRT workflow via executor API

Infrastructure Changes

  • TRT-LLM team formally releases docker image on NGC.
  • The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI
  • The dependent TensorRT version is updated to 10.10.0
  • The dependent CUDA version is updated to 12.9.0
  • The dependent public PyTorch version is updated to 2.7.0
  • The dependent NVIDIA ModelOpt version is updated to 0.29.0
  • The dependent NCCL version is maintained at 2.25.1
  • Open-sourced XQA kernels
  • Dependent datasets version was upgraded to 3.1.0
  • Migrate Triton Backend to TensorRT LLM repo to TensorRT LLM submodule
  • Downgrade gcc toolset version from 13 to 11

API Changes

  • [Breaking Change]:Enable scheduling overlap by default
  • Remove deprecated GptSession/V1 from TRT workflow
  • Set _AutoDeployLlmArgs as primary config object
  • Allow overriding CLI arguments with YAML file in trtllm-serve
  • Introduced multimodal embedding field in LlmRequest

Fixed Issues

  • Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095)
  • Fix C++ decoder synchronization in PyTorch (#3106)
  • Fix bug of create cuda stream as default parameter which will be initialized during importing (#3764)
  • Fix bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
  • Fix attention DP bug on Qwen3 MoE model (#4141)
  • Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101)
  • Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227)

Known Issues

  • multi-GPU model support on RTX Pro 6000

What's Changed

  • Refine doc by @juney-nvidia in #4420
  • Refine doc by @juney-nvidia in #4421
  • refine doc by @juney-nvidia in #4422
  • Remove vila test by @Tabrizian in #4376
  • [TRTLLM-4618][feat] Add Nemotron Super 49B FP8 test on RTX6000 Pro (SM120) by @farazkh80 in #4363
  • tests: add qa test mentioned in docs by @crazydemo in #4357
  • [Infra] - Always push the release images in the post-merge job by @chzblych in #4426
  • tests: Add test cases for rcca cases by @crazydemo in #4347
  • chore: cleanup perf_evaluator code by @Superjomn in #3833
  • feat: Add pp support for hybrid attn/mamba model by @yuxianq in #4358
  • fix: wrong argument name enable_overlap_scheduler by @kaiyux in #4433
  • Update "Roadmap" link under README.md to the issues with Roadmap label by @AdamzNV in #4425
  • fix potential issues in allreduce fusion kernel and ut by @yilin-void in #4226
  • [TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split by @dc3671 in #4337
  • feat: NIXL interface integration by @Shixiaowei02 in #3934
  • Downgrade the logger level for fallback tactic warning. by @hyukn in #4440
  • Test: Improve model re-use in C++ DGX tests for CI stability by @DomBrown in #4263
  • fix: temp disable the problem test by @Shixiaowei02 in #4445
  • Add llama4 disagg accuracy tests by @Tabrizian in #4336
  • [https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 by @liji-nv in #3952
  • [Docs] - Reapply #4220 by @chzblych in #4434
  • [TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) by @farazkh80 in #4335
  • [Feat] add chunked-attention kernels on Hopper (for llama4) by @PerkzZheng in #4291
  • test(perf): Add some Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (TRT flow, trtllm-bench) by @venkywonka in #4128
  • fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. by @yuxianq in #4399
  • feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #4344
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4429
  • [TRTLLM-4932] Add CLI accuracy tests for Llama-3.3-70B-Instruct and LLM API BF16 variant by @moraxu in #4362
  • test: update test filter in perf test yml file to select cases by gpu name and add cases for RTX 6000 pro by @ruodil in #4282
  • [AutoDeploy] HF factory improvements by @lucaslie in #4371
  • chore: bump version to 0.21.0rc0 by @ZhanruiSunCh in #4465
  • doc: [TRTLLM-325]Integrate the NGC image in Makefile automation and document by @MartinMarciniszyn in #4400
  • chore: bump version to 0.20.0 by @ZhanruiSunCh in #4469
  • fix: replace the image links in the blog by @Shixiaowei02 in #4490
  • fix: cleanup process tree for disaggregated test by @tongyuantongyu in #4116
  • Cherry pick #4508 by @QiJune in #4512
  • Cherry pick #4447 by @yuxianq in #4517
  • chore: Remove unused script by @kaiyux in #4485
  • chore: Deprecate autopp. by @yuxianq in #4471
  • fix: Fix trtllm sampler beam width bug by @dcampora in #4507
  • tests: update api change from decoder to sampler in test by @crazydemo in #4479
  • docs: Add KV Cache Management documentation by @Funatiq in #3908
  • test: add failed case in waive list and fix some test script issue for perf test by @ruodil in #4528
  • Add tritonrelease container by @Tabrizian in #4544
  • fix: [TRTLLM-325]WAR against security vulnerabilities in Python packages by @MartinMarciniszyn in #4539
  • [5141290][5273694][5260696] fix: Fix mrope argument missing issue in the summary tasks for Qwen model. by @hyukn in #4432
  • test: waive hanging cases for perf test by @ruodil in #4563
  • [nvbugs/5274894] fix: Moving finished context requests to generation by @Funatiq in #4576
  • [5234029][5226211] chore: Unwaive multimodal tests for Qwen model. by @hyukn in #4519
  • test(perf): Extend the Llama-Nemotron-Nano-8B perf-integration-tests (pyt) by @venkywonka in #4407
  • test: fix for perf sanity test and skip fp8 deepseek blackwell cases by @ruodil in #4598
  • [5180961] chore: Unwai...
Read more