Hi,
We are very pleased to announce the 0.14.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Enhanced the
LLM
class in the LLM API.- Added support for calibration with offline dataset.
- Added support for Mamba2.
- Added support for
finish_reason
andstop_reason
.
- Added FP8 support for CodeLlama.
- Added
__repr__
methods for classModule
, thanks to the contribution from @1ytic in #2191. - Added BFloat16 support for fused gated MLP.
- Updated ReDrafter beam search logic to match Apple ReDrafter v1.1.
- Improved
customAllReduce
performance. - Draft model now can copy logits directly over MPI to the target model's process in
orchestrator
mode. This fast logits copy reduces the delay between draft token generation and the beginning of target model inference. - NVIDIA Volta GPU support is deprecated and will be removed in a future release.
API Changes
- [BREAKING CHANGE] The default
max_batch_size
of thetrtllm-build
command is set to2048
. - [BREAKING CHANGE] Remove
builder_opt
from theBuildConfig
class and thetrtllm-build
command. - Add logits post-processor support to the
ModelRunnerCpp
class. - Added
isParticipant
method to the C++Executor
API to check if the current process is a participant in the executor instance.
Model Updates
- Added support for NemotronNas, see
examples/nemotron_nas/README.md
. - Added support for Deepseek-v1, see
examples/deepseek_v1/README.md
. - Added support for Phi-3.5 models, see
examples/phi/README.md
.
Fixed Issues
- Fixed a typo in
tensorrt_llm/models/model_weights_loader.py
, thanks to the contribution from @wangkuiyi in #2152. - Fixed duplicated import module in
tensorrt_llm/runtime/generation.py
, thanks to the contribution from @lkm2835 in #2182. - Enabled
share_embedding
for the models that have nolm_head
in legacy checkpoint conversion path, thanks to the contribution from @lkm2835 in #2232. - Fixed
kv_cache_type
issue in the Python benchmark, thanks to the contribution from @qingquansong in #2219. - Fixed an issue with SmoothQuant calibration with custom datasets. Thanks to the contribution by @Bhuvanesh09 in #2243.
- Fixed an issue surrounding
trtllm-build --fast-build
with fake or random weights. Thanks to @ZJLi2013 for flagging it in #2135. - Fixed missing
use_fused_mlp
when constructingBuildConfig
from dict, thanks for the fix from @ethnzhng in #2081. - Fixed lookahead batch layout for
numNewTokensCumSum
. (#2263)
Infrastructure Changes
- The dependent ModelOpt version is updated to v0.17.
Documentation
- @Sherlock113 added a tech blog to the latest news in #2169, thanks for the contribution.
Known Issues
- Replit Code is not supported with the transformers 4.45+
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team