- Models
- BART and mBART support in encoder-decoder models
- FairSeq Neural Machine Translation (NMT) family
- Mixtral-8x7B model
- Support weight loading for HuggingFace Mixtral model
- OpenAI Whisper
- Mixture of Experts support
- MPT - Int4 AWQ / SmoothQuant support
- Baichuan FP8 quantization support
- Features
- [Preview] Speculative decoding
- Add Python binding for
GptManager
- Add a Python class
ModelRunnerCpp
that wraps C++gptSession
- System prompt caching
- Enable split-k for weight-only cutlass kernels
- FP8 KV cache support for XQA kernel
- New Python builder API and
trtllm-build
command(already applied to blip2 and OPT ) - Support
StoppingCriteria
andLogitsProcessor
in Python generate API (thanks to the contribution from @zhang-ge-hao) - fMHA support for chunked attention and paged kv cache
- Bug fixes
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
- Fix LLaMa with LoRA error #637
- Fix LLaMA GPTQ failure #580
- Fix Python binding for InferenceRequest issue #528
- Fix CodeLlama SQ accuracy issue #453
- Performance
- MMHA optimization for MQA and GQA
- LoRA optimization: cutlass grouped gemm
- Optimize Hopper warp specialized kernels
- Optimize AllReduce for parallel attention on Falcon and GPT-J
- Enable split-k for weight-only cutlass kernel when SM>=75
- Documentation
-
Models
- ChatGLM3
- InternLM (contributed by @wangruohui)
- Mistral 7B (developed in collaboration with Mistral.AI)
- MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
- Qwen (contributed by @Tlntin and @zhaohb)
- Replit Code V-1.5 3B (external contribution)
- T5, mT5, Flan-T5 (Python runtime only)
-
Features
- Add runtime statistics related to active requests and KV cache utilization from the batch manager (see the batch manager documentation)
- Add
sequence_length
tensor to support proper lengths in beam-search (when beam-width > 1 - see tensorrt_llm/batch_manager/GptManager.h) - BF16 support for encoder-decoder models (Python runtime - see examples/enc_dec)
- Improvements to memory utilization (CPU and GPU - including memory leaks)
- Improved error reporting and memory consumption
- Improved support for stop and bad words
- INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see examples/baichuan)
- INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only support for the GPT-J model (see examples/gptj)
- INT4 AWQ support for the Falcon models (see examples/falcon)
- LoRA support (functional preview only - limited to the Python runtime, only QKV support and not optimized in terms of runtime performance) for the GPT model (see the Run LoRA with the Nemo checkpoint in the GPT example)
- Multi-GPU support for encoder-decoder models (Python runtime - see examples/enc_dec)
- New heuristic for launching the Multi-block Masked MHA kernel (similar to FlashDecoding - see decoderMaskedMultiheadAttentionLaunch.h)
- Prompt-Tuning support for GPT and LLaMA models (see the Prompt-tuning Section in the GPT example)
- Performance optimizations in various CUDA kernels
- Possibility to exclude input tokens from the output (see
excludeInputInOutput
inGptManager
) - Python binding for the C++ runtime (GptSession - see
pybind
) - Support for different micro batch sizes for context and generation
phases with pipeline parallelism (see
GptSession::Config::ctxMicroBatchSize
andGptSession::Config::genMicroBatchSize
in tensorrt_llm/runtime/gptSession.h) - Support for "remove input padding" for encoder-decoder models (see examples/enc_dec)
- Support for context and generation logits (see
mComputeContextLogits
andmComputeGenerationLogits
in tensorrt_llm/runtime/gptModelConfig.h) - Support for
logProbs
andcumLogProbs
(see"output_log_probs"
and"cum_log_probs"
inGptManager
) - Update to CUTLASS 3.x
-
Bug fixes
- Fix for ChatGLM2 #93 and #138
- Fix tensor names error "RuntimeError: Tensor names
(
host_max_kv_cache_length
) in engine are not the same as expected in the main branch" #369 - Fix weights split issue in BLOOM when
world_size = 2
("array split does not result in an equal division") #374 - Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
- Fix a crash in GenerationSession if stream keyword argument is not None #202
- Fix a typo when calling PyNVML API [BUG] code bug #410
- Fix bugs related to the improper management of the
end_id
for various models [C++ and Python] - Fix memory leaks [C++ code and Python models]
- Fix the std::alloc error when running the gptManagerBenchmark -- issue gptManagerBenchmark std::bad_alloc error #66
- Fix a bug in pipeline parallelism when beam-width > 1
- Fix a bug with Llama GPTQ due to improper support of GQA
- Fix issue #88
- Fix an issue with the Huggingface Transformers version #16
- Fix link jump in windows readme.md #30 - by @yuanlehome
- Fix typo in batchScheduler.h #56 - by @eltociear
- Fix typo #58 - by @RichardScottOZ
- Fix Multi-block MMHA: Difference between
max_batch_size
in the engine builder andmax_num_sequences
in TrtGptModelOptionalParams? #65 - Fix the log message to be more accurate on KV cache #224
- Fix Windows release wheel installation: Failed to install the release wheel for Windows using pip #261
- Fix missing torch dependencies: [BUG] The batch_manage.a choice error in --cpp-only when torch's cxx_abi version is different with gcc #151
- Fix linking error during compiling google-test & benchmarks #277
- Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by the lack of bfloat16 #335
- Minor bug fixes
- TensorRT-LLM v0.5.0 is the first public release.