Change Log

Versions 0.7.0 / 0.7.1

Models
- BART and mBART support in encoder-decoder models
- FairSeq Neural Machine Translation (NMT) family
- Mixtral-8x7B model
  - Support weight loading for HuggingFace Mixtral model
- OpenAI Whisper
- Mixture of Experts support
- MPT - Int4 AWQ / SmoothQuant support
- Baichuan FP8 quantization support
Features
- [Preview] Speculative decoding
- Add Python binding for GptManager
- Add a Python class ModelRunnerCpp that wraps C++ gptSession
- System prompt caching
- Enable split-k for weight-only cutlass kernels
- FP8 KV cache support for XQA kernel
- New Python builder API and trtllm-build command(already applied to blip2 and OPT )
- Support StoppingCriteria and LogitsProcessor in Python generate API (thanks to the contribution from @zhang-ge-hao)
- fMHA support for chunked attention and paged kv cache
Bug fixes
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
- Fix LLaMa with LoRA error #637
- Fix LLaMA GPTQ failure #580
- Fix Python binding for InferenceRequest issue #528
- Fix CodeLlama SQ accuracy issue #453
Performance
- MMHA optimization for MQA and GQA
- LoRA optimization: cutlass grouped gemm
- Optimize Hopper warp specialized kernels
- Optimize AllReduce for parallel attention on Falcon and GPT-J
- Enable split-k for weight-only cutlass kernel when SM>=75
Documentation
- Add documentation for convert/build workflow

Versions 0.6.0 / 0.6.1

Models
- ChatGLM3
- InternLM (contributed by @wangruohui)
- Mistral 7B (developed in collaboration with Mistral.AI)
- MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
- Qwen (contributed by @Tlntin and @zhaohb)
- Replit Code V-1.5 3B (external contribution)
- T5, mT5, Flan-T5 (Python runtime only)
Features
- Add runtime statistics related to active requests and KV cache utilization from the batch manager (see the batch manager documentation)
- Add sequence_length tensor to support proper lengths in beam-search (when beam-width > 1 - see tensorrt_llm/batch_manager/GptManager.h)
- BF16 support for encoder-decoder models (Python runtime - see examples/enc_dec)
- Improvements to memory utilization (CPU and GPU - including memory leaks)
- Improved error reporting and memory consumption
- Improved support for stop and bad words
- INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see examples/baichuan)
- INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only support for the GPT-J model (see examples/gptj)
- INT4 AWQ support for the Falcon models (see examples/falcon)
- LoRA support (functional preview only - limited to the Python runtime, only QKV support and not optimized in terms of runtime performance) for the GPT model (see the Run LoRA with the Nemo checkpoint in the GPT example)
- Multi-GPU support for encoder-decoder models (Python runtime - see examples/enc_dec)
- New heuristic for launching the Multi-block Masked MHA kernel (similar to FlashDecoding - see decoderMaskedMultiheadAttentionLaunch.h)
- Prompt-Tuning support for GPT and LLaMA models (see the Prompt-tuning Section in the GPT example)
- Performance optimizations in various CUDA kernels
- Possibility to exclude input tokens from the output (see excludeInputInOutput in GptManager)
- Python binding for the C++ runtime (GptSession - see pybind)
- Support for different micro batch sizes for context and generation phases with pipeline parallelism (see GptSession::Config::ctxMicroBatchSize and GptSession::Config::genMicroBatchSize in tensorrt_llm/runtime/gptSession.h)
- Support for "remove input padding" for encoder-decoder models (see examples/enc_dec)
- Support for context and generation logits (see mComputeContextLogits and mComputeGenerationLogits in tensorrt_llm/runtime/gptModelConfig.h)
- Support for logProbs and cumLogProbs (see "output_log_probs" and "cum_log_probs" in GptManager)
- Update to CUTLASS 3.x
Bug fixes
- Fix for ChatGLM2 #93 and #138
- Fix tensor names error "RuntimeError: Tensor names (host_max_kv_cache_length) in engine are not the same as expected in the main branch" #369
- Fix weights split issue in BLOOM when world_size = 2 ("array split does not result in an equal division") #374
- Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
- Fix a crash in GenerationSession if stream keyword argument is not None #202
- Fix a typo when calling PyNVML API [BUG] code bug #410
- Fix bugs related to the improper management of the end_id for various models [C++ and Python]
- Fix memory leaks [C++ code and Python models]
- Fix the std::alloc error when running the gptManagerBenchmark -- issue gptManagerBenchmark std::bad_alloc error #66
- Fix a bug in pipeline parallelism when beam-width > 1
- Fix a bug with Llama GPTQ due to improper support of GQA
- Fix issue #88
- Fix an issue with the Huggingface Transformers version #16
- Fix link jump in windows readme.md #30 - by @yuanlehome
- Fix typo in batchScheduler.h #56 - by @eltociear
- Fix typo #58 - by @RichardScottOZ
- Fix Multi-block MMHA: Difference between max_batch_size in the engine builder and max_num_sequences in TrtGptModelOptionalParams? #65
- Fix the log message to be more accurate on KV cache #224
- Fix Windows release wheel installation: Failed to install the release wheel for Windows using pip #261
- Fix missing torch dependencies: [BUG] The batch_manage.a choice error in --cpp-only when torch's cxx_abi version is different with gcc #151
- Fix linking error during compiling google-test & benchmarks #277
- Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by the lack of bfloat16 #335
- Minor bug fixes

Version 0.5.0

TensorRT-LLM v0.5.0 is the first public release.