TensorRT-LLM Requests #632

ncomly-nvidia · 2023-12-11T19:35:22Z

Hi all, this issue will track the feature requests you've made to TensorRT-LLM & provide a place to see what TRT-LLM is currently working on.

Last update: Jan 14th, 2024
🚀 = in development

Models

Features & Optimizations

Context Chunking - [Feature request] Dynamic splitfuse from Deepspeed (2x throughput) #317
Speculative Decoding - Feature: Speculative sampling / Assisted Generation #169, Smaller available space for paged KV cache compared with vLLM #224, Falcon-40b build causing memory leaks and failure #226
implementation done - documentation in progress

KV Cache

Reuse KV Cache - [Feature reuqest] support interactive-generation #292, Add automatic reuse of common key value cache blocks between requests #620
Attention Sinks (StreamingLLM, H2O) - Attention sink #104

Quantization

StarCoder INT8 SQ - Feature request: Support SmoothQuant variant of StarCoder #324
Qwen INT4 - [Feature request] AutoAWQ support #345
INT8 Weight only - Support weight only quantization from bfloat16 to int8? #110

Sampling

🚀 support frequnecy_penalty - Support for frequency_penalty #275
Logit Manipulators - Add Transformers logits manipulators #241
Combine repetition & presence penalties - Support for combining repetition_penalty, presence_penalty #274

Workflow

Front-ends

OpenAI compatible API - Provide an interface similar to OpenAI API #334
Flag for end-of-stream - Flag indicate end of stream #240
Load from Buffer - GptManager add support for loading from buffer #144
Paged KV Cache Utilization Metric - How to know the utility of paged kv cache ? #512
Log Probabilities - Return log probabilities for tokens #238
Return only new tokens - How to get the newly generated tokens only? #227

Integrations

🚀 LlamaIndex
🚀 LangChain
Mojo - Question about a Mojo Integration #556

Usage / Installation

pip install - waiting for pre-built wheel package #790,

Platform Support

Jetson - Nvidia Jetson device Support #62, How can I running successful on jetson orin NX? #488, TensorRT installation in TRT-LLM #619
V100, T4 MHA - Why FMHA is not supported in V100 and T4 #320

The text was updated successfully, but these errors were encountered:

teis-e · 2024-04-04T18:37:58Z

Please add CohereAI!!

CohereForAI/c4ai-command-r-plus

EwoutH · 2024-04-22T09:03:02Z

Llama 3 would be great (both 8B and 70B): #1470

Maybe quantized to 8 or even 4 bit.

StephennFernandes · 2024-04-22T22:02:06Z

currently llama 3 throws a bunch of errors converting to TensorRT LLM

any ideal about the support for llama 3

EwoutH · 2024-04-23T15:06:56Z

Phi-3-mini should be amazing! Such a small 3.8B model could run quantized on many GPUs, with as little as 4GB VRAM.

Paper: https://arxiv.org/abs/2404.14219
Model weights: https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3

oscarbg · 2024-05-04T14:51:58Z

+1 for Phi-3

user-0a · 2024-05-18T05:22:42Z

+1 for Command R Plus!

CohereForAI/c4ai-command-r-plus

khan-yin · 2024-06-25T16:34:46Z

hello @ncomly-nvidia, I am a student interested in the project! I want to ask if there are any good-first-issue feature request for Features & Optimizations recently? 🤣

chenpinganan · 2024-07-02T11:13:17Z

+1 for OpenBMB/MiniCPM-V-2

FenardH · 2024-08-05T07:16:28Z

Any news on support for jetson platform? Thanks in advance.

anubhav-agrawal-mu-sigma · 2024-09-03T09:58:56Z

Requesting support for Meta's m4t v2 model, like how whisper support is provided.

johnnynunez · 2024-09-25T06:56:53Z

How is it going for Jetson AGX ? It would be nice if all is compatible before Jetson Thor launch

ampdot-io · 2024-09-28T04:14:38Z

LLaMa 3.2 multimodal vision models anytime soon?

hello-11 · 2024-11-18T04:27:30Z

cc @laikhtewari for vis.

johnnynunez · 2024-11-18T07:32:37Z

congrats Nvidia: https://www.jetson-ai-lab.com/tensorrt_llm.html

Mavericky-j · 2024-11-21T03:02:00Z

Any news on support for jetson platform? Thanks in advance.

You can refer to the v0.12-jetson branch.

ncomly-nvidia added the good first issue Good for newcomers label Dec 18, 2023

ncomly-nvidia pinned this issue Dec 18, 2023

symphonylyh mentioned this issue Dec 18, 2023

Is it even possible to have multiple input layers #641

Closed

erenup mentioned this issue Dec 29, 2023

Add Roberta and few new tests for Bert #778

Closed

tp-nan mentioned this issue Mar 13, 2024

[Feature Request] More realistic benchmark and throughput optimization #1292

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM Requests #632

TensorRT-LLM Requests #632

ncomly-nvidia commented Dec 11, 2023 •

edited

Loading

teis-e commented Apr 4, 2024

EwoutH commented Apr 22, 2024

StephennFernandes commented Apr 22, 2024

EwoutH commented Apr 23, 2024

oscarbg commented May 4, 2024

user-0a commented May 18, 2024

khan-yin commented Jun 25, 2024 •

edited

Loading

chenpinganan commented Jul 2, 2024

FenardH commented Aug 5, 2024

anubhav-agrawal-mu-sigma commented Sep 3, 2024

johnnynunez commented Sep 25, 2024

ampdot-io commented Sep 28, 2024

hello-11 commented Nov 18, 2024

johnnynunez commented Nov 18, 2024

Mavericky-j commented Nov 21, 2024

TensorRT-LLM Requests #632

TensorRT-LLM Requests #632

Comments

ncomly-nvidia commented Dec 11, 2023 • edited Loading

Models

Decoder Only

Encoder / Encoder-Decoder

Multi-Modal

Other

Features & Optimizations

KV Cache

Quantization

Sampling

Workflow

Front-ends

Integrations

Usage / Installation

Platform Support

teis-e commented Apr 4, 2024

EwoutH commented Apr 22, 2024

StephennFernandes commented Apr 22, 2024

EwoutH commented Apr 23, 2024

oscarbg commented May 4, 2024

user-0a commented May 18, 2024

khan-yin commented Jun 25, 2024 • edited Loading

chenpinganan commented Jul 2, 2024

FenardH commented Aug 5, 2024

anubhav-agrawal-mu-sigma commented Sep 3, 2024

johnnynunez commented Sep 25, 2024

ampdot-io commented Sep 28, 2024

hello-11 commented Nov 18, 2024

johnnynunez commented Nov 18, 2024

Mavericky-j commented Nov 21, 2024

ncomly-nvidia commented Dec 11, 2023 •

edited

Loading

khan-yin commented Jun 25, 2024 •

edited

Loading