Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it even possible to have multiple input layers #641

Closed
forrestjgq opened this issue Dec 12, 2023 · 7 comments
Closed

Is it even possible to have multiple input layers #641

forrestjgq opened this issue Dec 12, 2023 · 7 comments
Assignees
Labels
feature request New feature or request triaged Issue has been triaged by maintainers

Comments

@forrestjgq
Copy link

forrestjgq commented Dec 12, 2023

Hi guys:

We're developing a new model using tensorrt-llm, and this model has more than one input layers. I checked out GptSession and GptDecoder code, and find that it seems only input ids could be passed to model?

Is it possible for us to pass more data to model? those data will NOT be changed throughout the whole generating processing.

Thanks!

@forrestjgq
Copy link
Author

Or did you has a plan to support llava model or other multimodals?

@byshiue
Copy link
Collaborator

byshiue commented Dec 13, 2023

You could use several input layer. Currently, all k/v caches are passed by input layers.

llava is on our roadmap, add @ncomly-nvidia share more details.

@byshiue byshiue assigned byshiue and ncomly-nvidia and unassigned byshiue Dec 13, 2023
@byshiue byshiue added the feature request New feature or request label Dec 13, 2023
@forrestjgq
Copy link
Author

You could use several input layer. Currently, all k/v caches are passed by input layers.

llava is on our roadmap, add @ncomly-nvidia share more details.

Our plan is to design 2 extra layers other than input_ids based on Llama, and build to trt engine.

In ensemble pipeline, data for those 2 layers will be processed by preprocessing model and passed to trtllm within named tensor.

We have 2 questions:

  1. is this plan working?
  2. will these 2 extra layers be delivered to trt engine in each forwarding by gpt manager(or something else)?

@ncomly-nvidia ncomly-nvidia added the triaged Issue has been triaged by maintainers label Dec 18, 2023
@symphonylyh
Copy link
Collaborator

@forrestjgq for 2 extra layers do you mean 2 extra inputs? Here are some examples that might be useful to you:

  1. position_ids and token_type_ids: in encoder-decoder models like BART, it requires an extra input called position_ids, so you can see here
  2. BLIP2 example has input_embeds instead of input_ids to handle visual encoder's embeddings, treating it as a prompt tuning table, here
  3. Last, if your goal is to enable LLaVA, a good news is that we have internally supported LLaVA and more generally a multi-modal family class. Stay tuned and we'll release soon. We can use TensorRT-LLM Requests #632 as the tracker

@symphonylyh
Copy link
Collaborator

@forrestjgq we now released the multi-modal support for BLIP w/ OPT or T5, and LLaVA. Please take a look :)
Annoucement: #847

Closing the issue for now since LLaVA is now supported. Feel free to re-open or open a new issue if you encountered any problem.

@forrestjgq
Copy link
Author

forrestjgq commented Jan 16, 2024

@symphonylyh
Great!

One more question, how to deploy it in triton server?

@symphonylyh
Copy link
Collaborator

@forrestjgq the simplest approach is to write a Triton python backend and some examples are in the triton repo. But for more general support and inflight-batching feature, we're still in progress and expect to have some update in Feb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants