Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mllama example does not run properly for v0.15 when using the tensorrt_llm_bls endpoint #669

Open
here4dadata opened this issue Dec 24, 2024 · 0 comments

Comments

@here4dadata
Copy link

When following the steps highlighted in the examples for mllama we run into two issues.

  1. The cross_kv_cache_fraction parameter is expected to be set in tensorrt_llm/config.pbtxt, wheras it is not set at all in the examples, and attempting to follow the examples will fail. You can set it manually to something like 0.5 to get past this issue.
  2. When actually sending the example curl request, but replacing ensemble with tensorrt_llm_bls, we end up with the following error:
    "Traceback (most recent call last):\n File \"/models/tensorrt_llm_bls/1/model.py\", line 108, in execute\n for res in res_gen:\n File \"/models/tensorrt_llm_bls/1/lib/decode.py\", line 223, in decode\n gen_response = self._generate_non_streaming(\n File \"/models/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 350, in _generate_non_streaming\n r = self._exec_triton_request_single(triton_req)\n File \"/models/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 149, in _exec_triton_request_single\n raise pb_utils.TritonModelException(responses.error().message())\nc_python_backend_utils.TritonModelException: Executor failed process requestId 5 due to the following error: Encountered an error in forwardAsync function: GenericLlmRequest::getEncoderInputLen - Do not have encoder length! (/workspace/tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:580)\n1 0x78deb6f675e6 tensorrt_llm::batch_manager::GenericLlmRequest<std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::shared_ptr<tensorrt_llm::runtime::CudaStream> >::getEncoderInputLen() const + 246\n2 0x78deb6f87d98 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::getRemainingBlocksToCompletion(tensorrt_llm::batch_manager::LlmRequest const&) const + 312\n3 0x78deb6f51172 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x2b50172) [0x78deb6f51172]\n4 0x78deb6f5152f tensorrt_llm::batch_manager::GuaranteedNoEvictScheduler::operator()(tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::BasePeftCacheManager const>, std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) const + 47\n5 0x78deb6f5259f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x2b5159f) [0x78deb6f5259f]\n6 0x78deb6f4dfa1 tensorrt_llm::batch_manager::CapacityScheduler::operator()(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::BasePeftCacheManager const>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const>) const + 97\n7 0x78deb6fe32f9 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 649\n8 0x78deb7021297 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 455\n9 0x78deb7027755 tensorrt_llm::executor::Executor::Impl::executionLoop() + 1365\n10 0x78dfa8308253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x78dfa8308253]\n11 0x78dfa7e6bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x78dfa7e6bac3]\n12 0x78dfa7efca04 clone + 68\n"

What is this encoder error error: Encountered an error in forwardAsync function: GenericLlmRequest::getEncoderInputLen - Do not have encoder length!? We set the encoder input len when building the trtllm engine with the following flag --max_encoder_input_len 8200 ... Is there another parameter we have to populate when sending requests with the bls endpoint?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant