Mllama example does not run properly for v0.15 when using the `tensorrt_llm_bls` endpoint #669

here4dadata · 2024-12-24T16:45:28Z

When following the steps highlighted in the examples for mllama we run into two issues.

The cross_kv_cache_fraction parameter is expected to be set in tensorrt_llm/config.pbtxt, wheras it is not set at all in the examples, and attempting to follow the examples will fail. You can set it manually to something like 0.5 to get past this issue.
When actually sending the example curl request, but replacing ensemble with tensorrt_llm_bls, we end up with the following error:
"Traceback (most recent call last):\n File \"/models/tensorrt_llm_bls/1/model.py\", line 108, in execute\n for res in res_gen:\n File \"/models/tensorrt_llm_bls/1/lib/decode.py\", line 223, in decode\n gen_response = self._generate_non_streaming(\n File \"/models/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 350, in _generate_non_streaming\n r = self._exec_triton_request_single(triton_req)\n File \"/models/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 149, in _exec_triton_request_single\n raise pb_utils.TritonModelException(responses.error().message())\nc_python_backend_utils.TritonModelException: Executor failed process requestId 5 due to the following error: Encountered an error in forwardAsync function: GenericLlmRequest::getEncoderInputLen - Do not have encoder length! (/workspace/tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:580)\n1 0x78deb6f675e6 tensorrt_llm::batch_manager::GenericLlmRequest<std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::shared_ptr<tensorrt_llm::runtime::CudaStream> >::getEncoderInputLen() const + 246\n2 0x78deb6f87d98 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::getRemainingBlocksToCompletion(tensorrt_llm::batch_manager::LlmRequest const&) const + 312\n3 0x78deb6f51172 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x2b50172) [0x78deb6f51172]\n4 0x78deb6f5152f tensorrt_llm::batch_manager::GuaranteedNoEvictScheduler::operator()(tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::BasePeftCacheManager const>, std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) const + 47\n5 0x78deb6f5259f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x2b5159f) [0x78deb6f5259f]\n6 0x78deb6f4dfa1 tensorrt_llm::batch_manager::CapacityScheduler::operator()(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::BasePeftCacheManager const>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const>) const + 97\n7 0x78deb6fe32f9 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 649\n8 0x78deb7021297 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 455\n9 0x78deb7027755 tensorrt_llm::executor::Executor::Impl::executionLoop() + 1365\n10 0x78dfa8308253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x78dfa8308253]\n11 0x78dfa7e6bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x78dfa7e6bac3]\n12 0x78dfa7efca04 clone + 68\n"

What is this encoder error error: Encountered an error in forwardAsync function: GenericLlmRequest::getEncoderInputLen - Do not have encoder length!? We set the encoder input len when building the trtllm engine with the following flag --max_encoder_input_len 8200 ... Is there another parameter we have to populate when sending requests with the bls endpoint?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mllama example does not run properly for v0.15 when using the `tensorrt_llm_bls` endpoint #669

Mllama example does not run properly for v0.15 when using the `tensorrt_llm_bls` endpoint #669

here4dadata commented Dec 24, 2024

Mllama example does not run properly for v0.15 when using the tensorrt_llm_bls endpoint #669

Mllama example does not run properly for v0.15 when using the tensorrt_llm_bls endpoint #669

Comments

here4dadata commented Dec 24, 2024

Mllama example does not run properly for v0.15 when using the `tensorrt_llm_bls` endpoint #669

Mllama example does not run properly for v0.15 when using the `tensorrt_llm_bls` endpoint #669