You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When following the steps highlighted in the examples for mllama we run into two issues.
The cross_kv_cache_fraction parameter is expected to be set in tensorrt_llm/config.pbtxt, wheras it is not set at all in the examples, and attempting to follow the examples will fail. You can set it manually to something like 0.5 to get past this issue.
When actually sending the example curl request, but replacing ensemble with tensorrt_llm_bls, we end up with the following error: "Traceback (most recent call last):\n File \"/models/tensorrt_llm_bls/1/model.py\", line 108, in execute\n for res in res_gen:\n File \"/models/tensorrt_llm_bls/1/lib/decode.py\", line 223, in decode\n gen_response = self._generate_non_streaming(\n File \"/models/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 350, in _generate_non_streaming\n r = self._exec_triton_request_single(triton_req)\n File \"/models/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 149, in _exec_triton_request_single\n raise pb_utils.TritonModelException(responses.error().message())\nc_python_backend_utils.TritonModelException: Executor failed process requestId 5 due to the following error: Encountered an error in forwardAsync function: GenericLlmRequest::getEncoderInputLen - Do not have encoder length! (/workspace/tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:580)\n1 0x78deb6f675e6 tensorrt_llm::batch_manager::GenericLlmRequest<std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::shared_ptr<tensorrt_llm::runtime::CudaStream> >::getEncoderInputLen() const + 246\n2 0x78deb6f87d98 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::getRemainingBlocksToCompletion(tensorrt_llm::batch_manager::LlmRequest const&) const + 312\n3 0x78deb6f51172 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x2b50172) [0x78deb6f51172]\n4 0x78deb6f5152f tensorrt_llm::batch_manager::GuaranteedNoEvictScheduler::operator()(tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::BasePeftCacheManager const>, std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) const + 47\n5 0x78deb6f5259f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x2b5159f) [0x78deb6f5259f]\n6 0x78deb6f4dfa1 tensorrt_llm::batch_manager::CapacityScheduler::operator()(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::BasePeftCacheManager const>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const>) const + 97\n7 0x78deb6fe32f9 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 649\n8 0x78deb7021297 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 455\n9 0x78deb7027755 tensorrt_llm::executor::Executor::Impl::executionLoop() + 1365\n10 0x78dfa8308253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x78dfa8308253]\n11 0x78dfa7e6bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x78dfa7e6bac3]\n12 0x78dfa7efca04 clone + 68\n"
What is this encoder error error: Encountered an error in forwardAsync function: GenericLlmRequest::getEncoderInputLen - Do not have encoder length!? We set the encoder input len when building the trtllm engine with the following flag --max_encoder_input_len 8200 ... Is there another parameter we have to populate when sending requests with the bls endpoint?
The text was updated successfully, but these errors were encountered:
When following the steps highlighted in the examples for
mllama
we run into two issues.cross_kv_cache_fraction
parameter is expected to be set intensorrt_llm/config.pbtxt
, wheras it is not set at all in the examples, and attempting to follow the examples will fail. You can set it manually to something like0.5
to get past this issue.ensemble
withtensorrt_llm_bls
, we end up with the following error:"Traceback (most recent call last):\n File \"/models/tensorrt_llm_bls/1/model.py\", line 108, in execute\n for res in res_gen:\n File \"/models/tensorrt_llm_bls/1/lib/decode.py\", line 223, in decode\n gen_response = self._generate_non_streaming(\n File \"/models/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 350, in _generate_non_streaming\n r = self._exec_triton_request_single(triton_req)\n File \"/models/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 149, in _exec_triton_request_single\n raise pb_utils.TritonModelException(responses.error().message())\nc_python_backend_utils.TritonModelException: Executor failed process requestId 5 due to the following error: Encountered an error in forwardAsync function: GenericLlmRequest::getEncoderInputLen - Do not have encoder length! (/workspace/tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:580)\n1 0x78deb6f675e6 tensorrt_llm::batch_manager::GenericLlmRequest<std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::shared_ptr<tensorrt_llm::runtime::CudaStream> >::getEncoderInputLen() const + 246\n2 0x78deb6f87d98 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::getRemainingBlocksToCompletion(tensorrt_llm::batch_manager::LlmRequest const&) const + 312\n3 0x78deb6f51172 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x2b50172) [0x78deb6f51172]\n4 0x78deb6f5152f tensorrt_llm::batch_manager::GuaranteedNoEvictScheduler::operator()(tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::BasePeftCacheManager const>, std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) const + 47\n5 0x78deb6f5259f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x2b5159f) [0x78deb6f5259f]\n6 0x78deb6f4dfa1 tensorrt_llm::batch_manager::CapacityScheduler::operator()(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::BasePeftCacheManager const>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const>) const + 97\n7 0x78deb6fe32f9 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 649\n8 0x78deb7021297 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 455\n9 0x78deb7027755 tensorrt_llm::executor::Executor::Impl::executionLoop() + 1365\n10 0x78dfa8308253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x78dfa8308253]\n11 0x78dfa7e6bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x78dfa7e6bac3]\n12 0x78dfa7efca04 clone + 68\n"
What is this encoder error
error: Encountered an error in forwardAsync function: GenericLlmRequest::getEncoderInputLen - Do not have encoder length!
? We set the encoder input len when building the trtllm engine with the following flag--max_encoder_input_len 8200
... Is there another parameter we have to populate when sending requests with thebls
endpoint?The text was updated successfully, but these errors were encountered: