You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is solved by switching to a docker container installed with up-to-date tensorrtllm_backend built from main branch (rather than stable releases). However, when the service is idle, the orchestrator hogs 200% CPU, and every other tritonserver process occupies 100% CPU.
I followed the guidance in https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/disaggregated_serving to create all models based on qwen2.5-14b, with tp_size=2 for context and generation models. Everything looks fine during the launch of tritonserver. However, when a request is sent using inflight_batcher_llm/client/end_to_end_grpc_client.py, the following error message appears:
The triton container version I currently use is 24.10, with v0.14.0 TensorRT-LLM engine.
@kaiyux
The text was updated successfully, but these errors were encountered: