disaggregated_serving_bls: CPU usage #653

gary-wjc · 2024-12-02T07:46:38Z

I followed the guidance in https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/disaggregated_serving to create all models based on qwen2.5-14b, with tp_size=2 for context and generation models. Everything looks fine during the launch of tritonserver. However, when a request is sent using inflight_batcher_llm/client/end_to_end_grpc_client.py, the following error message appears:

Received an error from server:
in ensemble 'ensemble', Failed to process the request(s) for model 'disaggregated_serving_bls_0_0', message: TritonModelException: Context model context failed with error: Context-only and generation-only requests are NOT currently supported in orchestrator mode. (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/src/utils.cc:647)
1       0x7ff2e008e881 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x13881) [0x7ff2e008e881]
2       0x7ff2e0098001 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1d001) [0x7ff2e0098001]
3       0x7ff2e00a3744 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x28744) [0x7ff2e00a3744]
4       0x7ff2e00920d5 TRITONBACKEND_ModelInstanceExecute + 101
5       0x7ff2f03070b4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a70b4) [0x7ff2f03070b4]
6       0x7ff2f030742b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a742b) [0x7ff2f030742b]
7       0x7ff2f0425ccd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2c5ccd) [0x7ff2f0425ccd]
8       0x7ff2f030b864 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ab864) [0x7ff2f030b864]
9       0x7ff2efbcc253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7ff2efbcc253]
10      0x7ff2ef95bac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7ff2ef95bac3]
11      0x7ff2ef9eca04 clone + 68

The triton container version I currently use is 24.10, with v0.14.0 TensorRT-LLM engine.
@kaiyux

The text was updated successfully, but these errors were encountered:

gary-wjc · 2024-12-03T09:08:23Z

This issue is solved by switching to a docker container installed with up-to-date tensorrtllm_backend built from main branch (rather than stable releases). However, when the service is idle, the orchestrator hogs 200% CPU, and every other tritonserver process occupies 100% CPU.

gary-wjc changed the title ~~disaggregated_serving_bls: fail to process request~~ disaggregated_serving_bls: CPU usage Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disaggregated_serving_bls: CPU usage #653

disaggregated_serving_bls: CPU usage #653

gary-wjc commented Dec 2, 2024 •

edited

Loading

gary-wjc commented Dec 3, 2024 •

edited

Loading

disaggregated_serving_bls: CPU usage #653

disaggregated_serving_bls: CPU usage #653

Comments

gary-wjc commented Dec 2, 2024 • edited Loading

gary-wjc commented Dec 3, 2024 • edited Loading

gary-wjc commented Dec 2, 2024 •

edited

Loading

gary-wjc commented Dec 3, 2024 •

edited

Loading