Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

triton server multi request dynamic_batching not work #661

Open
2 of 4 tasks
kazyun opened this issue Dec 13, 2024 · 1 comment
Open
2 of 4 tasks

triton server multi request dynamic_batching not work #661

kazyun opened this issue Dec 13, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@kazyun
Copy link

kazyun commented Dec 13, 2024

System Info

  • GPU A800 80G *2
  • Container:nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3
  • Model:Qwen2.5-14B-Instruct

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. adding dynamic_batching in tensorrt_llm/config.txt
    dynamic_batching {
    preferred_batch_size: [ 32 ]
    max_queue_delay_microseconds: 10000
    default_queue_policy: { max_queue_size: 32 }
    }

  2. instance_group [
    {
    count: 1
    kind : KIND_GPU
    gpus: [ 0 ]
    }
    ]

  3. Simulate 10 concurrent requests.

Expected behavior

Expect these 10 requests to be processed simultaneously and return results.

actual behavior

If the model instance is limited to one, then during the simulation of concurrent requests, the requests will be processed sequentially, one after another. For example, if processing and generating the full text for one request takes 10 seconds, the second request will only begin after 10 seconds, resulting in a total duration of 20 seconds.

additional notes

If you need me to provide the complete config.pbtxt file, feel free to ask。

@kazyun kazyun added the bug Something isn't working label Dec 13, 2024
@jadhosn
Copy link

jadhosn commented Dec 17, 2024

Try increasing your max_queue_delay_microseconds to a larger value to give your calls a chance to arrive within the same time window. This will help you debug whether dynamic_batching is broken, or whether the parameters you chose are too tight.

Try for example max_queue_delay_microseconds = 10000000 (given that the queue delay is in microseconds), your queue would wait 10 seconds for an incoming requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants