Concurrency Issue: Multiple Requests Not Being Processed Simultaneously #1947

hahmad2008 · 2024-11-07T18:16:27Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

I'm running into an issue with sglang where multiple requests are not being processed concurrently, even though their combined token count is well below the maximum context length, and the GPU memory should be able to handle them.

I’m sending 4 concurrent requests, each with 1,000 tokens, for a total of 4,000 tokens. This should easily fit within the 8,000 token context length, and the GPU memory should be more than enough to handle all 4 requests at once.

However, only 3 of the requests are processed at the same time, and the 4th request ends up getting queued. The problem is that the queued request has a much higher Time to First Token (TFT) compared to the others.

What I Expected:

All 4 requests should be processed simultaneously since the sum of their tokens is well within the GPU's context capacity (and GPU memory is also sufficient).

Questions

Is there a limit within sglang that's preventing more than 3 concurrent requests from being processed at once, even though there’s enough memory and token capacity?
Could there be a memory management or resource allocation issue that’s causing this behavior?
Is there any configuration I’m missing or anything I should adjust to allow all requests to be processed simultaneously?

Reproduction

Setup:

Model: solidrust/gemma-2-9b-it-AWQ

GPU: Single GPU with 24 GB memory

Context Length: 8,000 tokens

Request Length: 1,000 tokens each

Concurrent Requests: 4 requests, each with 1,000 tokens

max-tokens: 1024

ServerArgs:
server_args=ServerArgs(model_path='solidrust/gemma-2-9b-it-AWQ', tokenizer_path='solidrust/gemma-2-9b-it-AWQ', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='bfloat16', device='cuda', kv_cache_dtype='auto', trust_remote_code=True, context_length=8192, quantization=None, served_model_name='solidrust/gemma-2-9b-it-AWQ', chat_template=None, is_embedding=False, host='127.0.0.1', port=10000, mem_fraction_static=0.91, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=8192, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=740475642, constrained_json_whitespace_pattern=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', attention_backend='flashinfer', sampling_backend='flashinfer', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=True, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, enable_mixed_chunk=False, enable_torch_compile=False, max_torch_compile_bs=32, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, lora_paths=None, max_loras_per_batch=8)

Logs:

[13:15:58 TP0] Prefill batch. #new-seq: 1, #new-token: 1008, #cached-token: 5, cache hit rate: 0.55%, token usage: 0.00, #running-req: 0, #queue-req: 0
[13:15:59 TP0] Prefill batch. #new-seq: 3, #new-token: 2923, #cached-token: 26, cache hit rate: 0.79%, token usage: 0.02, #running-req: 1, #queue-req: 0
[13:16:01 TP0] Decode batch. #running-req: 4, #token: 4035, token usage: 0.09, gen throughput (token/s): 1.93, #queue-req: 0
[13:16:02 TP0] Decode batch. #running-req: 4, #token: 4195, token usage: 0.10, gen throughput (token/s): 132.88, #queue-req: 0
[13:16:04 TP0] Decode batch. #running-req: 4, #token: 4355, token usage: 0.10, gen throughput (token/s): 132.35, #queue-req: 0
[13:16:05 TP0] Decode batch. #running-req: 4, #token: 4515, token usage: 0.11, gen throughput (token/s): 132.35, #queue-req: 0
[13:16:06 TP0] Decode batch. #running-req: 3, #token: 3488, token usage: 0.08, gen throughput (token/s): 120.36, #queue-req: 0
[13:16:08 TP0] Decode batch. #running-req: 2, #token: 2553, token usage: 0.06, gen throughput (token/s): 67.14, #queue-req: 0
[13:16:10 TP0] Decode batch. #running-req: 2, #token: 2633, token usage: 0.06, gen throughput (token/s): 66.08, #queue-req: 0
[13:16:11 TP0] Decode batch. #running-req: 1, #token: 1352, token usage: 0.03, gen throughput (token/s): 37.88, #queue-req: 0

Environment

sglang: 0.3.3.post1

The text was updated successfully, but these errors were encountered:

hahmad2008 · 2024-11-07T18:18:11Z

@merrymercy Could you please help?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrency Issue: Multiple Requests Not Being Processed Simultaneously #1947

Concurrency Issue: Multiple Requests Not Being Processed Simultaneously #1947

hahmad2008 commented Nov 7, 2024

hahmad2008 commented Nov 7, 2024

Concurrency Issue: Multiple Requests Not Being Processed Simultaneously #1947

Concurrency Issue: Multiple Requests Not Being Processed Simultaneously #1947

Comments

hahmad2008 commented Nov 7, 2024

Checklist

Describe the bug

Reproduction

Environment

hahmad2008 commented Nov 7, 2024