Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency Issue: Multiple Requests Not Being Processed Simultaneously #1947

Open
1 of 5 tasks
hahmad2008 opened this issue Nov 7, 2024 · 1 comment
Open
1 of 5 tasks

Comments

@hahmad2008
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I'm running into an issue with sglang where multiple requests are not being processed concurrently, even though their combined token count is well below the maximum context length, and the GPU memory should be able to handle them.

I’m sending 4 concurrent requests, each with 1,000 tokens, for a total of 4,000 tokens. This should easily fit within the 8,000 token context length, and the GPU memory should be more than enough to handle all 4 requests at once.

However, only 3 of the requests are processed at the same time, and the 4th request ends up getting queued. The problem is that the queued request has a much higher Time to First Token (TFT) compared to the others.

What I Expected:

  • All 4 requests should be processed simultaneously since the sum of their tokens is well within the GPU's context capacity (and GPU memory is also sufficient).

Questions

  • Is there a limit within sglang that's preventing more than 3 concurrent requests from being processed at once, even though there’s enough memory and token capacity?
  • Could there be a memory management or resource allocation issue that’s causing this behavior?
  • Is there any configuration I’m missing or anything I should adjust to allow all requests to be processed simultaneously?

Reproduction

Setup:

  • Model: solidrust/gemma-2-9b-it-AWQ
  • GPU: Single GPU with 24 GB memory
  • Context Length: 8,000 tokens
  • Request Length: 1,000 tokens each
  • Concurrent Requests: 4 requests, each with 1,000 tokens
  • max-tokens: 1024

ServerArgs:
server_args=ServerArgs(model_path='solidrust/gemma-2-9b-it-AWQ', tokenizer_path='solidrust/gemma-2-9b-it-AWQ', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='bfloat16', device='cuda', kv_cache_dtype='auto', trust_remote_code=True, context_length=8192, quantization=None, served_model_name='solidrust/gemma-2-9b-it-AWQ', chat_template=None, is_embedding=False, host='127.0.0.1', port=10000, mem_fraction_static=0.91, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=8192, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=740475642, constrained_json_whitespace_pattern=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', attention_backend='flashinfer', sampling_backend='flashinfer', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=True, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, enable_mixed_chunk=False, enable_torch_compile=False, max_torch_compile_bs=32, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, lora_paths=None, max_loras_per_batch=8)

Logs:

[13:15:58 TP0] Prefill batch. #new-seq: 1, #new-token: 1008, #cached-token: 5, cache hit rate: 0.55%, token usage: 0.00, #running-req: 0, #queue-req: 0
[13:15:59 TP0] Prefill batch. #new-seq: 3, #new-token: 2923, #cached-token: 26, cache hit rate: 0.79%, token usage: 0.02, #running-req: 1, #queue-req: 0
[13:16:01 TP0] Decode batch. #running-req: 4, #token: 4035, token usage: 0.09, gen throughput (token/s): 1.93, #queue-req: 0
[13:16:02 TP0] Decode batch. #running-req: 4, #token: 4195, token usage: 0.10, gen throughput (token/s): 132.88, #queue-req: 0
[13:16:04 TP0] Decode batch. #running-req: 4, #token: 4355, token usage: 0.10, gen throughput (token/s): 132.35, #queue-req: 0
[13:16:05 TP0] Decode batch. #running-req: 4, #token: 4515, token usage: 0.11, gen throughput (token/s): 132.35, #queue-req: 0
[13:16:06 TP0] Decode batch. #running-req: 3, #token: 3488, token usage: 0.08, gen throughput (token/s): 120.36, #queue-req: 0
[13:16:08 TP0] Decode batch. #running-req: 2, #token: 2553, token usage: 0.06, gen throughput (token/s): 67.14, #queue-req: 0
[13:16:10 TP0] Decode batch. #running-req: 2, #token: 2633, token usage: 0.06, gen throughput (token/s): 66.08, #queue-req: 0
[13:16:11 TP0] Decode batch. #running-req: 1, #token: 1352, token usage: 0.03, gen throughput (token/s): 37.88, #queue-req: 0

Environment

sglang: 0.3.3.post1

@hahmad2008
Copy link
Author

@merrymercy Could you please help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant