You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
5. Please use English, otherwise it will be closed.
Describe the bug
I'm running into an issue with sglang where multiple requests are not being processed concurrently, even though their combined token count is well below the maximum context length, and the GPU memory should be able to handle them.
I’m sending 4 concurrent requests, each with 1,000 tokens, for a total of 4,000 tokens. This should easily fit within the 8,000 token context length, and the GPU memory should be more than enough to handle all 4 requests at once.
However, only 3 of the requests are processed at the same time, and the 4th request ends up getting queued. The problem is that the queued request has a much higher Time to First Token (TFT) compared to the others.
What I Expected:
All 4 requests should be processed simultaneously since the sum of their tokens is well within the GPU's context capacity (and GPU memory is also sufficient).
Questions
Is there a limit within sglang that's preventing more than 3 concurrent requests from being processed at once, even though there’s enough memory and token capacity?
Could there be a memory management or resource allocation issue that’s causing this behavior?
Is there any configuration I’m missing or anything I should adjust to allow all requests to be processed simultaneously?
Reproduction
Setup:
Model: solidrust/gemma-2-9b-it-AWQ
GPU: Single GPU with 24 GB memory
Context Length: 8,000 tokens
Request Length: 1,000 tokens each
Concurrent Requests: 4 requests, each with 1,000 tokens
Checklist
Describe the bug
I'm running into an issue with sglang where multiple requests are not being processed concurrently, even though their combined token count is well below the maximum context length, and the GPU memory should be able to handle them.
I’m sending 4 concurrent requests, each with 1,000 tokens, for a total of 4,000 tokens. This should easily fit within the 8,000 token context length, and the GPU memory should be more than enough to handle all 4 requests at once.
However, only 3 of the requests are processed at the same time, and the 4th request ends up getting queued. The problem is that the queued request has a much higher Time to First Token (TFT) compared to the others.
What I Expected:
Questions
Reproduction
Setup:
ServerArgs:
server_args=ServerArgs(model_path='solidrust/gemma-2-9b-it-AWQ', tokenizer_path='solidrust/gemma-2-9b-it-AWQ', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='bfloat16', device='cuda', kv_cache_dtype='auto', trust_remote_code=True, context_length=8192, quantization=None, served_model_name='solidrust/gemma-2-9b-it-AWQ', chat_template=None, is_embedding=False, host='127.0.0.1', port=10000, mem_fraction_static=0.91, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=8192, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=740475642, constrained_json_whitespace_pattern=None, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', attention_backend='flashinfer', sampling_backend='flashinfer', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=True, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, enable_mixed_chunk=False, enable_torch_compile=False, max_torch_compile_bs=32, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, lora_paths=None, max_loras_per_batch=8)
Logs:
Environment
sglang: 0.3.3.post1
The text was updated successfully, but these errors were encountered: