-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler methods #1913
base: main
Are you sure you want to change the base?
Scheduler methods #1913
Conversation
…eduler_methods
…eduler_methods
https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2s97q9mki-hAaMglU8sV6pQvi3dttgIw Very cool work! Are you on the slack channel? Let's have a offline discussion |
) | ||
|
||
self.workers.append(send_to) | ||
base_gpu_id += server_args.tp_size | ||
|
||
if self.pre_raidx: | ||
import threading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to top.
Can you fix the CI tests? If this is lightweight, we can also merge this. |
try: | ||
node = deepcopy(self.tree_cache.root_node) | ||
send_data = RadixCacheSend( | ||
gpu_id=self.gpu_id, root_node=node, time=time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the radix cache will even contain GPU tensors. Please only send a simplified version without any GPU tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I check the pytorch doc and find that may be we can use torch.multiprocessing. How do you think of it?
https://pytorch.org/docs/stable/notes/multiprocessing.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.multiprocessing
is not helpful here because we do not need to transfer any TreeNode.value
in the radix tree. You should implement a function to drop all TreeNode.value
in the tree
We have added two new load balancing solutions, resources_aware and pre_radix
resources_aware
resources_aware takes into account the GPU resource usage to dynamically schedule requests. The comparison results of resources_aware are shown in the figure.
The script and environment that produces the result is as follows:
serving:
/workspace/bin/micromamba run -n sglang python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.7 --dp-size 8 --load-balance-method resources_aware
bench:
/workspace/bin/micromamba run -n sglang python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8080 --dataset-name random --tokenizer meta-llama/Meta-Llama-3.1-8B --model meta-llama/Meta-Llama-3.1-8B --random-output-len 1024 --random-input-len 4096 --random-range-ratio 0.5 --seed 1234 --num-prompts 90000--request-rate 15.7
pre_radix
pre_radix is implemented based on resources_aware. It can greatly improve the KV Cache hit rate. It is mainly used to handle multi-round dialogue situations. Its results are as follows:
We also counted the cache hit rate during the inference process, and the results are as follows:
round_robin cache hit rate
pre_radix cache hit rate
The script and environment that produces the result is as follows:
/workspace/bin/micromamba run -n sglang python3 -m sglang.launch_server --model-path Qwen/Qwen2-7B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.7 --dp-size 8 --load-balance-method pre_radix
/workspace/bin/micromamba run -n sglang python3 /workspace/sglang/benchmark/multi_turn_chat/bench_sglang.py --tokenizer Qwen/Qwen2-7B --port 8080 --parallel 128 --min-len-q 128 --max-len-q 256 --min-len-a 256 --max-len-a 512 --turns 20 --num-qa 256
btw, we modified the benchmark code to make the number of rounds of multi-round dialogue a random value to enhance the persuasiveness of our experimental results.