[rollout] fix: use the same request_ids in each rollout turn for better tracking in rollout backend#5271
[rollout] fix: use the same request_ids in each rollout turn for better tracking in rollout backend#5271PeterSH6 wants to merge 1 commit intoverl-project:mainfrom
Conversation
…er tracking in rollout backend
There was a problem hiding this comment.
Code Review
The pull request modifies the generate method in AsyncLLMServerManager to use the same request_id for each rollout turn, which is intended to improve tracking and aborting trajectories in the backend. The change involves replacing uuid4().hex with the input request_id when calling server.generate.remote. This change seems reasonable and aligns with the stated goal of better tracking. However, it's crucial to ensure that the request_id is properly managed and unique across different trajectories to avoid unintended side effects.
| server = self._choose_server(request_id) | ||
| output = await server.generate.remote( | ||
| request_id=uuid4().hex, # use new request_id for each turn | ||
| request_id=request_id, # use the same request_id for better tracking |
There was a problem hiding this comment.
Using the same request_id across multiple turns could lead to issues if the backend relies on unique request_ids for identifying individual requests. This might cause incorrect caching or interference between turns. It's important to ensure that the backend system can handle the same request_id for related turns within a trajectory without any conflicts or errors.
Consider adding a mechanism to ensure uniqueness across different trajectories, perhaps by incorporating a trajectory ID or session ID into the request_id.
| request_id=request_id, # use the same request_id for better tracking | |
| request_id=f"{request_id}_turn", # use the same request_id for better tracking |
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a bug in AsyncLLMServerManager where a new, random request_id was generated for each generation turn, ignoring the request_id associated with the trajectory. By propagating the correct request_id to the backend server, this change enables consistent tracking of multi-turn rollouts and ensures sticky sessions function as intended for prefix caching. The fix is sound and aligns with the stated goals of improving traceability and control over rollouts.
|
There's an issue report that same |
| server = self._choose_server(request_id) | ||
| output = await server.generate.remote( | ||
| request_id=uuid4().hex, # use new request_id for each turn | ||
| request_id=request_id, # use the same request_id for better tracking |
There was a problem hiding this comment.
We better to use new request_id for each turn
For each trajectory, if we use the same request_ids, it would be easier for us to track its progress in the vllm/sglang/trtllm backend.
Also, it would be easier for us to abort one trajectory when needed.
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.