Skip to content

Commit 0640ff5

Browse files
yyDing1sunnweiwei
authored andcommitted
[doc] feat: update doc of reward loop (volcengine#3880)
### What does this PR do? - add more explanation - update all pictures used in the doc ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
1 parent f8e3557 commit 0640ff5

File tree

1 file changed

+47
-34
lines changed

1 file changed

+47
-34
lines changed

docs/advance/reward_loop.rst

Lines changed: 47 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
Reward Loop
22
===========
33

4-
Last updated: 10/21/2025.
4+
Last updated: 10/23/2025.
55

66
.. warning::
7-
Reward Loop is still in progress.
7+
Reward Loop is ready for use, but the API may change in future releaes.
88

99
Reward Loop is designed for more flexible and easy-to-use reward computation.
1010

@@ -14,18 +14,20 @@ Reward Loop is designed for more flexible and easy-to-use reward computation.
1414
- Support broader reward model interface (including discriminative and generative models)
1515
- Make user customized reward function more flexible
1616

17-
.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_overview.png?raw=true
17+
.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_overview.svg?raw=true
1818

1919
Async Reward Computation
2020
------------------------
2121

22-
Reward Loop refactors the design of the reward manager so that each sample is processed asynchronously in the `run_single` function, to enable more efficient and flexible reward computation.
23-
Take the `NaiveRewardLoopManager` as an example:
22+
RewardLoopManager
23+
~~~~~~~~~~~~~~~~~
24+
25+
The Reward Loop refactors the design of the reward manager so that each sample is processed asynchronously in the ``run_single`` function.
26+
This asynchronous design enables the Reward Loop to handle multiple reward computations concurrently, significantly improving computation efficiency.
2427

2528
.. code:: python
2629
2730
class RewardLoopManagerBase(ABC):
28-
@abstractmethod
2931
async def run_single(self, data: DataProto) -> dict:
3032
# ... (data preprocessing)
3133
if self.is_async_reward_score:
@@ -52,14 +54,16 @@ Take the `NaiveRewardLoopManager` as an example:
5254
# ... (reward postprocessing)
5355
return final_result
5456
55-
To support this feature, user-customized reward functions can be implemented as either synchronous or asynchronous.
56-
`RewardLoopManager` automatically determines whether the user-customized reward function is asynchronous or synchronous and handles it accordingly, ensuring that the current process remains non-blocking.
57+
User-defined reward functions can be implemented as either synchronous or asynchronous.
58+
``RewardLoopManager`` automatically detects the type of the user-defined function and executes it accordingly, ensuring that the reward computation process remains non-blocking.
59+
60+
User-Customized Reward Function
61+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5762

58-
Reward Model Interface
59-
----------------------
63+
Users can define custom reward functions, for instance, by integrating external generative rewards or rule-based rewards to accommodate diverse scenario requirements.
6064

61-
In the RewardLoopManger, we directly expose the reward model interface to support more complex reward computation scenarios involving reward models.
62-
For example, a user-defined reward function can be written as follows:
65+
To facilitate this, the Reward Loop directly exposes the reward model interface, enabling complex reward computation pipelines that involve model-based scoring.
66+
A user-defined reward function may look like the following:
6367

6468
.. code:: python
6569
@@ -73,36 +77,45 @@ For example, a user-defined reward function can be written as follows:
7377
):
7478
"""Compute the reward score."""
7579
80+
# Step 1: Prepare prompt and request payload
7681
grm_prompt = GRM_PROMPT_TEMPLATE.format(problem=extra_info["question"], solution=solution_str)
7782
messages = [{"role": "user", "content": grm_prompt}]
7883
sampling_params = {"temperature": 0.7, "top_p": 0.8, "max_tokens": 4096}
79-
chat_complete_request = {
80-
"messages": messages,
81-
**sampling_params,
82-
}
84+
chat_complete_request = {"messages": messages, **sampling_params}
85+
86+
# Step 2: Send async request to the reward model
87+
# here, chat_complete sends async http request to the router address
8388
result = await chat_complete(
8489
router_address=reward_router_address,
8590
chat_complete_request=chat_complete_request,
8691
)
87-
grm_response = result.choices[0].message.content
92+
93+
# Step 3: Parse model response and extract score
94+
grm_response = result.choices[0].message.content.strip()
8895
try:
89-
score = int(grm_response.split("\n\n")[-1].strip())
96+
score_str = grm_response.split("\n\n")[-1].strip()
97+
score = int(score_str)
9098
except Exception:
9199
score = 0
92-
return {"score": score, "acc": score == 10}
93100
94-
We provide runable examples in the `recipe/fapo` directory.
101+
return {"score": score}
95102
96-
Reward models with single router
97-
--------------------------------
103+
Runable examples are provided in the ``recipe/fapo`` directory for reference.
98104

99-
We launch multiple reward servers first and then register them in the reward router. This router will forward the requests to the registered reward servers with load balancing and return the results.
100-
So we can expose the unique reward router address to the user-customized reward function, and the user can use this address to access the reward models.
105+
Reward Models and Router
106+
------------------------
107+
108+
To support flexible and scalable reward model computation, RewardLoop implement a reward router that coordinates requests among multiple reward model servers.
109+
110+
Each reward model runs as an independent server and is registered with the router.
111+
This router will forward the requests to the registered reward servers with load balancing and return the results.
112+
113+
This design allows us to expose a single unified router address to user-defined reward functions, enabling them to access various reward models seamlessly through the same interface.
101114

102115
RewardModelManager
103116
~~~~~~~~~~~~~~~~~~
104117

105-
.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_full.png?raw=true
118+
.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_full.svg?raw=true
106119

107120
`RewardModelManager` will launch multiple reward servers and register them in the reward router.
108121

@@ -116,18 +129,18 @@ RewardModelManager
116129
Initialize the reward model manager.
117130
118131
Args:
119-
config (RewardModelConfig): Reward model configuration.
120-
worker_group (RayWorkerGroup, optional): Worker group. Defaults to None.
132+
config (RewardModelConfig): Reward model configuration.
133+
worker_group (RayWorkerGroup, optional): Worker group. Defaults to None.
121134
"""
122135
self.config = config
123136
self.worker_group = worker_group
124137
self._initialize_llm_servers()
125138
self._initialize_router()
126139
if self.config.rollout.free_cache_engine:
127-
self.sleep()
140+
self.sleep()
128141
129-
Router
130-
~~~~~~
142+
Reward Router
143+
~~~~~~~~~~~~~
131144

132145
The router is to forward the requests to the registered reward servers with load balancing.
133146
- For sglang reward servers, we directly use the sglang router to forward the requests.
@@ -168,10 +181,10 @@ The router is to forward the requests to the registered reward servers with load
168181
# Placeholder for aiohttp client
169182
self.client = None
170183
171-
Integrate with AgentLoop
172-
------------------------
184+
Agent Reward Loop
185+
-----------------
173186

174-
Reward Loop can be integrated with AgentLoop to enable sample-wise rollout and reward computation.
187+
RewardLoop can be integrated with AgentLoop to enable sample-wise rollout and reward computation.
175188

176-
.. image:: https://github.com/yyDing1/verl-materials/blob/main/agent_reward_loop.png?raw=true
189+
.. image:: https://github.com/yyDing1/verl-materials/blob/main/agent_reward_loop.svg?raw=true
177190

0 commit comments

Comments
 (0)