You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### What does this PR do?
- add more explanation
- update all pictures used in the doc
### Checklist Before Starting
- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
```python
# Add code snippet or script demonstrating how to use this
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
Reward Loop refactors the design of the reward manager so that each sample is processed asynchronously in the `run_single` function, to enable more efficient and flexible reward computation.
23
-
Take the `NaiveRewardLoopManager` as an example:
22
+
RewardLoopManager
23
+
~~~~~~~~~~~~~~~~~
24
+
25
+
The Reward Loop refactors the design of the reward manager so that each sample is processed asynchronously in the ``run_single`` function.
26
+
This asynchronous design enables the Reward Loop to handle multiple reward computations concurrently, significantly improving computation efficiency.
@@ -52,14 +54,16 @@ Take the `NaiveRewardLoopManager` as an example:
52
54
# ... (reward postprocessing)
53
55
return final_result
54
56
55
-
To support this feature, user-customized reward functions can be implemented as either synchronous or asynchronous.
56
-
`RewardLoopManager` automatically determines whether the user-customized reward function is asynchronous or synchronous and handles it accordingly, ensuring that the current process remains non-blocking.
57
+
User-defined reward functions can be implemented as either synchronous or asynchronous.
58
+
``RewardLoopManager`` automatically detects the type of the user-defined function and executes it accordingly, ensuring that the reward computation process remains non-blocking.
59
+
60
+
User-Customized Reward Function
61
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
57
62
58
-
Reward Model Interface
59
-
----------------------
63
+
Users can define custom reward functions, for instance, by integrating external generative rewards or rule-based rewards to accommodate diverse scenario requirements.
60
64
61
-
In the RewardLoopManger, we directly expose the reward model interface to support more complex reward computation scenarios involving reward models.
62
-
For example, a user-defined reward function can be written as follows:
65
+
To facilitate this, the Reward Loop directly exposes the reward model interface, enabling complex reward computation pipelines that involve model-based scoring.
66
+
A user-defined reward function may look like the following:
63
67
64
68
.. code:: python
65
69
@@ -73,36 +77,45 @@ For example, a user-defined reward function can be written as follows:
We provide runable examples in the `recipe/fapo` directory.
101
+
return {"score": score}
95
102
96
-
Reward models with single router
97
-
--------------------------------
103
+
Runable examples are provided in the ``recipe/fapo`` directory for reference.
98
104
99
-
We launch multiple reward servers first and then register them in the reward router. This router will forward the requests to the registered reward servers with load balancing and return the results.
100
-
So we can expose the unique reward router address to the user-customized reward function, and the user can use this address to access the reward models.
105
+
Reward Models and Router
106
+
------------------------
107
+
108
+
To support flexible and scalable reward model computation, RewardLoop implement a reward router that coordinates requests among multiple reward model servers.
109
+
110
+
Each reward model runs as an independent server and is registered with the router.
111
+
This router will forward the requests to the registered reward servers with load balancing and return the results.
112
+
113
+
This design allows us to expose a single unified router address to user-defined reward functions, enabling them to access various reward models seamlessly through the same interface.
0 commit comments