[doc] feat: update doc of reward loop (volcengine#3880)

yyDing1 · sunnweiwei · commit 0640ff540e92 · 2025-10-23T17:37:06.000-04:00
### What does this PR do? - add more explanation - update all pictures used in the doc ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
diff --git a/docs/advance/reward_loop.rst b/docs/advance/reward_loop.rst
@@ -1,10 +1,10 @@
 Reward Loop
 ===========
 
-Last updated: 10/21/2025.
+Last updated: 10/23/2025.
 
 .. warning::
-   Reward Loop is still in progress.
+   Reward Loop is ready for use, but the API may change in future releaes.
 
 Reward Loop is designed for more flexible and easy-to-use reward computation.
 
@@ -14,18 +14,20 @@ Reward Loop is designed for more flexible and easy-to-use reward computation.
 - Support broader reward model interface (including discriminative and generative models)
 - Make user customized reward function more flexible
 
-.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_overview.png?raw=true
+.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_overview.svg?raw=true
 
 Async Reward Computation
 ------------------------
 
-Reward Loop refactors the design of the reward manager so that each sample is processed asynchronously in the `run_single` function, to enable more efficient and flexible reward computation.
-Take the `NaiveRewardLoopManager` as an example:
+RewardLoopManager
+~~~~~~~~~~~~~~~~~
+
+The Reward Loop refactors the design of the reward manager so that each sample is processed asynchronously in the ``run_single`` function.
+This asynchronous design enables the Reward Loop to handle multiple reward computations concurrently, significantly improving computation efficiency.
 
 .. code:: python
 
    class RewardLoopManagerBase(ABC):
-      @abstractmethod
       async def run_single(self, data: DataProto) -> dict:
          # ... (data preprocessing)
          if self.is_async_reward_score:
@@ -52,14 +54,16 @@ Take the `NaiveRewardLoopManager` as an example:
          # ... (reward postprocessing)
          return final_result
 
-To support this feature, user-customized reward functions can be implemented as either synchronous or asynchronous.
-`RewardLoopManager` automatically determines whether the user-customized reward function is asynchronous or synchronous and handles it accordingly, ensuring that the current process remains non-blocking.
+User-defined reward functions can be implemented as either synchronous or asynchronous.
+``RewardLoopManager`` automatically detects the type of the user-defined function and executes it accordingly, ensuring that the reward computation process remains non-blocking.
+
+User-Customized Reward Function
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Reward Model Interface
-----------------------
+Users can define custom reward functions, for instance, by integrating external generative rewards or rule-based rewards to accommodate diverse scenario requirements.
 
-In the RewardLoopManger, we directly expose the reward model interface to support more complex reward computation scenarios involving reward models.
-For example, a user-defined reward function can be written as follows:
+To facilitate this, the Reward Loop directly exposes the reward model interface, enabling complex reward computation pipelines that involve model-based scoring.
+A user-defined reward function may look like the following:
 
 .. code:: python
 
@@ -73,36 +77,45 @@ For example, a user-defined reward function can be written as follows:
    ):
       """Compute the reward score."""
 
+      # Step 1: Prepare prompt and request payload
       grm_prompt = GRM_PROMPT_TEMPLATE.format(problem=extra_info["question"], solution=solution_str)
       messages = [{"role": "user", "content": grm_prompt}]
       sampling_params = {"temperature": 0.7, "top_p": 0.8, "max_tokens": 4096}
-      chat_complete_request = {
-         "messages": messages,
-         **sampling_params,
-      }
+      chat_complete_request = {"messages": messages, **sampling_params}
+
+      # Step 2: Send async request to the reward model
+      # here, chat_complete sends async http request to the router address
       result = await chat_complete(
          router_address=reward_router_address,
          chat_complete_request=chat_complete_request,
       )
-      grm_response = result.choices[0].message.content
+
+      # Step 3: Parse model response and extract score
+      grm_response = result.choices[0].message.content.strip()
       try:
-         score = int(grm_response.split("\n\n")[-1].strip())
+         score_str = grm_response.split("\n\n")[-1].strip()
+         score = int(score_str)
       except Exception:
          score = 0
-      return {"score": score, "acc": score == 10}
 
-We provide runable examples in the `recipe/fapo` directory.
+      return {"score": score}
 
-Reward models with single router
---------------------------------
+Runable examples are provided in the ``recipe/fapo`` directory for reference.
 
-We launch multiple reward servers first and then register them in the reward router. This router will forward the requests to the registered reward servers with load balancing and return the results.
-So we can expose the unique reward router address to the user-customized reward function, and the user can use this address to access the reward models.
+Reward Models and Router
+------------------------
+
+To support flexible and scalable reward model computation, RewardLoop implement a reward router that coordinates requests among multiple reward model servers.
+
+Each reward model runs as an independent server and is registered with the router.
+This router will forward the requests to the registered reward servers with load balancing and return the results.
+
+This design allows us to expose a single unified router address to user-defined reward functions, enabling them to access various reward models seamlessly through the same interface.
 
 RewardModelManager
 ~~~~~~~~~~~~~~~~~~
 
-.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_full.png?raw=true
+.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_full.svg?raw=true
 
 `RewardModelManager` will launch multiple reward servers and register them in the reward router.
 
@@ -116,18 +129,18 @@ RewardModelManager
          Initialize the reward model manager.
 
          Args:
-               config (RewardModelConfig): Reward model configuration.
-               worker_group (RayWorkerGroup, optional): Worker group. Defaults to None.
+            config (RewardModelConfig): Reward model configuration.
+            worker_group (RayWorkerGroup, optional): Worker group. Defaults to None.
          """
          self.config = config
          self.worker_group = worker_group
          self._initialize_llm_servers()
          self._initialize_router()
          if self.config.rollout.free_cache_engine:
-               self.sleep()
+            self.sleep()
 
-Router
-~~~~~~
+Reward Router
+~~~~~~~~~~~~~
 
 The router is to forward the requests to the registered reward servers with load balancing.
 - For sglang reward servers, we directly use the sglang router to forward the requests.
@@ -168,10 +181,10 @@ The router is to forward the requests to the registered reward servers with load
          # Placeholder for aiohttp client
          self.client = None
 
-Integrate with AgentLoop
-------------------------
+Agent Reward Loop
+-----------------
 
-Reward Loop can be integrated with AgentLoop to enable sample-wise rollout and reward computation.
+RewardLoop can be integrated with AgentLoop to enable sample-wise rollout and reward computation.
 
-.. image:: https://github.com/yyDing1/verl-materials/blob/main/agent_reward_loop.png?raw=true
+.. image:: https://github.com/yyDing1/verl-materials/blob/main/agent_reward_loop.svg?raw=true