verl-project · tardis-key · Feb 11, 2026 · gemini-code-assist · Feb 11, 2026 · gemini-code-assist
diff --git a/README.md b/README.md
@@ -89,7 +89,7 @@ verl is fast with:
 - Compatible with Hugging Face Transformers and Modelscope Hub: [Qwen-3](https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3-8b.sh), Qwen-2.5, Llama3.1, Gemma2, DeepSeek-LLM, etc
 - Supervised fine-tuning.
 - Reinforcement learning with [PPO](examples/ppo_trainer/), [GRPO](examples/grpo_trainer/), [GSPO](https://github.com/verl-project/verl-recipe/tree/main/gspo/), [ReMax](examples/remax_trainer/), [REINFORCE++](https://verl.readthedocs.io/en/latest/examples/config.html#algorithm), [RLOO](examples/rloo_trainer/), [PRIME](https://github.com/verl-project/verl-recipe/tree/main/prime/), [DAPO](https://github.com/verl-project/verl-recipe/tree/main/dapo/), [DrGRPO](https://github.com/verl-project/verl-recipe/tree/main/drgrpo), [KL_Cov & Clip_Cov](https://github.com/verl-project/verl-recipe/tree/main/entropy) etc.
-  - Support model-based reward and function-based reward (verifiable reward) for math, [coding](https://github.com/volcengine/verl-recipe/tree/main/dapo), etc
+  - Support model-based reward and function-based reward (verifiable reward) for math, [coding](https://github.com/verl-project/verl-recipe/tree/main/dapo), etc
   - Support vision-language models (VLMs) and [multi-modal RL](examples/grpo_trainer/run_qwen2_5_vl-7b.sh) with Qwen2.5-vl, Kimi-VL
   - [Multi-turn with tool calling](https://github.com/volcengine/verl/tree/main/examples/sglang_multiturn)
 - LLM alignment recipes such as [Self-play preference optimization (SPPO)](https://github.com/verl-project/verl-recipe/tree/main/sppo)

@@ -152,7 +152,7 @@ Chat completion vs Token in token out
 Almost all agent frameworks (LangGraph, CrewAI, LlamaIndex, etc) call LLM with OpenAI chat completion api, and 
 keep chat history as messages. So user may expect that we should use the chat completion api in multi-turn rollout.
 
-But based on our recent experience on single-turn training on DAPO and multi-turn training on `retool <https://github.com/volcengine/verl-recipe/tree/main/retool>`_,
+But based on our recent experience on single-turn training on DAPO and multi-turn training on `retool <https://github.com/verl-project/verl-recipe/tree/main/retool>`_,
 we found the token_ids from apply the final messages may not equal to the token_ids by concat prompt_ids and response_ids in each turn.
 
 .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/multi_turn.png?raw=true
@@ -234,5 +234,5 @@ Next
 ----
 
 - :doc:`Agentic RL Training<../start/agentic_rl>`: Quick start agentic RL training with gsm8k dataset.
-- `LangGraph MathExpression <https://github.com/volcengine/verl-recipe/tree/main/langgraph_agent/example>`_: Demonstrate how to use LangGraph to build agent loop.
-- `Retool <https://github.com/volcengine/verl-recipe/tree/main/retool>`_: End-to-end retool paper reproduction using tool agent.
+- `LangGraph MathExpression <https://github.com/verl-project/verl-recipe/tree/main/langgraph_agent/example>`_: Demonstrate how to use LangGraph to build agent loop.
+- `Retool <https://github.com/verl-project/verl-recipe/tree/main/retool>`_: End-to-end retool paper reproduction using tool agent.
@@ -30,11 +30,11 @@ Refer to the table below to reproduce RL training from different pre-trained che
 | NVIDIA GPU | Qwen/Qwen2-7B-Instruct           | GRPO (FSDP2)    | 89.8         | [log](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/qwen2-7b-fsdp2.log)                                                                                                                                 |
 | NVIDIA GPU | Qwen/Qwen2-7B-Instruct           | GRPO (Megatron) | 89.6         | [log](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/qwen2-7b_math_megatron.log)                                                                                                                         |
 | NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct         | ReMax           | 97           | [script](https://github.com/eric-haibin-lin/verl/blob/main/examples/remax_trainer/run_qwen2.5-3b_seq_balance.sh), [wandb](https://wandb.ai/liziniu1997/verl_remax_example_gsm8k/runs/vxl10pln)                                |
-| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct         | SPPO            | 65.6 (MATH)  | [SPPO script](https://github.com/volcengine/verl-recipe/tree/main/sppo/README.md)                                                                                                                                             |
+| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct         | SPPO            | 65.6 (MATH)  | [SPPO script](https://github.com/verl-project/verl-recipe/tree/main/sppo/README.md)                                                                                                                                             |
 | NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct         | GRPO-LoRA       | 93.4         | [command and logs](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-7B-bsz64_8-prompt512-resp1024-lorarank32-score0.934.log)                                                                       |
 | NVIDIA GPU | Mixtral-8x22B-Instruct-v0.1      | Instruct model  | 83.7         | [Qwen Blog](https://qwen.ai/blog?id=qwen2.5-llm)                                                                                                                                                                              |
 | NVIDIA GPU | Mixtral-8x22B-Instruct-v0.1      | RLOO (Megatron) | 92.3         | [wandb](https://api.wandb.ai/links/ppo_dev/sbuiuf2d)                                                                                                                                                                          |
-| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct         | SPIN            | 92           | [script](https://github.com/volcengine/verl-recipe/tree/main/spin/README.md)                                                                                                                                                  |
+| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct         | SPIN            | 92           | [script](https://github.com/verl-project/verl-recipe/tree/main/spin/README.md)                                                                                                                                                  |
 | NVIDIA GPU | Qwen/Qwen2-7B-Instruct           | GPG             | 88           | [log](https://github.com/diqiuzhuanzhuan/verldata/blob/main/run_logs/qwen2-7b_math.log), [wandb](https://wandb.ai/diqiuzhuanzhuan/verl_gpg_example_gsm8k_math/runs/ab86c4va)                                                  |
 | NVIDIA GPU | Qwen/Qwen2-7B-Instruct           | GPG (Megatron)  | 88           | [log](https://github.com/diqiuzhuanzhuan/verldata/blob/main/run_logs/qwen2-7b_math_megatron.log), [wandb](https://wandb.ai/diqiuzhuanzhuan/verl_gpg_example_gsm8k_math/runs/yy8bheu8)                                         |
 | NVIDIA GPU | Qwen/Qwen2.5-VL-7B-Instruct      | GRPO (Megatron) | 65.4 (GEO3k) | [script](https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen2_5_vl-7b-megatron.sh), [wandb](https://api.wandb.ai/links/megatron-core-moe-dev/1yngvkek)                                                |

@@ -170,11 +170,11 @@ if self.overlong_buffer_cfg.enable:
 
 Most experiments in the paper, including the best-performant one, are run without Overlong Filtering because it's somehow overlapping with Overlong Reward Shaping in terms of properly learning from the longest outputs. So we don't implement it here.
 
-### What's the difference between [the `recipe/dapo` directory in the `main` branch](https://github.com/volcengine/verl-recipe/tree/main/dapo) and the [`recipe/dapo` branch](https://github.com/verl-project/verl-recipe/tree/main/dapo/recipe/dapo)?
+### What's the difference between [the `recipe/dapo` directory in the `main` branch](https://github.com/verl-project/verl-recipe/tree/main/dapo) and the [`recipe/dapo` branch](https://github.com/verl-project/verl-recipe/tree/main/dapo/recipe/dapo)?
 
 [The `recipe/dapo` branch](https://github.com/verl-project/verl-recipe/tree/main/dapo/recipe/dapo) is for **as-is reproduction** and thus won't be updated with new features.
 
-[The `recipe/dapo` directory in the `main` branch](https://github.com/volcengine/verl-recipe/tree/main/dapo) works as an example of how to extend the latest `verl` to implement an algorithm recipe, which will be maintained with new features.
+[The `recipe/dapo` directory in the `main` branch](https://github.com/verl-project/verl-recipe/tree/main/dapo) works as an example of how to extend the latest `verl` to implement an algorithm recipe, which will be maintained with new features.
 
 ### Why can't I produce similar results after modifications?