Commit c77bb54
[perf, data] feat: DP workload balance (volcengine#3605)
### What does this PR do?
Mitigate workload imbalance in DP.
As shown in the figure below, all ranks must synchronize after mini
batch in DP. Stragglers with longer sequences delay all workers.



> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.
### Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here:
volcengine#3401
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
The line with the suffix `Balance` in the figure below can get better
MFU in Qwen2.5-Math-7 GRPO.
<img width="5056" height="2656" alt="W B Chart 2025_9_24 16_52_24"
src="https://github.com/user-attachments/assets/b83bd7a2-3c74-4a09-8212-2f9b754c4ef1"
/>
### API and Usage Example
split Data to n workload balanced chunks
```python
_balance_data_proto(DataProto_obj, chunks)
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
As shown in the figure, the leftmost side shows the unsplit data with a
global batch size of 16.
When DP = 2, existing methods directly split the batch into two ranks
sequentially. You can see that in this case, rank 0 receives more tokens
than rank 1.
The rightmost side shows our design. We model the workload generated by
each data entry and use the Karmarkar-Karp algorithm to split the batch
into two equal parts, ensuring that the total workload of each part is
as close as possible.
The workload can be calculated using the FLOPS formula in verl. Here, we
roughly estimate and hardcode the FLOPs by `seqlens**2 + seqlens *
24576` (Attention+MLP of 7B model).

### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)1 parent 0640ff5 commit c77bb54
2 files changed
+42
-11
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
60 | | - | |
| 60 | + | |
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
| |||
914 | 914 | | |
915 | 915 | | |
916 | 916 | | |
917 | | - | |
| 917 | + | |
918 | 918 | | |
919 | 919 | | |
920 | 920 | | |
921 | | - | |
| 921 | + | |
| 922 | + | |
922 | 923 | | |
923 | | - | |
924 | | - | |
925 | | - | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
| 928 | + | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
926 | 946 | | |
927 | 947 | | |
928 | 948 | | |
| |||
1103 | 1123 | | |
1104 | 1124 | | |
1105 | 1125 | | |
1106 | | - | |
1107 | 1126 | | |
1108 | 1127 | | |
1109 | 1128 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
27 | 37 | | |
28 | 38 | | |
29 | 39 | | |
| |||
298 | 308 | | |
299 | 309 | | |
300 | 310 | | |
301 | | - | |
302 | 311 | | |
303 | 312 | | |
304 | | - | |
| 313 | + | |
| 314 | + | |
305 | 315 | | |
306 | 316 | | |
307 | 317 | | |
308 | 318 | | |
309 | 319 | | |
310 | | - | |
311 | | - | |
| 320 | + | |
| 321 | + | |
312 | 322 | | |
313 | 323 | | |
314 | 324 | | |
| 325 | + | |
| 326 | + | |
315 | 327 | | |
316 | 328 | | |
317 | 329 | | |
| |||
0 commit comments