Fix: skip trainer config validation in bench mode by shiweijiezero · Pull Request #535 · agentscope-ai/Trinity-RFT

shiweijiezero · 2026-05-09T11:35:45Z

Bug

In bench mode, cluster.trainer_gpu_num is left at its default 0 because the cluster validator deliberately skips trainer GPU allocation for bench / explore / serve (these modes don't train) — see config_validator.py:244.

But the trainer config validator (config_validator.py:1168) still keeps bench in its whitelist. So for any local-model bench run (i.e. external_model.enable=false), the call chain reaches:

trinity/trainer/verl/verl_config.py:430
    if train_batch_size % (world_size // sp_size) != 0:
                              ↑ trainer_gpu_num = 0 → ZeroDivisionError

Repro: any yaml with mode: bench + a non-external model + engine_num × tensor_parallel_size = total GPU.

Fix

Drop bench from the trainer-config-check whitelist. Bench mode runs explorer-only and never touches the trainer, so its trainer parallelism config doesn't need validation. Same fast-path semantics as the existing external_model.enable=true check immediately below at line 1170.

After the fix, all 6 modes behave correctly:

mode	trainer GPU allocated	trainer config check
train	yes	yes
both	yes	yes
colocate	yes (1 GPU placeholder)	yes
bench	no	no (was: yes, broken)
explore	no	no (already)
serve	no	no (already)

Test

A local-model bench run (Qwen3.6-27B + frozen_lake_obscure eval, 1 node × 8 GPU, engine_num=4 × TP=2) reproduces the ZeroDivisionError on main and runs cleanly with this patch.

config.trainer.trainer_config is not accessed by any module outside config_validator.py (verified via repo-wide grep), so skipping synchronize_config() in bench mode has no downstream effect.

Bench mode runs explorer-only; cluster.trainer_gpu_num is left at 0 because the cluster validator (line 244) skips trainer GPU allocation for bench/explore/serve. The trainer config validator however still kept 'bench' in its whitelist, so any local-model bench run hit: trinity/trainer/verl/verl_config.py:430 if train_batch_size % (world_size // sp_size) != 0: ZeroDivisionError: integer division or modulo by zero Drop bench from the whitelist; same fast-path semantics as the existing external_model.enable check immediately below.

pan-x-c approved these changes May 9, 2026

View reviewed changes

pan-x-c merged commit 5b1d8a7 into agentscope-ai:main May 9, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: skip trainer config validation in bench mode#535

Fix: skip trainer config validation in bench mode#535
pan-x-c merged 1 commit intoagentscope-ai:mainfrom
shiweijiezero:fix/skip-trainer-check-bench-mode

shiweijiezero commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shiweijiezero commented May 9, 2026

Bug

Fix

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants