Skip to content

Fix: skip trainer config validation in bench mode#535

Merged
pan-x-c merged 1 commit intoagentscope-ai:mainfrom
shiweijiezero:fix/skip-trainer-check-bench-mode
May 9, 2026
Merged

Fix: skip trainer config validation in bench mode#535
pan-x-c merged 1 commit intoagentscope-ai:mainfrom
shiweijiezero:fix/skip-trainer-check-bench-mode

Conversation

@shiweijiezero
Copy link
Copy Markdown
Contributor

Bug

In bench mode, cluster.trainer_gpu_num is left at its default 0 because the cluster validator deliberately skips trainer GPU allocation for bench / explore / serve (these modes don't train) — see config_validator.py:244.

But the trainer config validator (config_validator.py:1168) still keeps bench in its whitelist. So for any local-model bench run (i.e. external_model.enable=false), the call chain reaches:

trinity/trainer/verl/verl_config.py:430
    if train_batch_size % (world_size // sp_size) != 0:
                              ↑ trainer_gpu_num = 0 → ZeroDivisionError

Repro: any yaml with mode: bench + a non-external model + engine_num × tensor_parallel_size = total GPU.

Fix

Drop bench from the trainer-config-check whitelist. Bench mode runs explorer-only and never touches the trainer, so its trainer parallelism config doesn't need validation. Same fast-path semantics as the existing external_model.enable=true check immediately below at line 1170.

After the fix, all 6 modes behave correctly:

mode trainer GPU allocated trainer config check
train yes yes
both yes yes
colocate yes (1 GPU placeholder) yes
bench no no (was: yes, broken)
explore no no (already)
serve no no (already)

Test

A local-model bench run (Qwen3.6-27B + frozen_lake_obscure eval, 1 node × 8 GPU, engine_num=4 × TP=2) reproduces the ZeroDivisionError on main and runs cleanly with this patch.

config.trainer.trainer_config is not accessed by any module outside config_validator.py (verified via repo-wide grep), so skipping synchronize_config() in bench mode has no downstream effect.

Bench mode runs explorer-only; cluster.trainer_gpu_num is left at 0
because the cluster validator (line 244) skips trainer GPU allocation
for bench/explore/serve. The trainer config validator however still
kept 'bench' in its whitelist, so any local-model bench run hit:

    trinity/trainer/verl/verl_config.py:430
    if train_batch_size % (world_size // sp_size) != 0:
        ZeroDivisionError: integer division or modulo by zero

Drop bench from the whitelist; same fast-path semantics as the existing
external_model.enable check immediately below.
@pan-x-c pan-x-c merged commit 5b1d8a7 into agentscope-ai:main May 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants