🧺 [2/N] Refactor `_generate` in GRPO/RLOO: Use `prompt_ids` from generation #4152

qgallouedec · 2025-09-26T18:42:37Z

This PR belongs to a sequence of PR that aims to refactor the generation part of GRPO/RLOO to allow for easier customization and ultimately tool calling

🧺 [3/N] Refactor _generate in GRPO/RLOO: Rely on generator for prompt truncation #4153
🧺 [4/N] Refactor _generate in GRPO/RLOO: Move forward_kwargs outside generation method #4154
🧺 [5/N] Refactor _generate in GRPO/RLOO: Insert images in the prompt #4155

Instead of getting prompt_ids for all generation methods like this

prompt_inputs = super()._prepare_inputs(prompt_inputs)
prompt_ids, prompt_mask = prompt_inputs["input_ids"], prompt_inputs["attention_mask"]
forward_kwargs = {k: v for k, v in prompt_inputs.items() if k not in ["input_ids", "attention_mask"]}
prompt_ids = [p[m].tolist() for p, m in zip(prompt_ids, prompt_mask.bool())]

we rely on each method to provide the prompt_ids.

Benchmark

I ran a quick benchmark with two configs, results are here:

I can't remember which run correspond to before, and which one correspond to after 😅

from datasets import load_dataset
from trl import GRPOTrainer, GRPOConfig
import os

os.environ["TRACKIO_PROJECT"] = "4152"
os.environ["TRACKIO_SPACE_ID"] = "qgallouedec/trackio"

# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
    return [len(set(c)) for c in completions]

dataset = load_dataset("trl-lib/tldr", split="train[:500]")
dataset.select_columns(["prompt"])

training_args = GRPOConfig(
    output_dir="tmp",
    per_device_train_batch_size=4,  # reduce the batch size to reduce memory usage
    gradient_accumulation_steps=8,
    num_generations=4,  # reduce the number of generations to reduce memory usage
    max_completion_length=256,  # reduce the completion length to reduce memory usage
    steps_per_generation=4,
    logging_steps=1,
    num_train_epochs=1,
    vllm_mode="colocate",
    use_vllm=True,
    report_to="trackio"
)
trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B-Base",
    reward_funcs=[reward_num_unique_chars],
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

https://huggingface.co/spaces/qgallouedec/trackio?project=4152

…_thw` in GRPO and RLOO trainers; update `split_pixel_values_by_grid` to use `image_grid_thw`

…#4185)

…eterized tests (#4176)

Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]>

…GRPO test (#4192)

Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: sergiopaniego <[email protected]> Co-authored-by: lewtun <[email protected]> Co-authored-by: Kashif Rasul <[email protected]>

Co-authored-by: Sergio Paniego Blanco <[email protected]>

Co-authored-by: Quentin Gallouédec <[email protected]>

)

lewtun

Thanks for the clean refactor!

Overall LGTM with a question about whether the difference across benchmark runs is within the variance of a repeated run (i.e. if you run one of these again on main does it also fluctuate from the original run on main)?

tests/test_vllm_client_server.py

qgallouedec and others added 30 commits September 19, 2025 20:57

Refactor image handling: replace image_split_sizes with `image_grid…

552e899

…_thw` in GRPO and RLOO trainers; update `split_pixel_values_by_grid` to use `image_grid_thw`

simpler

449ef07

gfpo

c8933aa

multi-image grpo

229c554

log with wandb

3ca6ad5

no vlm reward models

dcf4b92

rloo

30ad7ca

gfpo

86cc30b

fix

088897b

test peft

d2adc63

fix gfpo

f4c82bf

rloo test

1257796

peft rloo

099a39b

oops

529add6

update test

fc6b11f

generate method

ae1f497

debug

f998432

skip failing test

fa73876

Merge branch 'main' into drop-image_split_sizes

52d8bd9

Merge branch 'drop-image_split_sizes' into multi-image-support

dfc0d38

test fixed!

fc52e68

Merge branch 'multi-image-support' into generate-method

4d12aeb

gfpo

4fc2b5b

rm vllm

b628744

fix doc

d3a769f

Merge branch 'main' into drop-image_split_sizes

e17ec42

Merge branch 'drop-image_split_sizes' into multi-image-support

efbb03a

Merge branch 'main' into multi-image-support

562c662

Merge branch 'main' into multi-image-support

485781c

update layers to ignore

05270f8

fix normal generation path

5fca5b8

qgallouedec requested review from kashif, edbeeching, albertvillanova and lewtun October 1, 2025 02:18

qgallouedec and others added 18 commits October 1, 2025 08:49

Merge branch 'main' into refactor_generate_2

d599c20

🔣 Fix test: replace trainer.tokenizer by trainer.processing_class (…

e82db74

…#4185)

Fix CI ImportError: FlashAttention2 and decorator order for all param…

192deb3

…eterized tests (#4176)

Hotfix wrong formatting of docstrings with blockquote tips (#4187)

cf9d8e7

🌡️ Have vLLM return processed (temperature scaled) log probs (#4163)

f9c3c3c

Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]>

Replace remaining trainer.tokenizer with trainer.processing_class in …

6489479

…GRPO test (#4192)

[DOCS] Lora without regret (#4181)

21a67fc

Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: sergiopaniego <[email protected]> Co-authored-by: lewtun <[email protected]> Co-authored-by: Kashif Rasul <[email protected]>

[DOCS/FIX] lora without regrets - fix lr (#4207)

c1e7ad2

Remove custome_container for building the docs (#4198)

5d34144

Remove tokenizer creation from sft example script (#4197)

ae2a0e7

Hotfix: Exclude transformers 4.57.0 for Python 3.9 (#4209)

6543f51

Co-authored-by: Sergio Paniego Blanco <[email protected]>

Replace unittest with pytest (#4188)

8319ce0

Updated vLLM integration guide (#4162)

4fdaa4c

Co-authored-by: Quentin Gallouédec <[email protected]>

Remove Optional from processing_class in PPOTrainer (#4212)

d258e36

Replace setup with pyproject and fix packaging unintended modules (#4194

7f5b499

)

Merge branch 'main' into refactor_generate_2

df386f9

Merge branch 'main' into refactor_generate_2

5b9a6ab

Merge branch 'main' into refactor_generate_2

4a274d5

lewtun approved these changes Oct 7, 2025

View reviewed changes

tests/test_vllm_client_server.py Show resolved Hide resolved

Merge branch 'main' into refactor_generate_2

6324eda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🧺 [2/N] Refactor `_generate` in GRPO/RLOO: Use `prompt_ids` from generation #4152

🧺 [2/N] Refactor `_generate` in GRPO/RLOO: Use `prompt_ids` from generation #4152

qgallouedec commented Sep 26, 2025 •

edited

Loading

Uh oh!

lewtun left a comment

Uh oh!

Uh oh!

Uh oh!

🧺 [2/N] Refactor _generate in GRPO/RLOO: Use prompt_ids from generation #4152

Are you sure you want to change the base?

🧺 [2/N] Refactor _generate in GRPO/RLOO: Use prompt_ids from generation #4152

Conversation

qgallouedec commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

🧺 [2/N] Refactor `_generate` in GRPO/RLOO: Use `prompt_ids` from generation #4152

🧺 [2/N] Refactor `_generate` in GRPO/RLOO: Use `prompt_ids` from generation #4152

qgallouedec commented Sep 26, 2025 •

edited

Loading