Issues at GRPO with VLM

### Reproduction

Thank you for your hard work.

I observed a couple of issues with the current TRL released version for training VLMs with GRPO:
- multiple-images inputs are not supported
- If for some reason (I am still wondering), the model could generate a image token id in the completion, raising a mismatch error between the number of image tokens and the image feature size.

Using a similar code where the dataset, is a custom dataset with multiple images as input:

```python
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer

dataset = load_dataset("custom_with multiple images in prompt ", split="train")

# Define the reward function, which rewards completions that are close to 20 characters
def reward_len(completions, **kwargs):
    return [-abs(20 - len(completion)) for completion in completions]


training_args = GRPOConfig(output_dir="Llava7b-GRPO")
trainer = GRPOTrainer(
    model="[Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/llava-hf/llava-1.5-7b-hf)",
    reward_funcs=reward_len,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()
```

Some minor code manipulations were required at: https://github.com/huggingface/trl/blob/5d914a41255d2a63e1cf0ee91a80b4199befd6d4/trl/trainer/grpo_trainer.py#L1369, and https://github.com/huggingface/trl/blob/5d914a41255d2a63e1cf0ee91a80b4199befd6d4/trl/trainer/grpo_trainer.py#L1105 to handle batches with multiple images.

For the image token generation:

a replacing token is required at: https://github.com/huggingface/trl/blob/5d914a41255d2a63e1cf0ee91a80b4199befd6d4/trl/trainer/grpo_trainer.py#L1587



### System Info

Ubuntu 24
TRL 0.20.0
Transformers 4.54.1
PEFT 0.17.0

### Checklist

- [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any traceback provided is complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issues at GRPO with VLM #3847

Reproduction

System Info

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issues at GRPO with VLM #3847

Description

Reproduction

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions