prompt+completion training vs messages training for VLM

### Reproduction

I have read this blog https://huggingface.co/learn/cookbook/fine_tuning_vlm_trl and tried to implement it myself. However, it seems like the model usually converges into a local solution. 
It uses this data format
```python
    return {
        "images": [sample["image"]],
        "messages": [
            {
                "role": "system",
                "content": [{"type": "text", "text": system_message}],
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": sample["image"],
                    },
                    {
                        "type": "text",
                        "text": sample["query"],
                    },
                ],
            },
            {
                "role": "assistant",
                "content": [{"type": "text", "text": sample["label"][0]}],
            },
        ],
    }
```
I then found (https://huggingface.co/docs/trl/en/sft_trainer#training-vision-language-models) and relized the prompt and completion should be in two different field to make it work. And I used this and it worked
```python
    images = [sample["image"].resize((512, 512))]
    prompt = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": sample["image"],
                },
                {
                    "type": "text",
                    "text": sample['prompt'],
                }
            ],
        },
    ]

    completion = [{
            "role": "assistant",
            "content": sample["completion"],
        }]
    return {
        "images": images,
        "prompt": prompt,
        "completion": completion,
    }
```
Is this intended or just version difference? I am training Qwen-2.5-VL-3B.


### System Info

- Platform: Linux-6.1.123+-x86_64-with-glibc2.35
- Python version: 3.12.11
- TRL version: 0.24.0.dev0
- PyTorch version: 2.8.0+cu126
- accelerator(s): NVIDIA A100-SXM4-40GB
- Transformers version: 4.56.1
- Accelerate version: 1.10.1
- Accelerate config: not found
- Datasets version: 4.0.0
- HF Hub version: 0.34.4
- bitsandbytes version: 0.47.0
- DeepSpeed version: not installed
- Diffusers version: 0.35.1
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.106.1
- PEFT version: 0.17.1
- vLLM version: not installed

### Checklist

- [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any traceback provided is complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

prompt+completion training vs messages training for VLM #4077

Reproduction

System Info

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

prompt+completion training vs messages training for VLM #4077

Description

Reproduction

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions