add VLM SFT engine pipeline #3590

The-Hierophant · 2025-09-23T14:08:00Z

What does this PR do?

This PR adds a VLM SFT engine pipeline, which complements volcengine/verl#3589.

Currently implemented features:

Qwen 2.5 VL with FSDP backend
Multi-Turn & Multi images Training
Sequence balancing

gemini-code-assist

Code Review

This pull request introduces a supervised fine-tuning (SFT) pipeline for Vision Language Models (VLMs), including data processing, training scripts, and necessary modifications to the dataset and engine components. The changes are extensive and enable multi-turn and multi-image training for models like Qwen 2.5 VL.

My review has identified a few critical and high-severity issues. These include a leftover breakpoint() call that could halt execution, in-place modification of data structures which can lead to subtle bugs and race conditions, and an incorrect docstring in a data preprocessing script. Addressing these points will improve the robustness and maintainability of the new VLM SFT pipeline.

verl/utils/dataset/multiturn_sft_dataset.py

examples/data_preprocess/mnist_multiturn_sft.py

verl/utils/dataset/multiturn_sft_dataset.py

gemini-code-assist · 2025-09-23T14:09:52Z

verl/utils/seqlen_balancing.py

+    multi_modal_inputs = batch.pop("multi_modal_inputs", None)
+    if multi_modal_inputs is not None:
+        assert len(multi_modal_inputs) == len(batch["input_ids"]), \
+            "Length of 'multi_modal_inputs' must match the batch size."


The function rearrange_micro_batches modifies its input batch by using pop("multi_modal_inputs", None). This creates an undocumented side effect that can lead to bugs if the caller reuses the batch object, as it will be missing the multi_modal_inputs key. It's safer to access the value without modifying the dictionary, or to work on a copy. Using TensorDict.exclude() would be a safer alternative as it returns a new TensorDict without the specified key.

Suggested change

multi_modal_inputs = batch.pop("multi_modal_inputs", None)

if multi_modal_inputs is not None:

assert len(multi_modal_inputs) == len(batch["input_ids"]), \

"Length of 'multi_modal_inputs' must match the batch size."

multi_modal_inputs = batch.get("multi_modal_inputs", None)

if multi_modal_inputs is not None:

batch = batch.exclude("multi_modal_inputs")

assert len(multi_modal_inputs) == len(batch["input_ids"]), \

"Length of 'multi_modal_inputs' must match the batch size."

add VLM SFT engine pipeline

3a21e01

The-Hierophant requested review from PeterSH6, ZihengJiang, eric-haibin-lin, tongyx361 and vermouth1992 as code owners September 23, 2025 14:08

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

update based on bot's review

d78ef67

chenhaiq removed request for PeterSH6, ZihengJiang, eric-haibin-lin and tongyx361 October 11, 2025 01:59

techkang mentioned this pull request Oct 11, 2025

[trainer] feat: vlm support for sft engine #3729

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add VLM SFT engine pipeline #3590

add VLM SFT engine pipeline #3590

Uh oh!

The-Hierophant commented Sep 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

add VLM SFT engine pipeline #3590

Are you sure you want to change the base?

add VLM SFT engine pipeline #3590

Uh oh!

Conversation

The-Hierophant commented Sep 23, 2025

What does this PR do?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant