[Bug] AttributeError when training text-only models (e.g., Qwen2) with Sequence Parallelism

### Reminder

- [x] I have read the README and searched the existing issues.

### System Info

- `llamafactory` version: 0.9.1
- Platform: Linux-5.10.134-14.zncgsl6.x86_64-x86_64-with-glibc2.35
- Python version: 3.11.9
- PyTorch version: 2.6.0+cu124 (GPU)
- Transformers version: 4.52.1
- Datasets version: 2.16.0
- Accelerate version: 1.7.0
- PEFT version: 0.15.2
- TRL version: 0.9.6
- GPU type: NVIDIA A800-SXM4-80GB
- DeepSpeed version: 0.16.8

### Reproduction

When attempting to fine-tune a text-only large language model (e.g., Qwen2) using the Sequence Parallelism (sp) feature, the program crashes during the data preprocessing stage.

The error message is AttributeError: Qwen2TokenizerFast has no attribute image_token_id.

This appears to be caused by a mismatch between the model choice and the data processing configuration. The specific data pipeline for Sequence Parallelism (_get_sequence_parallel_dataset) seems to be designed to handle multimodal data, as it attempts to access the tokenizer's image_token_id attribute. However, the tokenizer for a text-only model like Qwen2 does not contain this attribute, leading to a crash.

Steps to Reproduce
Choose a text-only model as the base model, e.g., --model_name_or_path qwen/Qwen3-8B.

Prepare a text-only instruction fine-tuning dataset.

Enable the Sequence Parallelism feature when launching the training, for example, by setting --sequence_parallel_size 2.

Execute the training script.

The program crashes during the data loading and preprocessing stage.

Actual Behavior
The training script is terminated by an AttributeError. The full error stack trace is provided below:
```python
[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/tenant-home_speed/czh/360-LLaMA-Factory-sp/src/train.py", line 28, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/tenant-home_speed/czh/360-LLaMA-Factory-sp/src/train.py", line 19, in main
[rank0]:     run_exp()
[rank0]:   File "/mnt/tenant-home_speed/czh/360-LLaMA-Factory-sp/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/mnt/tenant-home_speed/czh/360-LLaMA-Factory-sp/src/llamafactory/train/sft/workflow.py", line 47, in run_sft
[rank0]:     dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank0]:   File "/mnt/tenant-home_speed/czh/360-LLaMA-Factory-sp/src/llamafactory/data/loader.py", line 313, in get_dataset
[rank0]:     dataset = _get_sequence_parallel_dataset(
[rank0]:   File "/mnt/tenant-home_speed/czh/360-LLaMA-Factory-sp/src/llamafactory/data/loader.py", line 253, in _get_sequence_parallel_dataset
[rank0]:     sp_dataset = padded_dataset.map(
[rank0]:   File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
[rank0]:     out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
[rank0]:     out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3185, in map
[rank0]:     for rank, done, content in iflatmap_unordered(
[rank0]:   File "/usr/local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 654, in iflatmap_unordered
[rank0]:     [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]:   File "/usr/local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 654, in <listcomp>
[rank0]:     [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]:   File "/usr/local/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get
[rank0]:     raise self._value
[rank0]: AttributeError: Qwen2TokenizerFast has no attribute image_token_id
[rank0]:[W1010 09:35:45.217851562 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
```

### Expected behavior

The training framework should ideally support Sequence Parallelism for text-only models. If this configuration is intentionally unsupported, the framework should raise a clear configuration error (e.g., "Sequence Parallelism is not supported for text-only models in this implementation") rather than crashing with a low-level AttributeError.

期望训练框架能够兼容纯文本模型下的序列并行训练。如果该功能明确只支持多模态模型，框架应该在检测到这种不兼容的配置时，给出一个更明确的配置错误警告（例如 "Sequence Parallelism is not supported for text-only models in this version"），而不是直接抛出底层的 AttributeError。我注意到该问题是伴随最新版本更新出现，即10月份的对Qwen2.5-VL系列模型的适配导致的，希望能够检查对应代码并提供正确的版本。

### Others

我注意到该问题是伴随最新版本更新出现，即10月份的对Qwen2.5-VL系列模型的适配导致的，希望能够检查对应代码并提供正确的版本。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] AttributeError when training text-only models (e.g., Qwen2) with Sequence Parallelism #78

Reminder

System Info

Reproduction

Expected behavior

Others

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] AttributeError when training text-only models (e.g., Qwen2) with Sequence Parallelism #78

Description

Reminder

System Info

Reproduction

Expected behavior

Others

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions