两张4090，加载数据集反而比一张4090慢了5-6倍 #9693

Sleep-Enough · 2025-12-29T06:44:55Z

Sleep-Enough
Dec 29, 2025

尝试两张4090运行lora，微调qwen2-vl-2b

一张4090，加载数据集，半小时就好了，使用相同的其他配置：
Running tokenizer on dataset (num_proc=12): 100%|██████████| 96470/96470 [37:11<00:00, 43.23 examples/s]

两张4090，加载了2个小时还没加载完毕。
这里我先设置了preprocessing_num_workers: 30
Running tokenizer on dataset (num_proc=30): 81%|████████ | 78240/96470 [2:34:40<24:37, 12.34 examples/s]

随后重新设置preprocessing_num_workers: 12
Running tokenizer on dataset (num_proc=12): 112470 examples [29:49, 15.55 examples/s]
运行特别慢，并且cpu，gpu使用率都很高：

GPU使用率：
不知为何数据集加载使用了GPU，而且只用了一张

CPU：

运行命令：
CUDA_VISIBLE_DEVICES=0,1 FORCE_TORCHRUN=1 llamafactory-cli train /root/autodl-tmp/zhou_vlm/yaml_dir/four_class_train/train.yaml

train.yaml

flash_attn: fa2

model_name_or_path: "/root/autodl-tmp/modelscope/models/Qwen/Qwen2-VL-2B-Instruct"
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: True

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: /root/autodl-tmp/zhou_vlm/yaml_dir/four_class_train/ds_z0_config.json

### dataset
dataset: [emotion_train_fixed_results, environment_mllm_merge_results, gesture_train_fixed_results, weather_train_fixed_results]
template: qwen2_vl
cutoff_len: 4096

overwrite_cache: true
preprocessing_num_workers: 12
dataloader_num_workers: 4

### output
output_dir: "/root/autodl-tmp/zhou_vlm/four_class/models/sft_lora/qwen2_vl_fourclass__v3_20251229"
logging_steps: 100
save_steps: 5000
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 4
gradient_accumulation_steps: 1
learning_rate: 1.0e-5
num_train_epochs: 6.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

deepspeed

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto"
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 0,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true,
    "round_robin_gradients": true
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

两张4090，加载数据集反而比一张4090慢了5-6倍 #9693

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

两张4090，加载数据集反而比一张4090慢了5-6倍 #9693

Uh oh!

Sleep-Enough Dec 29, 2025

Replies: 0 comments

Sleep-Enough
Dec 29, 2025