Skip to content

Sequence_parrallel 导致nccl通信timeout #49

@wongzhenhao

Description

@wongzhenhao

Reminder

  • I have read the README and searched the existing issues.

System Info

Image

我发现当我设置sequence_parallel_size: 2或以上的时候,会出现上述问题,一旦sequence_parallel_size: 1就可以顺利跑通

Reproduction

model

model_name_or_path: /mnt/public/data/lh/models/Qwen2.5-7B-Instruct

trust_remote_code: true

method

stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

dataset

dataset: Infinity-Instruct,SYNTHETIC-1-SFT-Data # alpaca_zh_demo
template: qwen
cutoff_len: 16384
max_samples: 100
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: /mnt/public/data/wongzhenhao/RAG/sft_ckpt/Qwen2.5-7B-Instruct/full/sft_16384_fp16
logging_steps: 10
save_steps: 1000
plot_loss: true
overwrite_output_dir: true
report_to: none # 禁用 wandb

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 5.0e-7
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

eval

val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
max_grad_norm: 1.0

sequence_parallel_size: 2

Expected behavior

希望能帮助解决这个单机通信问题,甚至多机通信问题的bug

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions