-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Reminder
- I have read the README and searched the existing issues.
System Info
我发现当我设置sequence_parallel_size: 2或以上的时候,会出现上述问题,一旦sequence_parallel_size: 1就可以顺利跑通
Reproduction
model
model_name_or_path: /mnt/public/data/lh/models/Qwen2.5-7B-Instruct
trust_remote_code: true
method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
dataset
dataset: Infinity-Instruct,SYNTHETIC-1-SFT-Data # alpaca_zh_demo
template: qwen
cutoff_len: 16384
max_samples: 100
overwrite_cache: true
preprocessing_num_workers: 16
output
output_dir: /mnt/public/data/wongzhenhao/RAG/sft_ckpt/Qwen2.5-7B-Instruct/full/sft_16384_fp16
logging_steps: 10
save_steps: 1000
plot_loss: true
overwrite_output_dir: true
report_to: none # 禁用 wandb
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 5.0e-7
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
max_grad_norm: 1.0
sequence_parallel_size: 2
Expected behavior
希望能帮助解决这个单机通信问题,甚至多机通信问题的bug
Others
No response
