Skip to content

moe : RuntimeError: v must have shape (batch_size, seqlen_k, num_heads_k, head_size) #66

@JCode012

Description

@JCode012

Reminder

  • I have read the README and searched the existing issues.

System Info

单机8卡H20

Reproduction

[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 721, in redispatch
[rank5]: return self._handle.redispatch_boxed(keyset, *args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 324, in backend_impl
[rank5]: result = self._backend_fns[device_type](*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_compile.py", line 32, in inner
[rank5]: return disable_fn(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank5]: return fn(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 367, in wrapped_fn
[rank5]: return fn(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 96, in _flash_attn_forward
[rank5]: out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.fwd(
[rank5]: RuntimeError: v must have shape (batch_size, seqlen_k, num_heads_k, head_size)

deepspeed src/train.py
--stage sft
--do_train
--model_name_or_path /mnt/tidalfs-idc01/dataset/redaccel/models/Moonlight-16B-A3B-Instruct
--dataset agent_v1.9_am_thinking_mathcode_only
--template deepseek3
--finetuning_type lora
--lora_target all
--output_dir saves/360/moonlight_16bA3b/lora/sft_sp
--overwrite_cache
--overwrite_output_dir
--cutoff_len 1024
--max_samples 1000
--per_device_train_batch_size 1
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--warmup_ratio 0.0
--logging_steps 1
--save_steps 100
--learning_rate 8.0e-6
--num_train_epochs 2.0
--plot_loss
--deepspeed examples/deepspeed/ds_z3_config.json
--bf16 True
--ddp_timeout 180000000
--sequence_parallel_size 8
--flash_attn fa2
--enable_liger_kernel false
--preprocessing_num_workers 128
--trust_remote_code false

Expected behavior

No response

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions