moe : RuntimeError: v must have shape (batch_size, seqlen_k, num_heads_k, head_size)

### Reminder

- [x] I have read the README and searched the existing issues.

### System Info

单机8卡H20

### Reproduction

[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 721, in redispatch
[rank5]:     return self._handle.redispatch_boxed(keyset, *args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 324, in backend_impl
[rank5]:     result = self._backend_fns[device_type](*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_compile.py", line 32, in inner
[rank5]:     return disable_fn(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank5]:     return fn(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 367, in wrapped_fn
[rank5]:     return fn(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 96, in _flash_attn_forward
[rank5]:     out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.fwd(
[rank5]: RuntimeError: v must have shape (batch_size, seqlen_k, num_heads_k, head_size)



deepspeed src/train.py \
    --stage sft \
    --do_train \
    --model_name_or_path /mnt/tidalfs-idc01/dataset/redaccel/models/Moonlight-16B-A3B-Instruct \
    --dataset agent_v1.9_am_thinking_mathcode_only \
    --template deepseek3 \
    --finetuning_type lora \
    --lora_target all \
    --output_dir saves/360/moonlight_16bA3b/lora/sft_sp \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --max_samples 1000 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.0 \
    --logging_steps 1 \
    --save_steps 100 \
    --learning_rate 8.0e-6 \
    --num_train_epochs 2.0 \
    --plot_loss \
    --deepspeed examples/deepspeed/ds_z3_config.json \
    --bf16 True \
    --ddp_timeout 180000000 \
    --sequence_parallel_size 8 \
    --flash_attn fa2 \
    --enable_liger_kernel false \
    --preprocessing_num_workers 128 \
    --trust_remote_code false

### Expected behavior

_No response_

### Others

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

moe : RuntimeError: v must have shape (batch_size, seqlen_k, num_heads_k, head_size) #66

Reminder

System Info

Reproduction

Expected behavior

Others

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

moe : RuntimeError: v must have shape (batch_size, seqlen_k, num_heads_k, head_size) #66

Description

Reminder

System Info

Reproduction

Expected behavior

Others

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions