Skip to content

[Tracking] Qwen3-14B A8W8 decode kernels and TPOT optimization #665

Description

@vegetabledoww

Summary

Track the Qwen3-14B A8W8 PyPTO kernel work and the follow-up decode TPOT optimization path.

This issue is associated with PR #642, which adds the Qwen3-14B A8W8 kernel path in pypto-lib. The current end-to-end A8W8 path can run successfully with the matching serving-side PR, but the remaining performance gap appears to be dominated by decode task fanout and scheduling overhead rather than one isolated small operator.

Current status

  • The combined pypto-lib + pypto-serving A8W8 path can run an end-to-end Qwen3-14B A8W8 decode flow.
  • The latest measured baseline is around 360-371 ms/token TPOT.
  • A previous fused QK norm path caused repeated text output. The root cause was narrowed down to the fused branch taking a batched Q RoPE path that was not numerically equivalent to the unfused path.
  • Switching Q RoPE back to per-Q-head rotation restores output quality, and single-layer debug showed bitwise agreement for the corrected Q RoPE output.
  • The quality fix alone does not provide a speedup. The best corrected fused-QK-norm path observed so far is around 429.6 ms/token.

Measurement basis

Decode profiling suggests the main cost is task granularity and fanout:

  • Device wall time is around 356 ms/step.
  • Each decode step emits about 38,705 AICore task records.
  • Average task runtime is about 45 us.
  • Dispatch-to-finish latency is about 61 us, which suggests roughly 25% scheduling/head-tail overhead.

Static group breakdown per layer:

Area Groups per layer Notes
MLP gate/up/silu/down 224 Largest fanout source, about 69% of groups
out_proj 40 Already investigated through several submit/fusion attempts
QKV projection 28 Smaller than MLP fanout
Other attention/norm work 32 Secondary priority

Findings so far

  • The QUANT_ROWMAX_TRANSPOSE umbrella switch is not a necessary condition for text repetition.
  • QWEN_A8W8_FUSED_QKV_DEQUANT=1 with QWEN_A8W8_FUSED_QK_NORM=1 reproduced the repetition issue even without the ACT/OUT rowmax transpose switches.
  • QWEN_A8W8_FUSED_QKV_DEQUANT=1 with QWEN_A8W8_FUSED_QK_NORM=0 restored text quality, but TPOT was still poor.
  • Single-layer debug showed q_pre_rope and k_pre_rope were bitwise identical between fused and unfused paths, so the Q/K norm math itself was not the root cause.
  • The fused branch's Q RoPE implementation was the quality issue.
  • Model-DSL-level submit fusion or tile-size tuning has limited headroom because several attempts either had no measurable benefit or hit backend buffer limits.

Latest dual-output experiment results

The recent matmul_dual / matmul_dual_acc experiments further confirm that this is not solvable by simple DSL composition alone:

  • After adding L0 tiling support to matmul_dual, the Right-buffer issue can be removed, but the tensor-level path still preloads two complete RHS tiles, so Mat buffer usage still exceeds the limit.
  • Smaller K chunks can avoid the Mat buffer limit, but ordinary dual-output accumulation across chunks needs an AIV add, and the generated SSA is currently incorrect.
  • matmul_dual_acc can remove the AIV add, but the model-level second Acc output still triggers an Acc-to-Acc tmov before store, which ptoas does not support today.
  • Therefore the next useful path is lower-level support for multi-Acc output store/layout handling, or a dedicated dual-output lowering that explicitly manages RHS staging, accumulator lifetime, and output stores.

Ruled-out or low-priority paths

  • FUSE_OUT_PROJ_NPAIR=1 was measured around 411.3 ms/token and did not provide a useful improvement.
  • Connecting a8w8_matmul_dequant_acc to out_proj did not provide a substantial gain and ran into backend constraints.
  • MLP_TN=512 failed compilation due to Mat buffer usage exceeding the current limit:
    • gate_proj_aic Mat buffer usage 655360 > 524288
    • up_proj_aic Mat buffer usage 655360 > 524288
  • Default DOWN_SPLITK_ATOMIC=0 already keeps down projection coarse at the model submit level. Enabling atomic split-K would likely increase submit count and should not be the first optimization path.
  • SiLU submit grouping is not expected to have enough upside compared with the larger MLP projection fanout.
  • Simple DSL-level matmul_dual composition is blocked by either Mat buffer pressure, incorrect cross-chunk SSA generation, or unsupported Acc-to-Acc movement before storing the second accumulator output.

Proposed next step

Add a backend-supported dual-output matmul path for MLP gate/up projection instead of continuing to force the optimization at the model DSL level.

A possible API shape:

gate_acc, up_acc = matmul_dual_acc(A, W_gate, W_up)

The key lowering requirement is that this must not simply expand to two independent matmuls and must not rely on unsupported model-level Acc-to-Acc movement. The lowering should:

  • Reuse the A tile load/prepare work.
  • Keep gate and up outputs mathematically independent.
  • Stage B tiles and C accumulators so the Mat buffer limit is not exceeded.
  • Support storing multiple Acc outputs with an explicit, supported layout/store path.
  • Preserve the current downstream SiLU, quant/dequant, and down-projection semantics.

Suggested validation plan

  1. Add a compile-only microbenchmark using the current Qwen3-14B MLP decode shape:
    • M=16
    • K=512
    • N=256
    • Two RHS tensors for gate/up
  2. Verify the new lowering compiles without exceeding the Mat buffer limit.
  3. Verify both Acc outputs can be stored without unsupported Acc-to-Acc tmov.
  4. Wire the path into models/qwen3/14b/decode_layer_a8w8.py behind a default-off flag such as QWEN_A8W8_DUAL_GATE_UP_PROJ=1.
  5. Run 16-token and 48-token quality checks.
  6. Re-profile TPOT and AICore task records.

Acceptance criteria

  • The default path remains correct and stable.
  • The experimental dual-output path compiles successfully for the Qwen3-14B A8W8 MLP decode shape.
  • Both gate and up Acc outputs are stored through a supported lowering path, without unsupported Acc-to-Acc tmov.
  • 16-token and 48-token output quality checks pass without repeated-output regression.
  • TPOT improves versus the current ~360-371 ms/token baseline.
  • Task records or static submit group count decrease in the MLP gate/up area.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions