Summary
Track the Qwen3-14B A8W8 PyPTO kernel work and the follow-up decode TPOT optimization path.
This issue is associated with PR #642, which adds the Qwen3-14B A8W8 kernel path in pypto-lib. The current end-to-end A8W8 path can run successfully with the matching serving-side PR, but the remaining performance gap appears to be dominated by decode task fanout and scheduling overhead rather than one isolated small operator.
Current status
- The combined
pypto-lib + pypto-serving A8W8 path can run an end-to-end Qwen3-14B A8W8 decode flow.
- The latest measured baseline is around 360-371 ms/token TPOT.
- A previous fused QK norm path caused repeated text output. The root cause was narrowed down to the fused branch taking a batched Q RoPE path that was not numerically equivalent to the unfused path.
- Switching Q RoPE back to per-Q-head rotation restores output quality, and single-layer debug showed bitwise agreement for the corrected Q RoPE output.
- The quality fix alone does not provide a speedup. The best corrected fused-QK-norm path observed so far is around 429.6 ms/token.
Measurement basis
Decode profiling suggests the main cost is task granularity and fanout:
- Device wall time is around 356 ms/step.
- Each decode step emits about 38,705 AICore task records.
- Average task runtime is about 45 us.
- Dispatch-to-finish latency is about 61 us, which suggests roughly 25% scheduling/head-tail overhead.
Static group breakdown per layer:
| Area |
Groups per layer |
Notes |
| MLP gate/up/silu/down |
224 |
Largest fanout source, about 69% of groups |
| out_proj |
40 |
Already investigated through several submit/fusion attempts |
| QKV projection |
28 |
Smaller than MLP fanout |
| Other attention/norm work |
32 |
Secondary priority |
Findings so far
- The
QUANT_ROWMAX_TRANSPOSE umbrella switch is not a necessary condition for text repetition.
QWEN_A8W8_FUSED_QKV_DEQUANT=1 with QWEN_A8W8_FUSED_QK_NORM=1 reproduced the repetition issue even without the ACT/OUT rowmax transpose switches.
QWEN_A8W8_FUSED_QKV_DEQUANT=1 with QWEN_A8W8_FUSED_QK_NORM=0 restored text quality, but TPOT was still poor.
- Single-layer debug showed
q_pre_rope and k_pre_rope were bitwise identical between fused and unfused paths, so the Q/K norm math itself was not the root cause.
- The fused branch's Q RoPE implementation was the quality issue.
- Model-DSL-level submit fusion or tile-size tuning has limited headroom because several attempts either had no measurable benefit or hit backend buffer limits.
Latest dual-output experiment results
The recent matmul_dual / matmul_dual_acc experiments further confirm that this is not solvable by simple DSL composition alone:
- After adding L0 tiling support to
matmul_dual, the Right-buffer issue can be removed, but the tensor-level path still preloads two complete RHS tiles, so Mat buffer usage still exceeds the limit.
- Smaller K chunks can avoid the Mat buffer limit, but ordinary dual-output accumulation across chunks needs an AIV add, and the generated SSA is currently incorrect.
matmul_dual_acc can remove the AIV add, but the model-level second Acc output still triggers an Acc-to-Acc tmov before store, which ptoas does not support today.
- Therefore the next useful path is lower-level support for multi-Acc output store/layout handling, or a dedicated dual-output lowering that explicitly manages RHS staging, accumulator lifetime, and output stores.
Ruled-out or low-priority paths
FUSE_OUT_PROJ_NPAIR=1 was measured around 411.3 ms/token and did not provide a useful improvement.
- Connecting
a8w8_matmul_dequant_acc to out_proj did not provide a substantial gain and ran into backend constraints.
MLP_TN=512 failed compilation due to Mat buffer usage exceeding the current limit:
gate_proj_aic Mat buffer usage 655360 > 524288
up_proj_aic Mat buffer usage 655360 > 524288
- Default
DOWN_SPLITK_ATOMIC=0 already keeps down projection coarse at the model submit level. Enabling atomic split-K would likely increase submit count and should not be the first optimization path.
- SiLU submit grouping is not expected to have enough upside compared with the larger MLP projection fanout.
- Simple DSL-level
matmul_dual composition is blocked by either Mat buffer pressure, incorrect cross-chunk SSA generation, or unsupported Acc-to-Acc movement before storing the second accumulator output.
Proposed next step
Add a backend-supported dual-output matmul path for MLP gate/up projection instead of continuing to force the optimization at the model DSL level.
A possible API shape:
gate_acc, up_acc = matmul_dual_acc(A, W_gate, W_up)
The key lowering requirement is that this must not simply expand to two independent matmuls and must not rely on unsupported model-level Acc-to-Acc movement. The lowering should:
- Reuse the A tile load/prepare work.
- Keep gate and up outputs mathematically independent.
- Stage B tiles and C accumulators so the Mat buffer limit is not exceeded.
- Support storing multiple Acc outputs with an explicit, supported layout/store path.
- Preserve the current downstream SiLU, quant/dequant, and down-projection semantics.
Suggested validation plan
- Add a compile-only microbenchmark using the current Qwen3-14B MLP decode shape:
M=16
K=512
N=256
- Two RHS tensors for gate/up
- Verify the new lowering compiles without exceeding the Mat buffer limit.
- Verify both Acc outputs can be stored without unsupported Acc-to-Acc
tmov.
- Wire the path into
models/qwen3/14b/decode_layer_a8w8.py behind a default-off flag such as QWEN_A8W8_DUAL_GATE_UP_PROJ=1.
- Run 16-token and 48-token quality checks.
- Re-profile TPOT and AICore task records.
Acceptance criteria
- The default path remains correct and stable.
- The experimental dual-output path compiles successfully for the Qwen3-14B A8W8 MLP decode shape.
- Both gate and up Acc outputs are stored through a supported lowering path, without unsupported Acc-to-Acc
tmov.
- 16-token and 48-token output quality checks pass without repeated-output regression.
- TPOT improves versus the current ~360-371 ms/token baseline.
- Task records or static submit group count decrease in the MLP gate/up area.
Related
Summary
Track the Qwen3-14B A8W8 PyPTO kernel work and the follow-up decode TPOT optimization path.
This issue is associated with PR #642, which adds the Qwen3-14B A8W8 kernel path in
pypto-lib. The current end-to-end A8W8 path can run successfully with the matching serving-side PR, but the remaining performance gap appears to be dominated by decode task fanout and scheduling overhead rather than one isolated small operator.Current status
pypto-lib+pypto-servingA8W8 path can run an end-to-end Qwen3-14B A8W8 decode flow.Measurement basis
Decode profiling suggests the main cost is task granularity and fanout:
Static group breakdown per layer:
Findings so far
QUANT_ROWMAX_TRANSPOSEumbrella switch is not a necessary condition for text repetition.QWEN_A8W8_FUSED_QKV_DEQUANT=1withQWEN_A8W8_FUSED_QK_NORM=1reproduced the repetition issue even without the ACT/OUT rowmax transpose switches.QWEN_A8W8_FUSED_QKV_DEQUANT=1withQWEN_A8W8_FUSED_QK_NORM=0restored text quality, but TPOT was still poor.q_pre_ropeandk_pre_ropewere bitwise identical between fused and unfused paths, so the Q/K norm math itself was not the root cause.Latest dual-output experiment results
The recent
matmul_dual/matmul_dual_accexperiments further confirm that this is not solvable by simple DSL composition alone:matmul_dual, the Right-buffer issue can be removed, but the tensor-level path still preloads two complete RHS tiles, so Mat buffer usage still exceeds the limit.matmul_dual_acccan remove the AIV add, but the model-level second Acc output still triggers an Acc-to-Acctmovbefore store, whichptoasdoes not support today.Ruled-out or low-priority paths
FUSE_OUT_PROJ_NPAIR=1was measured around 411.3 ms/token and did not provide a useful improvement.a8w8_matmul_dequant_acctoout_projdid not provide a substantial gain and ran into backend constraints.MLP_TN=512failed compilation due to Mat buffer usage exceeding the current limit:gate_proj_aic Mat buffer usage 655360 > 524288up_proj_aic Mat buffer usage 655360 > 524288DOWN_SPLITK_ATOMIC=0already keeps down projection coarse at the model submit level. Enabling atomic split-K would likely increase submit count and should not be the first optimization path.matmul_dualcomposition is blocked by either Mat buffer pressure, incorrect cross-chunk SSA generation, or unsupported Acc-to-Acc movement before storing the second accumulator output.Proposed next step
Add a backend-supported dual-output matmul path for MLP gate/up projection instead of continuing to force the optimization at the model DSL level.
A possible API shape:
The key lowering requirement is that this must not simply expand to two independent matmuls and must not rely on unsupported model-level Acc-to-Acc movement. The lowering should:
Suggested validation plan
M=16K=512N=256tmov.models/qwen3/14b/decode_layer_a8w8.pybehind a default-off flag such asQWEN_A8W8_DUAL_GATE_UP_PROJ=1.Acceptance criteria
tmov.Related