[Tracking] Qwen3-14B A8W8 decode kernels and TPOT optimization

## Summary

Track the Qwen3-14B A8W8 PyPTO kernel work and the follow-up decode TPOT optimization path.

This issue is associated with PR #642, which adds the Qwen3-14B A8W8 kernel path in `pypto-lib`. The current end-to-end A8W8 path can run successfully with the matching serving-side PR, but the remaining performance gap appears to be dominated by decode task fanout and scheduling overhead rather than one isolated small operator.

## Current status

- The combined `pypto-lib` + `pypto-serving` A8W8 path can run an end-to-end Qwen3-14B A8W8 decode flow.
- The latest measured baseline is around 360-371 ms/token TPOT.
- A previous fused QK norm path caused repeated text output. The root cause was narrowed down to the fused branch taking a batched Q RoPE path that was not numerically equivalent to the unfused path.
- Switching Q RoPE back to per-Q-head rotation restores output quality, and single-layer debug showed bitwise agreement for the corrected Q RoPE output.
- The quality fix alone does not provide a speedup. The best corrected fused-QK-norm path observed so far is around 429.6 ms/token.

## Measurement basis

Decode profiling suggests the main cost is task granularity and fanout:

- Device wall time is around 356 ms/step.
- Each decode step emits about 38,705 AICore task records.
- Average task runtime is about 45 us.
- Dispatch-to-finish latency is about 61 us, which suggests roughly 25% scheduling/head-tail overhead.

Static group breakdown per layer:

| Area | Groups per layer | Notes |
| --- | ---: | --- |
| MLP gate/up/silu/down | 224 | Largest fanout source, about 69% of groups |
| out_proj | 40 | Already investigated through several submit/fusion attempts |
| QKV projection | 28 | Smaller than MLP fanout |
| Other attention/norm work | 32 | Secondary priority |

## Findings so far

- The `QUANT_ROWMAX_TRANSPOSE` umbrella switch is not a necessary condition for text repetition.
- `QWEN_A8W8_FUSED_QKV_DEQUANT=1` with `QWEN_A8W8_FUSED_QK_NORM=1` reproduced the repetition issue even without the ACT/OUT rowmax transpose switches.
- `QWEN_A8W8_FUSED_QKV_DEQUANT=1` with `QWEN_A8W8_FUSED_QK_NORM=0` restored text quality, but TPOT was still poor.
- Single-layer debug showed `q_pre_rope` and `k_pre_rope` were bitwise identical between fused and unfused paths, so the Q/K norm math itself was not the root cause.
- The fused branch's Q RoPE implementation was the quality issue.
- Model-DSL-level submit fusion or tile-size tuning has limited headroom because several attempts either had no measurable benefit or hit backend buffer limits.

## Latest dual-output experiment results

The recent `matmul_dual` / `matmul_dual_acc` experiments further confirm that this is not solvable by simple DSL composition alone:

- After adding L0 tiling support to `matmul_dual`, the Right-buffer issue can be removed, but the tensor-level path still preloads two complete RHS tiles, so Mat buffer usage still exceeds the limit.
- Smaller K chunks can avoid the Mat buffer limit, but ordinary dual-output accumulation across chunks needs an AIV add, and the generated SSA is currently incorrect.
- `matmul_dual_acc` can remove the AIV add, but the model-level second Acc output still triggers an Acc-to-Acc `tmov` before store, which `ptoas` does not support today.
- Therefore the next useful path is lower-level support for multi-Acc output store/layout handling, or a dedicated dual-output lowering that explicitly manages RHS staging, accumulator lifetime, and output stores.

## Ruled-out or low-priority paths

- `FUSE_OUT_PROJ_NPAIR=1` was measured around 411.3 ms/token and did not provide a useful improvement.
- Connecting `a8w8_matmul_dequant_acc` to `out_proj` did not provide a substantial gain and ran into backend constraints.
- `MLP_TN=512` failed compilation due to Mat buffer usage exceeding the current limit:
  - `gate_proj_aic Mat buffer usage 655360 > 524288`
  - `up_proj_aic Mat buffer usage 655360 > 524288`
- Default `DOWN_SPLITK_ATOMIC=0` already keeps down projection coarse at the model submit level. Enabling atomic split-K would likely increase submit count and should not be the first optimization path.
- SiLU submit grouping is not expected to have enough upside compared with the larger MLP projection fanout.
- Simple DSL-level `matmul_dual` composition is blocked by either Mat buffer pressure, incorrect cross-chunk SSA generation, or unsupported Acc-to-Acc movement before storing the second accumulator output.

## Proposed next step

Add a backend-supported dual-output matmul path for MLP gate/up projection instead of continuing to force the optimization at the model DSL level.

A possible API shape:

```text
gate_acc, up_acc = matmul_dual_acc(A, W_gate, W_up)
```

The key lowering requirement is that this must not simply expand to two independent matmuls and must not rely on unsupported model-level Acc-to-Acc movement. The lowering should:

- Reuse the A tile load/prepare work.
- Keep gate and up outputs mathematically independent.
- Stage B tiles and C accumulators so the Mat buffer limit is not exceeded.
- Support storing multiple Acc outputs with an explicit, supported layout/store path.
- Preserve the current downstream SiLU, quant/dequant, and down-projection semantics.

## Suggested validation plan

1. Add a compile-only microbenchmark using the current Qwen3-14B MLP decode shape:
   - `M=16`
   - `K=512`
   - `N=256`
   - Two RHS tensors for gate/up
2. Verify the new lowering compiles without exceeding the Mat buffer limit.
3. Verify both Acc outputs can be stored without unsupported Acc-to-Acc `tmov`.
4. Wire the path into `models/qwen3/14b/decode_layer_a8w8.py` behind a default-off flag such as `QWEN_A8W8_DUAL_GATE_UP_PROJ=1`.
5. Run 16-token and 48-token quality checks.
6. Re-profile TPOT and AICore task records.

## Acceptance criteria

- The default path remains correct and stable.
- The experimental dual-output path compiles successfully for the Qwen3-14B A8W8 MLP decode shape.
- Both gate and up Acc outputs are stored through a supported lowering path, without unsupported Acc-to-Acc `tmov`.
- 16-token and 48-token output quality checks pass without repeated-output regression.
- TPOT improves versus the current ~360-371 ms/token baseline.
- Task records or static submit group count decrease in the MLP gate/up area.

## Related

- pypto-lib PR: #642
- Matching serving-side integration: hw-native-sys/pypto-serving#48
- Matching serving-side tracking issue: hw-native-sys/pypto-serving#52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tracking] Qwen3-14B A8W8 decode kernels and TPOT optimization #665

Summary

Current status

Measurement basis

Findings so far

Latest dual-output experiment results

Ruled-out or low-priority paths

Proposed next step

Suggested validation plan

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Area	Groups per layer	Notes
MLP gate/up/silu/down	224	Largest fanout source, about 69% of groups
out_proj	40	Already investigated through several submit/fusion attempts
QKV projection	28	Smaller than MLP fanout
Other attention/norm work	32	Secondary priority

Uh oh!

[Tracking] Qwen3-14B A8W8 decode kernels and TPOT optimization #665

Description

Summary

Current status

Measurement basis

Findings so far

Latest dual-output experiment results

Ruled-out or low-priority paths

Proposed next step

Suggested validation plan

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions