Context
khimaros/qwen3-tts.cpp commit 391dded adds three related optimizations:
ggml_flash_attn_ext in the talker step, code predictor step, and code predictor graph
- Static KV cache: full-
n_ctx-sized K/V views with ggml_set_rows scatter writes instead of ggml_cpy to dynamic views
- Vocoder graph caching (we have our own graph cache — already removed in e3104eb for a related correctness reason)
Plus the KV-memory-zeroing follow-up in c215ab0 to defend flash-attn against stale bytes past n_used.
What was tried
Full port attempted on 2026-04-19. Built clean, 0.6B (all backends) and 1.7B on CPU were fine.
Metal backend destabilized 1.7B generation:
| Smoketest row |
Baseline audio |
With flash attn |
Observation |
| 1.7b_f16_auto_clone |
51s |
327.66s |
hit max_audio_tokens ceiling |
| 1.7b_f16_auto_instr |
37.9s |
0.86s |
degenerate |
| 1.7b_f16_auto_clone_instr |
20s |
2.22s |
degenerate |
| 1.7b_q8_0_auto_basic |
10.6s |
❌ FAIL (wav <10KB) |
hard failure |
| 1.7b_q8_0_auto_clone |
173s |
327.66s |
hit ceiling |
All 1.7B CPU rows remained normal, which isolates the problem to Metal's `ggml_flash_attn_ext` interacting with either the F16 `-inf` mask, our `head_dim=128`, or the large talker `n_ctx` (~2060 for default `max_audio_tokens=2048`).
Reverted — no commit landed.
What's needed to unblock
- Reduce the interaction surface: try F32 mask, smaller `n_ctx`, or per-head-dim kernel selection in GGML's Metal backend.
- Test on a known-good Metal flash-attn workload (e.g. llama.cpp) to confirm the GGML Metal flash-attn kernel itself is correct for our shapes.
- Alternatively, adopt only the static KV cache (khimaros's `ggml_set_rows` scatter + full-n_ctx views) while keeping the manual QK^T / softmax / V attention path. Would still be a mild perf win and would not exercise the destabilizing path.
Upstream reference
Context
khimaros/qwen3-tts.cpp commit 391dded adds three related optimizations:
ggml_flash_attn_extin the talker step, code predictor step, and code predictor graphn_ctx-sized K/V views withggml_set_rowsscatter writes instead ofggml_cpyto dynamic viewsPlus the KV-memory-zeroing follow-up in c215ab0 to defend flash-attn against stale bytes past
n_used.What was tried
Full port attempted on 2026-04-19. Built clean, 0.6B (all backends) and 1.7B on CPU were fine.
Metal backend destabilized 1.7B generation:
All 1.7B CPU rows remained normal, which isolates the problem to Metal's `ggml_flash_attn_ext` interacting with either the F16 `-inf` mask, our `head_dim=128`, or the large talker `n_ctx` (~2060 for default `max_audio_tokens=2048`).
Reverted — no commit landed.
What's needed to unblock
Upstream reference