Port flash attention + static KV cache (blocked: Metal destabilizes 1.7B)

## Context

khimaros/qwen3-tts.cpp commit [391dded](https://github.com/khimaros/qwen3-tts.cpp/commit/391dded) adds three related optimizations:

1. `ggml_flash_attn_ext` in the talker step, code predictor step, and code predictor graph
2. Static KV cache: full-`n_ctx`-sized K/V views with `ggml_set_rows` scatter writes instead of `ggml_cpy` to dynamic views
3. Vocoder graph caching (we have our own graph cache — already removed in e3104eb for a related correctness reason)

Plus the KV-memory-zeroing follow-up in [c215ab0](https://github.com/khimaros/qwen3-tts.cpp/commit/c215ab0) to defend flash-attn against stale bytes past `n_used`.

## What was tried

Full port attempted on 2026-04-19. Built clean, 0.6B (all backends) and 1.7B on CPU were fine.

**Metal backend destabilized 1.7B generation:**

| Smoketest row | Baseline audio | With flash attn | Observation |
|---|---|---|---|
| 1.7b_f16_auto_clone | 51s | 327.66s | hit max_audio_tokens ceiling |
| 1.7b_f16_auto_instr | 37.9s | 0.86s | degenerate |
| 1.7b_f16_auto_clone_instr | 20s | 2.22s | degenerate |
| 1.7b_q8_0_auto_basic | 10.6s | ❌ FAIL (wav <10KB) | hard failure |
| 1.7b_q8_0_auto_clone | 173s | 327.66s | hit ceiling |

All 1.7B CPU rows remained normal, which isolates the problem to Metal's \`ggml_flash_attn_ext\` interacting with either the F16 \`-inf\` mask, our \`head_dim=128\`, or the large talker \`n_ctx\` (~2060 for default \`max_audio_tokens=2048\`).

Reverted — no commit landed.

## What's needed to unblock

- Reduce the interaction surface: try F32 mask, smaller \`n_ctx\`, or per-head-dim kernel selection in GGML's Metal backend.
- Test on a known-good Metal flash-attn workload (e.g. llama.cpp) to confirm the GGML Metal flash-attn kernel itself is correct for our shapes.
- Alternatively, adopt **only the static KV cache** (khimaros's \`ggml_set_rows\` scatter + full-n_ctx views) while keeping the manual QK^T / softmax / V attention path. Would still be a mild perf win and would not exercise the destabilizing path.

## Upstream reference

- khimaros 391dded: https://github.com/khimaros/qwen3-tts.cpp/commit/391dded
- khimaros c215ab0: https://github.com/khimaros/qwen3-tts.cpp/commit/c215ab0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port flash attention + static KV cache (blocked: Metal destabilizes 1.7B) #2

Context

What was tried

What's needed to unblock

Upstream reference

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Smoketest row	Baseline audio	With flash attn	Observation
1.7b_f16_auto_clone	51s	327.66s	hit max_audio_tokens ceiling
1.7b_f16_auto_instr	37.9s	0.86s	degenerate
1.7b_f16_auto_clone_instr	20s	2.22s	degenerate
1.7b_q8_0_auto_basic	10.6s	❌ FAIL (wav <10KB)	hard failure
1.7b_q8_0_auto_clone	173s	327.66s	hit ceiling

Port flash attention + static KV cache (blocked: Metal destabilizes 1.7B) #2

Description

Context

What was tried

What's needed to unblock

Upstream reference

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions