Skip to content

Port flash attention + static KV cache (blocked: Metal destabilizes 1.7B) #2

Description

@rmusser01

Context

khimaros/qwen3-tts.cpp commit 391dded adds three related optimizations:

  1. ggml_flash_attn_ext in the talker step, code predictor step, and code predictor graph
  2. Static KV cache: full-n_ctx-sized K/V views with ggml_set_rows scatter writes instead of ggml_cpy to dynamic views
  3. Vocoder graph caching (we have our own graph cache — already removed in e3104eb for a related correctness reason)

Plus the KV-memory-zeroing follow-up in c215ab0 to defend flash-attn against stale bytes past n_used.

What was tried

Full port attempted on 2026-04-19. Built clean, 0.6B (all backends) and 1.7B on CPU were fine.

Metal backend destabilized 1.7B generation:

Smoketest row Baseline audio With flash attn Observation
1.7b_f16_auto_clone 51s 327.66s hit max_audio_tokens ceiling
1.7b_f16_auto_instr 37.9s 0.86s degenerate
1.7b_f16_auto_clone_instr 20s 2.22s degenerate
1.7b_q8_0_auto_basic 10.6s ❌ FAIL (wav <10KB) hard failure
1.7b_q8_0_auto_clone 173s 327.66s hit ceiling

All 1.7B CPU rows remained normal, which isolates the problem to Metal's `ggml_flash_attn_ext` interacting with either the F16 `-inf` mask, our `head_dim=128`, or the large talker `n_ctx` (~2060 for default `max_audio_tokens=2048`).

Reverted — no commit landed.

What's needed to unblock

  • Reduce the interaction surface: try F32 mask, smaller `n_ctx`, or per-head-dim kernel selection in GGML's Metal backend.
  • Test on a known-good Metal flash-attn workload (e.g. llama.cpp) to confirm the GGML Metal flash-attn kernel itself is correct for our shapes.
  • Alternatively, adopt only the static KV cache (khimaros's `ggml_set_rows` scatter + full-n_ctx views) while keeping the manual QK^T / softmax / V attention path. Would still be a mild perf win and would not exercise the destabilizing path.

Upstream reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions