Skip to content

[Bug] Vulkan backend intermittently outputs "-" despite VAD detecting speech #233

@peterolejua

Description

@peterolejua

Description

Whisper's Vulkan backend intermittently produces "-" output even when the audio clearly contains speech. VAD confirms speech is present (multiple segments detected), but Whisper completes transcription in ~0.16-0.20s (abnormally fast) and returns only "-".

The CPU backend was tested as an alternative but hangs indefinitely at 0% CPU usage on the same system.

Steps to Reproduce

  1. Use voxtype 0.6.2 with Vulkan backend on AMD GPU (RX 6600, RADV)
  2. Use small.en model
  3. Dictate multiple recordings in succession
  4. Intermittently, a recording will transcribe as "-" instead of the spoken text

The failure appears random — it can happen immediately after a successful transcription (3-second gap) or after longer idle periods. No consistent pattern.

Evidence: VAD confirms speech, Whisper fails

This is the key diagnostic. With VAD enabled, the logs show that 18.2s of audio with 5 detected speech segments was sent to Whisper, which returned "-" in 0.17s:

INFO Recording stopped (18.2s)
whisper_vad_segments_from_probs: Final speech segments after filtering: 5
whisper_vad_segments_from_probs: VAD segment 0: start = 0.13, end = 5.37 (duration: 5.24)
whisper_vad_segments_from_probs: VAD segment 1: start = 5.83, end = 7.29 (duration: 1.46)
whisper_vad_segments_from_probs: VAD segment 2: start = 7.78, end = 13.47 (duration: 5.69)
whisper_vad_segments_from_probs: VAD segment 3: start = 13.79, end = 15.74 (duration: 1.95)
whisper_vad_segments_from_probs: VAD segment 4: start = 15.97, end = 17.92 (duration: 1.95)
INFO Transcribing 18.1s of audio...
whisper_backend_init_gpu: using Vulkan0 backend
INFO Transcription completed in 0.17s: "-"

This proves the audio capture is correct — the issue is in the Whisper Vulkan inference.

Failure rate

13 failures out of 188 transcriptions (~7%) in a single session.

All failures share the same signature:

  • Transcription completes in 0.16-0.20s (vs 0.20-0.77s for successful runs)
  • Output is exactly "-"

Attempted mitigations (none fixed it)

  • PipeWire node suspension: set session.suspend-timeout-seconds = 0 and node.always-process = true — did not help
  • USB autosuspend: already disabled (power/control = on) — not the cause
  • VAD: enabled with threshold 0.3 — VAD correctly detects speech but Whisper still fails
  • CPU backend: voxtype setup gpu --disable — transcription hangs at 0% CPU indefinitely
  • Direct PipeWire source name: VoxType/CPAL doesn't recognize PipeWire node names, only ALSA names

Environment

  • OS: Arch Linux (kernel 6.18.9-zen1-2-zen)
  • Voxtype: 0.6.2
  • GPU: AMD Radeon RX 6600 (Navi 23), RADV driver
  • Vulkan: 1.4.328, Mesa 25.3.5
  • Model: small.en
  • Audio: PipeWire 1.4.10, USB mic (Trust GXT 232), device = "default"
  • CPU: AMD Ryzen 5 7600 (Zen 4)

Possible upstream issue

This may be a whisper.cpp Vulkan backend bug rather than a voxtype issue. Related: ggml-org/whisper.cpp#2400, ggml-org/whisper.cpp#2596.

Suggestion

When VAD detects speech but Whisper returns only "-", ., or ..., voxtype could automatically retry the transcription (perhaps re-initializing the Vulkan state) rather than outputting the dash.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions