Description
Whisper's Vulkan backend intermittently produces "-" output even when the audio clearly contains speech. VAD confirms speech is present (multiple segments detected), but Whisper completes transcription in ~0.16-0.20s (abnormally fast) and returns only "-".
The CPU backend was tested as an alternative but hangs indefinitely at 0% CPU usage on the same system.
Steps to Reproduce
- Use
voxtype 0.6.2 with Vulkan backend on AMD GPU (RX 6600, RADV)
- Use
small.en model
- Dictate multiple recordings in succession
- Intermittently, a recording will transcribe as
"-" instead of the spoken text
The failure appears random — it can happen immediately after a successful transcription (3-second gap) or after longer idle periods. No consistent pattern.
Evidence: VAD confirms speech, Whisper fails
This is the key diagnostic. With VAD enabled, the logs show that 18.2s of audio with 5 detected speech segments was sent to Whisper, which returned "-" in 0.17s:
INFO Recording stopped (18.2s)
whisper_vad_segments_from_probs: Final speech segments after filtering: 5
whisper_vad_segments_from_probs: VAD segment 0: start = 0.13, end = 5.37 (duration: 5.24)
whisper_vad_segments_from_probs: VAD segment 1: start = 5.83, end = 7.29 (duration: 1.46)
whisper_vad_segments_from_probs: VAD segment 2: start = 7.78, end = 13.47 (duration: 5.69)
whisper_vad_segments_from_probs: VAD segment 3: start = 13.79, end = 15.74 (duration: 1.95)
whisper_vad_segments_from_probs: VAD segment 4: start = 15.97, end = 17.92 (duration: 1.95)
INFO Transcribing 18.1s of audio...
whisper_backend_init_gpu: using Vulkan0 backend
INFO Transcription completed in 0.17s: "-"
This proves the audio capture is correct — the issue is in the Whisper Vulkan inference.
Failure rate
13 failures out of 188 transcriptions (~7%) in a single session.
All failures share the same signature:
- Transcription completes in 0.16-0.20s (vs 0.20-0.77s for successful runs)
- Output is exactly
"-"
Attempted mitigations (none fixed it)
- PipeWire node suspension: set
session.suspend-timeout-seconds = 0 and node.always-process = true — did not help
- USB autosuspend: already disabled (
power/control = on) — not the cause
- VAD: enabled with threshold 0.3 — VAD correctly detects speech but Whisper still fails
- CPU backend:
voxtype setup gpu --disable — transcription hangs at 0% CPU indefinitely
- Direct PipeWire source name: VoxType/CPAL doesn't recognize PipeWire node names, only ALSA names
Environment
- OS: Arch Linux (kernel 6.18.9-zen1-2-zen)
- Voxtype: 0.6.2
- GPU: AMD Radeon RX 6600 (Navi 23), RADV driver
- Vulkan: 1.4.328, Mesa 25.3.5
- Model: small.en
- Audio: PipeWire 1.4.10, USB mic (Trust GXT 232), device = "default"
- CPU: AMD Ryzen 5 7600 (Zen 4)
Possible upstream issue
This may be a whisper.cpp Vulkan backend bug rather than a voxtype issue. Related: ggml-org/whisper.cpp#2400, ggml-org/whisper.cpp#2596.
Suggestion
When VAD detects speech but Whisper returns only "-", ., or ..., voxtype could automatically retry the transcription (perhaps re-initializing the Vulkan state) rather than outputting the dash.
Description
Whisper's Vulkan backend intermittently produces
"-"output even when the audio clearly contains speech. VAD confirms speech is present (multiple segments detected), but Whisper completes transcription in ~0.16-0.20s (abnormally fast) and returns only"-".The CPU backend was tested as an alternative but hangs indefinitely at 0% CPU usage on the same system.
Steps to Reproduce
voxtype 0.6.2with Vulkan backend on AMD GPU (RX 6600, RADV)small.enmodel"-"instead of the spoken textThe failure appears random — it can happen immediately after a successful transcription (3-second gap) or after longer idle periods. No consistent pattern.
Evidence: VAD confirms speech, Whisper fails
This is the key diagnostic. With VAD enabled, the logs show that 18.2s of audio with 5 detected speech segments was sent to Whisper, which returned
"-"in 0.17s:This proves the audio capture is correct — the issue is in the Whisper Vulkan inference.
Failure rate
13 failures out of 188 transcriptions (~7%) in a single session.
All failures share the same signature:
"-"Attempted mitigations (none fixed it)
session.suspend-timeout-seconds = 0andnode.always-process = true— did not helppower/control = on) — not the causevoxtype setup gpu --disable— transcription hangs at 0% CPU indefinitelyEnvironment
Possible upstream issue
This may be a whisper.cpp Vulkan backend bug rather than a voxtype issue. Related: ggml-org/whisper.cpp#2400, ggml-org/whisper.cpp#2596.
Suggestion
When VAD detects speech but Whisper returns only
"-",., or..., voxtype could automatically retry the transcription (perhaps re-initializing the Vulkan state) rather than outputting the dash.