Skip to content

Fix invalid multibyte string crash in BPE decoder#2

Merged
TroyHernandez merged 2 commits into
mainfrom
fix/multibyte-decode
Feb 27, 2026
Merged

Fix invalid multibyte string crash in BPE decoder#2
TroyHernandez merged 2 commits into
mainfrom
fix/multibyte-decode

Conversation

@TroyHernandez
Copy link
Copy Markdown
Contributor

Summary

  • decode_bpe_bytes() crashes with "invalid multibyte string, element 1" when BPE token decoding produces partial UTF-8 byte sequences
  • This happens intermittently during transcription depending on the audio content
  • Fix writes raw bytes to a temp file and reads back with iconv() to gracefully strip invalid sequences instead of erroring

Test plan

  • Verified fix on two TTS-generated audio files that consistently triggered the crash
  • Run R CMD check

🤖 Generated with Claude Code

Whisper's decoder can emit EOT early, skipping speech in the middle of
a 30s chunk. This adds a seek loop (matching the Python reference
implementation) that re-encodes from the last timestamp position when
the model stops before the end of the chunk.

Also improves transcribe_long overlap dedup: trims overlapping segments
at chunk boundaries instead of dropping them, filters hallucinated
segments from padded chunks, and caps timestamps to actual audio duration.
rawToChar() crashes when BPE token decoding produces partial UTF-8
byte sequences. Write raw bytes to temp file and read back with
iconv to gracefully strip invalid sequences instead of erroring.
@TroyHernandez TroyHernandez merged commit f74f733 into main Feb 27, 2026
2 checks passed
@TroyHernandez TroyHernandez deleted the fix/multibyte-decode branch March 13, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant