Fix invalid multibyte string crash in BPE decoder#2
Merged
Conversation
Whisper's decoder can emit EOT early, skipping speech in the middle of a 30s chunk. This adds a seek loop (matching the Python reference implementation) that re-encodes from the last timestamp position when the model stops before the end of the chunk. Also improves transcribe_long overlap dedup: trims overlapping segments at chunk boundaries instead of dropping them, filters hallucinated segments from padded chunks, and caps timestamps to actual audio duration.
rawToChar() crashes when BPE token decoding produces partial UTF-8 byte sequences. Write raw bytes to temp file and read back with iconv to gracefully strip invalid sequences instead of erroring.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
decode_bpe_bytes()crashes with "invalid multibyte string, element 1" when BPE token decoding produces partial UTF-8 byte sequencesiconv()to gracefully strip invalid sequences instead of erroringTest plan
R CMD check🤖 Generated with Claude Code