Add segment-level and word-level timestamps#1
Merged
Merged
Conversation
Segment timestamps use Whisper's built-in timestamp tokens (<|0.00|> through <|30.00|>) with logit suppression rules that enforce proper timestamp generation (forward-only, paired, capped at 30s). Word timestamps use cross-attention DTW alignment: during decoding, cross-attention weights are captured from model-specific alignment heads, then dynamic time warping maps each token to audio frames. Subword tokens are merged into words with start/end times. API: transcribe(..., timestamps=TRUE) returns segments data.frame, transcribe(..., word_timestamps=TRUE) returns words data.frame. Both work with single chunks and long audio (automatic time offsets).
decode_bpe_bytes() was a stub that only handled the space token, causing non-ASCII characters (accented Latin, CJK, etc.) to come out garbled. Now fully reverses the GPT-2 byte-to-unicode mapping with a cached lookup table.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
timestamps=TRUE): Enables Whisper's built-in timestamp tokens (<|0.00|>through<|30.00|>) with logit suppression rules enforcing forward-only, paired timestamps capped at 30s. Returnssegmentsdata.frame with start, end, text.word_timestamps=TRUE): Captures cross-attention weights from model-specific alignment heads during decoding, then uses DTW alignment to map each token to audio frames. Subword tokens are merged into words. Returnswordsdata.frame with word, start, end.Test plan
apply_timestamp_rules(),extract_segments(),dtw_align(),medfilt1(),group_into_words()(81 tests, all passing)at_home()+model_exists("tiny")for bothtimestamps=TRUEandword_timestamps=TRUE