Skip to content

Add segment-level and word-level timestamps#1

Merged
TroyHernandez merged 4 commits into
mainfrom
timestamps
Feb 18, 2026
Merged

Add segment-level and word-level timestamps#1
TroyHernandez merged 4 commits into
mainfrom
timestamps

Conversation

@TroyHernandez
Copy link
Copy Markdown
Contributor

Summary

  • Segment timestamps (timestamps=TRUE): Enables Whisper's built-in timestamp tokens (<|0.00|> through <|30.00|>) with logit suppression rules enforcing forward-only, paired timestamps capped at 30s. Returns segments data.frame with start, end, text.
  • Word timestamps (word_timestamps=TRUE): Captures cross-attention weights from model-specific alignment heads during decoding, then uses DTW alignment to map each token to audio frames. Subword tokens are merged into words. Returns words data.frame with word, start, end.
  • Both work with single chunks and long audio (automatic time offsets per chunk).

Test plan

  • Unit tests for apply_timestamp_rules(), extract_segments(), dtw_align(), medfilt1(), group_into_words() (81 tests, all passing)
  • Integration tests guarded by at_home() + model_exists("tiny") for both timestamps=TRUE and word_timestamps=TRUE
  • Manual verification with JFK sample audio — segments and word times are accurate and monotonic

Segment timestamps use Whisper's built-in timestamp tokens (<|0.00|>
through <|30.00|>) with logit suppression rules that enforce proper
timestamp generation (forward-only, paired, capped at 30s).

Word timestamps use cross-attention DTW alignment: during decoding,
cross-attention weights are captured from model-specific alignment
heads, then dynamic time warping maps each token to audio frames.
Subword tokens are merged into words with start/end times.

API: transcribe(..., timestamps=TRUE) returns segments data.frame,
transcribe(..., word_timestamps=TRUE) returns words data.frame.
Both work with single chunks and long audio (automatic time offsets).
decode_bpe_bytes() was a stub that only handled the space token,
causing non-ASCII characters (accented Latin, CJK, etc.) to come
out garbled. Now fully reverses the GPT-2 byte-to-unicode mapping
with a cached lookup table.
@TroyHernandez TroyHernandez merged commit ac6cefd into main Feb 18, 2026
2 checks passed
@TroyHernandez TroyHernandez deleted the timestamps branch February 18, 2026 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant