Add segment-level and word-level timestamps by TroyHernandez · Pull Request #1 · cornball-ai/whisper

TroyHernandez · 2026-02-17T13:28:06Z

Summary

Segment timestamps (timestamps=TRUE): Enables Whisper's built-in timestamp tokens (<|0.00|> through <|30.00|>) with logit suppression rules enforcing forward-only, paired timestamps capped at 30s. Returns segments data.frame with start, end, text.
Word timestamps (word_timestamps=TRUE): Captures cross-attention weights from model-specific alignment heads during decoding, then uses DTW alignment to map each token to audio frames. Subword tokens are merged into words. Returns words data.frame with word, start, end.
Both work with single chunks and long audio (automatic time offsets per chunk).

Test plan

Unit tests for apply_timestamp_rules(), extract_segments(), dtw_align(), medfilt1(), group_into_words() (81 tests, all passing)
Integration tests guarded by at_home() + model_exists("tiny") for both timestamps=TRUE and word_timestamps=TRUE
Manual verification with JFK sample audio — segments and word times are accurate and monotonic

Segment timestamps use Whisper's built-in timestamp tokens (<|0.00|> through <|30.00|>) with logit suppression rules that enforce proper timestamp generation (forward-only, paired, capped at 30s). Word timestamps use cross-attention DTW alignment: during decoding, cross-attention weights are captured from model-specific alignment heads, then dynamic time warping maps each token to audio frames. Subword tokens are merged into words with start/end times. API: transcribe(..., timestamps=TRUE) returns segments data.frame, transcribe(..., word_timestamps=TRUE) returns words data.frame. Both work with single chunks and long audio (automatic time offsets).

decode_bpe_bytes() was a stub that only handled the space token, causing non-ASCII characters (accented Latin, CJK, etc.) to come out garbled. Now fully reverses the GPT-2 byte-to-unicode mapping with a cached lookup table.

TroyHernandez added 4 commits February 15, 2026 13:48

Update README and CLAUDE.md with timestamp documentation

3a81306

Add peak VRAM and speed benchmarks to models table

aaff392

Fix UTF-8 byte decoding in tokenizer

e613105

decode_bpe_bytes() was a stub that only handled the space token, causing non-ASCII characters (accented Latin, CJK, etc.) to come out garbled. Now fully reverses the GPT-2 byte-to-unicode mapping with a cached lookup table.

TroyHernandez merged commit ac6cefd into main Feb 18, 2026
2 checks passed

TroyHernandez deleted the timestamps branch February 18, 2026 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add segment-level and word-level timestamps#1

Add segment-level and word-level timestamps#1
TroyHernandez merged 4 commits into
mainfrom
timestamps

TroyHernandez commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TroyHernandez commented Feb 17, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant