fix: replace metadata-based offset compensation with CBR byte-count math#468
Merged
rany2 merged 1 commit intorany2:masterfrom Mar 22, 2026
Merged
Conversation
b81f2e6 to
b6f76d0
Compare
The inter-chunk offset_compensation accumulated timing drift on long texts because it derived each chunk's baseline from Microsoft's reported metadata offsets plus a hardcoded 8,750,000-tick padding constant. The output format is audio-24khz-48kbitrate-mono-mp3 (48 kbps CBR). For any CBR stream the relationship between byte count and duration is exact: ticks = total_bytes * 8 * 10_000_000 // 48_000. Every binary audio message — including encoded silence from the AI's variable inter-sentence pauses — is counted. This replaces the metadata-domain accumulation that was vulnerable to: - Microsoft's integer-overflow bug in long-text Offset values - Variable AI pause lengths that no single constant can model - Compounding per-chunk errors across dozens of chunks Intra-chunk word/sentence boundary offsets from Microsoft's metadata are still used as-is (accurate within a single chunk). Only the inter-chunk compensation is changed. Fixes cumulative ~3-minute drift observed at ~4 hours of audio.
b6f76d0 to
cafe9a5
Compare
Owner
|
Why close this? |
Contributor
Author
|
thought you wasn't interested since it's been days |
Contributor
Author
|
thanks for merging, sorry I caused you issues |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the inter-chunk
offset_compensationlogic with exact arithmeticderived from the actual MP3 byte count. This eliminates cumulative subtitle
timing drift on long texts (~3 min drift at ~4 hours).
Problem
Communicate.__streamcurrently derives each chunk's offset baseline from theprevious chunk's last metadata offset + a hardcoded
8_750_000-tick (~875 ms)padding constant. Three things cause this to accumulate unbounded error:
Offsetvalues inaudio.metadatabecome unreliable on long texts, solast_duration_offsetis incorrect as source material.punctuation, and prosody. No single constant can model it.
the already-compensated previous chunk. Per-chunk errors feed forward.
Fix
The output format
audio-24khz-48kbitrate-mono-mp3is 48 kbps CBR.For any constant-bitrate stream, the byte-to-duration conversion is exact:
Every WebSocket binary message — including encoded silence from the AI's
variable pauses — is counted. The "unpredictable pauses" that defeat the
fixed-constant approach are automatically and precisely accounted for because
they exist as real MP3 frames in the audio payload.
What changes
constants.pyTICKS_PER_SECOND = 10_000_000andMP3_BITRATE_BPS = 48_000typing.pychunk_audio_bytesandcumulative_audio_bytestoCommunicateStatecommunicate.pyturn.end, computeoffset_compensationfrom cumulative byte count instead of metadata + constantWhat does NOT change
metadata (accurate within a single chunk).
SubMaker,save(),stream()API — all unchanged.last_duration_offsetis still set for backward compatibility / debugging.Why this is correct
The only scenario where this could break is if Microsoft switched to VBR
encoding, which would require a different
outputFormatstring — triviallydetectable.
Testing
Validated on texts producing up to 50 minutes of continuous audio with
real-time word-boundary tracking against actual playback position.
Zero observable drift. SRT timestamps remain synchronized with the audio
throughout.