Skip to content

fix: replace metadata-based offset compensation with CBR byte-count math#468

Merged
rany2 merged 1 commit intorany2:masterfrom
hcgiub001:fix-timing-drift-byte-count
Mar 22, 2026
Merged

fix: replace metadata-based offset compensation with CBR byte-count math#468
rany2 merged 1 commit intorany2:masterfrom
hcgiub001:fix-timing-drift-byte-count

Conversation

@hcgiub001
Copy link
Copy Markdown
Contributor

Summary

Replace the inter-chunk offset_compensation logic with exact arithmetic
derived from the actual MP3 byte count. This eliminates cumulative subtitle
timing drift on long texts (~3 min drift at ~4 hours).

Problem

Communicate.__stream currently derives each chunk's offset baseline from the
previous chunk's last metadata offset + a hardcoded 8_750_000-tick (~875 ms)
padding constant. Three things cause this to accumulate unbounded error:

  1. Microsoft's integer-overflow bug — raw Offset values in
    audio.metadata become unreliable on long texts, so
    last_duration_offset is incorrect as source material.
  2. Variable AI pauses — the silence between chunks depends on content,
    punctuation, and prosody. No single constant can model it.
  3. Circular error compounding — each chunk's compensation is derived from
    the already-compensated previous chunk. Per-chunk errors feed forward.

Fix

The output format audio-24khz-48kbitrate-mono-mp3 is 48 kbps CBR.
For any constant-bitrate stream, the byte-to-duration conversion is exact:

ticks = total_bytes × 8 × 10,000,000 ÷ 48,000

Every WebSocket binary message — including encoded silence from the AI's
variable pauses — is counted. The "unpredictable pauses" that defeat the
fixed-constant approach are automatically and precisely accounted for because
they exist as real MP3 frames in the audio payload.

What changes

File Change
constants.py Add TICKS_PER_SECOND = 10_000_000 and MP3_BITRATE_BPS = 48_000
typing.py Add chunk_audio_bytes and cumulative_audio_bytes to CommunicateState
communicate.py Count audio bytes per chunk; at turn.end, compute offset_compensation from cumulative byte count instead of metadata + constant

What does NOT change

  • Intra-chunk word/sentence boundary offsets still come from Microsoft's
    metadata (accurate within a single chunk).
  • SubMaker, save(), stream() API — all unchanged.
  • last_duration_offset is still set for backward compatibility / debugging.

Why this is correct

Property Old approach New approach
Inter-chunk offset source Last metadata offset + 875 ms constant Actual MP3 bytes × exact CBR arithmetic
Affected by MS integer overflow Yes No — metadata not used for accumulation
Handles variable AI pauses No Yes — silence is encoded in the bytes
Error accumulation Compounds per chunk Zero (byte count is independent)
Mathematical guarantee None Exact for CBR; integer division, no rounding

The only scenario where this could break is if Microsoft switched to VBR
encoding, which would require a different outputFormat string — trivially
detectable.

Testing

Validated on texts producing up to 50 minutes of continuous audio with
real-time word-boundary tracking against actual playback position.
Zero observable drift. SRT timestamps remain synchronized with the audio
throughout.

@hcgiub001 hcgiub001 force-pushed the fix-timing-drift-byte-count branch 2 times, most recently from b81f2e6 to b6f76d0 Compare March 17, 2026 11:43
The inter-chunk offset_compensation accumulated timing drift on long
texts because it derived each chunk's baseline from Microsoft's reported
metadata offsets plus a hardcoded 8,750,000-tick padding constant.

The output format is audio-24khz-48kbitrate-mono-mp3 (48 kbps CBR).
For any CBR stream the relationship between byte count and duration is
exact:  ticks = total_bytes * 8 * 10_000_000 // 48_000.

Every binary audio message — including encoded silence from the AI's
variable inter-sentence pauses — is counted.  This replaces the
metadata-domain accumulation that was vulnerable to:

  - Microsoft's integer-overflow bug in long-text Offset values
  - Variable AI pause lengths that no single constant can model
  - Compounding per-chunk errors across dozens of chunks

Intra-chunk word/sentence boundary offsets from Microsoft's metadata
are still used as-is (accurate within a single chunk).  Only the
inter-chunk compensation is changed.

Fixes cumulative ~3-minute drift observed at ~4 hours of audio.
@hcgiub001 hcgiub001 force-pushed the fix-timing-drift-byte-count branch from b6f76d0 to cafe9a5 Compare March 17, 2026 11:45
@hcgiub001 hcgiub001 closed this Mar 21, 2026
@rany2 rany2 reopened this Mar 21, 2026
@rany2
Copy link
Copy Markdown
Owner

rany2 commented Mar 21, 2026

Why close this?

@hcgiub001
Copy link
Copy Markdown
Contributor Author

thought you wasn't interested since it's been days

@rany2 rany2 merged commit 9965046 into rany2:master Mar 22, 2026
3 checks passed
@hcgiub001
Copy link
Copy Markdown
Contributor Author

thanks for merging, sorry I caused you issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants