-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Component(s)
processor/batch
Describe the issue you're reporting
Problem description
The batch processor's timeout-based flushing mechanism accumulates delays over time, causing batches to drift significantly from their intended schedule. This creates issues for use cases that require consistent time alignment.
Current behaviour
When using timeout-based batching (e.g. timeout: 60s), we observe that:
- Individual batches are delayed: Each batch arrives to the next processor slightly after the configured interval (60.2 - 60.3 seconds instead of 60.0)
- Delays accumulate over time : The accumulation of the delay causes the batching window to shift over time
Evidence
Real-world timing data with 60s timeout configuration, logging at what time the metrics arrive to the next processor:
1:09:08.684
1:10:08.733 (49 ms drift)
1:11:08.980 (296 ms drift)
1:12:09.487 (803 ms drift)
1:13:09.962 (1278 ms drift)
...
2:07:39.094 (30+ seconds drift)
(drift = difference between ideal batch start time and actual start time)
Script to reproduce the problem: github.com/franm7/batch-processor-script
Root Cause Analysis
Current implementation in startLoop:
func (b *shard[T]) startLoop() {
...
for {
select {
...
case <-timerCh:
if b.batch.itemCount() > 0 {
b.sendItems(triggerTimeout)
b.resetTimer()
Once the timer fires (timerCh receives a value), it triggers a flush of the current batch (if non-empty), and then after that the timer is reset for the next interval:
func (b *shard[T]) resetTimer() {
if b.hasTimer() {
b.timer.Reset(b.processor.timeout)
Small delays from goroutine scheduling, flush processing time, and timer inaccuracies compound over time because individual delays are never accounted for - if a delay happens, nothing corrects it and it shifts all future batching windows.
Impact
The drift makes it difficult to aggregate data that should be grouped together. When we expect batches to arrive at consistent intervals for time-based processing, the delay causes incorrect grouping.
Questions:
- Are maintainers aware of this drift accumulation behavior? Is it considered expected or unintended?
- Would the community be interested in exploring solutions for use cases requiring time-aligned batching?
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1
or me too
, to help us triage it. Learn more here.