Skip to content

[batchprocessor] Batch processor timeout accumulates drift over time, causing misaligned batching windows #13717

@franm7

Description

@franm7

Component(s)

processor/batch

Describe the issue you're reporting

Problem description

The batch processor's timeout-based flushing mechanism accumulates delays over time, causing batches to drift significantly from their intended schedule. This creates issues for use cases that require consistent time alignment.

Current behaviour

When using timeout-based batching (e.g. timeout: 60s), we observe that:

  • Individual batches are delayed: Each batch arrives to the next processor slightly after the configured interval (60.2 - 60.3 seconds instead of 60.0)
  • Delays accumulate over time : The accumulation of the delay causes the batching window to shift over time

Evidence

Real-world timing data with 60s timeout configuration, logging at what time the metrics arrive to the next processor:

1:09:08.684
1:10:08.733 (49 ms drift)
1:11:08.980 (296 ms drift)
1:12:09.487 (803 ms drift)
1:13:09.962 (1278 ms drift) 

...

2:07:39.094 (30+ seconds drift)

(drift = difference between ideal batch start time and actual start time)

Script to reproduce the problem: github.com/franm7/batch-processor-script

Root Cause Analysis

Current implementation in startLoop:

func (b *shard[T]) startLoop() {
    ...
    for {
        select {
        ...
        case <-timerCh:
            if b.batch.itemCount() > 0 {
                b.sendItems(triggerTimeout)
            b.resetTimer()

Once the timer fires (timerCh receives a value), it triggers a flush of the current batch (if non-empty), and then after that the timer is reset for the next interval:

func (b *shard[T]) resetTimer() {
    if b.hasTimer() {
        b.timer.Reset(b.processor.timeout)

Small delays from goroutine scheduling, flush processing time, and timer inaccuracies compound over time because individual delays are never accounted for - if a delay happens, nothing corrects it and it shifts all future batching windows.

Impact

The drift makes it difficult to aggregate data that should be grouped together. When we expect batches to arrive at consistent intervals for time-based processing, the delay causes incorrect grouping.

Questions:

  1. Are maintainers aware of this drift accumulation behavior? Is it considered expected or unintended?
  2. Would the community be interested in exploring solutions for use cases requiring time-aligned batching?

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions