Skip to content

Conversation

@paul1r
Copy link
Collaborator

@paul1r paul1r commented Oct 17, 2025

What this PR does / why we need it:
It has been seen that reading from partition ingesters can overwhelm the WAL disk space. This PR ties the disk free space into the retry mechanism that already exists for ingesters, such that when the disk is above a certain threshold, an ErrReadOnly error is returned. The ingester will fall into a retry backoff loop in this case, until the WAL disk has been cleaned up by the periodic flushing that occurs.

In case of the partition ingesters not being able to keep up, the consumption lag will increase, resulting in a metric that can be used to scale up the partition ingesters.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@paul1r paul1r requested a review from a team as a code owner October 17, 2025 17:30
@github-actions
Copy link
Contributor

github-actions bot commented Oct 17, 2025

💻 Deploy preview deleted (feat(ingester): Add WAL throttling for partition ingesters).

Copy link
Contributor

@benclive benclive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution but, unfortunately, I think it drop records in the kafka consumer code because the Push logic will give up if enough retries for the same record have elapsed, plus it is explicitly checking for the ErrReadOnly result and failing the retry logic.

Retry/Abort logic: pkg/ingester/kafka_consumer.go:157
ErrReadOnly abort logic: pkg/ingester/kafka_consumer.go:204

A potentially fix is to backoff indefinitely in the case of the ErrReadOnly instead of aborting, but I don't know if that interferes with shutdown?

@paul1r
Copy link
Collaborator Author

paul1r commented Oct 22, 2025

I like this solution but, unfortunately, I think it drop records in the kafka consumer code because the Push logic will give up if enough retries for the same record have elapsed, plus it is explicitly checking for the ErrReadOnly result and failing the retry logic.

Retry/Abort logic: pkg/ingester/kafka_consumer.go:157 ErrReadOnly abort logic: pkg/ingester/kafka_consumer.go:204

A potentially fix is to backoff indefinitely in the case of the ErrReadOnly instead of aborting, but I don't know if that interferes with shutdown?

I believe the backoff logic is indefinite (kafka_consumer.go:218)?

The dskit.Backoff.Ongoing() call checks for a cancelled context, which should cover shutdown cases

Copy link
Contributor

@benclive benclive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: the kafka consumer logic is actually handling this case already.

if cfg.Enabled && cfg.CheckpointDuration < 1 {
return fmt.Errorf("invalid checkpoint duration: %v", cfg.CheckpointDuration)
}
if cfg.DiskFullThreshold < 0 || cfg.DiskFullThreshold > 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot use 0 to disable it (but is mentioned in the docs).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so? This is validating the number is less than zero or greater than 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misread like an idiot 😆

@paul1r paul1r merged commit cce4511 into main Oct 22, 2025
121 of 125 checks passed
@paul1r paul1r deleted the paul1r/wal_throttle branch October 22, 2025 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants