-
Notifications
You must be signed in to change notification settings - Fork 3.8k
feat(ingester): Add WAL throttling for partition ingesters #19538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
💻 Deploy preview deleted (feat(ingester): Add WAL throttling for partition ingesters). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this solution but, unfortunately, I think it drop records in the kafka consumer code because the Push logic will give up if enough retries for the same record have elapsed, plus it is explicitly checking for the ErrReadOnly result and failing the retry logic.
Retry/Abort logic: pkg/ingester/kafka_consumer.go:157
ErrReadOnly abort logic: pkg/ingester/kafka_consumer.go:204
A potentially fix is to backoff indefinitely in the case of the ErrReadOnly instead of aborting, but I don't know if that interferes with shutdown?
I believe the backoff logic is indefinite (kafka_consumer.go:218)? The dskit.Backoff.Ongoing() call checks for a cancelled context, which should cover shutdown cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: the kafka consumer logic is actually handling this case already.
| if cfg.Enabled && cfg.CheckpointDuration < 1 { | ||
| return fmt.Errorf("invalid checkpoint duration: %v", cfg.CheckpointDuration) | ||
| } | ||
| if cfg.DiskFullThreshold < 0 || cfg.DiskFullThreshold > 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot use 0 to disable it (but is mentioned in the docs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How so? This is validating the number is less than zero or greater than 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misread like an idiot 😆
What this PR does / why we need it:
It has been seen that reading from partition ingesters can overwhelm the WAL disk space. This PR ties the disk free space into the retry mechanism that already exists for ingesters, such that when the disk is above a certain threshold, an
ErrReadOnlyerror is returned. The ingester will fall into a retry backoff loop in this case, until the WAL disk has been cleaned up by the periodic flushing that occurs.In case of the partition ingesters not being able to keep up, the consumption lag will increase, resulting in a metric that can be used to scale up the partition ingesters.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Checklist
CONTRIBUTING.mdguide (required)featPRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.docs/sources/setup/upgrade/_index.mddeprecated-config.yamlanddeleted-config.yamlfiles respectively in thetools/deprecated-config-checkerdirectory. Example PR