-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106
TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106
Conversation
This pull request does not have a backport label. Could you fix it @carsonip? 🙏
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the code is mostly ready, I plan to split this PR into 2 - a refactoring PR with no behavior changes and a bugfix PR with actual "drop and recreate" logic.
return nil | ||
} | ||
|
||
timer := time.NewTicker(min(time.Minute, ttl)) // Eval db size every minute as badger reports them with 1m lag, but use min to facilitate testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting DB size should be a relatively cheap operation so we can consider lowering ticker to be more frequent. This should prevent a configuration where TTL is low sub-minute to swap DB frequently after just 2 limit hits in a row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understand here.
This should prevent a configuration where TTL is low sub-minute to swap DB frequently after just 2 limit hits in a row.
if TTL < 1m, DB will be swapped after 2*TTL. (I actually consider this pathological as badger db size reporting comes with 1m delay)
if TTL >=1m, DB will be swapped after at max TTL+1m.
Do you think we should fix TTL <1m? It will increase test run time to 1m, but will fix swapping DB frequently for sub-minute TTL (which I don't think has much value in reality). I'm inclined to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, I didn't realize the size is getting recalculated every 1m. Then I think the current implementation is fine.
I was mostly afraid of the edge case situation when a TTL is set to a low value like 5s and DB is running close to the max disk limit, then even if compaction is making progress we might perform DB swap every other minute if we are unlucky and check twice in a row when the DB is full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually afraid of the situation that TTL is way lower than a minute e.g. 5s, causing storage limit check to be done on the same 1m cached reported db size. I have the min()
so that the test doesn't take a minute to run, but this edge case worries me a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 0b3e742 but TestDropLoop tests will take 1m to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the changes look good to me, thank you @carsonip!
|
8e65516
to
9605745
Compare
9605745
to
1e4d5b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the changes!
…L time (#15106) Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state. (cherry picked from commit a902d3c)
…L time (#15106) Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state. (cherry picked from commit a902d3c)
…L time (#15106) Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state. (cherry picked from commit a902d3c)
…L time (#15106) (#15168) Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state. (cherry picked from commit a902d3c) Co-authored-by: Carson Ip <[email protected]>
…L time (#15106) (#15167) Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state. (cherry picked from commit a902d3c) Co-authored-by: Carson Ip <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
…L time (#15106) (#15169) Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state. (cherry picked from commit a902d3c) Co-authored-by: Carson Ip <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Already tested for 8.17.1 and 8.16.3 |
Motivation/summary
Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state.
Checklist
- [ ] Update CHANGELOG.asciidocchange will be backported. Changelog should be added on docs release.For functional changes, consider:
How to test these changes
2 ways to test:
Related issues
Alternative to #15081
Fixes #14923