Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106

Merged

Conversation

carsonip
Copy link
Member

@carsonip carsonip commented Jan 2, 2025

Motivation/summary

Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state.

Checklist

- [ ] Update CHANGELOG.asciidoc change will be backported. Changelog should be added on docs release.

  • Documentation has been updated

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

2 ways to test:

  1. Start apm-server with TBS, use apmsoak to send many small events <1KB, confirm it is affected by TBS: apm-server never recovers from storage limit exceeded in rare cases #14923, wait for TTL and ensure that DB is deleted and recreated, and writes to badger are resumed.
  2. Manually bloat the badger DB with either APM events or other irrelevant data, start apm-server, ensure DB is deleted and recreated, and writes to badger are resumed.

Related issues

Alternative to #15081

Fixes #14923

Copy link
Contributor

mergify bot commented Jan 2, 2025

This pull request does not have a backport label. Could you fix it @carsonip? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-8.x is the label to automatically backport to the 8.x branch.

Copy link
Contributor

mergify bot commented Jan 2, 2025

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 2, 2025
@carsonip carsonip added backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify labels Jan 2, 2025
@carsonip carsonip changed the title TBS: drop badger db if storage limit is reached for at least TTL TBS: drop and recreate badger db if storage limit exceeded for at least TTL time Jan 2, 2025
Copy link
Member Author

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the code is mostly ready, I plan to split this PR into 2 - a refactoring PR with no behavior changes and a bugfix PR with actual "drop and recreate" logic.

@carsonip carsonip requested a review from 1pkg January 2, 2025 23:48
return nil
}

timer := time.NewTicker(min(time.Minute, ttl)) // Eval db size every minute as badger reports them with 1m lag, but use min to facilitate testing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting DB size should be a relatively cheap operation so we can consider lowering ticker to be more frequent. This should prevent a configuration where TTL is low sub-minute to swap DB frequently after just 2 limit hits in a row.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand here.

This should prevent a configuration where TTL is low sub-minute to swap DB frequently after just 2 limit hits in a row.

if TTL < 1m, DB will be swapped after 2*TTL. (I actually consider this pathological as badger db size reporting comes with 1m delay)
if TTL >=1m, DB will be swapped after at max TTL+1m.

Do you think we should fix TTL <1m? It will increase test run time to 1m, but will fix swapping DB frequently for sub-minute TTL (which I don't think has much value in reality). I'm inclined to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I didn't realize the size is getting recalculated every 1m. Then I think the current implementation is fine.
I was mostly afraid of the edge case situation when a TTL is set to a low value like 5s and DB is running close to the max disk limit, then even if compaction is making progress we might perform DB swap every other minute if we are unlucky and check twice in a row when the DB is full.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually afraid of the situation that TTL is way lower than a minute e.g. 5s, causing storage limit check to be done on the same 1m cached reported db size. I have the min() so that the test doesn't take a minute to run, but this edge case worries me a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0b3e742 but TestDropLoop tests will take 1m to run.

1pkg
1pkg previously approved these changes Jan 3, 2025
Copy link
Member

@1pkg 1pkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the changes look good to me, thank you @carsonip!

@carsonip
Copy link
Member Author

carsonip commented Jan 3, 2025

Will rebase this PR after refactoring PR #15112 is merged done

@carsonip carsonip force-pushed the tbs-recover-from-storage-limit-exceeded-rm-db branch from 8e65516 to 9605745 Compare January 3, 2025 19:54
@carsonip carsonip changed the title TBS: drop and recreate badger db if storage limit exceeded for at least TTL time TBS: drop and recreate badger db after exceeding storage limit for TTL time Jan 3, 2025
@carsonip carsonip force-pushed the tbs-recover-from-storage-limit-exceeded-rm-db branch from 9605745 to 1e4d5b3 Compare January 3, 2025 19:56
@carsonip carsonip enabled auto-merge (squash) January 7, 2025 18:01
lahsivjar
lahsivjar previously approved these changes Jan 7, 2025
Copy link
Contributor

@lahsivjar lahsivjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the changes!

1pkg
1pkg previously approved these changes Jan 7, 2025
@carsonip carsonip disabled auto-merge January 7, 2025 21:05
@carsonip carsonip dismissed stale reviews from lahsivjar and 1pkg via 9d15913 January 7, 2025 21:21
@carsonip carsonip requested a review from 1pkg January 7, 2025 21:21
@carsonip carsonip enabled auto-merge (squash) January 7, 2025 21:43
@carsonip carsonip merged commit a902d3c into elastic:main Jan 7, 2025
11 checks passed
mergify bot pushed a commit that referenced this pull request Jan 7, 2025
…L time (#15106)

Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state.

(cherry picked from commit a902d3c)
mergify bot pushed a commit that referenced this pull request Jan 7, 2025
…L time (#15106)

Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state.

(cherry picked from commit a902d3c)
mergify bot pushed a commit that referenced this pull request Jan 7, 2025
…L time (#15106)

Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state.

(cherry picked from commit a902d3c)
mergify bot added a commit that referenced this pull request Jan 7, 2025
…L time (#15106) (#15168)

Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state.

(cherry picked from commit a902d3c)

Co-authored-by: Carson Ip <[email protected]>
mergify bot added a commit that referenced this pull request Jan 7, 2025
…L time (#15106) (#15167)

Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state.

(cherry picked from commit a902d3c)

Co-authored-by: Carson Ip <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
mergify bot added a commit that referenced this pull request Jan 7, 2025
…L time (#15106) (#15169)

Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state.

(cherry picked from commit a902d3c)

Co-authored-by: Carson Ip <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
@endorama
Copy link
Member

endorama commented Feb 3, 2025

Already tested for 8.17.1 and 8.16.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TBS: apm-server never recovers from storage limit exceeded in rare cases
5 participants