TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106

carsonip · 2025-01-02T15:58:22Z

Motivation/summary

Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state.

Checklist

~~- [ ] Update CHANGELOG.asciidoc~~ change will be backported. Changelog should be added on docs release.

Documentation has been updated

For functional changes, consider:

Is it observable through the addition of either logging or metrics?
Is its use being published in telemetry to enable product improvement?
Have system tests been added to avoid regression?

How to test these changes

2 ways to test:

Start apm-server with TBS, use apmsoak to send many small events <1KB, confirm it is affected by TBS: apm-server never recovers from storage limit exceeded in rare cases #14923, wait for TTL and ensure that DB is deleted and recreated, and writes to badger are resumed.
Manually bloat the badger DB with either APM events or other irrelevant data, start apm-server, ensure DB is deleted and recreated, and writes to badger are resumed.

Related issues

Alternative to #15081

Fixes #14923

mergify · 2025-01-02T15:59:01Z

This pull request does not have a backport label. Could you fix it @carsonip? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-7.17 is the label to automatically backport to the 7.17 branch.
backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
backport-8.x is the label to automatically backport to the 8.x branch.

mergify · 2025-01-02T15:59:01Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

carsonip

When the code is mostly ready, I plan to split this PR into 2 - a refactoring PR with no behavior changes and a bugfix PR with actual "drop and recreate" logic.

x-pack/apm-server/sampling/eventstorage/storage_manager.go

1pkg · 2025-01-03T03:38:24Z

x-pack/apm-server/sampling/eventstorage/storage_manager.go

+		return nil
+	}
+
+	timer := time.NewTicker(min(time.Minute, ttl)) // Eval db size every minute as badger reports them with 1m lag, but use min to facilitate testing


Getting DB size should be a relatively cheap operation so we can consider lowering ticker to be more frequent. This should prevent a configuration where TTL is low sub-minute to swap DB frequently after just 2 limit hits in a row.

Not sure if I understand here.

This should prevent a configuration where TTL is low sub-minute to swap DB frequently after just 2 limit hits in a row.

if TTL < 1m, DB will be swapped after 2*TTL. (I actually consider this pathological as badger db size reporting comes with 1m delay)
if TTL >=1m, DB will be swapped after at max TTL+1m.

Do you think we should fix TTL <1m? It will increase test run time to 1m, but will fix swapping DB frequently for sub-minute TTL (which I don't think has much value in reality). I'm inclined to fix it.

Oh I see, I didn't realize the size is getting recalculated every 1m. Then I think the current implementation is fine.
I was mostly afraid of the edge case situation when a TTL is set to a low value like 5s and DB is running close to the max disk limit, then even if compaction is making progress we might perform DB swap every other minute if we are unlucky and check twice in a row when the DB is full.

I'm actually afraid of the situation that TTL is way lower than a minute e.g. 5s, causing storage limit check to be done on the same 1m cached reported db size. I have the min() so that the test doesn't take a minute to run, but this edge case worries me a bit.

Fixed in 0b3e742 but TestDropLoop tests will take 1m to run.

x-pack/apm-server/sampling/eventstorage/storage_manager.go

1pkg

Overall the changes look good to me, thank you @carsonip!

carsonip · 2025-01-03T15:16:15Z

~~Will rebase this PR after refactoring PR #15112 is merged~~ done

…L time

lahsivjar

LGTM! Thanks for the changes!

x-pack/apm-server/sampling/eventstorage/storage_manager.go

…L time (#15106) Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state. (cherry picked from commit a902d3c)

…L time (#15106) (#15168) Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state. (cherry picked from commit a902d3c) Co-authored-by: Carson Ip <[email protected]>

…L time (#15106) (#15167) Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state. (cherry picked from commit a902d3c) Co-authored-by: Carson Ip <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

…L time (#15106) (#15169) Workaround for cases where apm-server is stuck at storage limit exceeded state indefinitely because badger DB compaction conditions are not satisfied. This PR implements a goroutine that detects this state, and if the state persists for at least TTL time, as the entries in badger DB would have been expired, just drop and recreate the DB to get out of this state. (cherry picked from commit a902d3c) Co-authored-by: Carson Ip <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

endorama · 2025-02-03T11:55:41Z

Already tested for 8.17.1 and 8.16.3

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 2, 2025

carsonip added backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify labels Jan 2, 2025

carsonip changed the title ~~TBS: drop badger db if storage limit is reached for at least TTL~~ TBS: drop and recreate badger db if storage limit exceeded for at least TTL time Jan 2, 2025

carsonip commented Jan 2, 2025

View reviewed changes

carsonip requested a review from 1pkg January 2, 2025 23:48

carsonip commented Jan 2, 2025

View reviewed changes

x-pack/apm-server/sampling/eventstorage/storage_manager.go Outdated Show resolved Hide resolved

1pkg reviewed Jan 3, 2025

View reviewed changes

x-pack/apm-server/sampling/eventstorage/storage_manager.go Outdated Show resolved Hide resolved

1pkg reviewed Jan 3, 2025

View reviewed changes

x-pack/apm-server/sampling/eventstorage/storage_manager.go Outdated Show resolved Hide resolved

1pkg previously approved these changes Jan 3, 2025

View reviewed changes

carsonip dismissed 1pkg’s stale review via bbac388 January 3, 2025 10:12

carsonip mentioned this pull request Jan 3, 2025

TBS: refactor to encapsulate badger DB #15112

Merged

2 tasks

This was referenced Jan 3, 2025

[8.x] TBS: refactor to encapsulate badger DB (backport #15112) #15118

Merged

[8.16] TBS: refactor to encapsulate badger DB (backport #15112) #15119

Merged

[8.17] TBS: refactor to encapsulate badger DB (backport #15112) #15120

Merged

carsonip force-pushed the tbs-recover-from-storage-limit-exceeded-rm-db branch from 8e65516 to 9605745 Compare January 3, 2025 19:54

carsonip changed the title ~~TBS: drop and recreate badger db if storage limit exceeded for at least TTL time~~ TBS: drop and recreate badger db after exceeding storage limit for TTL time Jan 3, 2025

TBS: drop and recreate badger db after exceeding storage limit for TT…

1e4d5b3

…L time

carsonip force-pushed the tbs-recover-from-storage-limit-exceeded-rm-db branch from 9605745 to 1e4d5b3 Compare January 3, 2025 19:56

carsonip added 3 commits January 3, 2025 21:40

Hardcode file mode 0700

0b9bcc1

Ensure backupDir does not exist

25957a6

Make dropAndRecreate private

092b587

carsonip mentioned this pull request Jan 3, 2025

TBS: flatten badger when compaction is not running #15081

Closed

2 tasks

carsonip added 3 commits January 3, 2025 22:11

Fix typo

8f6ac0e

Return error in dropAndRecreate

c8cc7a4

Remove duplicate reset

962905f

Add link to issue

3c00d97

carsonip enabled auto-merge (squash) January 7, 2025 18:01

lahsivjar previously approved these changes Jan 7, 2025

View reviewed changes

1pkg reviewed Jan 7, 2025

View reviewed changes

x-pack/apm-server/sampling/eventstorage/storage_manager.go Show resolved Hide resolved

1pkg previously approved these changes Jan 7, 2025

View reviewed changes

Merge branch 'main' into tbs-recover-from-storage-limit-exceeded-rm-db

71e3be2

carsonip disabled auto-merge January 7, 2025 21:05

Log deleted files

9d15913

carsonip dismissed stale reviews from lahsivjar and 1pkg via 9d15913 January 7, 2025 21:21

carsonip requested a review from 1pkg January 7, 2025 21:21

carsonip enabled auto-merge (squash) January 7, 2025 21:43

1pkg approved these changes Jan 7, 2025

View reviewed changes

carsonip merged commit a902d3c into elastic:main Jan 7, 2025
11 checks passed

mergify bot mentioned this pull request Jan 7, 2025

[8.x] TBS: drop and recreate badger db after exceeding storage limit for TTL time (backport #15106) #15167

Merged

1 task

mergify bot mentioned this pull request Jan 7, 2025

[8.16] TBS: drop and recreate badger db after exceeding storage limit for TTL time (backport #15106) #15168

Merged

1 task

mergify bot mentioned this pull request Jan 7, 2025

[8.17] TBS: drop and recreate badger db after exceeding storage limit for TTL time (backport #15106) #15169

Merged

1 task

This was referenced Jan 13, 2025

APM Server 8.16.3 test plan #15225

Closed

APM Server 8.17.1 test plan #15226

Closed

endorama added test-plan v8.18.0 and removed test-plan v8.18.0 labels Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106

TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106

carsonip commented Jan 2, 2025 •

edited

Loading

mergify bot commented Jan 2, 2025

mergify bot commented Jan 2, 2025

carsonip left a comment

1pkg Jan 3, 2025

carsonip Jan 3, 2025

1pkg Jan 4, 2025

carsonip Jan 6, 2025

carsonip Jan 6, 2025

1pkg left a comment

carsonip commented Jan 3, 2025 •

edited

Loading

lahsivjar left a comment

endorama commented Feb 3, 2025

TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106

TBS: drop and recreate badger db after exceeding storage limit for TTL time #15106

Conversation

carsonip commented Jan 2, 2025 • edited Loading

Motivation/summary

Checklist

How to test these changes

Related issues

mergify bot commented Jan 2, 2025

mergify bot commented Jan 2, 2025

carsonip left a comment

Choose a reason for hiding this comment

1pkg Jan 3, 2025

Choose a reason for hiding this comment

carsonip Jan 3, 2025

Choose a reason for hiding this comment

1pkg Jan 4, 2025

Choose a reason for hiding this comment

carsonip Jan 6, 2025

Choose a reason for hiding this comment

carsonip Jan 6, 2025

Choose a reason for hiding this comment

1pkg left a comment

Choose a reason for hiding this comment

carsonip commented Jan 3, 2025 • edited Loading

lahsivjar left a comment

Choose a reason for hiding this comment

endorama commented Feb 3, 2025

carsonip commented Jan 2, 2025 •

edited

Loading

carsonip commented Jan 3, 2025 •

edited

Loading