-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[8.x] TBS: Replace badger with pebble (backport #15235) #15452
base: 8.x
Are you sure you want to change the base?
Conversation
This PR replaces badger with pebble as the database for tail-based sampling. Significant performance gains. The database of choice is Pebble, which does not have TTL handling built-in, and we implement our own TTL handling on top of the database: - TTL is divided up into N parts, where N is partitionsPerTTL. - A database holds N + 1 + 1 partitions. - Every TTL/N we will discard the oldest partition, so we keep a rolling window of N+1 partitions. - Writes will go to the most recent partition, and we'll read across N+1 partitions (cherry picked from commit 0ca58b8) # Conflicts: # go.mod # go.sum # internal/beater/monitoringtest/opentelemetry.go # x-pack/apm-server/main.go # x-pack/apm-server/main_test.go # x-pack/apm-server/sampling/processor.go # x-pack/apm-server/sampling/processor_bench_test.go # x-pack/apm-server/sampling/processor_test.go
Cherry-pick of 0ca58b8 has failed:
To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
There's a lot of conflicts as #15094 was not backported to 8.x, but they should be fixed now. |
run docs-build |
This pull request is now in conflicts. Could you fix it @mergify[bot]? 🙏
|
This pull request has not been merged yet. Could you please review and merge it @carsonip? 🙏 |
Motivation/summary
This PR replaces badger with pebble as the database for tail-based sampling.
Benchmarks
TLDR: +3000% in indexed events/s, +76% intakev2 event rate, while on -75% memory usage and -44% disk usage
See comment for details.
Major design changes
2*TTL
. In fact, there's a knobpartitionsPerTTL
to adjust the available prefixes to trade between storage overhead and read amplification. e.g. partitionsPerTTL=1 keeps2*TTL
entries with 2 partition reads per key read, while partitionsPerTTL=2 keeps1.5*TTL
entries with 3 partition reads per key read.Other implied changes
sampling.tail.storage.lsm_size
, whilesampling.tail.storage.vlog_size
is always 0. The change is not decided yet.TODO:
sampling.tail.storage_limit
and storage limit handling #14933Check for memory leak by running it for a long time, as pebble does manual memory managementevent loss on upgrade:see Migration path from badger to pebble #15423Useful but not necessary, out of scope of this PR:
testing/infra/terraform/modules/standalone_apm_server
for reproducible benchmark with moxy and without using ESS: terraform: support TBS in standalone_apm_server #15337Checklist
For functional changes, consider:
How to test these changes
Enable TBS, try various sampling policies, send events, keep it running for over 2 * TTL, ensure that disk usage is bounded, and memory usage is expected.
Related issues
Fixes #15246
This is an automatic backport of pull request #15235 done by [Mergify](https://mergify.com).