Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBS: set default sampling.tail.storage_limit to 0 but limit disk usage to 90% #15467

Merged
merged 51 commits into from
Jan 31, 2025

Conversation

carsonip
Copy link
Member

@carsonip carsonip commented Jan 30, 2025

Motivation/summary

This is a breaking change to the default storage_limit to enable a more user-friendly TBS disk usage handling. This new default will automatically scale with a larger disk.

Change sampling.tail.storage_limit default to 0.
While 0 means unlimited local tail-sampling database size,
it now enforces a max 90% disk usage on the disk where the data directory is located.
Any tail sampling writes after this threshold will be rejected,
similar to what happens when tail-sampling database size exceeds a non-0 storage limit.
Setting sampling.tail.storage_limit to non-0 maintains the existing behavior
which limits the tail-sampling database size to sampling.tail.storage_limit
and does not have the new disk usage threshold check.

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

Create a tmpfs with various sizes, check for logs as disk threshold is hit.

Related issues

Part of #15450
EA-managed apm-server needs elastic/integrations#12543 to change the default.

Copy link
Contributor

mergify bot commented Jan 30, 2025

This pull request does not have a backport label. Could you fix it @carsonip? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-8.x is the label to automatically backport to the 8.x branch.

Copy link
Contributor

mergify bot commented Jan 30, 2025

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 30, 2025
@carsonip carsonip added backport-9.0 Automated backport to the 9.0 branch and removed backport-8.x Automated backport to the 8.x branch with mergify labels Jan 30, 2025
Copy link
Contributor

mergify bot commented Jan 30, 2025

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 30, 2025
@carsonip carsonip changed the title [WIP] TBS: disk threshold TBS: stop writing after disk_threshold and remove default storage_limit Jan 30, 2025
@carsonip carsonip marked this pull request as ready for review January 30, 2025 16:19
@carsonip carsonip requested a review from a team as a code owner January 30, 2025 16:19
@carsonip carsonip removed the backport-8.x Automated backport to the 8.x branch with mergify label Jan 30, 2025
Copy link
Contributor

mergify bot commented Jan 30, 2025

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 30, 2025

usage, err := vfs.Default.GetDiskUsage(sm.storageDir)
if err != nil {
sm.rateLimitedLogger.With(logp.Error(err)).Warn("failed to get disk usage")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the fallback in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation it effectively becomes unlimited. It is by design because of how they are 2 separate checkers and RW. But I think it is possible to overwrite dbStorageLimit to a fallback value if we get an error, so that storage_limit checker takes over.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added fallback handling. A bit complex, but should cover all cases. All of it is written with the assumption that if GetDiskUsage ever returns an error, it will always return an error.

@carsonip carsonip removed the backport-8.x Automated backport to the 8.x branch with mergify label Jan 30, 2025
Copy link
Contributor

mergify bot commented Jan 30, 2025

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 30, 2025
carsonip added a commit that referenced this pull request Jan 31, 2025
Fix a missing colon in logs (typo from #15235 ), and remove "storage" in "configured storage limit reached" message to make way for #15467 to avoid confusion
mergify bot pushed a commit that referenced this pull request Jan 31, 2025
Fix a missing colon in logs (typo from #15235 ), and remove "storage" in "configured storage limit reached" message to make way for #15467 to avoid confusion

(cherry picked from commit 28068bd)
mergify bot pushed a commit that referenced this pull request Jan 31, 2025
Fix a missing colon in logs (typo from #15235 ), and remove "storage" in "configured storage limit reached" message to make way for #15467 to avoid confusion

(cherry picked from commit 28068bd)

# Conflicts:
#	x-pack/apm-server/sampling/eventstorage/rw.go
simitt
simitt previously approved these changes Jan 31, 2025
lahsivjar
lahsivjar previously approved these changes Jan 31, 2025
Copy link
Contributor

@lahsivjar lahsivjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

usage, err := sm.getDiskUsage()
if err != nil {
sm.logger.With(logp.Error(err)).Warn("failed to get disk usage")
sm.getDiskUsageFailed.Store(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: While I can't think of what could lead to transient failures, I am not sure about always failing on error bit - something to think about in a future PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives me a headache as well. What should existing disk usage threshold checks perform when getDiskUsage has transient failures, should it use a stale number, or should it become unlimited? With this assumption of always failing, it simplifies the implementation.

@carsonip carsonip dismissed stale reviews from lahsivjar and simitt via 7bcfc60 January 31, 2025 11:46
@carsonip carsonip requested review from simitt and lahsivjar January 31, 2025 11:53
lahsivjar
lahsivjar previously approved these changes Jan 31, 2025
@carsonip carsonip requested a review from lahsivjar January 31, 2025 12:39
@carsonip carsonip changed the title TBS: stop writing after disk_threshold_ratio and remove default storage_limit TBS: set default sampling.tail.storage_limit to 0 but limit disk usage to 90% Jan 31, 2025
@carsonip carsonip enabled auto-merge (squash) January 31, 2025 12:54
@carsonip carsonip merged commit d019277 into elastic:main Jan 31, 2025
16 checks passed
mergify bot pushed a commit that referenced this pull request Jan 31, 2025
…age to 90% (#15467)

This is a breaking change to the default storage_limit to enable a more user-friendly TBS disk usage handling. This new default will automatically scale with a larger disk.

Change sampling.tail.storage_limit default to 0.
While 0 means unlimited local tail-sampling database size,
it now enforces a max 90% disk usage on the disk where the data directory is located.
Any tail sampling writes after this threshold will be rejected,
similar to what happens when tail-sampling database size exceeds a non-0 storage limit.
Setting sampling.tail.storage_limit to non-0 maintains the existing behavior
which limits the tail-sampling database size to sampling.tail.storage_limit
and does not have the new disk usage threshold check.

(cherry picked from commit d019277)

# Conflicts:
#	changelogs/9.0.asciidoc
mergify bot added a commit that referenced this pull request Jan 31, 2025
…isk usage to 90% (backport #15467) (#15501)

* TBS: set default `sampling.tail.storage_limit` to 0 but limit disk usage to 90% (#15467)

This is a breaking change to the default storage_limit to enable a more user-friendly TBS disk usage handling. This new default will automatically scale with a larger disk.

Change sampling.tail.storage_limit default to 0.
While 0 means unlimited local tail-sampling database size,
it now enforces a max 90% disk usage on the disk where the data directory is located.
Any tail sampling writes after this threshold will be rejected,
similar to what happens when tail-sampling database size exceeds a non-0 storage limit.
Setting sampling.tail.storage_limit to non-0 maintains the existing behavior
which limits the tail-sampling database size to sampling.tail.storage_limit
and does not have the new disk usage threshold check.

(cherry picked from commit d019277)

# Conflicts:
#	changelogs/9.0.asciidoc

* Fix changelog

---------

Co-authored-by: Carson Ip <[email protected]>
Co-authored-by: Carson Ip <[email protected]>
mergify bot added a commit that referenced this pull request Jan 31, 2025
Fix a missing colon in logs (typo from #15235 ), and remove "storage" in "configured storage limit reached" message to make way for #15467 to avoid confusion

(cherry picked from commit 28068bd)

Co-authored-by: Carson Ip <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
carsonip added a commit to elastic/integrations that referenced this pull request Feb 3, 2025
sampling.tail.storage_limit is 0 by default in 9.0. See elastic/apm-server#15467 .
As UI validation requires unit (e.g. GB), set apm integration default storage limit to 0GB which carries the same meaning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-9.0 Automated backport to the 9.0 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants