Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle custom histogram buckets #126

Open
keturiosakys opened this issue Aug 24, 2023 · 3 comments
Open

Handle custom histogram buckets #126

keturiosakys opened this issue Aug 24, 2023 · 3 comments

Comments

@keturiosakys
Copy link
Member

Some of the Autometrics libraries (Rust, Go) already support customizing the histogram buckets and others are likely to follow suit. One implication of this is that if the library produces metrics with different than default histogram buckets, the Autometrics alerting ruleset won't work. To make it easy to handle these cases am could be able to generate and apply these custom rulesets.

Ideally am would be able to infer from the code or at runtime if custom histogram buckets are used and suggest to apply them to the Prometheus config running under the hood. Alternatively, an easier path to implement would be a flag + am.toml field that would allow to set these manually

@hatchan
Copy link
Contributor

hatchan commented Aug 24, 2023

This can be picked up together with #22

@gagbo
Copy link
Member

gagbo commented Aug 24, 2023

Ideally am would be able to infer from the code or at runtime if custom histogram buckets are used and suggest to apply them to the Prometheus config running under the hood.

I think runtime detection is safer, because sometimes you won't be able to infer the buckets from the code (if an environment variable controls the buckets).

Also, technically the alerting ruleset will still work if you only change the buckets of the histogram, it will just be less accurate (meaning the alert might not trigger at exactly 90% or whatever SLO success rate); what will break completely the alerting rules (as "no alert would trigger at all"), is to change the percentile objective to something unsupported (so something outside 90%, 95%, 99%, and 99.9% iirc). It's still something we want to tackle for sure anyway!

@IvanMerrill
Copy link

I would strongly suggest that there needs to be a way to enforce a rule for there to be a bucket boundary on the SLO latency target. The loss of accuracy that @gagbo describes could be a big deal. For example:

SLO target is 95% at <= 150ms
A bucket exists for >100ms & <200ms
All responses take 110ms
The 95% percentile would be calculated as 195ms, and an alert would trigger.

Without enforcing the boundaries to match the SLO target it's easy for people to create a situation where their well-behaved code would permanently trigger an alert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants