Handle custom histogram buckets #126

keturiosakys · 2023-08-24T10:00:05Z

Some of the Autometrics libraries (Rust, Go) already support customizing the histogram buckets and others are likely to follow suit. One implication of this is that if the library produces metrics with different than default histogram buckets, the Autometrics alerting ruleset won't work. To make it easy to handle these cases am could be able to generate and apply these custom rulesets.

Ideally am would be able to infer from the code or at runtime if custom histogram buckets are used and suggest to apply them to the Prometheus config running under the hood. Alternatively, an easier path to implement would be a flag + am.toml field that would allow to set these manually

The text was updated successfully, but these errors were encountered:

hatchan · 2023-08-24T10:02:24Z

This can be picked up together with #22

gagbo · 2023-08-24T12:58:19Z

Ideally am would be able to infer from the code or at runtime if custom histogram buckets are used and suggest to apply them to the Prometheus config running under the hood.

I think runtime detection is safer, because sometimes you won't be able to infer the buckets from the code (if an environment variable controls the buckets).

Also, technically the alerting ruleset will still work if you only change the buckets of the histogram, it will just be less accurate (meaning the alert might not trigger at exactly 90% or whatever SLO success rate); what will break completely the alerting rules (as "no alert would trigger at all"), is to change the percentile objective to something unsupported (so something outside 90%, 95%, 99%, and 99.9% iirc). It's still something we want to tackle for sure anyway!

IvanMerrill · 2023-09-05T14:21:40Z

I would strongly suggest that there needs to be a way to enforce a rule for there to be a bucket boundary on the SLO latency target. The loss of accuracy that @gagbo describes could be a big deal. For example:

SLO target is 95% at <= 150ms
A bucket exists for >100ms & <200ms
All responses take 110ms
The 95% percentile would be calculated as 195ms, and an alert would trigger.

Without enforcing the boundaries to match the SLO target it's easy for people to create a situation where their well-behaved code would permanently trigger an alert.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle custom histogram buckets #126

Handle custom histogram buckets #126

keturiosakys commented Aug 24, 2023

hatchan commented Aug 24, 2023

gagbo commented Aug 24, 2023

IvanMerrill commented Sep 5, 2023

Handle custom histogram buckets #126

Handle custom histogram buckets #126

Comments

keturiosakys commented Aug 24, 2023

hatchan commented Aug 24, 2023

gagbo commented Aug 24, 2023

IvanMerrill commented Sep 5, 2023