Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add active alerts for alternator latencies #2402

Closed
GeoffMontee opened this issue Sep 25, 2024 · 7 comments
Closed

Add active alerts for alternator latencies #2402

GeoffMontee opened this issue Sep 25, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@GeoffMontee
Copy link

Please make sure that this is a feature request.

System information

  • Scylla version (you are using): Any
  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state.

Scylla Monitoring provides great panels showing Alternator latencies, but it does not currently provide alerts when latencies deviate from the average.

Who will benefit with this feature?

Users who use Alternator and depend on low latencies.

Any Other info.

Here is my alert configuration:

-  name: AlternatorByOpLatencyAboveAverage
   description: alternator operation latency has deviated from average
   severity: 1
   trigger-after: 10m
   expression: group((avg_over_time(scylla_alternator_op_latency_bucket[10m])-avg_over_time(scylla_alternator_op_latency_bucket[1d]))/stddev_over_time(scylla_alternator_op_latency_bucket[1d]) > 25) by (cluster, op)
   team: support
@GeoffMontee GeoffMontee added the enhancement New feature or request label Sep 25, 2024
@amnonh
Copy link
Collaborator

amnonh commented Oct 13, 2024

There are a few issues with the suggested alert, the main one being that looking at a bucket like that most likely doesn't do what you expect it to do.

I suggest using a P99 or P95 and setting some hardcoded limit, similar to what we do with CQL.

@GeoffMontee
Copy link
Author

Hi @amnonh,

Thanks for the reply. I have some followup questions:

I inspected the "P95 Latencies" panel on the "Alternator" dashboard, and I see the following expression:

      "expr": "histogram_quantile(0.95, sum(rate(scylla_alternator_op_latency_bucket{instance=~\"[[node]]\",cluster=~\"$cluster|$^\", dc=~\"$dc\", shard=~\"[[shard]]|$^\", op=~\"$ops\"}[$__rate_interval])>0) by (op, le))",

Would a similar calculation be used in an alert like this? Or would there be a better way to do this?

@GeoffMontee
Copy link
Author

Hi @wpaven @ruthea @ManjotS @pdbossman,

We should probably be specific about what we want to alert on. Something like this?:

Alert if the p95 latency for BatchGetItem is greater than 8 ms

Or would someone define it differently?

Or do we want to define alerts for each operation, not just BatchGetItem?

@ManjotS
Copy link

ManjotS commented Nov 1, 2024

I would rather look for patterns because it will differ per customer and cluster.

Here is my latest iteration

(histogram_quantile(0.95, sum(rate(scylla_alternator_op_latency_bucket{}[10m])>0) by (cluster, op, le))-histogram_quantile(0.95, sum(rate(scylla_alternator_op_latency_bucket{}[1d])>0) by (cluster, op, le)))/histogram_quantile(0.95, sum(rate(scylla_alternator_op_latency_bucket{}[1d])>0) by (cluster, op, le)) > 8

@ManjotS
Copy link

ManjotS commented Nov 5, 2024

latest:

((avg(histogram_quantile(0.95, sum(rate(scylla_alternator_op_latency_bucket{}[10m])>0) by (cluster, op, le))) by (cluster))-(avg(histogram_quantile(0.95, sum(rate(scylla_alternator_op_latency_bucket{}[1d])>0) by (cluster, op, le))) by (cluster)))/(stddev(histogram_quantile(0.95, sum(rate(scylla_alternator_op_latency_bucket{}[1d])>0) by (cluster, op, le))) by (cluster)) > 2

@ManjotS
Copy link

ManjotS commented Nov 5, 2024

We don't want this one widely distributed. This should be an SRE issue.

@ManjotS
Copy link

ManjotS commented Nov 7, 2024

Please close this issue

@mykaul mykaul closed this as not planned Won't fix, can't repro, duplicate, stale Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants