SLO performance issue

Hello,

I'm reaching out on behalf of the Wikimedia Foundation as we are currently setting up Pyrra for our SLOs.

We are struggling with an SLO over a 12-week window that uses a metric exported by Istio, which has very high cardinality (8,807 resulting series).

I (we) would like to share some of our configurations with you to ask for advice/suggestions.

Below is our current SLO configuration:
```
---
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: liftwing-revscoring-availability
  namespace: pyrra-o11y
  labels:
    pyrra.dev/team: ml
    pyrra.dev/service: liftwing
    pyrra.dev/site: "eqiad"
spec:
  target: '98'
  window: 12w
  indicator:
    ratio:
      errors:
        metric: istio_requests_total{response_code!~"(2|3|4)..", site=~"eqiad",  destination_service_namespace=~"revscoring.*"}
      total:
        metric: istio_requests_total{response_code=~"...", site=~"eqiad",  destination_service_namespace=~"revscoring.*"}
```

This configuration generates recording rules in the form of:
```
3.817    sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code!~"(2|3|4)..",site=~"eqiad"}[3d])) / sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"...",site=~"eqiad"}[3d]))

12.694    sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code!~"(2|3|4)..",site=~"eqiad"}[12d])) / sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"...",site=~"eqiad"}[12d]))

12.972 (sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[12d])) - sum(rate(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2..",site=~"eqiad"}[12d]))) / sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[12d]))

48.040    sum by (destination_service_namespace, response_code, site) (increase(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[12w]))

63.559    sum by (destination_service_namespace, response_code, site) (increase(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2..",site=~"eqiad"}[12w]))

94.796    sum by (destination_service_namespace, response_code, site) (increase(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"...",site=~"eqiad"}[12w]))
```

We use Thanos as a long-term storage solution. The first two weeks of data are retrieved from sidecars, while the remaining data is processed through the Thanos store, with blocks residing in a Swift object store.

Every recording rule generated by the Pyrra configuration listed above is preceded by a number indicating the time it takes to compute the query. You can see that the 12-week query takes 95 seconds to complete, which is close to our timeout of 120 seconds. Moreover, the interval in the generated configuration we submit to Thanos Rule is 30 seconds. If I’m not mistaken, this implies a scenario where our SLO is undersampled, as only one out of four points is actually computed by Thanos Rule.

Finally, the computation is also resource-intensive. For this SLO alone, we've observed a 30% increase in CPU usage on the Thanos querier node and a 3x increase in interface bandwidth usage due to communication between Thanos and Swift.

Given that, we tried to set up a recording rule to aggregate data present in the Istio metrics:
```
- record: istio_sli_latency_request_duration_milliseconds_bucket:increase5m
expr: sum by (destination_canonical_service, destination_service_namespace, le, response_code, site, prometheus) (increase(istio_request_duration_milliseconds_bucket{kubernetes_namespace="istio-system", le=~"(50|100|250|500|1000|2500|5000|10000|30000|\\+Inf)"}[5m]))
```
However, using recording rules in the form of "sum by (a, b, c) (istio_requests_total)" leads to rate(sum()) issues with the Pyrra-generated recording rules, which increases the risk of unwanted spikes in case of resets.

If you would like more information on how we managed this situation, you can find it on our Phabricator instance:
* https://phabricator.wikimedia.org/T387350
* https://phabricator.wikimedia.org/T302995
* https://phabricator.wikimedia.org/T302995

Thank you and have a nice day

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SLO performance issue #1440

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SLO performance issue #1440

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions