-
Notifications
You must be signed in to change notification settings - Fork 137
Description
Hello,
I'm reaching out on behalf of the Wikimedia Foundation as we are currently setting up Pyrra for our SLOs.
We are struggling with an SLO over a 12-week window that uses a metric exported by Istio, which has very high cardinality (8,807 resulting series).
I (we) would like to share some of our configurations with you to ask for advice/suggestions.
Below is our current SLO configuration:
---
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: liftwing-revscoring-availability
namespace: pyrra-o11y
labels:
pyrra.dev/team: ml
pyrra.dev/service: liftwing
pyrra.dev/site: "eqiad"
spec:
target: '98'
window: 12w
indicator:
ratio:
errors:
metric: istio_requests_total{response_code!~"(2|3|4)..", site=~"eqiad", destination_service_namespace=~"revscoring.*"}
total:
metric: istio_requests_total{response_code=~"...", site=~"eqiad", destination_service_namespace=~"revscoring.*"}
This configuration generates recording rules in the form of:
3.817 sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code!~"(2|3|4)..",site=~"eqiad"}[3d])) / sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"...",site=~"eqiad"}[3d]))
12.694 sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code!~"(2|3|4)..",site=~"eqiad"}[12d])) / sum(rate(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"...",site=~"eqiad"}[12d]))
12.972 (sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[12d])) - sum(rate(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2..",site=~"eqiad"}[12d]))) / sum(rate(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[12d]))
48.040 sum by (destination_service_namespace, response_code, site) (increase(istio_request_duration_milliseconds_count{destination_service_namespace=~"revscoring.*",response_code=~"2..",site=~"eqiad"}[12w]))
63.559 sum by (destination_service_namespace, response_code, site) (increase(istio_request_duration_milliseconds_bucket{destination_service_namespace=~"revscoring.*",le="5000",response_code=~"2..",site=~"eqiad"}[12w]))
94.796 sum by (destination_service_namespace, response_code, site) (increase(istio_requests_total{destination_service_namespace=~"revscoring.*",response_code=~"...",site=~"eqiad"}[12w]))
We use Thanos as a long-term storage solution. The first two weeks of data are retrieved from sidecars, while the remaining data is processed through the Thanos store, with blocks residing in a Swift object store.
Every recording rule generated by the Pyrra configuration listed above is preceded by a number indicating the time it takes to compute the query. You can see that the 12-week query takes 95 seconds to complete, which is close to our timeout of 120 seconds. Moreover, the interval in the generated configuration we submit to Thanos Rule is 30 seconds. If I’m not mistaken, this implies a scenario where our SLO is undersampled, as only one out of four points is actually computed by Thanos Rule.
Finally, the computation is also resource-intensive. For this SLO alone, we've observed a 30% increase in CPU usage on the Thanos querier node and a 3x increase in interface bandwidth usage due to communication between Thanos and Swift.
Given that, we tried to set up a recording rule to aggregate data present in the Istio metrics:
- record: istio_sli_latency_request_duration_milliseconds_bucket:increase5m
expr: sum by (destination_canonical_service, destination_service_namespace, le, response_code, site, prometheus) (increase(istio_request_duration_milliseconds_bucket{kubernetes_namespace="istio-system", le=~"(50|100|250|500|1000|2500|5000|10000|30000|\\+Inf)"}[5m]))
However, using recording rules in the form of "sum by (a, b, c) (istio_requests_total)" leads to rate(sum()) issues with the Pyrra-generated recording rules, which increases the risk of unwanted spikes in case of resets.
If you would like more information on how we managed this situation, you can find it on our Phabricator instance:
- https://phabricator.wikimedia.org/T387350
- https://phabricator.wikimedia.org/T302995
- https://phabricator.wikimedia.org/T302995
Thank you and have a nice day