Ingester performance and reliability issues on high scale cluster #8810

agardiman · 2024-07-24T10:53:58Z

agardiman
Jul 24, 2024

Hi team,
we are experiencing performance and reliability issues for ingesters in our biggest cluster (smaller clusters are running fine).
Some general numbers of this cluster are

It ingests more than 10 million samples per second
it has total in-memory series around the B scale
there are hundreds of ingesters, in 3 zones (deployed in 3 physical AWS availability zone), with replication factory =3
(running in an AWS EKS k8s cluster)

The normal % of ingestion errors is very low, 0.00X%. When we start to restart some ingesters that ingestion error % increases. We cannot restart more than 20/30 ingesters at a time if we don't want the ingestion errors to go above 1%.
At the moment, we cannot say we can sustain an entire zone going down, because 2 zones wouldn't sustain the traffic by themselves. I suspect this is the case because when ingesters are not performing fine, finding the quorum, so 2 ingesters, out of 3 ingesters is more likely than finding 2 out of 2.

I suspect there are many issues going on but nothing major comes up while investigating.
The ingesters have a total rejection rate (so the sum across all of them) between 5/10k requests per second because the max inflight push requests limit (set to 65k) is reached. They reach this limit independently, a bit randomly from each other, without any apparent cause. So that limit is reached in one ingesters one minute, the minute after another ingesters, and so on..

I attached the manifests for all the components in case you can spot something unusual or that can be configured better for this scale
all_manifests.zip
.
For convenience, the configmap with the overrides is as follow

apiVersion: v1
data:
  overrides.yaml: |
    distributor_limits:
        max_inflight_push_requests: 1000
        max_inflight_push_requests_bytes: 2.097152e+08
    ingester_limits:
        max_inflight_push_requests: 65000
    overrides:
        default:
            compactor_split_and_merge_shards: 64
            compactor_split_groups: 53
            ingestion_burst_size: 2.4e+07
            ingestion_rate: 2.2e+07

This is the ingester's statefulset definition

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    rollout-max-unavailable: "15"
  labels:
    grafana.com/min-time-between-zones-downscale: 13h
    grafana.com/prepare-downscale: "true"
    rollout-group: ingester
  name: ingester-zone-a
  namespace: cortex
spec:
  podManagementPolicy: Parallel
  replicas: XX
  selector:
    matchLabels:
      name: ingester-zone-a
      rollout-group: ingester
  serviceName: ingester-zone-a
  template:
    metadata:
      labels:
        gossip_ring_member: "true"
        name: ingester-zone-a
        rollout-group: ingester
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cortex.pharos.inday.io/zone
                operator: In
                values:
                - mimir-a
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: rollout-group
                operator: In
                values:
                - ingester
              - key: name
                operator: NotIn
                values:
                - ingester-zone-a
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - -auth.multitenancy-enabled=true
        - -blocks-storage.s3.bucket-name=X
        - -blocks-storage.tsdb.block-ranges-period=2h
        - -blocks-storage.tsdb.dir=/data/tsdb
        - -blocks-storage.tsdb.head-compaction-interval=15m
        - -blocks-storage.tsdb.ship-interval=1m
        - -blocks-storage.tsdb.wal-replay-concurrency=9
        - -common.storage.backend=s3
        - -common.storage.s3.endpoint=s3.us-west-2.amazonaws.com
        - -distributor.health-check-ingesters=true
        - -enable-go-runtime-metrics=true
        - -ingester.limit-inflight-requests-using-grpc-method-limiter=true
        - -ingester.max-global-metadata-per-metric=10
        - -ingester.max-global-metadata-per-user=320000000
        - -ingester.max-global-series-per-user=1600000000
        - -ingester.ring.heartbeat-period=2m
        - -ingester.ring.heartbeat-timeout=8m
        - -ingester.ring.instance-availability-zone=zone-a
        - -ingester.ring.num-tokens=1024
        - -ingester.ring.prefix=
        - -ingester.ring.replication-factor=3
        - -ingester.ring.store=memberlist
        - -ingester.ring.tokens-file-path=/data/tokens
        - -ingester.ring.unregister-on-shutdown=false
        - -ingester.ring.zone-awareness-enabled=true
        - -memberlist.bind-port=7946
        - -memberlist.join=dns+gossip-ring.cortex.svc.cluster.local.:7946
        - -runtime-config.file=/etc/mimir/overrides.yaml
        - -server.grpc-max-concurrent-streams=10000
        - -server.grpc.keepalive.min-time-between-pings=10s
        - -server.grpc.keepalive.ping-without-stream-allowed=true
        - -server.http-listen-port=80
        - -target=ingester
        - -tenant-federation.enabled=true
        - -usage-stats.enabled=false
        - -usage-stats.installation-mode=jsonnet
        env:
        - name: GOMAXPROCS
          value: "10"
        - name: GOMEMLIMIT
          value: 53687091200B
        - name: GOGC
          value: "300"
        image: REPO/mimir:2.11.0
        imagePullPolicy: IfNotPresent
        name: ingester
        ports:
        - containerPort: 80
          name: http-metrics
        - containerPort: 9095
          name: grpc
        - containerPort: 7946
          name: gossip-ring
        readinessProbe:
          httpGet:
            path: /ready
            port: 80
          initialDelaySeconds: 15
          timeoutSeconds: 5
        resources:
          limits:
            memory: 60Gi
          requests:
            cpu: "10"
            memory: 50Gi
        volumeMounts:
        - mountPath: /data
          name: ingester-data
        - mountPath: /etc/mimir
          name: overrides
      securityContext:
        runAsUser: 0
      serviceAccountName: cortex
      terminationGracePeriodSeconds: 1200
      volumes:
      - configMap:
          name: overrides
        name: overrides
  updateStrategy:
    type: OnDelete
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: ingester-data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 50Gi
      storageClassName: gp3

Note how the cpu request is 10 and GOMAXPROCS is set to 10 as well. However, their cpu utilization is always between 2.5 and 6 cores between compactions, then on compactions it goes a bit higher, up to amost 7. So there is plenty of cores available that are never fully utilised.

I attach also 2 flamegraph (+ also the raw profiles in the zip) that I obtained from respectively on and off cpu 30s profilling of one of the ingesters (/pprof and /fgprof)
profiles_and_flamegraphs.zip

can you give me some guidance on how to find what the performance bottleneck is, or how to make progress in the debugging?
Let me know what more information is needed to give a better picture of what is happening.

Thank you very much!

bboreham · 2024-07-25T16:41:27Z

bboreham
Jul 25, 2024
Maintainer

Hi, I took a look around what you sent

the max inflight push requests limit (set to 65k) is reached

This is very far into "not keeping up" territory. An ingester can only reasonably do as many things at once as it has CPU cores; locking can spike that number occasionally but 65k is a bad symptom.

I wonder if the Go scheduler just can't figure out what to do fast enough, and that's why you see lower CPU usage.
In theory go tool trace can show great detail inside the Go runtime, but I usually struggle to interpret what it says.

There is also quite a high level of overhead from gRPC; perhaps you could try to make push requests bigger? (E.g. via max_samples_per_send in Prometheus).

Do you think you could add a heap profile? About 25% of the on-CPU time is down to garbage collection, so there might be something anomalous in the allocation patterns. I see you have GOGC set to 300, which explains the relatively high RAM usage.
If possible, upload the raw .pprof data, because it has the most detail and makes it easier to drill down to source level.

4 replies

agardiman Jul 25, 2024
Author

Hi @bboreham, thanks for having a look.
I already changed the majority of our clients config max sample per send to 10k. Is that good enough or would you suggest going further?

I can get the memory profile once I get back to work the next week. I attached in the zip files the on CPU and off CPU prof files. Are these the .pprof you need by any chance? I might have called them with the wrong name.

bboreham Jul 26, 2024
Maintainer

10K is pretty good; divided across 220 ingesters that should be 45 samples per request. You could try 20K. I wouldn't go much higher as it will start to increase latency in distributors.

The .prof files in the zip appear to be an ASCII text dump of the pprof, but I can't interpret the raw data and I don't know of any tool that can ingest that format.

If you curl /debug/pprof/profile you should get the pprof in Protobuf format, and /debug/pprof/heap for the memory.

agardiman Jul 30, 2024
Author

@bboreham Ah ok, I was using that format with the Brendan Gregg tools to create the flamegraphs. Here are the on/off/heap profiles as you asked:
profiles.zip

The reason why we increased the GOGC to 300 is because we saw that a lot of CPU is spent in GC, so we added the GOMEMLIMIT as a safety upper limit for memory, and decrease the frequency, hoping to decrease the time spent in GC. It actually helped with the performance a bit, even tho not dramatically.
Thanks

agardiman Aug 6, 2024
Author

@bboreham do you have any recommendations based on the profile files I attached above by any chance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingester performance and reliability issues on high scale cluster #8810

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Ingester performance and reliability issues on high scale cluster #8810

agardiman Jul 24, 2024

Replies: 1 comment · 4 replies

bboreham Jul 25, 2024 Maintainer

agardiman Jul 25, 2024 Author

bboreham Jul 26, 2024 Maintainer

agardiman Jul 30, 2024 Author

agardiman Aug 6, 2024 Author

agardiman
Jul 24, 2024

Replies: 1 comment 4 replies

bboreham
Jul 25, 2024
Maintainer

agardiman Jul 25, 2024
Author

bboreham Jul 26, 2024
Maintainer

agardiman Jul 30, 2024
Author

agardiman Aug 6, 2024
Author