Ingester performance and reliability issues on high scale cluster #8810
Replies: 1 comment 4 replies
-
Hi, I took a look around what you sent
This is very far into "not keeping up" territory. An ingester can only reasonably do as many things at once as it has CPU cores; locking can spike that number occasionally but 65k is a bad symptom. I wonder if the Go scheduler just can't figure out what to do fast enough, and that's why you see lower CPU usage. There is also quite a high level of overhead from gRPC; perhaps you could try to make push requests bigger? (E.g. via Do you think you could add a heap profile? About 25% of the on-CPU time is down to garbage collection, so there might be something anomalous in the allocation patterns. I see you have |
Beta Was this translation helpful? Give feedback.
-
Hi team,
we are experiencing performance and reliability issues for ingesters in our biggest cluster (smaller clusters are running fine).
Some general numbers of this cluster are
(running in an AWS EKS k8s cluster)
The normal % of ingestion errors is very low, 0.00X%. When we start to restart some ingesters that ingestion error % increases. We cannot restart more than 20/30 ingesters at a time if we don't want the ingestion errors to go above 1%.
At the moment, we cannot say we can sustain an entire zone going down, because 2 zones wouldn't sustain the traffic by themselves. I suspect this is the case because when ingesters are not performing fine, finding the quorum, so 2 ingesters, out of 3 ingesters is more likely than finding 2 out of 2.
I suspect there are many issues going on but nothing major comes up while investigating.
The ingesters have a total rejection rate (so the sum across all of them) between 5/10k requests per second because the max inflight push requests limit (set to 65k) is reached. They reach this limit independently, a bit randomly from each other, without any apparent cause. So that limit is reached in one ingesters one minute, the minute after another ingesters, and so on..
I attached the manifests for all the components in case you can spot something unusual or that can be configured better for this scale
all_manifests.zip
.
For convenience, the configmap with the overrides is as follow
This is the ingester's statefulset definition
Note how the cpu request is 10 and GOMAXPROCS is set to 10 as well. However, their cpu utilization is always between 2.5 and 6 cores between compactions, then on compactions it goes a bit higher, up to amost 7. So there is plenty of cores available that are never fully utilised.
I attach also 2 flamegraph (+ also the raw profiles in the zip) that I obtained from respectively on and off cpu 30s profilling of one of the ingesters (/pprof and /fgprof)
profiles_and_flamegraphs.zip
can you give me some guidance on how to find what the performance bottleneck is, or how to make progress in the debugging?
Let me know what more information is needed to give a better picture of what is happening.
Thank you very much!
Beta Was this translation helpful? Give feedback.
All reactions