expanding series: too many unhealthy instances in the ring #5158
-
Describe the bugTurned mimir on for the first time last night. Woke up today to all my dashboards throwing this error ["expanding series: too many unhealthy instances in the ring"](internal: rpc error: code = Code(500) desc = {"status":"error","errorType":"internal","error":"expanding series: too many unhealthy instances in the ring"}) To ReproduceSteps to reproduce the behavior:
Expected behaviorExpected mimir to continue running normally. Buckets, PV's, and Node health are just fine... Environment
Client Version: v1.24.14
Kustomize Version: v4.5.4
Server Version: v1.24.0
Additional ContextAll of our monitoring goes through a single anonymous user on a prometheus instance deployed else-where in our k8s cluster. remoteWrite:
- url: "http://mimir-nginx.observability.svc:80/api/v1/push"
queueConfig:
capacity: 5000
maxShards: 100
maxSamplesPerSend: 1000 $ helm show chart grafana/mimir-distributed
apiVersion: v2
appVersion: 2.8.0
dependencies:
- alias: minio
condition: minio.enabled
name: minio
repository: https://charts.min.io/
version: 5.0.7
- alias: grafana-agent-operator
condition: metaMonitoring.grafanaAgent.installOperator
name: grafana-agent-operator
repository: https://grafana.github.io/helm-charts
version: 0.2.8
- alias: rollout_operator
condition: rollout_operator.enabled
name: rollout-operator
repository: https://grafana.github.io/helm-charts
version: 0.4.2
description: Grafana Mimir
home: https://grafana.com/docs/mimir/v2.8.x/
icon: https://grafana.com/static/img/logos/logo-mimir.svg
kubeVersion: ^1.20.0-0
name: mimir-distributed
version: 4.4.1 Relevant logs from ingesters: ts=2023-06-03T21:37:21.568191115Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-ingester-zone-a-0-2a08c024' from=172.16.86.215:7946"
ts=2023-06-03T21:37:21.568827254Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-ingester-zone-a-0-2a08c024' from=172.16.119.135:7946"
ts=2023-06-03T21:37:23.108260172Z caller=log.go:194 level=info msg="Suspect mimir-ingester-zone-a-0-2a08c024 has failed, no acks received"
ts=2023-06-03T21:37:24.923539679Z caller=head.go:728 level=info user=anonymous msg="WAL checkpoint loaded"
ts=2023-06-03T21:37:25.603902136Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=166 maxSegment=246
ts=2023-06-03T21:37:26.287417665Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=167 maxSegment=246
ts=2023-06-03T21:37:27.021362334Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=168 maxSegment=246
ts=2023-06-03T21:37:27.716951701Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=169 maxSegment=246
ts=2023-06-03T21:37:28.480874939Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=170 maxSegment=246
ts=2023-06-03T21:37:29.184559413Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=171 maxSegment=246
ts=2023-06-03T21:37:30.104357757Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=172 maxSegment=246
ts=2023-06-03T21:37:31.631196134Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=173 maxSegment=246
ts=2023-06-03T21:37:33.097860242Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=174 maxSegment=246
ts=2023-06-03T21:37:33.602815194Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=175 maxSegment=246
ts=2023-06-03T21:37:34.260580977Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=176 maxSegment=246
ts=2023-06-03T21:37:34.910500285Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=177 maxSegment=246
ts=2023-06-03T21:37:35.546847321Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=178 maxSegment=246
ts=2023-06-03T21:37:35.55304165Z caller=log.go:194 level=info msg="Marking mimir-ingester-zone-a-0-2a08c024 as failed, suspect timeout reached (2 peer confirmations)" |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 9 replies
-
Turning off zone replication doesn't seem to help either |
Beta Was this translation helpful? Give feedback.
-
Converting this into a discussion, as there's no evidence of an actual bug yet. |
Beta Was this translation helpful? Give feedback.
-
I get the same issue. What I've tried
Tracking down the errorThe error happens in Cortex: Here are related tests: https://github.com/cortexproject/cortex/blob/master/pkg/ring/ring_test.go#L965-L1421 Fixing the issue by resetting the internal ringThe memberlist ring is stored in-memory. Consequently, deleting all pods will delete the faulty ring. Deleting all podsImportant: if some part of your infrastructure recreates the pods while you delete them (e.g. ArgoCD auto-sync), this procedure won't work as the faulty ring will be propagated again. kubectl get deploy -l app.kubernetes.io/instance=mimir -n grafana | awk '{print $1}' | tail +2 | xargs kubectl scale --replicas=0 -n grafana deploye
kubectl get statefulset -l app.kubernetes.io/instance=mimir -n grafana | awk '{print $1}' | tail +2 | xargs kubectl scale --replicas=0 -n grafana statefulset Recreating all pods
|
Beta Was this translation helpful? Give feedback.
-
Having the same problem as well. Only thing able to fix it is deleting all pods to reset the memberlist ring as @clouedoc also states. However, within an hour, the ring is corrupted again. |
Beta Was this translation helpful? Give feedback.
-
@sinthetix @clouedoc @abhinavDhulipala have you tried checking the hash ring page for the ingesters? That page is exposed by the distributor pods (but not proxied by the nginx) on |
Beta Was this translation helpful? Give feedback.
-
I didn't know about this, I should've known that Mimir had a wide API like this 👀
I'll try it next time I get the issue 👍
7 juil. 2023, 13:06 de ***@***.***:
…
@sinthetix <https://github.com/sinthetix>> > @clouedoc <https://github.com/clouedoc>> > @abhinavDhulipala <https://github.com/abhinavDhulipala>> have you tried checking the hash ring page for the ingesters? That page is exposed by the distributor pods (but not proxied by the nginx) on > /ingester/ring> and shows the state of the ring (> API reference <https://grafana.com/docs/mimir/latest/references/http-api/#ingesters-ring-status>> ). It would be helpful to see that and see why the queriers think that there are too many ingesters.
—
Reply to this email directly, > view it on GitHub <#5158 (comment)>> , or > unsubscribe <https://github.com/notifications/unsubscribe-auth/ADKG2SVPZ4AC3ARWG564KNTXO7UTHANCNFSM6AAAAAAZ6XOZOQ>> .
You are receiving this because you were mentioned.> Message ID: > <grafana/mimir/repo-discussions/5158/comments/6383446> @> github> .> com>
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
the joining memberlist clusters can happen because of IP reuse. One cluster will try to gossip to the old IP, but the IP will already be in use by the Loki/Tempo cluster. The Loki team has a blog post on how we've encountered it - see "Why memberlist labels matter" https://grafana.com/blog/2022/09/28/inside-the-migration-from-consul-to-memberlist-at-grafana-labs/
It also gives a brief overview of how to do the migration to have Loki/Tempo/Mimir have their own memberlist labels, so they "know" not to join the same memberlist cluster. The section is "Migration steps for using labels." The respective config options for Mimir are
memberlist.cluster_label: "mimir"
andmemberlist.cluster_label_…