help request: Apisix ETCD going into Crash loop back off #11338

Lakshmi2k1 · 2024-06-06T05:56:42Z

Description

Hello,
I have deployed apisix 2.7.0 Helm chart and out of three etcd pods, two are going into crash loop back off error which affects the ingress created for other deployments.

The logs show the following details,

Master (etcd pod in running state)
"msg":"rejected stream from remote peer because it was removed","local-member-id"

Other pods (etcd pods in crash loop back off state)
"failed to publish local member to cluster through raft","local-member-id":"2c16fb63879f0d98","local-member-attributes":"{Name:apisix-etcd-1 ClientURLs:[http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2379/ http://apisix-etcd.apisix.svc.cluster.local:2379]}","request-path":"/0/members/2c16fb63879f0d98/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"

Currently stuck in this, let me know if anyone has faced this and has any fix for this

Environment

APISIX version (run apisix version): 2.7.0

The text was updated successfully, but these errors were encountered:

kayx23 · 2024-06-06T08:24:09Z

Do you have a strong requirement to use 2.7.0? I'm using the latest and pods are starting normally.

flearc · 2024-06-06T08:31:18Z

I think you need to try etcdctl member list first. This will help you verify if the member ID of the crashing pods matches the IDs from the etcdctl.

Lakshmi2k1 · 2024-06-07T04:51:28Z

Do you have a strong requirement to use 2.7.0? I'm using the latest and pods are starting normally.

The most recent version is 2.8.0 (released in Jun 04, 2024). So, was using one version prior to that which got released on April. May I know which version of helm chart you're using?

Lakshmi2k1 · 2024-06-07T04:54:12Z

I think you need to try etcdctl member list first. This will help you verify if the member ID of the crashing pods matches the IDs from the etcdctl.

If it's cluster's etcd then we have to login into the node and execute the commands, since here it is running as pod, not sure where to execute etcdctl commands and also as the pods are in crash loop back off, I can't even exec into the pods.

flearc · 2024-06-07T05:42:35Z

There is one running etcd pod, run etcdctl member list after exec into the pod. And check the logs of crashed etcd pods, normally there was member id it used.

BTW, I think it's more likely a etcd problem.

Lakshmi2k1 · 2024-06-07T11:26:18Z

There is one running etcd pod, run etcdctl member list after exec into the pod. And check the logs of crashed etcd pods, normally there was member id it used.

BTW, I think it's more likely a etcd problem.

Hello,
I have tried it, this is what i got

I have no name!@apisix-etcd-2:/opt/bitnami/etcd$ etcdctl member list
3ff1b5cd453a87df, started, apisix-etcd-2, http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380, http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2379,http://apisix-etcd.apisix.svc.cluster.local:2379, false and

this is the member id I found in the logs of crashing pods [local-member-id":"2c16fb63879f0d98"]. I also tried disabling apisix etcd and used an external etcd but it was not able to integrate with the running etcd pod. I'm trying to fix that. Share if anything you know could help.

Thilip707 · 2024-06-10T06:25:22Z

any solutions for this issue iam also facing the same issue for last 3 days

Lakshmi2k1 · 2024-06-10T06:52:45Z

any solutions for this issue iam also facing the same issue for last 3 days

I have changed the etcd version in chart.yaml to "10.1.0". Now all pods are in running state. I'm checking few things in UI to make sure everything is working fine. If you are using Helm chart for deploying apisix, try this.

Thilip707 · 2024-06-10T06:57:45Z

thanks will try and update mam

Lakshmi2k1 · 2024-06-11T11:56:04Z

Apisix is working fine after upgrading the version of etcd in chart.yaml as "10.1.0". So, closing this issue.

sudhir649 · 2024-07-15T16:23:02Z

Hi @Lakshmi2k1 Still are you facing the same issue? Need some suggestion on it. We have upgraded to 10.2.6 but still facing the same issue

BadTorro · 2024-07-25T19:13:13Z

Still having the same issue. We downloaded and added the entire chart dir, setting the etcd version in chart.yaml to "10.1.0" as suggested by @Lakshmi2k1

Are there any plans to have this fixed?

sudhir649 · 2024-07-25T19:25:01Z

Hi @BadTorro try to enable the disaster recovery cron job.

BadTorro · 2024-07-25T19:43:30Z

How do you mean? Do you have some more specifics to that? We're currently using it on a local development environment and etcd boots with 3 nodes, but 2 always keep failing. From time to time I need to shutdown the entire environment and restart it again to get it working again. Using it within https://tilt.dev/ thanks

sudhir649 · 2024-07-25T20:18:34Z

Hi @BadTorro ,

I have found as of now two solution for it.

Intermittent solution for it delete the all three pvc and restart the pod.
check the readme.md file document in bitanami/etcd folder where they have explained how to enable the disaterrecovery cronjob.
In disaster recovery there is a cronjob it will take the back of pvc, if more than (n-1)/2 pods are failing then pods will automatically come back to running status with the help of backup pvc.
I have implememted the disaster recovery in my env, now I have seen that 2 pods are still failing and try to come back in running status , logs are also changed but unfortunately they are still not able to come back in running status.
But once the third pod got failed all the three pods are automatically come back to running status with the help of backup pvc.
So extract the etcd zip folder and in values. yaml enable the cronjob and zip it and redeploy it

BadTorro · 2024-07-26T06:39:07Z

@sudhir649 thanks for the tipp, need to verify - seems like I need an nfs storage provider to get the snapshot image to work.

sudhir649 · 2024-07-26T08:05:04Z

@BadTorro yes, tilt is working on local machine so you need to deploy nfs storage class.

By the way lakshmi solution won't work for me

Lakshmi2k1 · 2024-07-27T07:06:26Z

@sudhir649 @BadTorro
Deploying new version of etcd worked for me initially but whenever the node is scaled down and scaled up, one of the etcd pods is going into crashloop. But since other two etcd pods were running, it wasn't affecting the route and upstream creations. Yet, we are about to use it in production environment, so wish there is a permanent fix. After reading the solution @sudhir649 you have pointed out, I have few questions.
1. Won't deleting the PVC cause loss of data that apisix needs.
2. For second solution, I feel it's good to give a try. (n-1)/2 , in my case number of etcd replicas is 3, so according to this even if one pod is crashing, the disaster recovery cron will run and take backup of pvc. But as you mentioned when two pods were crashing there was no change, when third pod also crashed then all three pods came to running state. But in my case only one or rarely two pods crashing. If you have any inputs let me know. Thanks in advance!

sudhir649 · 2024-07-27T07:50:06Z

@Lakshmi2k1

for deleting the pvc it's depends what data are you storing into it. In my case Or generally we are storing the routes only so if I deleted it will restore again once new pvc is created.
In the documentaion they have mentioned more than (n-1) /2 .It means when more than 1 pod (atleast 2 if you have 3 etcd pods) will fail then automatically pods will try to recover. Recently in our QA env all the pods were down so it's better to implement disater recovery.

Lakshmi2k1 · 2024-07-29T03:49:54Z

@sudhir649
Thanks Sudhir, I'll try the same from my end.

Lakshmi2k1 · 2024-07-29T05:48:22Z

@sudhir649, We are facing one more error in apisix. We use openid-connect plugin for authentication and authorization in the ApisixPluginConfig. When we try to hit the ingress of application, it gives 431 (Header too large) error. We tried removing few headers but it was breaking UI of application, so is there a way to solve this? Have you come across similar issue before?

Thilip707 · 2024-07-29T06:31:23Z

@sudhir649, We are facing one more error in apisix. We use openid-connect plugin for authentication and authorization in the ApisixPluginConfig. When we try to hit the ingress of application, it gives 431 (Header too large) error. We tried removing few headers but it was breaking UI of application, so is there a way to solve this? Have you come across similar issue before?

did u use nginx? if u use nginx add this in nginx
client_max_body_size 2G;

Lakshmi2k1 · 2024-07-29T06:44:29Z

@Thilip707

We are using the below configuration in apisix configmap as mentioned in docs

Thilip707 · 2024-07-29T06:47:01Z

just increase client size and check it will work

BadTorro · 2024-07-30T20:41:23Z

@sudhir649 @BadTorro Deploying new version of etcd worked for me initially but whenever the node is scaled down and scaled up, one of the etcd pods is going into crashloop. But since other two etcd pods were running, it wasn't affecting the route and upstream creations. Yet, we are about to use it in production environment, so wish there is a permanent fix. After reading the solution @sudhir649 you have pointed out, I have few questions. 1. Won't deleting the PVC cause loss of data that apisix needs. 2. For second solution, I feel it's good to give a try. (n-1)/2 , in my case number of etcd replicas is 3, so according to this even if one pod is crashing, the disaster recovery cron will run and take backup of pvc. But as you mentioned when two pods were crashing there was no change, when third pod also crashed then all three pods came to running state. But in my case only one or rarely two pods crashing. If you have any inputs let me know. Thanks in advance!

Regarding to that, I managed to get it work by basically:

Deploying longhorn storage solution to the cluster
Configured rancher desktop based on this guide to have open-iscsi in place and useable
changed the storageclass in the dedicated etcd sub-chart and related values.yaml file to "longhorn"

persistence:
  enabled: true
  storageClass: "longhorn"

started everything with "tilt up"

Currently keeps on running and did not crash since.
However, we are now as well checking if the Bitnami chart runs out of the box...

Lakshmi2k1 · 2024-08-05T05:59:44Z

@Lakshmi2k1

for deleting the pvc it's depends what data are you storing into it. In my case Or generally we are storing the routes only so if I deleted it will restore again once new pvc is created.

In the documentaion they have mentioned more than (n-1) /2 .It means when more than 1 pod (atleast 2 if you have 3 etcd pods) will fail then automatically pods will try to recover. Recently in our QA env all the pods were down so it's better to implement disater recovery.

I have enabled disaster recovery and deployed the helm chart, but this time not just etcd was crashing, the apisix pod stuck in init container, apisix ingress controller was crashing and the snapshot pod was also in error state. So, I rolled back to previous revision again after observing the pod status doesn't seem to change for a long time.

sudhir649 · 2024-08-09T04:54:35Z

@sudhir649 @BadTorro Deploying new version of etcd worked for me initially but whenever the node is scaled down and scaled up, one of the etcd pods is going into crashloop. But since other two etcd pods were running, it wasn't affecting the route and upstream creations. Yet, we are about to use it in production environment, so wish there is a permanent fix. After reading the solution @sudhir649 you have pointed out, I have few questions. 1. Won't deleting the PVC cause loss of data that apisix needs. 2. For second solution, I feel it's good to give a try. (n-1)/2 , in my case number of etcd replicas is 3, so according to this even if one pod is crashing, the disaster recovery cron will run and take backup of pvc. But as you mentioned when two pods were crashing there was no change, when third pod also crashed then all three pods came to running state. But in my case only one or rarely two pods crashing. If you have any inputs let me know. Thanks in advance!

Regarding to that, I managed to get it work by basically:

Deploying longhorn storage solution to the cluster

Configured rancher desktop based on this guide to have open-iscsi in place and useable

changed the storageclass in the dedicated etcd sub-chart and related values.yaml file to "longhorn"
persistence:
  enabled: true
  storageClass: "longhorn"
started everything with "tilt up"

Currently keeps on running and did not crash since. However, we are now as well checking if the Bitnami chart runs out of the box...

Hi @BadTorro , How was the experince after deploying the disater recovery? For us its working fine so we replicate it in all the envs.

Regards,
Sudhir

sudhir649 · 2024-08-09T05:09:36Z

@Lakshmi2k1

for deleting the pvc it's depends what data are you storing into it. In my case Or generally we are storing the routes only so if I deleted it will restore again once new pvc is created.

In the documentaion they have mentioned more than (n-1) /2 .It means when more than 1 pod (atleast 2 if you have 3 etcd pods) will fail then automatically pods will try to recover. Recently in our QA env all the pods were down so it's better to implement disater recovery.

I have enabled disaster recovery and deployed the helm chart, but this time not just etcd was crashing, the apisix pod stuck in init container, apisix ingress controller was crashing and the snapshot pod was also in error state. So, I rolled back to previous revision again after observing the pod status doesn't seem to change for a long time.

@Lakshmi2k1 The problem you encountered has nothing to do with disaster recovery. I have not experienced this problem

minedetector · 2024-08-19T08:11:07Z

We had a very similar issue.
All of our etcd pods were going into crashloopbackoff and had only these warnings.

Cluster not healthy, not adding self to cluster for now, keeping trying...

We created apisix and etcd through the helm chart and for us the issue was that even though we re-created the StatefulSet and deleted the PVC-s for a fresh start the ETCD_INITIAL_CLUSTER_STATE ENV was still set to existing.

Changed it to new scheduled STS to 0 and then 3 again and it started working for us.

pietrogoddibit2win · 2024-09-03T11:15:09Z

The problem is the same here bitnami/charts#16069
When a Pod present in ETCD_INITIAL_CLUSTER is schduled in a new node it starts with and empty PVC, so the pod is not able to join the cluster anymore.

serhiikucherenko · 2024-09-03T11:27:43Z

JFYI : removeMemberOnContainerTermination:false didn't help me on AKS cluster v.1.28.9
(docker.io/bitnami/etcd:3.5.10-debian-11-r2)
{"level":"warn","ts":"2024-09-03T11:26:04.835351Z","caller":"etcdserver/server.go:1127","msg":"server error","error":"the member has been permanently removed from the cluster"}
Any ideas? :-)

pietrogoddibit2win · 2024-09-03T11:36:21Z

JFYI : removeMemberOnContainerTermination:false didn't help me on AKS cluster v.1.28.9 (docker.io/bitnami/etcd:3.5.10-debian-11-r2) {"level":"warn","ts":"2024-09-03T11:26:04.835351Z","caller":"etcdserver/server.go:1127","msg":"server error","error":"the member has been permanently removed from the cluster"} Any ideas? :-)

neither in GKE

BadTorro · 2024-09-03T19:42:49Z

Hi @BadTorro , How was the experince after deploying the disater recovery? For us its working fine so we replicate it in all the envs.

Regards, Sudhir

apologizes for the late reply @sudhir649 , but we ended up using the bitnami chart and customized it to our needs.

pietrogoddibit2win · 2024-10-04T08:57:58Z

Any news about this problem?

gustysap · 2024-11-05T03:19:40Z

any solutions for this issue iam also facing the same issue

Joeydelarago · 2024-11-07T10:29:23Z

I had this issue running on GKE. Uninstalling and deleting the etcd PVCs, then reinstalling fixed the issue.

gustysap · 2024-11-07T14:27:48Z

Hi @Joeydelarago , when you delete the etcd, what data is gone ya? all route is gone or is still exist?

Joeydelarago · 2024-11-07T14:42:37Z

Hi @Joeydelarago , when you delete the etcd, what data is gone ya? all route is gone or is still exist?

I actually encountered this when setting up a new environment, so it was not a concern for me. Also I have done my configuration via yaml files instead of the API.

If you need the data, you can always mount the pvcs to a dummy deployment and copy the files to local with kubectl cp. Then delete the pvc and apply apisix. Then use kubectl cp to copy the important files to the newly created pvc. I can't guarantee it will work though.

Joeydelarago · 2024-12-19T07:06:03Z

This issue started occurring again for me, so I ended up installing etcd separately.

I installed etcd. I only updated the etcd values.yaml to increase the replication factor from 1 -> 3

helm install etcd-apisix bitnami/etcd \
  --namespace <NAMESPACE> \
  --values etcd-values.yaml

Then I updated the apisix values.yaml and did a helm upgrade apisix.

externalEtcd:
  host:
    - http://etcd-apisix.<NAMESPACE>.svc.cluster.local:2379
  user: root
  existingSecret: "etcd-apisix"
  secretPasswordKey: "etcd-root-password"
...

etcd:
  enabled: false

My solution for the issue is still the same as I mentioned above. Helm uninstall, delete etcd PVC, helm reinstall. However, with etcd separated, this can be done without taking down apisix.

Edit: The experimental composite architecture simulates etcd instead. Perhaps by the time you are reading this it is in stable https://apisix.apache.org/docs/ingress-controller/composite/

github-project-automation bot added this to Apache APISIX backlog Jun 6, 2024

github-project-automation bot moved this to 📋 Backlog in Apache APISIX backlog Jun 6, 2024

Lakshmi2k1 closed this as completed Jun 11, 2024

github-project-automation bot moved this from 📋 Backlog to ✅ Done in Apache APISIX backlog Jun 11, 2024

Lakshmi2k1 reopened this Jul 27, 2024

github-project-automation bot moved this from ✅ Done to 📋 Backlog in Apache APISIX backlog Jul 27, 2024

vipul0805 mentioned this issue Nov 20, 2024

help request: Error while restoring backup to etcd cluster. #11763

Open

Joeydelarago mentioned this issue Dec 19, 2024

bug: Error: etcdserver: member not found #11383

Closed

help request: Apisix ETCD going into Crash loop back off #11338

help request: Apisix ETCD going into Crash loop back off #11338

Comments

Lakshmi2k1 commented Jun 6, 2024

Description

Environment

kayx23 commented Jun 6, 2024

flearc commented Jun 6, 2024

Lakshmi2k1 commented Jun 7, 2024

Lakshmi2k1 commented Jun 7, 2024

flearc commented Jun 7, 2024

Lakshmi2k1 commented Jun 7, 2024

Thilip707 commented Jun 10, 2024

Lakshmi2k1 commented Jun 10, 2024

Thilip707 commented Jun 10, 2024

Lakshmi2k1 commented Jun 11, 2024

sudhir649 commented Jul 15, 2024 • edited Loading

BadTorro commented Jul 25, 2024

sudhir649 commented Jul 25, 2024

BadTorro commented Jul 25, 2024 via email • edited Loading

sudhir649 commented Jul 25, 2024

BadTorro commented Jul 26, 2024

sudhir649 commented Jul 26, 2024 • edited Loading

Lakshmi2k1 commented Jul 27, 2024

sudhir649 commented Jul 27, 2024

Lakshmi2k1 commented Jul 29, 2024

Lakshmi2k1 commented Jul 29, 2024

Thilip707 commented Jul 29, 2024

Lakshmi2k1 commented Jul 29, 2024

Thilip707 commented Jul 29, 2024

BadTorro commented Jul 30, 2024 • edited Loading

Lakshmi2k1 commented Aug 5, 2024

sudhir649 commented Aug 9, 2024

sudhir649 commented Aug 9, 2024

minedetector commented Aug 19, 2024

pietrogoddibit2win commented Sep 3, 2024

serhiikucherenko commented Sep 3, 2024

pietrogoddibit2win commented Sep 3, 2024

BadTorro commented Sep 3, 2024

pietrogoddibit2win commented Oct 4, 2024

gustysap commented Nov 5, 2024

Joeydelarago commented Nov 7, 2024

gustysap commented Nov 7, 2024

Joeydelarago commented Nov 7, 2024

Joeydelarago commented Dec 19, 2024 • edited Loading

sudhir649 commented Jul 15, 2024 •

edited

Loading

BadTorro commented Jul 25, 2024 via email •

edited

Loading

sudhir649 commented Jul 26, 2024 •

edited

Loading

BadTorro commented Jul 30, 2024 •

edited

Loading

Joeydelarago commented Dec 19, 2024 •

edited

Loading