Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

help request: Apisix ETCD going into Crash loop back off #11338

Open
Lakshmi2k1 opened this issue Jun 6, 2024 · 39 comments
Open

help request: Apisix ETCD going into Crash loop back off #11338

Lakshmi2k1 opened this issue Jun 6, 2024 · 39 comments

Comments

@Lakshmi2k1
Copy link

Description

Hello,
I have deployed apisix 2.7.0 Helm chart and out of three etcd pods, two are going into crash loop back off error which affects the ingress created for other deployments.

image

The logs show the following details,

Master (etcd pod in running state)
"msg":"rejected stream from remote peer because it was removed","local-member-id"

Other pods (etcd pods in crash loop back off state)
"failed to publish local member to cluster through raft","local-member-id":"2c16fb63879f0d98","local-member-attributes":"{Name:apisix-etcd-1 ClientURLs:[http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2379/ http://apisix-etcd.apisix.svc.cluster.local:2379]}","request-path":"/0/members/2c16fb63879f0d98/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"

Currently stuck in this, let me know if anyone has faced this and has any fix for this

Environment

  • APISIX version (run apisix version): 2.7.0
@kayx23
Copy link
Member

kayx23 commented Jun 6, 2024

Do you have a strong requirement to use 2.7.0? I'm using the latest and pods are starting normally.

@flearc
Copy link
Contributor

flearc commented Jun 6, 2024

I think you need to try etcdctl member list first. This will help you verify if the member ID of the crashing pods matches the IDs from the etcdctl.

@Lakshmi2k1
Copy link
Author

Do you have a strong requirement to use 2.7.0? I'm using the latest and pods are starting normally.

The most recent version is 2.8.0 (released in Jun 04, 2024). So, was using one version prior to that which got released on April. May I know which version of helm chart you're using?

@Lakshmi2k1
Copy link
Author

I think you need to try etcdctl member list first. This will help you verify if the member ID of the crashing pods matches the IDs from the etcdctl.

If it's cluster's etcd then we have to login into the node and execute the commands, since here it is running as pod, not sure where to execute etcdctl commands and also as the pods are in crash loop back off, I can't even exec into the pods.

@flearc
Copy link
Contributor

flearc commented Jun 7, 2024

There is one running etcd pod, run etcdctl member list after exec into the pod. And check the logs of crashed etcd pods, normally there was member id it used.

BTW, I think it's more likely a etcd problem.

@Lakshmi2k1
Copy link
Author

There is one running etcd pod, run etcdctl member list after exec into the pod. And check the logs of crashed etcd pods, normally there was member id it used.

BTW, I think it's more likely a etcd problem.

Hello,
I have tried it, this is what i got

I have no name!@apisix-etcd-2:/opt/bitnami/etcd$ etcdctl member list
3ff1b5cd453a87df, started, apisix-etcd-2, http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380, http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2379,http://apisix-etcd.apisix.svc.cluster.local:2379, false and

this is the member id I found in the logs of crashing pods [local-member-id":"2c16fb63879f0d98"]. I also tried disabling apisix etcd and used an external etcd but it was not able to integrate with the running etcd pod. I'm trying to fix that. Share if anything you know could help.

@Thilip707
Copy link

any solutions for this issue iam also facing the same issue for last 3 days

@Lakshmi2k1
Copy link
Author

any solutions for this issue iam also facing the same issue for last 3 days

I have changed the etcd version in chart.yaml to "10.1.0". Now all pods are in running state. I'm checking few things in UI to make sure everything is working fine. If you are using Helm chart for deploying apisix, try this.

@Thilip707
Copy link

thanks will try and update mam

@Lakshmi2k1
Copy link
Author

Apisix is working fine after upgrading the version of etcd in chart.yaml as "10.1.0". So, closing this issue.

@github-project-automation github-project-automation bot moved this from 📋 Backlog to ✅ Done in Apache APISIX backlog Jun 11, 2024
@sudhir649
Copy link

sudhir649 commented Jul 15, 2024

Hi @Lakshmi2k1 Still are you facing the same issue? Need some suggestion on it. We have upgraded to 10.2.6 but still facing the same issue

@BadTorro
Copy link

Still having the same issue. We downloaded and added the entire chart dir, setting the etcd version in chart.yaml to "10.1.0" as suggested by @Lakshmi2k1

Are there any plans to have this fixed?

@sudhir649
Copy link

Hi @BadTorro try to enable the disaster recovery cron job.

@BadTorro
Copy link

BadTorro commented Jul 25, 2024 via email

@sudhir649
Copy link

Hi @BadTorro ,

I have found as of now two solution for it.

  1. Intermittent solution for it delete the all three pvc and restart the pod.

  2. check the readme.md file document in bitanami/etcd folder where they have explained how to enable the disaterrecovery cronjob.
    In disaster recovery there is a cronjob it will take the back of pvc, if more than (n-1)/2 pods are failing then pods will automatically come back to running status with the help of backup pvc.
    I have implememted the disaster recovery in my env, now I have seen that 2 pods are still failing and try to come back in running status , logs are also changed but unfortunately they are still not able to come back in running status.
    But once the third pod got failed all the three pods are automatically come back to running status with the help of backup pvc.
    So extract the etcd zip folder and in values. yaml enable the cronjob and zip it and redeploy it

@BadTorro
Copy link

@sudhir649 thanks for the tipp, need to verify - seems like I need an nfs storage provider to get the snapshot image to work.

@sudhir649
Copy link

sudhir649 commented Jul 26, 2024

@BadTorro yes, tilt is working on local machine so you need to deploy nfs storage class.

By the way lakshmi solution won't work for me

@Lakshmi2k1
Copy link
Author

@sudhir649 @BadTorro
Deploying new version of etcd worked for me initially but whenever the node is scaled down and scaled up, one of the etcd pods is going into crashloop. But since other two etcd pods were running, it wasn't affecting the route and upstream creations. Yet, we are about to use it in production environment, so wish there is a permanent fix. After reading the solution @sudhir649 you have pointed out, I have few questions.
1. Won't deleting the PVC cause loss of data that apisix needs.
2. For second solution, I feel it's good to give a try. (n-1)/2 , in my case number of etcd replicas is 3, so according to this even if one pod is crashing, the disaster recovery cron will run and take backup of pvc. But as you mentioned when two pods were crashing there was no change, when third pod also crashed then all three pods came to running state. But in my case only one or rarely two pods crashing. If you have any inputs let me know. Thanks in advance!

@Lakshmi2k1 Lakshmi2k1 reopened this Jul 27, 2024
@github-project-automation github-project-automation bot moved this from ✅ Done to 📋 Backlog in Apache APISIX backlog Jul 27, 2024
@sudhir649
Copy link

@Lakshmi2k1

  1. for deleting the pvc it's depends what data are you storing into it. In my case Or generally we are storing the routes only so if I deleted it will restore again once new pvc is created.

  2. In the documentaion they have mentioned more than (n-1) /2 .It means when more than 1 pod (atleast 2 if you have 3 etcd pods) will fail then automatically pods will try to recover. Recently in our QA env all the pods were down so it's better to implement disater recovery.

@Lakshmi2k1
Copy link
Author

@sudhir649
Thanks Sudhir, I'll try the same from my end.

@Lakshmi2k1
Copy link
Author

@sudhir649, We are facing one more error in apisix. We use openid-connect plugin for authentication and authorization in the ApisixPluginConfig. When we try to hit the ingress of application, it gives 431 (Header too large) error. We tried removing few headers but it was breaking UI of application, so is there a way to solve this? Have you come across similar issue before?

@Thilip707
Copy link

@sudhir649, We are facing one more error in apisix. We use openid-connect plugin for authentication and authorization in the ApisixPluginConfig. When we try to hit the ingress of application, it gives 431 (Header too large) error. We tried removing few headers but it was breaking UI of application, so is there a way to solve this? Have you come across similar issue before?

did u use nginx? if u use nginx add this in nginx
client_max_body_size 2G;

@Lakshmi2k1
Copy link
Author

@Thilip707

We are using the below configuration in apisix configmap as mentioned in docs
image

@Thilip707
Copy link

just increase client size and check it will work

@BadTorro
Copy link

BadTorro commented Jul 30, 2024

@sudhir649 @BadTorro Deploying new version of etcd worked for me initially but whenever the node is scaled down and scaled up, one of the etcd pods is going into crashloop. But since other two etcd pods were running, it wasn't affecting the route and upstream creations. Yet, we are about to use it in production environment, so wish there is a permanent fix. After reading the solution @sudhir649 you have pointed out, I have few questions. 1. Won't deleting the PVC cause loss of data that apisix needs. 2. For second solution, I feel it's good to give a try. (n-1)/2 , in my case number of etcd replicas is 3, so according to this even if one pod is crashing, the disaster recovery cron will run and take backup of pvc. But as you mentioned when two pods were crashing there was no change, when third pod also crashed then all three pods came to running state. But in my case only one or rarely two pods crashing. If you have any inputs let me know. Thanks in advance!

Regarding to that, I managed to get it work by basically:

  • Deploying longhorn storage solution to the cluster
  • Configured rancher desktop based on this guide to have open-iscsi in place and useable
  • changed the storageclass in the dedicated etcd sub-chart and related values.yaml file to "longhorn"
persistence:
  enabled: true
  storageClass: "longhorn"
  • started everything with "tilt up"

Currently keeps on running and did not crash since.
However, we are now as well checking if the Bitnami chart runs out of the box...

@Lakshmi2k1
Copy link
Author

@Lakshmi2k1

  1. for deleting the pvc it's depends what data are you storing into it. In my case Or generally we are storing the routes only so if I deleted it will restore again once new pvc is created.
  2. In the documentaion they have mentioned more than (n-1) /2 .It means when more than 1 pod (atleast 2 if you have 3 etcd pods) will fail then automatically pods will try to recover. Recently in our QA env all the pods were down so it's better to implement disater recovery.

I have enabled disaster recovery and deployed the helm chart, but this time not just etcd was crashing, the apisix pod stuck in init container, apisix ingress controller was crashing and the snapshot pod was also in error state. So, I rolled back to previous revision again after observing the pod status doesn't seem to change for a long time.

@sudhir649
Copy link

@sudhir649 @BadTorro Deploying new version of etcd worked for me initially but whenever the node is scaled down and scaled up, one of the etcd pods is going into crashloop. But since other two etcd pods were running, it wasn't affecting the route and upstream creations. Yet, we are about to use it in production environment, so wish there is a permanent fix. After reading the solution @sudhir649 you have pointed out, I have few questions. 1. Won't deleting the PVC cause loss of data that apisix needs. 2. For second solution, I feel it's good to give a try. (n-1)/2 , in my case number of etcd replicas is 3, so according to this even if one pod is crashing, the disaster recovery cron will run and take backup of pvc. But as you mentioned when two pods were crashing there was no change, when third pod also crashed then all three pods came to running state. But in my case only one or rarely two pods crashing. If you have any inputs let me know. Thanks in advance!

Regarding to that, I managed to get it work by basically:

  • Deploying longhorn storage solution to the cluster
  • Configured rancher desktop based on this guide to have open-iscsi in place and useable
  • changed the storageclass in the dedicated etcd sub-chart and related values.yaml file to "longhorn"
persistence:
  enabled: true
  storageClass: "longhorn"
  • started everything with "tilt up"

Currently keeps on running and did not crash since. However, we are now as well checking if the Bitnami chart runs out of the box...

Hi @BadTorro , How was the experince after deploying the disater recovery? For us its working fine so we replicate it in all the envs.

Regards,
Sudhir

@sudhir649
Copy link

@Lakshmi2k1

  1. for deleting the pvc it's depends what data are you storing into it. In my case Or generally we are storing the routes only so if I deleted it will restore again once new pvc is created.
  2. In the documentaion they have mentioned more than (n-1) /2 .It means when more than 1 pod (atleast 2 if you have 3 etcd pods) will fail then automatically pods will try to recover. Recently in our QA env all the pods were down so it's better to implement disater recovery.

I have enabled disaster recovery and deployed the helm chart, but this time not just etcd was crashing, the apisix pod stuck in init container, apisix ingress controller was crashing and the snapshot pod was also in error state. So, I rolled back to previous revision again after observing the pod status doesn't seem to change for a long time.

@Lakshmi2k1 The problem you encountered has nothing to do with disaster recovery. I have not experienced this problem

@minedetector
Copy link

We had a very similar issue.
All of our etcd pods were going into crashloopbackoff and had only these warnings.

Cluster not healthy, not adding self to cluster for now, keeping trying...

We created apisix and etcd through the helm chart and for us the issue was that even though we re-created the StatefulSet and deleted the PVC-s for a fresh start the ETCD_INITIAL_CLUSTER_STATE ENV was still set to existing.

Changed it to new scheduled STS to 0 and then 3 again and it started working for us.

@pietrogoddibit2win
Copy link

The problem is the same here bitnami/charts#16069
When a Pod present in ETCD_INITIAL_CLUSTER is schduled in a new node it starts with and empty PVC, so the pod is not able to join the cluster anymore.

@serhiikucherenko
Copy link

JFYI : removeMemberOnContainerTermination:false didn't help me on AKS cluster v.1.28.9
(docker.io/bitnami/etcd:3.5.10-debian-11-r2)
{"level":"warn","ts":"2024-09-03T11:26:04.835351Z","caller":"etcdserver/server.go:1127","msg":"server error","error":"the member has been permanently removed from the cluster"}
Any ideas? :-)

@pietrogoddibit2win
Copy link

JFYI : removeMemberOnContainerTermination:false didn't help me on AKS cluster v.1.28.9 (docker.io/bitnami/etcd:3.5.10-debian-11-r2) {"level":"warn","ts":"2024-09-03T11:26:04.835351Z","caller":"etcdserver/server.go:1127","msg":"server error","error":"the member has been permanently removed from the cluster"} Any ideas? :-)

neither in GKE

@BadTorro
Copy link

BadTorro commented Sep 3, 2024

Hi @BadTorro , How was the experince after deploying the disater recovery? For us its working fine so we replicate it in all the envs.

Regards, Sudhir

apologizes for the late reply @sudhir649 , but we ended up using the bitnami chart and customized it to our needs.

@pietrogoddibit2win
Copy link

Any news about this problem?

@gustysap
Copy link

gustysap commented Nov 5, 2024

any solutions for this issue iam also facing the same issue

@Joeydelarago
Copy link

I had this issue running on GKE. Uninstalling and deleting the etcd PVCs, then reinstalling fixed the issue.

@gustysap
Copy link

gustysap commented Nov 7, 2024

Hi @Joeydelarago , when you delete the etcd, what data is gone ya? all route is gone or is still exist?

@Joeydelarago
Copy link

Hi @Joeydelarago , when you delete the etcd, what data is gone ya? all route is gone or is still exist?

I actually encountered this when setting up a new environment, so it was not a concern for me. Also I have done my configuration via yaml files instead of the API.

If you need the data, you can always mount the pvcs to a dummy deployment and copy the files to local with kubectl cp. Then delete the pvc and apply apisix. Then use kubectl cp to copy the important files to the newly created pvc. I can't guarantee it will work though.

@Joeydelarago
Copy link

Joeydelarago commented Dec 19, 2024

This issue started occurring again for me, so I ended up installing etcd separately.

I installed etcd. I only updated the etcd values.yaml to increase the replication factor from 1 -> 3

helm install etcd-apisix bitnami/etcd \
  --namespace <NAMESPACE> \
  --values etcd-values.yaml

Then I updated the apisix values.yaml and did a helm upgrade apisix.

externalEtcd:
  host:
    - http://etcd-apisix.<NAMESPACE>.svc.cluster.local:2379
  user: root
  existingSecret: "etcd-apisix"
  secretPasswordKey: "etcd-root-password"
...

etcd:
  enabled: false

My solution for the issue is still the same as I mentioned above. Helm uninstall, delete etcd PVC, helm reinstall. However, with etcd separated, this can be done without taking down apisix.

Edit: The experimental composite architecture simulates etcd instead. Perhaps by the time you are reading this it is in stable https://apisix.apache.org/docs/ingress-controller/composite/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests