You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Zero Downtime upgrades of Zitadel looks broken as of now - "502 bad gateway" error for couple of seconds from the ingress at the time of rolling updates are in progress
#282
PROBLEM: When zitadel's pod is being restarted, stopped or the zitadel version is being updated (strategy: rollingUpdate) - kubernetes sends SIGTERM signal to gracefully stop the main zitadel container, and after that - zitadel immediately exits. This happens instantly, therefore the k8s endpoints controller could (and usually does) update the endpoints BEFORE an ingress controller detects these changes in endpoints, which leads for some small amount of traffic (for couple of seconds) still being sent to non-existing exited pod. (Pod Lifecycle Controller and Endpoint Controller - are two separated control loops in k8s. Since these two controllers operate independently, there is a slight lag between them is possible). And as a side effect - you will see a multiple "502 bad gateway" errors while trying to reach the zitadel endpoints at that time. So no complete ZERO-downtime is possible at the time of stop/restart/update.
HOW TO REPRODUCE: start k6 performance tests (VUs >15) and at the same time restart or update whole zitadel deployment (it will be done one by one pod) - you will see a few "bad gateway" errors due to the process described above.
Describe your ideal solution
Implement configurable exit delay for Zitadel container once the gracefull SIGTERM signal is received for ingress and endpoint controllers to finalize they job independently before the final exit of the zitadel container. This exact solution was tested and confirmed the solution.
Preflight Checklist
Describe your problem
PROBLEM: When zitadel's pod is being restarted, stopped or the zitadel version is being updated (strategy: rollingUpdate) - kubernetes sends SIGTERM signal to gracefully stop the main zitadel container, and after that - zitadel immediately exits. This happens instantly, therefore the k8s endpoints controller could (and usually does) update the endpoints BEFORE an ingress controller detects these changes in endpoints, which leads for some small amount of traffic (for couple of seconds) still being sent to non-existing exited pod. (Pod Lifecycle Controller and Endpoint Controller - are two separated control loops in k8s. Since these two controllers operate independently, there is a slight lag between them is possible). And as a side effect - you will see a multiple "502 bad gateway" errors while trying to reach the zitadel endpoints at that time. So no complete ZERO-downtime is possible at the time of stop/restart/update.
HOW TO REPRODUCE: start k6 performance tests (VUs >15) and at the same time restart or update whole zitadel deployment (it will be done one by one pod) - you will see a few "bad gateway" errors due to the process described above.
Describe your ideal solution
Implement configurable exit delay for Zitadel container once the gracefull SIGTERM signal is received for ingress and endpoint controllers to finalize they job independently before the final exit of the zitadel container. This exact solution was tested and confirmed the solution.
DISCUSSION with final solution confirmation: https://discord.com/channels/927474939156643850/1287680005869928490
Version
v2.55.0
Environment
Self-hosted
Additional Context
No response
The text was updated successfully, but these errors were encountered: