Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero Downtime upgrades of Zitadel looks broken as of now - "502 bad gateway" error for couple of seconds from the ingress at the time of rolling updates are in progress #282

Open
2 tasks done
roman-aleksejuk-telia opened this issue Oct 14, 2024 · 0 comments
Assignees
Labels

Comments

@roman-aleksejuk-telia
Copy link

roman-aleksejuk-telia commented Oct 14, 2024

Preflight Checklist

  • I could not find a solution in the existing issues, docs, nor discussions
  • I have joined the ZITADEL chat

Describe your problem

PROBLEM: When zitadel's pod is being restarted, stopped or the zitadel version is being updated (strategy: rollingUpdate) - kubernetes sends SIGTERM signal to gracefully stop the main zitadel container, and after that - zitadel immediately exits. This happens instantly, therefore the k8s endpoints controller could (and usually does) update the endpoints BEFORE an ingress controller detects these changes in endpoints, which leads for some small amount of traffic (for couple of seconds) still being sent to non-existing exited pod. (Pod Lifecycle Controller and Endpoint Controller - are two separated control loops in k8s. Since these two controllers operate independently, there is a slight lag between them is possible). And as a side effect - you will see a multiple "502 bad gateway" errors while trying to reach the zitadel endpoints at that time. So no complete ZERO-downtime is possible at the time of stop/restart/update.
HOW TO REPRODUCE: start k6 performance tests (VUs >15) and at the same time restart or update whole zitadel deployment (it will be done one by one pod) - you will see a few "bad gateway" errors due to the process described above.

Describe your ideal solution

Implement configurable exit delay for Zitadel container once the gracefull SIGTERM signal is received for ingress and endpoint controllers to finalize they job independently before the final exit of the zitadel container. This exact solution was tested and confirmed the solution.

DISCUSSION with final solution confirmation: https://discord.com/channels/927474939156643850/1287680005869928490

Version

v2.55.0

Environment

Self-hosted

Additional Context

No response

@eliobischof eliobischof transferred this issue from zitadel/zitadel Dec 3, 2024
@hifabienne hifabienne moved this to 📨 Product Backlog in Product Management Dec 3, 2024
@hifabienne hifabienne moved this from 📨 Product Backlog to 🧐 Investigating in Product Management Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants