Zero Downtime upgrades of Zitadel looks broken as of now - "502 bad gateway" error for couple of seconds from the ingress at the time of rolling updates are in progress #282

roman-aleksejuk-telia · 2024-10-14T12:00:31Z

Preflight Checklist

I could not find a solution in the existing issues, docs, nor discussions
I have joined the ZITADEL chat

Describe your problem

PROBLEM: When zitadel's pod is being restarted, stopped or the zitadel version is being updated (strategy: rollingUpdate) - kubernetes sends SIGTERM signal to gracefully stop the main zitadel container, and after that - zitadel immediately exits. This happens instantly, therefore the k8s endpoints controller could (and usually does) update the endpoints BEFORE an ingress controller detects these changes in endpoints, which leads for some small amount of traffic (for couple of seconds) still being sent to non-existing exited pod. (Pod Lifecycle Controller and Endpoint Controller - are two separated control loops in k8s. Since these two controllers operate independently, there is a slight lag between them is possible). And as a side effect - you will see a multiple "502 bad gateway" errors while trying to reach the zitadel endpoints at that time. So no complete ZERO-downtime is possible at the time of stop/restart/update.
HOW TO REPRODUCE: start k6 performance tests (VUs >15) and at the same time restart or update whole zitadel deployment (it will be done one by one pod) - you will see a few "bad gateway" errors due to the process described above.

Describe your ideal solution

Implement configurable exit delay for Zitadel container once the gracefull SIGTERM signal is received for ingress and endpoint controllers to finalize they job independently before the final exit of the zitadel container. This exact solution was tested and confirmed the solution.

DISCUSSION with final solution confirmation: https://discord.com/channels/927474939156643850/1287680005869928490

Version

v2.55.0

Environment

Self-hosted

Additional Context

No response

hifabienne added the devops label Oct 15, 2024

stebenz assigned eliobischof Nov 11, 2024

eliobischof transferred this issue from zitadel/zitadel Dec 3, 2024

hifabienne added this to Product Management Dec 3, 2024

hifabienne moved this to 📨 Product Backlog in Product Management Dec 3, 2024

hifabienne moved this from 📨 Product Backlog to 🧐 Investigating in Product Management Dec 3, 2024

eliobischof removed this from Product Management Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero Downtime upgrades of Zitadel looks broken as of now - "502 bad gateway" error for couple of seconds from the ingress at the time of rolling updates are in progress #282

Zero Downtime upgrades of Zitadel looks broken as of now - "502 bad gateway" error for couple of seconds from the ingress at the time of rolling updates are in progress #282

roman-aleksejuk-telia commented Oct 14, 2024 •

edited

Loading

Zero Downtime upgrades of Zitadel looks broken as of now - "502 bad gateway" error for couple of seconds from the ingress at the time of rolling updates are in progress #282

Zero Downtime upgrades of Zitadel looks broken as of now - "502 bad gateway" error for couple of seconds from the ingress at the time of rolling updates are in progress #282

Comments

roman-aleksejuk-telia commented Oct 14, 2024 • edited Loading

Preflight Checklist

Describe your problem

Describe your ideal solution

Version

Environment

Additional Context

roman-aleksejuk-telia commented Oct 14, 2024 •

edited

Loading