You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a cluster-admin I want to onboard and eventually terminate tenants in our cluster by creating and eventually deleting a root namespace for the tenant. The onboarding process works like a charm, but I am not too happy about the termination process - which we had an opportunity to try out this week. Let me explain:
We use a GitOps process where tenants are onboarded via a pull request to our "tenants" project. A tenant is in this project defined by some simple resources:
An Accurate root namespace with some templated resources defined in our tenant-template Accurate template namespace.
A Flux gitops-reconciler service account granted admin permissions in the tenant root namespace. Both the SA and admin role binding are configured with propagation to sub-namespaces. This allows the tenant to use Flux to provision most resources using a GitOps process in their namespace tree.
Flux GitRepository and Kustomization resources pointing to a Git project controlled by the tenant - allowing the tenant to bootstrap their resources.
So far, so good. But this week we received a request for a tenant termination. Using a modern GitOps tool like Flux, with pruning enabled, we thought that it was as simple as reverting the onboarding PR in Git. So we did that, after getting the PR approved by the tenant responsible. What we forgot to do, was to check if there were sub-namespaces defined under the tenant root namespace. After merging the tenant termination PR, Flux immediately reported an error: It got (correctly) blocked by the Accurate namespace webhook:
But this error was reported only once, which kind of surprised me - as Flux is usually constantly reconciling until the actual state equals the desired state. And as I suspected, the tenant namespace was still present - including the child namespaces that we did not think about. However the Flux controller resource had no knowledge of the resources it used to control anymore, which left the tenant root namespace (including children) as orphans in our cluster. This is something we are trying hard to avoid.
After cleaning up manually, I reached out to the Flux maintainers on Slack, and you can read all the details in the CNCF Slack thread - if you are interested. TL;DR: Flux maintainers think this is an issue with Accurate, and I tend to agree now.
To fix this for future tenant terminations in our clusters, I suggest adding an opt-in allowing us to configure the Accurate namespace webhook allowing cascade namespace DELETE requests.
How
I have some ideas after looking at the code, but please let me know what you think first! I will update this description once we agree on a solution. I'll be happy to submit a PR fixing this.
Checklist
Finish implementation of the issue
Test all functions
Have enough logs to trace activities
Notify developers of necessary actions
The text was updated successfully, but these errors were encountered:
The easiest way to implement this would be to configure an Accurate installation to allow cascading deletes. It would be better if it could be expressed in some declarative way per namespace/sub-namespace, but I fear the complexities in an implementation of this. Looking at the docs for HNC in this area, it's not clear to me how this works - even after reading it multiple times. 😄
My ideal solution would be to allow cascading deletes of namespaces in general, which is admin area anyway. And block end-users from deleting SubNamespace with children by default. WDYT? Would it be possible to implement without adding a lot of complexity?
My ideal solution would be to allow cascading deletes of namespaces in general, which is admin area anyway. And block end-users from deleting SubNamespace with children by default
I don't think it is a good idea to cascade deletions by default. Because it is breaking changes and very dangerous.
It would not be complicated if the implementation were to allow cascade deletion only for the cluster administrator.
What
As a cluster-admin I want to onboard and eventually terminate tenants in our cluster by creating and eventually deleting a root namespace for the tenant. The onboarding process works like a charm, but I am not too happy about the termination process - which we had an opportunity to try out this week. Let me explain:
We use a GitOps process where tenants are onboarded via a pull request to our "tenants" project. A tenant is in this project defined by some simple resources:
tenant-template
Accurate template namespace.So far, so good. But this week we received a request for a tenant termination. Using a modern GitOps tool like Flux, with pruning enabled, we thought that it was as simple as reverting the onboarding PR in Git. So we did that, after getting the PR approved by the tenant responsible. What we forgot to do, was to check if there were sub-namespaces defined under the tenant root namespace. After merging the tenant termination PR, Flux immediately reported an error: It got (correctly) blocked by the Accurate namespace webhook:
But this error was reported only once, which kind of surprised me - as Flux is usually constantly reconciling until the actual state equals the desired state. And as I suspected, the tenant namespace was still present - including the child namespaces that we did not think about. However the Flux controller resource had no knowledge of the resources it used to control anymore, which left the tenant root namespace (including children) as orphans in our cluster. This is something we are trying hard to avoid.
After cleaning up manually, I reached out to the Flux maintainers on Slack, and you can read all the details in the CNCF Slack thread - if you are interested. TL;DR: Flux maintainers think this is an issue with Accurate, and I tend to agree now.
To fix this for future tenant terminations in our clusters, I suggest adding an opt-in allowing us to configure the Accurate namespace webhook allowing cascade namespace
DELETE
requests.How
I have some ideas after looking at the code, but please let me know what you think first! I will update this description once we agree on a solution. I'll be happy to submit a PR fixing this.
Checklist
The text was updated successfully, but these errors were encountered: