diff --git a/keps/prod-readiness/sig-api-machinery/5647.yaml b/keps/prod-readiness/sig-api-machinery/5647.yaml new file mode 100644 index 00000000000..833cfba116c --- /dev/null +++ b/keps/prod-readiness/sig-api-machinery/5647.yaml @@ -0,0 +1,3 @@ +kep-number: 5647 +alpha: + approver: "@jpbetz" \ No newline at end of file diff --git a/keps/sig-api-machinery/5647-stale-controller-handling/README.md b/keps/sig-api-machinery/5647-stale-controller-handling/README.md new file mode 100644 index 00000000000..eabf5fb7ee0 --- /dev/null +++ b/keps/sig-api-machinery/5647-stale-controller-handling/README.md @@ -0,0 +1,827 @@ +# KEP-NNNN: Stale Controller Detection and Mitigation + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Informer and Cache Update](#informer-and-cache-update) + - [Staleness Mitigation in Controllers](#staleness-mitigation-in-controllers) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +We propose a way to mitigate staleness within Kubernetes controllers, +particularly the ones located in `kube-controller-manager` (KCM). When operating +at a large scale, controllers can fall behind the state of the apiserver, +leading to decisions based on outdated information. This proposal introduces a +generalized mechanism for controllers to detect how stale their local cache is +and prevent reconciliation until the cache catches up to provide consistency. + +## Motivation + +Every controller operates on a local cache that is populated by watching for +changes from the apiserver. By its nature, this watch stream is eventually +consistent and provides no guarantee of how far behind the "live" state of the +apiserver it is. A change event might arrive within milliseconds, or under other +circumstances, could be delayed by seconds or even minutes. This means every +controller, at all times, is operating on a potentially outdated view of the +cluster's state. + +The issue this KEP addresses is that operators currently have no visibility into +this lag. It's impossible to distinguish between normal delays and a controller +that has fallen dangerously out of sync. A controller may continue trying to +enact its desired state based on an outdated view of the world, causing spurious +reconciles that do not reflect the desired intentions of the user. + +This proposal aims to provide solutions to mitigate issues that may arise out of +the controller falling behind. With the ability to [compare resource +version](https://github.com/kubernetes/enhancements/pull/5505) we now have the +ability to have the idea of the "age" of a resource and run operations with that +in mind. + +### Goals + +The goals of this KEP is to document the stale controller issue and propose +solutions to help detect and mitigate it. We will propose the ideas of enforcing +stricter cache semantics at certain points in the reconcile cycle of a +controller. + +### Non-Goals + +This is not intended to enforce consistency guarantees on all controllers. The +changes described here will be "opt-in" for every controller and will require +some degree of changes to the controller logic. It is not the goal to require +controllers to all view a consistent state of the world per reconcile. + +## Proposal + +We propose the ability for controllers to be able to view whether their own +writes have made it into the apiserver. This will allow for the controller to +skip reconciling an object until it knows whether or not the previous write has +been propogated to its cache. We will do this by adding plumbing to the +underlying cached informer and adding the ability for controllers to know the +resource version of objects they subscribe to. + +Once this is added, we will onboard certain controllers to be able to use the +newly exposed resource versions and skip reconciling and requeue until certain +objects they write to are updated in their cache. + +### User Stories (Optional) + +#### Story 1 + +I am a K8s cluster administrator. I want to be sure that my controllers do not +write too frequently. I enable this feature so that I know my controllers only +write when they are up to date. + +#### Story 2 + +I am a controller author. I want certain objects to be ensured to be in my cache +after I write on a previous reconcile so I can be assured that my reads are up +to date. I use the newly provided frameworks to ensure those objects are up to +date, mitigating the risks that I am operating on stale data. + +### Risks and Mitigations + +There is the risk of skipping reconciles, if this is not correct and we don't +unpause properly then that would lead to scenarios where a controller may stop +reconciling an object entirely. We will feature gate this and add a set of +rigorous tests prior to Beta/GA to ensure that the feature is well tested and +not missing edge cases. Any time we try and optimize the reconcile loop there +are issues like this that may arise. + +There are also some edge cases that need to be accounted for such as controller +restarts causing the cache to have to be resynced. We will need to account for +scenarios like this and ensure that the controller gets a consistent view of the +world after events like this. While these changes will not make existing +behavior worse, anyone implementing and depending on the ability to read their +writes will need to ensure that situations like that are consistent. Likely, we +will need to have some solutions to +https://github.com/kubernetes/kubernetes/issues/59848#issuecomment-2842495398 +before controllers can fully rely on their watch cache on restart. The same +ideas discussed here can likely be applied to have a consistent cache. + +## Design Details + +Our proposal consists of two parts, one is the change that will enable the +controllers to be fully informed on the current resource versions of objects in +their cache. Second is the use of that new ability to actually ensure the read +after write guarantees on objects we care about in the controller itself. + +### Informer and Cache Update + +We will update the `ResourceEventHandlerFuncs` in +`staging/src/k8s.io/client-go/tools/cache/controller.go` to have one more function. + +``` +type ResourceEventHandlerFuncs struct { + AddFunc func(obj interface{}) + UpdateFunc func(oldObj, newObj interface{}) + DeleteFunc func(obj interface{}) + BookmarkFunc func(resourceVersion string) <--- NEW +} +``` + +The Add, Update and Delete functions are already nearly enough to be able to +track the lifecycle of objects but certain edge cases in tracking make it +necessary to add the Bookmark function. Controllers will use all the functions +to be able to track when they are able to reconcile after their prior writes. + +This Bookmark functions only purpose is to inform any listeners that an update +has occurred. This is necessary to be able to fully track the lifecycle of +resources since otherwise a controller may not be able to tell whether an object +has not yet been added to the cache or whether the object has been added and +deleted without the cache tracking it. + +### Staleness Mitigation in Controllers + +With the changes in the informers controllers now need mechanisms to actively +prevent issues caused by stale data. The most critical failures often occur when +a controller makes a significant decision—like deleting a Pod or scaling a +resource—based on outdated information from its local cache. To address this, we +will introduce read after write guarantees at these critical decision points. + +We will begin by implementing targeted fixes directly within the controllers +that are most sensitive to staleness. To solve this, we will modify the core +processing loop of key controllers. The new approach involves tracking the +resourceVersion of key resources after a successful write operation. This +last-written resourceVersion is stored in memory. We will effectively store the +mapping of the object that is reconciled on, to a tuple of resourceVersions of +any resources that we wish to track. Once the reads of the cache have processed +past the latest write for all the tracked resources we will allow the reconcile +to proceed for that object. This will only skip and requeue reconciles for the +specific objects that have written in a previous reconcile, and won't be a +global lock on all reconciles. By doing so, we ensure that we can progress as +much as possible until we need an updated cache. + +We can provide an example with the Daemonset controller. Any time the daemonset +controller writes to pods, we will store the mapping from Daemonset -> Pod +Resource Version. At the same time, we will also add a tracker to the controller +which will update the latest seen RV of the pod cache. On any informer event, we +will update the seen RV if it is greater than what is currently stored. We will +provide some helper function `isReady(dsKey string)` that will query the tracker +and only return true if the resource version of the latest write for pods by +that daemonset is older than the latest read. We can also optimize other parts +in similar ways, such as writes to daemonset status, but they all take the same +pattern. + +Lastly, inside of the controllers `processNextWorkItem` function, we will check +whether the daemonset is ready using the helper function and requeue without +running the reconciliation routine if the daemonset is not yet ready to be +worked on due to the cache still needing to catch up. In the case of the object +not being ready, we will requeue the object the same way as if an error +occurred. This will have the same exponential backoff semantics so after a few +reconciles of being unable to catch up the requeue will take longer and longer +until the cache has enough time to actually catch up to the writes. + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + +All the new code will be unit tested and have reasonable levels of coverage +prior to merging. + +We will be adding changes in the `pkg/controller` and package mostly. The +controller package will contain the bulk of the logic with additions to the pod +informer. We will add unit tests to ensure that if the cache is out of sync, we +skip reconciliation and requeue ourselves. + +For the `Informer and Cache Updates` we will be making those changes all in +`staging/src/k8s.io/client-go/tools/cache` folder for the most part. We will add +tests to the newly exposed function and ensure that it works as we expect it to. + +##### Integration tests + + + + + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +##### e2e tests + + + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +### Graduation Criteria + +#### Alpha +- Addition of new bookmark function for informers +- Feature implemented behind a feature flag for 1 or more controllers +- Unit tests for controllers/bookmark function +- Addition of e2e tests for the controllers with feature gate enabled + +#### Beta +- Analysis of onboarded controllers and addition of others that may have the same staleness issues +- Addition of additional E2E tests and stress tests to ensure edge cases are fully tested + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +This is all internal behavior in the sync loop for controllers. Version skew +should not affect it. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: StaleControllerConsistency + - Components depending on the feature gate: KCM Controllers that are determined to be high scale + +###### Does enabling the feature change any default behavior? +There is no change to the default behavior of reconcilers, however the +reconcilers may skip reconciling for some time until their caches catch up. This +should be invisible but may look like the controllers are stuck when they really +are just waiting for stale caches to catch up. + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? +Yes, the feature flag does not have any changes that are irreversible. The only +change is in how frequently the reconciliation of key controllers occurs but no +actual effect on any api objects should occur that a rollback would break. + + +###### What happens if we reenable the feature if it was previously rolled back? +It should start skipping reconciles for objects on enabled controllers until the +cache catches up again. + +###### Are there any tests for feature enablement/disablement? +There are no APIs to enable, we will test for whether the feature gate properly +enables/disables the consistency guarantees. + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? +Controller not updating objects as expected while it seems like informer metrics +are fine. We are planning on adding staleness metrics which can help inform on +this. + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +No + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? +It will automatically go in use once the gate is enabled + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + +The most significant drawback is the possibility of breaking the controllers by +causing them to not reconcile when they in fact should. To mitigate this as much +as possible, we will only add these semantics to controllers where we have +observed issues due to stale reads and only implement those, especially for the +beginning. If we see success on these controllers, we will look into creating a +framework around this so it is easier to take advantage, but it would already be +a success just for these controllers with documented staleness issues to be +fixed. + +## Alternatives + +There is not much of an alternative, we want to ensure that reconciles don't +occur on stale caches, so we have to skip reconciliation somehow. We can query +the cache directly and try and figure out the current version that way, but that +can lead to much more racy side effects than implementing a function in the +informer. + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-api-machinery/5647-stale-controller-handling/kep.yaml b/keps/sig-api-machinery/5647-stale-controller-handling/kep.yaml new file mode 100644 index 00000000000..115671d7747 --- /dev/null +++ b/keps/sig-api-machinery/5647-stale-controller-handling/kep.yaml @@ -0,0 +1,40 @@ +title: Stale Controller Handling +kep-number: 5647 +authors: + - "@michaelasp" +owning-sig: sig-api-machinery +status: implementable +creation-date: 2025-10-9 +reviewers: + - "@serathius" + - "@jpbetz" + +approvers: + - "@liggitt" + - "@deads2k" + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.35" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.35" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: StaleControllerConsistency + components: + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +# metrics: +# - my_feature_metric