-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5647: Initial KEP for Stale Controller Detection #5649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
michaelasp
commented
Oct 9, 2025
- One-line PR description: Add initial KEP for stale controller metrics
- Issue link: Stale Controller Detection and Mitigation #5647
- Other comments:
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: michaelasp The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @serathius |
fbf7156
to
84877d0
Compare
across the entire end-to-end test run, with rules in place to minimize the risk | ||
of "tragedy of the commons" flakes across CI. | ||
|
||
We propose the introduction of instrumentation to figure out the staleness of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While a metric would be a good first step I would formulate the KEP tackle the overall problem of controller acting based on stale data. Metrics would be a first good step to better track and understand the problem, but I would not box the proposal to just adding metric. For Alpha we could add a metric, but for Beta we could add mechanism to act on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, let me make the overall topic stale controller detection and mitigation.
a principled approach with rules and guardrails. | ||
|
||
See: | ||
As Kubernetes increases in scale, greater pressure is being put on all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While staleness usually happens in scale, it's not exclusive to it. It's a consequence of building reconciliation on watch protocol which is an eventually consistent. There is not guarantee of how far behind the watch is, next event might arrive in a second, or in an hour. The problem is that currently don't have any way to measure it.
components. One issue that arises are controllers falling out of sync with the | ||
apiserver as they cannot keep up. When this happens, a controller may act on | ||
information that is old without realizing and get stuck in an irrecoverable | ||
state where it keeps trying to act on the world and does not see its own writes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expand this more, the problem of controller acting on outdated information, which in best cases can cause conflicts on writes increasing error rates or in worst case invalid behaviors like duplicating objects that overwhelmed control plane.
- Prevent flaking, unreliable tests | ||
- Ensure result reporting is structured | ||
- Must not impact the conformance test suite | ||
The goal of this kep is to add a set of metrics that we define for a certain set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's expand the goals to be able to measure the staleness of controller reconcilation to allow administrators and controllers to take an action if a threshold is reached.
- Enable completely arbitrary checks | ||
- Targeting integration tests. | ||
- We are specifically aiming for end to end tests for this purpose. | ||
We also will focus on metrics in this KEP and not propose solutions to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have just a KEP for a metric, when I suggested cutting scope to metrics I meant that it's a good first step and we can expand it.
#### Story 1 | ||
|
||
How will UX be reviewed, and by whom? | ||
I am a cluster administrator, I want to be able to check my metrics and see if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, overloading is not the only cause staleness and increasing resources is not the only mitigation. I would skip describing a specific mitigation, just focus on being able to monitor and alert on the issue so an oncall can take an action.
|
||
If implemented poorly, this could result in tests flaking in any number of e2e | ||
test CI jobs that are now running these tests. | ||
I am a user and am trying to optimize my workloads, I look at my usage patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the story. What user can do here?
What are the risks of this proposal, and how do we mitigate? Think broadly. | ||
For example, consider both security and how this will impact the larger | ||
Kubernetes ecosystem. | ||
### User Stories (Optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a story for a controller developer that wants to ensure reliable behavior of controller regardless of watch delay.
to flaking invariant tests in a timeline fashion will result in demoting or | ||
removing them. | ||
``` | ||
&compbasemetrics.HistogramOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think exact metric definition here is needed, the important parts are describing thresholds we want to detect and what sampling resolution (buckets) we need to detect it.
When working on kubernetes/kubernetes#123448 I used the following watch latency thresholds to distinquish state of watch:
- < 100ms - GREAT,
- <1s - GOOD
- < 10s - SLOW but acceptable for large clusters
- >10s - STALE
They might need to be adjusted for controllers reconciliation which is further in the stack, and use those values to decide how often we should sample RV and what metric buckers we should pick.
Name: "{my_controller_name}_pod_watch_delay_total_seconds", | ||
Help: "Watch delay seconds", | ||
StabilityLevel: compbasemetrics.ALPHA, | ||
Buckets: compbasemetrics.LinearBuckets(1, 1, 300), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a sample for all delays between 1 and 300 is a little to much. We should have around 10-20 buckets. If we want to sample range between 1 and 300s then we might want to switch to exponential buckets to reduce their number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not run in many CI jobs, there will be limited benefit to the signal. | ||
We will aim to generally introduce these as default selected tests. | ||
|
||
By adding a new prober like this, we introduce more APIServer requests to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Risk: Measuring latency on watch requires knowing current RV in etcd, that means periodic polling of RV from apiserver. Making a LIST request every X seconds per each controller/resource is too much load.
Mitigation: Start with only pods for KCM controller as the most problematic one.
KCM controllers and report when they have not been updated for a certain amount | ||
of time. We do this by adding probers into common codepaths that run into | ||
scalability limits, such as statefulsets and expose metrics that give an idea of | ||
how out of sync a controller is with the apiserver itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention your KEP for comparing RV that now finally have a way to measure delay that is experienced by controllers.
|
||
A shared system will be introduced to the e2e framework to enable this form of | ||
testing. | ||
We propose the exposure of several key metrics, a histogram metric for every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When trying to address a open issue like this, it's good to structure the proposal in 3 themes: Prevent, Detect, Mitigate.
Preventing controller staleness issues would require performance improvements, that would be outside the scope of this KEP. So it would be good to discuss in Detect and Mitigate:
- Detect: Monitor controller reconciliation delay to allow administrators configure alerting and act when staleness appears. Solved by adding a metric
- Mitigate: Simplify controller resilience to staleness by preventing reconciliation on stale data. Here the idea proposed by @liggitt to not sync when actions from previous attempt were not not observed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea for Prevent: use the metric to measure the latency in K8s scalability tests, define a thresholds to detect regressions and and define it as SLO for K8s project.
We don't need to design everything in detail for Alpha, just specify this a steps that are needed to address the issue comprehensively. Let's propose a plan that makes high level sense and tackle first step.
We will then identify controllers, (Daemonset, StatefulSet, Deployment, ...) | ||
that are the highest churn and most at risk of running stale and will compare | ||
the current latest read resource version to the probers. We will run through the | ||
probers list of resource versions until we find the first object that is older |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't understand which "first object" you mean?
change are understandable. This may include API specs (though not always | ||
required) or even code snippets. If there's any ambiguity about HOW your | ||
proposal will be implemented, this is the place to discuss them. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please structure the proposal in high level changes you want to do and only go to details when needed. For me there are 3 changes:
- Sample a RV from apiserver. A periodic loop that requests LIST on a resource to get current RV and store it in queue.
- Update the informer code to store latest RV.
- For set of controllers measure the latency by comparing RV returned by pod informer to RV sampled from apiserver.
title: Remove gogo protobuf dependency | ||
kep-number: 5589 | ||
title: Stale Controller Handling | ||
kep-number: 5647 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think the KEP number should come from number of issue used for tracking progress across releases, not PR number. Please open enhancement tracking issue.
milestone: | ||
stable: "v1.35" | ||
alpha: "v1.35" | ||
beta: "v1.36" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not make guesses.
- sig-architecture | ||
status: implementable | ||
creation-date: 2025-09-29 | ||
status: provisional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status provisional is not enough start work on in the release. We could quickly merge it as provisional, but it would need another iteration to get it into "implementable".