KEP-5647: Initial KEP for Stale Controller Detection #5649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

michaelasp wants to merge 1 commit into kubernetes:master from michaelasp:staleControllerKEP

+811 −0

Contributor

michaelasp commented Oct 9, 2025

One-line PR description: Add initial KEP for stale controller metrics

Issue link: Stale Controller Detection and Mitigation #5647

Other comments:

k8s-ci-robot added the cncf-cla: yes label

k8s-ci-robot requested review from deads2k and fedebongio

October 9, 2025 23:37

Contributor

k8s-ci-robot commented Oct 9, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: michaelasp
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-api-machinery/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added kind/kep sig/api-machinery labels

Contributor Author

michaelasp commented Oct 9, 2025

k8s-ci-robot added the size/XL label

k8s-ci-robot requested a review from serathius

October 9, 2025 23:38


          Initial KEP for Stale Controller Detection

84877d0

michaelasp force-pushed the staleControllerKEP branch from fbf7156 to 84877d0 Compare

October 9, 2025 23:42

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              across the entire end-to-end test run, with rules in place to minimize the risk

              of "tragedy of the commons" flakes across CI.

              We propose the introduction of instrumentation to figure out the staleness of

Contributor

serathius Oct 10, 2025 •

edited

Loading

While a metric would be a good first step I would formulate the KEP tackle the overall problem of controller acting based on stale data. Metrics would be a first good step to better track and understand the problem, but I would not box the proposal to just adding metric. For Alpha we could add a metric, but for Beta we could add mechanism to act on it.

Contributor Author

michaelasp Oct 10, 2025

Sounds good, let me make the overall topic stale controller detection and mitigation.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              a principled approach with rules and guardrails.

              See:

              As Kubernetes increases in scale, greater pressure is being put on all

Contributor

serathius Oct 10, 2025

While staleness usually happens in scale, it's not exclusive to it. It's a consequence of building reconciliation on watch protocol which is an eventually consistent. There is not guarantee of how far behind the watch is, next event might arrive in a second, or in an hour. The problem is that currently don't have any way to measure it.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              components. One issue that arises are controllers falling out of sync with the

              apiserver as they cannot keep up. When this happens, a controller may act on

              information that is old without realizing and get stuck in an irrecoverable

              state where it keeps trying to act on the world and does not see its own writes

Contributor

serathius Oct 10, 2025

I would expand this more, the problem of controller acting on outdated information, which in best cases can cause conflicts on writes increasing error rates or in worst case invalid behaviors like duplicating objects that overwhelmed control plane.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              - Prevent flaking, unreliable tests

              - Ensure result reporting is structured

              - Must not impact the conformance test suite

              The goal of this kep is to add a set of metrics that we define for a certain set

Contributor

serathius Oct 10, 2025

Let's expand the goals to be able to measure the staleness of controller reconcilation to allow administrators and controllers to take an action if a threshold is reached.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              - Enable completely arbitrary checks

              - Targeting integration tests.

                - We are specifically aiming for end to end tests for this purpose.

              We also will focus on metrics in this KEP and not propose solutions to the

Contributor

serathius Oct 10, 2025

I don't think we should have just a KEP for a metric, when I suggested cutting scope to metrics I meant that it's a good first step and we can expand it.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              #### Story 1

              How will UX be reviewed, and by whom?

              I am a cluster administrator, I want to be able to check my metrics and see if

Contributor

serathius Oct 10, 2025

Again, overloading is not the only cause staleness and increasing resources is not the only mitigation. I would skip describing a specific mitigation, just focus on being able to monitor and alert on the issue so an oncall can take an action.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              If implemented poorly, this could result in tests flaking in any number of e2e

              test CI jobs that are now running these tests.

              I am a user and am trying to optimize my workloads, I look at my usage patterns

Contributor

serathius Oct 10, 2025

Not sure I understand the story. What user can do here?

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              What are the risks of this proposal, and how do we mitigate? Think broadly.

              For example, consider both security and how this will impact the larger

              Kubernetes ecosystem.

              ### User Stories (Optional)

Contributor

serathius Oct 10, 2025

Please add a story for a controller developer that wants to ensure reliable behavior of controller regardless of watch delay.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              to flaking invariant tests in a timeline fashion will result in demoting or

              removing them.

              ```

              &compbasemetrics.HistogramOpts{

Contributor

serathius Oct 10, 2025 •

edited

Loading

Don't think exact metric definition here is needed, the important parts are describing thresholds we want to detect and what sampling resolution (buckets) we need to detect it.

When working on kubernetes/kubernetes#123448 I used the following watch latency thresholds to distinquish state of watch:

< 100ms - GREAT,
<1s - GOOD
< 10s - SLOW but acceptable for large clusters
>10s - STALE

They might need to be adjusted for controllers reconciliation which is further in the stack, and use those values to decide how often we should sample RV and what metric buckers we should pick.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              		Name:           "{my_controller_name}_pod_watch_delay_total_seconds",

              		Help:           "Watch delay seconds",

              		StabilityLevel: compbasemetrics.ALPHA,

              		Buckets:        compbasemetrics.LinearBuckets(1, 1, 300),

Contributor

serathius Oct 10, 2025

Having a sample for all delays between 1 and 300 is a little to much. We should have around 10-20 buckets. If we want to sample range between 1 and 300s then we might want to switch to exponential buckets to reduce their number

Contributor

serathius Oct 10, 2025

cc @richabanker @dgrisonnet

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              If not run in many CI jobs, there will be limited benefit to the signal.

              We will aim to generally introduce these as default selected tests.

              By adding a new prober like this, we introduce more APIServer requests to the

Contributor

serathius Oct 10, 2025

Risk: Measuring latency on watch requires knowing current RV in etcd, that means periodic polling of RV from apiserver. Making a LIST request every X seconds per each controller/resource is too much load.
Mitigation: Start with only pods for KCM controller as the most problematic one.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              KCM controllers and report when they have not been updated for a certain amount

              of time. We do this by adding probers into common codepaths that run into

              scalability limits, such as statefulsets and expose metrics that give an idea of

              how out of sync a controller is with the apiserver itself.

Contributor

serathius Oct 10, 2025

Maybe mention your KEP for comparing RV that now finally have a way to measure delay that is experienced by controllers.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              A shared system will be introduced to the e2e framework to enable this form of

              testing.

              We propose the exposure of several key metrics, a histogram metric for every

Contributor

serathius Oct 10, 2025

When trying to address a open issue like this, it's good to structure the proposal in 3 themes: Prevent, Detect, Mitigate.

Preventing controller staleness issues would require performance improvements, that would be outside the scope of this KEP. So it would be good to discuss in Detect and Mitigate:

Detect: Monitor controller reconciliation delay to allow administrators configure alerting and act when staleness appears. Solved by adding a metric
Mitigate: Simplify controller resilience to staleness by preventing reconciliation on stale data. Here the idea proposed by @liggitt to not sync when actions from previous attempt were not not observed.

Contributor

serathius Oct 10, 2025 •

edited

Loading

Idea for Prevent: use the metric to measure the latency in K8s scalability tests, define a thresholds to detect regressions and and define it as SLO for K8s project.

We don't need to design everything in detail for Alpha, just specify this a steps that are needed to address the issue comprehensively. Let's propose a plan that makes high level sense and tackle first step.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              We will then identify controllers, (Daemonset, StatefulSet, Deployment, ...)

              that are the highest churn and most at risk of running stale and will compare

              the current latest read resource version to the probers. We will run through the

              probers list of resource versions until we find the first object that is older

Contributor

serathius Oct 10, 2025

Don't understand which "first object" you mean?

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

    
              change are understandable. This may include API specs (though not always

              required) or even code snippets. If there's any ambiguity about HOW your

              proposal will be implemented, this is the place to discuss them.

              -->

Contributor

serathius Oct 10, 2025

Please structure the proposal in high level changes you want to do and only go to details when needed. For me there are 3 changes:

Sample a RV from apiserver. A periodic loop that requests LIST on a resource to get current RV and store it in queue.
Update the informer code to store latest RV.
For set of controllers measure the latency by comparing RV returned by pod informer to RV sampled from apiserver.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/kep.yaml

    
              title: Remove gogo protobuf dependency

              kep-number: 5589

              title: Stale Controller Handling

              kep-number: 5647

Contributor

serathius Oct 10, 2025

Think the KEP number should come from number of issue used for tracking progress across releases, not PR number. Please open enhancement tracking issue.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/kep.yaml

    
              milestone:

                stable: "v1.35"

                alpha: "v1.35"

                beta: "v1.36"

Contributor

serathius Oct 10, 2025

Let's not make guesses.

serathius reviewed

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/kep.yaml

    
                - sig-architecture

              status: implementable

              creation-date: 2025-09-29

              status: provisional

Contributor

serathius Oct 10, 2025

Status provisional is not enough start work on in the release. We could quickly merge it as provisional, but it would need another iteration to get it into "implementable".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes kind/kep sig/api-machinery size/XL