Skip to content

Conversation

michaelasp
Copy link
Contributor

  • One-line PR description: Add initial KEP for stale controller metrics
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 9, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: michaelasp
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Oct 9, 2025
@michaelasp
Copy link
Contributor Author

/cc @serathius

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 9, 2025
@k8s-ci-robot k8s-ci-robot requested a review from serathius October 9, 2025 23:38
across the entire end-to-end test run, with rules in place to minimize the risk
of "tragedy of the commons" flakes across CI.

We propose the introduction of instrumentation to figure out the staleness of
Copy link
Contributor

@serathius serathius Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While a metric would be a good first step I would formulate the KEP tackle the overall problem of controller acting based on stale data. Metrics would be a first good step to better track and understand the problem, but I would not box the proposal to just adding metric. For Alpha we could add a metric, but for Beta we could add mechanism to act on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let me make the overall topic stale controller detection and mitigation.

a principled approach with rules and guardrails.

See:
As Kubernetes increases in scale, greater pressure is being put on all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While staleness usually happens in scale, it's not exclusive to it. It's a consequence of building reconciliation on watch protocol which is an eventually consistent. There is not guarantee of how far behind the watch is, next event might arrive in a second, or in an hour. The problem is that currently don't have any way to measure it.

components. One issue that arises are controllers falling out of sync with the
apiserver as they cannot keep up. When this happens, a controller may act on
information that is old without realizing and get stuck in an irrecoverable
state where it keeps trying to act on the world and does not see its own writes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expand this more, the problem of controller acting on outdated information, which in best cases can cause conflicts on writes increasing error rates or in worst case invalid behaviors like duplicating objects that overwhelmed control plane.

- Prevent flaking, unreliable tests
- Ensure result reporting is structured
- Must not impact the conformance test suite
The goal of this kep is to add a set of metrics that we define for a certain set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's expand the goals to be able to measure the staleness of controller reconcilation to allow administrators and controllers to take an action if a threshold is reached.

- Enable completely arbitrary checks
- Targeting integration tests.
- We are specifically aiming for end to end tests for this purpose.
We also will focus on metrics in this KEP and not propose solutions to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have just a KEP for a metric, when I suggested cutting scope to metrics I meant that it's a good first step and we can expand it.

#### Story 1

How will UX be reviewed, and by whom?
I am a cluster administrator, I want to be able to check my metrics and see if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, overloading is not the only cause staleness and increasing resources is not the only mitigation. I would skip describing a specific mitigation, just focus on being able to monitor and alert on the issue so an oncall can take an action.


If implemented poorly, this could result in tests flaking in any number of e2e
test CI jobs that are now running these tests.
I am a user and am trying to optimize my workloads, I look at my usage patterns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the story. What user can do here?

What are the risks of this proposal, and how do we mitigate? Think broadly.
For example, consider both security and how this will impact the larger
Kubernetes ecosystem.
### User Stories (Optional)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a story for a controller developer that wants to ensure reliable behavior of controller regardless of watch delay.

to flaking invariant tests in a timeline fashion will result in demoting or
removing them.
```
&compbasemetrics.HistogramOpts{
Copy link
Contributor

@serathius serathius Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think exact metric definition here is needed, the important parts are describing thresholds we want to detect and what sampling resolution (buckets) we need to detect it.

When working on kubernetes/kubernetes#123448 I used the following watch latency thresholds to distinquish state of watch:

  • < 100ms - GREAT,
  • <1s - GOOD
  • < 10s - SLOW but acceptable for large clusters
  • >10s - STALE

They might need to be adjusted for controllers reconciliation which is further in the stack, and use those values to decide how often we should sample RV and what metric buckers we should pick.

Name: "{my_controller_name}_pod_watch_delay_total_seconds",
Help: "Watch delay seconds",
StabilityLevel: compbasemetrics.ALPHA,
Buckets: compbasemetrics.LinearBuckets(1, 1, 300),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a sample for all delays between 1 and 300 is a little to much. We should have around 10-20 buckets. If we want to sample range between 1 and 300s then we might want to switch to exponential buckets to reduce their number

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not run in many CI jobs, there will be limited benefit to the signal.
We will aim to generally introduce these as default selected tests.

By adding a new prober like this, we introduce more APIServer requests to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risk: Measuring latency on watch requires knowing current RV in etcd, that means periodic polling of RV from apiserver. Making a LIST request every X seconds per each controller/resource is too much load.
Mitigation: Start with only pods for KCM controller as the most problematic one.

KCM controllers and report when they have not been updated for a certain amount
of time. We do this by adding probers into common codepaths that run into
scalability limits, such as statefulsets and expose metrics that give an idea of
how out of sync a controller is with the apiserver itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention your KEP for comparing RV that now finally have a way to measure delay that is experienced by controllers.


A shared system will be introduced to the e2e framework to enable this form of
testing.
We propose the exposure of several key metrics, a histogram metric for every
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When trying to address a open issue like this, it's good to structure the proposal in 3 themes: Prevent, Detect, Mitigate.

Preventing controller staleness issues would require performance improvements, that would be outside the scope of this KEP. So it would be good to discuss in Detect and Mitigate:

  • Detect: Monitor controller reconciliation delay to allow administrators configure alerting and act when staleness appears. Solved by adding a metric
  • Mitigate: Simplify controller resilience to staleness by preventing reconciliation on stale data. Here the idea proposed by @liggitt to not sync when actions from previous attempt were not not observed.

Copy link
Contributor

@serathius serathius Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea for Prevent: use the metric to measure the latency in K8s scalability tests, define a thresholds to detect regressions and and define it as SLO for K8s project.

We don't need to design everything in detail for Alpha, just specify this a steps that are needed to address the issue comprehensively. Let's propose a plan that makes high level sense and tackle first step.

We will then identify controllers, (Daemonset, StatefulSet, Deployment, ...)
that are the highest churn and most at risk of running stale and will compare
the current latest read resource version to the probers. We will run through the
probers list of resource versions until we find the first object that is older
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't understand which "first object" you mean?

change are understandable. This may include API specs (though not always
required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please structure the proposal in high level changes you want to do and only go to details when needed. For me there are 3 changes:

  • Sample a RV from apiserver. A periodic loop that requests LIST on a resource to get current RV and store it in queue.
  • Update the informer code to store latest RV.
  • For set of controllers measure the latency by comparing RV returned by pod informer to RV sampled from apiserver.

title: Remove gogo protobuf dependency
kep-number: 5589
title: Stale Controller Handling
kep-number: 5647
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think the KEP number should come from number of issue used for tracking progress across releases, not PR number. Please open enhancement tracking issue.

milestone:
stable: "v1.35"
alpha: "v1.35"
beta: "v1.36"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not make guesses.

- sig-architecture
status: implementable
creation-date: 2025-09-29
status: provisional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status provisional is not enough start work on in the release. We could quickly merge it as provisional, but it would need another iteration to get it into "implementable".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants