KEP-2371: update to beta #5632

haircommander · 2025-10-07T20:28:45Z

One-line PR description:

Issue link: cAdvisor-less, CRI-full Container and Pod Stats #2371

Other comments:

keps/sig-node/2371-cri-pod-container-stats/README.md

kannon92

out of diff but is this still an open question?

https://github.com/kubernetes/enhancements/blob/f04c6991969c25f117686f065bd761493e404d08/keps/sig-node/2371-cri-pod-container-stats/README.md#open-questions

keps/sig-node/2371-cri-pod-container-stats/README.md

kannon92 · 2025-10-09T15:19:09Z

keps/sig-node/2371-cri-pod-container-stats/README.md

-* **What specific metrics should inform a rollback?**
+###### What specific metrics should inform a rollback?

 The lack of any metrics reported for pods and containers is the worst case scenerio here, and would require either a rollback or for the feature gate to be disabled.


Commented above but if kubelet provider is not working, should we expose a metric or something?

If Kubelet is unable to post metrics on a node, it seems difficult to find this out currently.

I think if the admin attempted to roll out the feature and it failed, the metric saying provider is 'cadvisor' unexpectedly would be the signal that the fallback happened

That makes sense and the metric is exposed per node?

IMO this is still a very difficult thing for someone to detect.

The lack of any metrics reported for pods and containers is the worst case scenerio here, and would require either a rollback or for the feature gate to be disabled.

So the only way someone who find this out is if a kubelet on a node stopped posting metrics and that pod/container on that node was not found in prometheus.

That seems very complicated to tell if I had 5000 nodes.

Its worth calling out that the rollback failing would be cadvisor but if the metrics are not being posted then what is the best way to find that out? How does one find the bad node via metrics or monitoring?

This is still not addressed.

I believe the answer here is above in the GA section where there is:

- Likely, the `cri_losing_support` metric will be used to report that users on versions lower than 2.2 will lose support by a targeted GA version.

Would be good to ensure this is updated here as well.

keps/sig-node/2371-cri-pod-container-stats/README.md

kannon92 · 2025-10-09T15:21:41Z

Please address the verify job failure.

PRR shadow:

I left some comments but overall I think it is close.

haircommander · 2025-10-09T17:22:58Z

thanks @kannon92 updated!

keps/sig-node/2371-cri-pod-container-stats/README.md

SergeyKanzhelev · 2025-10-13T18:31:45Z

+1 to @kannon92 comments. We need a clear transition docs that indicate how to transition. Also there should be graceful period of supporting both equally well, maybe even simultaneously. This was the goal of transition documentation goal for beta - make sure it is reviewed and we understand if we need to announce deprecation or can just suggest an easy migration for each metric

haircommander · 2025-10-14T16:56:43Z

I have updated based on comments. I have made a couple of things explicity:

Windows support is out of scope
GA of this feature will drop support for the partial CRI and cadvisor stats providers
- there's some included motivation on this, but basically there isn't a good way to configure it today, and it was never meant to be configured. Instead, the kubelet is opinionated on what stats provider to use
- as such, GA will be blocked on containerd 2.2 or above being the only supported containerd, and there being at least 3 releases of delay
Sometime in the alpha, we changed the approach of /metrics/cadvisor, now having kubelet translate CRI information to that endpoint. I have tried to update the KEP to reflect this, but this means we're not moving away from metrics/cadvisor endpoint, instead changing its source

haircommander · 2025-10-14T18:12:40Z

Note: we chatted about this in SIG Node and agreed that the stats provider (cadvisor or partial CRI) is an implementation detail, and doesn't currently have any configuration. Introducing configuration to allow an admin to toggle whether we turn on full cri stats just to remove it in a handful of releases doesn't seem worth it. We decided to announce deprecation in beta, and move forward with it in GA when we decide to drop support for containerd < 2.2.

We also chatted with @marosset and agreed that windows support wouldn't block this KEP going to beta but we'd best-effort try to include it because this has been a sore spot for windows for a long time.

SergeyKanzhelev · 2025-10-14T18:27:02Z

keps/sig-node/2371-cri-pod-container-stats/README.md

-#### Alpha -> Beta Graduation
+#### Beta

- Conformance tests for the fields in `/metrics/cadvisor` should be created.


why not have critest for metrics we want to be collected? We need some way confirming that users transition will be seamless

keps/sig-node/2371-cri-pod-container-stats/README.md

SergeyKanzhelev · 2025-10-14T18:35:55Z

keps/sig-node/2371-cri-pod-container-stats/README.md

- cAdvisor stats provider will likely be marked as deprecated (depending on dockershim deprecation).
+- cAdvisor stats provider support will be dropped, as well as support for partial cri stats provider.
 - Feature gate removed and the CRI stats provider will no longer rely on cAdvisor for container/pod level metrics.
+- Conformance tests for stats and metrics being present as expected from the new sources, and performance/scale testing should show comparable performance.


Does conformance include specific set of metrics?

not currently AFAIU

keps/sig-node/2371-cri-pod-container-stats/README.md

keps/sig-node/2371-cri-pod-container-stats/kep.yaml

Signed-off-by: Peter Hunt <[email protected]>

keps/prod-readiness/sig-node/2371.yaml

kannon92 · 2025-10-14T20:23:13Z

keps/sig-node/2371-cri-pod-container-stats/README.md

 ### /metrics/cadvisor

-1. Expose the metric fields provided in `/metrics/cadvisor` in an analogous Prometheus endpoint directly from the CRI implementation.
+1. Expose the metric fields provided in `/metrics/cadvisor` in the same Prometheus endpoint, gathered by Kubelet from from the CRI implementation and reported through the Kubelet.


Suggested change

1. Expose the metric fields provided in `/metrics/cadvisor` in the same Prometheus endpoint, gathered by Kubelet from from the CRI implementation and reported through the Kubelet.

1. Expose the metric fields provided in `/metrics/cadvisor` in the same Prometheus endpoint, gathered by Kubelet from the CRI implementation and reported through the Kubelet.

kannon92 · 2025-10-14T20:28:07Z

keps/sig-node/2371-cri-pod-container-stats/README.md

-  automations, so be extremely careful here.
-  Enabling this behavior means some stats endpoints will not be filled:
+###### Does enabling the feature change any default behavior?
+:


Suggested change

:

kannon92 · 2025-10-14T20:31:50Z

/hold

Based on #5632 (comment), waiting on @SergeyKanzhelev lgtm.

PRR shadow:

Just one item I still think needs more detail on it. #5632 (comment)

I think the answer is that in order to GA this feature we cannot have a node fail to publish metrics via cri. As once this feature is GA there is no alternative so a node would miss kubelet metrics and there is no possible way to recover that once we lock the gate to true and drop the kubelet stats provider code.

SergeyKanzhelev

/lgtm
/approve

SergeyKanzhelev · 2025-10-15T16:42:08Z

/unhold

SergeyKanzhelev · 2025-10-15T16:43:01Z

/assign @deads2k
/assign @kannon92

kannon92 · 2025-10-15T16:57:51Z

Based on #5632 (comment),

/assign @soltysh

soltysh

I'm going to conditionally approve the PRR, but here are the things that need to be ensured for this work:

This functionality SHOULD NOT be enabled by default, if containerd doesn't release v2.2 before 1.35 dev cycle ends. Given that v2.2.0-beta.1 was release 5 days ago you should be ok.
Update the missing bits I've mentioned in the doc (mostly the rollback metric, ideally also the template, but that's a minor problem).

/approve
the PRR section

keps/prod-readiness/sig-node/2371.yaml

soltysh · 2025-10-15T14:18:14Z

keps/sig-node/2371-cri-pod-container-stats/README.md

+  - [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
 <!-- /toc -->

 # cAdvisor-less, CRI-full Container and Pod Stats


Nit: but it seems this KEP didn't get updated template

do you see a section I missed? I thought I updated the template and I'm not seeing anything missing

soltysh · 2025-10-15T18:39:06Z

keps/sig-node/2371-cri-pod-container-stats/README.md

-* **What specific metrics should inform a rollback?**
+###### What specific metrics should inform a rollback?

 The lack of any metrics reported for pods and containers is the worst case scenerio here, and would require either a rollback or for the feature gate to be disabled.


I believe the answer here is above in the GA section where there is:

- Likely, the `cri_losing_support` metric will be used to report that users on versions lower than 2.2 will lose support by a targeted GA version.

Would be good to ensure this is updated here as well.

k8s-ci-robot · 2025-10-15T18:42:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, haircommander, SergeyKanzhelev, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [soltysh]
~~keps/sig-node/OWNERS~~ [dchen1107]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

haircommander · 2025-10-15T19:49:06Z

This functionality SHOULD NOT be enabled by default, if containerd doesn't release v2.2 before 1.35 dev cycle ends. Given that v2.2.0-beta.1 was release 5 days ago you should be ok.

I'm not even sure we should do on by default beta until 1.36 anyway, because 2.2 will barely be anywhere when 1.35 releases

soltysh · 2025-10-16T13:09:00Z

I'm not even sure we should do on by default beta until 1.36 anyway, because 2.2 will barely be anywhere when 1.35 releases

That's very reasonable approach, I'm definitely supportive.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 7, 2025

k8s-ci-robot requested review from dchen1107 and palnabarun October 7, 2025 20:28

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 7, 2025

haircommander mentioned this pull request Oct 7, 2025

cAdvisor-less, CRI-full Container and Pod Stats #2371

Open

16 tasks

kannon92 reviewed Oct 8, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

kannon92 reviewed Oct 9, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

kannon92 reviewed Oct 9, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

kannon92 reviewed Oct 9, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Oct 9, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

kannon92 reviewed Oct 9, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

kannon92 reviewed Oct 9, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

haircommander force-pushed the 2371-beta-2 branch from f04c699 to 0c5a6c4 Compare October 9, 2025 17:20

kannon92 reviewed Oct 10, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

kannon92 reviewed Oct 10, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Oct 10, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

kannon92 reviewed Oct 10, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

SergeyKanzhelev reviewed Oct 13, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

SergeyKanzhelev reviewed Oct 13, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

haircommander force-pushed the 2371-beta-2 branch from 0c5a6c4 to 1b9b218 Compare October 14, 2025 16:54

SergeyKanzhelev reviewed Oct 14, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Outdated Show resolved Hide resolved

SergeyKanzhelev reviewed Oct 14, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

SergeyKanzhelev reviewed Oct 14, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/kep.yaml Show resolved Hide resolved

haircommander force-pushed the 2371-beta-2 branch from 1b9b218 to 5845ad0 Compare October 14, 2025 18:57

KEP-2371: update to beta

b901bf4

Signed-off-by: Peter Hunt <[email protected]>

haircommander force-pushed the 2371-beta-2 branch from 5845ad0 to b901bf4 Compare October 14, 2025 19:01

kannon92 reviewed Oct 14, 2025

View reviewed changes

keps/prod-readiness/sig-node/2371.yaml Show resolved Hide resolved

kannon92 reviewed Oct 14, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2025

SergeyKanzhelev approved these changes Oct 14, 2025

View reviewed changes

k8s-ci-robot assigned SergeyKanzhelev Oct 14, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2025

k8s-ci-robot assigned deads2k and kannon92 Oct 15, 2025

k8s-ci-robot assigned soltysh Oct 15, 2025

soltysh approved these changes Oct 15, 2025

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2025

k8s-ci-robot merged commit 000e2fb into kubernetes:master Oct 15, 2025
4 checks passed

k8s-ci-robot added this to the v1.35 milestone Oct 15, 2025

kannon92 mentioned this pull request Oct 20, 2025

self nominate kannon92 for production readiness approval #5662

Merged

	1. Expose the metric fields provided in `/metrics/cadvisor` in the same Prometheus endpoint, gathered by Kubelet from from the CRI implementation and reported through the Kubelet.
	1. Expose the metric fields provided in `/metrics/cadvisor` in the same Prometheus endpoint, gathered by Kubelet from the CRI implementation and reported through the Kubelet.

KEP-2371: update to beta #5632

KEP-2371: update to beta #5632

Uh oh!

Conversation

haircommander commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kannon92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kannon92 commented Oct 9, 2025

Uh oh!

haircommander commented Oct 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SergeyKanzhelev commented Oct 13, 2025

Uh oh!

haircommander commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haircommander commented Oct 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kannon92 commented Oct 14, 2025

Uh oh!

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

Uh oh!

SergeyKanzhelev commented Oct 15, 2025

Uh oh!

SergeyKanzhelev commented Oct 15, 2025

Uh oh!

kannon92 commented Oct 15, 2025

Uh oh!

soltysh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

haircommander commented Oct 14, 2025 •

edited

Loading