-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-2371: update to beta #5632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP-2371: update to beta #5632
Conversation
haircommander
commented
Oct 7, 2025
- One-line PR description:
- Issue link: cAdvisor-less, CRI-full Container and Pod Stats #2371
- Other comments:
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: haircommander The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- Conduct research to find the set of metrics from `/metrics/cadvisor` that compliant CRI implementations must expose. | ||
|
||
#### Alpha -> Beta Graduation | ||
#### Beta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will these be done before feature gate is turned on?
The requirements for PRR have changed so ideally most of the work is complete when this is promoted to beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's my hope. The containerd side of the verification may need to be done on a prerelease version, but since a lot of the testing will be manual I think that will be okay cc @akhilerm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @haircommander . Will have to manually verify and I am trying to get it merged in before the next beta of containerd 2.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Conformance tests for the fields in
/metrics/cadvisor
should be created.
That sounds like the tests will be automated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Conformance tests for the fields in
/metrics/cadvisor
should be created. - Validate performance impact of this feature is within allowable margin (or non-existent, ideally).
- The CRI stats implementation should perform better than they did with CRI+cAdvisor.
- cAdvisor stats provider will be marked as deprecated, as well as the cAdvisor providing the metrics endpoint
/metrics/cadvisor
. - Write migration documentation for entities relying on metrics from
/metrics/cadvisor
. - Windows stats and metrics will be added.
This seems like a lot to do for beta promotion. Is the plan to do all of this in 1.35 cycle?
It sounds like performance impact of this is also container runtime dependent so I'd expect performance numbers for CRI-O / Containerd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw we do have e2e_node test for it now https://github.com/kubernetes/kubernetes/blob/b393d87d16f225f873f72a79734b3409323b4a05/test/e2e_node/container_metrics_test.go#L39 so we'd have to find a way to transfer to conformance
I'm dropping "Write migration documentation for entities relying on metrics from /metrics/cadvisor." as we have changed the implementation to use /metrics/cadvisor still
I am also dropping windows piece, SIG windows can do a follow-up KEP for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess its not clear to me what this means actually.
This test is marked as "NodeConformance" in test/e2e_node. Are you proposing that we move this test to test/e2e/ and move it to the conformance tests for a k8s distribution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of diff but is this still an open question?
Ideally all components will rely on summary API thereby alleviating need for cAdvisor for container and pod level stats. | ||
This is also a requirement to be able to disable cAdvisor container metrics collection. | ||
|
||
To make clear to cluster admins when metrics are coming from CRI, rather than cadvisor, a new metric `kubelet_metrics_provider` will be used, with `provider` label either `cri` or `cadvisor`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are issues with kubelet metrics provider do you think its worth exposing this in the metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advocate the specific providers should report their own error metrics
* **What specific metrics should inform a rollback?** | ||
###### What specific metrics should inform a rollback? | ||
|
||
The lack of any metrics reported for pods and containers is the worst case scenerio here, and would require either a rollback or for the feature gate to be disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented above but if kubelet provider is not working, should we expose a metric or something?
If Kubelet is unable to post metrics on a node, it seems difficult to find this out currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the admin attempted to roll out the feature and it failed, the metric saying provider is 'cadvisor' unexpectedly would be the signal that the fallback happened
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense and the metric is exposed per node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is still a very difficult thing for someone to detect.
The lack of any metrics reported for pods and containers is the worst case scenerio here, and would require either a rollback or for the feature gate to be disabled.
So the only way someone who find this out is if a kubelet on a node stopped posting metrics and that pod/container on that node was not found in prometheus.
That seems very complicated to tell if I had 5000 nodes.
Its worth calling out that the rollback failing would be cadvisor but if the metrics are not being posted then what is the best way to find that out? How does one find the bad node via metrics or monitoring?
Please address the verify job failure. PRR shadow: I left some comments but overall I think it is close. |
f04c699
to
0c5a6c4
Compare
thanks @kannon92 updated! |
+1 to @kannon92 comments. We need a clear transition docs that indicate how to transition. Also there should be graceful period of supporting both equally well, maybe even simultaneously. This was the goal of transition documentation goal for beta - make sure it is reviewed and we understand if we need to announce deprecation or can just suggest an easy migration for each metric |
Signed-off-by: Peter Hunt <[email protected]>
0c5a6c4
to
1b9b218
Compare
I have updated based on comments. I have made a couple of things explicity:
|
Note: we chatted about this in SIG Node and agreed that the stats provider (cadvisor or partial CRI) is an implementation detail, and doesn't currently have any configuration. Introducing configuration to allow an admin to toggle whether we turn on full cri stats just to remove it in a handful of releases doesn't seem worth it. We decided to announce deprecation in beta, and move forward with it in GA when we decide to drop support for containerd < 2.2. We also chatted with @marosset and agreed that windows support wouldn't block this KEP going to beta but we'd best-effort try to include it because this has been a sore spot for windows for a long time. |
#### Alpha -> Beta Graduation | ||
#### Beta | ||
|
||
- Conformance tests for the fields in `/metrics/cadvisor` should be created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not have critest for metrics we want to be collected? We need some way confirming that users transition will be seamless