Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Power Consumption of Single Namesapce is Higher than Whole Platform's #1833

Open
marvin-steinke opened this issue Nov 4, 2024 · 0 comments
Labels
kind/bug report bug issue

Comments

@marvin-steinke
Copy link

What happened?

Metric CPU Workload GPU Workload
Dynamic Namespace:
sum(rate(kepler_container_joules_total{{container_namespace='test', mode='dynamic'}}[60s]))
50.322 379.493
Idle Namespace:
sum(rate(kepler_container_joules_total{{container_namespace='test', mode='idle'}}[60s]))
128.672 69.03
Dynamic Node:
sum(rate(kepler_container_platform_joules_total{mode='dynamic'}[60s]))
43.013 229.015
Idle Node:
sum(rate(kepler_container_platform_joules_total{mode='idle'}[60s]))
242.012 242.0
Socket meter measurements 292.611 489.543

The sum of the node approximations is pretty close to the socket meter measured values. However, the approximated values for a whole namespace are off by a lot. When the GPU worload pauses for a few minutes, the socket meter measures about 260 W, meaning the actual consumption of the GPU worload should not be much higher than 250 W (~ 490 (gpu total node socket meter ) - 260 (socket meter idle)). Should I be using another PromQL query to obtain the power consumption of a whole namespace?

I think the idle power consumption of the gpu workload are a lot lower because the idle power is currently divided by the number of processes rather than by the amount of resources which is acknowledged and planned as future work item (line 179) and the cpu workload is a microservice benchmark that uses a lot of pods.

What did you expect to happen?

Kepler reports the power consumption per namesapce as accurately as it does for the entire node (which is does very well, props to the devs)

How can we reproduce it (as minimally and precisely as possible)?

Run any kind of workload (preferably GPU as it's more dramatic) and execute above prometheus queries.

Anything else we need to know?

No response

Kepler image tag

latest

Kubernetes version

$ kubectl version
Client Version: v1.30.4+k3s1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.4+k3s1

Cloud provider or bare metal

bare metal

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

$ uname -a
Linux gpu01 6.8.0-44-generic #44-Ubuntu SMP PREEMPT_DYNAMIC Tue Aug 13 13:35:26 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

helm

Kepler deployment config

No response

Container runtime (CRI) and version (if applicable)

Containerd v1.7.20-k3s1

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug report bug issue
Projects
None yet
Development

No branches or pull requests

1 participant