Dynamic Power Consumption of Single Namesapce is Higher than Whole Platform's #1833

marvin-steinke · 2024-11-04T11:02:23Z

What happened?

Metric	CPU Workload	GPU Workload
Dynamic Namespace: `sum(rate(kepler_container_joules_total{{container_namespace='test', mode='dynamic'}}[60s]))`	50.322	379.493
Idle Namespace: `sum(rate(kepler_container_joules_total{{container_namespace='test', mode='idle'}}[60s]))`	128.672	69.03
Dynamic Node: `sum(rate(kepler_container_platform_joules_total{mode='dynamic'}[60s]))`	43.013	229.015
Idle Node: `sum(rate(kepler_container_platform_joules_total{mode='idle'}[60s]))`	242.012	242.0
Socket meter measurements	292.611	489.543

The sum of the node approximations is pretty close to the socket meter measured values. However, the approximated values for a whole namespace are off by a lot. When the GPU worload pauses for a few minutes, the socket meter measures about 260 W, meaning the actual consumption of the GPU worload should not be much higher than 250 W (~ 490 (gpu total node socket meter ) - 260 (socket meter idle)). Should I be using another PromQL query to obtain the power consumption of a whole namespace?

I think the idle power consumption of the gpu workload are a lot lower because the idle power is currently divided by the number of processes rather than by the amount of resources which is acknowledged and planned as future work item (line 179) and the cpu workload is a microservice benchmark that uses a lot of pods.

What did you expect to happen?

Kepler reports the power consumption per namesapce as accurately as it does for the entire node (which is does very well, props to the devs)

How can we reproduce it (as minimally and precisely as possible)?

Run any kind of workload (preferably GPU as it's more dramatic) and execute above prometheus queries.

Anything else we need to know?

No response

Kepler image tag

latest

Kubernetes version

$ kubectl version
Client Version: v1.30.4+k3s1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.4+k3s1

Cloud provider or bare metal

bare metal

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

$ uname -a
Linux gpu01 6.8.0-44-generic #44-Ubuntu SMP PREEMPT_DYNAMIC Tue Aug 13 13:35:26 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

helm

Kepler deployment config

No response

Container runtime (CRI) and version (if applicable)

Containerd v1.7.20-k3s1

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

The text was updated successfully, but these errors were encountered:

marvin-steinke added the kind/bug report bug issue label Nov 4, 2024

marvin-steinke mentioned this issue Nov 6, 2024

kepler node_package does not equal total of kepler_process_package #1837

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Power Consumption of Single Namesapce is Higher than Whole Platform's #1833

Dynamic Power Consumption of Single Namesapce is Higher than Whole Platform's #1833

marvin-steinke commented Nov 4, 2024

Dynamic Power Consumption of Single Namesapce is Higher than Whole Platform's #1833

Dynamic Power Consumption of Single Namesapce is Higher than Whole Platform's #1833

Comments

marvin-steinke commented Nov 4, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kepler image tag

Kubernetes version

Cloud provider or bare metal

OS version

Install tools

Kepler deployment config

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)