Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes Scaphandre Deployment reporting 0 W #353

Open
eduardogomescampos1 opened this issue Jan 23, 2024 · 6 comments
Open

Kubernetes Scaphandre Deployment reporting 0 W #353

eduardogomescampos1 opened this issue Jan 23, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@eduardogomescampos1
Copy link

eduardogomescampos1 commented Jan 23, 2024

Bug description

First of all, I would like to thank all the Scaphandre team for a tool like this. It has been extremely helpful so far! So, the bug consists on some nodes from my local k8s cluster reporting 0W of consumption. As way to illustrate the issue, there is a print screen from the official Scaphandre Grafana dashboard on the screenshot section.
Each color represents a node and, as you can see, 3 of them report 0W. The thing that is most intriguing is that if I run Scaphandre locally, I'm able to get actual values. There is also a print screen of the logs of a local execution of Scaphandre in one of those nodes reporting 0W in the k8s version.
As you can see, Scaphandre is able to obtain those metrics locally, however, the pods from the k8s cluster cannot. Doing "kubectl logs 'scaphandre pod ' " has been of no help since it just returns:
" Scaphandre prometheus exporter
Sending ⚡ metrics
Press CTRL-C to stop scaphandre "
And describing the pods does not return anything worth mentioning either.
It is relevant to note that the firewall is disabled on all cluster machines.
Could you give any insights on solving this, please?

To Reproduce

  1. Create a k8s cluster using Calico CNI following its documentation
  2. Create a deployment for Grafana and Prometheus (following these tutorials: https://devopscube.com/setup-grafana-kubernetes/ and https://devopscube.com/setup-prometheus-monitoring-on-kubernetes/)
  3. Deploy Scaphandre from its Helm Chart
  4. Open Scaphandre Grafana dashboard and verify that some nodes report 0W

Expected behavior

The Grafana dashboard should report the same values obtained from the local execution rather than 0W

Screenshots

image

  • Scaphandre Grafana Dashboard
    image
  • Local execution from Scaphandre is able to get values different than 0W.

Environment

  • Linux distribution version on all machines Ubuntu 22.04.3
  • Kernel version on all machines 5.15.0-91-generic

Additional context

One interesting aspect is that all of malfunctioning machines have been formatted quite recently so I'm guessing there might be a misconfiguration somewhere.

@eduardogomescampos1 eduardogomescampos1 added the bug Something isn't working label Jan 23, 2024
@mmadoo
Copy link
Contributor

mmadoo commented Jan 23, 2024

Which docker tag are you using and what is the value of the metrics scaph_self_version ?

I am using dev tag and got version 0.5. My metrics for scaph_process_power_consumption_microwatts are fine.
image

@eduardogomescampos1
Copy link
Author

Which docker tag are you using and what is the value of the metrics scaph_self_version ?

I am using dev tag and got version 0.5. My metrics for scaph_process_power_consumption_microwatts are fine. image

All nodes return 0.5 for this metric. I have installed the helm chart from the dev branch using the dev tag as well. Besides, something I also noted is that whenever I run the quick docker version (as in https://hubblo-org.github.io/scaphandre-documentation/tutorials/installation-linux) I also get reported 0W on one of the malfunctioning nodes. I feel like this has something to do with the container not being allowed to access the proper files, even though I have disabled all firewalls and used the command chmod 777 on both /sys/class/powercap and /proc (for testing purposes). I'm wondering why only one node is able to get the measurements correctly.

image
Docker quick version output

@eduardogomescampos1
Copy link
Author

Now I've tried to run the dev image locally and there is a warning
"scaphandre::sensors: Could'nt read record from /sys/class/powercap/intel-rapl:0/energy_uj, error was: Os { code: 2, kind: NotFound, message: "No such file or directory" }"
However, as I have stated before, I have used the chmod -R 777 command on this folder and disabled the firewall. What could be causing this?

@eduardogomescampos1
Copy link
Author

It is indeed a permission issue. As I came back to office and typed "kubectl logs 'scaphandre pod'", this time a got a warning message stating:
"scaphandre::sensors: Could'nt read record from /sys/class/powercap/intel-rapl:0/energy_uj, error was: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }"

@eduardogomescampos1
Copy link
Author

I think it had something to do with the containerd container runtime. In the project I'm taking part on we decided to change from containerd to CRI-O and the problem was solved afterwards. All nodes report sensible values now

@bpetit bpetit added this to General Jun 19, 2024
@bpetit bpetit moved this to Triage in General Jun 19, 2024
@bpetit
Copy link
Contributor

bpetit commented Oct 17, 2024

Hi, it seems related to #391 that has been merged in dev a few days ago.

If anyone wants to give it a try with a containerd runtime that would be interesting.

Now I've tried to run the dev image locally and there is a warning "scaphandre::sensors: Could'nt read record from /sys/class/powercap/intel-rapl:0/energy_uj, error was: Os { code: 2, kind: NotFound, message: "No such file or directory" }" However, as I have stated before, I have used the chmod -R 777 command on this folder and disabled the firewall. What could be causing this?

This would be related to a intel-rapl module issue, not scaphandre itself.

It is indeed a permission issue. As I came back to office and typed "kubectl logs 'scaphandre pod'", this time a got a warning message stating:
"scaphandre::sensors: Could'nt read record from /sys/class/powercap/intel-rapl:0/energy_uj, error was: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }"

This would be related (probably) to #391

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Triage
Development

No branches or pull requests

3 participants