Exporter SIGSEGV on Ubuntu 20 hosts (release-0.7.10+) #1866

Robbie558 · 2024-11-28T17:29:00Z

What happened?

Kepler fails on all Ubuntu 20 hosts in my K8s cluster, producing the following logs:

$ kubectl logs -n monitoring kepler-6hrms
WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
I1128 17:15:12.843579       1 exporter.go:103] Kepler running on version: v0.7.12-dirty
I1128 17:15:12.844340       1 config.go:293] using gCgroup ID in the BPF program: true
I1128 17:15:12.844406       1 config.go:295] kernel version: 5.4
I1128 17:15:12.844693       1 power.go:78] Unable to obtain power, use estimate method
I1128 17:15:12.844720       1 redfish.go:169] failed to get redfish credential file path
I1128 17:15:12.853436       1 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I1128 17:15:12.853459       1 power.go:79] using none to obtain power
E1128 17:15:12.853478       1 accelerator.go:154] [DUMMY] doesn't contain GPU
E1128 17:15:12.853507       1 exporter.go:154] failed to init GPU accelerators: no devices found
WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
I1128 17:15:12.854860       1 exporter.go:84] Number of CPUs: 2
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x87b273]

goroutine 1 [running]:
github.com/sustainable-computing-io/kepler/pkg/bpf.(*hardwarePerfEvents).close(0x0)
	/workspace/pkg/bpf/exporter.go:274 +0x13
github.com/sustainable-computing-io/kepler/pkg/bpf.(*exporter).Detach(0xc0001a4000)
	/workspace/pkg/bpf/exporter.go:195 +0x15a
github.com/sustainable-computing-io/kepler/pkg/bpf.NewExporter()
	/workspace/pkg/bpf/exporter.go:58 +0x13e
main.main()
	/workspace/cmd/exporter/exporter.go:159 +0x86b

Pods running as expected against U22 hosts in the same cluster

What did you expect to happen?

Kepler runs on Ubuntu 20 hosts

How can we reproduce it (as minimally and precisely as possible)?

Install via helm at latest version against a cluster with virtualised Ubuntu 20 nodes

Anything else we need to know?

Virtualised hosts running on Hyper-V

Kepler image tag

```console quay.io/sustainable_computing_io/kepler:release-0.7.12 ```

Kubernetes version

Server Version: v1.31.2

Cloud provider or bare metal

OS version

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

$ uname -a
Linux fh1-kubet01 5.4.0-200-generic #220-Ubuntu SMP Fri Sep 27 13:19:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
</details>


### Install tools

<details>
helm
</details>


### Kepler deployment config

<details>

For on kubernetes:
```console
$ KEPLER_NAMESPACE=monitoring

$ $ kubectl describe ds -n monitoring kepler 
Name:           kepler
Selector:       app.kubernetes.io/component=exporter,app.kubernetes.io/name=kepler
Node-Selector:  kubernetes.io/os=linux
Labels:         app.kubernetes.io/component=exporter
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=kepler
                app.kubernetes.io/version=release-0.7.12
                helm.sh/chart=kepler-0.5.11
Annotations:    deprecated.daemonset.template.generation: 1
                meta.helm.sh/release-name: kepler
                meta.helm.sh/release-namespace: monitoring
Desired Number of Nodes Scheduled: 7
Current Number of Nodes Scheduled: 7
Number of Nodes Scheduled with Up-to-date Pods: 7
Number of Nodes Scheduled with Available Pods: 1
Number of Nodes Misscheduled: 0
Pods Status:  7 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app.kubernetes.io/component=exporter
                    app.kubernetes.io/name=kepler
  Service Account:  kepler
  Containers:
   kepler-exporter:
    Image:      quay.io/sustainable_computing_io/kepler:release-0.7.12
    Port:       9102/TCP
    Host Port:  9102/TCP
    Args:
      -v=$(KEPLER_LOG_LEVEL)
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
    Environment:
      NODE_IP:                      (v1:status.hostIP)
      NODE_NAME:                    (v1:spec.nodeName)
      METRIC_PATH:                 /metrics
      BIND_ADDRESS:                0.0.0.0:9102
      CGROUP_METRICS:              *
      CPU_ARCH_OVERRIDE:           
      ENABLE_EBPF_CGROUPID:        true
      ENABLE_GPU:                  true
      ENABLE_PROCESS_METRICS:      false
      ENABLE_QAT:                  false
      EXPOSE_CGROUP_METRICS:       false
      EXPOSE_HW_COUNTER_METRICS:   true
      EXPOSE_IRQ_COUNTER_METRICS:  true
      KEPLER_LOG_LEVEL:            1
    Mounts:
      /lib/modules from lib-modules (rw)
      /proc from proc (rw)
      /sys from tracing (rw)
      /usr/src from usr-src (rw)
  Volumes:
   lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  DirectoryOrCreate
   tracing:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
   proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
   usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/src
    HostPathType:  Directory

</details>


### Container runtime (CRI) and version (if applicable)

<details>
containerd://1.7.12
</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>
CNI - Flannel
</details>

The text was updated successfully, but these errors were encountered:

Robbie558 · 2024-11-29T16:22:41Z

Issue appears to have been introduced in release-0.7.10, as I am able to workaround by downgrading to release-0.7.8 of the kepler image.

Robbie558 · 2024-11-29T16:23:32Z

Possibly similar to issue #636

Robbie558 added the kind/bug report bug issue label Nov 28, 2024

Robbie558 changed the title ~~Exporter SIGSEGV on Ubuntu 20~~ Exporter SIGSEGV on Ubuntu 20 hosts (release-0.7.10+) Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporter SIGSEGV on Ubuntu 20 hosts (release-0.7.10+) #1866

Exporter SIGSEGV on Ubuntu 20 hosts (release-0.7.10+) #1866

Robbie558 commented Nov 28, 2024

Robbie558 commented Nov 29, 2024

Robbie558 commented Nov 29, 2024 •

edited

Loading

Exporter SIGSEGV on Ubuntu 20 hosts (release-0.7.10+) #1866

Exporter SIGSEGV on Ubuntu 20 hosts (release-0.7.10+) #1866

Comments

Robbie558 commented Nov 28, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kepler image tag

Kubernetes version

Cloud provider or bare metal

OS version

Robbie558 commented Nov 29, 2024

Robbie558 commented Nov 29, 2024 • edited Loading

Robbie558 commented Nov 29, 2024 •

edited

Loading