Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporter SIGSEGV on Ubuntu 20 hosts (release-0.7.10+) #1866

Open
Robbie558 opened this issue Nov 28, 2024 · 2 comments
Open

Exporter SIGSEGV on Ubuntu 20 hosts (release-0.7.10+) #1866

Robbie558 opened this issue Nov 28, 2024 · 2 comments
Labels
kind/bug report bug issue

Comments

@Robbie558
Copy link

What happened?

Kepler fails on all Ubuntu 20 hosts in my K8s cluster, producing the following logs:

$ kubectl logs -n monitoring kepler-6hrms
WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
I1128 17:15:12.843579       1 exporter.go:103] Kepler running on version: v0.7.12-dirty
I1128 17:15:12.844340       1 config.go:293] using gCgroup ID in the BPF program: true
I1128 17:15:12.844406       1 config.go:295] kernel version: 5.4
I1128 17:15:12.844693       1 power.go:78] Unable to obtain power, use estimate method
I1128 17:15:12.844720       1 redfish.go:169] failed to get redfish credential file path
I1128 17:15:12.853436       1 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I1128 17:15:12.853459       1 power.go:79] using none to obtain power
E1128 17:15:12.853478       1 accelerator.go:154] [DUMMY] doesn't contain GPU
E1128 17:15:12.853507       1 exporter.go:154] failed to init GPU accelerators: no devices found
WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
I1128 17:15:12.854860       1 exporter.go:84] Number of CPUs: 2
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x87b273]

goroutine 1 [running]:
github.com/sustainable-computing-io/kepler/pkg/bpf.(*hardwarePerfEvents).close(0x0)
	/workspace/pkg/bpf/exporter.go:274 +0x13
github.com/sustainable-computing-io/kepler/pkg/bpf.(*exporter).Detach(0xc0001a4000)
	/workspace/pkg/bpf/exporter.go:195 +0x15a
github.com/sustainable-computing-io/kepler/pkg/bpf.NewExporter()
	/workspace/pkg/bpf/exporter.go:58 +0x13e
main.main()
	/workspace/cmd/exporter/exporter.go:159 +0x86b

Pods running as expected against U22 hosts in the same cluster

What did you expect to happen?

Kepler runs on Ubuntu 20 hosts

How can we reproduce it (as minimally and precisely as possible)?

Install via helm at latest version against a cluster with virtualised Ubuntu 20 nodes

Anything else we need to know?

Virtualised hosts running on Hyper-V

Kepler image tag

```console quay.io/sustainable_computing_io/kepler:release-0.7.12 ```

Kubernetes version

Server Version: v1.31.2

Cloud provider or bare metal

OS version

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

$ uname -a
Linux fh1-kubet01 5.4.0-200-generic #220-Ubuntu SMP Fri Sep 27 13:19:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
</details>


### Install tools

<details>
helm
</details>


### Kepler deployment config

<details>

For on kubernetes:
```console
$ KEPLER_NAMESPACE=monitoring

$ $ kubectl describe ds -n monitoring kepler 
Name:           kepler
Selector:       app.kubernetes.io/component=exporter,app.kubernetes.io/name=kepler
Node-Selector:  kubernetes.io/os=linux
Labels:         app.kubernetes.io/component=exporter
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=kepler
                app.kubernetes.io/version=release-0.7.12
                helm.sh/chart=kepler-0.5.11
Annotations:    deprecated.daemonset.template.generation: 1
                meta.helm.sh/release-name: kepler
                meta.helm.sh/release-namespace: monitoring
Desired Number of Nodes Scheduled: 7
Current Number of Nodes Scheduled: 7
Number of Nodes Scheduled with Up-to-date Pods: 7
Number of Nodes Scheduled with Available Pods: 1
Number of Nodes Misscheduled: 0
Pods Status:  7 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app.kubernetes.io/component=exporter
                    app.kubernetes.io/name=kepler
  Service Account:  kepler
  Containers:
   kepler-exporter:
    Image:      quay.io/sustainable_computing_io/kepler:release-0.7.12
    Port:       9102/TCP
    Host Port:  9102/TCP
    Args:
      -v=$(KEPLER_LOG_LEVEL)
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
    Environment:
      NODE_IP:                      (v1:status.hostIP)
      NODE_NAME:                    (v1:spec.nodeName)
      METRIC_PATH:                 /metrics
      BIND_ADDRESS:                0.0.0.0:9102
      CGROUP_METRICS:              *
      CPU_ARCH_OVERRIDE:           
      ENABLE_EBPF_CGROUPID:        true
      ENABLE_GPU:                  true
      ENABLE_PROCESS_METRICS:      false
      ENABLE_QAT:                  false
      EXPOSE_CGROUP_METRICS:       false
      EXPOSE_HW_COUNTER_METRICS:   true
      EXPOSE_IRQ_COUNTER_METRICS:  true
      KEPLER_LOG_LEVEL:            1
    Mounts:
      /lib/modules from lib-modules (rw)
      /proc from proc (rw)
      /sys from tracing (rw)
      /usr/src from usr-src (rw)
  Volumes:
   lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  DirectoryOrCreate
   tracing:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
   proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
   usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/src
    HostPathType:  Directory

</details>


### Container runtime (CRI) and version (if applicable)

<details>
containerd://1.7.12
</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>
CNI - Flannel
</details>
@Robbie558 Robbie558 added the kind/bug report bug issue label Nov 28, 2024
@Robbie558
Copy link
Author

Issue appears to have been introduced in release-0.7.10, as I am able to workaround by downgrading to release-0.7.8 of the kepler image.

@Robbie558
Copy link
Author

Robbie558 commented Nov 29, 2024

Possibly similar to issue #636

@Robbie558 Robbie558 changed the title Exporter SIGSEGV on Ubuntu 20 Exporter SIGSEGV on Ubuntu 20 hosts (release-0.7.10+) Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug report bug issue
Projects
None yet
Development

No branches or pull requests

1 participant