Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM support for AWS Graviton v2, v3 #1556

Closed
jjo opened this issue Jun 18, 2024 · 12 comments
Closed

ARM support for AWS Graviton v2, v3 #1556

jjo opened this issue Jun 18, 2024 · 12 comments
Labels
kind/bug report bug issue

Comments

@jjo
Copy link

jjo commented Jun 18, 2024

What would you like to be added?

ARM support for AWS Graviton v2, v3

Why is this needed?

Expanding Kepler support to more ARM architectures will be very beneficial, else we'd be subsetting our energy observability features to x64 -only, especially for commonly used architectures in some cloud providers: for AWS these are GravitonV2 for the 𝑥6g... (c6g, m6g, etc) and GravitonV3 for 𝑥7g... (c7g, m7g, etc).

Worth noting previous issue at #482 (comment).

I tried deploying kepler-0.7.10 on an m6g instance with below details:

[root@ip-10-60-5-69 ~]# lscpu 
Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  32
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
Model:               1
Model name:          Neoverse-N1
Stepping:            r3p1
BogoMIPS:            243.75
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0-31
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
[root@ip-10-60-5-69 ~]# uname -r
5.10.215-203.850.amzn2.aarch64

It crashed with the below log tail (full log at https://0x0.st/XTMr.txt):

[...]
libbpf: failed to open '/sys/kernel/debug/tracing/events/writeback/writeback_dirty_folio/id': No such file or directory
libbpf: failed to determine tracepoint 'writeback/writeback_dirty_folio' perf event ID: No such file or directory
libbpf: prog 'kepler_write_page_trace': failed to create tracepoint 'writeback/writeback_dirty_folio' perf event: No such file or directory
W0618 20:16:47.934582       1 exporter.go:215] failed to attach tp/writeback/writeback_dirty_folio: failed to attach tracepoint writeback_dirty_folio to program kepler_write_page_trace: no such file or directory. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs.
libbpf: prog 'kepler_read_page_trace': failed to attach: ERROR: strerror_r(-524)=22
W0618 20:16:47.934648       1 exporter.go:227] failed to attach fentry/mark_page_accessed: failed to attach program: errno 524. Kepler will not collect page cache read events. This will affect the DRAM power model estimation on VMs.
I0618 20:16:48.149175       1 exporter.go:270] Successfully load eBPF module from libbpf object
I0618 20:16:48.149206       1 exporter.go:116] Initializing the GPU collector
I0618 20:16:48.149517       1 watcher.go:67] Using in cluster k8s config
I0618 20:16:48.650362       1 watcher.go:138] k8s APIserver watcher was started
I0618 20:16:48.650507       1 prometheus_collector.go:92] Registered Container Prometheus metrics
I0618 20:16:48.650548       1 prometheus_collector.go:97] Registered VM Prometheus metrics
I0618 20:16:48.650569       1 prometheus_collector.go:101] Registered Node Prometheus metrics
panic: runtime error: invalid memory address or nil pointer dereference

/cc @nikimanoledaki

@jjo jjo added the kind/feature New feature or request label Jun 18, 2024
@dave-tucker
Copy link
Collaborator

@jjo if you still have the environment available could you test with Kepler v0.7.11?
I'm hopeful that this has fixed a couple of issues in ARM64 environments.

@petewall
Copy link

I had replicated this with a graviton node. Let me see if I can do it again with 0.7.11

@jjo
Copy link
Author

jjo commented Jul 11, 2024

@jjo if you still have the environment available could you test with Kepler v0.7.11? I'm hopeful that this has fixed a couple of issues in ARM64 environments.

awesome, lemme check and report back

@petewall
Copy link

It feels like it gets further, but still crashes:

libbpf: map 'cpu_freq_array': at sec_idx 13, offset 320.
libbpf: map 'cpu_freq_array': found type = 2.
libbpf: map 'cpu_freq_array': found key [6], sz = 4.
libbpf: map 'cpu_freq_array': found value [6], sz = 4.
libbpf: map 'cpu_freq_array': found max_entries = 128.
libbpf: map '.rodata.config' (global data): at sec_idx 11, offset 0, flags 80.
libbpf: map 11 is ".rodata.config"
libbpf: map 'kepler.bss' (global data): at sec_idx 12, offset 0, flags 400.
libbpf: map 12 is "kepler.bss"
libbpf: sec '.reltp/sched/sched_switch': collecting relocation for section(3) 'tp/sched/sched_switch'
libbpf: sec '.reltp/sched/sched_switch': relo #0: insn #0 against '.rodata.config'
libbpf: prog 'kepler_sched_switch_trace': found data map 11 (.rodata.config, sec 11, off 0) for insn 0
libbpf: sec '.reltp/sched/sched_switch': relo #1: insn #7 against 'counter_sched_switch'
libbpf: prog 'kepler_sched_switch_trace': found data map 12 (kepler.bss, sec 12, off 0) for insn 7
libbpf: sec '.reltp/sched/sched_switch': relo #2: insn #16 against '.rodata.config'
libbpf: prog 'kepler_sched_switch_trace': found data map 11 (.rodata.config, sec 11, off 0) for insn 16
libbpf: sec '.reltp/sched/sched_switch': relo #3: insn #41 against 'cpu_cycles_event_reader'
libbpf: prog 'kepler_sched_switch_trace': found map 2 (cpu_cycles_event_reader, sec 13, off 64) for insn #41
libbpf: sec '.reltp/sched/sched_switch': relo #4: insn #53 against 'cpu_cycles'
libbpf: prog 'kepler_sched_switch_trace': found map 3 (cpu_cycles, sec 13, off 96) for insn #53
libbpf: sec '.reltp/sched/sched_switch': relo #5: insn #67 against 'cpu_cycles'
libbpf: prog 'kepler_sched_switch_trace': found map 3 (cpu_cycles, sec 13, off 96) for insn #67
libbpf: sec '.reltp/sched/sched_switch': relo #6: insn #77 against 'cpu_instructions_event_reader'
libbpf: prog 'kepler_sched_switch_trace': found map 4 (cpu_instructions_event_reader, sec 13, off 128) for insn #77
libbpf: sec '.reltp/sched/sched_switch': relo #7: insn #87 against 'cpu_instructions'
libbpf: prog 'kepler_sched_switch_trace': found map 5 (cpu_instructions, sec 13, off 160) for insn #87
libbpf: sec '.reltp/sched/sched_switch': relo #8: insn #99 against 'cpu_instructions'
libbpf: prog 'kepler_sched_switch_trace': found map 5 (cpu_instructions, sec 13, off 160) for insn #99
libbpf: sec '.reltp/sched/sched_switch': relo #9: insn #111 against 'cache_miss_event_reader'
libbpf: prog 'kepler_sched_switch_trace': found map 6 (cache_miss_event_reader, sec 13, off 192) for insn #111
libbpf: sec '.reltp/sched/sched_switch': relo #10: insn #122 against 'cache_miss'
libbpf: prog 'kepler_sched_switch_trace': found map 7 (cache_miss, sec 13, off 224) for insn #122
libbpf: sec '.reltp/sched/sched_switch': relo #11: insn #136 against 'cache_miss'
libbpf: prog 'kepler_sched_switch_trace': found map 7 (cache_miss, sec 13, off 224) for insn #136
libbpf: sec '.reltp/sched/sched_switch': relo #12: insn #145 against 'pid_time'
libbpf: prog 'kepler_sched_switch_trace': found map 1 (pid_time, sec 13, off 32) for insn #145
libbpf: sec '.reltp/sched/sched_switch': relo #13: insn #153 against 'pid_time'
libbpf: prog 'kepler_sched_switch_trace': found map 1 (pid_time, sec 13, off 32) for insn #153
libbpf: sec '.reltp/sched/sched_switch': relo #14: insn #165 against 'pid_time'
libbpf: prog 'kepler_sched_switch_trace': found map 1 (pid_time, sec 13, off 32) for insn #165
libbpf: sec '.reltp/sched/sched_switch': relo #15: insn #171 against 'processes'
libbpf: prog 'kepler_sched_switch_trace': found map 0 (processes, sec 13, off 0) for insn #171
libbpf: sec '.reltp/sched/sched_switch': relo #16: insn #195 against 'processes'
libbpf: prog 'kepler_sched_switch_trace': found map 0 (processes, sec 13, off 0) for insn #195
libbpf: sec '.reltp/sched/sched_switch': relo #17: insn #227 against 'processes'
libbpf: prog 'kepler_sched_switch_trace': found map 0 (processes, sec 13, off 0) for insn #227
libbpf: sec '.reltp/irq/softirq_entry': collecting relocation for section(5) 'tp/irq/softirq_entry'
libbpf: sec '.reltp/irq/softirq_entry': relo #0: insn #6 against 'processes'
libbpf: prog 'kepler_irq_trace': found map 0 (processes, sec 13, off 0) for insn #6
libbpf: sec '.relfexit/mark_page_accessed': collecting relocation for section(7) 'fexit/mark_page_accessed'
libbpf: sec '.relfexit/mark_page_accessed': relo #0: insn #4 against 'processes'
libbpf: prog 'kepler_read_page_trace': found map 0 (processes, sec 13, off 0) for insn #4
libbpf: sec '.reltp/writeback/writeback_dirty_folio': collecting relocation for section(9) 'tp/writeback/writeback_dirty_folio'
libbpf: sec '.reltp/writeback/writeback_dirty_folio': relo #0: insn #4 against 'processes'
libbpf: prog 'kepler_write_page_trace': found map 0 (processes, sec 13, off 0) for insn #4
I0711 13:26:45.390256       1 exporter.go:158] 1 CPU cores detected. Resizing eBPF Perf Event Arrays
libbpf: loading kernel BTF '/sys/kernel/btf/vmlinux': 0
libbpf: map 'processes': created successfully, fd=8
libbpf: map 'pid_time': created successfully, fd=9
libbpf: map 'cpu_cycles_event_reader': created successfully, fd=10
libbpf: map 'cpu_cycles': created successfully, fd=11
libbpf: map 'cpu_instructions_event_reader': created successfully, fd=12
libbpf: map 'cpu_instructions': created successfully, fd=13
libbpf: map 'cache_miss_event_reader': created successfully, fd=14
libbpf: map 'cache_miss': created successfully, fd=15
libbpf: map 'task_clock_ms_event_reader': created successfully, fd=16
libbpf: map 'task_clock': created successfully, fd=17
libbpf: map 'cpu_freq_array': created successfully, fd=18
libbpf: map '.rodata.config': created successfully, fd=19
libbpf: map 'kepler.bss': created successfully, fd=20
libbpf: failed to open '/sys/kernel/debug/tracing/events/writeback/writeback_dirty_folio/id': No such file or directory
libbpf: failed to determine tracepoint 'writeback/writeback_dirty_folio' perf event ID: No such file or directory
libbpf: prog 'kepler_write_page_trace': failed to create tracepoint 'writeback/writeback_dirty_folio' perf event: No such file or directory
W0711 13:26:45.440663       1 exporter.go:215] failed to attach tp/writeback/writeback_dirty_folio: failed to attach tracepoint writeback_dirty_folio to program kepler_write_page_trace: no such file or directory. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs.
libbpf: prog 'kepler_read_page_trace': failed to attach: ERROR: strerror_r(-524)=22
W0711 13:26:45.440798       1 exporter.go:227] failed to attach fentry/mark_page_accessed: failed to attach program: errno 524. Kepler will not collect page cache read events. This will affect the DRAM power model estimation on VMs.
I0711 13:26:45.441110       1 exporter.go:270] Successfully load eBPF module from libbpf object
I0711 13:26:45.441125       1 exporter.go:116] Initializing the GPU collector
I0711 13:26:45.441461       1 watcher.go:67] Using in cluster k8s config
I0711 13:26:45.557488       1 watcher.go:138] k8s APIserver watcher was started
I0711 13:26:45.557760       1 prometheus_collector.go:92] Registered Container Prometheus metrics
I0711 13:26:45.557960       1 prometheus_collector.go:97] Registered VM Prometheus metrics
I0711 13:26:45.558094       1 prometheus_collector.go:101] Registered Node Prometheus metrics
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7fe29c]

goroutine 34 [running]:
github.com/sustainable-computing-io/kepler/pkg/collector/stats/types.(*UInt64StatCollection).SumAllAggrValues(0x15f5ec0?)
    /workspace/pkg/collector/stats/types/types.go:144 +0x1c
github.com/sustainable-computing-io/kepler/pkg/metrics/utils.CollectResUtil(0x15f8c80?, {0x16c58e0?, 0x40003d7810?}, {0x1822498, 0x12}, {0x1a89e18, 0x4000390298})
    /workspace/pkg/metrics/utils/utils.go:141 +0x578
github.com/sustainable-computing-io/kepler/pkg/metrics/utils.CollectResUtilizationMetrics(0x15f8c80?, {0x16c58e0, 0x40003d7810}, 0x4000270370?)
    /workspace/pkg/metrics/utils/utils.go:48 +0x94
github.com/sustainable-computing-io/kepler/pkg/metrics/container.(*collector).Collect(0x40003d0100, 0x40004f7f58?)
    /workspace/pkg/metrics/container/metrics.go:102 +0xf4
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
    /workspace/vendor/github.com/prometheus/client_golang/prometheus/registry.go:455 +0xd8
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather in goroutine 1
    /workspace/vendor/github.com/prometheus/client_golang/prometheus/registry.go:547 +0x974
Stream closed EOF for default/kepler-8x9f8 (kepler-exporter)

This is using a c6g.medium machine type, so AWS Graviton2.

@petewall
Copy link

Running again with debug logs...

@petewall
Copy link

Logs again with log level 5:

I0711 13:36:58.394721       1 gpu.go:38] Trying to initialize GPU collector using dcgm
W0711 13:36:58.395340       1 gpu_dcgm.go:104] There is no DCGM daemon running in the host: libdcgm.so not Found
W0711 13:36:58.395447       1 gpu_dcgm.go:108] Could not start DCGM. Error: libdcgm.so not Found
I0711 13:36:58.395455       1 gpu.go:45] Error initializing dcgm: not able to connect to DCGM: libdcgm.so not Found
I0711 13:36:58.395460       1 gpu.go:38] Trying to initialize GPU collector using nvidia-nvml
I0711 13:36:58.395623       1 gpu.go:45] Error initializing nvidia-nvml: failed to init nvml. ERROR_LIBRARY_NOT_FOUND
I0711 13:36:58.395720       1 gpu.go:38] Trying to initialize GPU collector using dummy
I0711 13:36:58.395780       1 gpu.go:42] Using dummy to obtain gpu power
E0711 13:36:58.399807       1 utils.go:139] getCPUArch failure: open /sys/devices/cpu/caps/pmu_name: no such file or directory
I0711 13:36:58.399909       1 exporter.go:85] Kepler running on version: release-0.7.10
I0711 13:36:58.399921       1 config.go:283] using gCgroup ID in the BPF program: true
I0711 13:36:58.399944       1 config.go:285] kernel version: 5.1
I0711 13:36:58.399978       1 config.go:310] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0711 13:36:58.399987       1 config.go:151] ENABLE_EBPF_CGROUPID: false
I0711 13:36:58.399990       1 config.go:152] ENABLE_GPU: true
I0711 13:36:58.399994       1 config.go:153] ENABLE_QAT: false
I0711 13:36:58.399998       1 config.go:154] ENABLE_PROCESS_METRICS: false
I0711 13:36:58.400002       1 config.go:155] EXPOSE_HW_COUNTER_METRICS: true
I0711 13:36:58.400007       1 config.go:156] EXPOSE_CGROUP_METRICS: false
I0711 13:36:58.400011       1 config.go:157] EXPOSE_IRQ_COUNTER_METRICS: true
I0711 13:36:58.400016       1 config.go:158] EXPOSE_ESTIMATED_IDLE_POWER_METRICS: true. This only impacts when the power is estimated using pre-prained models. Estimated idle power is meaningful only when Kepler is running on bare-metal or with a single virtual machine (VM) on the node.
I0711 13:36:58.400023       1 config.go:159] EXPERIMENTAL_BPF_SAMPLE_RATE: 0
I0711 13:36:58.400051       1 rapl_msr_util.go:129] failed to open path /dev/cpu/0/msr: no such file or directory
I0711 13:36:58.400080       1 power.go:72] Unable to obtain power, use estimate method
I0711 13:36:58.400086       1 redfish.go:169] failed to get redfish credential file path
I0711 13:36:58.400686       1 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I0711 13:36:58.400700       1 power.go:73] using none to obtain power
libbpf: loading /var/lib/kepler/bpfassets/kepler.bpfel.o
libbpf: elf: section(3) tp/sched/sched_switch, size 1864, link 0, flags 6, type=1
libbpf: sec 'tp/sched/sched_switch': found program 'kepler_sched_switch_trace' at insn offset 0 (0 bytes), code size 233 insns (1864 bytes)
libbpf: elf: section(4) .reltp/sched/sched_switch, size 288, link 32, flags 40, type=9
libbpf: elf: section(5) tp/irq/softirq_entry, size 144, link 0, flags 6, type=1
libbpf: sec 'tp/irq/softirq_entry': found program 'kepler_irq_trace' at insn offset 0 (0 bytes), code size 18 insns (144 bytes)
libbpf: elf: section(6) .reltp/irq/softirq_entry, size 16, link 32, flags 40, type=9
libbpf: elf: section(7) fexit/mark_page_accessed, size 104, link 0, flags 6, type=1
libbpf: sec 'fexit/mark_page_accessed': found program 'kepler_read_page_trace' at insn offset 0 (0 bytes), code size 13 insns (104 bytes)
libbpf: elf: section(8) .relfexit/mark_page_accessed, size 16, link 32, flags 40, type=9
libbpf: elf: section(9) tp/writeback/writeback_dirty_folio, size 104, link 0, flags 6, type=1
libbpf: sec 'tp/writeback/writeback_dirty_folio': found program 'kepler_write_page_trace' at insn offset 0 (0 bytes), code size 13 insns (104 bytes)
libbpf: elf: section(10) .reltp/writeback/writeback_dirty_folio, size 16, link 32, flags 40, type=9
libbpf: elf: section(11) .rodata.config, size 12, link 0, flags 2, type=1
libbpf: elf: section(12) .bss, size 4, link 0, flags 3, type=8
libbpf: elf: section(13) .maps, size 352, link 0, flags 3, type=1
libbpf: elf: section(14) license, size 13, link 0, flags 3, type=1
libbpf: license of /var/lib/kepler/bpfassets/kepler.bpfel.o is Dual BSD/GPL
libbpf: elf: section(23) .BTF, size 5659, link 0, flags 0, type=1
libbpf: elf: section(25) .BTF.ext, size 2072, link 0, flags 0, type=1
libbpf: elf: section(32) .symtab, size 1128, link 1, flags 0, type=2
libbpf: looking for externs among 47 symbols...
libbpf: collected 0 externs total
libbpf: map 'processes': at sec_idx 13, offset 0.
libbpf: map 'processes': found type = 1.
libbpf: map 'processes': found key [6], sz = 4.
libbpf: map 'processes': found value [10], sz = 112.
libbpf: map 'processes': found max_entries = 32768.
libbpf: map 'pid_time': at sec_idx 13, offset 32.
libbpf: map 'pid_time': found type = 1.
libbpf: map 'pid_time': found key [6], sz = 4.
libbpf: map 'pid_time': found value [12], sz = 8.
libbpf: map 'pid_time': found max_entries = 32768.
libbpf: map 'cpu_cycles_event_reader': at sec_idx 13, offset 64.
libbpf: map 'cpu_cycles_event_reader': found type = 4.
libbpf: map 'cpu_cycles_event_reader': found key [2], sz = 4.
libbpf: map 'cpu_cycles_event_reader': found value [6], sz = 4.
libbpf: map 'cpu_cycles_event_reader': found max_entries = 128.
libbpf: map 'cpu_cycles': at sec_idx 13, offset 96.
libbpf: map 'cpu_cycles': found type = 2.
libbpf: map 'cpu_cycles': found key [6], sz = 4.
libbpf: map 'cpu_cycles': found value [12], sz = 8.
libbpf: map 'cpu_cycles': found max_entries = 128.
libbpf: map 'cpu_instructions_event_reader': at sec_idx 13, offset 128.
libbpf: map 'cpu_instructions_event_reader': found type = 4.
libbpf: map 'cpu_instructions_event_reader': found key [2], sz = 4.
libbpf: map 'cpu_instructions_event_reader': found value [6], sz = 4.
libbpf: map 'cpu_instructions_event_reader': found max_entries = 128.
libbpf: map 'cpu_instructions': at sec_idx 13, offset 160.
libbpf: map 'cpu_instructions': found type = 2.
libbpf: map 'cpu_instructions': found key [6], sz = 4.
libbpf: map 'cpu_instructions': found value [12], sz = 8.
libbpf: map 'cpu_instructions': found max_entries = 128.
libbpf: map 'cache_miss_event_reader': at sec_idx 13, offset 192.
libbpf: map 'cache_miss_event_reader': found type = 4.
libbpf: map 'cache_miss_event_reader': found key [2], sz = 4.
libbpf: map 'cache_miss_event_reader': found value [6], sz = 4.
libbpf: map 'cache_miss_event_reader': found max_entries = 128.
libbpf: map 'cache_miss': at sec_idx 13, offset 224.
libbpf: map 'cache_miss': found type = 2.
libbpf: map 'cache_miss': found key [6], sz = 4.
libbpf: map 'cache_miss': found value [12], sz = 8.
libbpf: map 'cache_miss': found max_entries = 128.
libbpf: map 'task_clock_ms_event_reader': at sec_idx 13, offset 256.
libbpf: map 'task_clock_ms_event_reader': found type = 4.
libbpf: map 'task_clock_ms_event_reader': found key [2], sz = 4.
libbpf: map 'task_clock_ms_event_reader': found value [6], sz = 4.
libbpf: map 'task_clock_ms_event_reader': found max_entries = 128.
libbpf: map 'task_clock': at sec_idx 13, offset 288.
libbpf: map 'task_clock': found type = 2.
libbpf: map 'task_clock': found key [6], sz = 4.
libbpf: map 'task_clock': found value [12], sz = 8.
libbpf: map 'task_clock': found max_entries = 128.
libbpf: map 'cpu_freq_array': at sec_idx 13, offset 320.
libbpf: map 'cpu_freq_array': found type = 2.
libbpf: map 'cpu_freq_array': found key [6], sz = 4.
libbpf: map 'cpu_freq_array': found value [6], sz = 4.
libbpf: map 'cpu_freq_array': found max_entries = 128.
libbpf: map '.rodata.config' (global data): at sec_idx 11, offset 0, flags 80.
libbpf: map 11 is ".rodata.config"
libbpf: map 'kepler.bss' (global data): at sec_idx 12, offset 0, flags 400.
libbpf: map 12 is "kepler.bss"
libbpf: sec '.reltp/sched/sched_switch': collecting relocation for section(3) 'tp/sched/sched_switch'
libbpf: sec '.reltp/sched/sched_switch': relo #0: insn #0 against '.rodata.config'
libbpf: prog 'kepler_sched_switch_trace': found data map 11 (.rodata.config, sec 11, off 0) for insn 0
libbpf: sec '.reltp/sched/sched_switch': relo #1: insn #7 against 'counter_sched_switch'
libbpf: prog 'kepler_sched_switch_trace': found data map 12 (kepler.bss, sec 12, off 0) for insn 7
libbpf: sec '.reltp/sched/sched_switch': relo #2: insn #16 against '.rodata.config'
libbpf: prog 'kepler_sched_switch_trace': found data map 11 (.rodata.config, sec 11, off 0) for insn 16
libbpf: sec '.reltp/sched/sched_switch': relo #3: insn #41 against 'cpu_cycles_event_reader'
libbpf: prog 'kepler_sched_switch_trace': found map 2 (cpu_cycles_event_reader, sec 13, off 64) for insn #41
libbpf: sec '.reltp/sched/sched_switch': relo #4: insn #53 against 'cpu_cycles'
libbpf: prog 'kepler_sched_switch_trace': found map 3 (cpu_cycles, sec 13, off 96) for insn #53
libbpf: sec '.reltp/sched/sched_switch': relo #5: insn #67 against 'cpu_cycles'
libbpf: prog 'kepler_sched_switch_trace': found map 3 (cpu_cycles, sec 13, off 96) for insn #67
libbpf: sec '.reltp/sched/sched_switch': relo #6: insn #77 against 'cpu_instructions_event_reader'
libbpf: prog 'kepler_sched_switch_trace': found map 4 (cpu_instructions_event_reader, sec 13, off 128) for insn #77
libbpf: sec '.reltp/sched/sched_switch': relo #7: insn #87 against 'cpu_instructions'
libbpf: prog 'kepler_sched_switch_trace': found map 5 (cpu_instructions, sec 13, off 160) for insn #87
libbpf: sec '.reltp/sched/sched_switch': relo #8: insn #99 against 'cpu_instructions'
libbpf: prog 'kepler_sched_switch_trace': found map 5 (cpu_instructions, sec 13, off 160) for insn #99
libbpf: sec '.reltp/sched/sched_switch': relo #9: insn #111 against 'cache_miss_event_reader'
libbpf: prog 'kepler_sched_switch_trace': found map 6 (cache_miss_event_reader, sec 13, off 192) for insn #111
libbpf: sec '.reltp/sched/sched_switch': relo #10: insn #122 against 'cache_miss'
libbpf: prog 'kepler_sched_switch_trace': found map 7 (cache_miss, sec 13, off 224) for insn #122
libbpf: sec '.reltp/sched/sched_switch': relo #11: insn #136 against 'cache_miss'
libbpf: prog 'kepler_sched_switch_trace': found map 7 (cache_miss, sec 13, off 224) for insn #136
libbpf: sec '.reltp/sched/sched_switch': relo #12: insn #145 against 'pid_time'
libbpf: prog 'kepler_sched_switch_trace': found map 1 (pid_time, sec 13, off 32) for insn #145
libbpf: sec '.reltp/sched/sched_switch': relo #13: insn #153 against 'pid_time'
libbpf: prog 'kepler_sched_switch_trace': found map 1 (pid_time, sec 13, off 32) for insn #153
libbpf: sec '.reltp/sched/sched_switch': relo #14: insn #165 against 'pid_time'
libbpf: prog 'kepler_sched_switch_trace': found map 1 (pid_time, sec 13, off 32) for insn #165
libbpf: sec '.reltp/sched/sched_switch': relo #15: insn #171 against 'processes'
libbpf: prog 'kepler_sched_switch_trace': found map 0 (processes, sec 13, off 0) for insn #171
libbpf: sec '.reltp/sched/sched_switch': relo #16: insn #195 against 'processes'
libbpf: prog 'kepler_sched_switch_trace': found map 0 (processes, sec 13, off 0) for insn #195
libbpf: sec '.reltp/sched/sched_switch': relo #17: insn #227 against 'processes'
libbpf: prog 'kepler_sched_switch_trace': found map 0 (processes, sec 13, off 0) for insn #227
libbpf: sec '.reltp/irq/softirq_entry': collecting relocation for section(5) 'tp/irq/softirq_entry'
libbpf: sec '.reltp/irq/softirq_entry': relo #0: insn #6 against 'processes'
libbpf: prog 'kepler_irq_trace': found map 0 (processes, sec 13, off 0) for insn #6
libbpf: sec '.relfexit/mark_page_accessed': collecting relocation for section(7) 'fexit/mark_page_accessed'
libbpf: sec '.relfexit/mark_page_accessed': relo #0: insn #4 against 'processes'
libbpf: prog 'kepler_read_page_trace': found map 0 (processes, sec 13, off 0) for insn #4
libbpf: sec '.reltp/writeback/writeback_dirty_folio': collecting relocation for section(9) 'tp/writeback/writeback_dirty_folio'
libbpf: sec '.reltp/writeback/writeback_dirty_folio': relo #0: insn #4 against 'processes'
libbpf: prog 'kepler_write_page_trace': found map 0 (processes, sec 13, off 0) for insn #4
I0711 13:36:58.401576       1 exporter.go:158] 1 CPU cores detected. Resizing eBPF Perf Event Arrays
libbpf: loading kernel BTF '/sys/kernel/btf/vmlinux': 0
libbpf: map 'processes': created successfully, fd=8
libbpf: map 'pid_time': created successfully, fd=9
libbpf: map 'cpu_cycles_event_reader': created successfully, fd=10
libbpf: map 'cpu_cycles': created successfully, fd=11
libbpf: map 'cpu_instructions_event_reader': created successfully, fd=12
libbpf: map 'cpu_instructions': created successfully, fd=13
libbpf: map 'cache_miss_event_reader': created successfully, fd=14
libbpf: map 'cache_miss': created successfully, fd=15
libbpf: map 'task_clock_ms_event_reader': created successfully, fd=16
libbpf: map 'task_clock': created successfully, fd=17
libbpf: map 'cpu_freq_array': created successfully, fd=18
libbpf: map '.rodata.config': created successfully, fd=19
libbpf: map 'kepler.bss': created successfully, fd=20
libbpf: failed to open '/sys/kernel/debug/tracing/events/writeback/writeback_dirty_folio/id': No such file or directory
libbpf: failed to determine tracepoint 'writeback/writeback_dirty_folio' perf event ID: No such file or directory
libbpf: prog 'kepler_write_page_trace': failed to create tracepoint 'writeback/writeback_dirty_folio' perf event: No such file or directory
W0711 13:36:58.432355       1 exporter.go:215] failed to attach tp/writeback/writeback_dirty_folio: failed to attach tracepoint writeback_dirty_folio to program kepler_write_page_trace: no such file or directory. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs.
libbpf: prog 'kepler_read_page_trace': failed to attach: ERROR: strerror_r(-524)=22
W0711 13:36:58.432439       1 exporter.go:227] failed to attach fentry/mark_page_accessed: failed to attach program: errno 524. Kepler will not collect page cache read events. This will affect the DRAM power model estimation on VMs.
I0711 13:36:58.432732       1 exporter.go:270] Successfully load eBPF module from libbpf object
I0711 13:36:58.432743       1 utils.go:85] Available ebpf software counters: [bpf_cpu_time_ms bpf_net_tx_irq bpf_net_rx_irq bpf_block_irq]
I0711 13:36:58.432754       1 utils.go:90] Available ebpf hardware counters: [cpu_cycles cpu_instructions cache_miss task_clock_ms]
I0711 13:36:58.432762       1 exporter.go:116] Initializing the GPU collector
I0711 13:36:58.433086       1 watcher.go:67] Using in cluster k8s config
I0711 13:36:58.442909       1 reflector.go:289] Starting reflector *v1.Pod (0s) from pkg/kubernetes/watcher.go:129
I0711 13:36:58.442928       1 reflector.go:325] Listing and watching *v1.Pod from pkg/kubernetes/watcher.go:129
I0711 13:36:58.464043       1 watcher.go:183] Pod datadog-agent-z8gbg default is ready with 3 container statuses, 2 init container status, 0 ephemeral statues
I0711 13:36:58.467485       1 watcher.go:211] receiving container agent datadog-agent-z8gbg default 51e39f910b11f1be4bb692ea71274e68a37ced9db8d3f8a3d21704e1cf7c4a84
I0711 13:36:58.467636       1 watcher.go:211] receiving container process-agent datadog-agent-z8gbg default 9edca06ac3feeee4224a81dfebb32f2280871cb1e9d45c053bc96df077ad0353
I0711 13:36:58.467735       1 watcher.go:211] receiving container trace-agent datadog-agent-z8gbg default 3d3a97fc9dc78d3aefda82561a37748efd0621dd70a9704a27cb743bfaaa6aa7
I0711 13:36:58.468781       1 watcher.go:211] receiving container init-volume datadog-agent-z8gbg default f26b9af07cedea452c6d2d63398d88a935568423265a368c3d06a7c14d390c0e
I0711 13:36:58.468912       1 watcher.go:211] receiving container init-config datadog-agent-z8gbg default dbdda7a34af5361a330e1505f87034405874977857e1bccda9305abb02faddc4
I0711 13:36:58.468992       1 watcher.go:190] parsing pod datadog-agent-z8gbg default status: <nil> <nil> <nil>
I0711 13:36:58.469080       1 watcher.go:183] Pod grafana-k8s-monitoring-prometheus-node-exporter-x57cs default is ready with 1 container statuses, 0 init container status, 0 ephemeral statues
I0711 13:36:58.469171       1 watcher.go:211] receiving container node-exporter grafana-k8s-monitoring-prometheus-node-exporter-x57cs default 487e932daa02bf86c726b6cb510d6e7420ab89a02682e3613158767bd08352b2
I0711 13:36:58.469258       1 watcher.go:190] parsing pod grafana-k8s-monitoring-prometheus-node-exporter-x57cs default status: <nil> <nil> <nil>
I0711 13:36:58.469350       1 watcher.go:183] Pod kepler-tjdm9 default is ready with 1 container statuses, 0 init container status, 0 ephemeral statues
I0711 13:36:58.469456       1 watcher.go:211] receiving container kepler-exporter kepler-tjdm9 default 4821309a845ef627c6370249ffbc22d62a0fdbca43ada2352d1b871ef0049cbb
I0711 13:36:58.469545       1 watcher.go:190] parsing pod kepler-tjdm9 default status: <nil> <nil> <nil>
I0711 13:36:58.469642       1 watcher.go:183] Pod aws-node-gg949 kube-system is ready with 1 container statuses, 1 init container status, 0 ephemeral statues
I0711 13:36:58.469775       1 watcher.go:211] receiving container aws-node aws-node-gg949 kube-system 50431be2ae2adc8d6d2ca06567dbb231c28eac1573677c50d2800fb2b8f3a6a7
I0711 13:36:58.469905       1 watcher.go:211] receiving container aws-vpc-cni-init aws-node-gg949 kube-system 36302514210ba899a9cc4fbafa63fb0573b8d9a2434f59966387f9e6fc969e3e
I0711 13:36:58.469984       1 watcher.go:190] parsing pod aws-node-gg949 kube-system status: <nil> <nil> <nil>
I0711 13:36:58.470082       1 watcher.go:183] Pod ebs-csi-node-jr78n kube-system is ready with 3 container statuses, 0 init container status, 0 ephemeral statues
I0711 13:36:58.470221       1 watcher.go:211] receiving container ebs-plugin ebs-csi-node-jr78n kube-system 64f2895f61a0aed5ef63153dab64042e2a7a57793706c70d176219aa652b114c
I0711 13:36:58.470352       1 watcher.go:211] receiving container liveness-probe ebs-csi-node-jr78n kube-system 68741d6793c77bb428a69403b4e1d74b721144600ebbd4e21c3f1fd40a4cfe6d
I0711 13:36:58.470478       1 watcher.go:211] receiving container node-driver-registrar ebs-csi-node-jr78n kube-system c869b9270c6ae738f85fb5cef9fe9b6d36668bf0f9078de14be83ae230daa2a5
I0711 13:36:58.470619       1 watcher.go:190] parsing pod ebs-csi-node-jr78n kube-system status: <nil> <nil> <nil>
I0711 13:36:58.470700       1 watcher.go:183] Pod kube-proxy-6cqdc kube-system is ready with 1 container statuses, 0 init container status, 0 ephemeral statues
I0711 13:36:58.470877       1 watcher.go:211] receiving container kube-proxy kube-proxy-6cqdc kube-system b1e17615429844acc0313098a84e55952559bb42b0a895e4e4f1233b5f6dcdc7
I0711 13:36:58.470963       1 watcher.go:190] parsing pod kube-proxy-6cqdc kube-system status: <nil> <nil> <nil>
I0711 13:36:58.543751       1 shared_informer.go:341] caches populated
I0711 13:36:58.543770       1 watcher.go:138] k8s APIserver watcher was started
I0711 13:36:58.543856       1 prometheus_collector.go:92] Registered Container Prometheus metrics
I0711 13:36:58.543883       1 prometheus_collector.go:97] Registered VM Prometheus metrics
I0711 13:36:58.543896       1 prometheus_collector.go:101] Registered Node Prometheus metrics
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7fe29c]

goroutine 34 [running]:
github.com/sustainable-computing-io/kepler/pkg/collector/stats/types.(*UInt64StatCollection).SumAllAggrValues(0x15f5ec0?)
	/workspace/pkg/collector/stats/types/types.go:144 +0x1c
github.com/sustainable-computing-io/kepler/pkg/metrics/utils.CollectResUtil(0x15f8c80?, {0x16c58e0?, 0x40003db9d0?}, {0x1822498, 0x12}, {0x1a89e18, 0x40003942a8})
	/workspace/pkg/metrics/utils/utils.go:141 +0x578
github.com/sustainable-computing-io/kepler/pkg/metrics/utils.CollectResUtilizationMetrics(0x0?, {0x16c58e0, 0x40003db9d0}, 0x4000274850?)
	/workspace/pkg/metrics/utils/utils.go:48 +0x94
github.com/sustainable-computing-io/kepler/pkg/metrics/container.(*collector).Collect(0x40003d4100, 0x4000069f58?)
	/workspace/pkg/metrics/container/metrics.go:102 +0xf4
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
	/workspace/vendor/github.com/prometheus/client_golang/prometheus/registry.go:455 +0xd8
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather in goroutine 1
	/workspace/vendor/github.com/prometheus/client_golang/prometheus/registry.go:547 +0x974

@jjo
Copy link
Author

jjo commented Jul 11, 2024

@jjo if you still have the environment available could you test with Kepler v0.7.11? I'm hopeful that this has fixed a couple of issues in ARM64 environments.

woOT!, running ok so far:

I0711 13:32:10.264592       1 gpu.go:38] Trying to initialize GPU collector using dcgm
W0711 13:32:10.264788       1 gpu_dcgm.go:104] There is no DCGM daemon running in the host: libdcgm.so not Found
W0711 13:32:10.264815       1 gpu_dcgm.go:108] Could not start DCGM. Error: libdcgm.so not Found
I0711 13:32:10.264826       1 gpu.go:45] Error initializing dcgm: not able to connect to DCGM: libdcgm.so not Found
I0711 13:32:10.264830       1 gpu.go:38] Trying to initialize GPU collector using nvidia-nvml
I0711 13:32:10.264880       1 gpu.go:45] Error initializing nvidia-nvml: failed to init nvml. ERROR_LIBRARY_NOT_FOUND
I0711 13:32:10.264889       1 gpu.go:38] Trying to initialize GPU collector using dummy
I0711 13:32:10.264893       1 gpu.go:42] Using dummy to obtain gpu power
E0711 13:32:10.265472       1 utils.go:110] getCPUArch failure: open /sys/devices/cpu/caps/pmu_name: no such file or directory
I0711 13:32:10.266033       1 exporter.go:100] Kepler running on version: v0.7.11
I0711 13:32:10.266089       1 config.go:284] using gCgroup ID in the BPF program: true
I0711 13:32:10.266129       1 config.go:286] kernel version: 5.1
I0711 13:32:10.266162       1 config.go:311] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0711 13:32:10.266199       1 rapl_msr_util.go:129] failed to open path /dev/cpu/0/msr: no such file or directory
I0711 13:32:10.266230       1 power.go:72] Unable to obtain power, use estimate method
I0711 13:32:10.266242       1 redfish.go:169] failed to get redfish credential file path
I0711 13:32:10.266638       1 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I0711 13:32:10.266652       1 power.go:73] using none to obtain power
I0711 13:32:10.364233       1 exporter.go:89] Number of CPUs: 32
W0711 13:32:15.972242       1 exporter.go:150] failed to attach fentry/mark_page_accessed: create raw tracepoint: not supported. Kepler will not collect page cache read events. This will affect the DRAM power model estimation on VMs.
I0711 13:32:16.296235       1 exporter.go:147] Initializing the GPU collector
I0711 13:32:16.296566       1 watcher.go:68] Using in cluster k8s config
I0711 13:32:16.862345       1 watcher.go:140] k8s APIserver watcher was started
I0711 13:32:16.862410       1 prometheus_collector.go:95] Registered Container Prometheus metrics
I0711 13:32:16.862451       1 prometheus_collector.go:100] Registered VM Prometheus metrics
I0711 13:32:16.862476       1 prometheus_collector.go:104] Registered Node Prometheus metrics
I0711 13:32:16.965248       1 process_energy.go:114] Using the Ratio/DynPower Power Model to estimate Process Platform Power
I0711 13:32:16.965271       1 process_energy.go:115] Process feature names: [bpf_cpu_time_ms]
I0711 13:32:16.965294       1 process_energy.go:124] Using the Ratio/DynPower Power Model to estimate Process Component Power
I0711 13:32:16.965301       1 process_energy.go:125] Process feature names: [bpf_cpu_time_ms bpf_cpu_time_ms bpf_cpu_time_ms   gpu_compute_util]
I0711 13:32:16.965466       1 node_platform_energy.go:52] Using the Regressor/AbsPower Power Model to estimate Node Platform Power
I0711 13:32:16.965540       1 node_component_energy.go:56] Using the Regressor/AbsPower Power Model to estimate Node Component Power
I0711 13:32:16.965581       1 exporter.go:201] starting to listen on 0.0.0.0:9102
I0711 13:32:16.965696       1 exporter.go:215] Started Kepler in 6.699690316s

More details re:running instance:

# uname -a
Linux ip-10-60-8-147.us-east-2.compute.internal 5.10.219-208.866.amzn2.aarch64 #1 SMP Tue Jun 18 14:00:02 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

# lscpu 
Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  32
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
Model:               1
Model name:          Neoverse-N1
Stepping:            r3p1
BogoMIPS:            243.75
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0-31
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

@petewall
Copy link

Ahh geezz... Just realized I'm still running 0.7.10.. Sorry for the noise. Let me update and I'll let you know

@dave-tucker
Copy link
Collaborator

Ahh geezz... Just realized I'm still running 0.7.10.. Sorry for the noise. Let me update and I'll let you know

No worries I was about to say "those look like 0.7.10 logs" 🤣
Hoping this at least doesn't crash this time around 😓

@petewall
Copy link

Ahh geezz... Just realized I'm still running 0.7.10.. Sorry for the noise. Let me update and I'll let you know

No worries I was about to say "those look like 0.7.10 logs" 🤣 Hoping this at least doesn't crash this time around 😓

Sorry for the short term panic!

Good news! 0.7.11 is no longer crashing on my graviton node, too!

@dave-tucker
Copy link
Collaborator

Given that things aren't crashing I'm going to go ahead and close this one for now.

@dave-tucker dave-tucker added kind/bug report bug issue and removed kind/feature New feature or request labels Jul 11, 2024
@Maarham
Copy link

Maarham commented Jul 12, 2024

Hey, just wanted to follow up on this thread. When you are saying no crashes, are y'all able to get information (valid sustainability metrics like power consumption, carbon emissions, etc.) from Kepler? Do they seem valid and how do the results look like? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug report bug issue
Projects
None yet
Development

No branches or pull requests

4 participants