Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

denki: derived metrics assigning power draw to individual processes #1827

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

natoscott
Copy link
Member

Denki (power) derived metrics, based on discussion with Christian Horn over the last few weeks.

These derived metrics use RAPL hardware metrics to assign power usage to processes based on the processor and memory power draw metrics.

Denki (power) derived metrics, based on discussion with Christian
Horn over the last few weeks.

These derived metrics use RAPL hardware metrics to assign power usage
to processes based on the processor and memory power draw metrics.
@christianhorn
Copy link
Collaborator

Hm.. with a build of PCP as of today (with commit 1d3a251 as HEAD), the syntax of [core] as referring to an instance still seems not recognized:

[root@kosmos ~]# pmrep proc.denki.cpu
[/var/lib/pcp/config/derived/denki.conf:21] Error: pmRegisterDerived(proc.denki.cpu, ...) syntax error
 rate(denki.rapl.raw[core]) * (rate(proc.psinfo.utime) + rate(proc.psinfo.stime)) / (kernel.all.uptime - proc.psinfo.start_time)
                    ^
Metric name expected to follow rate(
Invalid metric proc.denki.cpu (PM_ERR_NAME Unknown metric name).
[root@kosmos ~]# grep '^proc.denki.cpu ' /etc/pcp/derived/denki.conf
proc.denki.cpu = rate(denki.rapl.raw[core]) * (rate(proc.psinfo.utime) + rate(proc.psinfo.stime)) / (kernel.all.uptime - proc.psinfo.start_time)

@kmcdonell
Copy link
Member

Hi @christianhorn

The latest commit fixes the libpcp issues around derived metrics, it does not fix the denki derived metric issue. So far I've been concentrating on the "cpu" one, and I've run into these issues:

  1. intuitively the semantics for the denki.rapl.rate and denki.rapl.raw metrics seem reversed (I'd expect the former to be instantaneous, the latter to be counter)
  2. on my test system, there is no "core" instance, but there is a "package-0" instance for these metrics
  3. the semantics for denki.bat.energy_now_raw also seems reversed (counter not instanteous)
  4. when I change the semantics as in 1., the values seem reversed, as in "rate" is a counter and "raw" is the rate-converted value for the counter:
$ pmrep denki.rapl.rate,,package-0 denki.rapl.raw,,package-0
  d.r.rate  d.r.raw
  package-  package
                 /s
    218705      N/A
    218739   33.945
    218773   34.003

so may be it is the names that are reversed. I am expecting "raw" to be the raw values from the source, and "rate" to be somehow modified by the PMDA before export.
5. both the denki.rapl.rate and denki.rapl.raw metrics produce the same value with pminfo, which does not seem right
6. it is unusual for a PMDA to export both a counter and a "rate" metric, as we get better data semantics if we simply export a counter and let each PMAPI client do the rate conversion at their own sample frequency if they wish.

If I use the "rate" metric as though it was a counter, then the closest I can get to the desired expression for the per-process power (based on pro-rated CPU usage) is:

proc.denki.cpu = scalar((delta(denki.rapl.rate))[package-0]) * ((delta(proc.psinfo.utime) + delta(proc.psinfo.stime)) / (delta(kernel.all.cpu.user) + delta(kernel.all.cpu.sys)))

I think we need to sort out the data semantics issues with the PMDA first, then cycle back to the "core" vs "package-0" vs "???" issue.

@christianhorn
Copy link
Collaborator

christianhorn commented Oct 13, 2023

Hi Ken,

intuitively the semantics for the denki.rapl.rate and denki.rapl.raw metrics seem reversed (I'd expect
the former to be instantaneous, the latter to be counter)

Is this described? For example 'apropos instantaneous' is not giving me hits.
I might simply have understood the terms wrong. denki.rapl.raw is meant to reflect the plain values as handed over by sysfs, thus it's called 'raw'. For the kernel, it's not a gauge going up and down, but a counter. Example:

[chris@космос ~]$ pmrep -p -g denki.rapl
[ 1] - denki.rapl.rate["package-0"] - /s
[ 2] - denki.rapl.rate["core"] - /s
[ 3] - denki.rapl.rate["uncore"] - /s
[ 4] - denki.rapl.rate["dram"] - /s
[ 5] - denki.rapl.raw["package-0"] - none
[ 6] - denki.rapl.raw["core"] - none
[ 7] - denki.rapl.raw["uncore"] - none
[ 8] - denki.rapl.raw["dram"] - none

                 1         2         3         4         5         6         7         8
10:55:15       N/A       N/A       N/A       N/A     58799     32471      2472     11962
10:55:16     2.993     0.000     0.000     0.998     58802     32471      2472     11963
10:55:17     2.001     1.001     0.000     0.000     58804     32472      2472     11963

As a result, the rate can also not directly be computed on the first line of output and we get N/A. Which terms should be changed?

on my test system, there is no "core" instance, but there is a "package-0" instance
for these metrics

Until now, we did not really pay deeper attention to the components of the package-X. I can not remember having hit an x86 system without core though. Capturing a 'find /sys' from the system might be interesting, in next step I would then ask for tar archive with more details, in the style of what qa/denki/*tgz contain.

the semantics for denki.bat.energy_now_raw also seems reversed (counter not instanteous)

Yes, consistant with the concept I used for rapl. denki.bat.energy_now_raw is directly read (but a gauge from kernels view), denki.bat.energy_now_rate is computed by pmda-denki from that _raw. denki.bat.power_now and denki.bat.capacity are raw values read from the kernel, these are gauges and go up/down.

it is unusual for a PMDA to export both a counter and a "rate" metric, as we get better data
semantics if we simply export a counter and let each PMAPI client do the rate conversion at
their own sample frequency if they wish.

That could make sense, it would reduce the amount of data to be stored as 'direct metrics'. I should search for good examples where we do this.

@kmcdonell
Copy link
Member

@christianhorn Let's move this discussion to email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants