-
-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an amdgpu pmda #1975
Add an amdgpu pmda #1975
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good Fred!
Bunch of minor things described in individual comments. Beyond that, here's a list of other files that will need changes too:
- configure.ac (if libdrm is absent, we need to switch this PMDA off in the build - I see there'a a libdrm.pc so I recommend using PKG_CHECK_MODULES like e.g. libsasl)
- src/include/builddefs.in (makefile macro(s) based on configure.ac mechanism)
- build/rpm/.spec (we'll need a sub-package for this new PMDA)
- qa/xxxx[.out] (regression test or two, see the apache PMDA test qa/755 as example)
- qa/admin/package-lists (list of rpm/deb packages that CI needs to install in order to build/test/release this).
If anything unclear (for sure some will be) - let's chat on slack.
man/man1/pmdaamdgpu.1
Outdated
.\" | ||
.TH PMDAAMDGPU 1 "PCP" "Performance Co-Pilot" | ||
.SH NAME | ||
\f3pmdaamdgpu\f1 \- amdgpu gpu metrics domain agent (PMDA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> "AMD GPU metrics domain agent"
man/man1/pmdaamdgpu.1
Outdated
.PP | ||
The | ||
.B amdgpu | ||
PMDA exports metrics that measure gpu activity, memory utilization, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu -> GPU
man/man1/pmdaamdgpu.1
Outdated
.fi | ||
.ft 1 | ||
.PP | ||
If you want to establish access to the names, help text and values for the amdgpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amdgpu -> AMD GPU
src/pmdas/amdgpu/GNUmakefile
Outdated
CFILES = localdrm.c amdgpu.c | ||
HFILES = localdrm.h | ||
DFILES = README | ||
LLDLIBS = $(PCP_PMDALIB) $(LIB_FOR_DLOPEN) -ldrm -ldrm_amdgpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how the configure.ac machinery ends up (discussed elsewhere), the libraries listed here will likely end up accessed as makefile macros.
src/pmdas/amdgpu/localdrm.c
Outdated
int dev_count = drmGetDevices(NULL, 0); | ||
|
||
if (dev_count <= 0) { | ||
printf("No devices\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diagnostic printf calls in this file should all become __pmNotifyErr calls which interacts with the logging subsystem a bit more nicely (with timestamp prefixes, etc) - and ensures we write into the log and not stdout (which may even have pmcd on the other end of it waiting for a PDU).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, these are development artefacts, I'll rework/cleanup.
src/pmdas/amdgpu/localdrm.c
Outdated
memcpy(&p[amdgpu_count++], &temp[i], sizeof(drmDevicePtr)); | ||
|
||
/* Done with version */ | ||
drmFreeVersion(ver); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to close the fd (local variable on-stack)? Looks like it leaks an fd for each device here otherwise.
src/pmdas/amdgpu/localdrm.c
Outdated
|
||
return DRM_SUCCESS; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's alot of opening and closing of device files here, with the way these interfaces are setup (I think this is a bit based on nvidia again, and thats maybe made it more complex than necessary).
Going back to the earlier comment about need_refresh - currently we call all of these APIs on every fetch, for every GPU. If we are going to do that, we can collapse all of these library calls into a single function that opens and closes each GPU fd a maximum of once per fetch, and avoid system time overheads associated with doing this "for every GPU for every metric" (i.e. ~10x less work if we have 10 metrics).
src/pmdas/amdgpu/pmns
Outdated
mem_clock_max AMDGPU:0:8 | ||
gpu_clock AMDGPU:0:9 | ||
gpu_clock_max AMDGPU:0:10 | ||
temperature AMDGPU:0:11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With several of the above metrics, consider a "memory" subtree in the PMNS, e.g. amdgpu.memory.clock and so on (total, free, etc). Possibly same for amdgpu.gpu.clock and clock_max.
src/pmdas/amdgpu/pmns
Outdated
gpu_clock_max AMDGPU:0:10 | ||
temperature AMDGPU:0:11 | ||
load AMDGPU:0:12 | ||
avg_pwr AMDGPU:0:13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be inclined to expand avg_pwr to 'average_power' here.
I believe I addressed most of the review finidings, I'll need to go through the code at least one more time though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Fred, good updates here. Re that QA test we talked about, check out:
- "cd qa && ./new" to create a new test
- qa/755 for a simple example (Apache PMDA)
build/rpm/pcp.spec.in
Outdated
# | ||
%package pmda-amdgpu | ||
License: GPL-2.0-or-later | ||
Summary: Performance Co-Pilot (PCP) metrics from eBPF ELF modules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be more like AMD GPUs and less like eBPF ELF modules ;)
build/rpm/redhat.spec
Outdated
# | ||
%package pmda-amdgpu | ||
License: GPL-2.0-or-later | ||
Summary: Performance Co-Pilot (PCP) metrics from eBPF ELF modules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here.
@@ -2269,6 +2272,23 @@ collecting metrics about web server logs. | |||
# end pcp-pmda-weblog | |||
# end C pmdas | |||
|
|||
%if !%{disable_amdgpu} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This macro (disable_amdgpu) isn't defined AFAICT. Not 100% but I'm guessing it will be similar to the disable_resctrl definition which makes that package x86_64 only.
This package contains the PCP Performance Metrics Domain Agent (PMDA) for | ||
extracting performance metrics from AMDGPU devices. | ||
# end pcp-pmda-amdgpu | ||
%endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's several other things that need to be done in the spec files to create a sub-package (e.g. a %files section). Simplest way to find 'em is to look at a similar sub-package - the closest to this new one may be pcp-pmda-resctrl (search on "resctrl" occurrences in each spec and mimic each section).
config.guess
Outdated
@@ -1,12 +1,14 @@ | |||
#! /bin/sh | |||
#!/usr/bin/sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for updating these files Fred, well overdue. Is /usr/bin/sh guaranteed to exist on all platforms though? We tend to use /bin/sh everywhere else anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That actually came with the configure update.
In theory nowadays everything is in /usr. There is a trend to remove /bin and /sbin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fberat yeah, I understand that's the trend (on Linux). The problem will come in on platforms that don't have such a trend, e.g. Mac OS ...
(base) nathans-mac:~ nathans$ uname -a
Darwin nathans-mac 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000 arm64
(base) nathans-mac:~ nathans$ ls -l /usr/bin/sh
ls: /usr/bin/sh: No such file or directory
(base) nathans-mac:~ nathans$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's annoying, I may need to check with upstream config.{sub,guess} repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll investigate further, there is something odd, the file on my system doesn't match the one in the redhat-rpm-config repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, found the reason, it looks like there is a systemic modification of shebang in fedora when rpm are built. I'll revert the shebang back to the original value on my next update.
case AMDGPU_TEMPERATURE: | ||
if (pcp_amdgpuinfo.info[inst].failed[AMDGPU_TEMPERATURE]) | ||
return PM_ERR_VALUE; | ||
/* In millidegrees Celsius */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the label callbacks now but should we add "units":"millidegrees celcius" as a label for this one?
if (autorefresh > 0) { | ||
autorefresh = 0; | ||
for (int i = 0;i < AMDGPU_REFRESHER_COUNT;i++) { | ||
pmNotifyErr(LOG_ERR, "Refreshing %d", i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too verbose by default here, I think this is going to end up in a log file once every second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to cleanup, at the end of the day, you were faster than me to review these changes :)
src/pmdas/amdgpu/drm.c
Outdated
|
||
#ifndef DSOSUFFIX | ||
#define DSOSUFFIX "so" | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure this agent is Linux-only, so safe to hard-code this if you like.
|
||
if (strcmp(ver->name, "amdgpu")) { | ||
drmFreeVersion(ver); | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may leak an open fd here?
|
||
/* Done with version */ | ||
drmFreeVersion(ver); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also here at the last part of the loop? Could close it unconditionally right after drmGetVersion perhaps.
src/pmdas/amdgpu/amdgpu.c
Outdated
switch (item) { | ||
case AMDGPU_MEMORY_USED: | ||
atom->ull = pcp_amdgpuinfo.info[inst].memory.used; | ||
pmNotifyErr(LOG_ERR, "Getting used memory %lu", atom->ull); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
89152be
to
8e6c36e
Compare
Last handful of small things @fberat ... amdgpu_fetch()
We have a check that 'cluster' is within range, does 'item' need setup_gcard_indom()
INFO level might be more appropriate for this one?
-> leftover temp diagnostic? (or add pmDebugOptions.appl2 guard) PMDA README file has the word Readme at the start - intentional? qa/1772 has template comment still. [who are you?] -> Red Hat or
to
|
Agreed.
Agreed.
Yes, removing.
Likely not intentional, removing. Equal character added.
Red Hat added.
Done. Thanks for the review, I'll push the update. |
This pmda retrieves data using the libdrm and libdrm-amdgpu libraries. It only retrieves general information, no process specific data. Data retrieved includes memory usage, memory speed, GPU speed, temperature, etc ... Old Radeon (Pre GCN 1.1) aren't supported. Signed-off-by: Frédéric Bérat <[email protected]>
This pmda retrieves data using the libdrm and libdrm-amdgpu libraries. It only retrieves general information, no process specific data.
Data retrieved includes memory usage, memory speed, GPU speed, temperature, etc ...
Old Radeon (Pre GCN 1.1) aren't supported.