Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an amdgpu pmda #1975

Merged
merged 1 commit into from
Jul 5, 2024
Merged

Conversation

fberat
Copy link
Contributor

@fberat fberat commented May 3, 2024

This pmda retrieves data using the libdrm and libdrm-amdgpu libraries. It only retrieves general information, no process specific data.

Data retrieved includes memory usage, memory speed, GPU speed, temperature, etc ...

Old Radeon (Pre GCN 1.1) aren't supported.

src/pmdas/amdgpu/GNUmakefile Outdated Show resolved Hide resolved
src/pmdas/amdgpu/help Outdated Show resolved Hide resolved
Copy link
Member

@natoscott natoscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good Fred!

Bunch of minor things described in individual comments. Beyond that, here's a list of other files that will need changes too:

  • configure.ac (if libdrm is absent, we need to switch this PMDA off in the build - I see there'a a libdrm.pc so I recommend using PKG_CHECK_MODULES like e.g. libsasl)
  • src/include/builddefs.in (makefile macro(s) based on configure.ac mechanism)
  • build/rpm/.spec (we'll need a sub-package for this new PMDA)
  • qa/xxxx[.out] (regression test or two, see the apache PMDA test qa/755 as example)
  • qa/admin/package-lists (list of rpm/deb packages that CI needs to install in order to build/test/release this).

If anything unclear (for sure some will be) - let's chat on slack.

.\"
.TH PMDAAMDGPU 1 "PCP" "Performance Co-Pilot"
.SH NAME
\f3pmdaamdgpu\f1 \- amdgpu gpu metrics domain agent (PMDA)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> "AMD GPU metrics domain agent"

man/man1/pmdaamdgpu.1 Show resolved Hide resolved
.PP
The
.B amdgpu
PMDA exports metrics that measure gpu activity, memory utilization,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu -> GPU

.fi
.ft 1
.PP
If you want to establish access to the names, help text and values for the amdgpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amdgpu -> AMD GPU

CFILES = localdrm.c amdgpu.c
HFILES = localdrm.h
DFILES = README
LLDLIBS = $(PCP_PMDALIB) $(LIB_FOR_DLOPEN) -ldrm -ldrm_amdgpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how the configure.ac machinery ends up (discussed elsewhere), the libraries listed here will likely end up accessed as makefile macros.

int dev_count = drmGetDevices(NULL, 0);

if (dev_count <= 0) {
printf("No devices\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagnostic printf calls in this file should all become __pmNotifyErr calls which interacts with the logging subsystem a bit more nicely (with timestamp prefixes, etc) - and ensures we write into the log and not stdout (which may even have pmcd on the other end of it waiting for a PDU).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, these are development artefacts, I'll rework/cleanup.

memcpy(&p[amdgpu_count++], &temp[i], sizeof(drmDevicePtr));

/* Done with version */
drmFreeVersion(ver);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to close the fd (local variable on-stack)? Looks like it leaks an fd for each device here otherwise.


return DRM_SUCCESS;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's alot of opening and closing of device files here, with the way these interfaces are setup (I think this is a bit based on nvidia again, and thats maybe made it more complex than necessary).

Going back to the earlier comment about need_refresh - currently we call all of these APIs on every fetch, for every GPU. If we are going to do that, we can collapse all of these library calls into a single function that opens and closes each GPU fd a maximum of once per fetch, and avoid system time overheads associated with doing this "for every GPU for every metric" (i.e. ~10x less work if we have 10 metrics).

mem_clock_max AMDGPU:0:8
gpu_clock AMDGPU:0:9
gpu_clock_max AMDGPU:0:10
temperature AMDGPU:0:11
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With several of the above metrics, consider a "memory" subtree in the PMNS, e.g. amdgpu.memory.clock and so on (total, free, etc). Possibly same for amdgpu.gpu.clock and clock_max.

gpu_clock_max AMDGPU:0:10
temperature AMDGPU:0:11
load AMDGPU:0:12
avg_pwr AMDGPU:0:13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to expand avg_pwr to 'average_power' here.

@fberat
Copy link
Contributor Author

fberat commented May 28, 2024

I believe I addressed most of the review finidings, I'll need to go through the code at least one more time though.

Copy link
Member

@natoscott natoscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Fred, good updates here. Re that QA test we talked about, check out:

  • "cd qa && ./new" to create a new test
  • qa/755 for a simple example (Apache PMDA)

#
%package pmda-amdgpu
License: GPL-2.0-or-later
Summary: Performance Co-Pilot (PCP) metrics from eBPF ELF modules
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be more like AMD GPUs and less like eBPF ELF modules ;)

#
%package pmda-amdgpu
License: GPL-2.0-or-later
Summary: Performance Co-Pilot (PCP) metrics from eBPF ELF modules
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here.

@@ -2269,6 +2272,23 @@ collecting metrics about web server logs.
# end pcp-pmda-weblog
# end C pmdas

%if !%{disable_amdgpu}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro (disable_amdgpu) isn't defined AFAICT. Not 100% but I'm guessing it will be similar to the disable_resctrl definition which makes that package x86_64 only.

This package contains the PCP Performance Metrics Domain Agent (PMDA) for
extracting performance metrics from AMDGPU devices.
# end pcp-pmda-amdgpu
%endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's several other things that need to be done in the spec files to create a sub-package (e.g. a %files section). Simplest way to find 'em is to look at a similar sub-package - the closest to this new one may be pcp-pmda-resctrl (search on "resctrl" occurrences in each spec and mimic each section).

config.guess Outdated
@@ -1,12 +1,14 @@
#! /bin/sh
#!/usr/bin/sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for updating these files Fred, well overdue. Is /usr/bin/sh guaranteed to exist on all platforms though? We tend to use /bin/sh everywhere else anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That actually came with the configure update.
In theory nowadays everything is in /usr. There is a trend to remove /bin and /sbin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fberat yeah, I understand that's the trend (on Linux). The problem will come in on platforms that don't have such a trend, e.g. Mac OS ...

(base) nathans-mac:~ nathans$ uname -a
Darwin nathans-mac 23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000 arm64
(base) nathans-mac:~ nathans$ ls -l /usr/bin/sh
ls: /usr/bin/sh: No such file or directory
(base) nathans-mac:~ nathans$ 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's annoying, I may need to check with upstream config.{sub,guess} repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll investigate further, there is something odd, the file on my system doesn't match the one in the redhat-rpm-config repository.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, found the reason, it looks like there is a systemic modification of shebang in fedora when rpm are built. I'll revert the shebang back to the original value on my next update.

case AMDGPU_TEMPERATURE:
if (pcp_amdgpuinfo.info[inst].failed[AMDGPU_TEMPERATURE])
return PM_ERR_VALUE;
/* In millidegrees Celsius */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the label callbacks now but should we add "units":"millidegrees celcius" as a label for this one?

if (autorefresh > 0) {
autorefresh = 0;
for (int i = 0;i < AMDGPU_REFRESHER_COUNT;i++) {
pmNotifyErr(LOG_ERR, "Refreshing %d", i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too verbose by default here, I think this is going to end up in a log file once every second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to cleanup, at the end of the day, you were faster than me to review these changes :)


#ifndef DSOSUFFIX
#define DSOSUFFIX "so"
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this agent is Linux-only, so safe to hard-code this if you like.


if (strcmp(ver->name, "amdgpu")) {
drmFreeVersion(ver);
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may leak an open fd here?


/* Done with version */
drmFreeVersion(ver);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also here at the last part of the loop? Could close it unconditionally right after drmGetVersion perhaps.

switch (item) {
case AMDGPU_MEMORY_USED:
atom->ull = pcp_amdgpuinfo.info[inst].memory.used;
pmNotifyErr(LOG_ERR, "Getting used memory %lu", atom->ull);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@fberat fberat force-pushed the devel/amdgpu branch 2 times, most recently from 89152be to 8e6c36e Compare June 19, 2024 13:57
@natoscott
Copy link
Member

Last handful of small things @fberat ...

amdgpu_fetch()

     if (refresher_list[cluster][item] == NULL)
       continue;

We have a check that 'cluster' is within range, does 'item' need
a similar check to prevent possible out-of-bounds access? (i.e.
via a maliciously crafted PDU with pmID item >= MAX_ITEM_COUNT).

setup_gcard_indom()

pmNotifyErr(LOG_WARNING, "setup_gcard_indom: got %d cards" ...

INFO level might be more appropriate for this one?

pmNotifyErr(LOG_ERR, "Refreshed memory (%lx)", memory.used);

-> leftover temp diagnostic? (or add pmDebugOptions.appl2 guard)

PMDA README file has the word Readme at the start - intentional?
Just looks a bit odd. The heading underline could use one more
equals character as well. :)

qa/1772 has template comment still. [who are you?] -> Red Hat or
Fred (yourself). Also need to change this line, e.g.:

# test for-some-thing || _notrun No support for some-thing

to

test -d $PCP_PMDAS_DIR/amdgpu || _notrun No support for AMD GPU metrics

@fberat
Copy link
Contributor Author

fberat commented Jul 4, 2024

Last handful of small things @fberat ...

amdgpu_fetch()

     if (refresher_list[cluster][item] == NULL)
       continue;

We have a check that 'cluster' is within range, does 'item' need a similar check to prevent possible out-of-bounds access? (i.e. via a maliciously crafted PDU with pmID item >= MAX_ITEM_COUNT).

Agreed.

setup_gcard_indom()

pmNotifyErr(LOG_WARNING, "setup_gcard_indom: got %d cards" ...

INFO level might be more appropriate for this one?

Agreed.

pmNotifyErr(LOG_ERR, "Refreshed memory (%lx)", memory.used);

-> leftover temp diagnostic? (or add pmDebugOptions.appl2 guard)

Yes, removing.

PMDA README file has the word Readme at the start - intentional? Just looks a bit odd. The heading underline could use one more equals character as well. :)

Likely not intentional, removing. Equal character added.

qa/1772 has template comment still. [who are you?] -> Red Hat or Fred (yourself). Also need to change this line, e.g.:

Red Hat added.

# test for-some-thing || _notrun No support for some-thing

to

test -d $PCP_PMDAS_DIR/amdgpu || _notrun No support for AMD GPU metrics

Done.

Thanks for the review, I'll push the update.

This pmda retrieves data using the libdrm and libdrm-amdgpu libraries.
It only retrieves general information, no process specific data.

Data retrieved includes memory usage, memory speed, GPU speed,
temperature, etc ...

Old Radeon (Pre GCN 1.1) aren't supported.

Signed-off-by: Frédéric Bérat <[email protected]>
@natoscott natoscott merged commit e92fe2e into performancecopilot:main Jul 5, 2024
17 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants