You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been working on a tool designed to give an overview of the performance metrics for all GPUs associated to a SLURM job similarly to what the top command does for different processes. The way this is done is by reading information from SLURM regarding the list of hosts associated to a job and establishing a remote connection to the nv-hostengine daemons running on those hosts.
According to this GitHub issue, In order for the nv-hostengine process to listen to remote ips, one must specify the -b flag when starting the hostengine. I modified the systemd script for the nv-hostengine to include the -b ALL flag and it works great. However, I would like to have a more permanent and robust solution for this, so I have a few questions:
Is there any documentation on how to set up the nv-hostengine to support telemetry? I tried searching for quite a while but could not find anything except the previously mentioned issue.
Is there any DCGM config file where I can enable this feature other than the systemd script?
Is there a way to set up permissions to limit which users can connect remotely to a hostengine? One potential problem of simply accepting any connection is that users could potentially read the performance metrics from nodes allocated to other users. Is there a way to ensure that users can only connect to hostengines that are running on hosts allocated to them as part of their SLURM jobs?
Many thanks in advance,
Marcel
The text was updated successfully, but these errors were encountered:
Hi all,
I have been working on a tool designed to give an overview of the performance metrics for all GPUs associated to a SLURM job similarly to what the
top
command does for different processes. The way this is done is by reading information from SLURM regarding the list of hosts associated to a job and establishing a remote connection to thenv-hostengine
daemons running on those hosts.According to this GitHub issue, In order for the
nv-hostengine
process to listen to remote ips, one must specify the-b
flag when starting the hostengine. I modified thesystemd
script for thenv-hostengine
to include the-b ALL
flag and it works great. However, I would like to have a more permanent and robust solution for this, so I have a few questions:nv-hostengine
to support telemetry? I tried searching for quite a while but could not find anything except the previously mentioned issue.systemd
script?Many thanks in advance,
Marcel
The text was updated successfully, but these errors were encountered: