Skip to content

Commit c4f19e4

Browse files
committed
CI: DCGM strikes again, it appears to be running on the nv-gha-runners.
1 parent 85868fd commit c4f19e4

File tree

1 file changed

+14
-0
lines changed

1 file changed

+14
-0
lines changed

.github/workflows/test-brev-tutorial-docker-images.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,20 @@ jobs:
6363
username: ${{ github.actor }}
6464
password: ${{ secrets.GITHUB_TOKEN }}
6565

66+
- name: Stop DCGM to allow NCU profiling
67+
run: |
68+
# DCGM (Data Center GPU Manager) locks the GPU and prevents NCU from profiling.
69+
# Stop it before running the container tests.
70+
echo "Stopping DCGM services..."
71+
sudo systemctl stop nvidia-dcgm || echo "nvidia-dcgm service not found or already stopped"
72+
sudo systemctl stop dcgm || echo "dcgm service not found or already stopped"
73+
# Also try nv-hostengine which DCGM uses
74+
sudo systemctl stop nv-hostengine || echo "nv-hostengine service not found or already stopped"
75+
# Kill any remaining dcgm processes
76+
sudo pkill -9 nv-hostengine || echo "No nv-hostengine processes found"
77+
sudo pkill -9 dcgm || echo "No dcgm processes found"
78+
echo "DCGM services stopped."
79+
6680
- name: Test Docker Compose
6781
id: test
6882
run: |

0 commit comments

Comments
 (0)