Add MI300X Runners to AMD workflow. #149

saienduri · 2025-01-22T06:50:37Z

Description

This commit adds mi300 nodes as an option to the current amd workflow.

I tested a couple times on the AMD workflow here (https://github.com/gpu-mode/discord-cluster-manager/actions/runs/12902691574)

Checklist

Before submitting this PR, ensure the following steps have been completed:

Run the slash command /verifyruns on your own server.
- Run the cluster bot on your server:
```
python discord-bot.py
```
- Start training runs with the slash command /verifyruns.
- Verify that the bot eventually responds with:
```
✅ All runs completed successfully!
```
  (It may take a few minutes for all runs to finish. In particular, the GitHub
  runs may take a little longer. The Modal run is typically quick.)
  For more information on running a cluster bot on your own server, see
  README.md.

ngc92 · 2025-01-22T09:52:37Z

.github/workflows/amd_workflow.yml

@@ -13,10 +13,18 @@ on:

 jobs:
  run:
-    runs-on: [amdgpu-mi250-x86-64]
+    runs-on: ${{ matrix.runs-on }}
+    strategy:


I don't think this is the way we want to handle multiple runners; afaics, this well always run the given code sample on both types of AMD cards. Instead, the card should be selectable by an input argument that is given to the job.

Both runners should be registered in src/discord-cluster-manager/consts.py, and
https://github.com/gpu-mode/discord-cluster-manager/blob/main/src/discord-cluster-manager/cogs/github_cog.py#L35-L40
needs to be updated

Makes sense, thanks for the info! Updated github_cog but I don't think we need to touch consts.py

.github/workflows/amd_workflow.yml

msaroufim · 2025-02-15T20:41:08Z

To merge this PR faster I'm thinking what might be easiest is just moving to only using MI300 - wdyt @saienduri

saienduri · 2025-02-16T02:08:54Z

Thanks for the signal @msaroufim! I updated the PR, so that it should work with card selection. Does it look good to y'all?
Here's an example run I trigged in github: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/13350719364

saienduri requested review from ngc92 and msaroufim January 22, 2025 06:50

ngc92 reviewed Jan 22, 2025

View reviewed changes

saienduri force-pushed the mi300 branch from 53369a5 to 8d17921 Compare February 16, 2025 01:31

saienduri added 9 commits February 15, 2025 17:35

mi300 changes

017a68e

use preset env git variable

e93a91f

add venv activate to script step

0dd623a

temp pip install amd torch

69aff9a

remove temp pip install

8d17921

add multiple gh runners

3921dc4

fix expression

258054c

runner description

0f11bdb

output file name

901d081

saienduri requested a review from ngc92 February 16, 2025 02:09

msaroufim approved these changes Feb 16, 2025

View reviewed changes

msaroufim merged commit 00a6331 into main Feb 16, 2025
4 checks passed

saienduri deleted the mi300 branch February 16, 2025 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MI300X Runners to AMD workflow. #149

Add MI300X Runners to AMD workflow. #149

saienduri commented Jan 22, 2025 •

edited

Loading

ngc92 Jan 22, 2025

saienduri Feb 16, 2025 •

edited

Loading

msaroufim commented Feb 15, 2025

saienduri commented Feb 16, 2025

Add MI300X Runners to AMD workflow. #149

Add MI300X Runners to AMD workflow. #149

Conversation

saienduri commented Jan 22, 2025 • edited Loading

Description

Checklist

ngc92 Jan 22, 2025

Choose a reason for hiding this comment

saienduri Feb 16, 2025 • edited Loading

Choose a reason for hiding this comment

msaroufim commented Feb 15, 2025

saienduri commented Feb 16, 2025

saienduri commented Jan 22, 2025 •

edited

Loading

saienduri Feb 16, 2025 •

edited

Loading