-
Notifications
You must be signed in to change notification settings - Fork 10
Add MI300X Runners to AMD workflow. #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -13,10 +13,18 @@ on: | |||
|
|||
jobs: | |||
run: | |||
runs-on: [amdgpu-mi250-x86-64] | |||
runs-on: ${{ matrix.runs-on }} | |||
strategy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the way we want to handle multiple runners; afaics, this well always run the given code sample on both types of AMD cards. Instead, the card should be selectable by an input argument that is given to the job.
Both runners should be registered in src/discord-cluster-manager/consts.py, and
https://github.com/gpu-mode/discord-cluster-manager/blob/main/src/discord-cluster-manager/cogs/github_cog.py#L35-L40
needs to be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for the info! Updated github_cog
but I don't think we need to touch consts.py
To merge this PR faster I'm thinking what might be easiest is just moving to only using MI300 - wdyt @saienduri |
Thanks for the signal @msaroufim! I updated the PR, so that it should work with card selection. Does it look good to y'all? |
Description
This commit adds mi300 nodes as an option to the current amd workflow.
I tested a couple times on the AMD workflow here (https://github.com/gpu-mode/discord-cluster-manager/actions/runs/12902691574)
Checklist
Before submitting this PR, ensure the following steps have been completed:
/verifyruns
on your own server./verifyruns
.runs may take a little longer. The Modal run is typically quick.)
For more information on running a cluster bot on your own server, see
README.md.