-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add draft gpu troubles #290
base: main
Are you sure you want to change the base?
Conversation
scancel [JOB_ID] | ||
``` | ||
|
||
1. Reset the GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link to the reset option for nvidia-smi
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE] | ||
``` | ||
|
||
1. Cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cancel the job in Slurm
The node will have a **DRAIN** status. Then the instance will be terminated and replaced. | ||
|
||
|
||
1. Delete the reservation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is RES_NUMBER
?
scancel [JOB_ID] | ||
``` | ||
|
||
1. Place the node in **DRAIN**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node to terminate is an IP or name?
|
||
1. Create a reservation to isolate the node from being used by any jobs. | ||
```bash | ||
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what NODE_TO_TERMINATE
should be?
| 95 | Uncontained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) | | ||
|
||
# AWS ParallelCluster | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reference to ParallelCluster doc?
While running High-Performance Computing or Machine Learning workloads, GPUs may fail for various reasons captured by Xid messages. | ||
Those messages are placed in `/var/log/messages` for Amazon Linux or for Ubuntu in `/var/log/syslog` and `/var/log/kern.log` | ||
|
||
| Xid | Failure | Resolution | Orchestrator | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduler/Orchestrator
``` | ||
|
||
## Reset GPUs | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say that resetting does and what NODE_TO_TERMINATE
represents (or how to get it)
|
||
1. Delete the reservation | ||
```bash | ||
sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is RES_NUMBER
?
|
||
# Amazon SageMaker HyperPod | ||
|
||
TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBA
looks good - plan to create a new PR for HyperPod instructions after p-cluster is merged. |
c9e0f5e
to
30e6592
Compare
44e448e
to
1209815
Compare
90549b2
to
84f6f79
Compare
Signed-off-by: AWS ParallelCluster user <[email protected]>
Signed-off-by: AWS ParallelCluster user <[email protected]>
84f6f79
to
7a102a3
Compare
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.