Add draft gpu troubles #290

mhuguesaws · 2024-04-30T17:11:43Z

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

perifaws · 2024-04-30T17:56:08Z

troubleshooting/GPU-Troubleshooting.md

+   scancel [JOB_ID]
+   ```
+
+1. Reset the GPUs


Add a link to the reset option for nvidia-smi

perifaws · 2024-04-30T17:56:18Z

troubleshooting/GPU-Troubleshooting.md

+   sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
+   ```
+
+1. Cancel


cancel the job in Slurm

perifaws · 2024-04-30T17:56:43Z

troubleshooting/GPU-Troubleshooting.md

+   The node will have a **DRAIN** status. Then the instance will be terminated and replaced.
+
+
+1. Delete the reservation


what is RES_NUMBER?

perifaws · 2024-04-30T17:56:57Z

troubleshooting/GPU-Troubleshooting.md

+   scancel [JOB_ID]
+   ```
+
+1. Place the node in **DRAIN**.


node to terminate is an IP or name?

perifaws · 2024-04-30T17:57:27Z

troubleshooting/GPU-Troubleshooting.md

+
+1. Create a reservation to isolate the node from being used by any jobs.
+   ```bash
+   sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]


what NODE_TO_TERMINATE should be?

perifaws · 2024-04-30T17:57:42Z

troubleshooting/GPU-Troubleshooting.md

+| 95  | Uncontained ECC error | Reset GPUs          | [AWS ParallelCluster](#reset-gpus)                      |
+
+# AWS ParallelCluster
+


reference to ParallelCluster doc?

perifaws · 2024-04-30T17:58:02Z

troubleshooting/GPU-Troubleshooting.md

+While running High-Performance Computing or Machine Learning workloads, GPUs may fail for various reasons captured by Xid messages.
+Those messages are placed in `/var/log/messages` for Amazon Linux or for Ubuntu in `/var/log/syslog` and `/var/log/kern.log`
+
+| Xid | Failure               | Resolution          | Orchestrator                                            |


Scheduler/Orchestrator

perifaws · 2024-04-30T17:59:33Z

troubleshooting/GPU-Troubleshooting.md

+   ```
+
+## Reset GPUs
+


Say that resetting does and what NODE_TO_TERMINATE represents (or how to get it)

perifaws · 2024-04-30T17:59:45Z

troubleshooting/GPU-Troubleshooting.md

+
+1. Delete the reservation
+   ```bash
+   sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]


What is RES_NUMBER?

perifaws · 2024-04-30T17:59:55Z

troubleshooting/GPU-Troubleshooting.md

+
+# Amazon SageMaker HyperPod
+
+TBD


perifaws · 2024-04-30T18:03:03Z

You could link to the AWS doc:

nghtm · 2024-05-03T02:42:07Z

looks good - plan to create a new PR for HyperPod instructions after p-cluster is merged.

Signed-off-by: AWS ParallelCluster user <[email protected]>

perifaws suggested changes Apr 30, 2024

View reviewed changes

perifaws requested review from sean-smith, nghtm, awsankur and iankouls-aws April 30, 2024 18:03

KeitaW force-pushed the feature/#289_gpu_failure branch 2 times, most recently from c9e0f5e to 30e6592 Compare June 4, 2024 02:26

KeitaW force-pushed the main branch 3 times, most recently from 44e448e to 1209815 Compare June 4, 2024 02:30

mhuguesaws force-pushed the feature/#289_gpu_failure branch 2 times, most recently from 90549b2 to 84f6f79 Compare June 11, 2024 21:00

awsankur and others added 7 commits June 11, 2024 16:29

Updated

30d3ea5

Updated

4e1a665

Updated

1cd6b80

Updated

a9b5181

Updated training and data preparation configs

d0227b2

Signed-off-by: AWS ParallelCluster user <[email protected]>

removed miniconda.sh

32b3313

Signed-off-by: AWS ParallelCluster user <[email protected]>

Add draft gpu troubles

7a102a3

mhuguesaws force-pushed the feature/#289_gpu_failure branch from 84f6f79 to 7a102a3 Compare June 11, 2024 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add draft gpu troubles #290

Add draft gpu troubles #290

mhuguesaws commented Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws Apr 30, 2024

perifaws commented Apr 30, 2024 •

edited

Loading

nghtm commented May 3, 2024

		The node will have a DRAIN status. Then the instance will be terminated and replaced.


		1. Delete the reservation

		\| 95 \| Uncontained ECC error \| Reset GPUs \| [AWS ParallelCluster](#reset-gpus) \|

		# AWS ParallelCluster

Add draft gpu troubles #290

Are you sure you want to change the base?

Add draft gpu troubles #290

Conversation

mhuguesaws commented Apr 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

perifaws commented Apr 30, 2024 • edited Loading

nghtm commented May 3, 2024

perifaws commented Apr 30, 2024 •

edited

Loading