Fault injection in my pytorch training job #192

hjx620 · 2024-09-22T02:29:48Z

I want to do low cost error recovery from deep learning training failures. So I need to simulate some errors in my pytorch training file to test my system.

I find that DCGM has the ability of fault injection, such as:

PCIe Replay Errors
ECC Errors (single double-bit error or multiple co-located single-bit errors)
Power Excursions
Thermal Excursions
XID Errors
NVLink Errors
reference link : dcgm-error-injection.html

How can I use them in my pytorch file? Thanks

glowkey · 2024-09-25T14:35:25Z

Injecting errors into DCGM does not inject errors into the driver, NVML, or any other layer lower than DCGM itself. If your pytorch code integrates with DCGM to determine GPU health (using the DCGM health API for example), then injecting an error will trigger a callback. But if your pytorch code relies on anything lower than DCGM in the software stack then error injection will have no effect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault injection in my pytorch training job #192

Fault injection in my pytorch training job #192

hjx620 commented Sep 22, 2024 •

edited

Loading

glowkey commented Sep 25, 2024

Fault injection in my pytorch training job #192

Fault injection in my pytorch training job #192

Comments

hjx620 commented Sep 22, 2024 • edited Loading

glowkey commented Sep 25, 2024

hjx620 commented Sep 22, 2024 •

edited

Loading