Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault injection in my pytorch training job #192

Open
hjx620 opened this issue Sep 22, 2024 · 1 comment
Open

Fault injection in my pytorch training job #192

hjx620 opened this issue Sep 22, 2024 · 1 comment

Comments

@hjx620
Copy link

hjx620 commented Sep 22, 2024

I want to do low cost error recovery from deep learning training failures. So I need to simulate some errors in my pytorch training file to test my system.

I find that DCGM has the ability of fault injection, such as:

How can I use them in my pytorch file? Thanks

@glowkey
Copy link
Collaborator

glowkey commented Sep 25, 2024

Injecting errors into DCGM does not inject errors into the driver, NVML, or any other layer lower than DCGM itself. If your pytorch code integrates with DCGM to determine GPU health (using the DCGM health API for example), then injecting an error will trigger a callback. But if your pytorch code relies on anything lower than DCGM in the software stack then error injection will have no effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants