Skip to content

Conversation

swswsw1
Copy link

@swswsw1 swswsw1 commented Sep 5, 2025

This PR adds GPU health monitoring capabilities with XID detection to the resiliency framework.

  • XID detection using NVML and DCGM
  • Health check API with fault injection
  • Integration with existing health check infrastructure

Work in progress - API and implementation details may change.

Wei Shen added 2 commits September 4, 2025 17:27
- Initial implementation of XID checks using NVML and DCGM
- Health check API with fault injection capabilities
- Use correct GPU rank mapping and error detection
- Clean up some detection methods, and suppress watchdog rethrowing CUDA and NCCL errors
- Merged with existing ChainedGPUHealthCheck, ChainedNVLHealthCheck, and ChainedNicHealthCheck classes
- Resolved merge conflict to combine both health check systems
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant