Add GPU monitoring example for ML training workflows #2518
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add GPU Monitoring Example for ML Training Workflows
Description
This PR adds a comprehensive example demonstrating GPU monitoring in Metaflow, which is essential for ML engineers building training infrastructure.
Motivation
Many users in the community have asked about GPU usage patterns in Metaflow. This example provides a clear, working demonstration of:
@resources
decoratorThis is particularly relevant given the increasing focus on large-scale ML training and the need for efficient GPU utilization.
Changes
examples/tutorials/gpu_monitoring/gpu_flow.py
- Complete example flow demonstrating GPU monitoringexamples/tutorials/gpu_monitoring/README.md
- Comprehensive documentation with usage instructionsTesting
python gpu_flow.py run
Type of Change
Additional Notes
First-time contributor excited to help improve Metaflow's documentation! This example addresses a common need in the ML community for understanding GPU resource management in Metaflow. Happy to make any adjustments based on feedback.