Skip to content

Conversation

JonSnow1807
Copy link

Add GPU Monitoring Example for ML Training Workflows

Description

This PR adds a comprehensive example demonstrating GPU monitoring in Metaflow, which is essential for ML engineers building training infrastructure.

Motivation

Many users in the community have asked about GPU usage patterns in Metaflow. This example provides a clear, working demonstration of:

  • GPU resource allocation using @resources decorator
  • GPU availability detection with graceful fallback
  • Memory usage monitoring during training
  • Best practices for GPU-enabled flows

This is particularly relevant given the increasing focus on large-scale ML training and the need for efficient GPU utilization.

Changes

  • Added examples/tutorials/gpu_monitoring/gpu_flow.py - Complete example flow demonstrating GPU monitoring
  • Added examples/tutorials/gpu_monitoring/README.md - Comprehensive documentation with usage instructions

Testing

  • Example runs successfully with python gpu_flow.py run
  • Example handles both GPU and CPU-only environments gracefully
  • Code follows Metaflow conventions and best practices
  • Documentation is clear and includes practical use cases

Type of Change

  • Documentation/Example
  • Bug fix
  • New feature
  • Breaking change

Additional Notes

First-time contributor excited to help improve Metaflow's documentation! This example addresses a common need in the ML community for understanding GPU resource management in Metaflow. Happy to make any adjustments based on feedback.

This example helps ML engineers understand how to:
- Monitor GPU availability and usage in Metaflow
- Allocate GPU resources using the @resources decorator
- Track GPU memory usage during training

This is particularly relevant for ML infrastructure teams
building training platforms at scale.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant