Add GPU monitoring example for ML training workflows #2518

JonSnow1807 · 2025-07-24T19:21:18Z

Add GPU Monitoring Example for ML Training Workflows

Description

This PR adds a comprehensive example demonstrating GPU monitoring in Metaflow, which is essential for ML engineers building training infrastructure.

Motivation

Many users in the community have asked about GPU usage patterns in Metaflow. This example provides a clear, working demonstration of:

GPU resource allocation using @resources decorator
GPU availability detection with graceful fallback
Memory usage monitoring during training
Best practices for GPU-enabled flows

This is particularly relevant given the increasing focus on large-scale ML training and the need for efficient GPU utilization.

Changes

Added examples/tutorials/gpu_monitoring/gpu_flow.py - Complete example flow demonstrating GPU monitoring
Added examples/tutorials/gpu_monitoring/README.md - Comprehensive documentation with usage instructions

Testing

Example runs successfully with python gpu_flow.py run
Example handles both GPU and CPU-only environments gracefully
Code follows Metaflow conventions and best practices
Documentation is clear and includes practical use cases

Type of Change

Documentation/Example
Bug fix
New feature
Breaking change

Additional Notes

First-time contributor excited to help improve Metaflow's documentation! This example addresses a common need in the ML community for understanding GPU resource management in Metaflow. Happy to make any adjustments based on feedback.

This example helps ML engineers understand how to: - Monitor GPU availability and usage in Metaflow - Allocate GPU resources using the @resources decorator - Track GPU memory usage during training This is particularly relevant for ML infrastructure teams building training platforms at scale.

JonSnow1807 added 2 commits July 24, 2025 15:16

Merge branch 'master' into add-gpu-monitoring-example

bad7bc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GPU monitoring example for ML training workflows #2518

Add GPU monitoring example for ML training workflows #2518

Uh oh!

JonSnow1807 commented Jul 24, 2025

Uh oh!

Uh oh!

Add GPU monitoring example for ML training workflows #2518

Are you sure you want to change the base?

Add GPU monitoring example for ML training workflows #2518

Uh oh!

Conversation

JonSnow1807 commented Jul 24, 2025