-
Notifications
You must be signed in to change notification settings - Fork 34
feat: Add Injob time profiling metrics #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hexinw-nvidia
commented
Sep 14, 2025
- Add new profiling.py module with PyTorch metrics integration - Implement ProfilingEvent enum for tracking key InJob events: * FAILURE_DETECTED, WORKER_TERMINATED * RENDEZVOUS_STARTED, RENDEZVOUS_COMPLETED * WORKER_START_STARTED, WORKER_START_COMPLETED - Add FaultToleranceProfiler class with thread-safe event recording - Integrate profiling events throughout fault tolerance lifecycle: * Record rendezvous timing in FtRendezvousHandler * Track worker lifecycle events in LocalElasticAgent * Log comprehensive timing metrics summary on shutdown - Calculate and report timing metrics by restart cycles: * Failure to termination time * Rendezvous duration * Worker start time * Total cycle time (startup vs restart)
from ..shared_utils.log_manager import LogConfig | ||
|
||
|
||
class ProfilingEvent(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we are creating a custom profiler and not using something standard or OneLogger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using the PyTorch metrics system for the profiling. The NV OneLogger seems to have duplicate functions as the PyTorch metrics. There is no particular reason that we go with PyTorch metrics. Mainly, NVRx is tightly integrated with PyTorch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Megatron-LM and Nemo are both using NV OneLogger and it seems it is better integrated into other services (log/event/metrics collecting), would using NV OneLogger not make it easier to integrate when we have to create a dashboard in FACT?
Not opposed to using TorchMetrics, just looking at the ecosystem for integration
from enum import Enum | ||
from typing import Optional | ||
|
||
from nv_one_logger.api.one_logger_provider import OneLoggerProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hexinw-nvidia does it mean you are relying on application for instantiation and configuration?
|
||
# Create and record the event | ||
event_obj = Event.create(f"ft.{event.value}", attributes) | ||
OneLoggerProvider.instance().recorder.event(None, event_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hexinw-nvidia Note that in nv-one-logger API design, event need to belong to one span e.g., application span or any custom span. the Recorder.start
can create one and recorder will record the start and end timestamps of it.
…g multi-participants on one node.