Skip to content

Conversation

hexinw-nvidia
Copy link
Contributor

feat: Add Injob time profiling metrics

- Add new profiling.py module with PyTorch metrics integration
- Implement ProfilingEvent enum for tracking key InJob events:
  * FAILURE_DETECTED, WORKER_TERMINATED
  * RENDEZVOUS_STARTED, RENDEZVOUS_COMPLETED
  * WORKER_START_STARTED, WORKER_START_COMPLETED
- Add FaultToleranceProfiler class with thread-safe event recording
- Integrate profiling events throughout fault tolerance lifecycle:
  * Record rendezvous timing in FtRendezvousHandler
  * Track worker lifecycle events in LocalElasticAgent
  * Log comprehensive timing metrics summary on shutdown
- Calculate and report timing metrics by restart cycles:
  * Failure to termination time
  * Rendezvous duration
  * Worker start time
  * Total cycle time (startup vs restart)

- Add new profiling.py module with PyTorch metrics integration
- Implement ProfilingEvent enum for tracking key InJob events:
  * FAILURE_DETECTED, WORKER_TERMINATED
  * RENDEZVOUS_STARTED, RENDEZVOUS_COMPLETED
  * WORKER_START_STARTED, WORKER_START_COMPLETED
- Add FaultToleranceProfiler class with thread-safe event recording
- Integrate profiling events throughout fault tolerance lifecycle:
  * Record rendezvous timing in FtRendezvousHandler
  * Track worker lifecycle events in LocalElasticAgent
  * Log comprehensive timing metrics summary on shutdown
- Calculate and report timing metrics by restart cycles:
  * Failure to termination time
  * Rendezvous duration
  * Worker start time
  * Total cycle time (startup vs restart)
@hexinw-nvidia
Copy link
Contributor Author

output_example.txt

@hexinw-nvidia
Copy link
Contributor Author

@rhewett-nv rhewett-nv self-requested a review September 15, 2025 18:25
rhewett-nv
rhewett-nv previously approved these changes Sep 15, 2025
from ..shared_utils.log_manager import LogConfig


class ProfilingEvent(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we are creating a custom profiler and not using something standard or OneLogger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using the PyTorch metrics system for the profiling. The NV OneLogger seems to have duplicate functions as the PyTorch metrics. There is no particular reason that we go with PyTorch metrics. Mainly, NVRx is tightly integrated with PyTorch.

Copy link
Contributor

@apaithankar apaithankar Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Megatron-LM and Nemo are both using NV OneLogger and it seems it is better integrated into other services (log/event/metrics collecting), would using NV OneLogger not make it easier to integrate when we have to create a dashboard in FACT?

Not opposed to using TorchMetrics, just looking at the ecosystem for integration

from enum import Enum
from typing import Optional

from nv_one_logger.api.one_logger_provider import OneLoggerProvider
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hexinw-nvidia does it mean you are relying on application for instantiation and configuration?


# Create and record the event
event_obj = Event.create(f"ft.{event.value}", attributes)
OneLoggerProvider.instance().recorder.event(None, event_obj)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hexinw-nvidia Note that in nv-one-logger API design, event need to belong to one span e.g., application span or any custom span. the Recorder.start can create one and recorder will record the start and end timestamps of it.

@hexinw-nvidia hexinw-nvidia merged commit 4dce778 into NVIDIA:main Sep 30, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-approved Approved to run CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants