feat: Add Injob time profiling metrics #185

hexinw-nvidia · 2025-09-14T17:48:21Z

feat: Add Injob time profiling metrics

- Add new profiling.py module with PyTorch metrics integration
- Implement ProfilingEvent enum for tracking key InJob events:
  * FAILURE_DETECTED, WORKER_TERMINATED
  * RENDEZVOUS_STARTED, RENDEZVOUS_COMPLETED
  * WORKER_START_STARTED, WORKER_START_COMPLETED
- Add FaultToleranceProfiler class with thread-safe event recording
- Integrate profiling events throughout fault tolerance lifecycle:
  * Record rendezvous timing in FtRendezvousHandler
  * Track worker lifecycle events in LocalElasticAgent
  * Log comprehensive timing metrics summary on shutdown
- Calculate and report timing metrics by restart cycles:
  * Failure to termination time
  * Rendezvous duration
  * Worker start time
  * Total cycle time (startup vs restart)

- Add new profiling.py module with PyTorch metrics integration - Implement ProfilingEvent enum for tracking key InJob events: * FAILURE_DETECTED, WORKER_TERMINATED * RENDEZVOUS_STARTED, RENDEZVOUS_COMPLETED * WORKER_START_STARTED, WORKER_START_COMPLETED - Add FaultToleranceProfiler class with thread-safe event recording - Integrate profiling events throughout fault tolerance lifecycle: * Record rendezvous timing in FtRendezvousHandler * Track worker lifecycle events in LocalElasticAgent * Log comprehensive timing metrics summary on shutdown - Calculate and report timing metrics by restart cycles: * Failure to termination time * Rendezvous duration * Worker start time * Total cycle time (startup vs restart)

src/nvidia_resiliency_ext/fault_tolerance/profiling.py

hexinw-nvidia · 2025-09-15T17:42:57Z

output_example.txt

hexinw-nvidia · 2025-09-15T18:14:13Z

parser_friendly_output.txt

apaithankar · 2025-09-18T00:01:36Z

src/nvidia_resiliency_ext/shared_utils/profiling.py

+from ..shared_utils.log_manager import LogConfig
+
+
+class ProfilingEvent(Enum):


Any reason we are creating a custom profiler and not using something standard or OneLogger?

We are using the PyTorch metrics system for the profiling. The NV OneLogger seems to have duplicate functions as the PyTorch metrics. There is no particular reason that we go with PyTorch metrics. Mainly, NVRx is tightly integrated with PyTorch.

Since Megatron-LM and Nemo are both using NV OneLogger and it seems it is better integrated into other services (log/event/metrics collecting), would using NV OneLogger not make it easier to integrate when we have to create a dashboard in FACT?

Not opposed to using TorchMetrics, just looking at the ecosystem for integration

PytLab · 2025-09-24T08:57:34Z

src/nvidia_resiliency_ext/shared_utils/profiling.py

+from enum import Enum
+from typing import Optional
+
+from nv_one_logger.api.one_logger_provider import OneLoggerProvider


@hexinw-nvidia does it mean you are relying on application for instantiation and configuration?

PytLab · 2025-09-24T11:18:37Z

src/nvidia_resiliency_ext/shared_utils/profiling.py

+
+                # Create and record the event
+                event_obj = Event.create(f"ft.{event.value}", attributes)
+                OneLoggerProvider.instance().recorder.event(None, event_obj)


@hexinw-nvidia Note that in nv-one-logger API design, event need to belong to one span e.g., application span or any custom span. the Recorder.start can create one and recorder will record the start and end timestamps of it.

…g multi-participants on one node.

hexinw-nvidia added 2 commits September 13, 2025 20:55

Print cycle number in event.

ce63ab5

hexinw-nvidia requested review from namitdhameja, rhewett-nv, apaithankar and sbak5 September 14, 2025 17:48

hexinw-nvidia added the ci-approved Approved to run CI label Sep 14, 2025

rhewett-nv reviewed Sep 15, 2025

View reviewed changes

src/nvidia_resiliency_ext/fault_tolerance/profiling.py Outdated Show resolved Hide resolved

hexinw-nvidia and others added 3 commits September 15, 2025 10:47

Moved profiling.py to shared_utils/.

6eca263

Format the profiling output.

9e4efc0

Merge branch 'main' into profiling

c50b600

rhewett-nv self-requested a review September 15, 2025 18:25

rhewett-nv previously approved these changes Sep 15, 2025

View reviewed changes

Run log_profiling_summary as soon as we received the signal.

c6294f7

hexinw-nvidia dismissed rhewett-nv’s stale review via c6294f7 September 16, 2025 21:16

hexinw-nvidia and others added 5 commits September 16, 2025 14:42

Dump profiling on every restart cycle.

9d89999

Added cycle number in the event.

04168a8

Merge branch 'main' into profiling

4dcb8fe

Refactor.

a17991d

.

3b97756

apaithankar reviewed Sep 18, 2025

View reviewed changes

hexinw-nvidia and others added 4 commits September 22, 2025 17:11

OneLogger integration.

9bdd85d

Added nv one logger dependency

cd191b6

Lint fix.

5949dac

Merge branch 'main' into profiling

60a08d2

PytLab reviewed Sep 24, 2025

View reviewed changes

Removed unused variable.

6c017c2

Used rendesvous node as node_id in profiling event, enabling profilin…

b137a4e

…g multi-participants on one node.

rhewett-nv approved these changes Sep 30, 2025

View reviewed changes

sbak5 approved these changes Sep 30, 2025

View reviewed changes

Merge branch 'main' into profiling

119235b

hexinw-nvidia merged commit 4dce778 into NVIDIA:main Sep 30, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Injob time profiling metrics #185

feat: Add Injob time profiling metrics #185

Uh oh!

hexinw-nvidia commented Sep 14, 2025

Uh oh!

Uh oh!

hexinw-nvidia commented Sep 15, 2025

Uh oh!

hexinw-nvidia commented Sep 15, 2025

Uh oh!

apaithankar Sep 18, 2025

Uh oh!

hexinw-nvidia Sep 18, 2025

Uh oh!

apaithankar Sep 18, 2025 •

edited

Loading

Uh oh!

PytLab Sep 24, 2025

Uh oh!

PytLab Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

		from ..shared_utils.log_manager import LogConfig


		class ProfilingEvent(Enum):

feat: Add Injob time profiling metrics #185

feat: Add Injob time profiling metrics #185

Uh oh!

Conversation

hexinw-nvidia commented Sep 14, 2025

Uh oh!

Uh oh!

hexinw-nvidia commented Sep 15, 2025

Uh oh!

hexinw-nvidia commented Sep 15, 2025

Uh oh!

apaithankar Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

hexinw-nvidia Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

apaithankar Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PytLab Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

PytLab Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

apaithankar Sep 18, 2025 •

edited

Loading