Skip to content

Conversation

cms42
Copy link

@cms42 cms42 commented Aug 19, 2025

Problem

The straggler detector's Etpt (estimated throughput) metric was growing linearly with iteration count, reaching unrealistic values like 147,000+ TF/s (indicating impossible 147000%+ MFU).

Before (broken):

MxEtpt/Rnk: 1004.74TF/10
MxEtpt/Rnk: 1517.64TF/2
// ...
MxEtpt/Rnk: 145698.82TF/8
MxEtpt/Rnk: 147815.35TF/5

Root Cause

In post_training_step_callbacks(), the code was setting the FLOPs counter to 0.0 but this reset was local to the function and not persisted back to the main training loop. The caller continued using the growing accumulator.

def post_training_step_callbacks(
model,
optimizer,
opt_param_scheduler,
iteration,
prof,
num_floating_point_operations_since_last_log_event,
):
"""Run all post-training-step functions (e.g., FT heartbeats, GC)."""
args = get_args()
# Bring CPU and GPU back in sync if on right iteration.
if args.train_sync_interval and iteration % args.train_sync_interval == 0:
torch.cuda.synchronize()
# Straggler detector.
if iteration % args.log_interval == 0 and args.log_straggler:
stimer.report(num_floating_point_operations_since_last_log_event, args.log_interval)
num_floating_point_operations_since_last_log_event = 0.0

Solution

  • Modified post_training_step_callbacks() to return the updated FLOPs counter
  • Updated the call site to capture and use the returned (reset) counter
  • Added clear comments explaining the reset behavior

Verification

After:

MnEtpt/Rnk: 449.14TF/14 | MxEtpt/Rnk: 489.77TF/0
MnEtpt/Rnk: 458.21TF/12 | MxEtpt/Rnk: 478.66TF/14
MnEtpt/Rnk: 451.58TF/3 | MxEtpt/Rnk: 481.39TF/12

Copy link

copy-pr-bot bot commented Aug 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Li-dongyang
Copy link

Hi @sbhavani, could you please take a look at this PR when you get a chance? Thank you!

@cms42
Copy link
Author

cms42 commented Aug 25, 2025

ping @jaredcasper @jon-barker

This would just take you a few minutes🙏

@sbhavani
Copy link
Collaborator

Thanks @cms42! I've created an MR internally and will get back to you soon

@sbhavani sbhavani added the bug Something isn't working label Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants