Fix runaway Etpt in straggler detector by resetting FLOPs accumulator #1755

cms42 · 2025-08-19T10:54:07Z

Problem

The straggler detector's Etpt (estimated throughput) metric was growing linearly with iteration count, reaching unrealistic values like 147,000+ TF/s (indicating impossible 147000%+ MFU).

Before (broken):

MxEtpt/Rnk: 1004.74TF/10
MxEtpt/Rnk: 1517.64TF/2
// ...
MxEtpt/Rnk: 145698.82TF/8
MxEtpt/Rnk: 147815.35TF/5

Root Cause

In post_training_step_callbacks(), the code was setting the FLOPs counter to 0.0 but this reset was local to the function and not persisted back to the main training loop. The caller continued using the growing accumulator.

Megatron-LM/megatron/training/training.py

Lines 1985 to 2003 in f778f7b

    
           def post_training_step_callbacks( 
        
               model, 
        
               optimizer, 
        
               opt_param_scheduler, 
        
               iteration, 
        
               prof, 
        
               num_floating_point_operations_since_last_log_event, 
        
           ): 
        
               """Run all post-training-step functions (e.g., FT heartbeats, GC).""" 
        
               args = get_args() 
        
               # Bring CPU and GPU back in sync if on right iteration. 
        
               if args.train_sync_interval and iteration % args.train_sync_interval == 0: 
        
                   torch.cuda.synchronize() 
        
               # Straggler detector. 
        
               if iteration % args.log_interval == 0 and args.log_straggler: 
        
                   stimer.report(num_floating_point_operations_since_last_log_event, args.log_interval) 
        
                   num_floating_point_operations_since_last_log_event = 0.0

Solution

Modified post_training_step_callbacks() to return the updated FLOPs counter
Updated the call site to capture and use the returned (reset) counter
Added clear comments explaining the reset behavior

Verification

After:

MnEtpt/Rnk: 449.14TF/14 | MxEtpt/Rnk: 489.77TF/0
MnEtpt/Rnk: 458.21TF/12 | MxEtpt/Rnk: 478.66TF/14
MnEtpt/Rnk: 451.58TF/3 | MxEtpt/Rnk: 481.39TF/12

…log interval Co-authored-by: Li Ruixiao <[email protected]>

copy-pr-bot · 2025-08-19T10:54:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Li-dongyang · 2025-08-25T08:57:51Z

Hi @sbhavani, could you please take a look at this PR when you get a chance? Thank you!

cms42 · 2025-08-25T12:58:54Z

ping @jaredcasper @jon-barker

This would just take you a few minutes🙏

sbhavani · 2025-08-28T19:58:38Z

Thanks @cms42! I've created an MR internally and will get back to you soon

fix(training): Reset straggler detector FLOPs accumulator after each …

0354be3

…log interval Co-authored-by: Li Ruixiao <[email protected]>

sbhavani added the bug Something isn't working label Sep 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix runaway Etpt in straggler detector by resetting FLOPs accumulator #1755

Fix runaway Etpt in straggler detector by resetting FLOPs accumulator #1755

Uh oh!

cms42 commented Aug 19, 2025

Uh oh!

copy-pr-bot bot commented Aug 19, 2025

Uh oh!

Li-dongyang commented Aug 25, 2025

Uh oh!

cms42 commented Aug 25, 2025

Uh oh!

sbhavani commented Aug 28, 2025

Uh oh!

Uh oh!

	def post_training_step_callbacks(
	model,
	optimizer,
	opt_param_scheduler,
	iteration,
	prof,
	num_floating_point_operations_since_last_log_event,
	):
	"""Run all post-training-step functions (e.g., FT heartbeats, GC)."""
	args = get_args()

	# Bring CPU and GPU back in sync if on right iteration.
	if args.train_sync_interval and iteration % args.train_sync_interval == 0:
	torch.cuda.synchronize()

	# Straggler detector.
	if iteration % args.log_interval == 0 and args.log_straggler:
	stimer.report(num_floating_point_operations_since_last_log_event, args.log_interval)
	num_floating_point_operations_since_last_log_event = 0.0

Fix runaway Etpt in straggler detector by resetting FLOPs accumulator #1755

Are you sure you want to change the base?

Fix runaway Etpt in straggler detector by resetting FLOPs accumulator #1755

Uh oh!

Conversation

cms42 commented Aug 19, 2025

Problem

Root Cause

Solution

Verification

Uh oh!

copy-pr-bot bot commented Aug 19, 2025

Uh oh!

Li-dongyang commented Aug 25, 2025

Uh oh!

cms42 commented Aug 25, 2025

Uh oh!

sbhavani commented Aug 28, 2025

Uh oh!

Uh oh!