fix a corner-case bug in memory snapshot uploading #3504

TroyGarden · 2025-11-02T16:27:04Z

Summary:
Fixed two corner case issues in the TorchRec benchmark utilities:

Memory snapshot handling: Added rank filtering for memory snapshot operations to ensure they only run on rank 0 or when all_rank_traces is enabled. This prevents redundant memory snapshots from being taken on all ranks, reducing overhead and storage requirements while still capturing the necessary profiling data.
Shell script robustness: Added file existence checks before loop iterations in the trace upload script. Previously, if no trace files or memory snapshot files were found, the script would fail silently or produce errors. Now it checks with ls first and only proceeds with the loop if files exist, preventing issues when the trace directory is empty or files don't match the expected patterns.

Differential Revision: D86051540

meta-codesync · 2025-11-02T16:27:14Z

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86051540.

Summary: # context * loglevel is not correctly set in train pipeline benchmark due to the multiprocess setup * the log level is only set in the main process but not correctly set in the forked/spawn processes * this diff add the `loglevel` argument into the RunConfig so that in every runner funcion can call `set_logger_level` * also directly pass the error message on yaml or json parser failure, which previously just warn silently and the warning message is buried in lengthy logs. * with loglevel=info we can now see the planner info: P2014482201 Differential Revision: D85829837

Summary: Fixed two corner case issues in the TorchRec benchmark utilities: 1. **Memory snapshot handling**: Added rank filtering for memory snapshot operations to ensure they only run on rank 0 or when `all_rank_traces` is enabled. This prevents redundant memory snapshots from being taken on all ranks, reducing overhead and storage requirements while still capturing the necessary profiling data. 2. **Shell script robustness**: Added file existence checks before loop iterations in the trace upload script. Previously, if no trace files or memory snapshot files were found, the script would fail silently or produce errors. Now it checks with `ls` first and only proceeds with the loop if files exist, preventing issues when the trace directory is empty or files don't match the expected patterns. Differential Revision: D86051540

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 2, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 2, 2025

TroyGarden added 2 commits November 2, 2025 11:44

TroyGarden force-pushed the export-D86051540 branch from 11c75e7 to d2ce78f Compare November 2, 2025 19:44

meta-codesync bot closed this in 2f4b794 Nov 3, 2025

TroyGarden deleted the export-D86051540 branch November 3, 2025 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix a corner-case bug in memory snapshot uploading #3504

fix a corner-case bug in memory snapshot uploading #3504

Uh oh!

TroyGarden commented Nov 2, 2025

Uh oh!

meta-codesync bot commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix a corner-case bug in memory snapshot uploading #3504

fix a corner-case bug in memory snapshot uploading #3504

Uh oh!

Conversation

TroyGarden commented Nov 2, 2025

Uh oh!

meta-codesync bot commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant