Skip to content

Conversation

@TroyGarden
Copy link
Contributor

Summary:
Fixed two corner case issues in the TorchRec benchmark utilities:

  1. Memory snapshot handling: Added rank filtering for memory snapshot operations to ensure they only run on rank 0 or when all_rank_traces is enabled. This prevents redundant memory snapshots from being taken on all ranks, reducing overhead and storage requirements while still capturing the necessary profiling data.

  2. Shell script robustness: Added file existence checks before loop iterations in the trace upload script. Previously, if no trace files or memory snapshot files were found, the script would fail silently or produce errors. Now it checks with ls first and only proceeds with the loop if files exist, preventing issues when the trace directory is empty or files don't match the expected patterns.

Differential Revision: D86051540

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 2, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 2, 2025

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86051540.

Summary:

# context
* loglevel is not correctly set in train pipeline benchmark due to the multiprocess setup
* the log level is only set in the main process but not correctly set in the forked/spawn processes
* this diff add the `loglevel` argument into the RunConfig so that in every runner funcion can call `set_logger_level`
* also directly pass the error message on yaml or json parser failure, which previously just warn silently and the warning message is buried in lengthy logs.
* with loglevel=info we can now see the planner info: P2014482201

Differential Revision: D85829837
Summary:

Fixed two corner case issues in the TorchRec benchmark utilities:

1. **Memory snapshot handling**: Added rank filtering for memory snapshot operations to ensure they only run on rank 0 or when `all_rank_traces` is enabled. This prevents redundant memory snapshots from being taken on all ranks, reducing overhead and storage requirements while still capturing the necessary profiling data.

2. **Shell script robustness**: Added file existence checks before loop iterations in the trace upload script. Previously, if no trace files or memory snapshot files were found, the script would fail silently or produce errors. Now it checks with `ls` first and only proceeds with the loop if files exist, preventing issues when the trace directory is empty or files don't match the expected patterns.

Differential Revision: D86051540
@meta-codesync meta-codesync bot closed this in 2f4b794 Nov 3, 2025
@TroyGarden TroyGarden deleted the export-D86051540 branch November 3, 2025 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant