Skip to content

Commit 204fbb3

Browse files
c-p-i-opobin6
authored andcommitted
[c10d][fr] wait counter for dump function (pytorch#140823)
Summary: Add a wait counter for the dump function. This is useful to see if we get stuck in the dump function and never return for a particular job. Test Plan: Tested locally I and see `pytorch.wait_counter.NCCLTraceBuffer__dump.busy_time_us.sum.60` in ODS. Differential Revision: D65823433 Pull Request resolved: pytorch#140823 Approved by: https://github.com/fduwjj
1 parent 9ea2293 commit 204fbb3

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

torch/csrc/distributed/c10d/NCCLUtils.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
#include <torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp>
33
#include <torch/csrc/distributed/c10d/control_plane/Handlers.hpp>
44

5+
#include <c10/util/WaitCounter.h>
56
#include <c10/util/env.h>
67
#include <fstream>
78

@@ -843,6 +844,7 @@ std::string NCCLTraceBuffer::dump(
843844
bool includeCollectives,
844845
bool includeStackTraces,
845846
bool onlyActive) {
847+
STATIC_SCOPED_WAIT_COUNTER(pytorch.wait_counter.NCCLTraceBuffer__dump);
846848
auto result = new_dict();
847849
// common values
848850
result.insert(version_key, version_val);

0 commit comments

Comments
 (0)