Skip to content

Conversation

AtlantaPepsi
Copy link
Contributor

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?
One sentence describing the work done.

Why were the changes made?
Explain the motivation behind the work. Provide any publicly-available historical context.

How was the outcome achieved?
Technical details behind the work. Explain any publicly-available hardware peculiarities.

Additional Documentation:
What else should the reviewer know?

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

@AtlantaPepsi AtlantaPepsi requested a review from a team as a code owner August 20, 2025 01:10
@nileshnegi nileshnegi added the ci:docs-only Skip most non-docs CI checks for this PR label Aug 20, 2025
Copy link
Contributor

@amd-jnovotny amd-jnovotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is buried fairly deep in the tree and is not part of the doc set, so I didn't do a full copy edit. But I did want to point out a couple of fairly important things that I noticed upon a quick skim. Also, we don't cover this topic in the RCCL docs right now. Should it be included?

Depending on the MPI library used and your installation path, you may need to set the MPI_DIR path accordingly.

# Structured Logging
As part of the efforts to enhance RCCL Replayer functionality, and per Meta's request, RCCL now provides detailed logging of API calls.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mention Meta (or another non-AMD company) in the documentation.


## Usage
* Structured logging is a built-in module of RCCL source. For RCCL library in ROCm release, it's estimated to be present starting from ROCm 7.0. To enable structured logging, point LD_LIBRARY_PATH to supporting RCCL library, then run with environment variable `RCCL_REPLAY_FILE="${filename}"`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best to give a precise release where support begins. If not known, don't mention it.

```

Replace <numProcesses> with the number of MPI processes you want to run during the replay, </path/to/logfile> with the path to the collective log file generated during your RCCL runs, and <numGpusPerMpiRank> with the number of GPUs per MPI rank used in your application.
<!---We try to register and flush logging information at the beginning of a function, lest it never completes before termination/hanging of the program. **However**, many RCCL routines, such as communicator creation, user buffer registration, etc. will have pointers for returned handles. We record those value as well, but at the end of the routine, therefore these calls may not be logged in face of deadlock or error.--->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lest - typo? it never completes...


Depending on the MPI library you use, you may need to modify the mpirun command accordingly.
<!---Please interpret the parameters with a grain of salt. They are logged exactly as they are used, by user or by NCCL internal implementations. For instance, `ncclSend` entries will always have a null sendbuff but a valid "recfbuff" in the log, as `ncclSend` under the hood always fills the send buffer into the recv buffer field of `ncclInfo` that is enqueued.--->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo recfbuff -> ... but a valid "recvbuff" in the log...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci:docs-only Skip most non-docs CI checks for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants