-
Notifications
You must be signed in to change notification settings - Fork 169
Update RCCL Replayer README.md #1870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is buried fairly deep in the tree and is not part of the doc set, so I didn't do a full copy edit. But I did want to point out a couple of fairly important things that I noticed upon a quick skim. Also, we don't cover this topic in the RCCL docs right now. Should it be included?
Depending on the MPI library used and your installation path, you may need to set the MPI_DIR path accordingly. | ||
|
||
# Structured Logging | ||
As part of the efforts to enhance RCCL Replayer functionality, and per Meta's request, RCCL now provides detailed logging of API calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't mention Meta (or another non-AMD company) in the documentation.
|
||
## Usage | ||
* Structured logging is a built-in module of RCCL source. For RCCL library in ROCm release, it's estimated to be present starting from ROCm 7.0. To enable structured logging, point LD_LIBRARY_PATH to supporting RCCL library, then run with environment variable `RCCL_REPLAY_FILE="${filename}"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's best to give a precise release where support begins. If not known, don't mention it.
``` | ||
|
||
Replace <numProcesses> with the number of MPI processes you want to run during the replay, </path/to/logfile> with the path to the collective log file generated during your RCCL runs, and <numGpusPerMpiRank> with the number of GPUs per MPI rank used in your application. | ||
<!---We try to register and flush logging information at the beginning of a function, lest it never completes before termination/hanging of the program. **However**, many RCCL routines, such as communicator creation, user buffer registration, etc. will have pointers for returned handles. We record those value as well, but at the end of the routine, therefore these calls may not be logged in face of deadlock or error.---> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lest
- typo? it never completes...
|
||
Depending on the MPI library you use, you may need to modify the mpirun command accordingly. | ||
<!---Please interpret the parameters with a grain of salt. They are logged exactly as they are used, by user or by NCCL internal implementations. For instance, `ncclSend` entries will always have a null sendbuff but a valid "recfbuff" in the log, as `ncclSend` under the hood always fills the send buffer into the recv buffer field of `ncclInfo` that is enqueued.---> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo recfbuff
-> ... but a valid "recvbuff" in the log...
?
Details
Do not mention proprietary info or link to internal work items in this PR.
Work item: "Internal", or link to GitHub issue (if applicable).
What were the changes?
One sentence describing the work done.
Why were the changes made?
Explain the motivation behind the work. Provide any publicly-available historical context.
How was the outcome achieved?
Technical details behind the work. Explain any publicly-available hardware peculiarities.
Additional Documentation:
What else should the reviewer know?
Approval Checklist
Do not approve until these items are satisfied.