-
Notifications
You must be signed in to change notification settings - Fork 161
libfabric: fix asymmetrical rails for heterogeneous node types. #908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi fengjica! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
f3b6431
to
072697c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logics look good to me. Just have some cosmetic comments
/build |
/ok to test 807bbbe |
/build |
807bbbe
to
bf553aa
Compare
This change: - passes the number of rail endpoints to others via serialization, - added all remote endpoints for each local rails into addresses, - round-robin pick local rails and remote endpoints for data transfer. Signed-off-by: Feng Ji <[email protected]>
bf553aa
to
99d2627
Compare
/build |
/ok to test 99d2627 |
/build |
/ok to test e656843 |
This change:
What?
Fix the assumption in the code today that rails configuration is the same across nodes.
Why?
Users intend to use AWS p5en and p6 nodes together.
How?
See changes described above.