-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ULFM: Support for intercomm [i]agree and [i]shrink #12384
Conversation
@bosilca @abouteiller I need your review here. I did my best to write self-contained commits to ease review. The topmost commit is a log of consistency fixes for error branches. I have doubts about how If things are done wrong, the comm request freelist may endup with duplicated entries, and that would explain why I was getting double-release errors in debug mode from |
Yep, that's the culprit. |
fa03ac7
to
e5235ee
Compare
mpi4py failure... this one is unrelated to this PR, and already reported in #12367
The failure I got is not reproducible, everything is now fine after rerunning again. I pulling this PR out of the draft state. |
@abouteiller I've implemented your suggestion where you asked and ported it to another place. Could you please quickly re-review so that I can move this PR forward? |
6c13268
to
627970a
Compare
I got the following error running the ulfm-testing test harness, I'll need a couple more days to investigate this
|
Oh boy... that looks like the intercomm create issue that have been popping up elsewhere. |
627970a
to
3c436cb
Compare
@bosilca @jsquyres If a barrier-like synchronization helps fixing the intercomm issues, why don't just add the workaround and investigate it separately? Otherwise the problem will keep hunting everyone ad eternum. Or there is someone working actively on fixing this issue? Would a PR keeping testing working for others be accepted and merged quickly? |
Yes, that error is also present in main, not specific to this PR. I'll try to dig into it a bit to see what is the root cause. The intercom creation error has also happened --with-ft=no and without fault injection? |
When you ask about |
191ad44
to
b120f0c
Compare
@abouteiller I fixed up your commit to keel history clean, then I rebased to latest main. My CI testing stills fails[link], but IMHO these failures are unrelated to this PR. @bosilca I'm not sure what's the status of the intercommunicator deadlock issues. I'm thinking of manually adding |
b120f0c
to
b630434
Compare
After adding a |
b630434
to
570c8df
Compare
@dalcinl you may want to rebase this PR on main now. I tested it (using modified test_ulfm.py) and it seems to work consistently now. |
I also checked current main without the commits in this PR and test_ulfm.py does still fail. Passes for me with these commits added in. |
For intercommunicators, passing MPI_IN_PLACE to allreduce is invalid. Signed-off-by: Lisandro Dalcin <[email protected]>
Co-authored-by: Aurelien Bouteiller <[email protected]> Signed-off-by: Lisandro Dalcin <[email protected]>
570c8df
to
db35fad
Compare
@hppritcha Thanks. Indeed, after updating this PR, and after enabling all ULFM tests in mpi4py, everything is now OK. Full testsuite run here: https://github.com/mpi4py/mpi4py-testing/actions/runs/8734090801 @abouteiller @bosilca This PR is now ready. Can please you give it a final look and eventually merge? |
Signed-off-by: Lisandro Dalcin <[email protected]>
db35fad
to
b01e156
Compare
@bosilca I've updated this PR as per your recommendations. Note however that I've added a few more Please double check I didn't mess things up. |
larger scale test failed, need further study
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dalcinl for sticking with us on this.
@bosilca Unfortunately, I'll have to keep sticking on this... The changes you objected about the calling order of If you look at the implementation of |
I've opened #12480 as a follow up. BTW, How do we get all these fixes into branch v5.0.x ? |
This PR is a bit of cheating, as it implements ULFM for intercomm [i]agree and [i]shrink using the non-ft implementation of allreduce. There was already a precedent for such approach, and it is arguably better for the thing to work in the non-failure path rather than segfaulting bad as currently happening.
Refs. #12260 and mpi4py tests.