Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ULFM: Support for intercomm [i]agree and [i]shrink #12384

Merged
merged 3 commits into from
Apr 19, 2024

Conversation

dalcinl
Copy link
Contributor

@dalcinl dalcinl commented Feb 29, 2024

This PR is a bit of cheating, as it implements ULFM for intercomm [i]agree and [i]shrink using the non-ft implementation of allreduce. There was already a precedent for such approach, and it is arguably better for the thing to work in the non-failure path rather than segfaulting bad as currently happening.

Refs. #12260 and mpi4py tests.

@dalcinl
Copy link
Contributor Author

dalcinl commented Feb 29, 2024

@bosilca @abouteiller I need your review here. I did my best to write self-contained commits to ease review. The topmost commit is a log of consistency fixes for error branches.

I have doubts about how ompi_comm_request_t should be used, in particular whether ompi_comm_request_return() should be called in case of errors. The FT code had a lot of these request return calls, but some where missing and I added them. HOWEVER, in the comm_cid code, ompi_comm_request_return() is generally not used after a requests have been started and put in the progress loop. The asymmetry is disturbing, but I'm afraid I don't know well enough how the OMPI progress engine works to figure out what's the right pattern to follow.

If things are done wrong, the comm request freelist may endup with duplicated entries, and that would explain why I was getting double-release errors in debug mode from OBJ_RELEASE.
Something weird is still going on when using MPI_Comm_shrink (only with the blocking variant) on an intercomm and OMPI_MCA_mpi_ft_enable=1. Sometimes I get a deadlock at the nextcid machinery. Was this the function giving us trouble with high-stress spawn?

@rhc54
Copy link
Contributor

rhc54 commented Feb 29, 2024

Sometimes I get a deadlock at the nextcid machinery. Was this the function giving us trouble with high-stress spawn?

Yep, that's the culprit.

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 2, 2024

mpi4py failure... this one is unrelated to this PR, and already reported in #12367

testAlltoallInter (test_util_pkl5.TestMPIWorld.testAlltoallInter) ... python: pml_ob1_comm.c:51: mca_pml_ob1_comm_proc_destruct: Assertion `NULL == proc->frags_cant_match' failed.

More disturbing, I re-ran the testsuite with re-enabled ULFM tests again, and now I get a failure. Not sure whether this is a consequence of applying @abouteiller suggestion or something else. I'll have to test this locally before proceeding, therefore I'm marking this PR as draft.

The failure I got is not reproducible, everything is now fine after rerunning again. I pulling this PR out of the draft state.

@dalcinl dalcinl marked this pull request as draft March 2, 2024 12:07
@dalcinl dalcinl marked this pull request as ready for review March 2, 2024 16:49
@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 4, 2024

@abouteiller I've implemented your suggestion where you asked and ported it to another place. Could you please quickly re-review so that I can move this PR forward?

@dalcinl dalcinl force-pushed the bugfix/ulfm-intercomm branch 2 times, most recently from 6c13268 to 627970a Compare March 12, 2024 16:44
@abouteiller
Copy link
Member

I got the following error running the ulfm-testing test harness, I'll need a couple more days to investigate this

 4  /home/bouteill/ompi/master.debug/lib/libmpi.so.0(mca_pml_ob1_revoke_comm+0x409) [0x7ff4e61d7d30]
 5  /home/bouteill/ompi/master.debug/lib/libmpi.so.0(+0x7df76) [0x7ff4e5eb8f76]
 6  /home/bouteill/ompi/master.debug/lib/libmpi.so.0(+0x7a60e) [0x7ff4e5eb560e]
 7  /home/bouteill/ompi/master.debug/lib/libopen-pal.so.0(+0xda43f) [0x7ff4e5baa43f]
 8  /lib64/libevent_core-2.1.so.7(+0x21b98) [0x7ff4e55a4b98]
 9  /lib64/libevent_core-2.1.so.7(event_base_loop+0x577) [0x7ff4e55a67b7]
10  /home/bouteill/ompi/master.debug/lib/libopen-pal.so.0(+0x263be) [0x7ff4e5af63be]
11  /home/bouteill/ompi/master.debug/lib/libopen-pal.so.0(opal_progress+0xa7) [0x7ff4e5af6477]
12  /home/bouteill/ompi/master.debug/lib/libmpi.so.0(+0x6f471) [0x7ff4e5eaa471]
13  /home/bouteill/ompi/master.debug/lib/libmpi.so.0(ompi_comm_nextcid+0x7a) [0x7ff4e5eabeb2]
14  /home/bouteill/ompi/master.debug/lib/libmpi.so.0(ompi_intercomm_create+0x3b6) [0x7ff4e5ea6de6]
15  /home/bouteill/ompi/master.debug/lib/libmpi.so.0(PMPI_Intercomm_create+0x13c) [0x7ff4e5f50428]

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 13, 2024

Oh boy... that looks like the intercomm create issue that have been popping up elsewhere.

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 13, 2024

@bosilca @jsquyres If a barrier-like synchronization helps fixing the intercomm issues, why don't just add the workaround and investigate it separately? Otherwise the problem will keep hunting everyone ad eternum. Or there is someone working actively on fixing this issue? Would a PR keeping testing working for others be accepted and merged quickly?

ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved
ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved
@abouteiller
Copy link
Member

abouteiller commented Mar 13, 2024

Oh boy... that looks like the intercomm create issue that have been popping up elsewhere.

Yes, that error is also present in main, not specific to this PR. I'll try to dig into it a bit to see what is the root cause.

The intercom creation error has also happened --with-ft=no and without fault injection?

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 14, 2024

The intercom creation error has also happened --with-ft=no and without fault injection?

When you ask about --with-ft=no, do you mean the configure option? In all of my CI testing, I never specify --with-ft=... to configure, so the default should be in place. Regarding the second part of your question, all my testing so far is WITHOUT fault injection. The main purpose of this PR is to enable testing of the ULFM APIs when no fault occurs.

@dalcinl dalcinl force-pushed the bugfix/ulfm-intercomm branch 2 times, most recently from 191ad44 to b120f0c Compare March 21, 2024 07:05
@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 21, 2024

@abouteiller I fixed up your commit to keel history clean, then I rebased to latest main. My CI testing stills fails[link], but IMHO these failures are unrelated to this PR.

@bosilca I'm not sure what's the status of the intercommunicator deadlock issues. I'm thinking of manually adding intercomm.Barrier() everywhere after creation in mpi4py testsuite, otherwise that issue will keep us from making progress with other things like this PR.

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 28, 2024

After adding a Barrier() right after creating the intercomm, now mpi4py tests all pass with the changes in this PR
https://github.com/mpi4py/mpi4py-testing/actions/runs/8466320394

@hppritcha
Copy link
Member

@dalcinl you may want to rebase this PR on main now. I tested it (using modified test_ulfm.py) and it seems to work consistently now.

@hppritcha
Copy link
Member

I also checked current main without the commits in this PR and test_ulfm.py does still fail. Passes for me with these commits added in.

dalcinl and others added 2 commits April 18, 2024 10:14
For intercommunicators, passing MPI_IN_PLACE to allreduce is invalid.

Signed-off-by: Lisandro Dalcin <[email protected]>
Co-authored-by: Aurelien Bouteiller <[email protected]>
Signed-off-by: Lisandro Dalcin <[email protected]>
@dalcinl
Copy link
Contributor Author

dalcinl commented Apr 18, 2024

@hppritcha Thanks. Indeed, after updating this PR, and after enabling all ULFM tests in mpi4py, everything is now OK. Full testsuite run here: https://github.com/mpi4py/mpi4py-testing/actions/runs/8734090801

@abouteiller @bosilca This PR is now ready. Can please you give it a final look and eventually merge?

ompi/communicator/ft/comm_ft.c Outdated Show resolved Hide resolved
ompi/communicator/ft/comm_ft.c Outdated Show resolved Hide resolved
ompi/communicator/ft/comm_ft.c Show resolved Hide resolved
@dalcinl
Copy link
Contributor Author

dalcinl commented Apr 19, 2024

@bosilca I've updated this PR as per your recommendations. Note however that I've added a few more ompi_comm_request_return(request) in places where I believe these cleanups were missing.

Please double check I didn't mess things up.

abouteiller
abouteiller previously approved these changes Apr 19, 2024
@abouteiller abouteiller self-requested a review April 19, 2024 20:29
@abouteiller abouteiller dismissed their stale review April 19, 2024 20:30

larger scale test failed, need further study

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dalcinl for sticking with us on this.

@bosilca bosilca merged commit 0261a03 into open-mpi:main Apr 19, 2024
14 checks passed
@dalcinl dalcinl deleted the bugfix/ulfm-intercomm branch April 20, 2024 08:40
@dalcinl
Copy link
Contributor Author

dalcinl commented Apr 20, 2024

@bosilca Unfortunately, I'll have to keep sticking on this...

The changes you objected about the calling order of OBJ_RELEASE(...) and ompi_comm_request_return(...) where actually all good and needed. The problem is that after two months of having done all that work, I had forgotten the gory details.

If you look at the implementation of ompi_comm_request_return() you will notice the line OBJ_RELEASE (request->context). Therefore, any use of the internal request context after ompi_comm_request_return() is a use-after-free bug. That was the reason I originally incorporated all these fixes in this PR. Most of the error paths are therefore still broken with the use-after-free bug.

@dalcinl
Copy link
Contributor Author

dalcinl commented Apr 20, 2024

I've opened #12480 as a follow up.

BTW, How do we get all these fixes into branch v5.0.x ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants