-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GNI provider not thread safe #1312
Comments
I can confirm that setting the memory registration cache to NONE solves most of the issues we've been having with race conditions
Also, adding a lock in our code around the poll_cq calls fixes the other issues. So we are able to run (so far) error free. |
That is interesting. What version are you using? |
Also, what threading model are you using? FI_THREAD_COMPLETION, or FI_THREAD... ? |
We use FI_THREAD_SAFE |
So what version are you using? Can you provide a git tag or hash? |
I've been tracking the head of master branch on the cray fork of libfabric. I realize that using an unstable version is asking for trouble, but we had no luck with stable ones. Currently we are using this commit commit d05d486
and we have lockups in our code that we suspect is being caused by problem in libfabric (though we cannot be certain). the latest one happened just a few minutes ago using cori at nersc using 6K nodes. I have had more luck with tests on the cray at CSCS (daint) where my tests are passing on 2K nodes and 12 threads per node, however on cori we are using 68 threads per node with potentially up to 16 at a time in the network layer checks. @sithhell may want to comment further. |
It might be useful to get a core file if you can generate them reliably. |
Changed the title back. I just noticed the part about it crashing in poll_cq. |
Can you provide a test case? |
@biddisco Can you provide a test case? |
Unfortunately, I do not have a simple test case, the libfabric support for our HPX parcelport is rather difficult to separate out into a small test. We have disabled the registration cache and we therefore no longer have any problems with that part of the library. This issue could be closed on the grounds that it "cannot be reproduced" and it's a "won't fix". Since we do not use that code and cannot be certain that the bug exists and has not been fixed in subsequent commits. |
Can you provide a simple script that could be used to run this under the conditions that you experienced the issue? If the test isn't in excess of 1-2hrs in runtime, it would be relatively easy to reproduce if we can run it locally. Additionally, if the source for your app is open-source, we can likely do the modification we need to turn on the cache and test it. If not, I can work around it. |
I have a number of deadlines coming up and won't be able to devote any time to this, but in July, I could in principle retry some tests with memory registration enabled and see if bugs appear. Then I could supply a script that would build most of whats needed to reproduce any errors should they reappear. Please feel free to remind me around that time if I have not shown any signs of testing this further. |
@biddisco Any luck with reproducing the issue? |
John? Any chance you have been able to reproduce this? |
segfaults with stacktraces like these are common using the gni provider when using multiple threads. This one appears to emanate from the memory registration cache. Others are frequent in the poll_cq code.
The text was updated successfully, but these errors were encountered: