Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNI provider not thread safe #1312

Open
biddisco opened this issue Apr 4, 2017 · 16 comments
Open

GNI provider not thread safe #1312

biddisco opened this issue Apr 4, 2017 · 16 comments
Assignees

Comments

@biddisco
Copy link

biddisco commented Apr 4, 2017

segfaults with stacktraces like these are common using the gni provider when using multiple threads. This one appears to emanate from the memory registration cache. Others are frequent in the poll_cq code.

__restore_rt
__find_overlapping_addr
rbtFindLeftmost
__mr_cache_search_stale
_gnix_mr_cache_register
__cache_reg_mr
__mr_reg
gnix_mr_regattr
gnix_mr_reg
hpx::parcelset::policies::libfabric::sender::async_write_impl()
std::enable_if<hpx::parcelset::connection_handler_traits<hpx::parcelset::policies::libfabric::parcelport>::send_immediate_parcels::value, void>::type hpx::parcelset::parcelport_impl<hpx::parcelset::policies::libfabric::parcelport>::send_immediate_impl<hpx::parcelset::policies::libfabric::parcelport>(hpx::parcelset::parcelport_impl<hpx::parcelset::policies::libfabric::parcelport>&, hpx::parcelset::locality const&, hpx::util::function<void (boost::system::error_code const&, hpx::parcelset::parcel const&), false>&&, hpx::parcelset::parcel&&)
void hpx::util::detail::callable_vtable<void (hpx::parcelset::parcel&&)>::_invoke<hpx::parcelset::parcelport_impl<hpx::parcelset::policies::libfabric::parcelport>::parcel_await_handler>(void**, hpx::parcelset::parcel&&)
hpx::parcelset::detail::parcel_await::apply()
hpx::parcelset::parcelport_impl<hpx::parcelset::policies::libfabric::parcelport>::put_parcel(hpx::parcelset::locality const&, hpx::parcelset::parcel, hpx::util::function<void (boost::system::error_code const&, hpx::parcelset::parcel const&), false>)
hpx::parcelset::parcelhandler::put_parcel(hpx::parcelset::parcel, hpx::util::function<void (boost::system::error_code const&, hpx::parcelset::parcel const&), false>)
hpx::parcelset::detail::put_parcel_handler::operator()(hpx::parcelset::parcel&&) const.constprop.3840
void hpx::parcelset::detail::put_parcel_impl<hpx::parcelset::detail::put_parcel_handler, node_server::send_gravity_boundary_action&, hpx::threads::thread_priority&, gravity_boundary_type, geo::direction const&, bool&, unsigned long&>(hpx::parcelset::detail::put_parcel_handler&&, hpx::naming::id_type, hpx::naming::address&&, node_server::send_gravity_boundary_action&, hpx::threads::thread_priority&, gravity_boundary_type&&, geo::direction const&, bool&, unsigned long&)
bool hpx::detail::apply_impl<node_server::send_gravity_boundary_action, gravity_boundary_type, geo::direction const&, bool&, unsigned long&>(hpx::naming::id_type const&, hpx::threads::thread_priority, gravity_boundary_type&&, geo::direction const&, bool&, unsigned long&)
node_client::send_gravity_boundary(gravity_boundary_type&&, geo::direction const&, bool, unsigned long) const
node_server::compute_fmm(gsolve_type, bool)
node_server::refined_step()
...
@biddisco
Copy link
Author

biddisco commented Apr 4, 2017

I can confirm that setting the memory registration cache to NONE solves most of the issues we've been having with race conditions

            _set_check_domain_op_value(GNI_MR_CACHE, "none");

Also, adding a lock in our code around the poll_cq calls fixes the other issues. So we are able to run (so far) error free.

@jswaro
Copy link
Member

jswaro commented Apr 5, 2017

That is interesting. What version are you using?

@jswaro
Copy link
Member

jswaro commented Apr 5, 2017

Also, what threading model are you using? FI_THREAD_COMPLETION, or FI_THREAD... ?

@biddisco
Copy link
Author

We use FI_THREAD_SAFE

@jswaro
Copy link
Member

jswaro commented Apr 12, 2017

So what version are you using? Can you provide a git tag or hash?

@biddisco
Copy link
Author

I've been tracking the head of master branch on the cray fork of libfabric. I realize that using an unstable version is asking for trouble, but we had no luck with stable ones.

Currently we are using this commit

commit d05d486
Merge: 8e25361 d5c3f33
Author: libfabric test [email protected]
Date: Tue Apr 11 02:27:19 2017 -0500

Merge branch 'master' of https://github.com/ofiwg/libfabric

and we have lockups in our code that we suspect is being caused by problem in libfabric (though we cannot be certain). the latest one happened just a few minutes ago using cori at nersc using 6K nodes.

I have had more luck with tests on the cray at CSCS (daint) where my tests are passing on 2K nodes and 12 threads per node, however on cori we are using 68 threads per node with potentially up to 16 at a time in the network layer checks.

@sithhell may want to comment further.

@jswaro jswaro changed the title GNI provider not thread safe thread safety issue with memory registration Apr 13, 2017
@jswaro
Copy link
Member

jswaro commented Apr 13, 2017

It might be useful to get a core file if you can generate them reliably.

@jswaro jswaro changed the title thread safety issue with memory registration GNI provider not thread safe Apr 13, 2017
@jswaro
Copy link
Member

jswaro commented Apr 13, 2017

Changed the title back. I just noticed the part about it crashing in poll_cq.

@jswaro
Copy link
Member

jswaro commented Apr 17, 2017

Can you provide a test case?

@jswaro
Copy link
Member

jswaro commented May 31, 2017

@biddisco Can you provide a test case?

@biddisco
Copy link
Author

biddisco commented Jun 1, 2017

Unfortunately, I do not have a simple test case, the libfabric support for our HPX parcelport is rather difficult to separate out into a small test. We have disabled the registration cache and we therefore no longer have any problems with that part of the library.

This issue could be closed on the grounds that it "cannot be reproduced" and it's a "won't fix". Since we do not use that code and cannot be certain that the bug exists and has not been fixed in subsequent commits.

@jswaro
Copy link
Member

jswaro commented Jun 1, 2017

Unfortunately, I do not have a simple test case, the libfabric support for our HPX parcelport

Can you provide a simple script that could be used to run this under the conditions that you experienced the issue? If the test isn't in excess of 1-2hrs in runtime, it would be relatively easy to reproduce if we can run it locally. Additionally, if the source for your app is open-source, we can likely do the modification we need to turn on the cache and test it. If not, I can work around it.

@biddisco
Copy link
Author

biddisco commented Jun 8, 2017

I have a number of deadlines coming up and won't be able to devote any time to this, but in July, I could in principle retry some tests with memory registration enabled and see if bugs appear. Then I could supply a script that would build most of whats needed to reproduce any errors should they reappear. Please feel free to remind me around that time if I have not shown any signs of testing this further.

@jswaro
Copy link
Member

jswaro commented Jun 13, 2017

Related #1376 #1377

@jswaro
Copy link
Member

jswaro commented Jul 7, 2017

@biddisco Any luck with reproducing the issue?

@jswaro
Copy link
Member

jswaro commented Jul 25, 2017

@biddisco Any luck with reproducing the issue?

John? Any chance you have been able to reproduce this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants