-
Notifications
You must be signed in to change notification settings - Fork 5
DAOS-17899 mercury: Update ep close to address memory issue in UCX. #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Joseph Moore <[email protected]>
Signed-off-by: Joseph Moore <[email protected]>
| + if (na_ucx_addr != NULL) { | ||
| + NA_LOG_SUBSYS_WARNING(addr, | ||
| + "An entry is already present for this address"); | ||
| + na_ucx_addr_release(na_ucx_addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Joe, can we just release the address without checking refcount? what if anther place is still taking refcount on the address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was put in to ensure we free the resource. It was needed to work around an limitation with UCX. The UCX layer is supposed to give us a disconnect event on the the server-end for client-side ep closes, but these events don't occur on rare occasions. The clients are able to reuse addresses (ipaddr:port) for new clients, and in practice this often occurs. In the case where the disconnect for the prior usage of this address wasn't sent, we execute this line of code to enable the accept of the new connection request. The internal data structures in the Mercury UCX layer are not able to support multiple address structures sharing the same addr_key (sockaddr). Hence we have to release. Should always be the case in these instances that refcount is 1, so release would be called from decrement function if we used this call instead. I will look into your concern further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR should have been marked as draft, since it was created with initial purpose of producing RPMs for testing Jerome's changes (at LRZ). I will create an updated commit of this pr, with a separate patch file that adds Jerome's code changes as an incremental patch. This will make it clear that the "already present" handling was added to the DAOS-modified version of Mercury 2.4 in a separate, previously-landed fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that, please try to keep patches separate and organized by ticket numbers so that we can reference the initial issue/ticket that they were fixing otherwise we have no way of finding it.
Signed-off-by: Joseph Moore <[email protected]>
No description provided.