Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ompi/oshmem/spml/ucx: Fix cleanup when disconnected: avoid double free #12297

Merged
merged 1 commit into from
Feb 7, 2024

Conversation

iyastreb
Copy link
Contributor

Fixes #12249

Problem:
Crash occurs when running hello_oshmem_c.c example, which happens due to double free in oshmem_shmem_finalize function.

Root cause analysis:
oshmem_shmem_finalize is called twice, first time it's called explicitly from the app, and the second time it's called from shmem_onexit handler. It's perfectly fine to call this function multiple times, as it's protected by oshmem_shmem_initialized variable, which prevents double free on second invocation.
The problem is that this flag is not being reset, due to the error in _shmem_finalize:

oshmem_shmem_finalize
    ret = _shmem_finalize();
        mca_spml_base_finalize
            mca_spml_ucx_component_fini
                ret = opal_common_ucx_mca_pmix_fence_nb <<< returns PMIX_ERR_UNREACH when disconnected
                if (ret != OPAL_SUCCESS)
                    return ret;
            
    if (OSHMEM_SUCCESS == ret) {
        oshmem_shmem_initialized = false;
    }

When we are disconnected, then opal_common_ucx_mca_pmix_fence_nb returns PMIX_ERR_UNREACH error, and we interrupt the cleanup process in the middle. Then subsequent cleanup call tries to free the freed memory (requests) => segfault.

Solution
Allow clean up to proceed when we are in disconnected state. If we are not connected to the server, that's ok, we can probably successfully release the resources. IMO being connected to the server should not be prerequisite for destroy.

Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

3372d13: ompi/oshmem/spml/ucx: Fix clean up when disconnect...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@iyastreb iyastreb force-pushed the oshmem/mca/ucx/fix-double-cleanup branch from 3372d13 to 40da3bd Compare January 31, 2024 17:04
@brminich
Copy link
Member

brminich commented Feb 1, 2024

@tvegas1, can you pls review?

fenced = 1;
break;

default: return ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it also make sense to set fenced = 1 also for default case, maybe with a log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think it makes sense.
fenced is just a local variable on stack used below to safely clean up the resources.
If we return an error, no need to set it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant, setting it to 1 and continuing with cleanup anyways, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see your point now.
Basically it means that if we don't manage to contact the server, we just go on with clean up.
To me it looks reasonable, because if we return an error here and interrupt the cleanup, then we remain in semi-destroyed state, and subsequent cleanup call will crash anyway.

On the other side, existing code in _shmem_finalize follows the pattern: check status of each cleanup stage and return an error:
https://github.com/open-mpi/ompi/blob/main/oshmem/runtime/oshmem_shmem_finalize.c#L90

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok we could keep it like this then

Copy link
Member

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls note that branch is out of date, so needs to be synced

oshmem/mca/spml/ucx/spml_ucx_component.c Outdated Show resolved Hide resolved
oshmem/mca/spml/ucx/spml_ucx_component.c Show resolved Hide resolved
@iyastreb iyastreb force-pushed the oshmem/mca/ucx/fix-double-cleanup branch from 5d35c73 to 0f16fbe Compare February 5, 2024 07:31
Copy link

github-actions bot commented Feb 5, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

0f16fbe: ompi/oshmem/spml/ucx: Fix cleanup when disconnecte...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@iyastreb iyastreb force-pushed the oshmem/mca/ucx/fix-double-cleanup branch from 0f16fbe to f23365e Compare February 5, 2024 07:38
@@ -491,7 +492,10 @@ static int mca_spml_ucx_component_fini(void)


ret = opal_common_ucx_mca_pmix_fence_nb(&fenced);
if (OPAL_SUCCESS != ret) {
if (ret == PMIX_ERR_UNREACH) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print a warning if fence fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (ret == PMIX_ERR_UNREACH) {
/* It's ok if we are disconnected, continue cleanup */
if (ret != PMIX_SUCCESS) {
SPML_UCX_WARN("pmix fence failed: %d", ret);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a function to also print string name of the error code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PMIx_Error_string(ret)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2024

@iyastreb Please don't add merge commits into PR's; if you want the latest from main, please rebase.

@iyastreb iyastreb force-pushed the oshmem/mca/ucx/fix-double-cleanup branch from 304228c to 3071892 Compare February 6, 2024 15:29
@iyastreb
Copy link
Contributor Author

iyastreb commented Feb 6, 2024

@iyastreb Please don't add merge commits into PR's; if you want the latest from main, please rebase.

@jsquyres Ok, sure.
In this case I just noticed that my PR is marked as out-of-date, because of some recent changes in main.
So I just pressed the button "Update branch" in GH webUI, which resulted in this merge

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2024

@jsquyres Ok, sure. In this case I just noticed that my PR is marked as out-of-date, because of some recent changes in main. So I just pressed the button "Update branch" in GH webUI, which resulted in this merge

Fair enough. I wish there was a way we could disable Github from showing that button.

@yosefe yosefe merged commit 92b82f6 into open-mpi:main Feb 7, 2024
11 checks passed
@iyastreb iyastreb deleted the oshmem/mca/ucx/fix-double-cleanup branch February 16, 2024 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OSHMEM requests potentially destroyed more than once
6 participants