Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilize PMIx type checking for PML check #12481

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hppritcha
Copy link
Member

It appears that sometimes we get a bogus value back for the PML selected by a remote proc. Unfortunately, the current code for exchanging that info removes all type information from the exchange.

Replace the use of those macros with native PMIx
calls so we can check that the data type being
returned is the one we expected. True, it should not have changed - but this will help detect and debug any errors.

It appears that sometimes we get a bogus value back
for the PML selected by a remote proc. Unfortunately,
the current code for exchanging that info removes all
type information from the exchange.

Replace the use of those macros with native PMIx
calls so we can check that the data type being
returned is the one we expected. True, it should not
have changed - but this will help detect and debug
any errors.

Signed-off-by: Ralph Castain <[email protected]>
Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a solution to anything, especially not to PMIX returning incorrect values. It was helpful once, because we got lucky and the returned values had a different type, but that was a random event, not worth a patch that only fix one single instance of retrieving data from PMIX.

@rhc54
Copy link
Contributor

rhc54 commented Apr 22, 2024

So...would it be better for the user to blindly segfault when there is an error, instead of detecting that something is wrong, reporting the error, and cleanly exiting? Remember, in the case in question, you were just being lucky - you were treating a blind object as a char* and running strcmp on it, when in fact it was NOT a string. Is that the desired programming style in OMPI?

It isn't just a bug in PMIx that could cause the situation - could happen between two OMPI procs in different jobs, each running a different version. Where do you guarantee someone will always post the same type of data for a given key?

@hppritcha
Copy link
Member Author

This PR is just adding some protection. Someone else when they have time can go through the entire ompi code base and add similar protection. If this passes CI i''m merging.

@bosilca
Copy link
Member

bosilca commented Apr 22, 2024

  1. Then we should provide a similar patch for all instances of string storage in the entire OMPI code base, not fix just one that happen to have highlighted a bug in another library.
  2. Assuming PMIX had returned the correct value corresponding to the requested key or NULL, this patch would be of no need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants