Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accelerator: build components as dso's by default #12055

Merged
merged 2 commits into from
Dec 13, 2023

Conversation

hppritcha
Copy link
Member

related to #12036

@hppritcha hppritcha marked this pull request as draft November 8, 2023 19:13
@hppritcha
Copy link
Member Author

bot:ompi:retest

strange failure of make check shouldn't be impacted by this PR. rerunning

@hppritcha
Copy link
Member Author

bot:aws:retest

@jsquyres
Copy link
Member

bot:aws:retest

config/opal_mca.m4 Outdated Show resolved Hide resolved
config/opal_mca.m4 Outdated Show resolved Hide resolved
config/opal_mca.m4 Outdated Show resolved Hide resolved
@hppritcha
Copy link
Member Author

okay, i've added smcuda, accelerator (so all components), and rcache-gpusm rcache-rgpusm but still see libcuda linked in to opal and ompi libs and hence to commands like ompi_info. digging...

@hppritcha
Copy link
Member Author

oh i had a typo in the config file. ignore last comment.

@hppritcha hppritcha force-pushed the dso_accelerators branch 2 times, most recently from 7100bed to 88d46f2 Compare November 21, 2023 20:37
@hppritcha hppritcha marked this pull request as ready for review November 21, 2023 20:56
@hppritcha
Copy link
Member Author

i still have no idea why the make check opal_condition would fail with this PR.

@hppritcha
Copy link
Member Author

@jsquyres @edgargabriel please recheck

Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that these components are now built as dso's by default.

You need to rebase your branch however, it is missing the Makefile.am fixes to btl/smcuda and the two rcache components, and hence doesn't compile.

config/opal_mca.m4 Outdated Show resolved Hide resolved
config/opal_mca.m4 Outdated Show resolved Hide resolved
@hppritcha
Copy link
Member Author

i'm going to push a temporary commit to undo the default dso module thing to see if that's somehow causing the make check to fail.

@hppritcha hppritcha force-pushed the dso_accelerators branch 5 times, most recently from 5c937d8 to e6ed21b Compare November 30, 2023 15:47
@hppritcha
Copy link
Member Author

okay the problem is that opal_init call in the opal_conditional test fails in the make distcheck build. can't find an accelerator component.

@hppritcha
Copy link
Member Author

This is rather annoying. I've narrowed the problem down to the fact that when we try to select an accelerator component, if none are found, opal_init returns a fatal error. Now, when make check is run but make install has not been previously run, then the framework can't find a valid accelerator component and thus returns error. so any make check program that calls opal_init will error out unless make install had been previously run. doesn't have to do with the dist tarball itself.

@jsquyres
@edgargabriel

thoughts?

@hppritcha
Copy link
Member Author

i forgot that we can specify specific accelerator components to build as DSO's so trying that route.

@edgargabriel
Copy link
Member

In theory there is no reason to have the null component as a dso, that could be part statically compiled, and that might resolve your problem

@hppritcha
Copy link
Member Author

that's what I'm trying.

note this behavior in the accel base select component may need a better fix. For example, if someone configures with --enable-mca-dso then does make check, they will see the same failures as this PR was exhibiting. kind of an edge case but still...

also need to switch rcache/gpsum and rcache/rgpusum

to DSO by default.

Fix a problem in opal_mca.m4 where the enable-mca-dso list wasn't being
processed correctly starting with 5.0.0.

related to open-mpi#12036

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha
Copy link
Member Author

@jsquyres please review again

@hppritcha
Copy link
Member Author

@jsquyres ping, i'd like to pr over to 5.0.x for the 5.0.1 release

config/opal_mca.m4 Show resolved Hide resolved
config/opal_mca.m4 Show resolved Hide resolved
config/opal_mca.m4 Outdated Show resolved Hide resolved
@hppritcha
Copy link
Member Author

@jsquyres check now

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha
Copy link
Member Author

bot:nvidia:retest

@hppritcha
Copy link
Member Author

@open-mpi/cuda could someone see what's going on with ompi_NVIDIA CI?

@hppritcha
Copy link
Member Author

@jsquyres ping

@hppritcha hppritcha merged commit 454afb1 into open-mpi:main Dec 13, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants