Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ompi/coll/accelerator: implement reduce_local #12758

Merged

Conversation

Akshay-Venkatesh
Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh commented Aug 14, 2024

reduce_local implementation

@jsquyres
Copy link
Member

@Akshay-Venkatesh Why are you putting bot:notacherrypick on all your PRs? That is not appropriate here.

@jsquyres
Copy link
Member

Why does the commit message say

ompi/coll/accelerator/cuda:

As far as I can tell, this commit/PR is not specific to CUDA.

@janjust
Copy link
Contributor

janjust commented Aug 14, 2024

Probably just old terminology misuse, since in the past anything accelerator was synonymous to cuda.
@Akshay-Venkatesh can we just drop the cuda part from the commit message since this is accelerator component specific?

@janjust
Copy link
Contributor

janjust commented Aug 14, 2024

@Akshay-Venkatesh please drop the bot:notacherrypick from the commit message, anything in main doesn't need it since the CI is looking for commits in main for cherry-picks in other branches.

@Akshay-Venkatesh
Copy link
Contributor Author

@jsquyres @janjust I'll make the changes to commit message and force push soon.

Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Akshay-Venkatesh Akshay-Venkatesh changed the title ompi/coll/accelerator/cuda: implement reduce_local ompi/coll/accelerator: implement reduce_local Aug 14, 2024
dtype, op, root, comm,
s->c_coll.coll_reduce_module);

if ((comm == NULL) && (root == -1)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a horrible highjack of the reduction API ! A collective module is always attached to a communicator (also that's where the module is coming from). How can you end up here with a comm equal to NULL ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't disagree that the usage is is hacky. Do you have an alternative option that doesn't involve NULL comm and -1 root that also doesn't involve duplicated code? Maybe there are other constants for comm and root that are more appropriate here?

cc @jsquyres

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, check my other comment. You are trying to do a local_reduce in the reduce collective by using a NULL communicator as a signal. There is an API for local reduce, in this same PR, why don't you use it instead of redirecting it into the reduce ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that would effectively involve copying the code in mca_coll_accelerator_reduce into mca_coll_accelerator_reduce_local. I was hoping there was a way to avoid duplicated code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely. Put it into a separate function and invoke it in all places where it is needed.

struct ompi_op_t *op,
mca_coll_base_module_t *module)
{
return mca_coll_accelerator_reduce(sbuf, rbuf, count, dtype, op, -1, NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the solution proposed here, copy the code from above in this function. Cleaner and more readable from my perspective, as well as more into the OMPI approach to collective modules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsquyres / @bosilca Sorry this PR got de-prioritized last week. I've duplicated the code in the main reduce function now. Let me know if the changes look good.

@janjust
Copy link
Contributor

janjust commented Aug 27, 2024

@bosilca can we get another look at this PR

@bosilca bosilca merged commit e25e897 into open-mpi:main Sep 5, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants