Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl/smcuda: add delayed stream initialization #12354

Merged

Conversation

edgargabriel
Copy link
Member

@edgargabriel edgargabriel commented Feb 20, 2024

introduce two new mca parameterse to the smcuda component:

  • allow for delayed initialization of the internal ipc stream and the array of events. This allows to handle situations where the user code did not set the device before MPI_Init AND the internal stream and/or event structures have some dependence on the device id used during creation.
  • add a parameter to control how many events are created during initialization by the smcuda component.

@edgargabriel
Copy link
Member Author

I was wondering whether somebody would have time to review this pr, I would like to have it merged before the weekend if possible.

@edgargabriel
Copy link
Member Author

bot:retest

@wenduwan
Copy link
Contributor

Running AWS CI. We test on both x86 and arm.

@edgargabriel
Copy link
Member Author

@wenduwan ok, thank you!

@wenduwan
Copy link
Contributor

FYI our CI passed.

@edgargabriel
Copy link
Member Author

bot:retest

@edgargabriel edgargabriel force-pushed the topic/smcuda-delayed-sae-init branch from ccba7e8 to 71b404b Compare March 6, 2024 13:46
@edgargabriel
Copy link
Member Author

Hm, not sure whether the mpi4py failure is truly because of something in this pr

Error: The action 'Test mpi4py (np=4)' has timed out after 10 minutes.

opal/mca/btl/smcuda/btl_smcuda.h Outdated Show resolved Hide resolved
opal/mca/btl/smcuda/btl_smcuda_accelerator.c Outdated Show resolved Hide resolved

/* Create the events since they can be reused. */
for (i = 0; i < accelerator_event_max; i++) {
rc = opal_accelerator.create_event(MCA_ACCELERATOR_NO_DEVICE_ID, &accelerator_event_ipc_array[i], opal_accelerator_use_sync_memops ? false : true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly confused here about what this PR really does. You mentioned it improves the handling of the process GPU/stream, but I see here that the events are created without a device association (MCA_ACCELERATOR_NO_DEVICE_ID) as if they were expected to be generic (which is not the case, at least not with CUDA)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, the CUDA accelerator simply ignores the device id, which either suggest that we don't need it or that the CUDA accelerator framework is careless on the handling of the devices.

Copy link
Member Author

@edgargabriel edgargabriel Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So its a tricky question: neither hipEventCreateWithFlags() nor cuEventCreate() take a device_id as an argument, the device_id argument to the API was added for ze devices. However, it seems like the device_id does play internally a role. Specifically, in some ROCm releases, if the code did something like:

hipSetDevice(device_id)
MPI_Init()

everything worked, since the events were created using the device that was used subsequently. If the sequence was however the other way around (i.e. the code set the device using hipSetDevice() after MPI_Init()) we had the events (and streams) being created with the default device (0), but afterwards we tried to use it for operations on a different device, and that caused an error. The delayed init is fixing this.

I agree however that I can make this a bit cleaner, by first retrieving the current device id , and using that id as an argument in the event_create and stream_create functions. Would that be ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be awesome, and will make the code much more readable.

@edgargabriel edgargabriel force-pushed the topic/smcuda-delayed-sae-init branch from 71b404b to 4f83e45 Compare March 6, 2024 17:16
@edgargabriel
Copy link
Member Author

@bosilca thank you for your review, I tried address your comments and updated the code accordingly, please let me know whether this looks ok now. Thanks again!

introduce two new mca parameterse to the smcuda component:

- allow for delayed initialization of the internal ipc stream and the
  array of events. This allows to handle situations where the user
  code did not set the device before MPI_Init AND the internal stream
  and/or event structures have some dependence on the device id used
  during creation.
- add a parameter to control how many events are created during
  initialization.

Signed-off-by: Edgar Gabriel <[email protected]>
@edgargabriel edgargabriel force-pushed the topic/smcuda-delayed-sae-init branch from 4f83e45 to 835eef5 Compare March 6, 2024 17:45
@edgargabriel edgargabriel merged commit a25dd5f into open-mpi:main Mar 6, 2024
7 checks passed
@edgargabriel edgargabriel deleted the topic/smcuda-delayed-sae-init branch July 12, 2024 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants