Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI tags not general enough #282

Open
s-mayani opened this issue Apr 24, 2024 · 1 comment
Open

MPI tags not general enough #282

s-mayani opened this issue Apr 24, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@s-mayani
Copy link
Collaborator

The tags found in src/Communicate/Tags.h for MPI communication start overlapping after a certain run size (i.e. number of MPI processes used), which causes correctness issues in code which relies on communication and uses these tags. For example, this issue became visible when running the test/solver/TestGaussian.cpp on more than 512 nodes on Perlmutter, each having 4 GPUs.

This is now temporarily fixed by increasing the absolute distance between the tags which may be used at the same time, by commit b27fa15ed95a322873500d70a57e5df58e32a04f.

However, this is still an issue that we will run into for bigger runs which may reach this overlap limit, therefore, a permanent solution for it should be found.

@s-mayani
Copy link
Collaborator Author

Digging more into the issue, the problem is not the MPI tags, but rather the buffer factory. It seems that the send and receive buffers are "stepping on each other" i.e. the same place in memory is being allocated to them. This is because the buffer IDs used to get a buffer from the buffer factory are overlapping in the send and receive operations. A solution would be to lock memory that is in use, i.e. not give out the memory associated to the same buffer ID if another rank has not finished using it.

More details are in this presentation.
MPI_Tags_issue.pdf

@s-mayani s-mayani added the bug Something isn't working label Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant