-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda aware MPI is optional. #213
Conversation
e653d7a
to
7865164
Compare
cmake/dca_testing.cmake
Outdated
target_link_libraries(${name} ${MPI_C_LIBRARIES}) | ||
else() | ||
if (TEST_RUNNER) | ||
add_test(NAME ${name} | ||
COMMAND ${TEST_RUNNER} ${MPIEXEC_NUMPROC_FLAG} 1 | ||
${MPIEXEC_PREFLAGS} ${SMPIARGS_FLAG_NOMPI} "$<TARGET_FILE:${name}>") | ||
${MPIEXEC_PREFLAGS} "$<TARGET_FILE:${name}>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PDoakORNL: I am not really aware of what the whole *_CVD flags where supposed to do or when they where introduced, so if I am missing something by removing them, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indicates a test that needs the cuda visible device (CVD) wrapper so that each MPI rank sees only one GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readded the cvdlauncher and a cache flag that can point to it (to be set from the .cmake file) I kept the CUDA_CVD option out of the test function as all the tests have the same requirements besides requiring CUDA and/or MPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate the CMAKE and source code modifications as much as possible and make separate PR's. We need to leave the CVD flags in to support systems that do not use jsrun but do have multiple GPU's. Currently the ringG test is the only example but more are coming.
The manual flag to enable it is ok but I'd rather see it just test for the capability.
The source modifications look pretty much ready to go.
cmake/dca_testing.cmake
Outdated
target_link_libraries(${name} ${MPI_C_LIBRARIES}) | ||
else() | ||
if (TEST_RUNNER) | ||
add_test(NAME ${name} | ||
COMMAND ${TEST_RUNNER} ${MPIEXEC_NUMPROC_FLAG} 1 | ||
${MPIEXEC_PREFLAGS} ${SMPIARGS_FLAG_NOMPI} "$<TARGET_FILE:${name}>") | ||
${MPIEXEC_PREFLAGS} "$<TARGET_FILE:${name}>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indicates a test that needs the cuda visible device (CVD) wrapper so that each MPI rank sees only one GPU.
set(SMPIARGS_FLAG_MPI "" CACHE STRING "Spectrum MPI argument list flag for MPI tests.") | ||
|
||
# When we want to us a cuda visible devices restriction we need this flag | ||
set(SMPIARGS_FLAG_MPI_CVD "--smpiargs=-gpu" CACHE STRING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the name to _MGPU or something and don't remove it. I have to build and test the code on more systems than a laptop and summit and its is useful to be able to partition the multigpu tests. This is a useful distinction as long as our tests target 1 node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the additional flags can be easily added to the SMPIARGS_FLAG_MPI, as they don't need to differ between tests, even if the mpi implementation is not spectrum (sloppy naming on our side).
if (DCA_HAVE_CUDA) | ||
EXECUTE_PROCESS(COMMAND bash -c "nvidia-smi -L | awk 'BEGIN { num_gpu=0;} /GPU/ { num_gpu++;} END { printf(\"%d\", num_gpu) }'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quire useful to have to make decisions about which tests to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no test depends on the number of GPUs.
@PDoakORNL the ringG test does not use multiple GPUs: each rank has its own GPU. Currently DCA does not support multi-GPU. |
I'm making some additional changes to this PR based on testing on no summit multi GPU per node systems which I will PR back to you @gbalduzz. This should hopefully mean we can meet our mpi/gpu testing needs without more changes at least for a couple months. |
I will take a look maybe next week, if that fits your schedule... |
Merged master. |
The ringG test jsrun arguments are not correct on Summit
The jsrun command line generated from ctest is (which causes hang and does not picks up GPUDirect on Summit)
The correct command line should be (which passed):
So two command options are missing in ctest generated commands: Can you add these two to cmake related files? Or am I missing any settings before running distG4 related test? Otherwise, the rest of changes look good to me. Thanks. |
The cvd launcher is not necessary on summit: there is already one gpu per rank: the test passes as it is:
|
Ok, I re-run it again, it works now (not sure why previous run fails). LGTM, thanks! |
Edit: fixes #212
Solves manually #210.
Depends on #206 or #208It seems that using the cvdlauncher script is not really necessary, and just using the launch flag
--smpiargs=”-gpu”
does the job on summit.