Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda aware MPI is optional. #213

Merged
merged 9 commits into from
Aug 28, 2020

Conversation

gbalduzz
Copy link
Contributor

@gbalduzz gbalduzz commented Aug 3, 2020

Edit: fixes #212
Solves manually #210.
Depends on #206 or #208

  • Reverts the detection of multiple gpus and use a cmake flag instead for using CUDA aware MPI.
  • Implements a (not really optimized) fallback for the ring algorithm that should make testing easier.

It seems that using the cvdlauncher script is not really necessary, and just using the launch flag --smpiargs=”-gpu” does the job on summit.

@gbalduzz gbalduzz added enhancement New feature or request merge later This PR depends on something else. labels Aug 3, 2020
target_link_libraries(${name} ${MPI_C_LIBRARIES})
else()
if (TEST_RUNNER)
add_test(NAME ${name}
COMMAND ${TEST_RUNNER} ${MPIEXEC_NUMPROC_FLAG} 1
${MPIEXEC_PREFLAGS} ${SMPIARGS_FLAG_NOMPI} "$<TARGET_FILE:${name}>")
${MPIEXEC_PREFLAGS} "$<TARGET_FILE:${name}>")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PDoakORNL: I am not really aware of what the whole *_CVD flags where supposed to do or when they where introduced, so if I am missing something by removing them, please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indicates a test that needs the cuda visible device (CVD) wrapper so that each MPI rank sees only one GPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readded the cvdlauncher and a cache flag that can point to it (to be set from the .cmake file) I kept the CUDA_CVD option out of the test function as all the tests have the same requirements besides requiring CUDA and/or MPI.

@gbalduzz gbalduzz added bug Something isn't working and removed merge later This PR depends on something else. labels Aug 4, 2020
Copy link
Contributor

@PDoakORNL PDoakORNL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate the CMAKE and source code modifications as much as possible and make separate PR's. We need to leave the CVD flags in to support systems that do not use jsrun but do have multiple GPU's. Currently the ringG test is the only example but more are coming.

The manual flag to enable it is ok but I'd rather see it just test for the capability.

The source modifications look pretty much ready to go.

target_link_libraries(${name} ${MPI_C_LIBRARIES})
else()
if (TEST_RUNNER)
add_test(NAME ${name}
COMMAND ${TEST_RUNNER} ${MPIEXEC_NUMPROC_FLAG} 1
${MPIEXEC_PREFLAGS} ${SMPIARGS_FLAG_NOMPI} "$<TARGET_FILE:${name}>")
${MPIEXEC_PREFLAGS} "$<TARGET_FILE:${name}>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indicates a test that needs the cuda visible device (CVD) wrapper so that each MPI rank sees only one GPU.

set(SMPIARGS_FLAG_MPI "" CACHE STRING "Spectrum MPI argument list flag for MPI tests.")

# When we want to us a cuda visible devices restriction we need this flag
set(SMPIARGS_FLAG_MPI_CVD "--smpiargs=-gpu" CACHE STRING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the name to _MGPU or something and don't remove it. I have to build and test the code on more systems than a laptop and summit and its is useful to be able to partition the multigpu tests. This is a useful distinction as long as our tests target 1 node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the additional flags can be easily added to the SMPIARGS_FLAG_MPI, as they don't need to differ between tests, even if the mpi implementation is not spectrum (sloppy naming on our side).

if (DCA_HAVE_CUDA)
EXECUTE_PROCESS(COMMAND bash -c "nvidia-smi -L | awk 'BEGIN { num_gpu=0;} /GPU/ { num_gpu++;} END { printf(\"%d\", num_gpu) }'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quire useful to have to make decisions about which tests to add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no test depends on the number of GPUs.

@gbalduzz
Copy link
Contributor Author

gbalduzz commented Aug 6, 2020

@PDoakORNL the ringG test does not use multiple GPUs: each rank has its own GPU. Currently DCA does not support multi-GPU.

@PDoakORNL
Copy link
Contributor

I'm making some additional changes to this PR based on testing on no summit multi GPU per node systems which I will PR back to you @gbalduzz. This should hopefully mean we can meet our mpi/gpu testing needs without more changes at least for a couple months.

@weilewei
Copy link
Contributor

I will take a look maybe next week, if that fits your schedule...

@gbalduzz
Copy link
Contributor Author

Merged master.
Note that the ringG test is an integration test, so it requires the DCA_WITH_TESTS_EXTENSIVE=ON flag for being built. I would have preferred it as a unit test, but it does not really matter, and I did not want to change the original placement of the test.

@weilewei
Copy link
Contributor

The ringG test jsrun arguments are not correct on Summit

cmake -C ../build-aux/summit.cmake -DDCA_WITH_TESTS_EXTENSIVE=On ..
make ringG_tp_accumulator_gpu_test
cd test/integration/cluster_solver/shared_tools/accumulation/tp/

The jsrun command line generated from ctest is (which causes hang and does not picks up GPUDirect on Summit)

bash-4.2$ ctest -V -N
....
1: Test command: /sw/summit/xalt/1.2.0/bin/jsrun "-n" "3" "-a" "1" "-g" "1" "-c" "5" "/gpfs/alpine/proj-shared/cph102/weile/dev/src/dca_giovanni/DCA/build_test/test/integration/cluster_solver/shared_tools/accumulation/tp/ringG_tp_accumulator_gpu_test"
  Test #1: ringG_tp_accumulator_gpu_test

The correct command line should be (which passed):

jsrun -n3 -a1 -c7 -g1 -b rs --smpiargs="-gpu" ./cvdlauncher.sh ./ringG_tp_accumulator_gpu_test

So two command options are missing in ctest generated commands: --smpiargs="-gpu" ./cvdlauncher.sh

Can you add these two to cmake related files? Or am I missing any settings before running distG4 related test? Otherwise, the rest of changes look good to me. Thanks.

@gbalduzz
Copy link
Contributor Author

The cvd launcher is not necessary on summit: there is already one gpu per rank: the test passes as it is:

bash-4.2$ ctest -V   
UpdateCTestConfiguration  from :/gpfs/alpine/proj-shared/cph102/gbalduzz/DCA/build2/test/integration/cluster_solver/shared_tools/accumulation/tp/DartConfiguration.tcl
UpdateCTestConfiguration  from :/gpfs/alpine/proj-shared/cph102/gbalduzz/DCA/build2/test/integration/cluster_solver/shared_tools/accumulation/tp/DartConfiguration.tcl
Test project /gpfs/alpine/proj-shared/cph102/gbalduzz/DCA/build2/test/integration/cluster_solver/shared_tools/accumulation/tp
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 1
    Start 1: ringG_tp_accumulator_gpu_test

1: Test command: /sw/summit/xalt/1.2.0/bin/jsrun "-n" "3" "-a" "1" "-g" "1" "-c" "5" "//gpfs/alpine/proj-shared/cph102/gbalduzz/DCA/build2/test/integration/cluster_solver/shared_tools/accumulation/tp/ringG_tp_accumulator_gpu_test"
1: Test timeout computed to be: 10000000
1: Running main() from gtest_main.cc
1: [==========] Running 1 test from 1 test case.
1: [----------] Global test environment set-up.
1: [----------] 1 test from DistributedTpAccumulatorGpuTest
1: [ RUN      ] DistributedTpAccumulatorGpuTest.Accumulate
1: Running main() from gtest_main.cc
1: [==========] Running 1 test from 1 test case.
1: [----------] Global test environment set-up.
1: [----------] 1 test from DistributedTpAccumulatorGpuTest
1: [ RUN      ] DistributedTpAccumulatorGpuTest.Accumulate
1: Running main() from gtest_main.cc
1: [==========] Running 1 test from 1 test case.
1: [----------] Global test environment set-up.
1: [----------] 1 test from DistributedTpAccumulatorGpuTest
1: [ RUN      ] DistributedTpAccumulatorGpuTest.Accumulate
1: 
1: 
1: 	opening file : //gpfs/alpine/proj-shared/cph102/gbalduzz/DCA/test/integration/cluster_solver/shared_tools/accumulation/tp/input_4x4_multitransfer.json
1: 
1: 
1: 	 Parsing completed! read 1103 characters and 66 lines.
1: 	 name        : CLUSTER MOMENTUM_SPACE BRILLOUIN_ZONE (DIMENSION : 2)
1: 	 name (dual) : CLUSTER REAL_SPACE BRILLOUIN_ZONE (DIMENSION : 2)
1: 
1: 	 size        : 4
1: 
1: 			MOMENTUM_SPACE			|	REAL_SPACE
1: 	 origin-index : 0				|	0
1: 	 volume       : 3.947842e+01			|	4.000000e+00
1: 
1: 	 basis : 
1: 			3.141593e+00	-0.000000e+00	|	1.000000e+00	0.000000e+00	
1: 			-0.000000e+00	3.141593e+00	|	0.000000e+00	1.000000e+00	
1: 
1: 	 super-basis : 
1: 			6.283185e+00	-0.000000e+00	|	2.000000e+00	0.000000e+00	
1: 			-0.000000e+00	6.283185e+00	|	0.000000e+00	2.000000e+00	
1: 
1: 	 inverse-basis : 
1: 			3.183099e-01	0.000000e+00	|	1.000000e+00	-0.000000e+00	
1: 			0.000000e+00	3.183099e-01	|	-0.000000e+00	1.000000e+00	
1: 
1: 	 inverse-super-basis : 
1: 			1.591549e-01	0.000000e+00	|	5.000000e-01	-0.000000e+00	
1: 			0.000000e+00	1.591549e-01	|	-0.000000e+00	5.000000e-01	
1: 
1: 
1: 	0	|	0.000000e+00	0.000000e+00		0.000000e+00	0.000000e+00	
1: 	1	|	0.000000e+00	3.141593e+00		0.000000e+00	1.000000e+00	
1: 	2	|	3.141593e+00	0.000000e+00		1.000000e+00	0.000000e+00	
1: 	3	|	3.141593e+00	3.141593e+00		1.000000e+00	1.000000e+00	
1: 
1: 
1: 	MOMENTUM_SPACE k-space symmetries : 
1: 
1: 	0, 0	|		0, 0	0, 0	0, 0	0, 0	0, 0	0, 0	0, 0	0, 0
1: 	0, 1	|		0, 1	0, 1	0, 1	0, 1	0, 1	0, 1	0, 1	0, 1
1: 	1, 0	|		1, 0	2, 0	1, 0	2, 0	2, 0	1, 0	2, 0	1, 0
1: 	1, 1	|		1, 1	2, 1	1, 1	2, 1	2, 1	1, 1	2, 1	1, 1
1: 	2, 0	|		2, 0	1, 0	2, 0	1, 0	1, 0	2, 0	1, 0	2, 0
1: 	2, 1	|		2, 1	1, 1	2, 1	1, 1	1, 1	2, 1	1, 1	2, 1
1: 	3, 0	|		3, 0	3, 0	3, 0	3, 0	3, 0	3, 0	3, 0	3, 0
1: 	3, 1	|		3, 1	3, 1	3, 1	3, 1	3, 1	3, 1	3, 1	3, 1
1: 
1: 
1: 
1: 	REAL_SPACE symmetries : 
1: 
1: 	0, 0	|		0, 0	0, 0	0, 0	0, 0	0, 0	0, 0	0, 0	0, 0
1: 	0, 1	|		0, 1	0, 1	0, 1	0, 1	0, 1	0, 1	0, 1	0, 1
1: 	1, 0	|		1, 0	2, 0	1, 0	2, 0	2, 0	1, 0	2, 0	1, 0
1: 	1, 1	|		1, 1	2, 1	1, 1	2, 1	2, 1	1, 1	2, 1	1, 1
1: 	2, 0	|		2, 0	1, 0	2, 0	1, 0	1, 0	2, 0	1, 0	2, 0
1: 	2, 1	|		2, 1	1, 1	2, 1	1, 1	1, 1	2, 1	1, 1	2, 1
1: 	3, 0	|		3, 0	3, 0	3, 0	3, 0	3, 0	3, 0	3, 0	3, 0
1: 	3, 1	|		3, 1	3, 1	3, 1	3, 1	3, 1	3, 1	3, 1	3, 1
1: 
1: 	 name        : LATTICE_SP MOMENTUM_SPACE BRILLOUIN_ZONE (DIMENSION : 2)
1: 	 name (dual) : LATTICE_SP REAL_SPACE BRILLOUIN_ZONE (DIMENSION : 2)
1: 
1: 	 size        : 1
1: 
1: 			MOMENTUM_SPACE			|	REAL_SPACE
1: 	 origin-index : 0				|	0
1: 	 volume       : 3.947842e+01			|	1.000000e+00
1: 
1: 	 basis : 
1: 			6.283185e+00	-0.000000e+00	|	1.000000e+00	0.000000e+00	
1: 			-0.000000e+00	6.283185e+00	|	0.000000e+00	1.000000e+00	
1: 
1: 	 super-basis : 
1: 			6.283185e+00	-0.000000e+00	|	1.000000e+00	0.000000e+00	
1: 			-0.000000e+00	6.283185e+00	|	0.000000e+00	1.000000e+00	
1: 
1: 	 inverse-basis : 
1: 			1.591549e-01	0.000000e+00	|	1.000000e+00	-0.000000e+00	
1: 			0.000000e+00	1.591549e-01	|	-0.000000e+00	1.000000e+00	
1: 
1: 	 inverse-super-basis : 
1: 			1.591549e-01	0.000000e+00	|	1.000000e+00	-0.000000e+00	
1: 			0.000000e+00	1.591549e-01	|	-0.000000e+00	1.000000e+00	
1: 
1: 
1: 	 name        : LATTICE_TP MOMENTUM_SPACE BRILLOUIN_ZONE (DIMENSION : 2)
1: 	 name (dual) : LATTICE_TP REAL_SPACE BRILLOUIN_ZONE (DIMENSION : 2)
1: 
1: 	 size        : 4
1: 
1: 			MOMENTUM_SPACE			|	REAL_SPACE
1: 	 origin-index : 0				|	0
1: 	 volume       : 3.947842e+01			|	4.000000e+00
1: 
1: 	 basis : 
1: 			3.141593e+00	-0.000000e+00	|	1.000000e+00	0.000000e+00	
1: 			-0.000000e+00	3.141593e+00	|	0.000000e+00	1.000000e+00	
1: 
1: 	 super-basis : 
1: 			6.283185e+00	-0.000000e+00	|	2.000000e+00	0.000000e+00	
1: 			-0.000000e+00	6.283185e+00	|	0.000000e+00	2.000000e+00	
1: 
1: 	 inverse-basis : 
1: 			3.183099e-01	0.000000e+00	|	1.000000e+00	-0.000000e+00	
1: 			0.000000e+00	3.183099e-01	|	-0.000000e+00	1.000000e+00	
1: 
1: 	 inverse-super-basis : 
1: 			1.591549e-01	0.000000e+00	|	5.000000e-01	-0.000000e+00	
1: 			0.000000e+00	1.591549e-01	|	-0.000000e+00	5.000000e-01	
1: 
1: 
1: H_0 and H_int initialization start:    26-08-2020 14:39:47
1: H_0 and H_int initialization end:      26-08-2020 14:39:47
1: H_0 and H_int initialization duration: 7.881000e-05 s
1: 
1: G_0 initialization start:    26-08-2020 14:39:47
1: G_0 initialization end:      26-08-2020 14:39:47
1: G_0 initialization duration: 1.495300e-05 s
1: 
1: [       OK ] DistributedTpAccumulatorGpuTest.Accumulate (2644 ms)
1: [----------] 1 test from DistributedTpAccumulatorGpuTest (2644 ms total)
1: 
1: [----------] Global test environment tear-down
1: [==========] 1 test from 1 test case ran. (2644 ms total)
1: [  PASSED  ] 1 test.
1: [       OK ] DistributedTpAccumulatorGpuTest.Accumulate (2644 ms)
1: [----------] 1 test from DistributedTpAccumulatorGpuTest (2644 ms total)
1: 
1: [----------] Global test environment tear-down
1: [==========] 1 test from 1 test case ran. (2644 ms total)
1: [  PASSED  ] 1 test.
1: [       OK ] DistributedTpAccumulatorGpuTest.Accumulate (2644 ms)
1: [----------] 1 test from DistributedTpAccumulatorGpuTest (2644 ms total)
1: 
1: [----------] Global test environment tear-down
1: [==========] 1 test from 1 test case ran. (2644 ms total)
1: [  PASSED  ] 1 test.
1/1 Test #1: ringG_tp_accumulator_gpu_test ....   Passed    4.63 sec

@weilewei
Copy link
Contributor

Ok, I re-run it again, it works now (not sure why previous run fails). LGTM, thanks!

@PDoakORNL PDoakORNL merged commit 8c95662 into CompFUSE:master Aug 28, 2020
@gbalduzz gbalduzz deleted the optional_cuda_aware_mpi branch September 18, 2020 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

G4 ring test is broken
3 participants