Cuda aware MPI is optional. #213

gbalduzz · 2020-08-03T16:01:22Z

Edit: fixes #212
Solves manually #210.
~~Depends on #206 or #208~~

Reverts the detection of multiple gpus and use a cmake flag instead for using CUDA aware MPI.
Implements a (not really optimized) fallback for the ring algorithm that should make testing easier.

It seems that using the cvdlauncher script is not really necessary, and just using the launch flag --smpiargs=”-gpu” does the job on summit.

…PU_tests" This reverts commit f3ac453, reversing changes made to 0460fba.

gbalduzz · 2020-08-03T16:44:12Z

cmake/dca_testing.cmake

                 target_link_libraries(${name} ${MPI_C_LIBRARIES})
  else()
    if (TEST_RUNNER)
      add_test(NAME ${name}
               COMMAND ${TEST_RUNNER} ${MPIEXEC_NUMPROC_FLAG} 1
-	               ${MPIEXEC_PREFLAGS} ${SMPIARGS_FLAG_NOMPI} "$<TARGET_FILE:${name}>")
+	               ${MPIEXEC_PREFLAGS} "$<TARGET_FILE:${name}>")


@PDoakORNL: I am not really aware of what the whole *_CVD flags where supposed to do or when they where introduced, so if I am missing something by removing them, please let me know.

This indicates a test that needs the cuda visible device (CVD) wrapper so that each MPI rank sees only one GPU.

Readded the cvdlauncher and a cache flag that can point to it (to be set from the .cmake file) I kept the CUDA_CVD option out of the test function as all the tests have the same requirements besides requiring CUDA and/or MPI.

PDoakORNL

Separate the CMAKE and source code modifications as much as possible and make separate PR's. We need to leave the CVD flags in to support systems that do not use jsrun but do have multiple GPU's. Currently the ringG test is the only example but more are coming.

The manual flag to enable it is ok but I'd rather see it just test for the capability.

The source modifications look pretty much ready to go.

PDoakORNL · 2020-08-03T19:40:29Z

cmake/dca_testing.cmake

                 target_link_libraries(${name} ${MPI_C_LIBRARIES})
  else()
    if (TEST_RUNNER)
      add_test(NAME ${name}
               COMMAND ${TEST_RUNNER} ${MPIEXEC_NUMPROC_FLAG} 1
-	               ${MPIEXEC_PREFLAGS} ${SMPIARGS_FLAG_NOMPI} "$<TARGET_FILE:${name}>")
+	               ${MPIEXEC_PREFLAGS} "$<TARGET_FILE:${name}>")


This indicates a test that needs the cuda visible device (CVD) wrapper so that each MPI rank sees only one GPU.

PDoakORNL · 2020-08-04T17:01:58Z

build-aux/summit.cmake

-set(SMPIARGS_FLAG_MPI "" CACHE STRING "Spectrum MPI argument list flag for MPI tests.")
-
-# When we want to us a cuda visible devices restriction we need this flag
-set(SMPIARGS_FLAG_MPI_CVD "--smpiargs=-gpu" CACHE STRING 


Change the name to _MGPU or something and don't remove it. I have to build and test the code on more systems than a laptop and summit and its is useful to be able to partition the multigpu tests. This is a useful distinction as long as our tests target 1 node.

the additional flags can be easily added to the SMPIARGS_FLAG_MPI, as they don't need to differ between tests, even if the mpi implementation is not spectrum (sloppy naming on our side).

PDoakORNL · 2020-08-04T17:02:53Z

CMakeLists.txt

 if (DCA_HAVE_CUDA)
-  EXECUTE_PROCESS(COMMAND bash -c "nvidia-smi -L | awk 'BEGIN { num_gpu=0;} /GPU/ { num_gpu++;} END { printf(\"%d\", num_gpu) }'"


This is quire useful to have to make decisions about which tests to add.

no test depends on the number of GPUs.

gbalduzz · 2020-08-06T13:00:52Z

@PDoakORNL the ringG test does not use multiple GPUs: each rank has its own GPU. Currently DCA does not support multi-GPU.

PDoakORNL · 2020-08-13T20:34:09Z

I'm making some additional changes to this PR based on testing on no summit multi GPU per node systems which I will PR back to you @gbalduzz. This should hopefully mean we can meet our mpi/gpu testing needs without more changes at least for a couple months.

weilewei · 2020-08-21T14:48:55Z

I will take a look maybe next week, if that fits your schedule...

gbalduzz · 2020-08-23T20:51:17Z

Merged master.
Note that the ringG test is an integration test, so it requires the DCA_WITH_TESTS_EXTENSIVE=ON flag for being built. I would have preferred it as a unit test, but it does not really matter, and I did not want to change the original placement of the test.

weilewei · 2020-08-25T15:26:57Z

The ringG test jsrun arguments are not correct on Summit

cmake -C ../build-aux/summit.cmake -DDCA_WITH_TESTS_EXTENSIVE=On ..
make ringG_tp_accumulator_gpu_test
cd test/integration/cluster_solver/shared_tools/accumulation/tp/

The jsrun command line generated from ctest is (which causes hang and does not picks up GPUDirect on Summit)

bash-4.2$ ctest -V -N
....
1: Test command: /sw/summit/xalt/1.2.0/bin/jsrun "-n" "3" "-a" "1" "-g" "1" "-c" "5" "/gpfs/alpine/proj-shared/cph102/weile/dev/src/dca_giovanni/DCA/build_test/test/integration/cluster_solver/shared_tools/accumulation/tp/ringG_tp_accumulator_gpu_test"
  Test #1: ringG_tp_accumulator_gpu_test

The correct command line should be (which passed):

jsrun -n3 -a1 -c7 -g1 -b rs --smpiargs="-gpu" ./cvdlauncher.sh ./ringG_tp_accumulator_gpu_test

So two command options are missing in ctest generated commands: --smpiargs="-gpu" ./cvdlauncher.sh

Can you add these two to cmake related files? Or am I missing any settings before running distG4 related test? Otherwise, the rest of changes look good to me. Thanks.

gbalduzz · 2020-08-26T18:41:51Z

The cvd launcher is not necessary on summit: there is already one gpu per rank: the test passes as it is:

bash-4.2$ ctest -V   
UpdateCTestConfiguration  from :/gpfs/alpine/proj-shared/cph102/gbalduzz/DCA/build2/test/integration/cluster_solver/shared_tools/accumulation/tp/DartConfiguration.tcl
UpdateCTestConfiguration  from :/gpfs/alpine/proj-shared/cph102/gbalduzz/DCA/build2/test/integration/cluster_solver/shared_tools/accumulation/tp/DartConfiguration.tcl
Test project /gpfs/alpine/proj-shared/cph102/gbalduzz/DCA/build2/test/integration/cluster_solver/shared_tools/accumulation/tp
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 1
    Start 1: ringG_tp_accumulator_gpu_test

1: Test command: /sw/summit/xalt/1.2.0/bin/jsrun "-n" "3" "-a" "1" "-g" "1" "-c" "5" "//gpfs/alpine/proj-shared/cph102/gbalduzz/DCA/build2/test/integration/cluster_solver/shared_tools/accumulation/tp/ringG_tp_accumulator_gpu_test"
1: Test timeout computed to be: 10000000
1: Running main() from gtest_main.cc
1: [==========] Running 1 test from 1 test case.
1: [----------] Global test environment set-up.
1: [----------] 1 test from DistributedTpAccumulatorGpuTest
1: [ RUN      ] DistributedTpAccumulatorGpuTest.Accumulate
1: Running main() from gtest_main.cc
1: [==========] Running 1 test from 1 test case.
1: [----------] Global test environment set-up.
1: [----------] 1 test from DistributedTpAccumulatorGpuTest
1: [ RUN      ] DistributedTpAccumulatorGpuTest.Accumulate
1: Running main() from gtest_main.cc
1: [==========] Running 1 test from 1 test case.
1: [----------] Global test environment set-up.
1: [----------] 1 test from DistributedTpAccumulatorGpuTest
1: [ RUN      ] DistributedTpAccumulatorGpuTest.Accumulate
1: 
1: 
1: 	opening file : //gpfs/alpine/proj-shared/cph102/gbalduzz/DCA/test/integration/cluster_solver/shared_tools/accumulation/tp/input_4x4_multitransfer.json
1: 
1: 
1: 	 Parsing completed! read 1103 characters and 66 lines.
1: 	 name        : CLUSTER MOMENTUM_SPACE BRILLOUIN_ZONE (DIMENSION : 2)
1: 	 name (dual) : CLUSTER REAL_SPACE BRILLOUIN_ZONE (DIMENSION : 2)
1: 
1: 	 size        : 4
1: 
1: 			MOMENTUM_SPACE			|	REAL_SPACE
1: 	 origin-index : 0				|	0
1: 	 volume       : 3.947842e+01			|	4.000000e+00
1: 
1: 	 basis : 
1: 			3.141593e+00	-0.000000e+00	|	1.000000e+00	0.000000e+00	
1: 			-0.000000e+00	3.141593e+00	|	0.000000e+00	1.000000e+00	
1: 
1: 	 super-basis : 
1: 			6.283185e+00	-0.000000e+00	|	2.000000e+00	0.000000e+00	
1: 			-0.000000e+00	6.283185e+00	|	0.000000e+00	2.000000e+00	
1: 
1: 	 inverse-basis : 
1: 			3.183099e-01	0.000000e+00	|	1.000000e+00	-0.000000e+00	
1: 			0.000000e+00	3.183099e-01	|	-0.000000e+00	1.000000e+00	
1: 
1: 	 inverse-super-basis : 
1: 			1.591549e-01	0.000000e+00	|	5.000000e-01	-0.000000e+00	
1: 			0.000000e+00	1.591549e-01	|	-0.000000e+00	5.000000e-01	
1: 
1: 
1: 	0	|	0.000000e+00	0.000000e+00		0.000000e+00	0.000000e+00	
1: 	1	|	0.000000e+00	3.141593e+00		0.000000e+00	1.000000e+00	
1: 	2	|	3.141593e+00	0.000000e+00		1.000000e+00	0.000000e+00	
1: 	3	|	3.141593e+00	3.141593e+00		1.000000e+00	1.000000e+00	
1: 
1: 
1: 	MOMENTUM_SPACE k-space symmetries : 
1: 
1: 	0, 0	|		0, 0	0, 0	0, 0	0, 0	0, 0	0, 0	0, 0	0, 0
1: 	0, 1	|		0, 1	0, 1	0, 1	0, 1	0, 1	0, 1	0, 1	0, 1
1: 	1, 0	|		1, 0	2, 0	1, 0	2, 0	2, 0	1, 0	2, 0	1, 0
1: 	1, 1	|		1, 1	2, 1	1, 1	2, 1	2, 1	1, 1	2, 1	1, 1
1: 	2, 0	|		2, 0	1, 0	2, 0	1, 0	1, 0	2, 0	1, 0	2, 0
1: 	2, 1	|		2, 1	1, 1	2, 1	1, 1	1, 1	2, 1	1, 1	2, 1
1: 	3, 0	|		3, 0	3, 0	3, 0	3, 0	3, 0	3, 0	3, 0	3, 0
1: 	3, 1	|		3, 1	3, 1	3, 1	3, 1	3, 1	3, 1	3, 1	3, 1
1: 
1: 
1: 
1: 	REAL_SPACE symmetries : 
1: 
1: 	0, 0	|		0, 0	0, 0	0, 0	0, 0	0, 0	0, 0	0, 0	0, 0
1: 	0, 1	|		0, 1	0, 1	0, 1	0, 1	0, 1	0, 1	0, 1	0, 1
1: 	1, 0	|		1, 0	2, 0	1, 0	2, 0	2, 0	1, 0	2, 0	1, 0
1: 	1, 1	|		1, 1	2, 1	1, 1	2, 1	2, 1	1, 1	2, 1	1, 1
1: 	2, 0	|		2, 0	1, 0	2, 0	1, 0	1, 0	2, 0	1, 0	2, 0
1: 	2, 1	|		2, 1	1, 1	2, 1	1, 1	1, 1	2, 1	1, 1	2, 1
1: 	3, 0	|		3, 0	3, 0	3, 0	3, 0	3, 0	3, 0	3, 0	3, 0
1: 	3, 1	|		3, 1	3, 1	3, 1	3, 1	3, 1	3, 1	3, 1	3, 1
1: 
1: 	 name        : LATTICE_SP MOMENTUM_SPACE BRILLOUIN_ZONE (DIMENSION : 2)
1: 	 name (dual) : LATTICE_SP REAL_SPACE BRILLOUIN_ZONE (DIMENSION : 2)
1: 
1: 	 size        : 1
1: 
1: 			MOMENTUM_SPACE			|	REAL_SPACE
1: 	 origin-index : 0				|	0
1: 	 volume       : 3.947842e+01			|	1.000000e+00
1: 
1: 	 basis : 
1: 			6.283185e+00	-0.000000e+00	|	1.000000e+00	0.000000e+00	
1: 			-0.000000e+00	6.283185e+00	|	0.000000e+00	1.000000e+00	
1: 
1: 	 super-basis : 
1: 			6.283185e+00	-0.000000e+00	|	1.000000e+00	0.000000e+00	
1: 			-0.000000e+00	6.283185e+00	|	0.000000e+00	1.000000e+00	
1: 
1: 	 inverse-basis : 
1: 			1.591549e-01	0.000000e+00	|	1.000000e+00	-0.000000e+00	
1: 			0.000000e+00	1.591549e-01	|	-0.000000e+00	1.000000e+00	
1: 
1: 	 inverse-super-basis : 
1: 			1.591549e-01	0.000000e+00	|	1.000000e+00	-0.000000e+00	
1: 			0.000000e+00	1.591549e-01	|	-0.000000e+00	1.000000e+00	
1: 
1: 
1: 	 name        : LATTICE_TP MOMENTUM_SPACE BRILLOUIN_ZONE (DIMENSION : 2)
1: 	 name (dual) : LATTICE_TP REAL_SPACE BRILLOUIN_ZONE (DIMENSION : 2)
1: 
1: 	 size        : 4
1: 
1: 			MOMENTUM_SPACE			|	REAL_SPACE
1: 	 origin-index : 0				|	0
1: 	 volume       : 3.947842e+01			|	4.000000e+00
1: 
1: 	 basis : 
1: 			3.141593e+00	-0.000000e+00	|	1.000000e+00	0.000000e+00	
1: 			-0.000000e+00	3.141593e+00	|	0.000000e+00	1.000000e+00	
1: 
1: 	 super-basis : 
1: 			6.283185e+00	-0.000000e+00	|	2.000000e+00	0.000000e+00	
1: 			-0.000000e+00	6.283185e+00	|	0.000000e+00	2.000000e+00	
1: 
1: 	 inverse-basis : 
1: 			3.183099e-01	0.000000e+00	|	1.000000e+00	-0.000000e+00	
1: 			0.000000e+00	3.183099e-01	|	-0.000000e+00	1.000000e+00	
1: 
1: 	 inverse-super-basis : 
1: 			1.591549e-01	0.000000e+00	|	5.000000e-01	-0.000000e+00	
1: 			0.000000e+00	1.591549e-01	|	-0.000000e+00	5.000000e-01	
1: 
1: 
1: H_0 and H_int initialization start:    26-08-2020 14:39:47
1: H_0 and H_int initialization end:      26-08-2020 14:39:47
1: H_0 and H_int initialization duration: 7.881000e-05 s
1: 
1: G_0 initialization start:    26-08-2020 14:39:47
1: G_0 initialization end:      26-08-2020 14:39:47
1: G_0 initialization duration: 1.495300e-05 s
1: 
1: [       OK ] DistributedTpAccumulatorGpuTest.Accumulate (2644 ms)
1: [----------] 1 test from DistributedTpAccumulatorGpuTest (2644 ms total)
1: 
1: [----------] Global test environment tear-down
1: [==========] 1 test from 1 test case ran. (2644 ms total)
1: [  PASSED  ] 1 test.
1: [       OK ] DistributedTpAccumulatorGpuTest.Accumulate (2644 ms)
1: [----------] 1 test from DistributedTpAccumulatorGpuTest (2644 ms total)
1: 
1: [----------] Global test environment tear-down
1: [==========] 1 test from 1 test case ran. (2644 ms total)
1: [  PASSED  ] 1 test.
1: [       OK ] DistributedTpAccumulatorGpuTest.Accumulate (2644 ms)
1: [----------] 1 test from DistributedTpAccumulatorGpuTest (2644 ms total)
1: 
1: [----------] Global test environment tear-down
1: [==========] 1 test from 1 test case ran. (2644 ms total)
1: [  PASSED  ] 1 test.
1/1 Test #1: ringG_tp_accumulator_gpu_test ....   Passed    4.63 sec

weilewei · 2020-08-26T19:20:25Z

Ok, I re-run it again, it works now (not sure why previous run fails). LGTM, thanks!

gbalduzz added enhancement New feature or request merge later This PR depends on something else. labels Aug 3, 2020

gbalduzz added 3 commits August 3, 2020 18:41

Revert "Merge pull request CompFUSE#209 from PDoakORNL/prevent_multiG…

13147b5

…PU_tests" This reverts commit f3ac453, reversing changes made to 0460fba.

Implemented fallback method for non cuda aware MPI.

c83604f

use MPI_THREAD_MULTIPLE when enabling the ring communicator.

7865164

gbalduzz force-pushed the optional_cuda_aware_mpi branch from e653d7a to 7865164 Compare August 3, 2020 16:42

gbalduzz commented Aug 3, 2020

View reviewed changes

gbalduzz added 3 commits August 4, 2020 09:21

Merge branch 'master' into optional_cuda_aware_mpi

abc6172

fixed MPI function distribution.

e05746b

Store function elements in a vector for safer assignments.

814dbc8

gbalduzz added bug Something isn't working and removed merge later This PR depends on something else. labels Aug 4, 2020

PDoakORNL requested changes Aug 4, 2020

View reviewed changes

gbalduzz and others added 2 commits August 12, 2020 18:51

Restored cvdlauncher.sh.

bb958b9

Merge branch 'master' into optional_cuda_aware_mpi

9cb55a7

PDoakORNL requested a review from weilewei August 20, 2020 23:26

Merge branch 'master' into optional_cuda_aware_mpi

23e0387

weilewei approved these changes Aug 26, 2020

View reviewed changes

gbalduzz requested a review from PDoakORNL August 28, 2020 14:06

PDoakORNL approved these changes Aug 28, 2020

View reviewed changes

PDoakORNL merged commit 8c95662 into CompFUSE:master Aug 28, 2020

gbalduzz deleted the optional_cuda_aware_mpi branch September 18, 2020 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda aware MPI is optional. #213

Cuda aware MPI is optional. #213

gbalduzz commented Aug 3, 2020 •

edited

Loading

gbalduzz Aug 3, 2020

PDoakORNL Aug 3, 2020

gbalduzz Aug 12, 2020

PDoakORNL left a comment

PDoakORNL Aug 3, 2020

PDoakORNL Aug 4, 2020

gbalduzz Aug 12, 2020

PDoakORNL Aug 4, 2020

gbalduzz Aug 12, 2020

gbalduzz commented Aug 6, 2020

PDoakORNL commented Aug 13, 2020

weilewei commented Aug 21, 2020

gbalduzz commented Aug 23, 2020

weilewei commented Aug 25, 2020

gbalduzz commented Aug 26, 2020

weilewei commented Aug 26, 2020

		if (DCA_HAVE_CUDA)
		EXECUTE_PROCESS(COMMAND bash -c "nvidia-smi -L \| awk 'BEGIN { num_gpu=0;} /GPU/ { num_gpu++;} END { printf(\"%d\", num_gpu) }'"

Cuda aware MPI is optional. #213

Cuda aware MPI is optional. #213

Conversation

gbalduzz commented Aug 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PDoakORNL left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbalduzz commented Aug 6, 2020

PDoakORNL commented Aug 13, 2020

weilewei commented Aug 21, 2020

gbalduzz commented Aug 23, 2020

weilewei commented Aug 25, 2020

gbalduzz commented Aug 26, 2020

weilewei commented Aug 26, 2020

gbalduzz commented Aug 3, 2020 •

edited

Loading