Mpas testing #2

abishekg7 · 2025-03-05T05:11:33Z

This PR builds on MPAS-Dev#1284 to add runtime tests for MPAS.

To launch tests from the root directory

.ci/hpc-workflows/.ci/runner.py .ci/mpas_tests.jsonc -t <toolchain> -a <account>

The following toolchain/builds are proposed

GNU: gnu, gnu-double, gpu-openmp, gnu-debug
NVHPC: nvhpc-cpu, nvhpc-gpu, nvhpc-gpu-double
Intel: intel, intel-double

The following test scenarios are proposed:

Jablonowski-Williamson baroclinic wave (dycore)
Limited area over CONUS
Aquaplanet
Squall line

The following tests are proposed, orthogonal to test scenarios above.

Base: The simplest test configuration for a given scenario. The output restart file is checked for bit identical results against a stored reference.
Restart: Tests the ability to proceed from an intermediate restart file (provided by base run) and produce results bit-identical to base run.
Parallel: Tests for various parallel implementations/configurations. The resulting restart files are checked for bit-identical results against the base runs. This is further classified into
- MPI
- OMP
- Multinode CPU
- Multi GPU
- Multinode GPU
Performance: Five instances of a shorter run are performed and the timing statistics are compared against references stored in a database. The idea would be to run a nightly build + performance runs for the develop branch, and store the statistics in the SQLite databse. When the performance check is run for PRs, then we could use the average of the last 5-10 days to decide if the performance of our PR has regressed.

Test matrix:
The proposal is to avoid running all possible combinations of (toolchain X scenario X tests) and instead carefully map out a minimum set of tests in this space that would provide a reasonable guarantee of correctness and keep a eye on performance regression.

GNU (OPTIMIZED)

Scenario	Base	Restart	Parallel	Performance
J-W	✅	✅	MPI + OPENMP + Multinode	✅
CONUS	✅	✅	MPI	✅
Aquaplanet	✅
Squall line	✅

GNU (DEBUG) - Build only

GNU - DOUBLE PRECISION (OPTIMIZED)

Scenario	Base	Restart	Parallel	Performance
CONUS	✅			✅

NVHPC-CPU (OPTIMIZED)

Scenario	Base	Restart	Parallel	Performance
J-W	✅	✅	MPI + Multinode	✅
CONUS	✅	✅	MPI	✅

NVHPC-GPU (OPTIMIZED)

Scenario	Base	Restart	Parallel	Performance
J-W	✅	✅	MultiGPU + Multinode	✅
CONUS	✅	✅	MultiGPU	✅

NVHPC-GPU DOUBLE (OPTIMIZED)

Scenario	Base	Restart	Parallel	Performance
CONUS	✅			✅

INTEL (OPTIMIZED)

Scenario	Base	Restart	Parallel	Performance
CONUS	✅	✅	MPI + OPENMP + Multinode	✅

INTEL (DEBUG) - Build only

Reference generation

To periodically update the reference restart files against which the base runs are compare, we could use another json file .ci/mpas_refgen.jsonc to generate these references from the base branch (develop). This would be triggered manually as required, when either the base branch introduces substantial changes or the GPU porting of a routine results in non BFB identical results (which have been otherwise verified).

To launch the reference generation from the root directory

.ci/hpc-workflows/.ci/runner.py .ci/mpas_refgen.jsonc -t <toolchain> -a <account>

some changes to tests config some scripts to support performance benchmarking adding nvhpc-cpu some changes to exit codes

Adding reference generation capability and a separate jsonc

abishekg7 added 6 commits February 21, 2025 15:45

Preliminary

74cdcb5

some changes to tests config some scripts to support performance benchmarking adding nvhpc-cpu some changes to exit codes

adding nvhpc-cpu test

06d3772

adding nvhpc-cpu test

41a0cc3

Some cleanups and refactoring to be POSIX shell compliant

32d646c

Adding reference generation capability and a separate jsonc

Adding double precision builds

f905490

Some more changes

6c08be7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mpas testing #2

Mpas testing #2

abishekg7 commented Mar 5, 2025 •

edited

Loading

Mpas testing #2

Are you sure you want to change the base?

Mpas testing #2

Conversation

abishekg7 commented Mar 5, 2025 • edited Loading

abishekg7 commented Mar 5, 2025 •

edited

Loading