Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mpas testing #2

Open
wants to merge 6 commits into
base: initial-testing-framework_simplified
Choose a base branch
from

Conversation

abishekg7
Copy link
Owner

@abishekg7 abishekg7 commented Mar 5, 2025

This PR builds on MPAS-Dev#1284 to add runtime tests for MPAS.

To launch tests from the root directory

.ci/hpc-workflows/.ci/runner.py .ci/mpas_tests.jsonc -t <toolchain> -a <account>

The following toolchain/builds are proposed

  • GNU: gnu, gnu-double, gpu-openmp, gnu-debug
  • NVHPC: nvhpc-cpu, nvhpc-gpu, nvhpc-gpu-double
  • Intel: intel, intel-double

The following test scenarios are proposed:

  • Jablonowski-Williamson baroclinic wave (dycore)
  • Limited area over CONUS
  • Aquaplanet
  • Squall line

The following tests are proposed, orthogonal to test scenarios above.

  • Base: The simplest test configuration for a given scenario. The output restart file is checked for bit identical results against a stored reference.
  • Restart: Tests the ability to proceed from an intermediate restart file (provided by base run) and produce results bit-identical to base run.
  • Parallel: Tests for various parallel implementations/configurations. The resulting restart files are checked for bit-identical results against the base runs. This is further classified into
    • MPI
    • OMP
    • Multinode CPU
    • Multi GPU
    • Multinode GPU
  • Performance: Five instances of a shorter run are performed and the timing statistics are compared against references stored in a database. The idea would be to run a nightly build + performance runs for the develop branch, and store the statistics in the SQLite databse. When the performance check is run for PRs, then we could use the average of the last 5-10 days to decide if the performance of our PR has regressed.

Test matrix:
The proposal is to avoid running all possible combinations of (toolchain X scenario X tests) and instead carefully map out a minimum set of tests in this space that would provide a reasonable guarantee of correctness and keep a eye on performance regression.

GNU (OPTIMIZED)

Scenario Base Restart Parallel Performance
J-W MPI + OPENMP + Multinode
CONUS MPI
Aquaplanet
Squall line

GNU (DEBUG) - Build only

GNU - DOUBLE PRECISION (OPTIMIZED)

Scenario Base Restart Parallel Performance
CONUS

NVHPC-CPU (OPTIMIZED)

Scenario Base Restart Parallel Performance
J-W MPI + Multinode
CONUS MPI

NVHPC-GPU (OPTIMIZED)

Scenario Base Restart Parallel Performance
J-W MultiGPU + Multinode
CONUS MultiGPU

NVHPC-GPU DOUBLE (OPTIMIZED)

Scenario Base Restart Parallel Performance
CONUS

INTEL (OPTIMIZED)

Scenario Base Restart Parallel Performance
CONUS MPI + OPENMP + Multinode

INTEL (DEBUG) - Build only


Reference generation

To periodically update the reference restart files against which the base runs are compare, we could use another json file .ci/mpas_refgen.jsonc to generate these references from the base branch (develop). This would be triggered manually as required, when either the base branch introduces substantial changes or the GPU porting of a routine results in non BFB identical results (which have been otherwise verified).

To launch the reference generation from the root directory

.ci/hpc-workflows/.ci/runner.py .ci/mpas_refgen.jsonc -t <toolchain> -a <account>

some changes to tests config

some scripts to support performance benchmarking

adding nvhpc-cpu

some changes to exit codes
Adding reference generation capability and a separate jsonc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant