Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
167 commits
Select commit Hold shift + click to select a range
f657419
adapting to restructured dataset
Feb 11, 2025
6a07f45
pre-commit
Feb 11, 2025
9b8822b
Merge branch 'main' into serialization_data_update
Feb 12, 2025
19b5c36
rename solve non hydro savpoints
Feb 12, 2025
c87fc59
fix loading for solve_nonhydro savepoints
Feb 12, 2025
730f133
split substep fixture into substep_init and substep_exit, add leastsq…
Feb 13, 2025
41a7bb2
fixing nonhydro tests
Feb 13, 2025
645d6f8
remove dummy test
Feb 13, 2025
c887f68
fix wrong fields names in IconSolveNonHydroExit
Feb 13, 2025
7627426
first changes on advection tests
Feb 14, 2025
5590447
remove jstep fixtures -> use substep_exit/init instead
Feb 14, 2025
5b6e851
adapt advection datatest
Feb 14, 2025
b796b3d
pre-commit
Feb 14, 2025
f90523d
remove dummy test
Feb 14, 2025
b070f03
remove advection uris
Feb 14, 2025
a5457ee
trade jstep fixture for substep in test_timeloop.py
Feb 14, 2025
32aff7e
remove unused fixtures
Feb 14, 2025
2bf16f9
rename fixture for IconNonHydroFinalSavepoint
Feb 18, 2025
7ad86ff
rename Level TypeAlias
Feb 18, 2025
6c1d0d3
Merge branch 'main' into serialization_data_update
Feb 18, 2025
bcc3faa
pre-commit
Feb 18, 2025
dbf0354
fix data loading
Feb 19, 2025
259795b
add vn_only fixture to test_timeloop.py
Feb 19, 2025
c4116fa
add diagnostics init savepoint
Feb 25, 2025
850a318
merge main
Feb 25, 2025
767eb52
remove unused field
Feb 25, 2025
c31303f
adapt jabw and gauss3d tests
Feb 26, 2025
b237cad
adapt driver tests
Feb 27, 2025
31b94ad
change mount path for serialized data - path with new data
Feb 27, 2025
67168dc
adapt tools (1) substep fixtures instad of jstep
Feb 27, 2025
6f23e3d
fix naming, fixtures and access patterns
Feb 27, 2025
41b7db9
fix fixtures import
Feb 27, 2025
9f82f42
add directory listing for debugging
Feb 27, 2025
9309a30
add debug printout for loaded test path
Feb 27, 2025
ee0a39b
add TODO for microphysics savepoints, add warning with datapath
Feb 27, 2025
5786f8b
fix wrapper tests
Feb 27, 2025
1b4cc62
remove debugging output
Feb 27, 2025
af71a6c
xfail the start_index test in grid_manager for edge/END. icon data ha…
Feb 27, 2025
d649bb3
xfail the start_index test in grid_manager for edge/END. icon data ha…
Feb 27, 2025
45a25a2
add module level skips to parallel tests.
Feb 28, 2025
8ed39aa
fix: trade jstep for substep in driver initialization_utils.py
Feb 28, 2025
d789718
add bool cast (following review)
Feb 28, 2025
78fff2d
update data URI
Feb 28, 2025
17367be
remove comments that should not have been commmitted
Feb 28, 2025
489ea31
merge main
Mar 4, 2025
6e33a4e
WIP
Mar 4, 2025
2d26a4a
remove skips from parallel tests
Mar 4, 2025
20a1221
move mpi dependent decomposition tests to mpi_test subfolder for cons…
Mar 6, 2025
8ed7553
noxfile add selection for mpi tests
Mar 6, 2025
878fe1d
pre-commit
Mar 6, 2025
e925d0c
Merge branch 'main' into parallel_tests_on_ci
Mar 6, 2025
0340eb6
WIP add distributed.yml
Mar 7, 2025
d249430
remove skips in parallel tests
Mar 7, 2025
8c258b7
run mpirun inside nox session (local usage!)
Mar 7, 2025
db5d3be
add env variables for mpi4py build in distributed.yml
Mar 7, 2025
553b132
use scikit build core for mpi4py build
Mar 11, 2025
e033f92
debug: log some environment variables
Mar 11, 2025
df0ef59
add libnuma-dev to test container
Mar 11, 2025
9f88610
set SLURM_NTASKS=1
Mar 11, 2025
2c998f2
try SLURM_JOB_NUM_NODES: 4
Mar 11, 2025
754e153
add build stage in base.yml
Mar 12, 2025
debb64a
fix stages
Mar 12, 2025
5f01f4a
try to fix pwd
Mar 12, 2025
c5f121f
fix wrong extends key
Mar 12, 2025
590c6d5
artifacts path
Mar 12, 2025
275a78f
artifacts path
Mar 12, 2025
3b8cad9
artifacts path
Mar 12, 2025
a5c1611
remove artifacts
Mar 13, 2025
1199c1d
try out docker build
Mar 13, 2025
ff73fe4
avoid building single node container
Mar 13, 2025
c9af158
fix Docker file
Mar 13, 2025
c8d504b
fix Docker file
Mar 13, 2025
b90de30
fix VENV name (hopefully)
Mar 13, 2025
1e282be
add -k flag in test run, remove no-dev for build (we need test group)
Mar 13, 2025
01aba06
fix fixture loading for parallel procs, add printout for logging (TBR)
Mar 17, 2025
7145c17
set CC and MPICC vars
Mar 17, 2025
4a0b996
debug output in uv sync
Mar 17, 2025
07ac04a
debug output in uv sync
Mar 17, 2025
bd1ef86
fix component path
Mar 17, 2025
47f7392
use base image without nvidia tools
Mar 17, 2025
2d00fac
remove include base.yml
Mar 17, 2025
0067ca3
remove include base.yml
Mar 17, 2025
1846990
move "model" from docker tag
Mar 17, 2025
50d7d10
update main
Mar 18, 2025
34aae3f
clean up configurations
Mar 19, 2025
63bec54
split from base.yml
Mar 19, 2025
d10176c
update base image
Mar 19, 2025
7b0abb0
update base image
Mar 19, 2025
8cc7930
add back libreadline
Mar 20, 2025
0687695
check glibc version
Mar 20, 2025
d36c93c
merge main
May 6, 2025
4971a8d
remove old yaml file
May 6, 2025
da1d47d
fix parallel solve nonhydro tests
May 7, 2025
fcde592
add data uris for APE
May 7, 2025
b151248
remove debugging output in test stage
May 7, 2025
581de44
add cd /icon4py+-
May 7, 2025
45507d4
move python env activation to before_script
May 8, 2025
73e746e
remove workdir from docker file
May 8, 2025
13fe916
use docker file workdir
May 8, 2025
c24a0e1
fix work directory
May 8, 2025
0b079db
clean base docker image (uv installation)
May 8, 2025
f7318e9
debug cd error (WIP1)
May 8, 2025
0157c05
debug cd error (WIP1)
May 8, 2025
5d4e76d
debug cd error (WIP1)
May 8, 2025
39a4d35
debug cd error (WIP1)
May 8, 2025
015bc07
clean up debug output, use openmpi
May 8, 2025
8d51142
SLURM_NTASKS: 4
May 8, 2025
1161709
use --no-cache for uv
May 8, 2025
78bccc6
run pytest directly
May 8, 2025
965d144
try USE_MPI:yes with SLUM_NTASKS:1
May 9, 2025
96f1995
add dace to distributed_venv.Dockerfile
May 9, 2025
d2d1fc8
reset NTASKS (for openmpi), limit pytest to n=1 procs
May 9, 2025
ce4e91d
use --host option to mpirun
May 9, 2025
724abb2
use --host option to mpirun (2)
May 9, 2025
20033bb
use slurm only
May 9, 2025
56b90c1
try --host config again
May 9, 2025
9b2f093
add dace backend
May 9, 2025
855f850
Merge branch 'main' into parallel_tests_on_ci
May 9, 2025
b72b0b8
disable orchestration test in test_parallel_diffusion.py
May 9, 2025
e266f04
revert unused changes in testing infrastructure
May 9, 2025
b0470f6
clean out base_mpi.Dockerfile
May 9, 2025
dbabaff
update ubuntu base image, set to fixed uv version
May 9, 2025
bcf791c
try setting use_mpi without slurm tasks
May 9, 2025
e249dfe
revert last try
May 9, 2025
d83e721
set SLURM_TASKS with env for
May 13, 2025
c5694bf
formatting changes
May 14, 2025
6f06598
reset SLURM_NTASKS from the script command
May 14, 2025
52183a1
try host parameter
May 14, 2025
a5f1ba5
add comment for INVALID_INDEX handling in compute_cells_aw_verts
May 15, 2025
8084cc3
Merge branch 'main' into parallel_tests_on_ci
May 28, 2025
2d12297
fix comparision of rho_ic (wrong field)
May 28, 2025
05e6508
explicitly set USE_MPI with NTASKS=1
May 28, 2025
7d4e065
Merge branch 'main' into parallel_tests_on_ci
Jul 3, 2025
ff69a90
explicity set use_mpi to false and and SLURM_CPU_BIND
Jul 3, 2025
f400ad8
reset SLURM_NTASK in before script
Jul 4, 2025
2e9d084
fix syntax error
Jul 4, 2025
d503d76
fix data path
Jul 4, 2025
34bbf17
run single node tests
Jul 4, 2025
08b4494
Merge branch 'main' into parallel_tests_on_ci
Dec 3, 2025
b36857f
delete duplicate fixture, fix simple tests
Dec 3, 2025
6c4db31
delete duplicate fixture, fix simple tests
Dec 3, 2025
ad45595
delete change in base.Dockerfile
Dec 4, 2025
68fd858
Merge remote-tracking branch 'origin/main' into parallel_tests_on_ci
msimberg Jan 9, 2026
78759ab
Don't inject MPI libraries and use PMIx for distributed tests
msimberg Jan 9, 2026
f2c2e51
Try enabling multiple ranks in distributed tests again
msimberg Jan 9, 2026
f7562d9
dace is no longer optional
msimberg Jan 9, 2026
44d0b88
Run tests with --no-sync
msimberg Jan 9, 2026
16b8a5a
Unbuffered and labeled output
msimberg Jan 9, 2026
c76ec06
PMIx options
msimberg Jan 9, 2026
1442600
Run MPI tests
msimberg Jan 9, 2026
aecb35c
Update distributed CI base image
msimberg Jan 9, 2026
2d1fd91
Clean up mpi base image
msimberg Jan 12, 2026
99c3d6e
Small refactoring
msimberg Jan 12, 2026
26a50e5
Rename mpi dockerfile
msimberg Jan 12, 2026
65550b0
Clean up mpi ci setup
msimberg Jan 12, 2026
fe3818a
Add cscs-ci run distributed to CI reminder action
msimberg Jan 12, 2026
c29fbaa
Fix filename
msimberg Jan 12, 2026
029b5d6
Set working directory for mpi tests (WORKDIR is not used)
msimberg Jan 12, 2026
5df3273
Change working directory again
msimberg Jan 12, 2026
72c53f3
Set timelimits for distributed tests
msimberg Jan 12, 2026
7d40ad0
Mark MPI tests xfail
msimberg Jan 12, 2026
76ef82c
Fix working directory again
msimberg Jan 12, 2026
c763dc4
Undo some xfail, add some new xfail
msimberg Jan 12, 2026
f1294f8
Change logging directory
msimberg Jan 12, 2026
93444d4
More xfail
msimberg Jan 12, 2026
4ab8ea3
Make ci-mpi-wrapper.sh a bit more generic for local use
msimberg Jan 15, 2026
639f4ae
Disable distributed grg test with dace for now
msimberg Jan 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/mandatory_and_optional_test_reminder.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ jobs:

* `cscs-ci run dace`

To run tests with MPI you can use:

* `cscs-ci run distributed`

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:
* `cscs-ci run extra`

Expand Down
103 changes: 103 additions & 0 deletions ci/distributed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
include:
- remote: 'https://gitlab.com/cscs-ci/recipes/-/raw/master/templates/v2/.ci-ext.yml'

stages:
- baseimage
- image
- build
- test
- benchmark

variables:
PYVERSION_PREFIX: py310
PYVERSION: 3.10.9

# Base image build step with SHA256 checksum for caching
.build_distributed_baseimage:
stage: baseimage
before_script:
# include build arguments in hash since we use a parameterized Docker file
- DOCKER_TAG=`echo "$(cat $DOCKERFILE) $DOCKER_BUILD_ARGS" | sha256sum | head -c 16`
- export PERSIST_IMAGE_NAME=$CSCS_REGISTRY_PATH/public/$ARCH/base/icon4py:$DOCKER_TAG-$PYVERSION-mpi
- echo "BASE_IMAGE_${PYVERSION_PREFIX}=$PERSIST_IMAGE_NAME" >> build.env
artifacts:
reports:
dotenv: build.env
variables:
DOCKERFILE: ci/docker/base_mpi.Dockerfile
# change to 'always' if you want to rebuild, even if target tag exists already (if-not-exists is the default, i.e. we could also skip the variable)
CSCS_REBUILD_POLICY: if-not-exists

build_distributed_baseimage_aarch64:
extends: [.container-builder-cscs-gh200, .build_distributed_baseimage]
variables:
DOCKER_BUILD_ARGS: '["ARCH=$ARCH", "PYVERSION=$PYVERSION"]'

.build_distributed_template:
variables:
DOCKERFILE: ci/docker/checkout_mpi.Dockerfile
# Unique image name based on commit SHA,
DOCKER_BUILD_ARGS: '["PYVERSION=$PYVERSION", "BASE_IMAGE=${BASE_IMAGE_${PYVERSION_PREFIX}}", "VENV=${UV_PROJECT_ENVIRONMENT}"]'
PERSIST_IMAGE_NAME: $CSCS_REGISTRY_PATH/public/$ARCH/icon4py/icon4py-ci:$CI_COMMIT_SHA-$UV_PROJECT_ENVIRONMENT-$PYVERSION-mpi
USE_MPI: NO
SLURM_MPI_TYPE: pmix
PMIX_MCA_psec: native
PMIX_MCA_gds: "^shmem2"

.build_distributed_cpu:
extends: [.build_distributed_template]
variables:
UV_PROJECT_ENVIRONMENT: venv_dist

build_distributed_cpu:
stage: image
extends: [.container-builder-cscs-gh200, .build_distributed_cpu]
needs: [build_distributed_baseimage_aarch64]

.test_template_distributed:
timeout: 8h
image: $CSCS_REGISTRY_PATH/public/$ARCH/icon4py/icon4py-ci:$CI_COMMIT_SHA-$UV_PROJECT_ENVIRONMENT-$PYVERSION-mpi
extends: [.container-runner-santis-gh200, .build_distributed_cpu]
needs: [build_distributed_cpu]
variables:
SLURM_JOB_NUM_NODES: 1
SLURM_CPU_BIND: 'verbose'
SLURM_NTASKS: 4
TEST_DATA_PATH: "/icon4py/testdata"
ICON4PY_ENABLE_GRID_DOWNLOAD: false
ICON4PY_ENABLE_TESTDATA_DOWNLOAD: false
CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/d126/icon4py/ci/testdata_003:$TEST_DATA_PATH"]'

.test_distributed_aarch64:
stage: test
extends: [.test_template_distributed]
before_script:
- cd /icon4py
- echo "using virtual environment at ${UV_PROJECT_ENVIRONMENT}"
- source ${UV_PROJECT_ENVIRONMENT}/bin/activate
- echo "running with $(python --version)"
script:
- scripts/ci-mpi-wrapper.sh pytest -sv -k mpi_tests --with-mpi --backend=$BACKEND model/$COMPONENT
parallel:
matrix:
- COMPONENT: [atmosphere/diffusion, atmosphere/dycore, common]
BACKEND: [embedded, gtfn_cpu, dace_cpu]
rules:
- if: $COMPONENT == 'atmosphere/diffusion'
variables:
SLURM_TIMELIMIT: '00:05:00'
- if: $COMPONENT == 'atmosphere/dycore' && $BACKEND == 'dace_cpu'
variables:
SLURM_TIMELIMIT: '00:20:00'
- if: $COMPONENT == 'atmosphere/dycore'
variables:
SLURM_TIMELIMIT: '00:15:00'
- when: on_success
variables:
SLURM_TIMELIMIT: '00:30:00'
artifacts:
paths:
- pytest-log-rank-*.txt

test_model_distributed:
extends: [.test_distributed_aarch64]
27 changes: 27 additions & 0 deletions ci/docker/base_mpi.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
FROM ubuntu:25.04

ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update -qq && apt-get install -qq -y --no-install-recommends \
strace \
build-essential \
tar \
wget \
curl \
libboost-dev \
libnuma-dev \
libopenmpi-dev\
ca-certificates \
libssl-dev \
autoconf \
automake \
libtool \
pkg-config \
libreadline-dev \
git && \
rm -rf /var/lib/apt/lists/*

# Install uv: https://docs.astral.sh/uv/guides/integration/docker
COPY --from=ghcr.io/astral-sh/uv:0.9.24@sha256:816fdce3387ed2142e37d2e56e1b1b97ccc1ea87731ba199dc8a25c04e4997c5 /uv /uvx /bin/
11 changes: 11 additions & 0 deletions ci/docker/checkout_mpi.Dockerfile
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single ci runs tests through nox session, where each session installs its own python venv. That is not possible with the mpi tests, because you cannot do a "mpirun" from within the nox session as it does not like an external tool being accessed and the other way round runnig mpirun -np x nox -s "session" each rank uses its own mpi4py / ghex installation.

So I install the venv into the container by running uv sync in the docker file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I install the venv into the container by running uv sync in the docker file.

This sounds like a good thing even for different test jobs (not distributed, just different components/backends etc.).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be possible to get mpirun/srun nox working by specifying some options to nox about reusing venvs (https://nox.thea.codes/en/stable/usage.html#reusing-virtualenvs), and possibly only installing dependencies on rank 0. Alternatively, calling mpirun from outside the venv should still be possible as well (https://nox.thea.codes/en/stable/usage.html#disallowing-external-programs).

However, I would keep this first version simple and look into the above as a follow up if needed.

Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
ARG BASE_IMAGE
FROM $BASE_IMAGE

COPY . /icon4py
WORKDIR /icon4py

ARG PYVERSION
ARG VENV
ENV UV_PROJECT_ENVIRONMENT=$VENV
ENV MPI4PY_BUILD_BACKEND="scikit-build-core"
RUN uv sync --extra distributed --python=$PYVERSION
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@


@pytest.mark.mpi
@pytest.mark.uses_concat_where
@pytest.mark.parametrize(
"experiment, step_date_init, step_date_exit",
[
Expand Down Expand Up @@ -147,6 +148,7 @@ def test_parallel_diffusion(
)


@pytest.mark.skip("SKIP: orchestration is currently broken on CI")
@pytest.mark.mpi
@pytest.mark.parametrize(
"experiment, step_date_init, step_date_exit",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,9 @@ def test_run_solve_nonhydro_single_step(
decomposition_info: definitions.DecompositionInfo, # : F811 fixture
backend: gtx_typing.Backend | None,
) -> None:
if test_utils.is_embedded(backend):
pytest.xfail("ValueError: axes don't match array")

parallel_helpers.check_comm_size(processor_props)
print(
f"rank={processor_props.rank}/{processor_props.comm_size}: inializing dycore for experiment 'mch_ch_r04_b09_dsl"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
from icon4py.model.common import dimension as dims
from icon4py.model.common.decomposition import definitions, mpi_decomposition
from icon4py.model.testing import definitions as test_defs, serialbox
from icon4py.model.testing.parallel_helpers import check_comm_size, processor_props
from icon4py.model.testing.parallel_helpers import check_comm_size

from ...fixtures import (
backend,
Expand All @@ -40,16 +40,17 @@
icon_grid,
interpolation_savepoint,
metrics_savepoint,
processor_props,
ranked_data_path,
)


"""
running tests with mpi:

mpirun -np 2 python -m pytest -v --with-mpi tests/mpi_tests/test_parallel_setup.py
mpirun -np 2 python -m pytest -v --with-mpi tests/mpi_tests/test_mpi_decomposition.py

mpirun -np 2 pytest -v --with-mpi tests/mpi_tests/
mpirun -np 2 pytest -v --with-mpi -k mpi_tests/


"""
Expand All @@ -58,6 +59,7 @@
@pytest.mark.parametrize("processor_props", [True], indirect=True)
def test_props(processor_props: definitions.ProcessProperties) -> None:
assert processor_props.comm
assert processor_props.comm_size > 1


@pytest.mark.mpi(min_size=2)
Expand Down Expand Up @@ -257,7 +259,7 @@ def test_exchange_on_dummy_data(
exchange = definitions.create_exchange(processor_props, decomposition_info)
grid = grid_savepoint.construct_icon_grid()

number = processor_props.rank + 10.0
number = processor_props.rank + 10
input_field = data_alloc.constant_field(
grid,
number,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ def test_distributed_geometry_attrs_for_inverse(
grid_name: str,
lb_domain: h_grid.Domain,
) -> None:
pytest.xfail()
parallel_helpers.check_comm_size(processor_props)
parallel_helpers.log_process_properties(processor_props)
parallel_helpers.log_local_field_size(decomposition_info)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

import icon4py.model.common.dimension as dims
import icon4py.model.common.grid.horizontal as h_grid
from icon4py.model.common.decomposition import definitions as decomp_defs
from icon4py.model.testing import definitions as test_defs, parallel_helpers

from ...fixtures import (
Expand All @@ -31,12 +32,13 @@
if TYPE_CHECKING:
import gt4py.next as gtx

from icon4py.model.common.decomposition import definitions as decomp_defs
from icon4py.model.common.grid import base as base_grid


try:
import mpi4py # type: ignore[import-not-found] # F401: import mpi4py to check for optional mpi dependency

from icon4py.model.common.decomposition import mpi_decomposition
except ImportError:
pytest.skip("Skipping parallel on single node installation", allow_module_level=True)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,9 @@ def test_distributed_interpolation_grg(
decomposition_info: decomposition.DecompositionInfo,
interpolation_factory_from_savepoint: interpolation_factory.InterpolationFieldsFactory,
) -> None:
if test_utils.is_dace(backend):
pytest.xfail("Segmentation fault with dace backend")

parallel_helpers.check_comm_size(processor_props)
intp_factory = interpolation_factory_from_savepoint
field_ref = interpolation_savepoint.geofac_grg()
Expand Down Expand Up @@ -204,6 +207,7 @@ def test_distributed_interpolation_rbf(
intrp_name: str,
atol: int,
) -> None:
pytest.xfail()
parallel_helpers.check_comm_size(processor_props)
parallel_helpers.log_process_properties(processor_props)
parallel_helpers.log_local_field_size(decomposition_info)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@

@pytest.mark.datatest
@pytest.mark.mpi
@pytest.mark.uses_concat_where
@pytest.mark.parametrize("processor_props", [True], indirect=True)
@pytest.mark.parametrize(
"attrs_name, metrics_name",
Expand All @@ -68,6 +69,9 @@ def test_distributed_metrics_attrs(
metrics_name: str,
experiment: test_defs.Experiment,
) -> None:
if attrs_name == attrs.COEFF_GRADEKIN:
pytest.xfail()

parallel_helpers.check_comm_size(processor_props)
parallel_helpers.log_process_properties(processor_props)
parallel_helpers.log_local_field_size(decomposition_info)
Expand All @@ -80,6 +84,7 @@ def test_distributed_metrics_attrs(

@pytest.mark.datatest
@pytest.mark.mpi
@pytest.mark.uses_concat_where
@pytest.mark.parametrize("processor_props", [True], indirect=True)
@pytest.mark.parametrize(
"attrs_name, metrics_name",
Expand Down Expand Up @@ -151,6 +156,8 @@ def test_distributed_metrics_attrs_no_halo_regional(
metrics_name: str,
experiment: test_defs.Experiment,
) -> None:
if test_utils.is_embedded(backend):
pytest.xfail("ValueError: axes don't match array")
if experiment == test_defs.Experiments.EXCLAIM_APE:
pytest.skip(f"Fields not computed for {experiment}")
parallel_helpers.check_comm_size(processor_props)
Expand Down
5 changes: 5 additions & 0 deletions model/testing/src/icon4py/model/testing/fixtures/datatest.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,11 @@ def download_ser_data(
if "not datatest" in request.config.getoption("-k", ""):
return

with_mpi = request.config.getoption("with_mpi", False)
if with_mpi and experiment == definitions.Experiments.GAUSS3D:
# TODO(msimberg): Fix? Need serialized data.
pytest.skip("GAUSS3D experiment does not support MPI tests")

_download_ser_data(processor_props.comm_size, ranked_data_path, experiment)


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#
# Please, refer to the LICENSE file in the root directory.
# SPDX-License-Identifier: BSD-3-Clause

import logging
from collections.abc import Iterable

Expand Down
28 changes: 28 additions & 0 deletions scripts/ci-mpi-wrapper.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env bash

# Log all output to separate logfiles, stored as artifacts in gitlab. Output to
# stdout only from rank 0.

set -euo pipefail

# Check a few different possibilities for the rank.
if [[ ! -z "${PMI_RANK:-}" ]]; then
rank="${PMI_RANK}"
elif [[ ! -z "${OMPI_COMM_WORLD_RANK:-}" ]]; then
rank="${OMPI_COMM_WORLD_RANK}"
elif [[ ! -z "${SLURM_PROCID:-}" ]]; then
rank="${SLURM_PROCID}"
else
echo "Could not determine MPI rank. Set PMI_RANK, OMPI_COMM_WORLD_RANK, or SLURM_PROCID."
exit 1
fi

log_file="${CI_PROJECT_DIR:+${CI_PROJECT_DIR}/}pytest-log-rank-${rank}.txt"

if [[ "${rank}" -eq 0 ]]; then
echo "Starting pytest on rank ${rank}, logging to stdout and ${log_file}"
$@ |& tee "${log_file}"
else
echo "Starting pytest on rank ${rank}, logging to ${log_file}"
$@ >& "${log_file}"
fi