-
Notifications
You must be signed in to change notification settings - Fork 8
Run MPI tests in CI with CPU backends #692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
f657419
6a07f45
9b8822b
19b5c36
c87fc59
730f133
41a7bb2
645d6f8
c887f68
7627426
5590447
5b6e851
b796b3d
f90523d
b070f03
a5457ee
32aff7e
2bf16f9
7ad86ff
6c1d0d3
bcc3faa
dbf0354
259795b
c4116fa
850a318
767eb52
c31303f
b237cad
31b94ad
67168dc
6f23e3d
41b7db9
9f82f42
9309a30
ee0a39b
5786f8b
1b4cc62
af71a6c
d649bb3
45a25a2
8ed39aa
d789718
78fff2d
17367be
489ea31
6e33a4e
2d26a4a
20a1221
8ed7553
878fe1d
e925d0c
0340eb6
d249430
8c258b7
db5d3be
553b132
e033f92
df0ef59
9f88610
2c998f2
754e153
debb64a
5f01f4a
c5f121f
590c6d5
275a78f
3b8cad9
a5c1611
1199c1d
ff73fe4
c9af158
c8d504b
b90de30
1e282be
01aba06
7145c17
4a0b996
07ac04a
bd1ef86
47f7392
2d00fac
0067ca3
1846990
50d7d10
34aae3f
63bec54
d10176c
7b0abb0
8cc7930
0687695
d36c93c
4971a8d
da1d47d
fcde592
b151248
581de44
45507d4
73e746e
13fe916
c24a0e1
0b079db
f7318e9
0157c05
5d4e76d
39a4d35
015bc07
8d51142
1161709
78bccc6
965d144
96f1995
d2d1fc8
ce4e91d
724abb2
20033bb
56b90c1
9b2f093
855f850
b72b0b8
e266f04
b0470f6
dbabaff
bcf791c
e249dfe
d83e721
c5694bf
6f06598
52183a1
a5f1ba5
8084cc3
2d12297
05e6508
7d4e065
ff69a90
f400ad8
2e9d084
d503d76
34bbf17
08b4494
b36857f
6c4db31
ad45595
68fd858
78759ab
f2c2e51
f7562d9
44d0b88
16b8a5a
c76ec06
1442600
aecb35c
2d1fd91
99c3d6e
26a50e5
65550b0
fe3818a
c29fbaa
029b5d6
5df3273
72c53f3
7d40ad0
76ef82c
c763dc4
f1294f8
93444d4
4ab8ea3
639f4ae
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| include: | ||
| - remote: 'https://gitlab.com/cscs-ci/recipes/-/raw/master/templates/v2/.ci-ext.yml' | ||
|
|
||
| stages: | ||
| - baseimage | ||
| - image | ||
| - build | ||
| - test | ||
| - benchmark | ||
|
|
||
| variables: | ||
| PYVERSION_PREFIX: py310 | ||
| PYVERSION: 3.10.9 | ||
|
|
||
| # Base image build step with SHA256 checksum for caching | ||
| .build_distributed_baseimage: | ||
| stage: baseimage | ||
| before_script: | ||
| # include build arguments in hash since we use a parameterized Docker file | ||
| - DOCKER_TAG=`echo "$(cat $DOCKERFILE) $DOCKER_BUILD_ARGS" | sha256sum | head -c 16` | ||
| - export PERSIST_IMAGE_NAME=$CSCS_REGISTRY_PATH/public/$ARCH/base/icon4py:$DOCKER_TAG-$PYVERSION-mpi | ||
| - echo "BASE_IMAGE_${PYVERSION_PREFIX}=$PERSIST_IMAGE_NAME" >> build.env | ||
| artifacts: | ||
| reports: | ||
| dotenv: build.env | ||
| variables: | ||
| DOCKERFILE: ci/docker/base_mpi.Dockerfile | ||
| # change to 'always' if you want to rebuild, even if target tag exists already (if-not-exists is the default, i.e. we could also skip the variable) | ||
| CSCS_REBUILD_POLICY: if-not-exists | ||
|
|
||
| build_distributed_baseimage_aarch64: | ||
| extends: [.container-builder-cscs-gh200, .build_distributed_baseimage] | ||
| variables: | ||
| DOCKER_BUILD_ARGS: '["ARCH=$ARCH", "PYVERSION=$PYVERSION"]' | ||
|
|
||
| .build_distributed_template: | ||
| variables: | ||
| DOCKERFILE: ci/docker/checkout_mpi.Dockerfile | ||
| # Unique image name based on commit SHA, | ||
| DOCKER_BUILD_ARGS: '["PYVERSION=$PYVERSION", "BASE_IMAGE=${BASE_IMAGE_${PYVERSION_PREFIX}}", "VENV=${UV_PROJECT_ENVIRONMENT}"]' | ||
| PERSIST_IMAGE_NAME: $CSCS_REGISTRY_PATH/public/$ARCH/icon4py/icon4py-ci:$CI_COMMIT_SHA-$UV_PROJECT_ENVIRONMENT-$PYVERSION-mpi | ||
| USE_MPI: NO | ||
| SLURM_MPI_TYPE: pmix | ||
| PMIX_MCA_psec: native | ||
| PMIX_MCA_gds: "^shmem2" | ||
|
|
||
| .build_distributed_cpu: | ||
| extends: [.build_distributed_template] | ||
| variables: | ||
| UV_PROJECT_ENVIRONMENT: venv_dist | ||
|
|
||
| build_distributed_cpu: | ||
| stage: image | ||
| extends: [.container-builder-cscs-gh200, .build_distributed_cpu] | ||
| needs: [build_distributed_baseimage_aarch64] | ||
|
|
||
| .test_template_distributed: | ||
| timeout: 8h | ||
| image: $CSCS_REGISTRY_PATH/public/$ARCH/icon4py/icon4py-ci:$CI_COMMIT_SHA-$UV_PROJECT_ENVIRONMENT-$PYVERSION-mpi | ||
| extends: [.container-runner-santis-gh200, .build_distributed_cpu] | ||
| needs: [build_distributed_cpu] | ||
| variables: | ||
| SLURM_JOB_NUM_NODES: 1 | ||
| SLURM_CPU_BIND: 'verbose' | ||
| SLURM_NTASKS: 4 | ||
| TEST_DATA_PATH: "/icon4py/testdata" | ||
| ICON4PY_ENABLE_GRID_DOWNLOAD: false | ||
| ICON4PY_ENABLE_TESTDATA_DOWNLOAD: false | ||
| CSCS_ADDITIONAL_MOUNTS: '["/capstor/store/cscs/userlab/d126/icon4py/ci/testdata_003:$TEST_DATA_PATH"]' | ||
|
|
||
| .test_distributed_aarch64: | ||
| stage: test | ||
| extends: [.test_template_distributed] | ||
| before_script: | ||
| - cd /icon4py | ||
| - echo "using virtual environment at ${UV_PROJECT_ENVIRONMENT}" | ||
| - source ${UV_PROJECT_ENVIRONMENT}/bin/activate | ||
| - echo "running with $(python --version)" | ||
| script: | ||
| - scripts/ci-mpi-wrapper.sh pytest -sv -k mpi_tests --with-mpi --backend=$BACKEND model/$COMPONENT | ||
| parallel: | ||
| matrix: | ||
| - COMPONENT: [atmosphere/diffusion, atmosphere/dycore, common] | ||
| BACKEND: [embedded, gtfn_cpu, dace_cpu] | ||
| rules: | ||
| - if: $COMPONENT == 'atmosphere/diffusion' | ||
| variables: | ||
| SLURM_TIMELIMIT: '00:05:00' | ||
| - if: $COMPONENT == 'atmosphere/dycore' && $BACKEND == 'dace_cpu' | ||
| variables: | ||
| SLURM_TIMELIMIT: '00:20:00' | ||
| - if: $COMPONENT == 'atmosphere/dycore' | ||
| variables: | ||
| SLURM_TIMELIMIT: '00:15:00' | ||
| - when: on_success | ||
| variables: | ||
| SLURM_TIMELIMIT: '00:30:00' | ||
| artifacts: | ||
| paths: | ||
| - pytest-log-rank-*.txt | ||
|
|
||
| test_model_distributed: | ||
| extends: [.test_distributed_aarch64] | ||
msimberg marked this conversation as resolved.
Show resolved
Hide resolved
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| FROM ubuntu:25.04 | ||
|
|
||
| ENV LANG C.UTF-8 | ||
| ENV LC_ALL C.UTF-8 | ||
|
|
||
| ARG DEBIAN_FRONTEND=noninteractive | ||
| RUN apt-get update -qq && apt-get install -qq -y --no-install-recommends \ | ||
| strace \ | ||
| build-essential \ | ||
| tar \ | ||
| wget \ | ||
| curl \ | ||
| libboost-dev \ | ||
| libnuma-dev \ | ||
| libopenmpi-dev\ | ||
| ca-certificates \ | ||
| libssl-dev \ | ||
| autoconf \ | ||
| automake \ | ||
| libtool \ | ||
| pkg-config \ | ||
| libreadline-dev \ | ||
| git && \ | ||
| rm -rf /var/lib/apt/lists/* | ||
|
|
||
| # Install uv: https://docs.astral.sh/uv/guides/integration/docker | ||
| COPY --from=ghcr.io/astral-sh/uv:0.9.24@sha256:816fdce3387ed2142e37d2e56e1b1b97ccc1ea87731ba199dc8a25c04e4997c5 /uv /uvx /bin/ |
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The single ci runs tests through So I install the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This sounds like a good thing even for different test jobs (not distributed, just different components/backends etc.).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It may be possible to get However, I would keep this first version simple and look into the above as a follow up if needed. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| ARG BASE_IMAGE | ||
| FROM $BASE_IMAGE | ||
|
|
||
| COPY . /icon4py | ||
| WORKDIR /icon4py | ||
|
|
||
| ARG PYVERSION | ||
| ARG VENV | ||
| ENV UV_PROJECT_ENVIRONMENT=$VENV | ||
| ENV MPI4PY_BUILD_BACKEND="scikit-build-core" | ||
| RUN uv sync --extra distributed --python=$PYVERSION |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # Log all output to separate logfiles, stored as artifacts in gitlab. Output to | ||
| # stdout only from rank 0. | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| # Check a few different possibilities for the rank. | ||
| if [[ ! -z "${PMI_RANK:-}" ]]; then | ||
| rank="${PMI_RANK}" | ||
| elif [[ ! -z "${OMPI_COMM_WORLD_RANK:-}" ]]; then | ||
| rank="${OMPI_COMM_WORLD_RANK}" | ||
| elif [[ ! -z "${SLURM_PROCID:-}" ]]; then | ||
| rank="${SLURM_PROCID}" | ||
| else | ||
| echo "Could not determine MPI rank. Set PMI_RANK, OMPI_COMM_WORLD_RANK, or SLURM_PROCID." | ||
| exit 1 | ||
| fi | ||
|
|
||
| log_file="${CI_PROJECT_DIR:+${CI_PROJECT_DIR}/}pytest-log-rank-${rank}.txt" | ||
|
|
||
| if [[ "${rank}" -eq 0 ]]; then | ||
| echo "Starting pytest on rank ${rank}, logging to stdout and ${log_file}" | ||
| $@ |& tee "${log_file}" | ||
| else | ||
| echo "Starting pytest on rank ${rank}, logging to ${log_file}" | ||
| $@ >& "${log_file}" | ||
| fi |
Uh oh!
There was an error while loading. Please reload this page.