Interface ectrans with GPU backend #252

wdeconinck · 2024-12-19T21:12:17Z

When the feature "ECTRANS_GPU" is enabled, atlas will now offload all possible spectral transforms to ectrans with GPU backend.
Note that as of now not all functionality is implemented, and a not-implemented exception will be thrown.
The unit-tests by default ignore the not implemented features, triggered by such exception.
The workings of the exception handling depends on a ectrans pull request: ecmwf-ifs/ectrans#193
Without the ectrans pull requests the tests will compile but abort/crash at run-time.

MarekWlasak · 2024-12-20T07:18:34Z

Tagging @fmahebert FYI

l90lpa · 2025-04-09T20:11:31Z

Hi @wdeconinck, I'm wondering what the current status of this PR is? The reason that I ask is that I've tried to build and run it locally with a GPU enabled build of ecTrans but I'm getting errors running some of the Atlas trans tests. For example, in the test_nomesh case when running atlas_test_trans, the comparison of spf (see

atlas/src/tests/trans/test_trans.cc

Line 377 in f988397

EXPECT(int(sp(real)) == +m * spectral.truncation() + n);

) fails (which is after the scatter call that internally uses ecTrans's dist_spec).

I've build ecTrans using the NVHPC/25.1 compilers and the HPC-X MPI implementation (OpenMPI 4.1.7) that the SDK comes with and all the ecTrans tests pass (CPU and GPU). However, when I link Atlas to ecTrans and run the tests I get failures as I mentioned above. I'm starting to wonder if perhaps I might not be building things correctly or I'm missing some runtime flag. Would you be able to share how you've built this branch of Atlas and ecTrans?

wdeconinck · 2025-04-11T09:10:48Z

Hi @l90lpa I have just tested this with NVHPC 22.11 and saw no issues like that.

My loaded modules:

cmake/3.28.3 2) prgenv/nvidia 3) gcc/11.2.0 4) nvidia/22.11 5) hpcx-openmpi/2.14.0-cuda 6) eigen/3.4.0 7) fftw/3.3.10 8) ninja/1.11.1

Note I am not using the openmpi that came with the SDK here.

I built following projects with these cmake options:
fiat : -DENABLE_MPI=ON
ectrans: -DENABLE_ACC=ON -DENABLE_GPU=ON
atlas: -DENABLE_ACC=ON -DENABLE_CUDA=ON -DENABLE_ECTRANS=ON -DENABLE_ECTRANS_GPU=ON

wdeconinck · 2025-04-11T09:12:01Z

Now rebased on latest release.

github-actions · 2025-04-11T10:42:25Z

Private downstream CI failed.
Workflow name: private-downstream-ci
View the logs at https://github.com/ecmwf/private-downstream-ci/actions/runs/14400765285.

l90lpa · 2025-04-11T12:14:57Z

Hi @l90lpa I have just tested this with NVHPC 22.11 and saw no issues like that.

My loaded modules:
1. cmake/3.28.3   2) prgenv/nvidia   3) gcc/11.2.0   4) nvidia/22.11   5) hpcx-openmpi/2.14.0-cuda   6) eigen/3.4.0   7) fftw/3.3.10   8) ninja/1.11.1
Note I am not using the openmpi that came with the SDK here.

I built following projects with these cmake options: fiat : -DENABLE_MPI=ON ectrans: -DENABLE_ACC=ON -DENABLE_GPU=ON atlas: -DENABLE_ACC=ON -DENABLE_CUDA=ON -DENABLE_ECTRANS=ON -DENABLE_ECTRANS_GPU=ON

Hi @wdeconinck, thanks for getting back to me and sharing your build set-up! I'll try to recreate a similar environment and see if I have better luck.

l90lpa · 2025-04-22T14:12:29Z

Hi @wdeconinck, thanks again for sharing your build environment. I was able to get Atlas+ecTrans working using NVHPC 22.11. However, I've been having trouble building some of our code (and dependencies) with NVHPC 22.11 compilers, and so I was wondering if you have a build environment with a recent version of NVHPC that you know works? The reason I ask is because I seem to get test failures when I move to newer versions of NVHPC as mentioned above.

wdeconinck · 2025-04-29T21:17:21Z

I could reproduce some issues with nvidia/24.5. The issues seem not to stem from using ectrans-gpu.
I will try to fix or workaround separately from this PR, and then rebase this on develop once merged.

wdeconinck · 2025-05-06T11:53:28Z

I have managed to compile atlas with nvidia/24.5 and nvidia/24.11 using #278. I have rebased this branch including these changes. It should now work.

Another thing... By default all atlas tests are run with floating-point-exception trapping enabled.
For nvidia versions later than 22.11 it seems that some intrinsic functions like atan2(y,x) result in avx2-optimised versions (depending on optimization level) which still signal a FE_DIVBYZERO, even if there's a protection with

if(x!=0) atan2(y,x)

because the masking in vectorised code comes after the signal has been sent with AVX2. For this reason it may be required to turn off floating-point-exception trapping (only for running the tests). You can do this in the environment with

export ATLAS_FPE=0

wdeconinck mentioned this pull request Dec 19, 2024

Information request - using GPU-offloaded ecTrans via Atlas ecmwf-ifs/ectrans#178

Open

wdeconinck force-pushed the feature/ectrans-gpu branch from cd69b7e to 537cc8c Compare April 11, 2025 09:11

wdeconinck added 2 commits May 6, 2025 11:57

Link and run tests with transi_gpu_dp

721a758

Improved error handling for maybe uninmplemented ectrans GPU features

01b66a4

wdeconinck force-pushed the feature/ectrans-gpu branch from 537cc8c to 01b66a4 Compare May 6, 2025 11:57

ecmwf deleted a comment from github-actions bot Sep 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Interface ectrans with GPU backend #252

Interface ectrans with GPU backend #252

Uh oh!

wdeconinck commented Dec 19, 2024

Uh oh!

MarekWlasak commented Dec 20, 2024

Uh oh!

l90lpa commented Apr 9, 2025

Uh oh!

wdeconinck commented Apr 11, 2025

Uh oh!

wdeconinck commented Apr 11, 2025

Uh oh!

github-actions bot commented Apr 11, 2025

Uh oh!

l90lpa commented Apr 11, 2025

Uh oh!

l90lpa commented Apr 22, 2025 •

edited

Loading

Uh oh!

wdeconinck commented Apr 29, 2025

Uh oh!

wdeconinck commented May 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Interface ectrans with GPU backend #252

Are you sure you want to change the base?

Interface ectrans with GPU backend #252

Uh oh!

Conversation

wdeconinck commented Dec 19, 2024

Uh oh!

MarekWlasak commented Dec 20, 2024

Uh oh!

l90lpa commented Apr 9, 2025

Uh oh!

wdeconinck commented Apr 11, 2025

Uh oh!

wdeconinck commented Apr 11, 2025

Uh oh!

github-actions bot commented Apr 11, 2025

Uh oh!

l90lpa commented Apr 11, 2025

Uh oh!

l90lpa commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wdeconinck commented Apr 29, 2025

Uh oh!

wdeconinck commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

l90lpa commented Apr 22, 2025 •

edited

Loading

wdeconinck commented May 6, 2025 •

edited

Loading