Skip to content

Conversation

@wdeconinck
Copy link
Member

When the feature "ECTRANS_GPU" is enabled, atlas will now offload all possible spectral transforms to ectrans with GPU backend.
Note that as of now not all functionality is implemented, and a not-implemented exception will be thrown.
The unit-tests by default ignore the not implemented features, triggered by such exception.
The workings of the exception handling depends on a ectrans pull request: ecmwf-ifs/ectrans#193
Without the ectrans pull requests the tests will compile but abort/crash at run-time.

@MarekWlasak
Copy link
Contributor

Tagging @fmahebert FYI

@l90lpa
Copy link
Contributor

l90lpa commented Apr 9, 2025

Hi @wdeconinck, I'm wondering what the current status of this PR is? The reason that I ask is that I've tried to build and run it locally with a GPU enabled build of ecTrans but I'm getting errors running some of the Atlas trans tests. For example, in the test_nomesh case when running atlas_test_trans, the comparison of spf (see

EXPECT(int(sp(real)) == +m * spectral.truncation() + n);
) fails (which is after the scatter call that internally uses ecTrans's dist_spec).

I've build ecTrans using the NVHPC/25.1 compilers and the HPC-X MPI implementation (OpenMPI 4.1.7) that the SDK comes with and all the ecTrans tests pass (CPU and GPU). However, when I link Atlas to ecTrans and run the tests I get failures as I mentioned above. I'm starting to wonder if perhaps I might not be building things correctly or I'm missing some runtime flag. Would you be able to share how you've built this branch of Atlas and ecTrans?

@wdeconinck
Copy link
Member Author

Hi @l90lpa I have just tested this with NVHPC 22.11 and saw no issues like that.

My loaded modules:

  1. cmake/3.28.3 2) prgenv/nvidia 3) gcc/11.2.0 4) nvidia/22.11 5) hpcx-openmpi/2.14.0-cuda 6) eigen/3.4.0 7) fftw/3.3.10 8) ninja/1.11.1

Note I am not using the openmpi that came with the SDK here.

I built following projects with these cmake options:
fiat : -DENABLE_MPI=ON
ectrans: -DENABLE_ACC=ON -DENABLE_GPU=ON
atlas: -DENABLE_ACC=ON -DENABLE_CUDA=ON -DENABLE_ECTRANS=ON -DENABLE_ECTRANS_GPU=ON

@wdeconinck wdeconinck force-pushed the feature/ectrans-gpu branch from cd69b7e to 537cc8c Compare April 11, 2025 09:11
@wdeconinck
Copy link
Member Author

Now rebased on latest release.

@github-actions
Copy link

Private downstream CI failed.
Workflow name: private-downstream-ci
View the logs at https://github.com/ecmwf/private-downstream-ci/actions/runs/14400765285.

@l90lpa
Copy link
Contributor

l90lpa commented Apr 11, 2025

Hi @l90lpa I have just tested this with NVHPC 22.11 and saw no issues like that.

My loaded modules:

1. cmake/3.28.3   2) prgenv/nvidia   3) gcc/11.2.0   4) nvidia/22.11   5) hpcx-openmpi/2.14.0-cuda   6) eigen/3.4.0   7) fftw/3.3.10   8) ninja/1.11.1

Note I am not using the openmpi that came with the SDK here.

I built following projects with these cmake options: fiat : -DENABLE_MPI=ON ectrans: -DENABLE_ACC=ON -DENABLE_GPU=ON atlas: -DENABLE_ACC=ON -DENABLE_CUDA=ON -DENABLE_ECTRANS=ON -DENABLE_ECTRANS_GPU=ON

Hi @wdeconinck, thanks for getting back to me and sharing your build set-up! I'll try to recreate a similar environment and see if I have better luck.

@l90lpa
Copy link
Contributor

l90lpa commented Apr 22, 2025

Hi @wdeconinck, thanks again for sharing your build environment. I was able to get Atlas+ecTrans working using NVHPC 22.11. However, I've been having trouble building some of our code (and dependencies) with NVHPC 22.11 compilers, and so I was wondering if you have a build environment with a recent version of NVHPC that you know works? The reason I ask is because I seem to get test failures when I move to newer versions of NVHPC as mentioned above.

@wdeconinck
Copy link
Member Author

I could reproduce some issues with nvidia/24.5. The issues seem not to stem from using ectrans-gpu.
I will try to fix or workaround separately from this PR, and then rebase this on develop once merged.

@wdeconinck
Copy link
Member Author

wdeconinck commented May 6, 2025

I have managed to compile atlas with nvidia/24.5 and nvidia/24.11 using #278. I have rebased this branch including these changes. It should now work.

Another thing... By default all atlas tests are run with floating-point-exception trapping enabled.
For nvidia versions later than 22.11 it seems that some intrinsic functions like atan2(y,x) result in avx2-optimised versions (depending on optimization level) which still signal a FE_DIVBYZERO, even if there's a protection with

if(x!=0) atan2(y,x)

because the masking in vectorised code comes after the signal has been sent with AVX2. For this reason it may be required to turn off floating-point-exception trapping (only for running the tests). You can do this in the environment with

export ATLAS_FPE=0

@wdeconinck wdeconinck force-pushed the feature/ectrans-gpu branch from 537cc8c to 01b66a4 Compare May 6, 2025 11:57
@ecmwf ecmwf deleted a comment from github-actions bot Sep 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants