-
Notifications
You must be signed in to change notification settings - Fork 46
Interface ectrans with GPU backend #252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Tagging @fmahebert FYI |
|
Hi @wdeconinck, I'm wondering what the current status of this PR is? The reason that I ask is that I've tried to build and run it locally with a GPU enabled build of ecTrans but I'm getting errors running some of the Atlas trans tests. For example, in the atlas/src/tests/trans/test_trans.cc Line 377 in f988397
dist_spec).
I've build ecTrans using the NVHPC/25.1 compilers and the HPC-X MPI implementation (OpenMPI 4.1.7) that the SDK comes with and all the ecTrans tests pass (CPU and GPU). However, when I link Atlas to ecTrans and run the tests I get failures as I mentioned above. I'm starting to wonder if perhaps I might not be building things correctly or I'm missing some runtime flag. Would you be able to share how you've built this branch of Atlas and ecTrans? |
|
Hi @l90lpa I have just tested this with NVHPC 22.11 and saw no issues like that. My loaded modules:
Note I am not using the openmpi that came with the SDK here. I built following projects with these cmake options: |
cd69b7e to
537cc8c
Compare
|
Now rebased on latest release. |
|
Private downstream CI failed. |
Hi @wdeconinck, thanks for getting back to me and sharing your build set-up! I'll try to recreate a similar environment and see if I have better luck. |
|
Hi @wdeconinck, thanks again for sharing your build environment. I was able to get Atlas+ecTrans working using NVHPC 22.11. However, I've been having trouble building some of our code (and dependencies) with NVHPC 22.11 compilers, and so I was wondering if you have a build environment with a recent version of NVHPC that you know works? The reason I ask is because I seem to get test failures when I move to newer versions of NVHPC as mentioned above. |
|
I could reproduce some issues with nvidia/24.5. The issues seem not to stem from using ectrans-gpu. |
|
I have managed to compile atlas with nvidia/24.5 and nvidia/24.11 using #278. I have rebased this branch including these changes. It should now work. Another thing... By default all atlas tests are run with floating-point-exception trapping enabled. if(x!=0) atan2(y,x)because the masking in vectorised code comes after the signal has been sent with AVX2. For this reason it may be required to turn off floating-point-exception trapping (only for running the tests). You can do this in the environment with export ATLAS_FPE=0 |
537cc8c to
01b66a4
Compare
When the feature "ECTRANS_GPU" is enabled, atlas will now offload all possible spectral transforms to ectrans with GPU backend.
Note that as of now not all functionality is implemented, and a not-implemented exception will be thrown.
The unit-tests by default ignore the not implemented features, triggered by such exception.
The workings of the exception handling depends on a ectrans pull request: ecmwf-ifs/ectrans#193
Without the ectrans pull requests the tests will compile but abort/crash at run-time.