-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCM_X] Multiple RelVals failing in ROCM_X IB #46624
Comments
cms-bot internal usage |
A new Issue was created by @iarspider. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
RelVal |
assign heterogeneous |
This is actually the expected behaviour, well, kind of. The test causes an error on the GPU and tries to report that. However for the HIP/ROCm runtime the GPU-side error or "hardware exception" results in a crash/abort of the CPU-side application. So, we want the test to fail, maybe not this badly ? |
This is the same as #46624 (comment) . |
I'm investigating this together with @AuroraPerego . |
This is currently implemented as a CUDA-only workflow ( @AdrianoDee do you know if this uses the alpaka version of the modules (then it could be changed to use |
In fact, I think all
workflows are CUDA-only and should not be run for the AMD GPU tests. |
When possible I'll start looking at the |
Yes, all these are data RelVals using the old CUDA setup. |
All the |
Yeah, turning the error into an exception (that could be checked in the test itself) would be highly desirable. |
As far as I have been able to find out, that would require making changes to the HSA and ROCm runtime. |
I'm trying to run this workflow by hand on a LUMI node with runTheMatrix.py -w gpu -l 29834.404 but it fails already during step 1, with
Do I need some other options to run ? |
And, on a local machine at CERN, it passes step1 but fails step2 with
Suggestions ? |
Seems |
Thanks, re-running cmsDriver after the right gird setup seems to have worked for this. |
After setting up CVMFS and the Grid tools on LUMI this works for me with pre3:
Edit no, wait, it's running CPU-only... |
OK, it does work:
We'll have to double check the |
In CMSSW_14_2_ROCM_X_2024-11-06-2300 we observe multiple Unit test and RelVal failures:
HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception
HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception
Device-side assertion '0 == blockDimension % warpSize' failed.
followed byHSA_STATUS_ERROR_EXCEPTION
ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators
HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources
ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators
ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators
ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators
roc::DmaBlitManager::hsaCopyStaged
roc::DmaBlitManager::hsaCopyStaged
roc::DmaBlitManager::hsaCopyStaged
(SIGABRTs are either
HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception
orHSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources
The text was updated successfully, but these errors were encountered: