[ROCM_X] Multiple RelVals failing in ROCM_X IB #46624

iarspider · 2024-11-07T14:58:32Z

In CMSSW_14_2_ROCM_X_2024-11-06-2300 we observe multiple Unit test and RelVal failures:

What failed	Description
DataFormats/SoATemplate/testRocmSoALayoutAndView_t	`HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception`
HeterogeneousCore/AlpakaInterface/alpakaTestBufferROCmAsync	`HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception`
HeterogeneousCore/AlpakaInterface/alpakaTestPrefixScanROCmAsync	Many `Device-side assertion '0 == blockDimension % warpSize' failed.` followed by `HSA_STATUS_ERROR_EXCEPTION`
Relval 141.008583 step 2	`ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators`
Relval 29834.403 step 2	`HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources`
Relval 29834.404 step 2	StdException
Relval 141.008507 step 3	`ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators`
Relval 141.008508 step 3	Fatal exception: Unable to choose current device because CUDAService is not preset or disabled. If CUDAService was not explicitly disabled in the configuration, the probable cause is that there is no GPU or there is some problem in the CUDA runtime or drivers.
Relval 141.008513 step 3	`ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators`
Relval 141.008514 step 3	BadAlloc
Relval 141.008523 step 3	`ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators`
Relval 141.008524 step 3	BadAlloc
Relval 12834.402 step 3	SIGSEGV in `roc::DmaBlitManager::hsaCopyStaged`
Relval 13034.402 step 3	SIGABRT
Relval 13034.404 step 3	SIGABRT
Relval 13034.406 step 3	SIGABRT
Relval 13034.408 step 3	SIGABRT
Relval 13050.402 step 3	SIGABRT
Relval 13050.404 step 3	SIGABRT
Relval 13050.406 step 3	SIGSEGV in `roc::DmaBlitManager::hsaCopyStaged`
Relval 13050.408 step 3	SIGABRT
Relval 13061.402 step 3	SIGSEGV in `roc::DmaBlitManager::hsaCopyStaged`
Relval 29634.402 step 3	SIGABRT
Relval 29834.402 step 3	SIGABRT
Relval 160.03502 step 4	BadAlloc

(SIGABRTs are either HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception or HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources

The text was updated successfully, but these errors were encountered:

cmsbuild · 2024-11-07T14:58:52Z

cms-bot internal usage

cmsbuild · 2024-11-07T14:58:53Z

A new Issue was created by @iarspider.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

aandvalenzuela · 2024-11-07T15:03:07Z

RelVal 160.03502 should be disabled for ROCM IBs since it is a CUDA-only workflow.

makortel · 2024-11-07T15:04:27Z

assign heterogeneous

cmsbuild · 2024-11-07T15:04:47Z

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

fwyzard · 2024-11-07T16:35:03Z

DataFormats/SoATemplate/testRocmSoALayoutAndView_t

This is actually the expected behaviour, well, kind of.

The test causes an error on the GPU and tries to report that. However for the HIP/ROCm runtime the GPU-side error or "hardware exception" results in a crash/abort of the CPU-side application.

So, we want the test to fail, maybe not this badly ?

fwyzard · 2024-11-07T16:35:25Z

HeterogeneousCore/AlpakaInterface/alpakaTestBufferROCmAsync

This is the same as #46624 (comment) .

fwyzard · 2024-11-07T16:36:18Z

HeterogeneousCore/AlpakaInterface/alpakaTestPrefixScanROCmAsync

I'm investigating this together with @AuroraPerego .
It seems a problem with the test itself rather than with the functionality being tested, I should have a fix soon.

fwyzard · 2024-11-07T16:41:47Z

Relval 141.008583 step 2

This is currently implemented as a CUDA-only workflow ('--accelerators': 'gpu-nvidia').

@AdrianoDee do you know if this uses the alpaka version of the modules (then it could be changed to use '--accelerators': 'gpu-*') or the cuda version (then it should be disabled for the AMD tests) ?

fwyzard · 2024-11-07T16:43:01Z

In fact, I think all

Relval 141.0085xx step 3

workflows are CUDA-only and should not be run for the AMD GPU tests.

fwyzard · 2024-11-07T16:43:39Z

When possible I'll start looking at the *.40x workflows.

AdrianoDee · 2024-11-07T17:54:46Z

Yes, all these are data RelVals using the old CUDA setup.

AdrianoDee · 2024-11-07T17:56:25Z

All the 141.* + 160.*

makortel · 2024-11-07T18:34:00Z

DataFormats/SoATemplate/testRocmSoALayoutAndView_t

This is actually the expected behaviour, well, kind of.

The test causes an error on the GPU and tries to report that. However for the HIP/ROCm runtime the GPU-side error or "hardware exception" results in a crash/abort of the CPU-side application.

So, we want the test to fail, maybe not this badly ?

Yeah, turning the error into an exception (that could be checked in the test itself) would be highly desirable.

fwyzard · 2024-11-07T20:30:30Z

Yeah, turning the error into an exception (that could be checked in the test itself) would be highly desirable.

As far as I have been able to find out, that would require making changes to the HSA and ROCm runtime.

fwyzard · 2024-11-07T22:38:11Z

HeterogeneousCore/AlpakaInterface/alpakaTestPrefixScanROCmAsync

Fixed by #46629 .

fwyzard · 2024-11-08T08:05:21Z

Relval 29834.404 step 2

I'm trying to run this workflow by hand on a LUMI node with

runTheMatrix.py -w gpu -l 29834.404

but it fails already during step 1, with

----- Begin Fatal Exception 08-Nov-2024 09:58:07 EET-----------------------
An exception of category 'Configuration' occurred while
   [0] Processing global begin LuminosityBlock run: 1 luminosityBlock: 1
   [1] Calling method for module Pythia8ConcurrentGeneratorFilter/'generator'
Exception Message:
Failed to initialize hadronizer Pythia8Hadronizer for internal parton generation
----- End Fatal Exception -------------------------------------------------

Do I need some other options to run ?

fwyzard · 2024-11-08T08:10:43Z

And, on a local machine at CERN, it passes step1 but fails step2 with

----- Begin Fatal Exception 08-Nov-2024 09:04:28 CET-----------------------
An exception of category 'NoSecondaryFiles' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=MixingModule label='mix'
Exception Message:
RootEmbeddedFileSequence no input files specified for secondary input source.
----- End Fatal Exception -------------------------------------------------

Suggestions ?

AdrianoDee · 2024-11-08T08:30:12Z

And, on a local machine at CERN, it passes step1 but fails step2 with

----- Begin Fatal Exception 08-Nov-2024 09:04:28 CET-----------------------
An exception of category 'NoSecondaryFiles' occurred while
[0] Constructing the EventProcessor
[1] Constructing module: class=MixingModule label='mix'
Exception Message:
RootEmbeddedFileSequence no input files specified for secondary input source.
----- End Fatal Exception -------------------------------------------------
Suggestions ?

Seems cmsDriver can't get the MinBias input files. Could it be a problem of certificate?
(I've tested it locally and step2 runs)

fwyzard · 2024-11-08T08:41:40Z

Thanks, re-running cmsDriver after the right gird setup seems to have worked for this.

fwyzard · 2024-11-10T08:41:08Z

Relval 29834.404 step 2

After setting up CVMFS and the Grid tools on LUMI this works for me with pre3:

29834.404_TTbar_14TeV+2026D110PU_Patatrack_PixelOnlyAlpaka_Profiling Step0-PASSED Step1-PASSED Step2-PASSED - time date Sun Nov 10 07:34:10 2024-date Sun Nov 10 07:05:15 2024; exit: 0 0 0
1 1 1 tests passed, 0 0 0 failed

Edit no, wait, it's running CPU-only...

fwyzard · 2024-11-10T09:02:50Z

OK, it does work:

$ cmsRun step2_DIGI_L1TrackTrigger_L1_L1P2GT_DIGI2RAW_HLT_PU.py |& tee step2.log
%MSG-i AlpakaService:  (NoModuleName) 10-Nov-2024 10:46:38 EET pre-events
AlpakaServiceSerialSync succesfully initialised.
Found 1 device:
  - AMD EPYC 7A53 64-Core Processor
%MSG
%MSG-i ROCmService:  (NoModuleName) 10-Nov-2024 10:46:38 EET pre-events
ROCm runtime version 5.6.31062, driver version 5.6.31062, AMD driver version 6.3.6
ROCm device 0: AMD Instinct MI250X (gfx90a:sramecc+:xnack-)
%MSG
%MSG-i AlpakaService:  (NoModuleName) 10-Nov-2024 10:46:39 EET pre-events
AlpakaServiceROCmAsync succesfully initialised.
Found 1 device:
  - AMD Instinct MI250X
%MSG
10-Nov-2024 10:47:20 EET  Initiating request to open file file:step1.root
10-Nov-2024 10:47:21 EET  Successfully opened file file:step1.root
10-Nov-2024 10:47:40 EET  Initiating request to open file root://hip-cms-se.csc.fi:1094//store/relval/CMSSW_14_1_0_pre5/RelValMinBias_14TeV/GEN-SIM/140X_mcRun4_realistic_v4_RegeneratedGS_2026D110_noPU-v1/2580000/0dfa017f-9854-4ba1-a780-e8cb02f9cac1.root
...
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 10-Nov-2024 10:49:52.693 EET
...
Begin processing the 10th record. Run 1, Event 10, LumiSection 1 on stream 0 at 10-Nov-2024 10:59:12.452 EET
...
10-Nov-2024 11:00:06 EET  Closed file file:step1.root
10-Nov-2024 11:00:07 EET  Closed file root://xrootd-cms.infn.it//store/relval/CMSSW_14_1_0_pre5/RelValMinBias_14TeV/GEN-SIM/140X_mcRun4_realistic_v4_RegeneratedGS_2026D110_noPU-v1/2580000/0dfa017f-9854-4ba1-a780-e8cb02f9cac1.root

$ echo $?
0

We'll have to double check the cmsbuild environment.

cmsbuild added the pending-assignment label Nov 7, 2024

cmsbuild added pending-signatures heterogeneous-pending and removed pending-assignment labels Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCM_X] Multiple RelVals failing in ROCM_X IB #46624

[ROCM_X] Multiple RelVals failing in ROCM_X IB #46624

iarspider commented Nov 7, 2024

cmsbuild commented Nov 7, 2024 •

edited

Loading

cmsbuild commented Nov 7, 2024

aandvalenzuela commented Nov 7, 2024

makortel commented Nov 7, 2024

cmsbuild commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 7, 2024

AdrianoDee commented Nov 7, 2024

AdrianoDee commented Nov 7, 2024

makortel commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 8, 2024

fwyzard commented Nov 8, 2024

AdrianoDee commented Nov 8, 2024 •

edited

Loading

fwyzard commented Nov 8, 2024

fwyzard commented Nov 10, 2024 •

edited

Loading

fwyzard commented Nov 10, 2024

[ROCM_X] Multiple RelVals failing in ROCM_X IB #46624

[ROCM_X] Multiple RelVals failing in ROCM_X IB #46624

Comments

iarspider commented Nov 7, 2024

cmsbuild commented Nov 7, 2024 • edited Loading

cmsbuild commented Nov 7, 2024

aandvalenzuela commented Nov 7, 2024

makortel commented Nov 7, 2024

cmsbuild commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 7, 2024

AdrianoDee commented Nov 7, 2024

AdrianoDee commented Nov 7, 2024

makortel commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 7, 2024

fwyzard commented Nov 8, 2024

fwyzard commented Nov 8, 2024

AdrianoDee commented Nov 8, 2024 • edited Loading

fwyzard commented Nov 8, 2024

fwyzard commented Nov 10, 2024 • edited Loading

fwyzard commented Nov 10, 2024

cmsbuild commented Nov 7, 2024 •

edited

Loading

AdrianoDee commented Nov 8, 2024 •

edited

Loading

fwyzard commented Nov 10, 2024 •

edited

Loading