[WIP] Frontier: Additional post-maintenance updates. #2918

dqwu · 2024-07-23T16:47:54Z

This PR is a follow-up to PR #2915, upgrading ROCm and CCE on the
frontier-scream-gpu machine to prevent occasional segmentation
faults during MPI_Init (refer to OLCFDEV-1655).

Upgrade ROCm from 5.4.0 to 5.7.0 to avoid occasional segmentation faults during MPI_Init (see OLCFDEV-1655). Downgrade libunwind from 1.6.2 to 1.5.0 to prevent ums002/default (required to load libunwind 1.6.2) from resetting ROCm to 5.3.0. Upgrade CCE from 15.0.1 to 17.0.0 to ensure compatibility with ROCm 5.7.0. Update HDF5, NetCDF, PnetCDF, and ADIOS libraries accordingly.

The -hzero flag causes an Internal Compiler Error (ICE) on PhenologyMod.F90 when using cce/17.0.0. This workaround was provided by Andrew Bradley.

This workaround was provided by Andrew Bradley. According to Gautam Bisht, the code commented out by Andrew is only used if BGC is on in the land model.

ambrad · 2024-07-23T16:50:31Z

@jgfouca, from the perspective of downstream merges (scream -> e3sm) is it OK if we modify an ELM file in a way that should not be merged to the upstream repo? Or should we keep this kind of change on a branch? Thanks.

The specific change is this commit: 19a53ec

ambrad · 2024-07-23T16:58:26Z

IIRC, we'll see some performance degradation with rocm/5.7, perhaps 10-20%.

dqwu · 2024-07-23T17:02:11Z

@ambrad This PR is almost identical to your cce/17.0.0 approach branch, except that rocm/5.7.0 is being used. The only remaining issue, which is reproducible with some ne256 decadal runs, is the post-condition property check failure:

Error! Failed post-condition property check (cannot be repaired).
  - Atmosphere process name: p3

P.S. I abandoned the "cce/16.0.1 + rocm/5.5.1" configuration because it produces error messages from ROCm like the one below:
Memory access fault by GPU node-5 (Agent handle: 0x119c3000) on address 0x7ffb1a018000. Reason: Write access to a read-only page.

jgfouca · 2024-07-23T17:48:07Z

@ambrad , that's a good question. The risk (very high) is that I'll forgot to remove this from e3sm when I do the next downstream merge. Is there any way to ensure this change is deactivated for e3sm but not for eamxx? Preprocessor or cmake setting, something like that?

ambrad · 2024-07-23T18:00:32Z

This might require some discussion. In the past I've suggested a machines branch limited to just commits like this. Others have suggested and perhaps preferred other approaches. @brhillman @PeterCaldwell, thoughts?

PeterCaldwell · 2024-07-23T19:18:52Z

Yeah, I don't think we've figured out a satisfactory way of dealing with changes we need for a single machine but don't want for others yet. My recollection is that our previous "frontier branch" attempt ended up diverging from master and leading to mistakes (but maybe I'm remembering wrong). I recall that @ndkeen in particular was opposed to machine branches. My mild preference is to use ifdefs rather than machine branches for this use case, but I'm happy to be overruled.

ambrad · 2024-07-24T04:09:44Z

@jgfouca you know the build system best, since you wrote it. Do you have time to implement the ifdef approach so that it's safe to merge to e3sm eventually? If so, you can push a commit to this PR. Thanks.

ambrad

Switching back to unapproved until all threads are resolved.

dqwu · 2024-07-24T17:23:33Z

@ambrad
In "OLCFDEV-1655: Occasional seg-fault during MPI_Init", there is a workaround for rocm/5.4.0 or lower versions:

{}Workaround:{*}*

Add #include <hip/hip_runtime.h> to the top the source file containing main, and run hipInit(0); prior to MPI_Init(&argc, &argv);.

If scream code is able to apply this workaround, we might still be able to use current "cce/15.0.1 + rocm/5.4.0" configuration.

jgfouca · 2024-07-24T17:58:35Z

@ambrad , yes, I am happy to help. It looks like we just want to ifdef out some code in that one F90 file? It looks like the code is currently ifdef'd via CCE_17_COMPILER_BUG_FIXED. So CCE_17_COMPILER_BUG_FIXED needs to be defined in order to get the original behavior. We should probably do the opposite so that the original behavior is the default and we can turn off the problematic code in eamxx.

ambrad · 2024-07-24T18:35:18Z

@jgfouca Yes, you have all of that right.

What do you think of adding a more generic flag that we can use in all future instances for this same purpose? I'm thinking something like SCREAM_SYSTEM_WORKAROUND. To make this approach maximally flexible (example in a moment), we'd want to permit setting integer values. Then any place in general E3SM code that requires a workaround would have something like

#if !defined SCREAM_SYSTEM_WORKAROUND || SCREAM_SYSTEM_WORKAROUND != 1
... this code should run in general ...
#endif

Once that flag is available, we would be able to use it as needed without messing with the build system again.

The purpose for the integer complication is as follows. In the future one can imagine needing separate workarounds for Aurora and Frontier. We might then say that Frontier is assigned 1 and Aurora is assigned 2.

jgfouca · 2024-07-24T19:02:16Z

@ambrad , that sounds good. I will push to this branch shortly.

jgfouca · 2024-07-24T20:29:28Z

OK, I pushed. I'm checking it on frontier now.

jgfouca · 2024-07-24T20:50:43Z

Looks like it works:

PASS SMS_Ln300.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-perf_test--scream-output-preset-1 MODEL_BUILD
    Case dir: /lustre/orion/cli115/proj-shared/acmetest/e3sm_scratch/SMS_Ln300.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-perf_test--scream-output-preset-1.20240724_162911_xkivu3

ambrad · 2024-07-24T21:08:39Z

The other thing we need to resolve before merging this is the P3 issue. It seems to happen fairly quickly in multiple run configurations, so I suspect there is a real issue. Danqing's point about rocm/5.4 with the hipInit workaround might be pertinent. Here again we can use the SCREAM_SYSTEM_WORKAROUND guard.

Update: Looks like the P3 failure is associated with nondeterminism.

dqwu · 2024-07-24T21:15:59Z

The other thing we need to resolve before merging this is the P3 issue. It seems to happen fairly quickly in multiple run configurations, so I suspect there is a real issue. Danqing's point about rocm/5.4 with the hipInit workaround might be pertinent. Here again we can use the SCREAM_SYSTEM_WORKAROUND guard.

Update: Looks like the P3 failure is associated with nondeterminism.

@jayeshkrishna suspects that the post-condition check failures could be due to changes in compiler flags (e.g., removing the -hzero flag to get the code to build with CCE 17).

ambrad · 2024-07-24T21:38:49Z

suspects that the post-condition check failures could be due to changes in compiler flags (e.g., removing the -hzero flag to get the code to build with CCE 17).

I understand, but two things are relevant here. First, more than just -hzero changed; compiler versions did. Thus, we can expect different answers for a number of reasons. Second, I've used the BFB hash capability in EAMxx to identify a clear problem, likely an application-side bug, in P3 exposed by the ROCm version change. The post-condition check fails immediately or soon after the first expression of this bug.

dqwu · 2024-07-27T00:56:41Z

Closing in favor of PR #2923. The commits in this PR might be reused when we switch to CCE 17 or higher in the future (need fix the P3 post-condition check failure).

dqwu added 3 commits July 22, 2024 15:58

Remove -hzero flag from CMAKE_Fortran_FLAGS

42baef2

The -hzero flag causes an Internal Compiler Error (ICE) on PhenologyMod.F90 when using cce/17.0.0. This workaround was provided by Andrew Bradley.

Fix Cray Fortran 17.0.0 build errors in FireDataBaseType.F90

19a53ec

This workaround was provided by Andrew Bradley. According to Gautam Bisht, the code commented out by Andrew is only used if BGC is on in the land model.

dqwu added cmake Concerns CMake related files or logic Machine File Frontier labels Jul 23, 2024

dqwu requested a review from ambrad July 23, 2024 16:47

ambrad previously approved these changes Jul 23, 2024

View reviewed changes

ambrad reviewed Jul 24, 2024

View reviewed changes

ambrad self-requested a review July 24, 2024 17:12

jgfouca dismissed ambrad’s stale review via fcf8c82 July 24, 2024 20:28

Add scream system workaround

ccf8f1e

dqwu force-pushed the dqwu/frontier-072224 branch from fcf8c82 to ccf8f1e Compare July 27, 2024 00:49

dqwu closed this Jul 27, 2024

dqwu mentioned this pull request Jul 27, 2024

Frontier: Additional post-maintenance updates #2923

Merged

dqwu deleted the dqwu/frontier-072224 branch July 30, 2024 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Frontier: Additional post-maintenance updates. #2918

[WIP] Frontier: Additional post-maintenance updates. #2918

dqwu commented Jul 23, 2024

ambrad commented Jul 23, 2024 •

edited

Loading

ambrad commented Jul 23, 2024

dqwu commented Jul 23, 2024 •

edited

Loading

jgfouca commented Jul 23, 2024

ambrad commented Jul 23, 2024

PeterCaldwell commented Jul 23, 2024

ambrad commented Jul 24, 2024

ambrad left a comment

dqwu commented Jul 24, 2024

jgfouca commented Jul 24, 2024

ambrad commented Jul 24, 2024 •

edited

Loading

jgfouca commented Jul 24, 2024

jgfouca commented Jul 24, 2024

jgfouca commented Jul 24, 2024

ambrad commented Jul 24, 2024 •

edited

Loading

dqwu commented Jul 24, 2024

ambrad commented Jul 24, 2024

dqwu commented Jul 27, 2024

[WIP] Frontier: Additional post-maintenance updates. #2918

[WIP] Frontier: Additional post-maintenance updates. #2918

Conversation

dqwu commented Jul 23, 2024

ambrad commented Jul 23, 2024 • edited Loading

ambrad commented Jul 23, 2024

dqwu commented Jul 23, 2024 • edited Loading

jgfouca commented Jul 23, 2024

ambrad commented Jul 23, 2024

PeterCaldwell commented Jul 23, 2024

ambrad commented Jul 24, 2024

ambrad left a comment

Choose a reason for hiding this comment

dqwu commented Jul 24, 2024

jgfouca commented Jul 24, 2024

ambrad commented Jul 24, 2024 • edited Loading

jgfouca commented Jul 24, 2024

jgfouca commented Jul 24, 2024

jgfouca commented Jul 24, 2024

ambrad commented Jul 24, 2024 • edited Loading

dqwu commented Jul 24, 2024

ambrad commented Jul 24, 2024

dqwu commented Jul 27, 2024

ambrad commented Jul 23, 2024 •

edited

Loading

dqwu commented Jul 23, 2024 •

edited

Loading

ambrad commented Jul 24, 2024 •

edited

Loading

ambrad commented Jul 24, 2024 •

edited

Loading