Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Frontier: Additional post-maintenance updates. #2918

Closed
wants to merge 4 commits into from

Conversation

dqwu
Copy link
Contributor

@dqwu dqwu commented Jul 23, 2024

This PR is a follow-up to PR #2915, upgrading ROCm and CCE on the
frontier-scream-gpu machine to prevent occasional segmentation
faults during MPI_Init (refer to OLCFDEV-1655).

dqwu added 3 commits July 22, 2024 15:58
Upgrade ROCm from 5.4.0 to 5.7.0 to avoid occasional segmentation
faults during MPI_Init (see OLCFDEV-1655).

Downgrade libunwind from 1.6.2 to 1.5.0 to prevent ums002/default
(required to load libunwind 1.6.2) from resetting ROCm to 5.3.0.

Upgrade CCE from 15.0.1 to 17.0.0 to ensure compatibility with ROCm
5.7.0.

Update HDF5, NetCDF, PnetCDF, and ADIOS libraries accordingly.
The -hzero flag causes an Internal Compiler Error (ICE) on
PhenologyMod.F90 when using cce/17.0.0.

This workaround was provided by Andrew Bradley.
This workaround was provided by Andrew Bradley.

According to Gautam Bisht, the code commented out by Andrew is
only used if BGC is on in the land model.
@dqwu dqwu added cmake Concerns CMake related files or logic Machine File Frontier labels Jul 23, 2024
@dqwu dqwu requested a review from ambrad July 23, 2024 16:47
@ambrad
Copy link
Member

ambrad commented Jul 23, 2024

@jgfouca, from the perspective of downstream merges (scream -> e3sm) is it OK if we modify an ELM file in a way that should not be merged to the upstream repo? Or should we keep this kind of change on a branch? Thanks.

The specific change is this commit: 19a53ec

ambrad
ambrad previously approved these changes Jul 23, 2024
@ambrad
Copy link
Member

ambrad commented Jul 23, 2024

IIRC, we'll see some performance degradation with rocm/5.7, perhaps 10-20%.

@dqwu
Copy link
Contributor Author

dqwu commented Jul 23, 2024

@ambrad This PR is almost identical to your cce/17.0.0 approach branch, except that rocm/5.7.0 is being used. The only remaining issue, which is reproducible with some ne256 decadal runs, is the post-condition property check failure:

Error! Failed post-condition property check (cannot be repaired).
  - Atmosphere process name: p3

P.S. I abandoned the "cce/16.0.1 + rocm/5.5.1" configuration because it produces error messages from ROCm like the one below:
Memory access fault by GPU node-5 (Agent handle: 0x119c3000) on address 0x7ffb1a018000. Reason: Write access to a read-only page.

@jgfouca
Copy link
Member

jgfouca commented Jul 23, 2024

@ambrad , that's a good question. The risk (very high) is that I'll forgot to remove this from e3sm when I do the next downstream merge. Is there any way to ensure this change is deactivated for e3sm but not for eamxx? Preprocessor or cmake setting, something like that?

@ambrad
Copy link
Member

ambrad commented Jul 23, 2024

This might require some discussion. In the past I've suggested a machines branch limited to just commits like this. Others have suggested and perhaps preferred other approaches. @brhillman @PeterCaldwell, thoughts?

@PeterCaldwell
Copy link
Contributor

Yeah, I don't think we've figured out a satisfactory way of dealing with changes we need for a single machine but don't want for others yet. My recollection is that our previous "frontier branch" attempt ended up diverging from master and leading to mistakes (but maybe I'm remembering wrong). I recall that @ndkeen in particular was opposed to machine branches. My mild preference is to use ifdefs rather than machine branches for this use case, but I'm happy to be overruled.

@ambrad
Copy link
Member

ambrad commented Jul 24, 2024

@jgfouca you know the build system best, since you wrote it. Do you have time to implement the ifdef approach so that it's safe to merge to e3sm eventually? If so, you can push a commit to this PR. Thanks.

Copy link
Member

@ambrad ambrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching back to unapproved until all threads are resolved.

@ambrad ambrad self-requested a review July 24, 2024 17:12
@dqwu
Copy link
Contributor Author

dqwu commented Jul 24, 2024

@ambrad
In "OLCFDEV-1655: Occasional seg-fault during MPI_Init", there is a workaround for rocm/5.4.0 or lower versions:

{}Workaround:{*}*

Add #include <hip/hip_runtime.h> to the top the source file containing main, and run hipInit(0); prior to MPI_Init(&argc, &argv);.

If scream code is able to apply this workaround, we might still be able to use current "cce/15.0.1 + rocm/5.4.0" configuration.

@jgfouca
Copy link
Member

jgfouca commented Jul 24, 2024

@ambrad , yes, I am happy to help. It looks like we just want to ifdef out some code in that one F90 file? It looks like the code is currently ifdef'd via CCE_17_COMPILER_BUG_FIXED. So CCE_17_COMPILER_BUG_FIXED needs to be defined in order to get the original behavior. We should probably do the opposite so that the original behavior is the default and we can turn off the problematic code in eamxx.

@ambrad
Copy link
Member

ambrad commented Jul 24, 2024

@jgfouca Yes, you have all of that right.

What do you think of adding a more generic flag that we can use in all future instances for this same purpose? I'm thinking something like SCREAM_SYSTEM_WORKAROUND. To make this approach maximally flexible (example in a moment), we'd want to permit setting integer values. Then any place in general E3SM code that requires a workaround would have something like

#if !defined SCREAM_SYSTEM_WORKAROUND || SCREAM_SYSTEM_WORKAROUND != 1
... this code should run in general ...
#endif

Once that flag is available, we would be able to use it as needed without messing with the build system again.

The purpose for the integer complication is as follows. In the future one can imagine needing separate workarounds for Aurora and Frontier. We might then say that Frontier is assigned 1 and Aurora is assigned 2.

@jgfouca
Copy link
Member

jgfouca commented Jul 24, 2024

@ambrad , that sounds good. I will push to this branch shortly.

@jgfouca
Copy link
Member

jgfouca commented Jul 24, 2024

OK, I pushed. I'm checking it on frontier now.

@jgfouca
Copy link
Member

jgfouca commented Jul 24, 2024

Looks like it works:

PASS SMS_Ln300.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-perf_test--scream-output-preset-1 MODEL_BUILD
    Case dir: /lustre/orion/cli115/proj-shared/acmetest/e3sm_scratch/SMS_Ln300.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-perf_test--scream-output-preset-1.20240724_162911_xkivu3

@ambrad
Copy link
Member

ambrad commented Jul 24, 2024

The other thing we need to resolve before merging this is the P3 issue. It seems to happen fairly quickly in multiple run configurations, so I suspect there is a real issue. Danqing's point about rocm/5.4 with the hipInit workaround might be pertinent. Here again we can use the SCREAM_SYSTEM_WORKAROUND guard.

Update: Looks like the P3 failure is associated with nondeterminism.

@dqwu
Copy link
Contributor Author

dqwu commented Jul 24, 2024

The other thing we need to resolve before merging this is the P3 issue. It seems to happen fairly quickly in multiple run configurations, so I suspect there is a real issue. Danqing's point about rocm/5.4 with the hipInit workaround might be pertinent. Here again we can use the SCREAM_SYSTEM_WORKAROUND guard.

Update: Looks like the P3 failure is associated with nondeterminism.

@jayeshkrishna suspects that the post-condition check failures could be due to changes in compiler flags (e.g., removing the -hzero flag to get the code to build with CCE 17).

@ambrad
Copy link
Member

ambrad commented Jul 24, 2024

suspects that the post-condition check failures could be due to changes in compiler flags (e.g., removing the -hzero flag to get the code to build with CCE 17).

I understand, but two things are relevant here. First, more than just -hzero changed; compiler versions did. Thus, we can expect different answers for a number of reasons. Second, I've used the BFB hash capability in EAMxx to identify a clear problem, likely an application-side bug, in P3 exposed by the ROCm version change. The post-condition check fails immediately or soon after the first expression of this bug.

@dqwu
Copy link
Contributor Author

dqwu commented Jul 27, 2024

Closing in favor of PR #2923. The commits in this PR might be reused when we switch to CCE 17 or higher in the future (need fix the P3 post-condition check failure).

@dqwu dqwu closed this Jul 27, 2024
@dqwu dqwu deleted the dqwu/frontier-072224 branch July 30, 2024 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmake Concerns CMake related files or logic Frontier Machine File
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants