-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Frontier: Additional post-maintenance updates. #2918
Conversation
Upgrade ROCm from 5.4.0 to 5.7.0 to avoid occasional segmentation faults during MPI_Init (see OLCFDEV-1655). Downgrade libunwind from 1.6.2 to 1.5.0 to prevent ums002/default (required to load libunwind 1.6.2) from resetting ROCm to 5.3.0. Upgrade CCE from 15.0.1 to 17.0.0 to ensure compatibility with ROCm 5.7.0. Update HDF5, NetCDF, PnetCDF, and ADIOS libraries accordingly.
The -hzero flag causes an Internal Compiler Error (ICE) on PhenologyMod.F90 when using cce/17.0.0. This workaround was provided by Andrew Bradley.
This workaround was provided by Andrew Bradley. According to Gautam Bisht, the code commented out by Andrew is only used if BGC is on in the land model.
IIRC, we'll see some performance degradation with rocm/5.7, perhaps 10-20%. |
@ambrad This PR is almost identical to your cce/17.0.0 approach branch, except that rocm/5.7.0 is being used. The only remaining issue, which is reproducible with some ne256 decadal runs, is the post-condition property check failure:
P.S. I abandoned the "cce/16.0.1 + rocm/5.5.1" configuration because it produces error messages from ROCm like the one below: |
@ambrad , that's a good question. The risk (very high) is that I'll forgot to remove this from e3sm when I do the next downstream merge. Is there any way to ensure this change is deactivated for e3sm but not for eamxx? Preprocessor or cmake setting, something like that? |
This might require some discussion. In the past I've suggested a machines branch limited to just commits like this. Others have suggested and perhaps preferred other approaches. @brhillman @PeterCaldwell, thoughts? |
Yeah, I don't think we've figured out a satisfactory way of dealing with changes we need for a single machine but don't want for others yet. My recollection is that our previous "frontier branch" attempt ended up diverging from master and leading to mistakes (but maybe I'm remembering wrong). I recall that @ndkeen in particular was opposed to machine branches. My mild preference is to use ifdefs rather than machine branches for this use case, but I'm happy to be overruled. |
@jgfouca you know the build system best, since you wrote it. Do you have time to implement the ifdef approach so that it's safe to merge to e3sm eventually? If so, you can push a commit to this PR. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching back to unapproved until all threads are resolved.
@ambrad
If scream code is able to apply this workaround, we might still be able to use current "cce/15.0.1 + rocm/5.4.0" configuration. |
@ambrad , yes, I am happy to help. It looks like we just want to ifdef out some code in that one F90 file? It looks like the code is currently ifdef'd via CCE_17_COMPILER_BUG_FIXED. So CCE_17_COMPILER_BUG_FIXED needs to be defined in order to get the original behavior. We should probably do the opposite so that the original behavior is the default and we can turn off the problematic code in eamxx. |
@jgfouca Yes, you have all of that right. What do you think of adding a more generic flag that we can use in all future instances for this same purpose? I'm thinking something like SCREAM_SYSTEM_WORKAROUND. To make this approach maximally flexible (example in a moment), we'd want to permit setting integer values. Then any place in general E3SM code that requires a workaround would have something like
Once that flag is available, we would be able to use it as needed without messing with the build system again. The purpose for the integer complication is as follows. In the future one can imagine needing separate workarounds for Aurora and Frontier. We might then say that Frontier is assigned 1 and Aurora is assigned 2. |
@ambrad , that sounds good. I will push to this branch shortly. |
OK, I pushed. I'm checking it on frontier now. |
Looks like it works:
|
The other thing we need to resolve before merging this is the P3 issue. It seems to happen fairly quickly in multiple run configurations, so I suspect there is a real issue. Danqing's point about rocm/5.4 with the hipInit workaround might be pertinent. Here again we can use the SCREAM_SYSTEM_WORKAROUND guard. Update: Looks like the P3 failure is associated with nondeterminism. |
@jayeshkrishna suspects that the post-condition check failures could be due to changes in compiler flags (e.g., removing the -hzero flag to get the code to build with CCE 17). |
I understand, but two things are relevant here. First, more than just -hzero changed; compiler versions did. Thus, we can expect different answers for a number of reasons. Second, I've used the BFB hash capability in EAMxx to identify a clear problem, likely an application-side bug, in P3 exposed by the ROCm version change. The post-condition check fails immediately or soon after the first expression of this bug. |
fcf8c82
to
ccf8f1e
Compare
Closing in favor of PR #2923. The commits in this PR might be reused when we switch to CCE 17 or higher in the future (need fix the P3 post-condition check failure). |
This PR is a follow-up to PR #2915, upgrading ROCm and CCE on the
frontier-scream-gpu machine to prevent occasional segmentation
faults during MPI_Init (refer to OLCFDEV-1655).