Skip to content

Conversation

@climbfuji
Copy link
Collaborator

@climbfuji climbfuji commented Jul 25, 2025

Bug fixes for ccpp_prebuild to work with partially case-insensitive capgen parser

These updates are needed to make ccpp_prebuild.py work with the recent, partially complete case-insensitive capgen parser. I tested this with NEPTUNE in a rather complicated way - pulling develop into the branch neptune uses (that is based on main), then creating the bug fixes there, then cherry-picking them so that we can merge them into develop here. Hopefully, by the time this comes all back to NEPTUNE it will still work :-)

This PR needs to be merged into develop, then #668 must be updated before it can be merged into main.

User interface changes?: no - but prebuild is now case-insensitive

Fixes: no separate issue created, see discussion in #668

Testing:
test removed: none
unit tests: all pass
system tests: all pass
manual testing: full regression testing with NEPTUNE underway; would like to see UFS testing, too (@dustinswales?)

@climbfuji climbfuji marked this pull request as ready for review July 25, 2025 19:50
@climbfuji climbfuji changed the title DRAFT Bugfix/prebuild case insensitive capgen parser DRAFT Bug fixes for ccpp_prebuild to work with partially case-insensitive capgen parser Jul 25, 2025
@climbfuji climbfuji self-assigned this Jul 25, 2025
@climbfuji climbfuji added the bugfix Fix for issue with 'bug' label. label Jul 25, 2025
@climbfuji climbfuji added this to the capgen unification milestone Jul 25, 2025
@dustinswales
Copy link
Member

@climbfuji I can't test this since #668 is not working in the UFS :(.

Copy link
Collaborator

@mwaxmonsky mwaxmonsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few design/python questions.

Copy link
Collaborator

@gold2718 gold2718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly okay, just a minor Fortran nit.

@climbfuji
Copy link
Collaborator Author

@climbfuji I can't test this since #668 is not working in the UFS :(.

This is the whole point of this PR. Use #668 as a base and pull in the changes from here (#669).

@climbfuji climbfuji changed the title DRAFT Bug fixes for ccpp_prebuild to work with partially case-insensitive capgen parser Bug fixes for ccpp_prebuild to work with partially case-insensitive capgen parser Jul 28, 2025
@climbfuji climbfuji requested a review from gold2718 July 28, 2025 13:04
@dustinswales
Copy link
Member

@climbfuji I can't test this since #668 is not working in the UFS :(.

This is the whole point of this PR. Use #668 as a base and pull in the changes from here (#669).

Facepalm
Testing now

@climbfuji
Copy link
Collaborator Author

@dustinswales I found an interesting problem with the capgen / prebuild updates. We need to look at how the CCPP_interstitial DDTs are allocated. The implementation in NEPTUNE at least still relies in parts on the old blocked data structures, and this breaks when more than one thread is used, now that horizontal_loop_extent is no longer allowed in the host model.

@dustinswales
Copy link
Member

dustinswales commented Jul 28, 2025

@dustinswales I found an interesting problem with the capgen / prebuild updates. We need to look at how the CCPP_interstitial DDTs are allocated. The implementation in NEPTUNE at least still relies in parts on the old blocked data structures, and this breaks when more than one thread is used, now that horizontal_loop_extent is no longer allowed in the host model.

@climbfuji That's interesting. Not sure if I understand the details completely. I'm stuck on something else....

I (think) I got through all of the metadata changes I needed, but I'm running into a new error when building:
ld: physics/libccpp_physics.a(ccpp_fv3_gfs_v17_coupled_p8_phys_ps_cap.F90.o): relocation R_X86_64_32S against symbol `ccpp_fv3_gfs_v17_coupled_p8_phys_ps_cap_mp_initialized_' can not be used when making a shared object; recompile with -fPIC

I think this has something to do with case sensitivity, but I haven't figured out all the details yet.

@climbfuji
Copy link
Collaborator Author

@dustinswales I found an interesting problem with the capgen / prebuild updates. We need to look at how the CCPP_interstitial DDTs are allocated. The implementation in NEPTUNE at least still relies in parts on the old blocked data structures, and this breaks when more than one thread is used, now that horizontal_loop_extent is no longer allowed in the host model.

@climbfuji That's interesting. Not sure if I understand the details completely. I'm stuck on something else....

I (think) I got through all of the metadata changes I needed, but I'm running into a new error when building: ld: physics/libccpp_physics.a(ccpp_fv3_gfs_v17_coupled_p8_phys_ps_cap.F90.o): relocation R_X86_64_32S against symbol `ccpp_fv3_gfs_v17_coupled_p8_phys_ps_cap_mp_initialized_' can not be used when making a shared object; recompile with -fPIC

I think this has something to do with case sensitivity, but I haven't figured out all the details yet.

You need to update the calling CMakeLists.txt that includes ccpp-framework to build a static library. I am doing this in NEPTUNE:

set(BUILD_SHARED_LIBS OFF)
add_subdirectory(ccpp-framework)

I think this is safe for the UFS, too. But if you have something else setting this variable higher up, then you need this:

set(BUILD_SHARED_LIBS_SAVE ${BUILD_SHARED_LIBS})
set(BUILD_SHARED_LIBS OFF)
add_subdirectory(ccpp-framework)
set(BUILD_SHARED_LIBS ${BUILD_SHARED_LIBS_SAVE})

@climbfuji
Copy link
Collaborator Author

Another update. I got the UFS code to run with the updated GFS_interstitial DDT, it is b4b between omp=1 and omp=2. I still need to check on memory footprint and performance, but at least I have a working solution now that the capgen parser refuses horizontal_loop_extent in the host model metadata's horizontal dimensions.

@climbfuji
Copy link
Collaborator Author

Another update. I got the UFS code to run with the updated GFS_interstitial DDT, it is b4b between omp=1 and omp=2. I still need to check on memory footprint and performance, but at least I have a working solution now that the capgen parser refuses horizontal_loop_extent in the host model metadata's horizontal dimensions.

Good news is that so far, no further changes required to ccpp-framework (i.e. this PR).

@mwaxmonsky
Copy link
Collaborator

mwaxmonsky commented Jul 28, 2025

@climbfuji Would it make sense to change the frameworks' BUILD_SHARED_LIBS option to CCPP_BUILD_SHARED_LIBS?

option(CCPP_BUILD_SHARED_LIBS "Build using shared libraries" ON)

set(BUILD_SHARED_LIBS ${CCPP_BUILD_SHARED_LIBS})

Then in the parent level CMake, we can do:

set(CCPP_BUILD_SHARED_LIBS OFF) # Or comment out if default for the framework is fine 
add_subdirectory(ccpp-framework)

If I understand CMake, this way there shouldn't be a need to save the parent projects environment variable.

@climbfuji
Copy link
Collaborator Author

@climbfuji Would it make sense to change the frameworks' BUILD_SHARED_LIBS option to CCPP_BUILD_SHARED_LIBS?

option(CCPP_BUILD_SHARED_LIBS "Build using shared libraries" ON)

set(BUILD_SHARED_LIBS ${CCPP_BUILD_SHARED_LIBS})

Then in the parent level CMake, we can do:

set(CCPP_BUILD_SHARED_LIBS OFF) # Or comment out if default for the framework is fine 
add_subdirectory(ccpp-framework)

If I understand CMake, this way there shouldn't be a need to save the parent projects environment variable.

It's no problem at all to make the code work with the current name. I think it is actually better to not prefix every variable with CCPP_. If we were to do this for every component that goes into a model, we'd be dealing with tens or more of variables to set. For instance, if we were to also do this for OPENMP, then, in order to turn on OpenMP, we would have to set UFS_OPENMP, CCPP_OPENMP, MOM6_OPENMP, ... instead of just OPENMP`.

HDF5 for example is also not prefixing every CMake variable with HDF5_ ...

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@climbfuji I'll be on PTO until next Tuesday, so once Dustin gives his approval feel free to merge this yourself if I'm not back yet.

@climbfuji
Copy link
Collaborator Author

@dustinswales I feel like we should merge this given that all but one of your UFS tests pass with b4b identical results and that the one remaining test may differ because of a compiler optimization or something else UFS-related. Not merging this PR is holding back the update of main from develop, which in turn holds back updating NEPTUNE and other models. If really needed, we can always apply another bug fix to the develop branch of ccpp-framework?

@dustinswales
Copy link
Member

@climbfuji Merging this in is fine with me.
I still haven't found the cause of the one UFS RT that is failing...

@climbfuji
Copy link
Collaborator Author

@climbfuji Merging this in is fine with me. I still haven't found the cause of the one UFS RT that is failing...

Can you remind us again what "failing" means (we didn't quite remember yesterday in the meeting). Is it that the tests are b4b different? If so, for release builds only, or also for debug builds? Or is the code crashing?

@dustinswales
Copy link
Member

@climbfuji The RRTMGP test is not b4b, release build only no debug test for GP.
The test runs to completion, but answers change after the first time step. I spent the better half of last week on this and cannot find a reason.

@climbfuji
Copy link
Collaborator Author

@climbfuji The RRTMGP test is not b4b, release build only no debug test for GP. The test runs to completion, but answers change after the first time step. I spent the better half of last week on this and cannot find a reason.

Ok, I recall suggesting last week to run the RRTMGP test in DEBUG mode for the current develop branch and for your up-to-date branch. If the results match between the two, then I think you can be fairly certain that this is because of the compiler optimization.

@climbfuji
Copy link
Collaborator Author

And I am very certain that the changes here that deal with case-sensitivity / case-insensitivity) have nothing to do with the RRTMGP b4b differences ...

@dustinswales
Copy link
Member

Ok, I recall suggesting last week to run the RRTMGP test in DEBUG mode for the current develop branch and for your up-to-date branch. If the results match between the two, then I think you can be fairly certain that this is because of the compiler optimization.

I'm looking into this now.

@dustinswales
Copy link
Member

And I am very certain that the changes here that deal with case-sensitivity / case-insensitivity) have nothing to do with the RRTMGP b4b differences ...

This is true.
The allocating/resetting/cleanup of the interstitial type on the host side is where the problem arises. If I keep all the host-side changes and revert the framework hash, I get the same results.
Something about allocating/resetting/cleanup of the interstitial is not behaving the same as before?

@climbfuji
Copy link
Collaborator Author

And I am very certain that the changes here that deal with case-sensitivity / case-insensitivity) have nothing to do with the RRTMGP b4b differences ...

This is true. The allocating/resetting/cleanup of the interstitial type on the host side is where the problem arises. If I keep all the host-side changes and revert the framework hash, I get the same results. Something about allocating/resetting/cleanup of the interstitial is not behaving the same as before?

Oh wow. Can you diff the auto-generated files? Probably have to convert everything to lowercase using tr or so to be able to do that.

@climbfuji
Copy link
Collaborator Author

climbfuji commented Aug 20, 2025

And I am very certain that the changes here that deal with case-sensitivity / case-insensitivity) have nothing to do with the RRTMGP b4b differences ...

This is true. The allocating/resetting/cleanup of the interstitial type on the host side is where the problem arises. If I keep all the host-side changes and revert the framework hash, I get the same results. Something about allocating/resetting/cleanup of the interstitial is not behaving the same as before?

Oh wow. Can you diff the auto-generated files? Probably have to convert everything to lowercase using tr or so to be able to do that.

maybe those rrtmgp ddts inside the interstitial ddt don't get cleaned up correctly? scratch that, then reverting the ccpp-f hash wouldn't help

@dustinswales
Copy link
Member

@climbfuji I diff'd the Caps and the only change was from 1:IM -> ixs:ixe.
Also, we don't use the RRTMGP DDTS as interstitials anymore in the UWM. I know CAM-SIMA wants to (@peverwhee and #674)

@dustinswales
Copy link
Member

Ok, I recall suggesting last week to run the RRTMGP test in DEBUG mode for the current develop branch and for your up-to-date branch. If the results match between the two, then I think you can be fairly certain that this is because of the compiler optimization.

I'm looking into this now.

@climbfuji Differences occur in DEBUG mode. Snap.

@climbfuji
Copy link
Collaborator Author

climbfuji commented Aug 20, 2025

@climbfuji I diff'd the Caps and the only change was from 1:IM -> ixs:ixe. Also, we don't use the RRTMGP DDTS as interstitials anymore in the UWM. I know CAM-SIMA wants to (@peverwhee and #674)

For all phases except the run phase, the correct indices for the UFS are 1:GFS_control%ncols. Can you confirm that ixs=1 and ixe=IM=GFS_control%ncols`` in these cases? For the run phase, you would expect ixs:ixe` to be

Model%chunk_begin(ib):Model%chunk_end(ib)

for the different ib (1 to nblocks). Probably worth printing out these indices and also checking if there is anything in the RRTMG CCPP scheme entry points could cause an inconsistency with respect to the ranges.

@dustinswales
Copy link
Member

@climbfuji I diff'd the Caps and the only change was from 1:IM -> ixs:ixe. Also, we don't use the RRTMGP DDTS as interstitials anymore in the UWM. I know CAM-SIMA wants to (@peverwhee and #674)

For all phases except the run phase, the correct indices for the UFS are 1:GFS_control%ncols. Can you confirm that ixs=1 and ixe=IM=GFS_control%ncols in these cases? For the run phase, you would expect ``ixs:ixe` to be

Model%chunk_begin(ib):Model%chunk_end(ib)

for the different ib (1 to nblocks). Probably worth printing out these indices and also checking if there is anything in the RRTMG CCPP scheme entry points could cause an inconsistency with respect to the ranges.

Yeah, I've checked all of these things w.o any success.
I've been all over the GP interface and it doesn't do anything different than any of the other schemes wrt indexing.
The same chunked data as before is passing through the Caps, and all the schemes are fine with it except for GP?
I have no clue what's going on.

@climbfuji
Copy link
Collaborator Author

Then I suggest merging and continuing the investigation with the (soon) updated PR #668

@climbfuji climbfuji merged commit 4ae528b into NCAR:develop Aug 21, 2025
19 checks passed
@climbfuji climbfuji deleted the bugfix/prebuild_case_insensitive_capgen_parser branch August 21, 2025 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Fix for issue with 'bug' label.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants