Skip to content

Conversation

@SamuelDegelia-NOAA
Copy link
Contributor

@SamuelDegelia-NOAA SamuelDegelia-NOAA commented Jan 28, 2026

Description

This PR modifies the workarounds for gsibec to force zero for the outer analysis grids in linear variable change section. This method no longer need to fill the background values in missing values analysis grids.

These modifications are needed to prevent nans in the cost function when running 3dvar on the na3km domain. Note that we are still seeing some issues with nans that can be prevented by limiting 3dvar to only a single outer loop. There will likely be more changes coming to the gsibec code to resolve this. But for now, this PR allows us to at least run one outer loop and start getting results.

Huge thanks to @Masanori-NOAA for debugging this problem and finding a (at least partial) solution.

Issue(s) addressed

None

Dependencies (if applicable)

None

Checklist

  • I have performed a self-review of my own code.
  • I have run rrfs tests before creating the PR (if applicable).
  • Unit tests added/updated (if applicable).

@rrfsbot
Copy link
Collaborator

rrfsbot commented Jan 28, 2026

FAILED on hera

started build_and_test on hera at UTC time: Wed Jan 28 02:27:11 UTC 2026
finished at UTC time: Wed Jan 28 02:58:43 UTC 2026

Test project /scratch3/NCEPDEV/fv3-cam/rrfsbot/PRs_RDASApp/527/build/rrfs-test
      Start  6: rrfs_fv3jedi_2024052700_getkf_observer
      Start 15: rrfs_mpasjedi_2024052700_getkf_observer
      Start  1: rrfs_fv3jedi_2024052700_3dvar
      Start  2: rrfs_fv3jedi_2024052700_3denvar
      Start  3: rrfs_fv3jedi_2024052700_3denvar_mgbf
      Start  4: rrfs_fv3jedi_2024052700_hybrid3denvar
      Start  5: rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf
      Start  8: rrfs_fv3jedi_2024052700_3dvar_conv_surface
 1/18 Test  #1: rrfs_fv3jedi_2024052700_3dvar .................   Passed   38.65 sec
      Start  9: rrfs_fv3jedi_2024052700_3dvar_conv_upperair
 2/18 Test  #8: rrfs_fv3jedi_2024052700_3dvar_conv_surface ....***Failed   76.86 sec
      Start 10: rrfs_fv3jedi_2024052700_3dvar_remote
 3/18 Test  #6: rrfs_fv3jedi_2024052700_getkf_observer ........   Passed   98.96 sec
      Start  7: rrfs_fv3jedi_2024052700_getkf_solver
 4/18 Test #10: rrfs_fv3jedi_2024052700_3dvar_remote ..........***Failed   44.29 sec
      Start 11: rrfs_fv3jedi_2024052700_3dvar_satrad
 5/18 Test  #9: rrfs_fv3jedi_2024052700_3dvar_conv_upperair ...***Failed  100.57 sec
      Start 12: rrfs_fv3jedi_2024052700_3denvar_refl
 6/18 Test  #2: rrfs_fv3jedi_2024052700_3denvar ...............   Passed  185.11 sec
      Start 13: rrfs_mpasjedi_2024052700_bumploc
 7/18 Test #11: rrfs_fv3jedi_2024052700_3dvar_satrad ..........***Failed   74.29 sec
      Start 14: rrfs_mpasjedi_2024052700_3denvar
 8/18 Test  #4: rrfs_fv3jedi_2024052700_hybrid3denvar .........***Failed  210.08 sec
      Start 17: rrfs_mpasjedi_2024052700_3dvar
 9/18 Test  #3: rrfs_fv3jedi_2024052700_3denvar_mgbf ..........   Passed  228.73 sec
      Start 18: rrfs_bufr2ioda_msonet
10/18 Test #18: rrfs_bufr2ioda_msonet .........................   Passed   26.71 sec
11/18 Test #17: rrfs_mpasjedi_2024052700_3dvar ................   Passed   57.35 sec
12/18 Test #15: rrfs_mpasjedi_2024052700_getkf_observer .......   Passed  268.66 sec
      Start 16: rrfs_mpasjedi_2024052700_getkf_solver
13/18 Test  #7: rrfs_fv3jedi_2024052700_getkf_solver ..........   Passed  192.44 sec
14/18 Test  #5: rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf ....***Failed  307.39 sec
15/18 Test #16: rrfs_mpasjedi_2024052700_getkf_solver .........   Passed  191.12 sec
16/18 Test #13: rrfs_mpasjedi_2024052700_bumploc ..............   Passed  310.41 sec
17/18 Test #14: rrfs_mpasjedi_2024052700_3denvar ..............   Passed  334.52 sec
18/18 Test #12: rrfs_fv3jedi_2024052700_3denvar_refl ..........   Passed  530.79 sec

67% tests passed, 6 tests failed out of 18

Label Time Summary:
mpi            = 3276.92 sec*proc (18 tests)
rdas-bundle    = 3276.92 sec*proc (18 tests)
script         = 3276.92 sec*proc (18 tests)

Total Test time (real) = 670.07 sec

The following tests FAILED:
	  4 - rrfs_fv3jedi_2024052700_hybrid3denvar (Failed)
	  5 - rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf (Failed)
	  8 - rrfs_fv3jedi_2024052700_3dvar_conv_surface (Failed)
	  9 - rrfs_fv3jedi_2024052700_3dvar_conv_upperair (Failed)
	 10 - rrfs_fv3jedi_2024052700_3dvar_remote (Failed)
	 11 - rrfs_fv3jedi_2024052700_3dvar_satrad (Failed)
Errors while running CTest
Output from these tests are in: /scratch3/NCEPDEV/fv3-cam/rrfsbot/PRs_RDASApp/527/build/rrfs-test/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

workdir: /scratch3/NCEPDEV/fv3-cam/rrfsbot/PRs_RDASApp/527

@SamuelDegelia-NOAA
Copy link
Contributor Author

PASSED on wcoss2

started build_and_test on wcoss2 at UTC time: Wed Jan 28 02:22:36 UTC 2026
finished at UTC time: Wed Jan 28 03:18:52 UTC 2026

Test project /lfs/h2/emc/da/noscrub/samuel.degelia/rrfsbot/PRs_RDASApp/527/build/rrfs-test
      Start  6: rrfs_fv3jedi_2024052700_getkf_observer
      Start 15: rrfs_mpasjedi_2024052700_getkf_observer
      Start  1: rrfs_fv3jedi_2024052700_3dvar
      Start  2: rrfs_fv3jedi_2024052700_3denvar
      Start  3: rrfs_fv3jedi_2024052700_3denvar_mgbf
      Start  4: rrfs_fv3jedi_2024052700_hybrid3denvar
      Start  5: rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf
      Start  8: rrfs_fv3jedi_2024052700_3dvar_conv_surface
      Start  9: rrfs_fv3jedi_2024052700_3dvar_conv_upperair
      Start 10: rrfs_fv3jedi_2024052700_3dvar_remote
 1/18 Test  #8: rrfs_fv3jedi_2024052700_3dvar_conv_surface ....   Passed  180.90 sec
      Start 11: rrfs_fv3jedi_2024052700_3dvar_satrad
 2/18 Test  #1: rrfs_fv3jedi_2024052700_3dvar .................   Passed  230.07 sec
      Start 12: rrfs_fv3jedi_2024052700_3denvar_refl
 3/18 Test  #9: rrfs_fv3jedi_2024052700_3dvar_conv_upperair ...   Passed  230.05 sec
      Start 13: rrfs_mpasjedi_2024052700_bumploc
 4/18 Test #10: rrfs_fv3jedi_2024052700_3dvar_remote ..........   Passed  278.70 sec
      Start 14: rrfs_mpasjedi_2024052700_3denvar
 5/18 Test  #6: rrfs_fv3jedi_2024052700_getkf_observer ........   Passed  355.93 sec
      Start  7: rrfs_fv3jedi_2024052700_getkf_solver
 6/18 Test #11: rrfs_fv3jedi_2024052700_3dvar_satrad ..........   Passed  186.04 sec
      Start 17: rrfs_mpasjedi_2024052700_3dvar
 7/18 Test  #2: rrfs_fv3jedi_2024052700_3denvar ...............   Passed  461.93 sec
      Start 18: rrfs_bufr2ioda_msonet
 8/18 Test  #4: rrfs_fv3jedi_2024052700_hybrid3denvar .........   Passed  465.92 sec
 9/18 Test #17: rrfs_mpasjedi_2024052700_3dvar ................   Passed  116.11 sec
10/18 Test #18: rrfs_bufr2ioda_msonet .........................   Passed   33.96 sec
11/18 Test  #5: rrfs_fv3jedi_2024052700_hybrid3denvar_mgbf ....   Passed  545.91 sec
12/18 Test  #3: rrfs_fv3jedi_2024052700_3denvar_mgbf ..........   Passed  555.91 sec
13/18 Test #13: rrfs_mpasjedi_2024052700_bumploc ..............   Passed  347.00 sec
14/18 Test  #7: rrfs_fv3jedi_2024052700_getkf_solver ..........   Passed  224.99 sec
15/18 Test #15: rrfs_mpasjedi_2024052700_getkf_observer .......   Passed  727.19 sec
      Start 16: rrfs_mpasjedi_2024052700_getkf_solver
16/18 Test #14: rrfs_mpasjedi_2024052700_3denvar ..............   Passed  489.27 sec
17/18 Test #12: rrfs_fv3jedi_2024052700_3denvar_refl ..........   Passed  718.87 sec
18/18 Test #16: rrfs_mpasjedi_2024052700_getkf_solver .........   Passed  354.58 sec

100% tests passed, 0 tests failed out of 18

Label Time Summary:
rdas-bundle    = 6503.32 sec*proc (18 tests)
script         = 6503.32 sec*proc (18 tests)

Total Test time (real) = 1081.98 sec

workdir: /lfs/h2/emc/da/noscrub/samuel.degelia/rrfsbot/PRs_RDASApp/527

@SamuelDegelia-NOAA SamuelDegelia-NOAA marked this pull request as draft January 28, 2026 03:22
@SamuelDegelia-NOAA
Copy link
Contributor Author

Converting to draft. For some reason it looks like this change is actually causing NaNs in the cost function on Hera but resolving them on WCOSS2...

Copy link

@ShunLiu-NOAA ShunLiu-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized some ctests did not passed, so I am reverting my approval.

@SamuelDegelia-NOAA
Copy link
Contributor Author

SamuelDegelia-NOAA commented Jan 28, 2026

It looks like the Hera failures occur due to NaNs now appearing on the second outer loop. So these changes are resolving NaNs during the first outer loop for the na3km case, but sometimes cause NaNs during the second outer loop for both conus13km and na3km (machine dependent). Will probably need help from @TingLei-NOAA and @Masanori-NOAA to figure this one out.

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Jan 28, 2026

Yes, the nan values in cost function could come from the nan values or values in background fields from undefined behavior (when compiling optimization is turned on) on filtering grids outside of the model domain. That could occur earlier in the regional gsibec because of the lateral boundary points , even earlier than @Masanori-NOAA once suspected. For example, the ges_prsl calculation in guess_grids.f90 of gsibec. I am still trying to figure out a simple way to deal with situation. I would focus on this issue, in collaboration with Masanori and colleagues after I finish the current optimization of MGBF codes.

@SamuelDegelia-NOAA
Copy link
Contributor Author

Thanks, @TingLei-NOAA!

@SamuelDegelia-NOAA
Copy link
Contributor Author

I added some debug prints to track down the source of the NaNs in the minimizer. Tracing through various layers, I found that the NaNs originate in normal_rh_to_q.f90 within gsibec. At certain grid points, the derivative dqdrh is already non-finite because t and p are zero. This leads to NaNs when q is computed.

As a simple hardening step, I added additional checks in normal_rh_to_q (and the adjoint) to treat these points as invalid and skip them, rather than relying only on the existing ges_tsen < rmiss_th condition. Here is the general idea:

real(r_kind), parameter :: pmin = 1.0e-6_r_kind
real(r_kind), parameter :: tmin = 1.0_r_kind

# if(regional .and. ges_tsen(i,j,k,ntguessig) < rmiss_th) then # old check
if (regional .and. ( &
    (ges_tsen(i,j,k,ntguessig) < rmiss_th) .or. &
    (.not. ieee_is_finite(t(i,j,k)))   .or. (t(i,j,k)   <= tmin) .or. &
    (.not. ieee_is_finite(p(i,j,k)))   .or. (p(i,j,k)   <= pmin) .or. &
    (.not. ieee_is_finite(p(i,j,k+1))) .or. (p(i,j,k+1) <= pmin) )) then
  q(i,j,k) = zero
  cycle
endif

After changing this if-block, 3dvar now runs full thoroughly on Hera. The minimization results are slightly different though after this change (e.g., different reduction of residual norm). I am going to make some plots to see how similar the analyses are and if this fix is okay.

@SamuelDegelia-NOAA
Copy link
Contributor Author

3dvar run through after the above changes but the analyses are very different. Going to continue debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants