Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Handling and Logging within Kernels #233

Open
mjs271 opened this issue Aug 23, 2023 · 3 comments
Open

Error Handling and Logging within Kernels #233

mjs271 opened this issue Aug 23, 2023 · 3 comments

Comments

@mjs271
Copy link
Contributor

mjs271 commented Aug 23, 2023

We will eventually need to address the issue of how to handle/log errors within mam4xx. The sticky bit that complicates this is that essentially all of the work done here is within kernels and direct logging will be inefficient, at best, on gpu. Possible solutions include:

  • Storing char arrays in a buffer to send to host "every so often"
  • Creating some sort of persisting error struct
  • Recording only process/function/line number "location" data and an error code that is periodically synced to host where it could be decoded into something informative to humans.
  • Suggestions?

The specific case that led to a preliminary discussion in a recent meeting is this convergence check in the newton_raphson_iter() function of gas_chem.hpp.
The mam4 code records instances of non-convergence to a log file, and I have not followed it upward any further than this.

if( .not. convergence ) then
   !-----------------------------------------------------------------------
   ! ... non-convergence
   !-----------------------------------------------------------------------
   fail_cnt = fail_cnt + 1
   nstep = get_nstep()
   write(iulog,'('' imp_sol: Time step '',1p,e21.13,'' failed to converge @ (lchnk,lev,col,nstep) = '',4i6)') &
        dt,lchnk,lev,icol,nstep
   stp_con_cnt = 0
   if( cut_cnt < cut_limit ) then
      cut_cnt = cut_cnt + 1
      if( cut_cnt < cut_limit ) then
         dt = .5_r8 * dt
      else
         dt = .1_r8 * dt
      endif
      cycle time_step_loop
   else
      write(iulog,'('' imp_sol: Failed to converge @ (lchnk,lev,col,nstep,dt,time) = '',4i6,1p,2e21.13)') &
           lchnk,lev,icol,nstep,dt,interval_done+dt
      do mm = 1,clscnt4
         if( .not. converged(mm) ) then
            write(iulog,'(1x,a8,1x,1pe10.3)') solsym(clsmap(mm,4)), max_delta(mm)
         endif
      enddo
   endif
endif ! if( .not. convergence )
!-----------------------------------------------------------------------
! ... check for interval done
!-----------------------------------------------------------------------
interval_done = interval_done + dt
if( abs( delt - interval_done ) <= .0001_r8 ) then
   if( fail_cnt > 0 ) then
      write(iulog,*) 'imp_sol : @ (lchnk,lev,col) = ',lchnk,lev,icol,' failed ',fail_cnt,' times'
   endif
   exit time_step_loop
else

[...]
@pbosler
Copy link
Contributor

pbosler commented Aug 24, 2023

I suggest using either EKAT_KERNEL_ASSERT_MSG or EKAT_KERNEL_REQUIRE_MSG, the difference being that the ASSERT will only be compiled in debug builds.

To write the message you can use sprintf

@jeff-cohere
Copy link
Collaborator

jeff-cohere commented Aug 24, 2023

I suppose my question is: what would we like mam4xx to do when it encounters an error or failure in a given situation?

If a nonlinear solver fails to converge, do we stop the simulation with an informative error message? Is there any other reasonable course of action? In some cases, there may be, depending on the specific situation. If we can decide what we want the simulator to do, it becomes easier to discuss the possible solutions.

It seems like the code above just writes "the solver failed to converge LOL" and keeps on going, which isn't great.

EDIT: if we do want to halt the simulation on a failed converge, I second @pbosler 's suggestion to use EKAT_KERNEL_REQUIRE_MSG.

@mjs271
Copy link
Contributor Author

mjs271 commented Aug 24, 2023

I agree with the suggestions above. Sounds like the EKAT_KERNEL_[...]_MSG path is the correct one forward. ✅

As to what we do upon failure to converge, I would say that question should be answered by input from domain scientists and/or E3SM users. For instance a chemical solver code should certainly exit with error on non-convergence. However, as a chemistry module in the atmosphere component of E3SM, there may be compelling reasons to tolerate such a failure, perhaps within a tolerance. For instance, a user may not care about cloud chemistry/microphysics at all, or for long-term runs may only want to be informed of major failures in the chem solver.

Seems to me like a flag that controls the behavior could be an option, and another would be tying behavior to debug vs. release, related to Pete's comments. I do think we will need to ask around to determine what those behaviors might be, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants