Error Handling and Logging within Kernels #233

mjs271 · 2023-08-23T22:12:17Z

We will eventually need to address the issue of how to handle/log errors within mam4xx. The sticky bit that complicates this is that essentially all of the work done here is within kernels and direct logging will be inefficient, at best, on gpu. Possible solutions include:

Storing char arrays in a buffer to send to host "every so often"
Creating some sort of persisting error struct
Recording only process/function/line number "location" data and an error code that is periodically synced to host where it could be decoded into something informative to humans.
Suggestions?

The specific case that led to a preliminary discussion in a recent meeting is this convergence check in the newton_raphson_iter() function of gas_chem.hpp.
The mam4 code records instances of non-convergence to a log file, and I have not followed it upward any further than this.

if( .not. convergence ) then
   !-----------------------------------------------------------------------
   ! ... non-convergence
   !-----------------------------------------------------------------------
   fail_cnt = fail_cnt + 1
   nstep = get_nstep()
   write(iulog,'('' imp_sol: Time step '',1p,e21.13,'' failed to converge @ (lchnk,lev,col,nstep) = '',4i6)') &
        dt,lchnk,lev,icol,nstep
   stp_con_cnt = 0
   if( cut_cnt < cut_limit ) then
      cut_cnt = cut_cnt + 1
      if( cut_cnt < cut_limit ) then
         dt = .5_r8 * dt
      else
         dt = .1_r8 * dt
      endif
      cycle time_step_loop
   else
      write(iulog,'('' imp_sol: Failed to converge @ (lchnk,lev,col,nstep,dt,time) = '',4i6,1p,2e21.13)') &
           lchnk,lev,icol,nstep,dt,interval_done+dt
      do mm = 1,clscnt4
         if( .not. converged(mm) ) then
            write(iulog,'(1x,a8,1x,1pe10.3)') solsym(clsmap(mm,4)), max_delta(mm)
         endif
      enddo
   endif
endif ! if( .not. convergence )
!-----------------------------------------------------------------------
! ... check for interval done
!-----------------------------------------------------------------------
interval_done = interval_done + dt
if( abs( delt - interval_done ) <= .0001_r8 ) then
   if( fail_cnt > 0 ) then
      write(iulog,*) 'imp_sol : @ (lchnk,lev,col) = ',lchnk,lev,icol,' failed ',fail_cnt,' times'
   endif
   exit time_step_loop
else

[...]

The text was updated successfully, but these errors were encountered:

pbosler · 2023-08-24T16:35:05Z

I suggest using either EKAT_KERNEL_ASSERT_MSG or EKAT_KERNEL_REQUIRE_MSG, the difference being that the ASSERT will only be compiled in debug builds.

To write the message you can use sprintf

jeff-cohere · 2023-08-24T19:40:49Z

I suppose my question is: what would we like mam4xx to do when it encounters an error or failure in a given situation?

If a nonlinear solver fails to converge, do we stop the simulation with an informative error message? Is there any other reasonable course of action? In some cases, there may be, depending on the specific situation. If we can decide what we want the simulator to do, it becomes easier to discuss the possible solutions.

It seems like the code above just writes "the solver failed to converge LOL" and keeps on going, which isn't great.

EDIT: if we do want to halt the simulation on a failed converge, I second @pbosler 's suggestion to use EKAT_KERNEL_REQUIRE_MSG.

mjs271 · 2023-08-24T21:15:27Z

I agree with the suggestions above. Sounds like the EKAT_KERNEL_[...]_MSG path is the correct one forward. ✅

As to what we do upon failure to converge, I would say that question should be answered by input from domain scientists and/or E3SM users. For instance a chemical solver code should certainly exit with error on non-convergence. However, as a chemistry module in the atmosphere component of E3SM, there may be compelling reasons to tolerate such a failure, perhaps within a tolerance. For instance, a user may not care about cloud chemistry/microphysics at all, or for long-term runs may only want to be informed of major failures in the chem solver.

Seems to me like a flag that controls the behavior could be an option, and another would be tying behavior to debug vs. release, related to Pete's comments. I do think we will need to ask around to determine what those behaviors might be, though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Handling and Logging within Kernels #233

Error Handling and Logging within Kernels #233

mjs271 commented Aug 23, 2023

pbosler commented Aug 24, 2023 •

edited

Loading

jeff-cohere commented Aug 24, 2023 •

edited

Loading

mjs271 commented Aug 24, 2023

Error Handling and Logging within Kernels #233

Error Handling and Logging within Kernels #233

Comments

mjs271 commented Aug 23, 2023

pbosler commented Aug 24, 2023 • edited Loading

jeff-cohere commented Aug 24, 2023 • edited Loading

mjs271 commented Aug 24, 2023

pbosler commented Aug 24, 2023 •

edited

Loading

jeff-cohere commented Aug 24, 2023 •

edited

Loading