-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error Handling and Logging within Kernels #233
Comments
I suggest using either To write the message you can use |
I suppose my question is: what would we like mam4xx to do when it encounters an error or failure in a given situation? If a nonlinear solver fails to converge, do we stop the simulation with an informative error message? Is there any other reasonable course of action? In some cases, there may be, depending on the specific situation. If we can decide what we want the simulator to do, it becomes easier to discuss the possible solutions. It seems like the code above just writes "the solver failed to converge LOL" and keeps on going, which isn't great. EDIT: if we do want to halt the simulation on a failed converge, I second @pbosler 's suggestion to use |
I agree with the suggestions above. Sounds like the As to what we do upon failure to converge, I would say that question should be answered by input from domain scientists and/or E3SM users. For instance a chemical solver code should certainly exit with error on non-convergence. However, as a chemistry module in the atmosphere component of E3SM, there may be compelling reasons to tolerate such a failure, perhaps within a tolerance. For instance, a user may not care about cloud chemistry/microphysics at all, or for long-term runs may only want to be informed of major failures in the chem solver. Seems to me like a flag that controls the behavior could be an option, and another would be tying behavior to debug vs. release, related to Pete's comments. I do think we will need to ask around to determine what those behaviors might be, though. |
We will eventually need to address the issue of how to handle/log errors within mam4xx. The sticky bit that complicates this is that essentially all of the work done here is within kernels and direct logging will be inefficient, at best, on gpu. Possible solutions include:
The specific case that led to a preliminary discussion in a recent meeting is this convergence check in the
newton_raphson_iter()
function ofgas_chem.hpp
.The mam4 code records instances of non-convergence to a log file, and I have not followed it upward any further than this.
The text was updated successfully, but these errors were encountered: