You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the total loss is NaN, we can throw an error to stop the training. Otherwise, time is wasted training a model with parameters to be NaN, as seen in the case of deepmodeling/dpgen#1460.
Detailed Description
NaN can be checked when the total loss is on the CPU (but not the GPU), to avoid extra cost. For example, when writing to lcurve.out, the result is on the CPU.
Check the results before the checkpoint is written, so no checkpoint with NaN is written.
Implement the feature for TensorFlow, PyTorch, and PaddlePaddle backends.