Skip to content

[Feature Request] throw an error when the total loss is NaN #4985

@njzjz

Description

@njzjz

Summary

When the total loss is NaN, we can throw an error to stop the training. Otherwise, time is wasted training a model with parameters to be NaN, as seen in the case of deepmodeling/dpgen#1460.

Detailed Description

  1. NaN can be checked when the total loss is on the CPU (but not the GPU), to avoid extra cost. For example, when writing to lcurve.out, the result is on the CPU.
  2. Check the results before the checkpoint is written, so no checkpoint with NaN is written.
  3. Implement the feature for TensorFlow, PyTorch, and PaddlePaddle backends.

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions