Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train v2] Improve TrainingFailedError message #51199

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

justinvyu
Copy link
Contributor

Summary

The previous error message tells the user to look for error logs elsewhere, and it also references a worker_failures attribute on the error object itself, which is inaccessible to users who did not try catch the call to trainer.fit(). This PR improves the error message by just including the error string so that it's easy to see exactly what happened on the workers at a glance.

Here's what the message looks like now:

Training failed due to worker errors:
[Rank 0]
Traceback (most recent call last):
  File "/Users/justin/Developer/ray/python/ray/train/v2/tests/test_data_parallel_trainer.py", line 145, in _error_func_rank_0
    raise ValueError("error")
ValueError: error

Follow-up work

  • If there are a lot of workers that failed at once, this error message could get really long. This is the same for the controller warnings that include the same error string.

Comment on lines +7 to +9
@DeveloperAPI
class TrainingFailedError(RayTrainError):
"""Exception raised by `<Framework>Trainer.fit()` when training fails."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Public(beta)?

Signed-off-by: Justin Yu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants