-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UX] Add infeasibility reasons to the exception message #3986
base: master
Are you sure you want to change the base?
Conversation
sky/backends/cloud_vm_ray_backend.py
Outdated
table = log_utils.create_table(['Resource', 'Reason']) | ||
for (resource, exception) in resource_exceptions.items(): | ||
table.add_row([ | ||
resource, | ||
_EXCEPTION_SUMMARY_MESSAGE[exception.__class__] | ||
]) | ||
raise exceptions.ResourcesUnavailableError( | ||
_RESOURCES_UNAVAILABLE_LOG + '\n' + table.get_string(), | ||
failover_history=failover_history) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of parsing the exceptions here, should we directly rely on the failover_history
to generate reason table at the caller? Or, is there a reason we have to do it here?
It might be good to test with, e.g. sky launch --gpus H100:8
to see how the output for failover through many regions look like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of parsing the exceptions here, should we directly rely on the failover_history to generate reason table at the caller? Or, is there a reason we have to do it here?
Yes this was a design that I've tried, but I don't think the failover_history
gives enough information for users to identify the problem. For example, when I run sky launch --gpus H100:8
, the (partial) failover history would be
[ResourcesUnavailableError('Failed to acquire resources in us-central1-a. Try changing resource requirements or use another zone.'), ResourcesUnavailableError('Failed to acquire resources in us-west1-a. Try changing resource requirements or use another zone.'),
As you see it only contains the region of each failed provision, not even includes the cloud provider or resource information. So I think constructing the mapping from each resource to the exception here is more user-friendly.
It might be good to test with, e.g.
sky launch --gpus H100:8
to see how the output for failover through many regions look like.
Sure here is the final output:
$ sky launch --gpus H100:8
sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x <Cloud>({'H100': 8})
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
The reasons for the infeasibility of each resource are summarized below. For detailed explanations, please refer to the log above.
Resource Reason
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.
Seems that it works as expected.
A special case occurs when a resource have too many requirements, causing the 'Resource' column to become very long, which affects the display in the terminal.
Hi @Michaelvll ! I've just pushed a revised version of the PR, which change the format of the output table to fit the width of the terminal and provide more details for users. The new output is updated in the PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Conless!
Nit: this line is too verbose The reasons for the infeasibility of each resource are summarized below. For detailed explanations, please refer to the log above.
Suggestion: Reasons for provision failures (for details, please check the log above):
Thanks for your suggestion @yika-luo ! Just updated the message as you suggested. |
This PR fixes #3911 by summarize the infeasibility reasons for each resource into a table, and append it to the end of the final exception message.
Here is a minimal example.
The size of output table can fit the width of the terminal. This is an example when the terminal is narrow. (output is truncated)
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh