Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Fix AssertionError in show-gpus when all GPUs were allocated #4558

Merged
merged 2 commits into from
Jan 14, 2025

Conversation

romilbhardwaj
Copy link
Collaborator

Fixes #4556.

Before this PR:

(base) ➜  ~ sky show-gpus --cloud kubernetes
Traceback (most recent call last):
  File "/Users/romilb/tools/anaconda3/bin/sky", line 8, in <module>
    sys.exit(cli())
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 366, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 838, in invoke
    return super().invoke(ctx)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/clouds/service_catalog/config.py", line 51, in wrapper
    return func(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 3468, in show_gpus
    for out in _output():
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 3222, in _output
    k8s_realtime_table = _get_kubernetes_realtime_gpu_table(
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 3141, in _get_kubernetes_realtime_gpu_table
    assert (set(counts.keys()) == set(capacity.keys()) == set(
AssertionError: Keys of counts (['H100']), capacity (['H100']), and available ([]) must be same.

After this PR:

(base) ➜  ~ sky show-gpus --cloud kubernetes
Kubernetes GPUs (context: kind-skypilot)
GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
H100  1, 2, 4, 8                8           0

Kubernetes per node accelerator availability
NODE_NAME               GPU_NAME  TOTAL_GPUS  FREE_GPUS
skypilot-control-plane  H100      8           0

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the fix @romilbhardwaj!

Comment on lines +257 to +258
if accelerator_name not in total_accelerators_available:
total_accelerators_available[accelerator_name] = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we can also use collections.defaultdict(int) if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm might be better to use regular dict here to reduce bugs when downstream methods try to check for GPUs that don't exist in the cluster (e.g., accessing total_accelerators_available['A100'] will return 0, giving a impression its currently in-use, when it might not exist at all in the cluster).

@romilbhardwaj romilbhardwaj merged commit 837d51b into master Jan 14, 2025
18 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s-showgpus-assertionfix branch January 14, 2025 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Small bug with the command sky show-gpus
2 participants