Skip to content

[jobs][consolidation] sky jobs launch is stuck, sky jobs logs --controller is not streaming logs #7286

@romilbhardwaj

Description

@romilbhardwaj

Trying to use managed jobs with consolidation mode, but stuck at Waiting for task to start:

$ sky jobs launch -c nemo nemorl.sky.yaml --secret HF_TOKEN
YAML to run: nemorl.sky.yaml
Managed job 'nemo' will be launched on (estimated):
Considered resources (2 nodes):
----------------------------------------------------------------------------------------
 INFRA                        INSTANCE   vCPUs   Mem(GB)   GPUS     COST ($)   CHOSEN
----------------------------------------------------------------------------------------
 Kubernetes (xx)   -          32      64        H200:1   0.00          ✔
----------------------------------------------------------------------------------------
Launching a managed job 'nemo'. Proceed? [Y/n]:
Launching managed job 'nemo' from jobs controller...
⠧ Waiting for task to start (status: PENDING). It may take a few minutes.
(py310) ➜  ~ sky jobs logs --controller 14
<stays stuck here>

Things I tried that did not work:

  • Restarting API server with sky api stop; sky api start.
  • export SKYPILOT_ENABLE_GRPC=0 and export SKYPILOT_ENABLE_GRPC=1.
  • sky jobs cancel -ay and try again

Commit: a201b22dc2361fe9be379ba9e5d9aef132272e44
Running on local API server.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions