-
Notifications
You must be signed in to change notification settings - Fork 780
Open
Description
While load testing managed jobs, I ran into this error.
E 09-18 00:33:21 controller.py:741] Traceback (most recent call last):
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/backends/cloud_vm_ray_backend.py", line 2765, in get_grpc_channel
E 09-18 00:33:21 controller.py:741] s.connect(('localhost', tunnel.port))
E 09-18 00:33:21 controller.py:741] ConnectionRefusedError: [Errno 111] Connection refused
E 09-18 00:33:21 controller.py:741]
E 09-18 00:33:21 controller.py:741] During handling of the above exception, another exception occurred:
E 09-18 00:33:21 controller.py:741]
E 09-18 00:33:21 controller.py:741] Traceback (most recent call last):
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/jobs/controller.py", line 698, in run
E 09-18 00:33:21 controller.py:741] succeeded = await self._run_one_task(task_id, task)
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/jobs/controller.py", line 442, in _run_one_task
E 09-18 00:33:21 controller.py:741] job_status = await managed_job_utils.get_job_status(
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/jobs/utils.py", line 230, in get_job_status
E 09-18 00:33:21 controller.py:741] statuses = await context_utils.to_thread(backend.get_job_status,
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/miniconda3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
E 09-18 00:33:21 controller.py:741] result = self.fn(*self.args, **self.kwargs)
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/backends/cloud_vm_ray_backend.py", line 4458, in get_job_status
E 09-18 00:33:21 controller.py:741] response = backend_utils.invoke_skylet_with_retries(
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/backends/backend_utils.py", line 3725, in invoke_skylet_with_retries
E 09-18 00:33:21 controller.py:741] return func()
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/backends/cloud_vm_ray_backend.py", line 4459, in <lambda>
E 09-18 00:33:21 controller.py:741] lambda: SkyletClient(handle.get_grpc_channel()
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/backends/cloud_vm_ray_backend.py", line 2772, in get_grpc_channel
E 09-18 00:33:21 controller.py:741] tunnel = self._open_and_update_skylet_tunnel()
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/backends/cloud_vm_ray_backend.py", line 2808, in _open_and_update_skylet_tunnel
E 09-18 00:33:21 controller.py:741] runners = self.get_command_runners()
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/utils/context_utils.py", line 173, in wrapper
E 09-18 00:33:21 controller.py:741] return func(*args, **kwargs)
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/utils/common_utils.py", line 618, in _record
E 09-18 00:33:21 controller.py:741] return f(*args, **kwargs)
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/backends/cloud_vm_ray_backend.py", line 2647, in get_command_runners
E 09-18 00:33:21 controller.py:741] runners = provision_lib.get_command_runners(
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/provision/__init__.py", line 65, in _wrapper
E 09-18 00:33:21 controller.py:741] return func(provider_name, *args, **kwargs)
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/provision/__init__.py", line 282, in get_command_runners
E 09-18 00:33:21 controller.py:741] return command_runner.SSHCommandRunner.make_runner_list(
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/utils/command_runner.py", line 472, in make_runner_list
E 09-18 00:33:21 controller.py:741] return [cls(node, **kwargs) for node in node_list]
E 09-18 00:33:21 controller.py:741] File "/home/ubuntu/sky_workdir/sky/utils/command_runner.py", line 472, in <listcomp>
E 09-18 00:33:21 controller.py:741] return [cls(node, **kwargs) for node in node_list]
E 09-18 00:33:21 controller.py:741] TypeError: SSHCommandRunner.__init__() missing 2 required positional arguments: 'ssh_user' and 'ssh_private_key'
I also see two log messages
W 09-18 00:33:21 cloud_vm_ray_backend.py:2744] Failed to connect to SSH tunnel for cluster 'job-214-1248' on port 11612 ([Errno 111] Connection refused), acquiring lock
W 09-18 00:33:21 cloud_vm_ray_backend.py:2768] Failed to connect to SSH tunnel for cluster 'job-214-1248' on port 11612 ([Errno 111] Connection refused), opening new tunnel
Since those messages were reported separately, it's not possible to tell if these messages were before or after the stacktrace, but I must assume before.
The underlying cluster was on spot, so it's possible it was preempted and cleaned up by the skypilot-status-refresh-daemon.
Metadata
Metadata
Assignees
Labels
No labels