Skip to content

[gprc] TypeError: SSHCommandRunner.__init__() missing 2 required positional arguments: 'ssh_user' and 'ssh_private_key' #7239

@cg505

Description

@cg505

While load testing managed jobs, I ran into this error.

E 09-18 00:33:21 controller.py:741] Traceback (most recent call last):
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/backends/cloud_vm_ray_backend.py", line 2765, in get_grpc_channel
E 09-18 00:33:21 controller.py:741]     s.connect(('localhost', tunnel.port))
E 09-18 00:33:21 controller.py:741] ConnectionRefusedError: [Errno 111] Connection refused
E 09-18 00:33:21 controller.py:741]
E 09-18 00:33:21 controller.py:741] During handling of the above exception, another exception occurred:
E 09-18 00:33:21 controller.py:741]
E 09-18 00:33:21 controller.py:741] Traceback (most recent call last):
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/jobs/controller.py", line 698, in run
E 09-18 00:33:21 controller.py:741]     succeeded = await self._run_one_task(task_id, task)
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/jobs/controller.py", line 442, in _run_one_task
E 09-18 00:33:21 controller.py:741]     job_status = await managed_job_utils.get_job_status(
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/jobs/utils.py", line 230, in get_job_status
E 09-18 00:33:21 controller.py:741]     statuses = await context_utils.to_thread(backend.get_job_status,
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/miniconda3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
E 09-18 00:33:21 controller.py:741]     result = self.fn(*self.args, **self.kwargs)
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/backends/cloud_vm_ray_backend.py", line 4458, in get_job_status
E 09-18 00:33:21 controller.py:741]     response = backend_utils.invoke_skylet_with_retries(
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/backends/backend_utils.py", line 3725, in invoke_skylet_with_retries
E 09-18 00:33:21 controller.py:741]     return func()
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/backends/cloud_vm_ray_backend.py", line 4459, in <lambda>
E 09-18 00:33:21 controller.py:741]     lambda: SkyletClient(handle.get_grpc_channel()
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/backends/cloud_vm_ray_backend.py", line 2772, in get_grpc_channel
E 09-18 00:33:21 controller.py:741]     tunnel = self._open_and_update_skylet_tunnel()
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/backends/cloud_vm_ray_backend.py", line 2808, in _open_and_update_skylet_tunnel
E 09-18 00:33:21 controller.py:741]     runners = self.get_command_runners()
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/utils/context_utils.py", line 173, in wrapper
E 09-18 00:33:21 controller.py:741]     return func(*args, **kwargs)
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/utils/common_utils.py", line 618, in _record
E 09-18 00:33:21 controller.py:741]     return f(*args, **kwargs)
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/backends/cloud_vm_ray_backend.py", line 2647, in get_command_runners
E 09-18 00:33:21 controller.py:741]     runners = provision_lib.get_command_runners(
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/provision/__init__.py", line 65, in _wrapper
E 09-18 00:33:21 controller.py:741]     return func(provider_name, *args, **kwargs)
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/provision/__init__.py", line 282, in get_command_runners
E 09-18 00:33:21 controller.py:741]     return command_runner.SSHCommandRunner.make_runner_list(
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/utils/command_runner.py", line 472, in make_runner_list
E 09-18 00:33:21 controller.py:741]     return [cls(node, **kwargs) for node in node_list]
E 09-18 00:33:21 controller.py:741]   File "/home/ubuntu/sky_workdir/sky/utils/command_runner.py", line 472, in <listcomp>
E 09-18 00:33:21 controller.py:741]     return [cls(node, **kwargs) for node in node_list]
E 09-18 00:33:21 controller.py:741] TypeError: SSHCommandRunner.__init__() missing 2 required positional arguments: 'ssh_user' and 'ssh_private_key'

I also see two log messages

W 09-18 00:33:21 cloud_vm_ray_backend.py:2744] Failed to connect to SSH tunnel for cluster 'job-214-1248' on port 11612 ([Errno 111] Connection refused), acquiring lock
W 09-18 00:33:21 cloud_vm_ray_backend.py:2768] Failed to connect to SSH tunnel for cluster 'job-214-1248' on port 11612 ([Errno 111] Connection refused), opening new tunnel

Since those messages were reported separately, it's not possible to tell if these messages were before or after the stacktrace, but I must assume before.

The underlying cluster was on spot, so it's possible it was preempted and cleaned up by the skypilot-status-refresh-daemon.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions