-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the SshFabricTransport
plugin
#6154
Conversation
Quick note to self when testing this: also try an SSH configuration that sets the following:
This was the case for a certain user, and I'm curious to see if |
@sphuber finally got around to testing this. Setup/configuration went buttery smooth, really nice that users can just specify the However, the test for getting the number of jobs fails for some reason: * Getting number of jobs from scheduler... [Failed]: TimeoutError:
Full traceback:
Traceback (most recent call last):
File "/Users/mbercx/.aiida_venvs/sirius/lib/python3.10/site-packages/paramiko/channel.py", line 699, in recv
out = self.in_buffer.read(nbytes, self.timeout)
File "/Users/mbercx/.aiida_venvs/sirius/lib/python3.10/site-packages/paramiko/buffered_pipe.py", line 164, in read
raise PipeTimeout()
paramiko.buffered_pipe.PipeTimeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/mbercx/project/sirius/git/aiida-core/aiida/cmdline/commands/cmd_computer.py", line 559, in computer_test
success, message = test(
File "/Users/mbercx/project/sirius/git/aiida-core/aiida/cmdline/commands/cmd_computer.py", line 58, in _computer_test_get_jobs
found_jobs = scheduler.get_jobs(as_dict=True)
File "/Users/mbercx/project/sirius/git/aiida-core/aiida/schedulers/scheduler.py", line 361, in get_jobs
retval, stdout, stderr = self.transport.exec_command_wait(self._get_joblist_command(jobs=jobs, user=user))
File "/Users/mbercx/project/sirius/git/aiida-core/aiida/transports/transport.py", line 438, in exec_command_wait
retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
File "/Users/mbercx/project/sirius/git/aiida-core/aiida/transports/plugins/ssh.py", line 1472, in exec_command_wait_bytes
stdout_bytes.append(stdout.read())
File "/Users/mbercx/.aiida_venvs/sirius/lib/python3.10/site-packages/paramiko/file.py", line 200, in read
new_data = self._read(self._DEFAULT_BUFSIZE)
File "/Users/mbercx/.aiida_venvs/sirius/lib/python3.10/site-packages/paramiko/channel.py", line 1361, in _read
return self.channel.recv(size)
File "/Users/mbercx/.aiida_venvs/sirius/lib/python3.10/site-packages/paramiko/channel.py", line 701, in recv
raise socket.timeout()
TimeoutError I tried setting up the same computer with EDIT: Seems to work just fine on Piz Daint. Strange... |
When trying to run on Piz Daint, I encountered the following error: | Traceback (most recent call last):
| File "/Users/mbercx/project/sirius/git/aiida-core/aiida/engine/utils.py", line 187, in exponential_backoff_retry
| result = await coro()
| File "/Users/mbercx/project/sirius/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 146, in do_submit
| return execmanager.submit_calculation(node, transport)
| File "/Users/mbercx/project/sirius/git/aiida-core/aiida/engine/daemon/execmanager.py", line 382, in submit_calculation
| result = scheduler.submit_from_script(workdir, submit_script_filename)
| File "/Users/mbercx/project/sirius/git/aiida-core/aiida/schedulers/scheduler.py", line 412, in submit_from_script
| return self._parse_submit_output(*result)
| File "/Users/mbercx/project/sirius/git/aiida-core/aiida/schedulers/plugins/slurm.py", line 433, in _parse_submit_output
| transport_string = f' for {self.transport}'
| File "/Users/mbercx/project/sirius/git/aiida-core/aiida/transports/plugins/ssh.py", line 594, in __str__
| conn_info = self._machine
| AttributeError: 'SshFabricTransport' object has no attribute '_machine' EDIT: After fixing this | Traceback (most recent call last):
| File "/Users/mbercx/project/sirius/git/aiida-core/aiida/engine/utils.py", line 187, in exponential_backoff_retry
| result = await coro()
| File "/Users/mbercx/project/sirius/git/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 146, in do_submit
| return execmanager.submit_calculation(node, transport)
| File "/Users/mbercx/project/sirius/git/aiida-core/aiida/engine/daemon/execmanager.py", line 382, in submit_calculation
| result = scheduler.submit_from_script(workdir, submit_script_filename)
| File "/Users/mbercx/project/sirius/git/aiida-core/aiida/schedulers/scheduler.py", line 412, in submit_from_script
| return self._parse_submit_output(*result)
| File "/Users/mbercx/project/sirius/git/aiida-core/aiida/schedulers/plugins/slurm.py", line 433, in _parse_submit_output
| transport_string = f' for {self.transport}'
| File "/Users/mbercx/project/sirius/git/aiida-core/aiida/transports/plugins/ssh.py", line 596, in __str__
| conn_info = f"{self._connect_args['username']}@{conn_info}"
| AttributeError: 'SshFabricTransport' object has no attribute '_connect_args' |
15e31ca
to
aa31605
Compare
0b02b48
to
64106c0
Compare
64106c0
to
55da99b
Compare
Update note: |
> File "/Users/mbercx/project/sirius/git/aiida-core/aiida/transports/plugins/ssh.py", line 1472, in exec_command_wait_bytes
> stdout_bytes.append(stdout.read())
> File "/Users/mbercx/.aiida_venvs/sirius/lib/python3.10/site-packages/paramiko/file.py", line 200, in read
> new_data = self._read(self._DEFAULT_BUFSIZE)
> File "/Users/mbercx/.aiida_venvs/sirius/lib/python3.10/site-packages/paramiko/channel.py", line 1361, in _read
> return self.channel.recv(size)
> File "/Users/mbercx/.aiida_venvs/sirius/lib/python3.10/site-packages/paramiko/channel.py", line 701, in recv
> raise socket.timeout()
> TimeoutError
>
I'm confused why this was happening, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I went through this and tried it out, a few comments:
- If
hostname
doesn't exist or not configured in~/.ssh/config
, no message is communicated to users duringverdi -p <profile> computer configure
. I would even add a click callback, to actually try withfabric
if connection is possible. - If the
~/.ssh/config
is deleted after configuring your computer, the connection always fails.
These two, makes core.ssh_fabric
a "silent" plugin, basically works only if user can do $ ssh hostname
.
We should be aware if that fails for whatever reason after configuration, aiida
connectivity also fails.
This calls for adding more checks wherever relevant and raising to user.
Would be perfect to have this clearly stated at the time of configuring a computer.
Also useful to define def get_description(cls)
to give these info.
|
||
def open(self): | ||
"""Open the connection.""" | ||
if not self._connection.is_connected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catch paramiko.ssh_exception.AuthenticationException
:
- check if
hostname
cannot be found in~/.ssh/config
- make user aware of this somehow.
- kill immediately, no reason to retry
MAX_ATTEMPTS_OPTION
times in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would indeed be nice to get some more clear feedback, for example if I change my Host
in the .ssh/config
and then try to test the computer, I get:
❯ verdi computer test todi --print-traceback
Report: Testing computer<todi> for user<lala@lolo>...
* Opening connection... [FAILED]: Error while trying to connect to the computer
Full traceback:
Traceback (most recent call last):
File "/Users/mbercx/project/core/git/aiida-core/src/aiida/cmdline/commands/cmd_computer.py", line 541, in computer_test
with transport:
File "/Users/mbercx/project/core/git/aiida-core/src/aiida/transports/transport.py", line 128, in __enter__
self.open()
File "/Users/mbercx/project/core/git/aiida-core/src/aiida/transports/plugins/ssh_fabric.py", line 47, in open
self._sftp = self._connection.sftp() # This opens the fabric connection ``self._connection`` as well
File "/Users/mbercx/.aiida_venvs/core/lib/python3.10/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/Users/mbercx/.aiida_venvs/core/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
self.open()
File "/Users/mbercx/.aiida_venvs/core/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
result = self.client.connect(**kwargs)
File "/Users/mbercx/.aiida_venvs/core/lib/python3.10/site-packages/paramiko/client.py", line 349, in connect
to_try = list(self._families_and_addresses(hostname, port))
File "/Users/mbercx/.aiida_venvs/core/lib/python3.10/site-packages/paramiko/client.py", line 203, in _families_and_addresses
addrinfos = socket.getaddrinfo(
File "/opt/homebrew/Cellar/[email protected]/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/socket.py", line 955, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known
Warning: 1 out of 0 tests failed
Same in case I change the name of ~/.ssh/config
.
Just gave it another spin, and everything seems to be running smooth as butter on all machines I've tried. |
I like the idea of raising a warning in case the provided configuration fails, but would never dare to introduce such scope creep on a PR of @sphuber. 😇 That said, this should probably be done for all transport configure commands then. I'd open an issue for this and deal with it separately. |
I agree that there not being any warning or error whatsoever is problematic. I am not sure I would want to add a check to So really what I am thinking is that this is a problem specific with the |
Seems that we won't be able to rely much on In [16]: from fabric import Config
In [17]: from fabric import Connection
In [18]: conn = Connection('nonexisting-domain')
In [19]: conn.open()
---------------------------------------------------------------------------
gaierror Traceback (most recent call last)
Cell In[19], line 1
----> 1 conn.open()
File ~/.mambaforge/envs/aiida-py311/lib/python3.11/site-packages/fabric/connection.py:665, in Connection.open(self)
659 kwargs["auth_strategy"] = auth_strategy_class(
660 ssh_config=self.ssh_config,
661 fabric_config=self.config,
662 username=self.user,
663 )
664 # Actually connect!
--> 665 result = self.client.connect(**kwargs)
666 self.transport = self.client.get_transport()
667 return result
File ~/.mambaforge/envs/aiida-py311/lib/python3.11/site-packages/paramiko/client.py:349, in SSHClient.connect(self, hostname, port, username, password, pkey, key_filename, timeout, allow_agent, look_for_keys, compress, sock, gss_auth, gss_kex, gss_deleg_creds, gss_host, banner_timeout, auth_timeout, gss_trust_dns, passphrase, disabled_algorithms, transport_factory)
347 errors = {}
348 # Try multiple possible address families (e.g. IPv4 vs IPv6)
--> 349 to_try = list(self._families_and_addresses(hostname, port))
350 for af, addr in to_try:
351 try:
File ~/.mambaforge/envs/aiida-py311/lib/python3.11/site-packages/paramiko/client.py:203, in SSHClient._families_and_addresses(self, hostname, port)
195 """
196 Yield pairs of address families and addresses to try for connecting.
197
(...)
200 :returns: Yields an iterable of ``(family, address)`` tuples
201 """
202 guess = True
--> 203 addrinfos = socket.getaddrinfo(
204 hostname, port, socket.AF_UNSPEC, socket.SOCK_STREAM
205 )
206 for (family, socktype, proto, canonname, sockaddr) in addrinfos:
207 if socktype == socket.SOCK_STREAM:
File ~/.mambaforge/envs/aiida-py311/lib/python3.11/socket.py:962, in getaddrinfo(host, port, family, type, proto, flags)
959 # We override this function since we want to translate the numeric family
960 # and socket type values to enum constants.
961 addrlist = []
--> 962 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
963 af, socktype, proto, canonname, sa = res
964 addrlist.append((_intenum_converter(af, AddressFamily),
965 _intenum_converter(socktype, SocketKind),
966 proto, canonname, sa))
gaierror: [Errno -3] Temporary failure in name resolution |
Some quick notes during our discussion:
|
55da99b
to
f48a1f4
Compare
@mbercx I have been reading the implementation of I have pushed a commit that adds the |
740e7ca
to
22a6e69
Compare
Thanks @sphuber! I'll give it a spin. :)
Quick first question here: Does this happen at configuration time, or connection time? |
At connection time. It works exactly as the fabric version, just it only uses |
Tested three different clusters I have access to, all seems to work as expected! 👌 Will now do a proper review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two comments, main one being the name of the plugin.
To keep my scope-creep reputation alive: for this plugin it would be especially nice if the user didn't have to configure the connection after setting up the computer. But we can discuss if/how to make that happen another day.
__all__ = ('SshAutoTransport',) | ||
|
||
|
||
class SshAutoTransport(SshTransport): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in one of my comments above, I'm wondering if we should name the plugin SshConfigTransport
(entry point ssh_config
), since it takes the information from the ~/.ssh/config
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point of it corresponding to the ~/.ssh/config
file, but I don't think it is as obvious to a user. If a user knows nothing about the way these plugins work, but they have the option between core.ssh
and core.ssh_auto
, I think it is safe to say that they will pick the latter if they want the easy experience. In contrast, if the options are core.ssh
and core.ssh_config
I think they would guess that core.ssh_config
would require more configuration. That is why I would vote for core.ssh_auto
and just have the docstring explain that it automatically parses the config file. This also makes it less tied to exactly what it is parsing automatically if in the future we want to add more sources of information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, fair enough! Happy to keep auto then. :)
with (pathlib.Path('~').expanduser() / '.ssh' / 'config').open() as handle: | ||
config_ssh.parse(handle) | ||
|
||
if self._machine not in config_ssh.get_hostnames(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any case where the connection would still work when the host is not in the ~/.ssh/config
? Is there a reason to not outright raise
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking: it might. And since it will raise in that case anyway, why not give it a shot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One consideration might be that the user might focus more on the (somewhat unclear) error and miss the very clear warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. One more reason why I decided to use a warning is that the ~/.ssh/config
is not the only place that SSH connections can be configured. There are also system wide configs like etc/sshd/*
and others. These are also checked by fabric
which is why it failed silently if the specified hostname is not part of the ~/.ssh/config
file, because it cannot conclude that there is a problem if that is the case. We are adding this because for our purposes in most cases the user's config file would be the main place and we want to warn explicitly if we don't find the host there, but it is not necessarily a problem. Anyway, if others think it is better to make this an exception for now, happy to change it. We can always remove it later and turn it into a warning if there are cases where the plugin should rely on other configuration files as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also not critical for me, and I see your point. In fact, now that I think of it, I might already know of a user that had a setup I wasn't familiar with, which might work fine even though the remote isn't specified in the ~/.ssh/config
. So let's keep the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbercx did you already tried the core.ssh_auto
with your various real-life test cases? Did it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes! I tried 3 different clusters, with and without proxy jump. Also messed with the configuration to make it fail and then fixed it. All seems to work as expected.
So I think once we remove ssh_fabric
we can go ahead and merge this, let it into the wild and see what happens. 😁
22a6e69
to
8fd664f
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6154 +/- ##
==========================================
+ Coverage 77.51% 77.82% +0.32%
==========================================
Files 560 566 +6
Lines 41444 41961 +517
==========================================
+ Hits 32120 32654 +534
+ Misses 9324 9307 -17 ☔ View full report in Codecov by Sentry. |
This transport plugin subclasses the `SshTransport` plugin in order to remove all the configuration options. Instead, it parses the user's SSH config file using `paramiko.SSHConfig` when the transport is instantiated to determine the connection parameters automatically. The advantage of this approach is that when configuring a `Computer` using this plugin, the user is not prompted with a bunch of options. Rather, if they can connect to the target machine using `ssh` directly, the transport will also work. What's more, when the user updates their SSH config, the transport automatically uses these changes the next time it is instantiated as opposed to the `SshTransport` plugin which stores the configuration in an `AuthInfo` in the database and is therefore static. The original implementation of this plugin looked into the `fabric` library. This library builds on top of `paramiko` and aims to make configuration SSH connections easier, just as this new plugin was aiming to. However, after a closer look, it seems that fabric was not adding a lot of clever code when it comes to parsing the user's SSH config. It does implement some clever code for dealing with proxy jumps and commands but the `SshTransport` also already implements this. Therefore, it is not really justified to add `fabric` as a dependency but instead we opt to use `paramiko` to parse the config ourselves.
Keeping the original implementation using ###########################################################################
# Copyright (c), The AiiDA team. All rights reserved. #
# This file is part of the AiiDA code. #
# #
# The code is hosted on GitHub at https://github.com/aiidateam/aiida-core #
# For further information on the license, see the LICENSE.txt file #
# For further information please visit http://www.aiida.net #
###########################################################################
"""Transport plugin for SSH connection using ``fabric``.
This subclasses the ``SshTransport`` plugin to replace the connection configuration using the ``fabric`` package. This
makes the configuration significantly easier for the user since ``fabric`` automatically guesses connection parameters
by parsing available SSH configuration files on the system. Since ``fabric`` uses ``paramiko`` under the hood, which is
the library used by the ``SshTransport``, most of the implementation can be reused. The ``paramiko.SshClient`` is
exposed on the ``fabric.Connection`` instance and is proxied to the ``_client`` attribute of the class where the
``SshTransport`` base class expects to find it.
"""
import fabric
from aiida.common.exceptions import InvalidOperation
from ..transport import Transport
from .ssh import SshTransport
__all__ = ('SshFabricTransport',)
class SshFabricTransport(SshTransport):
"""Transport plugin for SSH connections with highly-automated connection configuration.
The connection is made using the ``fabric`` package which will try to use available SSH configuration files on the
system to guess the correct configuration for the connection to the relevant hostname.
"""
_valid_auth_options = []
def __init__(self, *args, **kwargs):
"""Construct a new instance."""
Transport.__init__(self, *args, **kwargs) # Skip the ``SshTransport`` constructor
self._connection = fabric.Connection(self.hostname)
self._client = self._connection.client
self._sftp = None
def open(self):
"""Open the connection."""
if not self._connection.is_connected:
self._sftp = self._connection.sftp() # This opens the fabric connection ``self._connection`` as well
self._sftp.chdir('.') # Needed to make sure sftp.getcwd() returns a valid path
self._is_open = True
def close(self):
"""Close the connection.
This will close the SFTP channel and the ``fabric`` connection.
:raise aiida.common.InvalidOperation: If the channel is already open.
"""
if not self._is_open:
raise InvalidOperation('Cannot close the transport: it is already closed')
if self._sftp:
self._sftp.close()
self._connection.close()
self._is_open = False
def __str__(self):
"""Return a string representation of the transport instance."""
status = 'OPEN' if self._is_open else 'CLOSED'
return f'{self._connection.user}@{self.hostname}:{self._connection.port} [{status}]'
def gotocomputer_command(self, remotedir):
"""Return the command string to connect to the remote and change directory to ``remotedir``."""
return f'ssh -t {self.hostname} {self._gotocomputer_string(remotedir)}' |
8fd664f
to
7624505
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @sphuber! Looking forward to seeing this get released. Let's encourage everyone to start using it on main
ASAP so we can catch any possible issues before that though.
This transport plugin subclasses the `SshTransport` plugin in order to remove all the configuration options. Instead, it parses the user's SSH config file using `paramiko.SSHConfig` when the transport is instantiated to determine the connection parameters automatically. The advantage of this approach is that when configuring a `Computer` using this plugin, the user is not prompted with a bunch of options. Rather, if they can connect to the target machine using `ssh` directly, the transport will also work. What's more, when the user updates their SSH config, the transport automatically uses these changes the next time it is instantiated as opposed to the `SshTransport` plugin which stores the configuration in an `AuthInfo` in the database and is therefore static. The original implementation of this plugin looked into the `fabric` library. This library builds on top of `paramiko` and aims to make configuration SSH connections easier, just as this new plugin was aiming to. However, after a closer look, it seems that fabric was not adding a lot of clever code when it comes to parsing the user's SSH config. It does implement some clever code for dealing with proxy jumps and commands but the `SshTransport` also already implements this. Therefore, it is not really justified to add `fabric` as a dependency but instead we opt to use `paramiko` to parse the config ourselves.
Fixes #6082
This transport plugin subclasses the
SshTransport
plugin but replaces the managing of the connection with thefabric
library. Under the hood this library still usesparamiko
, however, it is a lot clevered in automatically determining the correct connection configuration. For example, it will automatically search for SSH configurations on the machine. This makes it easier for the user to setup the transport to connect to a particular host if they can connect to it through a directssh
invocation. Sincefabric
still usesparamiko
under the hood, just like theSshTransport
, most of the implementation can be reused.