Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transport & Engine: AsyncTransport plugin #6626

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

khsrali
Copy link
Contributor

@khsrali khsrali commented Nov 21, 2024

This PR proposes many changes to make transport tasks asynchronous. This ensures that the daemon won’t be blocked by time-consuming tasks such as uploads, downloads, and similar operations, requested by @giovannipizzi.

Here’s a summary of the main updates:

  • New Transport Plugin: Introduces AsyncSshTransport with the entry point core.ssh_async.
  • Enhanced Authentication: AsyncSshTransport supports executing custom scripts before connections, which is particularly useful for authentication. 🥇
  • Engine Updates: Modifies the engine to consistently call asynchronous transport methods.
  • Deprecated Methods: Deprecates the use of transport.chdir() and transport.getcwd() (merged in Transport & Engine: factor out getcwd() & chdir() for compatibility with upcoming async transport #6594).
  • Backward Compatibility: Provides synchronous counterparts for all asynchronous methods in AsyncSshTransport.
  • Transport Class Overhaul: Deprecates the previous Transport class. Introduces _BaseTransport, Transport, and AsyncTransport as replacements.
  • Improved Documentation: Adds more docstrings and comments to guide plugin developers. Blocking plugins should inherit from Transport, while asynchronous ones should inherit from AsyncSshTransport.
  • Updated Tests: Revises test_all_plugins.py to reflect these changes. Unfortunately, existing tests for transport plugins remain minimal and need improvement in a separate PR (TODO).
  • New Path Type: Defines a TransportPath type and upgrades transport plugins to work with Union[str, Path, PurePosixPath].
  • New Feature: Introduces copy_from_remote_to_remote_async, addressing a previous issue where such tasks blocked the entire daemon.

Dependencies: This PR relies on PR 272 in plumpy.

Note: The initial commits by Chris were pulled from #6079 (closed).


Test Results: Performance Comparisons

When core.ssh_async Outperforms

In scenarios where the daemon is blocked by heavy transfer tasks (uploading/downloading/copying large files), core.ssh_async shows significant improvement.

For example, I submitted two WorkGraphs:

  1. The first handles heavy transfers:
    • Upload 10 MB
    • Remote copy 1 GB
    • Retrieve 1 GB
  2. The second performs a simple shell command: touch file.

The time taken until the submit command is processed (with one daemon running):

  • core.ssh_async: Only 4 seconds! 🚀🚀🚀🚀 A major improvement!
  • core.ssh: 108 seconds (WorkGraph 1 fully completes before processing the second).

When core.ssh_async and core.ssh Are Comparable

For tasks involving both (and many!) uploads and downloads (a common scenario), performance varies slightly depending on the case.

  • Large Files (~1 GB):

    • core.ssh_async performs better due to simultaneous uploads and downloads. In some networks, this can almost double the bandwidth, as demonstrated in the graph below. My bandwidth is 11.8 MB/s but increased to nearly double under favorable conditions:
      Bandwidth Boost Example

    • However, under heavy network load, bandwidth may revert to its base level (e.g., 11.8 MB/s):
      Bandwidth Under Load

      Test Case: Two WorkGraphs: one uploads 1 GB, the other retrieves 1 GB using RemoteData.

      • core.ssh_async: 120 seconds
      • core.ssh: 204 seconds
  • Small Files (Many Small Transfers):

    • Test Case: 25 WorkGraphs each transferring a few 1 MB files.
      • core.ssh_async: 105 seconds
      • core.ssh: 65 seconds

    In this scenario, the overhead of asynchronous calls seems to outweigh the benefits. We need to discuss the trade-offs and explore possible optimizations. As @agoscinski mentioned, this might be expected, see here async overheads.

Copy link

codecov bot commented Nov 21, 2024

Codecov Report

Attention: Patch coverage is 81.85745% with 168 lines in your changes missing coverage. Please review.

Project coverage is 78.03%. Comparing base (5e8bbe1) to head (2c2272a).

Files with missing lines Patch % Lines
src/aiida/transports/plugins/ssh_async.py 76.16% 103 Missing ⚠️
src/aiida/transports/transport.py 86.44% 35 Missing ⚠️
src/aiida/engine/daemon/execmanager.py 76.20% 10 Missing ⚠️
src/aiida/transports/plugins/ssh.py 88.10% 10 Missing ⚠️
src/aiida/transports/plugins/local.py 92.31% 6 Missing ⚠️
src/aiida/engine/processes/calcjobs/tasks.py 75.00% 1 Missing ⚠️
src/aiida/orm/computers.py 50.00% 1 Missing ⚠️
src/aiida/plugins/factories.py 83.34% 1 Missing ⚠️
src/aiida/transports/util.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6626      +/-   ##
==========================================
+ Coverage   77.99%   78.03%   +0.05%     
==========================================
  Files         563      564       +1     
  Lines       41761    42501     +740     
==========================================
+ Hits        32567    33162     +595     
- Misses       9194     9339     +145     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@khsrali khsrali marked this pull request as ready for review November 21, 2024 09:11
@khsrali khsrali requested a review from agoscinski November 21, 2024 17:28
Copy link
Contributor

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good, just to reiterate most important comments:


Why don't you just use Transport instead of BlockingTransport, since you set it one to the other? Now you have redundancy. I feel like this API is clear to me.

_BaseTransport -> Transport -> SshTransport
_BaseTransport -> AsyncTransport -> AsyncSshTransport

Will you make a PR in plumpy there so we can do a new release?


Tests I will review in the separate PR

@@ -119,7 +120,7 @@ pillow==10.1.0
platformdirs==3.11.0
plotly==5.17.0
pluggy==1.3.0
plumpy==0.22.3
plumpy@git+https://github.com/khsrali/plumpy.git@allow-async-upload-download#egg=plumpy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you make a PR there so we can do a new release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! Please review here: aiidateam/plumpy#272

if (
canonicalize_name(requirement_abstract.name) == canonicalize_name(requirement_concrete.name)
and abstract_contains
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we remove this before merge? Otherwise it would be good to add some comment what the new if-else does. Hard to understand without context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to keep it, as it's very useful to pass CI when we make PRs like this, that are hooked to another PR, or branch of other repo with @

The problem is @ is not listed as a valid specifier in class Specifier.
This little change, basically, accepts @ as a valid specifier and will check if a hooked dependency is to the same "version" across all files, requirement-xx and enviroment.yml , etc...

This way, apart of this nice check, the dependency test fails and it still triggers the main unit tests test-presto , test-3.xx for such PRs.. (otherwise it won't)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few lines of comment to clarify this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, perhaps would be better to separate into standalone PR for visibility.

btw: I started looking into using uv lockfile in #6640, seems like a better strategy than having to wrangle 4 different requirements files. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, this feature is already covered in the new PR #6640.
So I keep the changes temporarily for this PR only, and will revert 'utils/dependency_management.py' before any merge.

src/aiida/transports/transport.py Outdated Show resolved Hide resolved
return str(path)


class _BaseTransport:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this part of public API? I should use it if I create a new transport plugin? Or should I use Transport?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is private. No one should inherent from this except 'AsyncTransport', 'BlockingTransport'.
Only 'AsyncTransport', 'BlockingTransport' are the public ones -- to be used to create a new plugin--

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is problematic to have class like this, take the method get_safe_open_interval as example.

    def get_safe_open_interval(self):
        """Get an interval (in seconds) that suggests how long the user should wait
        between consecutive calls to open the transport.  This can be used as
        a way to get the user to not swamp a limited number of connections, etc.
        However it is just advisory.
        If returns 0, it is taken that there are no reasons to limit the
        frequency of open calls.
        In the main class, it returns a default value (>0 for safety), set in
        the _DEFAULT_SAFE_OPEN_INTERVAL attribute of the class. Plugins should override it.
        :return: The safe interval between calling open, in seconds
        :rtype: float
        """
        return self._safe_open_interval

It says "Plugins should override it", then what is the point of define the method?



# This is here for backwards compatibility
Transport = BlockingTransport
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this makes sense to make blocking the default one, especially if you expose both of them in the API. Shouldn't there be a public class for Blocking and Nonblocking transport which one should use to inherit from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just for backward compatibility as Giovanni suggested to call the former blocking Transport, now as, BlockingTransport

tests/engine/daemon/test_execmanager.py Outdated Show resolved Hide resolved
@@ -164,7 +167,8 @@ def test_upload_local_copy_list(
calc_info.local_copy_list = [[folder.uuid] + local_copy_list]

with node.computer.get_transport() as transport:
execmanager.upload_calculation(node, transport, calc_info, fixture_sandbox)
runner = get_manager().get_runner()
runner.loop.run_until_complete(execmanager.upload_calculation(node, transport, calc_info, fixture_sandbox))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because execmanager.upload_calculation is now a async function.. this way we can call it in a sync test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you use the old way? The test just passes and continues before finishing the command?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is very tricky to mix up the async programming and sync function, it is in general a very hard problem. This looks to me the runner.loop.run_until_complete will block the running of the task until it complete so give no benefit after making these methods async. Is the create_task the correct thing to use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I just asked Ali offline. This is only for tests and only for test the functionality of the implementation is correct. The async behaviors of four operations working together is not the purpose here.

src/aiida/transports/transport.py Outdated Show resolved Hide resolved
@@ -86,3 +86,24 @@ def copy_from_remote_to_remote(transportsource, transportdestination, remotesour
.. note:: it uses the method transportsource.copy_from_remote_to_remote
"""
transportsource.copy_from_remote_to_remote(transportdestination, remotesource, remotedestination, **kwargs)


async def copy_from_remote_to_remote_async(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required in the utils? I don't find any usage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how it's used, tbh, probably by external plugins? so far I just provide the similar functionality as in copy_from_remote_to_remote

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay something that might be cleaned up in the future but for this PR it does not make so much sense

@unkcpz
Copy link
Member

unkcpz commented Nov 24, 2024

I am about to finish #6627 which I think can benefit for the tests here as well. Please hold a bit for that. I'll try my best to get that one merge by Wednesday.

@khsrali
Copy link
Contributor Author

khsrali commented Nov 25, 2024

Why don't you just use Transport instead of BlockingTransport, since you set it one to the other? Now you have redundancy. I feel like this API is clear to me.

_BaseTransport -> Transport -> SshTransport
_BaseTransport -> AsyncTransport -> AsyncSshTransport

I just followed what @giovannipizzi suggested. But agreed this makes more sense, so I'm gonna apply this changes..

Will you make a PR in plumpy there so we can do a new release?

Will do once my performance tests are ready..

@khsrali
Copy link
Contributor Author

khsrali commented Nov 25, 2024

Note to myself:
@danielhollas suggested we apply the changes directly on core.ssh rather than creating a new plugin core.async_ssh
I should investigate this..

utils/dependency_management.py Outdated Show resolved Hide resolved
utils/dependency_management.py Outdated Show resolved Hide resolved
if (
canonicalize_name(requirement_abstract.name) == canonicalize_name(requirement_concrete.name)
and abstract_contains
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor changes

@@ -86,3 +86,24 @@ def copy_from_remote_to_remote(transportsource, transportdestination, remotesour
.. note:: it uses the method transportsource.copy_from_remote_to_remote
"""
transportsource.copy_from_remote_to_remote(transportdestination, remotesource, remotedestination, **kwargs)


async def copy_from_remote_to_remote_async(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay something that might be cleaned up in the future but for this PR it does not make so much sense

@khsrali khsrali requested review from unkcpz and agoscinski December 11, 2024 17:07
@khsrali
Copy link
Contributor Author

khsrali commented Dec 16, 2024

@agoscinski
I'll appreciated if you guys can give this PR, another round of review. -- I also asked @unkcpz, in the office) --

It would be nice to have it merged by the end of this week, because when I come back from holidays,
I'll lose half of my memory :-)))

Copy link
Member

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I give the implementations a first go. I was only checking the test_all_plugins.py previous time where I also did changes.
TBH, I think the PR still requires some changes.
I personally think the huge inheritance pattern is the evil of a lot of our headaches, here it add more of this. Would you mind to have a read on

The protocol can fit for both sync and async function, which means the AsyncTransport can use the function name without "_async" as suffix. Then inside "daemon/execmanager.py", if the function is sync transport, it runs in the blocking manner in the coroutine, if it is async it is scheduled to the event loop.

For example:

remote_user = await transport.whoami() # instead of await transport.whoami_async()

In aiidateam/plumpy#272, the post https://textual.textualize.io/blog/2023/03/15/no-async-async-with-python/ was mentioned. For the transport, I think the idea can work well to have async usage under the hood and call sync function as well.

But anyway, it is more stylish requests from mine. I think the PR is a great effort to improve the performance with async ssh. I think @khsrali already did the most difficult part of understanding async behavior and benchmark workflow for proof the changes are correct. We can do a pair coding next year to also setter down the interface and stylish disagreement.



def always_kill(node: CalcJobNode, transport: Transport) -> str | None:
def always_kill(node: CalcJobNode, transport: Union['Transport', 'AsyncTransport']) -> str | None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def always_kill(node: CalcJobNode, transport: Union['Transport', 'AsyncTransport']) -> str | None:
def always_kill(node: CalcJobNode, transport: Transport | AsyncTransport) -> str | None:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other places, but it is fine to leave it as this, I think we will find a time to change them all. So please ignore my comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, at first I also had them as Transport | AsyncTransport
But then Union was suggested and "imposed" by whatever is installed on pre-commit.

node: CalcJobNode,
transport: Transport,
transport: Union['Transport', 'AsyncTransport'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
transport: Union['Transport', 'AsyncTransport'],
transport: 'Transport | AsyncTransport',

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I see a type annotation like this, it usually means defining a protocol as interface is a better solution ;) But I won't bother in this PR.

@@ -20,6 +20,7 @@ classifiers = [
dependencies = [
'alembic~=1.2',
'archive-path~=0.4.2',
"asyncssh~=2.19.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the version that contains the change you mentioned from asyncssh? I remember you mentioned asyncssh did some change to solve the 4 spicks in network bandwidth usase issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this version includes changes on copy behavior.
We discussed it here: ronf/asyncssh#724

@@ -192,7 +192,7 @@ def _get_submit_command(self, submit_script):
directory.
IMPORTANT: submit_script should be already escaped.
"""
submit_command = f'bash {submit_script} > /dev/null 2>&1 & echo $!'
submit_command = f'(bash {submit_script} > /dev/null 2>&1 & echo $!) &'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change is related, can you move it to another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command execution from asyncssh library required this annotation, otherwise will not await it, therefore this change is related to this PR.
I've checked this change and it has no effect on expected behavior of command execution in paramiko, so everything is safe.

__all__ = ('Transport',)
__all__ = ('AsyncTransport', 'Transport', 'TransportPath')

TransportPath = Union[str, Path, PurePosixPath]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To deal with generic path typing, it is better to cover more I think:

PathLike = Union[AnyStr, os.PathLike]

In side the function, I'd rather all use pathlib.Path instead of str. The reason is we are all move to pathlib.Path in other module among the code base.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggeestion.
str still has to be supported, because there are plugins that have direct call on transport methods with srt paths. For example in QE, there exist one or two call. Other plugins I have not checked.

And about covering more types, I'd suggest we do it when a concrete usecase showed up.
AnyStr also includes bytes which I believe we don't need.
os.PathLike is very inclusive, and allows for custom paths, although I agree it's nice, but don't see why we would need that right now.

I defined that way, to be very specific what paths we support.

return str(path)


class _BaseTransport:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is problematic to have class like this, take the method get_safe_open_interval as example.

    def get_safe_open_interval(self):
        """Get an interval (in seconds) that suggests how long the user should wait
        between consecutive calls to open the transport.  This can be used as
        a way to get the user to not swamp a limited number of connections, etc.
        However it is just advisory.
        If returns 0, it is taken that there are no reasons to limit the
        frequency of open calls.
        In the main class, it returns a default value (>0 for safety), set in
        the _DEFAULT_SAFE_OPEN_INTERVAL attribute of the class. Plugins should override it.
        :return: The safe interval between calling open, in seconds
        :rtype: float
        """
        return self._safe_open_interval

It says "Plugins should override it", then what is the point of define the method?


# This will be used for ``Computer.get_minimum_job_poll_interval``
DEFAULT_MINIMUM_JOB_POLL_INTERVAL = 10

# This is used as a global default in case subclasses don't redefine this,
# but this should be redefined in plugins where appropriate
_DEFAULT_SAFE_OPEN_INTERVAL = 30.0
_DEFAULT_SAFE_OPEN_INTERVAL = 3.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the class is going to be the base class for the transport classes that will inheritant this one to get the default methods implementation to avoid code duplication. Will these class attributes being redefined (are these default attributes supposed to be redefined?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's a good point. Maybe we should take it out from here and put back in the two public transport classes.

Anyways, I just remember I shouldn't have change that default value here, thanks for reminding me. I'll set it back to 30.0 :)

Comment on lines +117 to +125
@abc.abstractmethod
def open(self):
"""Opens a transport channel
:raises InvalidOperation: if the transport is already open.
"""

@abc.abstractmethod
def close(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this class is not a metaclass (which provide as the interface), then make no sense to use @abc.abstractmethod.

@unkcpz
Copy link
Member

unkcpz commented Jan 10, 2025

Please be aware that failed test of py3.10 can be caused by the changes of this PR.

aiida_code_installed = <function aiida_code_installed.<locals>.factory at 0x7f624c74edd0>

    def test_get_builder_restart(aiida_code_installed):
        """Test :meth:`aiida.orm.nodes.process.process.ProcessNode.get_builder_restart`."""
        inputs = {
            'code': aiida_code_installed(default_calc_job_plugin='core.arithmetic.add', filepath_executable='/bin/bash'),
            'x': Int(1),
            'y': Int(1),
            'metadata': {'options': {'resources': {'num_machines': 1, 'num_mpiprocs_per_machine': 1}}},
        }
>       _, node = launch.run_get_node(ArithmeticAddCalculation, inputs)

tests/orm/nodes/process/test_process.py:88: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/aiida/engine/launch.py:65: in run_get_node
    return runner.run_get_node(process, inputs, **kwargs)
src/aiida/engine/runners.py:291: in run_get_node
    result, node = self._run(process, inputs, **kwargs)
src/aiida/engine/runners.py:261: in _run
    process_inited.execute()
.venv/lib/python3.10/site-packages/plumpy/processes.py:88: in func_wrapper
    return func(self, *args, **kwargs)
.venv/lib/python3.10/site-packages/plumpy/processes.py:1200: in execute
    self.loop.run_until_complete(self.step_until_terminated())
.venv/lib/python3.10/site-packages/nest_asyncio.py:92: in run_until_complete
    self._run_once()
.venv/lib/python3.10/site-packages/nest_asyncio.py:115: in _run_once
    event_list = self._selector.select(timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <selectors.EpollSelector object at 0x7f6244131de0>, timeout = 100.0

    def select(self, timeout=None):
        if timeout is None:
            timeout = -1
        elif timeout <= 0:
            timeout = 0
        else:
            # epoll_wait() has a resolution of 1 millisecond, round away
            # from zero to wait *at least* timeout seconds.
            timeout = math.ceil(timeout * 1e3) * 1e-3
    
        # epoll_wait() expects `maxevents` to be greater than zero;
        # we want to make sure that `select()` can be called when no
        # FD is registered.
        max_ev = max(len(self._fd_to_key), 1)
    
        ready = []
        try:
>           fd_event_list = self._selector.poll(timeout, max_ev)
E           Failed: Timeout >240.0s

/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/selectors.py:469: Failed

It says the event loop is closed, it might caused because you use loop.get_event_loop() in the transport and run coroutine with run_until_complete, it may then close event he main event loop. It requires more investigation.

@khsrali
Copy link
Contributor Author

khsrali commented Jan 10, 2025

Hi @unkcpz , yes they seem to be flaky. aiida_profile_clean would be nice, but which ones were the ones failing?

Edit: haha, now they pass, lol

@agoscinski
Copy link
Contributor

It might work because the timeout is beyond the total testing time (set 40 minutes now) so no test is killed by the timeout. It might be because the signal method of pytest-timeout is used.

If the system supports the SIGALRM signal the signal method will be used by default [...]
The main issue to look out for with this method is that it may interfere with the code under test. If the code under test uses SIGALRM itself things will go wrong and you will have to choose the thread method.

https://pypi.org/project/pytest-timeout/

Further when asking a chatbot

Issue Description
Signal handler not being called If the event loop is not actively yielding, SIGALRM might not be handled.
Interrupted system calls SIGALRM interrupts blocking system calls, causing asyncio tasks to fail with OSError.
Conflict with asyncio.run() asyncio.run() closes the loop automatically, which makes it tricky to use custom signal handlers.

I was not able to verify this statement, but it is easy to test it by using thread as timeout method. So changing in the pyproject.toml timeout_method = thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants