Fix a race condition in TransportQueue, and set semaphore on `exec_command_wait_async` #7144

khsrali · 2025-12-09T15:28:01Z

Fix #7119
This PR:

Fixes an important race condition in TransportQueue
In addition adds a semaphore control on exec_command_wait_async, to limit the failures

codecov · 2025-12-09T15:29:46Z

Codecov Report

❌ Patch coverage is 53.84615% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.60%. Comparing base (cc0bb48) to head (bf24acc).

Files with missing lines	Patch %	Lines
src/aiida/transports/plugins/ssh_async.py	54.55%	5 Missing ⚠️
src/aiida/schedulers/scheduler.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7144      +/-   ##
==========================================
- Coverage   79.61%   79.60%   -0.00%     
==========================================
  Files         566      566              
  Lines       43572    43580       +8     
==========================================
+ Hits        34684    34687       +3     
- Misses       8888     8893       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

khsrali · 2025-12-09T16:14:11Z

FAILED tests/engine/test_transport.py::TestTransportQueue::test_new_request_during_close_gets_fresh_transport - AssertionError: Each sequential request should get a fresh transport

might be flaky

EDIT:
It was a problem of python reusing the memory. It's fixed now

Copilot

Pull request overview

This PR fixes a critical race condition in TransportQueue and adds semaphore control to exec_command_wait_async to prevent SSH connection overwhelm. The race condition occurred when async transport closure yielded to the event loop (via nest_asyncio in plumpy), allowing new tasks to receive references to transports being closed. The fix ensures transport requests are removed from the dictionary before calling close(), preventing this interleaving.

Moved _transport_requests.pop() to execute before transport.close() in TransportQueue to fix race condition
Added semaphore wrapping to exec_command_wait_async to limit concurrent SSH subchannels
Added comprehensive regression tests for both the race condition and semaphore behavior

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/aiida/engine/transports.py	Moved transport request removal before close() call to fix race condition where new tasks could get closing transports
src/aiida/transports/plugins/ssh_async.py	Added semaphore control to exec_command_wait_async and error handling to close_async
tests/engine/test_transport.py	Added regression tests verifying transport request removal ordering and fresh transport creation
tests/transports/test_asyncssh_plugin.py	Restructured into test class with tests for semaphore release after errors and concurrent operation limiting
src/aiida/schedulers/scheduler.py	Removed redundant transport context manager (transport already opened per docstring)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/transports/test_asyncssh_plugin.py

src/aiida/transports/plugins/ssh_async.py

agoscinski

While the first bug could also happen without nest_asyncio, the usage of nest_asyncio makes such bugs more likely to happen, since we have now async calls in functions that are completely sync, so not marked by any concurrent execution.

I think the PR and tests are really good, just minor comments to improve the understanding of the bug. As discussed, the PR will be split into two commits since two bugs are fixed.

tests/transports/test_asyncssh_plugin.py

agoscinski · 2025-12-10T14:02:53Z

src/aiida/engine/transports.py

+                # 1. close() is called, which for AsyncTransport uses run_until_complete()
+                # 2. With nest_asyncio (used by plumpy), this can yield to the event loop
+                # 3. Another task might enter and get the same transport_request
+                # 4. That task tries to use the transport that's being closed -> error


tried to improve the message by being more verbose, not sure, you can adapt it as you whish

Suggested change

# 1. close() is called, which for AsyncTransport uses run_until_complete()

# 2. With nest_asyncio (used by plumpy), this can yield to the event loop

# 3. Another task might enter and get the same transport_request

# 4. That task tries to use the transport that's being closed -> error

# 1. close() is called, which for AsyncTransport uses run_until_complete(close_async)

# 2. With nest_asyncio (used by plumpy), this call yields back to the event loop

# 3. The event loop schedules close_async, then continues running a other tasks - for example one that requests the transport which is scheduled to be closed

# 4. The task now using the transport to do some operation awaits, next the close_async task closes the transport while still in use -> error

khsrali requested a review from GeigerJ2 December 9, 2025 15:28

khsrali requested a review from Copilot December 9, 2025 16:14

Copilot started reviewing on behalf of khsrali December 9, 2025 16:14 View session

Copilot AI reviewed Dec 9, 2025

View reviewed changes

tests/transports/test_asyncssh_plugin.py Outdated Show resolved Hide resolved

src/aiida/transports/plugins/ssh_async.py Outdated Show resolved Hide resolved

khsrali requested review from danielhollas and removed request for GeigerJ2 December 10, 2025 08:09

khsrali added 2 commits December 10, 2025 09:30

FIX

efa1aee

fix flaky test

bf24acc

khsrali force-pushed the asynctransport_open branch from 6eeae1b to bf24acc Compare December 10, 2025 08:30

agoscinski reviewed Dec 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix a race condition in TransportQueue, and set semaphore on `exec_command_wait_async` #7144

Fix a race condition in TransportQueue, and set semaphore on `exec_command_wait_async` #7144

khsrali commented Dec 9, 2025

Uh oh!

codecov bot commented Dec 9, 2025 •

edited

Loading

Uh oh!

khsrali commented Dec 9, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

agoscinski left a comment

Uh oh!

Uh oh!

agoscinski Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix a race condition in TransportQueue, and set semaphore on exec_command_wait_async #7144

Are you sure you want to change the base?

Fix a race condition in TransportQueue, and set semaphore on exec_command_wait_async #7144

Conversation

khsrali commented Dec 9, 2025

Uh oh!

codecov bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

khsrali commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

agoscinski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

agoscinski Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix a race condition in TransportQueue, and set semaphore on `exec_command_wait_async` #7144

Fix a race condition in TransportQueue, and set semaphore on `exec_command_wait_async` #7144

codecov bot commented Dec 9, 2025 •

edited

Loading

khsrali commented Dec 9, 2025 •

edited

Loading