Implement dynamic safe_interval for TransportQueue to reduce wait times #7145

Copilot · 2025-12-09T15:36:08Z

Quick calculations (e.g., 10s runtime) experienced 120s overhead due to fixed 30s safe_interval waits before each transport operation (upload, submit, retrieve, final retrieve). This resulted in 4×30s = 120s of unnecessary waiting.

Changes

Track last close time per authinfo in TransportQueue._last_close_times dict
Calculate dynamic wait intervals:
- First request: immediate (wait = 0)
- Subsequent requests: wait = max(0, safe_interval - time_since_close)
Record close time when transport is closed to enable next calculation

Implementation

# In TransportQueue.request_transport():
last_close_time = self._last_close_times.get(authinfo.pk, None)
current_time = time.time()

if last_close_time is None:
    wait_interval = 0  # First request: immediate
else:
    time_since_last_close = current_time - last_close_time
    if time_since_last_close >= safe_open_interval:
        wait_interval = 0  # Enough time passed: immediate
    else:
        wait_interval = safe_open_interval - time_since_last_close  # Partial wait

Impact

For a 10s calculation with default safe_interval=30:

Before: 130s total (10s execution + 4×30s waits)
After: ~90s total (10s execution + reduced waits)
Improvement: 30% faster, ~40s saved

State is maintained per daemon worker (not in database) to avoid multi-worker synchronization overhead. Backward compatible: existing behavior preserved when requests occur within safe_interval.

Original prompt

This section details on the original issue you should resolve

<issue_title>Make safe_interval more dynamic for quick transport tasks</issue_title>
<issue_description>As realized together with @giovannipizzi while debugging things for our new cluster at PSI: When submitting a simple calculation (execution takes about 10s) for testing purposes, with the default safe_interval=30 in the Computer configuration, one has to wait an additional 90s until the job is done (30s for the upload, submit, and retrieve tasks, each). This is to be expected, of course, and one could just reduce the safe_interval (albeit increasing the risk of SSH overloads).

However, the upload task in that case is truly the first Transport task that is being executed by the daemon worker, so it could, in principle, enter immediately (the same if jobs were run previously, but longer ago than the safe_interval). I locally implemented a first version (thanks to @giovannipizzi's input) that does this, by adding a last_close_time attribute (currently added to the authinfo metadata for a first PoC). In the request_transport method of the TransportQueue, the time difference between the current time and the last_close_time is then checked, and if it is larger than safe_interval, the Transport is opened immediately via:

open_callback_handle = self._loop.call_later(0, do_open, context=contextvars.Context())  # or use 1 for safety?

bypassing the safe_interval (or safe_open_interval as it is called in transports.py).

In addition, the waiting times for the submit and retrieve tasks could also be reduced. It seems like currently, the safe_interval is imposed on all of them, even if they finish very quickly (I assume as all open a transport connection via SSH). So we were thinking if it's possible to make this a bit more sophisticated, e.g. by adding special transport requests, that could make use of the open transport, and keep a transport of which the task has finished open for a short time longer (also quickly discussed with @mbercx). Of course, one would still need to make sure SSH doesn't get overloaded, the implementation works with heavy loads (not just individual testing calculations), and one would also have to consider how this all works with multiple daemon workers. Again with @giovannipizzi, I had a quick look, but it seems like the implementation would be a bit more involved. So wondering what the others think, if this is feasible and worth investigating more time into. Pinging @khsrali who has looked a bit more into transports.</issue_description>

Comments on the Issue (you are @copilot in this section)

@giovannipizzi Thanks for the nice write-up @GeigerJ2 ! Just some minor additional comments/clarifications

actually even for a millisecond run, the time to wait is 120s (or generally 4 times the safe interval), rather than 3x (90s):
- [30s] initial wait to upload
- [30s] time to wait to submit
- [30s] time for the first check of the queue (that, for jobs do not get queued, running immediately and run for < 30s, would already return that the job finished)
- [30s] to retrieve the calculations and set the calculation as finished
(and I guess one adds another 30s if there is also stashing involved)
The initial implementation keeps the time in the metadata of the authinfo, but already while discussing with Julian, we realized it's better not to put it there, as this is shared by all daemon workers, and could lead to wrong results, collisions and exceptions when multiple write the same DB row, etc. - better to just keep in another local attribute self.last_close_time, parallel to self._transport_requests. On the other hand, I just realize that if you are running from a local interpreter, and maybe submitting run() from a bash "for" loop (e.g. of verdi run commands), this might bypass the limit as all of them will think that nothing was submitted before. But probably this is OK with the current implementation? Fixing it properly would require making the whole concept of a safe_interval not specific to a worker, but global to a AiiDA profile.
In the implementation discussed above, in addition to setting the first parameter of call_later to zero if more than safe_interval seconds passed from the last call, I would also set the waiting time to the difference current_time - last_close_time, so e.g. you only wait 10 seconds if you closed the transport 20 seconds ago.
The points above solve the waiting of the first 30 seconds. For the other 3x30 seconds, the idea is that probably in this case the connection was just closed less than a second before, i.e. the time for AiiDA to change state. If we could keep the connection open for a configurable time after the last command (say with a default of 5 or 10 seconds), a full single submission could go down to probably just < ...

Fixes Make safe_interval more dynamic for quick transport tasks #6544

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

- Add last_close_times tracking dict to TransportQueue - Calculate wait_interval based on time since last transport close - Open immediately if no previous close or safe_interval has passed - Wait only remaining time if some time has already passed - Update last_close_time when transport is closed - Add test for dynamic safe interval behavior Co-authored-by: khsrali <[email protected]>

- Remove trailing whitespace from docstring - Move time import to top level - Remove duplicate import statements in tests - Update class docstring to explain dynamic wait behavior Co-authored-by: khsrali <[email protected]>

codecov · 2025-12-09T16:45:26Z

Codecov Report

❌ Patch coverage is 16.66667% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 29.18%. Comparing base (a770d7a) to head (c2e43e1).

Files with missing lines	Patch %	Lines
src/aiida/engine/transports.py	16.67%	10 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #7145       +/-   ##
===========================================
- Coverage   79.61%   29.18%   -50.42%     
===========================================
  Files         566      566               
  Lines       43572    43537       -35     
===========================================
- Hits        34684    12703    -21981     
- Misses       8888    30834    +21946

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR implements dynamic safe_interval calculation for the TransportQueue to reduce unnecessary wait times before transport operations. The enhancement tracks when each transport connection was last closed and calculates wait intervals dynamically, opening connections immediately when sufficient time has passed since the last close, rather than imposing a fixed wait on every request.

Key Changes:

Added per-authinfo tracking of last close times to enable dynamic wait calculation
Modified wait interval logic to open immediately for first requests or when safe_interval has elapsed
Added comprehensive test coverage for the new dynamic behavior

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
src/aiida/engine/transports.py	Implements dynamic safe_interval by tracking last close times in `_last_close_times` dict and calculating wait intervals based on time elapsed since last close
tests/engine/test_transport.py	Adds `test_dynamic_safe_interval` to verify immediate opening on first request and after safe_interval has elapsed; moves `import time` to module level

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-12T10:42:58Z

src/aiida/engine/transports.py

        """:param loop: An asyncio event, will use `asyncio.get_event_loop()` if not supplied"""
        self._loop = loop if loop is not None else asyncio.get_event_loop()
        self._transport_requests: Dict[Hashable, TransportRequest] = {}
+        self._last_close_times: Dict[Hashable, float] = {}


The _last_close_times dictionary will grow unbounded as new authinfo instances are used over time. This could lead to a memory leak in long-running daemon processes that work with many different computers/authinfo objects. Consider implementing a cleanup mechanism, such as removing entries after they become stale (e.g., older than a configurable threshold like 24 hours), or using an LRU cache with a maximum size to automatically evict old entries.

Copilot · 2025-12-12T10:42:59Z

src/aiida/engine/transports.py

+            # Calculate the actual wait time based on when the transport was last closed
+            last_close_time = self._last_close_times.get(authinfo.pk, None)
+            current_time = time.time()
+
+            if last_close_time is None:
+                # Never opened before, open immediately
+                wait_interval = 0
+            else:
+                time_since_last_close = current_time - last_close_time
+                if time_since_last_close >= safe_open_interval:
+                    # Enough time has passed, open immediately
+                    wait_interval = 0
+                else:
+                    # Not enough time has passed, wait for the remaining time
+                    wait_interval = safe_open_interval - time_since_last_close


The dynamic wait interval calculation causes the existing test_safe_interval test to fail. The old test expects each successive transport request to wait for the full safe_interval, but with this implementation, the first request opens immediately (wait_interval = 0). This breaks the test's assumption that iteration 0 should take at least safe_interval time. The test_safe_interval test was designed to verify that the safe_interval is respected between successive requests, and this change fundamentally alters that behavior. Consider whether the test should be updated to reflect the new behavior, or if there's a way to make both behaviors work correctly.

Copilot · 2025-12-12T10:42:59Z

tests/engine/test_transport.py

+                    trans = await request
+                    time_elapsed = time.time() - time_start
+                    # Should open immediately or very quickly
+                    assert time_elapsed < 0.1, f'First transport took too long to open: {time_elapsed}s'


The hard-coded time threshold of 0.1 seconds for asserting immediate opening could be flaky on heavily loaded systems or slow CI runners. Consider using a more generous threshold (e.g., 0.2 seconds) or making it configurable to improve test reliability across different environments.

Copilot · 2025-12-12T10:42:59Z

tests/engine/test_transport.py

+                    trans = await request
+                    time_elapsed = time.time() - time_start
+                    # Should wait approximately the safe interval since not enough time has passed
+                    assert time_elapsed >= 0.4, f'Second transport opened too quickly: {time_elapsed}s'


The time threshold of 0.4 seconds (for a 0.5s safe interval) provides only a 0.1s margin, which could lead to flaky test failures on slow systems. Consider using a more generous lower bound (e.g., 0.35 seconds) to improve test reliability.

Suggested change

assert time_elapsed >= 0.4, f'Second transport opened too quickly: {time_elapsed}s'

assert time_elapsed >= 0.35, f'Second transport opened too quickly: {time_elapsed}s'

Copilot · 2025-12-12T10:43:00Z

tests/engine/test_transport.py

+                    trans = await request
+                    time_elapsed = time.time() - time_start
+                    # Should open immediately since safe interval has passed
+                    assert time_elapsed < 0.1, f'Third transport took too long to open: {time_elapsed}s'


The hard-coded time threshold of 0.1 seconds for asserting immediate opening could be flaky on heavily loaded systems or slow CI runners. Consider using a more generous threshold or making it configurable to improve test reliability.

Initial plan

797ea8d

Copilot AI assigned Copilot and khsrali Dec 9, 2025

Copilot started work on behalf of khsrali December 9, 2025 15:36 View session

Copilot AI and others added 2 commits December 9, 2025 15:43

Fix linting issues and improve documentation

c2e43e1

- Remove trailing whitespace from docstring - Move time import to top level - Remove duplicate import statements in tests - Update class docstring to explain dynamic wait behavior Co-authored-by: khsrali <[email protected]>

Copilot AI changed the title ~~[WIP] Make safe_interval more dynamic for quick transport tasks~~ Implement dynamic safe_interval for TransportQueue to reduce wait times Dec 9, 2025

Copilot AI requested a review from khsrali December 9, 2025 15:55

Copilot finished work on behalf of khsrali December 9, 2025 15:55

khsrali requested review from GeigerJ2 and Copilot December 12, 2025 10:38

Copilot started reviewing on behalf of khsrali December 12, 2025 10:38 View session

Copilot AI reviewed Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement dynamic safe_interval for TransportQueue to reduce wait times #7145

Implement dynamic safe_interval for TransportQueue to reduce wait times #7145

Copilot AI commented Dec 9, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 9, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	assert time_elapsed >= 0.4, f'Second transport opened too quickly: {time_elapsed}s'
	assert time_elapsed >= 0.35, f'Second transport opened too quickly: {time_elapsed}s'

Implement dynamic safe_interval for TransportQueue to reduce wait times #7145

Are you sure you want to change the base?

Implement dynamic safe_interval for TransportQueue to reduce wait times #7145

Conversation

Copilot AI commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Implementation

Impact

Comments on the Issue (you are @copilot in this section)

Uh oh!

codecov bot commented Dec 9, 2025

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 9, 2025 •

edited

Loading