fix: add retry logic for tunnel reconnection in jmp shell proxy#679
fix: add retry logic for tunnel reconnection in jmp shell proxy#679ambient-code[bot] wants to merge 3 commits into
Conversation
When the tunnel between the local proxy and the jumpstarter-router drops during a jmp shell session, subsequent j commands would time out with SETTINGS frame timeout errors because: 1. The Dial call to the controller could fail with transient UNAVAILABLE errors, but there was no retry logic for these (only FAILED_PRECONDITION was retried). 2. The connect_router_stream could hang indefinitely trying to establish an HTTP/2 connection to an unreachable router endpoint, with no timeout on the channel readiness check. This commit fixes both issues: - Adds retry with exponential backoff for transient gRPC errors (UNAVAILABLE, RESOURCE_EXHAUSTED, ABORTED, INTERNAL) in the Dial call within handle_async. - Adds retry with exponential backoff for the router connection establishment, including re-dialing to get fresh router tokens when retrying. - Adds a channel_ready() timeout (10s) in connect_router_stream so that connections to unreachable routers fail fast instead of hanging on the HTTP/2 SETTINGS frame exchange. Fixes #638 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CI is failing due to diff coverage being at 8% (threshold is 80%). The new retry logic in Adding tests for:
This brings diff coverage to 100%. |
Add unit tests covering the new retry logic in handle_async (lease.py) and the channel_ready timeout in connect_router_stream (streams.py) to satisfy the 80% diff coverage requirement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| router_attempt = 0 | ||
| while True: | ||
| try: | ||
| async with connect_router_stream( |
There was a problem hiding this comment.
This code is very entangled, we are repeating the Dial back again down here.... repeating lots of code, this needs refactor.
There was a problem hiding this comment.
Come'on claude, you can do a much better job! :)
There was a problem hiding this comment.
Agreed, the duplicated Dial in the router retry block is messy. Refactoring into a single unified retry loop now.
|
Agreed with the review feedback. The duplicated Dial call in the router retry block is messy and needs refactoring. Fix plan:
Working on it now. |
Extract _dial_and_connect() to perform Dial + router connection as a single atomic operation. This eliminates the duplicated Dial call that was in the separate router retry block, addressing the code review feedback about entangled and repeated code. The single retry loop in handle_async now retries the full _dial_and_connect() on transient errors, which naturally handles both Dial failures and router connection failures with the same backoff logic and always gets a fresh router token on retry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review Status UpdateThe refactoring requested by @mangelajo has been completed in commit 534f7b6. The duplicated Dial code has been eliminated by:
CI StatusAll checks are passing except for a flaky e2e test on ARM64:
The flaky test is in the hooks execution path, while this PR modifies the tunnel connection/retry logic. These are separate concerns. Would you like me to:
|
Summary
Lease.handle_async, so that when the router tunnel drops, newjcommands retry connecting instead of failing immediately._dial_and_connect()to perform the Dial and router connection as a single atomic operation, keeping the retry logic in one unified loop (no duplicated Dial calls).channel_ready()timeout (10s default) inconnect_router_streamso connections to unreachable routers fail fast with UNAVAILABLE instead of hanging indefinitely on the HTTP/2 SETTINGS frame exchange.All retries are bounded by the existing
dial_timeout(default 30s).Fixes #638
Test plan
make lint)make pkg-test-jumpstarter,make pkg-test-jumpstarter-cli)jmp shell, kill the router/network, verifyjcommands retry and recover when the network comes back🤖 Generated with Claude Code