fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE#606
fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE#606raballew wants to merge 6 commits into
Conversation
📝 WalkthroughWalkthroughThis PR addresses premature lease termination when exporters briefly restart by introducing a consecutive-retry threshold for transient gRPC UNAVAILABLE errors. Instead of immediately marking the connection as lost on the first failure, the client now tolerates up to 10 consecutive UNAVAILABLE responses before declaring the connection dead. ChangesTransient UNAVAILABLE Retry Handling
Sequence Diagram(s)sequenceDiagram
participant PollLoop as StatusMonitor._poll_loop
participant GetStatus as GetStatus RPC
participant Error as UNAVAILABLE Error
participant ConnLoss as connection_lost Flag
PollLoop->>GetStatus: call GetStatus()
GetStatus->>Error: raises UNAVAILABLE
Error-->>PollLoop: error response
PollLoop->>PollLoop: increment unavailable_retries
alt retries < max (10)
PollLoop->>PollLoop: log warning/debug
PollLoop->>PollLoop: retry with backoff
else retries >= max
PollLoop->>ConnLoss: set connection_lost = True
PollLoop->>PollLoop: stop polling loop
end
Note over PollLoop: On next successful GetStatus,<br/>reset unavailable_retries counter
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| if e.code() == grpc.StatusCode.UNAVAILABLE: | ||
| remaining = deadline - time.monotonic() | ||
| if remaining <= 0: | ||
| logger.debug( |
There was a problem hiding this comment.
| logger.debug( | |
| logger.warning( |
May be even a warning?
5cd25c3 to
519bcf2
Compare
Instead of immediately marking the connection as permanently lost on a single gRPC UNAVAILABLE error, the poll loop now retries up to 10 times (mirroring the existing DEADLINE_EXCEEDED retry pattern). This prevents premature lease termination when an exporter briefly restarts. The retry counter resets on any successful GetStatus response. Only sustained failures (10+ consecutive UNAVAILABLE) mark connection_lost. Fixes jumpstarter-dev#242 Generated-By: Forge/20260416_202053_681470_11575359_i242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the exporter briefly restarts, the Dial RPC may fail with UNAVAILABLE. Instead of immediately giving up, retry with exponential backoff bounded by the existing dial_timeout parameter. This mirrors the existing FAILED_PRECONDITION retry logic. Fixes jumpstarter-dev#242 Generated-By: Forge/20260416_202053_681470_11575359_i242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ction in test Make UNAVAILABLE timeout in handle_async raise instead of returning silently, matching the FAILED_PRECONDITION timeout behavior. Add assertion that connect_router_stream is called after successful UNAVAILABLE retry. Generated-By: Forge/20260416_202053_681470_11575359_i242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ll loop Remove the `continue` statement from the UNAVAILABLE handler in _poll_loop so it falls through to the standard sleep block. Previously, UNAVAILABLE retries had no delay between attempts, so 10 retries could be exhausted in under 1ms -- far too fast to tolerate an exporter restart that takes several seconds. Now retries use the poll_interval sleep, making the 10-retry threshold span a meaningful duration. Generated-By: Forge/20260416_202053_681470_11575359_i242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert unrelated formatting changes to minimize backport conflicts. Change UNAVAILABLE timeout log from debug to warning per reviewer request. Restore removed comment for context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
519bcf2 to
cbd3efc
Compare
Co-authored-by: Miguel Angel Ajo Pelayo <majopela@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 325-327: The unavailable_retries counter must only count
consecutive UNAVAILABLE errors: in the status polling function (where
unavailable_retries and unavailable_max_retries are defined and where
unavailable_retries is incremented at the UNAVAILABLE branch) change the logic
so that unavailable_retries is incremented only when the RPC/status is
UNAVAILABLE and is explicitly reset to 0 for any other outcome (successful poll,
DEADLINE_EXCEEDED, other RPC errors, or exceptions). Locate the block that
inspects the RPC status (the place that currently increments unavailable_retries
at UNAVAILABLE and only resets on success) and add a branch to set
unavailable_retries = 0 whenever the status is not UNAVAILABLE so the threshold
truly requires consecutive UNAVAILABLEs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 44132cd1-b33f-4797-be45-700e258ab151
📒 Files selected for processing (4)
python/packages/jumpstarter/jumpstarter/client/lease.pypython/packages/jumpstarter/jumpstarter/client/lease_test.pypython/packages/jumpstarter/jumpstarter/client/status_monitor.pypython/packages/jumpstarter/jumpstarter/client/status_monitor_test.py
| unavailable_retries = 0 | ||
| unavailable_max_retries = 10 | ||
|
|
There was a problem hiding this comment.
Reset unavailable_retries on non-UNAVAILABLE errors to keep the threshold truly consecutive.
At Line 394, unavailable_retries increments correctly, but it is only reset on successful polls (Line 348). If DEADLINE_EXCEEDED (or another RPC error) occurs between UNAVAILABLEs, the counter still carries over, so connection loss can be triggered without 10 consecutive UNAVAILABLEs.
Suggested patch
except AioRpcError as e:
if e.code() == StatusCode.UNIMPLEMENTED:
logger.debug("GetStatus not implemented (server), assuming LEASE_READY")
self._signal_unsupported()
break
elif e.code() == StatusCode.UNAVAILABLE:
unavailable_retries += 1
if unavailable_retries >= unavailable_max_retries:
logger.warning(
"GetStatus UNAVAILABLE %d times consecutively, marking connection as lost",
unavailable_retries,
)
self._connection_lost = True
self._running = False
self._any_change_event.set()
self._any_change_event = Event()
break
elif unavailable_retries % 5 == 0:
logger.warning("GetStatus UNAVAILABLE %d times consecutively", unavailable_retries)
else:
logger.debug("GetStatus UNAVAILABLE (attempt %d), retrying...", unavailable_retries)
elif e.code() == StatusCode.DEADLINE_EXCEEDED:
+ unavailable_retries = 0
# DEADLINE_EXCEEDED is a transient error (RPC timed out), not a
# permanent connection loss. Keep polling - the shell's own timeout
# on wait_for_any_of is the real deadline. Only UNAVAILABLE indicates
# a true connection loss (server down/disconnected).
deadline_retries += 1
if deadline_retries >= 20:
@@
else:
logger.debug("GetStatus timed out (attempt %d), retrying...", deadline_retries)
continue
+ else:
+ unavailable_retries = 0
logger.debug(f"GetStatus poll error: {e.code()}")Also applies to: 348-348, 394-409
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py` around
lines 325 - 327, The unavailable_retries counter must only count consecutive
UNAVAILABLE errors: in the status polling function (where unavailable_retries
and unavailable_max_retries are defined and where unavailable_retries is
incremented at the UNAVAILABLE branch) change the logic so that
unavailable_retries is incremented only when the RPC/status is UNAVAILABLE and
is explicitly reset to 0 for any other outcome (successful poll,
DEADLINE_EXCEEDED, other RPC errors, or exceptions). Locate the block that
inspects the RPC status (the place that currently increments unavailable_retries
at UNAVAILABLE and only resets on success) and add a branch to set
unavailable_retries = 0 whenever the status is not UNAVAILABLE so the threshold
truly requires consecutive UNAVAILABLEs.
Summary
dial_timeout, mirroring existing FAILED_PRECONDITION retry logiccontinueCloses #242
Test plan
make pkg-test-jumpstarter🤖 Generated with Claude Code