fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE by raballew · Pull Request #606 · jumpstarter-dev/jumpstarter

raballew · 2026-04-17T06:22:26Z

Summary

Retry Dial on transient UNAVAILABLE with exponential backoff bounded by dial_timeout, mirroring existing FAILED_PRECONDITION retry logic
StatusMonitor poll loop now retries up to 10 times on UNAVAILABLE (matching DEADLINE_EXCEEDED pattern) instead of immediately marking connection lost
Add inter-retry delay for UNAVAILABLE errors by removing premature continue

Closes #242

Test plan

Verify Dial retries on UNAVAILABLE and succeeds after exporter restart
Verify StatusMonitor tolerates transient UNAVAILABLE without terminating lease
Run make pkg-test-jumpstarter

🤖 Generated with Claude Code

coderabbitai · 2026-04-17T06:22:33Z

📝 Walkthrough

Walkthrough

This PR addresses premature lease termination when exporters briefly restart by introducing a consecutive-retry threshold for transient gRPC UNAVAILABLE errors. Instead of immediately marking the connection as lost on the first failure, the client now tolerates up to 10 consecutive UNAVAILABLE responses before declaring the connection dead.

Changes

Transient UNAVAILABLE Retry Handling

Layer / File(s)	Summary
Lease dial retry logging improvements `python/packages/jumpstarter/jumpstarter/client/lease.py`	Reformats condition checks in `Lease._acquire` and improves log message formatting in `Lease.handle_async` retry loops to show attempt counts and remaining time more clearly.
Lease dial retry test infrastructure `python/packages/jumpstarter/jumpstarter/client/lease_test.py`	Adds `MockAioRpcError` helper class and `TestHandleAsyncUnavailableRetry` test cases that validate retry behavior on UNAVAILABLE errors during dial operations.
StatusMonitor consecutive-error threshold implementation `python/packages/jumpstarter/jumpstarter/client/status_monitor.py`	Introduces `unavailable_retries` counter and `unavailable_max_retries` threshold (10) to `_poll_loop`, only setting `_connection_lost` after reaching the threshold, with counter reset on successful polls.
StatusMonitor transient-error test coverage `python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py`	Replaces single-UNAVAILABLE test with comprehensive coverage for transient isolation, threshold behavior, counter reset, inter-retry delay, and updates integration/regression tests to match the new threshold semantics.

Sequence Diagram(s)

sequenceDiagram
  participant PollLoop as StatusMonitor._poll_loop
  participant GetStatus as GetStatus RPC
  participant Error as UNAVAILABLE Error
  participant ConnLoss as connection_lost Flag
  PollLoop->>GetStatus: call GetStatus()
  GetStatus->>Error: raises UNAVAILABLE
  Error-->>PollLoop: error response
  PollLoop->>PollLoop: increment unavailable_retries
  alt retries < max (10)
    PollLoop->>PollLoop: log warning/debug
    PollLoop->>PollLoop: retry with backoff
  else retries >= max
    PollLoop->>ConnLoss: set connection_lost = True
    PollLoop->>PollLoop: stop polling loop
  end
  Note over PollLoop: On next successful GetStatus,<br/>reset unavailable_retries counter

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

jumpstarter-dev/jumpstarter#293: Both PRs modify Lease implementation and error-handling behavior in lease.py.
jumpstarter-dev/jumpstarter#671: Both PRs change Lease acquisition/retry logic including handle_async retry behavior.
jumpstarter-dev/jumpstarter#761: Both PRs address Lease._acquire and transient gRPC failure handling during lease acquisition.

Suggested labels

backport release-0.7

Suggested reviewers

evakhoni
kirkbrauer
mangelajo

Poem

🐰 A rabbit hops through network blips,
No more hasty disconnection slips—
Count ten UNAVAILABLE calls before we flee,
Transient troubles now handled gracefully!
(Exporter restarts won't kill thy lease.)

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE' accurately summarizes the main changes: retrying Dial and StatusMonitor polling on transient UNAVAILABLE errors.
Description check	✅ Passed	The description clearly explains the two main objectives: retrying Dial on UNAVAILABLE with exponential backoff and making StatusMonitor retry up to 10 times instead of immediately marking connection lost.
Linked Issues check	✅ Passed	The changes directly address issue `#242` by implementing retry logic for transient UNAVAILABLE in both Dial (with exponential backoff bounded by dial_timeout) and StatusMonitor polling (up to 10 retries), preventing premature lease termination during exporter restarts.
Out of Scope Changes check	✅ Passed	All code changes are scoped to handling transient UNAVAILABLE errors in Lease.handle_async and StatusMonitor._poll_loop, with corresponding test coverage; no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mangelajo · 2026-04-17T08:06:13Z

+                if e.code() == grpc.StatusCode.UNAVAILABLE:
+                    remaining = deadline - time.monotonic()
+                    if remaining <= 0:
+                        logger.debug(


Suggested change

logger.debug(

logger.warning(

May be even a warning?

mangelajo

just some nits

Instead of immediately marking the connection as permanently lost on a single gRPC UNAVAILABLE error, the poll loop now retries up to 10 times (mirroring the existing DEADLINE_EXCEEDED retry pattern). This prevents premature lease termination when an exporter briefly restarts. The retry counter resets on any successful GetStatus response. Only sustained failures (10+ consecutive UNAVAILABLE) mark connection_lost. Fixes jumpstarter-dev#242 Generated-By: Forge/20260416_202053_681470_11575359_i242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the exporter briefly restarts, the Dial RPC may fail with UNAVAILABLE. Instead of immediately giving up, retry with exponential backoff bounded by the existing dial_timeout parameter. This mirrors the existing FAILED_PRECONDITION retry logic. Fixes jumpstarter-dev#242 Generated-By: Forge/20260416_202053_681470_11575359_i242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ction in test Make UNAVAILABLE timeout in handle_async raise instead of returning silently, matching the FAILED_PRECONDITION timeout behavior. Add assertion that connect_router_stream is called after successful UNAVAILABLE retry. Generated-By: Forge/20260416_202053_681470_11575359_i242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ll loop Remove the `continue` statement from the UNAVAILABLE handler in _poll_loop so it falls through to the standard sleep block. Previously, UNAVAILABLE retries had no delay between attempts, so 10 retries could be exhausted in under 1ms -- far too fast to tolerate an exporter restart that takes several seconds. Now retries use the poll_interval sleep, making the 10-retry threshold span a meaningful duration. Generated-By: Forge/20260416_202053_681470_11575359_i242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert unrelated formatting changes to minimize backport conflicts. Change UNAVAILABLE timeout log from debug to warning per reviewer request. Restore removed comment for context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: Miguel Angel Ajo Pelayo <majopela@redhat.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 325-327: The unavailable_retries counter must only count
consecutive UNAVAILABLE errors: in the status polling function (where
unavailable_retries and unavailable_max_retries are defined and where
unavailable_retries is incremented at the UNAVAILABLE branch) change the logic
so that unavailable_retries is incremented only when the RPC/status is
UNAVAILABLE and is explicitly reset to 0 for any other outcome (successful poll,
DEADLINE_EXCEEDED, other RPC errors, or exceptions). Locate the block that
inspects the RPC status (the place that currently increments unavailable_retries
at UNAVAILABLE and only resets on success) and add a branch to set
unavailable_retries = 0 whenever the status is not UNAVAILABLE so the threshold
truly requires consecutive UNAVAILABLEs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 44132cd1-b33f-4797-be45-700e258ab151

📥 Commits

Reviewing files that changed from the base of the PR and between 7a174e7 and a2e0252.

📒 Files selected for processing (4)

python/packages/jumpstarter/jumpstarter/client/lease.py
python/packages/jumpstarter/jumpstarter/client/lease_test.py
python/packages/jumpstarter/jumpstarter/client/status_monitor.py
python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py

coderabbitai · 2026-05-12T12:28:34Z

+        unavailable_retries = 0
+        unavailable_max_retries = 10



⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reset unavailable_retries on non-UNAVAILABLE errors to keep the threshold truly consecutive.

At Line 394, unavailable_retries increments correctly, but it is only reset on successful polls (Line 348). If DEADLINE_EXCEEDED (or another RPC error) occurs between UNAVAILABLEs, the counter still carries over, so connection loss can be triggered without 10 consecutive UNAVAILABLEs.

Suggested patch

except AioRpcError as e: if e.code() == StatusCode.UNIMPLEMENTED: logger.debug("GetStatus not implemented (server), assuming LEASE_READY") self._signal_unsupported() break elif e.code() == StatusCode.UNAVAILABLE: unavailable_retries += 1 if unavailable_retries >= unavailable_max_retries: logger.warning( "GetStatus UNAVAILABLE %d times consecutively, marking connection as lost", unavailable_retries, ) self._connection_lost = True self._running = False self._any_change_event.set() self._any_change_event = Event() break elif unavailable_retries % 5 == 0: logger.warning("GetStatus UNAVAILABLE %d times consecutively", unavailable_retries) else: logger.debug("GetStatus UNAVAILABLE (attempt %d), retrying...", unavailable_retries) elif e.code() == StatusCode.DEADLINE_EXCEEDED: + unavailable_retries = 0 # DEADLINE_EXCEEDED is a transient error (RPC timed out), not a # permanent connection loss. Keep polling - the shell's own timeout # on wait_for_any_of is the real deadline. Only UNAVAILABLE indicates # a true connection loss (server down/disconnected). deadline_retries += 1 if deadline_retries >= 20: @@ else: logger.debug("GetStatus timed out (attempt %d), retrying...", deadline_retries) continue + else: + unavailable_retries = 0 logger.debug(f"GetStatus poll error: {e.code()}")

Also applies to: 348-348, 394-409

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py` around lines 325 - 327, The unavailable_retries counter must only count consecutive UNAVAILABLE errors: in the status polling function (where unavailable_retries and unavailable_max_retries are defined and where unavailable_retries is incremented at the UNAVAILABLE branch) change the logic so that unavailable_retries is incremented only when the RPC/status is UNAVAILABLE and is explicitly reset to 0 for any other outcome (successful poll, DEADLINE_EXCEEDED, other RPC errors, or exceptions). Locate the block that inspects the RPC status (the place that currently increments unavailable_retries at UNAVAILABLE and only resets on success) and add a branch to set unavailable_retries = 0 whenever the status is not UNAVAILABLE so the threshold truly requires consecutive UNAVAILABLEs.

mangelajo reviewed Apr 17, 2026

View reviewed changes

Comment thread python/packages/jumpstarter/jumpstarter/client/lease.py Outdated

mangelajo reviewed Apr 17, 2026

View reviewed changes

raballew force-pushed the 242-unavailable-retry-resilience branch from 5cd25c3 to 519bcf2 Compare April 17, 2026 10:45

raballew and others added 5 commits April 28, 2026 10:05

raballew force-pushed the 242-unavailable-retry-resilience branch from 519bcf2 to cbd3efc Compare April 28, 2026 08:05

Apply suggestions from code review

a2e0252

Co-authored-by: Miguel Angel Ajo Pelayo <majopela@redhat.com>

raballew marked this pull request as ready for review May 12, 2026 12:25

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

raballew requested a review from mangelajo May 12, 2026 12:52

raballew enabled auto-merge (squash) May 12, 2026 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE#606

fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE#606
raballew wants to merge 6 commits into
jumpstarter-dev:mainfrom
raballew:242-unavailable-retry-resilience

raballew commented Apr 17, 2026

Uh oh!

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

mangelajo Apr 17, 2026

Uh oh!

Uh oh!

mangelajo left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raballew commented Apr 17, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

mangelajo Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mangelajo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading