Skip to content

fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE#606

Open
raballew wants to merge 6 commits into
jumpstarter-dev:mainfrom
raballew:242-unavailable-retry-resilience
Open

fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE#606
raballew wants to merge 6 commits into
jumpstarter-dev:mainfrom
raballew:242-unavailable-retry-resilience

Conversation

@raballew
Copy link
Copy Markdown
Member

Summary

  • Retry Dial on transient UNAVAILABLE with exponential backoff bounded by dial_timeout, mirroring existing FAILED_PRECONDITION retry logic
  • StatusMonitor poll loop now retries up to 10 times on UNAVAILABLE (matching DEADLINE_EXCEEDED pattern) instead of immediately marking connection lost
  • Add inter-retry delay for UNAVAILABLE errors by removing premature continue

Closes #242

Test plan

  • Verify Dial retries on UNAVAILABLE and succeeds after exporter restart
  • Verify StatusMonitor tolerates transient UNAVAILABLE without terminating lease
  • Run make pkg-test-jumpstarter

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR addresses premature lease termination when exporters briefly restart by introducing a consecutive-retry threshold for transient gRPC UNAVAILABLE errors. Instead of immediately marking the connection as lost on the first failure, the client now tolerates up to 10 consecutive UNAVAILABLE responses before declaring the connection dead.

Changes

Transient UNAVAILABLE Retry Handling

Layer / File(s) Summary
Lease dial retry logging improvements
python/packages/jumpstarter/jumpstarter/client/lease.py
Reformats condition checks in Lease._acquire and improves log message formatting in Lease.handle_async retry loops to show attempt counts and remaining time more clearly.
Lease dial retry test infrastructure
python/packages/jumpstarter/jumpstarter/client/lease_test.py
Adds MockAioRpcError helper class and TestHandleAsyncUnavailableRetry test cases that validate retry behavior on UNAVAILABLE errors during dial operations.
StatusMonitor consecutive-error threshold implementation
python/packages/jumpstarter/jumpstarter/client/status_monitor.py
Introduces unavailable_retries counter and unavailable_max_retries threshold (10) to _poll_loop, only setting _connection_lost after reaching the threshold, with counter reset on successful polls.
StatusMonitor transient-error test coverage
python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py
Replaces single-UNAVAILABLE test with comprehensive coverage for transient isolation, threshold behavior, counter reset, inter-retry delay, and updates integration/regression tests to match the new threshold semantics.

Sequence Diagram(s)

sequenceDiagram
  participant PollLoop as StatusMonitor._poll_loop
  participant GetStatus as GetStatus RPC
  participant Error as UNAVAILABLE Error
  participant ConnLoss as connection_lost Flag
  PollLoop->>GetStatus: call GetStatus()
  GetStatus->>Error: raises UNAVAILABLE
  Error-->>PollLoop: error response
  PollLoop->>PollLoop: increment unavailable_retries
  alt retries < max (10)
    PollLoop->>PollLoop: log warning/debug
    PollLoop->>PollLoop: retry with backoff
  else retries >= max
    PollLoop->>ConnLoss: set connection_lost = True
    PollLoop->>PollLoop: stop polling loop
  end
  Note over PollLoop: On next successful GetStatus,<br/>reset unavailable_retries counter
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

backport release-0.7

Suggested reviewers

  • evakhoni
  • kirkbrauer
  • mangelajo

Poem

🐰 A rabbit hops through network blips,
No more hasty disconnection slips—
Count ten UNAVAILABLE calls before we flee,
Transient troubles now handled gracefully!
(Exporter restarts won't kill thy lease.)

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE' accurately summarizes the main changes: retrying Dial and StatusMonitor polling on transient UNAVAILABLE errors.
Description check ✅ Passed The description clearly explains the two main objectives: retrying Dial on UNAVAILABLE with exponential backoff and making StatusMonitor retry up to 10 times instead of immediately marking connection lost.
Linked Issues check ✅ Passed The changes directly address issue #242 by implementing retry logic for transient UNAVAILABLE in both Dial (with exponential backoff bounded by dial_timeout) and StatusMonitor polling (up to 10 retries), preventing premature lease termination during exporter restarts.
Out of Scope Changes check ✅ Passed All code changes are scoped to handling transient UNAVAILABLE errors in Lease.handle_async and StatusMonitor._poll_loop, with corresponding test coverage; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

if e.code() == grpc.StatusCode.UNAVAILABLE:
remaining = deadline - time.monotonic()
if remaining <= 0:
logger.debug(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug(
logger.warning(

May be even a warning?

Comment thread python/packages/jumpstarter/jumpstarter/client/lease.py Outdated
Copy link
Copy Markdown
Member

@mangelajo mangelajo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some nits

Comment thread python/packages/jumpstarter/jumpstarter/client/lease.py Outdated
Comment thread python/packages/jumpstarter/jumpstarter/client/lease.py Outdated
Comment thread python/packages/jumpstarter/jumpstarter/client/lease.py Outdated
Comment thread python/packages/jumpstarter/jumpstarter/client/lease.py Outdated
Comment thread python/packages/jumpstarter/jumpstarter/client/lease.py Outdated
@raballew raballew force-pushed the 242-unavailable-retry-resilience branch from 5cd25c3 to 519bcf2 Compare April 17, 2026 10:45
raballew and others added 5 commits April 28, 2026 10:05
Instead of immediately marking the connection as permanently lost on a
single gRPC UNAVAILABLE error, the poll loop now retries up to 10 times
(mirroring the existing DEADLINE_EXCEEDED retry pattern). This prevents
premature lease termination when an exporter briefly restarts.

The retry counter resets on any successful GetStatus response. Only
sustained failures (10+ consecutive UNAVAILABLE) mark connection_lost.

Fixes jumpstarter-dev#242

Generated-By: Forge/20260416_202053_681470_11575359_i242
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the exporter briefly restarts, the Dial RPC may fail with
UNAVAILABLE. Instead of immediately giving up, retry with exponential
backoff bounded by the existing dial_timeout parameter. This mirrors
the existing FAILED_PRECONDITION retry logic.

Fixes jumpstarter-dev#242

Generated-By: Forge/20260416_202053_681470_11575359_i242
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ction in test

Make UNAVAILABLE timeout in handle_async raise instead of returning silently,
matching the FAILED_PRECONDITION timeout behavior. Add assertion that
connect_router_stream is called after successful UNAVAILABLE retry.

Generated-By: Forge/20260416_202053_681470_11575359_i242
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ll loop

Remove the `continue` statement from the UNAVAILABLE handler in
_poll_loop so it falls through to the standard sleep block. Previously,
UNAVAILABLE retries had no delay between attempts, so 10 retries could
be exhausted in under 1ms -- far too fast to tolerate an exporter
restart that takes several seconds. Now retries use the poll_interval
sleep, making the 10-retry threshold span a meaningful duration.

Generated-By: Forge/20260416_202053_681470_11575359_i242
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert unrelated formatting changes to minimize backport conflicts.
Change UNAVAILABLE timeout log from debug to warning per reviewer request.
Restore removed comment for context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@raballew raballew force-pushed the 242-unavailable-retry-resilience branch from 519bcf2 to cbd3efc Compare April 28, 2026 08:05
Co-authored-by: Miguel Angel Ajo Pelayo <majopela@redhat.com>
@raballew raballew marked this pull request as ready for review May 12, 2026 12:25
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 325-327: The unavailable_retries counter must only count
consecutive UNAVAILABLE errors: in the status polling function (where
unavailable_retries and unavailable_max_retries are defined and where
unavailable_retries is incremented at the UNAVAILABLE branch) change the logic
so that unavailable_retries is incremented only when the RPC/status is
UNAVAILABLE and is explicitly reset to 0 for any other outcome (successful poll,
DEADLINE_EXCEEDED, other RPC errors, or exceptions). Locate the block that
inspects the RPC status (the place that currently increments unavailable_retries
at UNAVAILABLE and only resets on success) and add a branch to set
unavailable_retries = 0 whenever the status is not UNAVAILABLE so the threshold
truly requires consecutive UNAVAILABLEs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 44132cd1-b33f-4797-be45-700e258ab151

📥 Commits

Reviewing files that changed from the base of the PR and between 7a174e7 and a2e0252.

📒 Files selected for processing (4)
  • python/packages/jumpstarter/jumpstarter/client/lease.py
  • python/packages/jumpstarter/jumpstarter/client/lease_test.py
  • python/packages/jumpstarter/jumpstarter/client/status_monitor.py
  • python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py

Comment on lines +325 to 327
unavailable_retries = 0
unavailable_max_retries = 10

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reset unavailable_retries on non-UNAVAILABLE errors to keep the threshold truly consecutive.

At Line 394, unavailable_retries increments correctly, but it is only reset on successful polls (Line 348). If DEADLINE_EXCEEDED (or another RPC error) occurs between UNAVAILABLEs, the counter still carries over, so connection loss can be triggered without 10 consecutive UNAVAILABLEs.

Suggested patch
             except AioRpcError as e:
                 if e.code() == StatusCode.UNIMPLEMENTED:
                     logger.debug("GetStatus not implemented (server), assuming LEASE_READY")
                     self._signal_unsupported()
                     break
                 elif e.code() == StatusCode.UNAVAILABLE:
                     unavailable_retries += 1
                     if unavailable_retries >= unavailable_max_retries:
                         logger.warning(
                             "GetStatus UNAVAILABLE %d times consecutively, marking connection as lost",
                             unavailable_retries,
                         )
                         self._connection_lost = True
                         self._running = False
                         self._any_change_event.set()
                         self._any_change_event = Event()
                         break
                     elif unavailable_retries % 5 == 0:
                         logger.warning("GetStatus UNAVAILABLE %d times consecutively", unavailable_retries)
                     else:
                         logger.debug("GetStatus UNAVAILABLE (attempt %d), retrying...", unavailable_retries)
                 elif e.code() == StatusCode.DEADLINE_EXCEEDED:
+                    unavailable_retries = 0
                     # DEADLINE_EXCEEDED is a transient error (RPC timed out), not a
                     # permanent connection loss. Keep polling - the shell's own timeout
                     # on wait_for_any_of is the real deadline. Only UNAVAILABLE indicates
                     # a true connection loss (server down/disconnected).
                     deadline_retries += 1
                     if deadline_retries >= 20:
@@
                     else:
                         logger.debug("GetStatus timed out (attempt %d), retrying...", deadline_retries)
                     continue
+                else:
+                    unavailable_retries = 0
                 logger.debug(f"GetStatus poll error: {e.code()}")

Also applies to: 348-348, 394-409

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py` around
lines 325 - 327, The unavailable_retries counter must only count consecutive
UNAVAILABLE errors: in the status polling function (where unavailable_retries
and unavailable_max_retries are defined and where unavailable_retries is
incremented at the UNAVAILABLE branch) change the logic so that
unavailable_retries is incremented only when the RPC/status is UNAVAILABLE and
is explicitly reset to 0 for any other outcome (successful poll,
DEADLINE_EXCEEDED, other RPC errors, or exceptions). Locate the block that
inspects the RPC status (the place that currently increments unavailable_retries
at UNAVAILABLE and only resets on success) and add a branch to set
unavailable_retries = 0 whenever the status is not UNAVAILABLE so the threshold
truly requires consecutive UNAVAILABLEs.

@raballew raballew requested a review from mangelajo May 12, 2026 12:52
@raballew raballew enabled auto-merge (squash) May 12, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hooks enabled: restarting an exporter is causing the client to kill the lease prematurely

2 participants