Skip to content

Conversation

@MikeeBuilds
Copy link
Collaborator

@MikeeBuilds MikeeBuilds commented Jan 29, 2026

Summary

Fixes #1546

  • Supersedes fix: agent retry loop and multi-account authentication (#1546) #1565 which had 152 merge conflicts with develop
  • Rebased cleanly onto latest develop with all review feedback addressed
  • Fixes agent getting stuck in infinite retry loop when hitting Claude API's tool concurrency limit (400 errors)
  • Implements exponential backoff retry logic (2s, 4s, 8s, 16s, 32s) with max 5 retries
  • Adds error context to agent prompt instructing it to use one tool at a time after concurrency errors
  • Preserves planning mode on planner-side concurrency errors (prevents skipping planner)

Changes

From original PR #1565

  • agents/session.py: Add is_tool_concurrency_error() detection, return 3-tuple (status, response, error_info) with structured error categorization
  • agents/coder.py: Add concurrency-aware retry loop with exponential backoff, error context injection into agent prompts, planning-phase retry preservation
  • agents/base.py: Add retry configuration constants (MAX_CONCURRENCY_RETRIES, INITIAL_RETRY_DELAY_SECONDS, MAX_RETRY_DELAY_SECONDS)

Review feedback addressed (this PR)

  • Extract duplicated concurrency error state reset into _reset_concurrency_state() helper (was duplicated in 3 places)
  • Fix stale docstring referencing "exception" key → now correctly documents "exception_type"
  • Move import os from function-level to module-level in simple_client.py
  • Use exception_type: type(e).__name__ instead of raw exception object (JSON-serializable, no internal leaks)
  • Off-by-one backoff already fixed (> not >=, allows all 5 retries)
  • Empty f-strings already fixed (plain strings, no Ruff F541)
  • Test mocks already return 3-tuples matching new signature
  • Planning concurrency guidance already implemented via planning_retry_context

Items from #1565 not carried over (already in develop)

  • auth.py multi-account keychain changes — already merged into develop via other PRs
  • AuthStatusIndicator.tsx — debug console.logs already removed in develop
  • App.tsx onboarding race condition — separate concern, not related to retry loop fix

Test plan

  • Verify agent handles 400 tool concurrency errors with exponential backoff
  • Verify agent prompt includes concurrency guidance after retry
  • Verify planning phase retries stay in planning mode on concurrency error
  • Verify agent gives up after 5 consecutive concurrency errors
  • Run existing test suite: pytest tests/test_issue_884_plan_schema.py -v

🤖 Generated with Claude Code

MikeeBuilds and others added 4 commits January 29, 2026 13:08
…0#1546)

- Add exponential backoff retry logic (2s, 4s, 8s, 16s, 32s) for 400 tool concurrency errors
- Add is_tool_concurrency_error() detection in session.py
- Update run_agent_session to return 3-tuple (status, response, error_info)
- Track consecutive concurrency errors (max 5 retries before marking subtask as stuck)
- Add error context to agent prompt instructing it to use one tool at a time
- Fix: Reset first_run=True on planning concurrency errors to retry planning
- Update test mocks for run_agent_session 3-tuple return type
- Update test mocks for get_token_from_keychain config_dir parameter

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Avoids JSON serialization issues and potential internal leaks when
error_info is logged or sent via IPC.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Extract duplicated concurrency error reset logic into _reset_concurrency_state() helper
- Fix stale docstring referencing "exception" key (now "exception_type")
- Move `import os` to module level in simple_client.py

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@MikeeBuilds
Copy link
Collaborator Author

Hey @AndyMik90 — this PR supersedes #1565.

Why a new PR: PR #1565 had diverged so far from develop that it had 152 merge conflicts across virtually the entire codebase (frontend, backend, CI, i18n, tests — everything). The original branch fix/1546-agent-retry-loop was based on a much older develop and the auth.py changes from that branch had already been merged separately.

What we did:

  1. Reviewed all comments on fix: agent retry loop and multi-account authentication (#1546) #1565 from CodeRabbit, Gemini, Sentry, and GitHub Advanced Security
  2. Verified which issues were already fixed in the existing commits vs still needed work
  3. Applied the remaining fixes (duplicated reset logic → helper function, stale docstring, module-level import)
  4. Rebased cleanly onto latest develop — resolved the single actual conflict in auth.py by keeping develop's version (which already had the keychain improvements)
  5. Opened this fresh PR with a clean 4-commit history

Review comment status:

  • ✅ Raw exception object → exception_type string (already in commit 9dca7845)
  • ✅ Planning retry skipping planner → already fixed (lines 660-667 reset first_run = True)
  • ✅ Off-by-one backoff → already fixed (uses > not >=)
  • ✅ Empty f-strings → already plain strings
  • ✅ Test mock 3-tuple mismatch → already returning 3 values
  • ✅ Planner concurrency guidance → already wired through planning_retry_context
  • ✅ Duplicated reset logic → new fix: extracted _reset_concurrency_state() helper
  • ✅ Stale docstring → new fix: updated to reference exception_type
  • ✅ Function-level import osnew fix: moved to module level

The auth.py multi-account keychain changes from #1565 were not carried over because they're already in develop from other merged PRs. Same for the AuthStatusIndicator.tsx debug logs — already cleaned up in develop.

#1565 can be closed once this is merged.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

Adds per-profile OAuth token lookup, propagates error context from sessions, and implements exponential-backoff retry handling for tool-concurrency (HTTP 400) errors; also integrates Claude profile loading into the frontend and updates related tests and translations.

Changes

Cohort / File(s) Summary
Retry/Concurrency Constants
apps/backend/agents/base.py
Added constants: MAX_CONCURRENCY_RETRIES, INITIAL_RETRY_DELAY_SECONDS, MAX_RETRY_DELAY_SECONDS.
Concurrency/Error Flow
apps/backend/agents/coder.py, apps/backend/agents/planner.py
Introduced per-run concurrency state and exponential-backoff retry logic in coder; planner updated to unpack new error_info return from sessions.
Session Error Reporting
apps/backend/agents/session.py
Added is_tool_concurrency_error() and changed run_agent_session to return (status, response, error_info) with richer error categorization and logging.
Per-Profile Auth Token
apps/backend/core/client.py, apps/backend/core/simple_client.py
Token retrieval now uses config_dir/sdk_env and passes config_dir to require_auth_token() enabling per-profile Keychain lookup and sets CLAUDE_CONFIG_DIR for the SDK.
Frontend: Claude Profiles & Auth UI
apps/frontend/src/renderer/App.tsx, apps/frontend/src/renderer/components/AuthStatusIndicator.tsx
Loaded Claude profiles on startup and integrated Claude profile store + usage data into auth status calculation and OAuth UI rendering.
Locales
apps/frontend/src/shared/i18n/locales/en/common.json, apps/frontend/src/shared/i18n/locales/fr/common.json
Added usage.account translation key ("Account" / "Compte").
Tests
tests/test_auth.py, tests/test_issue_884_plan_schema.py
Updated mocks and helpers to accept/return the new optional config_dir parameter and the new (status, response, error_info) tuple respectively.

Sequence Diagram(s)

sequenceDiagram
    participant Coder as Coder Agent
    participant Session as run_agent_session
    participant Detector as is_tool_concurrency_error
    participant Backoff as Backoff Manager

    Coder->>Session: run_agent_session(task)
    Session->>Coder: (status, response, error_info)
    alt error_info.type == "tool_concurrency"
        Coder->>Detector: check error_info
        Detector-->>Coder: true
        Coder->>Backoff: check/increment retry count
        Backoff-->>Coder: delay (2s→4s→8s...)
        Coder->>Coder: inject error_info into next prompt
        Coder->>Session: run_agent_session(task) [retry after delay]
    else success
        Coder->>Coder: _reset_concurrency_state()
    end
Loading
sequenceDiagram
    participant App as Frontend App
    participant Profiles as Profile Store
    participant ClaudeStore as Claude Profile Store
    participant Backend as Backend/SDK

    App->>Profiles: loadProfiles()
    Profiles-->>App: profiles loaded
    App->>ClaudeStore: loadClaudeProfiles()
    ClaudeStore->>Backend: fetch profiles
    Backend-->>ClaudeStore: profile list + active
    ClaudeStore-->>App: profile updates (subscribed)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

bug, area/backend, size/M

Suggested reviewers

  • AndyMik90

Poem

🐰 Hopping through logs with a retrying heart,
When tools clash and HTTPs start,
I nudge prompts with context and wait,
Backoff doubles—patient, never late.
Profiles bloom; tokens find their part.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: implementing a retry loop with exponential backoff for tool concurrency errors, which is the core fix across agents/base.py, agents/coder.py, and agents/session.py.
Docstring Coverage ✅ Passed Docstring coverage is 82.61% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @MikeeBuilds, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the agent's resilience and reliability when interacting with the Claude API, particularly concerning tool concurrency limits. By introducing an intelligent retry mechanism with exponential backoff and providing contextual guidance to the agent after encountering concurrency errors, it ensures smoother operation and prevents the agent from stalling. Additionally, it refines error reporting and preserves the agent's operational phase during retries, contributing to a more stable and predictable agent workflow.

Highlights

  • Tool Concurrency Error Handling: The agent now robustly handles Claude API's 400 tool concurrency errors, preventing it from getting stuck in an infinite retry loop.
  • Exponential Backoff Retry Logic: An exponential backoff strategy has been implemented for tool concurrency errors, with delays of 2s, 4s, 8s, 16s, and 32s, allowing for a maximum of 5 retries.
  • Agent Prompt Context Injection: After a tool concurrency error, the agent's prompt is augmented with specific guidance, instructing it to use one tool at a time to avoid further concurrency issues.
  • Planning Mode Preservation: If a concurrency error occurs during the planning phase, the system now ensures that the agent remains in planning mode for retry, preventing unintended skips.
  • Refactored Error Reporting: The run_agent_session function now returns structured error information, including the error type ('tool_concurrency' or 'other'), message, and exception type, facilitating more precise error handling.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@sentry
Copy link

sentry bot commented Jan 29, 2026

Codecov Report

❌ Patch coverage is 37.34940% with 52 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
apps/backend/agents/coder.py 33.92% 37 Missing ⚠️
apps/backend/agents/session.py 7.14% 13 Missing ⚠️
apps/backend/agents/planner.py 0.00% 1 Missing ⚠️
apps/backend/core/client.py 80.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses an infinite retry loop caused by tool concurrency errors from the Claude API. The implementation of exponential backoff, a maximum retry limit, and contextual error prompts for the agent are well-executed. The code is clean and the refactoring to avoid duplication is a good improvement. I have one suggestion to further improve maintainability by extracting a large prompt template from the main agent logic.

Comment on lines +650 to +661
error_context_message = (
"## CRITICAL: TOOL CONCURRENCY ERROR\n\n"
f"Your previous session hit Claude API's tool concurrency limit (HTTP 400).\n"
f"This is retry {consecutive_concurrency_errors}/{MAX_CONCURRENCY_RETRIES}.\n\n"
"**IMPORTANT: You MUST adjust your approach:**\n"
"1. Use ONE tool at a time - do NOT call multiple tools in parallel\n"
"2. Wait for each tool result before calling the next tool\n"
"3. Avoid starting with `pwd` or multiple Read calls at once\n"
"4. If you need to read multiple files, read them one by one\n"
"5. Take a more incremental, step-by-step approach\n\n"
"Start by focusing on ONE specific action for this subtask."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This large, multi-line string for the agent's error prompt is defined directly within the main agent loop. For better maintainability and separation of concerns, it's a good practice to extract prompt templates from the core logic.

Consider moving this template to a dedicated prompts module (e.g., prompts.py) and generating it with a helper function. This would make the core agent logic cleaner and centralize prompt management.

For example, in a prompts module:

def generate_concurrency_error_prompt(retry_count: int, max_retries: int) -> str:
    return f'''## CRITICAL: TOOL CONCURRENCY ERROR

Your previous session hit Claude API's tool concurrency limit (HTTP 400).
This is retry {retry_count}/{max_retries}.

**IMPORTANT: You MUST adjust your approach:**
1. Use ONE tool at a time - do NOT call multiple tools in parallel
2. Wait for each tool result before calling the next tool
3. Avoid starting with `pwd` or multiple Read calls at once
4. If you need to read multiple files, read them one by one
5. Take a more incremental, step-by-step approach

Start by focusing on ONE specific action for this subtask.'''

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@apps/backend/agents/coder.py`:
- Around line 224-235: The helper _reset_concurrency_state and the surrounding
variable declarations (consecutive_concurrency_errors, current_retry_delay,
concurrency_error_context, and INITIAL_RETRY_DELAY_SECONDS) are mis-formatted
causing ruff to fail; reformat the block so the inline comment and the
optional-typed assignment for concurrency_error_context use conventional
single-line or properly wrapped multi-line syntax (remove awkward
parentheses/newlines around the comment), then run ruff format (e.g., ruff
format apps/backend/agents/coder.py) to apply the correct formatting.

Comment on lines +230 to +238
def _reset_concurrency_state() -> None:
"""Reset concurrency error tracking state after a successful session or non-concurrency error."""
nonlocal \
consecutive_concurrency_errors, \
current_retry_delay, \
concurrency_error_context
consecutive_concurrency_errors = 0
current_retry_delay = INITIAL_RETRY_DELAY_SECONDS
concurrency_error_context = None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The _reset_concurrency_state function fails to reset planning_retry_context, which can cause stale error context to be used in subsequent planning attempts after a specific error sequence.
Severity: MEDIUM

Suggested Fix

Add planning_retry_context to the nonlocal statement within the _reset_concurrency_state function and set it to None along with the other state variables being reset.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: apps/backend/agents/coder.py#L230-L238

Potential issue: The function `_reset_concurrency_state` is designed to clear
error-related state but fails to reset the `planning_retry_context` variable. This can
lead to a bug under a specific sequence of events: if a concurrency error occurs during
the planning phase, `planning_retry_context` is set. If a subsequent planning attempt
then fails with a different, non-concurrency error, `_reset_concurrency_state` is called
but does not clear `planning_retry_context`. As a result, the next planning attempt will
incorrectly use the stale error context from the initial concurrency error, potentially
misleading the agent with outdated guidance.

Did we get this right? 👍 / 👎 to inform future reviews.

@MikeeBuilds MikeeBuilds added bug Something isn't working priority/high Important, fix this week area/fullstack This is Frontend + Backend stable-roadmap v2.7.6 labels Jan 29, 2026
@AndyMik90 AndyMik90 self-assigned this Jan 29, 2026
Copy link
Owner

@AndyMik90 AndyMik90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto Claude Review - APPROVED

Status: Ready to Merge

Summary: ### Merge Verdict: ✅ READY TO MERGE

✅ Ready to merge - All checks passing, no blocking issues found.

No blocking issues found

Risk Assessment

Factor Level Notes
Complexity Medium Based on lines changed
Security Impact None Based on security findings
Scope Coherence Good Based on structural review

Generated by Auto Claude PR Review


This automated review found no blocking issues. The PR can be safely merged.

Generated by Auto Claude

@AndyMik90 AndyMik90 merged commit 0aea4fb into AndyMik90:develop Jan 29, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/fullstack This is Frontend + Backend bug Something isn't working priority/high Important, fix this week stable-roadmap v2.7.6

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent stuck in retry loop on tool concurrency errors (400), burning tokens without progress

2 participants