Skip to content

chore: promote staging to staging-promote/806d4028-23330265305 (2026-03-20 06:16 UTC)#1456

Open
ironclaw-ci[bot] wants to merge 3 commits intostaging-promote/806d4028-23330265305from
staging-promote/b952d229-23331469361
Open

chore: promote staging to staging-promote/806d4028-23330265305 (2026-03-20 06:16 UTC)#1456
ironclaw-ci[bot] wants to merge 3 commits intostaging-promote/806d4028-23330265305from
staging-promote/b952d229-23331469361

Conversation

@ironclaw-ci
Copy link
Contributor

@ironclaw-ci ironclaw-ci bot commented Mar 20, 2026

Auto-promotion from staging CI

Batch range: c4ab382522c86e7e19d55fee760b125fb1970518..b952d229f941298af5748d421edca6513382f7f5
Promotion branch: staging-promote/b952d229-23331469361
Base: staging-promote/806d4028-23330265305
Triggered by: Staging CI batch at 2026-03-20 06:16 UTC

Commits in this batch (10):

Current commits in this promotion (3)

Current base: staging-promote/806d4028-23330265305
Current head: staging-promote/b952d229-23331469361
Current range: origin/staging-promote/806d4028-23330265305..origin/staging-promote/b952d229-23331469361

Auto-updated by staging promotion metadata workflow

Waiting for gates:

  • Tests: pending
  • E2E: pending
  • Claude Code review: pending (will post comments on this PR)

Auto-created by staging-ci workflow

zmanian and others added 3 commits March 19, 2026 22:36
…ion (#1234)

* feat(agent): activate stuck_threshold for time-based stuck job detection (#1223)

The stuck_threshold field on DefaultSelfRepair was defined but never used
(marked #[allow(dead_code)]). Jobs that got stuck in InProgress without
transitioning to Stuck state (e.g., deadlock, unhandled timeout) were
never detected by self-repair.

Changes:
- Add find_stuck_jobs_with_threshold() to ContextManager that detects
  InProgress jobs running longer than the threshold
- Wire stuck_threshold into detect_stuck_jobs() so it uses threshold-based
  detection alongside explicit Stuck state detection
- Remove dead_code annotation from stuck_threshold
- Accept InProgress jobs in the stuck job detection filter

Configurable via AGENT_STUCK_THRESHOLD_SECS (default: 300s).

Closes #1223

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(agent): address PR #1234 review feedback for stuck_threshold

- Transition InProgress jobs to Stuck before returning them from
  detect_stuck_jobs(), so attempt_recovery() (which requires Stuck
  state) works correctly on threshold-detected jobs
- Add detect-and-repair E2E test covering the full InProgress ->
  Stuck -> recovery -> InProgress cycle
- Rename idle_threshold -> elapsed_threshold in find_stuck_jobs_with_threshold
  for clarity
- Add `use std::time::Duration` import and remove fully qualified paths
- Update CLAUDE.md to reflect that stuck_threshold is now actively used

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: measure stuck_duration from Stuck transition, handle InProgress→Stuck in repair

- Fix stuck_duration computation to use the most recent Stuck transition
  timestamp instead of started_at, preventing jobs that ran for hours
  before becoming stuck from immediately exceeding the threshold
- Fix last_activity to also use the Stuck transition timestamp
- Transition InProgress jobs to Stuck before calling attempt_recovery()
  in repair_stuck_job(), since attempt_recovery() requires JobState::Stuck
- Add regression test verifying a recently-stuck job with old started_at
  is not misdetected as exceeding a 5-minute threshold

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(agent): address Copilot review comments on PR #1234

- Add comment in find_stuck_jobs_with_threshold() noting that started_at
  is not reset on Stuck->InProgress recovery, which may cause false
  positives for recovered jobs. Suggests tracking in_progress_since or
  using the most recent StateTransition as a future improvement.

- Fix misleading test comment in stuck_duration_measured_from_stuck_transition
  test: explicitly Stuck jobs are always returned regardless of threshold.
  The test verifies stuck_duration is near-zero, not that the job is excluded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: ilblackdragon@gmail.com <ilblackdragon@gmail.com>
* fix(security): validate embedding base URLs to prevent SSRF (#1103)

User-configurable base URLs (OLLAMA_BASE_URL, EMBEDDING_BASE_URL) were
passed directly to reqwest with no validation, allowing SSRF attacks
against cloud metadata endpoints, internal services, or file:// URIs.

Adds validate_base_url() that rejects:
- Non-HTTP(S) schemes (file://, ftp://)
- HTTP to non-localhost destinations (prevents credential leakage)
- HTTPS to private/loopback/link-local/metadata IPs (169.254.169.254,
  10.x, 192.168.x, 172.16-31.x, CGN 100.64/10)
- IPv4-mapped IPv6 bypass attempts

Validation runs at config resolution time so bad URLs fail at startup.

Closes #1103

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(security): add DNS resolution check, ULA blocking, and NEARAI_BASE_URL validation

Address review feedback:
- Resolve hostnames to IPs and check all resolved addresses against the
  blocklist (prevents DNS-based SSRF bypass where attacker uses a domain
  pointing to 169.254.169.254)
- Add IPv6 Unique Local Address (fc00::/7) to the blocklist
- Validate NEARAI_BASE_URL in llm config (was missing — especially
  dangerous since bearer tokens are forwarded to the configured URL)
- Allow DNS resolution failure gracefully (don't block startup when DNS
  is temporarily unavailable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: fix formatting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(security): add SSRF validation to all base URL chokepoints

- Add validate_base_url() in resolve_registry_provider() covering all
  LLM providers (OpenAI, Anthropic, Ollama, openai_compatible, etc.)
- Add validate_base_url() for NEARAI_AUTH_URL in LlmConfig::resolve()
- Add validate_base_url() for TRANSCRIPTION_BASE_URL in TranscriptionConfig
- Add missing SSRF test cases: CGN range, IPv4-mapped IPv6, ULA IPv6,
  URLs with credentials, empty/invalid URLs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: re-trigger CI with latest changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: trigger new run with skip-regression-check label

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(security): validate embedding base URLs to prevent SSRF (#1103)

User-configurable base URLs (OLLAMA_BASE_URL, EMBEDDING_BASE_URL) were
passed directly to reqwest with no validation, allowing SSRF attacks
against cloud metadata endpoints, internal services, or file:// URIs.

Adds validate_base_url() that rejects:
- Non-HTTP(S) schemes (file://, ftp://)
- HTTP to non-localhost destinations (prevents credential leakage)
- HTTPS to private/loopback/link-local/metadata IPs (169.254.169.254,
  10.x, 192.168.x, 172.16-31.x, CGN 100.64/10)
- IPv4-mapped IPv6 bypass attempts

Validation runs at config resolution time so bad URLs fail at startup.

Closes #1103

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(security): add DNS resolution check, ULA blocking, and NEARAI_BASE_URL validation

Address review feedback:
- Resolve hostnames to IPs and check all resolved addresses against the
  blocklist (prevents DNS-based SSRF bypass where attacker uses a domain
  pointing to 169.254.169.254)
- Add IPv6 Unique Local Address (fc00::/7) to the blocklist
- Validate NEARAI_BASE_URL in llm config (was missing — especially
  dangerous since bearer tokens are forwarded to the configured URL)
- Allow DNS resolution failure gracefully (don't block startup when DNS
  is temporarily unavailable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: fix formatting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(security): add SSRF validation to all base URL chokepoints

- Add validate_base_url() in resolve_registry_provider() covering all
  LLM providers (OpenAI, Anthropic, Ollama, openai_compatible, etc.)
- Add validate_base_url() for NEARAI_AUTH_URL in LlmConfig::resolve()
- Add validate_base_url() for TRANSCRIPTION_BASE_URL in TranscriptionConfig
- Add missing SSRF test cases: CGN range, IPv4-mapped IPv6, ULA IPv6,
  URLs with credentials, empty/invalid URLs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: re-trigger CI with latest changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: trigger new run with skip-regression-check label

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: ilblackdragon@gmail.com <ilblackdragon@gmail.com>
* fix: prefer execution-local message routing metadata

* test: cover message routing fallback metadata

* refactor: simplify message target resolution

* fix: ignore stale channel defaults for notify user metadata
@github-actions github-actions bot added scope: agent Agent core (agent loop, router, scheduler) scope: tool/builtin Built-in tools scope: docs Documentation size: XL 500+ changed lines risk: medium Business logic, config, or moderate-risk modules contributor: core 20+ merged PRs labels Mar 20, 2026
@claude
Copy link

claude bot commented Mar 20, 2026

Code review

Found 7 issues:

  1. [HIGH:HIGH] N+1 database queries in stuck job detection loop

https://github.com/anthropics/ironclaw/blob/5c94138a0794948a601aa704664d7e8439a09c4c/src/agent/self_repair.rs#L111-L120

The detect_stuck_jobs() function calls get_context(job_id) twice per stuck job: first at line 114 to check state, then again at line 159 after potential state transition. This creates an N+1 query pattern that scales with the number of stuck jobs. Consider keeping the context from the first fetch or combining the operations.

  1. [HIGH:HIGH] Blocking DNS resolution during config load with no timeout

https://github.com/anthropics/ironclaw/blob/5c94138a0794948a601aa704664d7e8439a09c4c/src/config/helpers.rs#L285-L292

The validate_base_url() function performs blocking DNS resolution via to_socket_addrs() without a timeout. Called during startup from multiple locations (embeddings.rs:382, llm.rs:685/712/725, transcription.rs:749), a slow or misconfigured DNS lookup can block application startup indefinitely. Add a timeout or defer validation to async runtime.

  1. [MEDIUM:HIGH] Stringly-typed routing metadata with duplicated keys

https://github.com/anthropics/ironclaw/blob/5c94138a0794948a601aa704664d7e8439a09c4c/src/agent/agent_loop.rs#L123-L132

The chat_tool_execution_metadata() function builds JSON with hardcoded string keys ("notify_channel", "notify_user", "notify_thread_id", "notify_metadata"). Per CLAUDE.md's "prefer strong types over strings" guidance, these should be a strongly-typed struct (e.g., ExecutionRoutingMetadata) serialized to JSON, enabling compile-time verification of all callers and preventing key name divergence.

  1. [MEDIUM:HIGH] Potential infinite stuck job re-detection loop

https://github.com/anthropics/ironclaw/blob/5c94138a0794948a601aa704664d7e8439a09c4c/src/context/manager.rs#L797-L809

When a recovered job transitions from Stuck back to InProgress, its original started_at timestamp is not reset. The next repair scan will re-detect it as stuck (elapsed time still > threshold), creating a re-detection loop. The code acknowledges this (lines 798-801) but marks it as "future improvement." Track in_progress_since separately or use the most recent StateTransition with to == InProgress to avoid false positives.

  1. [MEDIUM:MEDIUM] Repeated string allocations in message tool helpers

https://github.com/anthropics/ironclaw/blob/5c94138a0794948a601aa704664d7e8439a09c4c/src/tools/builtin/message.rs#L872-L910

Helper functions metadata_string(), metadata_notify_user(), and routing resolution create multiple String clones and allocations on every message send. While not in a hot loop, optimize by borrowing where possible or reducing intermediate allocations.

  1. [LOW:HIGH] Silent TOCTOU on InProgress→Stuck state transition

https://github.com/anthropics/ironclaw/blob/5c94138a0794948a601aa704664d7e8439a09c4c/src/agent/self_repair.rs#L159-L177

After transitioning a job to Stuck at line 134, the code re-fetches context at line 159 to capture the new transition timestamp. If this re-fetch fails, the job silently disappears from stuck_jobs output but the transition already occurred, making the stuck job invisible to repair. Log at error! level (not warn!) when re-fetch fails after successful transition.

  1. [LOW:MEDIUM] Non-obvious magic string filtering in metadata lookup

https://github.com/anthropics/ironclaw/blob/5c94138a0794948a601aa704664d7e8439a09c4c/src/tools/builtin/message.rs#L79-L81

The metadata_notify_user() helper filters out the string value "default" as a special case. This filtering is not documented in the function signature and violates the CLAUDE.md principle of "comments for non-obvious logic only." Use a named constant or enum variant to make the magic value explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: medium Business logic, config, or moderate-risk modules scope: agent Agent core (agent loop, router, scheduler) scope: docs Documentation scope: tool/builtin Built-in tools size: XL 500+ changed lines staging-promotion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants