feat(self-repair): wire stuck_threshold, store, and builder by zmanian · Pull Request #712 · nearai/ironclaw

zmanian · 2026-03-08T03:03:17Z

Summary

Closes #647

stuck_threshold: detect_stuck_jobs() now filters by duration — only jobs stuck longer than the configured threshold are reported, preventing premature repair attempts on recently-stuck jobs
with_store(): Wired in agent_loop.rs from AgentDeps.store so detect_broken_tools() can query the database for tool failure records
with_builder(): register_builder_tool() now returns the Arc<dyn SoftwareBuilder>, propagated through AppComponents and AgentDeps so repair_broken_tool() can rebuild broken WASM tools
tools: Passed alongside builder for hot-reload logging after successful repair
All #[allow(dead_code)] annotations removed; with_store()/with_builder() made fully public

Test plan

New test: detect_stuck_jobs_filters_by_threshold — verifies job stuck <1s is filtered by 1h threshold
New test: detect_stuck_jobs_includes_when_over_threshold — verifies zero threshold includes all stuck jobs
Existing test updated: detect_stuck_job_finds_stuck_state uses zero threshold to match new behavior
All 9 self_repair tests pass
Zero clippy warnings (--all-features)
Both feature flags compile (postgres, libsql)
cargo fmt --check clean

gemini-code-assist · 2026-03-08T03:03:20Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

henrypark133 · 2026-03-13T04:00:25Z

Bug: the newly wired stuck_threshold is measured from ctx.started_at, not from when the job entered JobState::Stuck. A job that ran for two hours and only just became stuck will now look over threshold immediately, so self-repair triggers right away even if the configured stuck threshold is much smaller. If this threshold is meant to debounce repair attempts, it should be computed from the timestamp of the transition to Stuck (for example, the last matching entry in ctx.transitions).

ilblackdragon · 2026-03-15T23:25:04Z

Let's add a test that actually uses it, and at least mocks the building part and loading new tool to unstuck the job.

ilblackdragon

Let's add an e2e test for the full stuck > self repair > builder > new tool > unstuck

zmanian · 2026-03-16T06:00:46Z

Added the E2E test in a8db274: e2e_stuck_job_repair_and_tool_rebuild

Tests the full cycle:

Job transitions Pending -> InProgress -> Stuck
detect_stuck_jobs() finds it (zero threshold)
repair_stuck_job() recovers it back to InProgress
A broken tool is submitted to repair_broken_tool()
MockBuilder (impl SoftwareBuilder) builds successfully
Verifies builder was invoked and repair returned Success

Uses a real libsql test database for the store path (increment_repair_attempts, mark_tool_repaired). The MockBuilder returns a successful BuildResult without requiring an LLM or filesystem.

All 10 self_repair tests pass, zero clippy warnings.

ilblackdragon

Review: feat(self-repair): wire stuck_threshold, store, and builder

Good progress wiring up the previously-dead fields. The builder integration and test harness (MockBuilder with AtomicU32) are well-designed, and the overall approach of threading store and builder through AppComponents → AgentDeps → DefaultSelfRepair is clean. A few issues need attention before merging.

1. Critical: `stuck_duration` is measured from `started_at`, not from the stuck transition

In src/agent/self_repair.rs, detect_stuck_jobs() computes stuck_duration as:

let stuck_duration = ctx
    .started_at
    .map(|start| {
        let now = Utc::now();
        let duration = now.signed_duration_since(start);
        Duration::from_secs(duration.num_seconds().max(0) as u64)
    })
    .unwrap_or_default();

started_at is set once — when the job first transitions to InProgress (see src/context/state.rs:261, guarded by if self.started_at.is_none()). It is never updated on subsequent Stuck → InProgress recovery cycles.

Impact: A job that ran successfully for 2 hours before becoming stuck will have stuck_duration ≈ 2h, immediately exceeding any reasonable threshold (e.g., 5 minutes). The threshold filter added by this PR (if stuck_duration < self.stuck_threshold { continue; }) becomes useless — every long-running job that becomes stuck will be reported immediately.

Fix: Use the timestamp of the most recent Stuck transition from ctx.transitions:

let stuck_since = ctx.transitions.iter().rev()
    .find(|t| t.to == JobState::Stuck)
    .map(|t| t.timestamp);

let stuck_duration = stuck_since
    .map(|ts| {
        let duration = Utc::now().signed_duration_since(ts);
        Duration::from_secs(duration.num_seconds().max(0) as u64)
    })
    .unwrap_or_default();

2. Same issue in `StuckJob.last_activity`

last_activity: ctx.started_at.unwrap_or(ctx.created_at),

This field is documented/named as "last activity" but it reflects when the job was first started, not when it became stuck. If this field is meant to communicate when the job was last doing something useful (before it got stuck), it should use the stuck-transition timestamp or the timestamp of the last InProgress → Stuck transition. If it's intentionally the job start time, it should be renamed to avoid confusion.

3. Test gap: all tests create jobs that become stuck immediately

The new tests (detect_stuck_jobs_filters_by_threshold and detect_stuck_jobs_includes_when_over_threshold) and the existing detect_stuck_job_finds_stuck_state test all create a job and immediately transition it to Stuck. This means started_at ≈ stuck_timestamp ≈ now, so the bug in (1) is invisible — stuck_duration is near-zero whether measured from started_at or the stuck transition.

A test that exposes the bug:

#[tokio::test]
async fn stuck_duration_measured_from_stuck_transition_not_started_at() {
    let cm = Arc::new(ContextManager::new(10));
    let job_id = cm.create_job("Long runner", "desc").await.unwrap();

    // Transition to InProgress (sets started_at to now).
    cm.update_context(job_id, |ctx| ctx.transition_to(JobState::InProgress, None))
        .await.unwrap().unwrap();

    // Manually backdate started_at to simulate a job that ran for 2 hours.
    cm.update_context(job_id, |ctx| {
        ctx.started_at = Some(Utc::now() - chrono::Duration::hours(2));
        Ok(())
    }).await.unwrap().unwrap();

    // Now transition to Stuck (stuck transition timestamp is ~now).
    cm.update_context(job_id, |ctx| {
        ctx.transition_to(JobState::Stuck, Some("wedged".into()))
    }).await.unwrap().unwrap();

    // With a 5-minute threshold, the job JUST became stuck — should NOT be detected.
    let repair = DefaultSelfRepair::new(cm, Duration::from_secs(300), 3);
    let stuck = repair.detect_stuck_jobs().await;
    assert!(stuck.is_empty(),
        "Job stuck for <1s should not exceed 5min threshold, \
         but stuck_duration was computed from started_at (2h ago)");
}

This test will fail on the current implementation and pass once (1) is fixed.

4. Stray artifact: `// ci fix` in `src/tools/builtin/memory.rs`

The diff adds a bare // ci fix comment at the end of src/tools/builtin/memory.rs. This appears to be a leftover from a CI troubleshooting session and should be removed.

5. Misleading log: "hot-reloaded into registry"

if result.registered {
    if self.tools.is_some() {
        tracing::info!("Repaired tool '{}' hot-reloaded into registry", tool.name);
    } else {
        tracing::info!("Repaired tool '{}' auto-registered", tool.name);
    }
}

The self.tools field (Option<Arc<ToolRegistry>>) is checked via is_some() but never actually used — no code reads from or writes to the registry here. The result.registered flag comes from SoftwareBuilder::build(), meaning the builder registered the tool (the LlmSoftwareBuilder already holds its own Arc<ToolRegistry>). Logging "hot-reloaded into registry" when this code path did nothing with the registry is misleading. Either:

Actually use self.tools to perform a reload/refresh (if that's the intent), or
Drop the distinction and keep the existing "auto-registered" message until real hot-reload logic is implemented.

6. Positive notes

The wiring through AppComponents.builder → AgentDeps.builder → DefaultSelfRepair is straightforward and follows the existing pattern for other optional deps.
Removing the #[allow(dead_code)] annotations and pub(crate) → pub visibility changes are appropriate now that the fields are used.
MockBuilder with AtomicU32 for build counting is a clean test pattern.
The register_builder_tool returning Arc<dyn SoftwareBuilder> instead of () is a nice refactor that avoids reconstructing the builder elsewhere.
The threshold filter in detect_stuck_jobs is the right addition — it just needs to measure from the correct timestamp.

zmanian · 2026-03-16T08:22:06Z

Thanks for the thorough review, @ilblackdragon. All points are valid — here's the plan:

E2E test: Will add a full stuck → self-repair → builder → new tool → unstuck end-to-end test.
Critical: stuck_duration using started_at: Good catch — a job that ran for hours before becoming stuck would immediately exceed the threshold. Will fix to derive stuck_duration from the most recent Stuck transition timestamp in ctx.transitions.
StuckJob.last_activity same issue: Will fix similarly to use the stuck transition timestamp rather than started_at.
Test gap: Will add the suggested test with a job that runs for a while before becoming stuck, so the bug would be caught by the test suite.
Stray // ci fix comment in memory.rs: Will remove.
Misleading "hot-reloaded into registry" log: Will drop the "hot-reloaded" distinction until real hot-reload is implemented.

Wire the previously dead-code fields in DefaultSelfRepair: - stuck_threshold: detect_stuck_jobs() now filters by duration, only reporting jobs stuck longer than the configured threshold - with_store(): wired in agent_loop.rs from AgentDeps.store for tool failure tracking via Database trait - with_builder(): wired from register_builder_tool() return value through AppComponents and AgentDeps for automatic tool rebuilding - tools: passed alongside builder for hot-reload logging Remove all #[allow(dead_code)] annotations. Add regression tests for threshold-based filtering (both above and below threshold). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ness After rebase onto staging, AgentDeps gained a `builder` field for self-repair tool rebuilding. The gateway workflow test harness was missing this field, causing CI compilation failure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tests the full self-repair flow requested in review: 1. Job transitions Pending -> InProgress -> Stuck 2. detect_stuck_jobs() finds it (zero threshold) 3. repair_stuck_job() recovers it back to InProgress 4. A broken tool is repaired via MockBuilder 5. Verify builder was invoked and repair succeeded Uses a MockBuilder (impl SoftwareBuilder) that returns successful BuildResult without requiring an LLM or filesystem. Uses libsql test database for the store (increment_repair_attempts, mark_tool_repaired). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tarted_at - Use ctx.transitions to find the most recent Stuck transition timestamp instead of ctx.started_at (which reflects job start, not stuck time) - Fix StuckJob.last_activity to use stuck transition timestamp - Remove misleading "hot-reloaded into registry" log - Remove stray "// ci fix" comment in memory.rs - Add regression test: backdated started_at must not inflate stuck_duration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot added size: M 50-199 changed lines scope: agent Agent core (agent loop, router, scheduler) scope: tool Tool infrastructure risk: medium Business logic, config, or moderate-risk modules contributor: core 20+ merged PRs and removed size: M 50-199 changed lines labels Mar 8, 2026

henrypark133 changed the base branch from main to staging March 10, 2026 02:19

zmanian force-pushed the feat/647-wire-self-repair branch from 69be66a to cf92503 Compare March 12, 2026 21:52

github-actions bot added size: M 50-199 changed lines scope: tool/builtin Built-in tools labels Mar 12, 2026

zmanian closed this Mar 12, 2026

zmanian reopened this Mar 12, 2026

zmanian force-pushed the feat/647-wire-self-repair branch from 52ff273 to 3a12dc8 Compare March 12, 2026 22:42

zmanian enabled auto-merge (squash) March 12, 2026 23:14

zmanian requested review from henrypark133 and ilblackdragon March 13, 2026 03:40

ilblackdragon requested changes Mar 16, 2026

View reviewed changes

github-actions bot removed the size: M 50-199 changed lines label Mar 16, 2026

github-actions bot added the size: L 200-499 changed lines label Mar 16, 2026

zmanian requested a review from ilblackdragon March 16, 2026 07:44

ilblackdragon requested changes Mar 16, 2026

View reviewed changes

zmanian and others added 4 commits March 17, 2026 03:00

ci: retrigger CI

bd06d76

fix: force CI refresh after path_routing_tests dedup

ecd8313

zmanian and others added 4 commits March 17, 2026 03:00

ci: re-trigger CI with latest changes

d08294b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: add type annotation to Ok(()) in test to resolve E0282

fdd1257

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zmanian force-pushed the feat/647-wire-self-repair branch from 9829d95 to fdd1257 Compare March 17, 2026 03:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(self-repair): wire stuck_threshold, store, and builder#712

feat(self-repair): wire stuck_threshold, store, and builder#712
zmanian wants to merge 8 commits intostagingfrom
feat/647-wire-self-repair

zmanian commented Mar 8, 2026

Uh oh!

gemini-code-assist bot commented Mar 8, 2026

Uh oh!

henrypark133 commented Mar 13, 2026

Uh oh!

ilblackdragon commented Mar 15, 2026

Uh oh!

ilblackdragon left a comment

Uh oh!

zmanian commented Mar 16, 2026

Uh oh!

ilblackdragon left a comment

Uh oh!

zmanian commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zmanian commented Mar 8, 2026

Summary

Test plan

Uh oh!

gemini-code-assist bot commented Mar 8, 2026

Uh oh!

henrypark133 commented Mar 13, 2026

Uh oh!

ilblackdragon commented Mar 15, 2026

Uh oh!

ilblackdragon left a comment

Choose a reason for hiding this comment

Uh oh!

zmanian commented Mar 16, 2026

Uh oh!

ilblackdragon left a comment

Choose a reason for hiding this comment

Review: feat(self-repair): wire stuck_threshold, store, and builder

1. Critical: stuck_duration is measured from started_at, not from the stuck transition

2. Same issue in StuckJob.last_activity

3. Test gap: all tests create jobs that become stuck immediately

4. Stray artifact: // ci fix in src/tools/builtin/memory.rs

5. Misleading log: "hot-reloaded into registry"

6. Positive notes

Uh oh!

zmanian commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Critical: `stuck_duration` is measured from `started_at`, not from the stuck transition

2. Same issue in `StuckJob.last_activity`

4. Stray artifact: `// ci fix` in `src/tools/builtin/memory.rs`