Skip to content

feat(self-repair): wire stuck_threshold, store, and builder#712

Open
zmanian wants to merge 8 commits intostagingfrom
feat/647-wire-self-repair
Open

feat(self-repair): wire stuck_threshold, store, and builder#712
zmanian wants to merge 8 commits intostagingfrom
feat/647-wire-self-repair

Conversation

@zmanian
Copy link
Collaborator

@zmanian zmanian commented Mar 8, 2026

Summary

Closes #647

  • stuck_threshold: detect_stuck_jobs() now filters by duration — only jobs stuck longer than the configured threshold are reported, preventing premature repair attempts on recently-stuck jobs
  • with_store(): Wired in agent_loop.rs from AgentDeps.store so detect_broken_tools() can query the database for tool failure records
  • with_builder(): register_builder_tool() now returns the Arc<dyn SoftwareBuilder>, propagated through AppComponents and AgentDeps so repair_broken_tool() can rebuild broken WASM tools
  • tools: Passed alongside builder for hot-reload logging after successful repair
  • All #[allow(dead_code)] annotations removed; with_store()/with_builder() made fully public

Test plan

  • New test: detect_stuck_jobs_filters_by_threshold — verifies job stuck <1s is filtered by 1h threshold
  • New test: detect_stuck_jobs_includes_when_over_threshold — verifies zero threshold includes all stuck jobs
  • Existing test updated: detect_stuck_job_finds_stuck_state uses zero threshold to match new behavior
  • All 9 self_repair tests pass
  • Zero clippy warnings (--all-features)
  • Both feature flags compile (postgres, libsql)
  • cargo fmt --check clean

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added size: M 50-199 changed lines scope: agent Agent core (agent loop, router, scheduler) scope: tool Tool infrastructure risk: medium Business logic, config, or moderate-risk modules contributor: core 20+ merged PRs and removed size: M 50-199 changed lines labels Mar 8, 2026
@henrypark133 henrypark133 changed the base branch from main to staging March 10, 2026 02:19
@zmanian zmanian force-pushed the feat/647-wire-self-repair branch from 69be66a to cf92503 Compare March 12, 2026 21:52
@github-actions github-actions bot added size: M 50-199 changed lines scope: tool/builtin Built-in tools labels Mar 12, 2026
@zmanian zmanian closed this Mar 12, 2026
@zmanian zmanian reopened this Mar 12, 2026
@zmanian zmanian force-pushed the feat/647-wire-self-repair branch from 52ff273 to 3a12dc8 Compare March 12, 2026 22:42
@zmanian zmanian enabled auto-merge (squash) March 12, 2026 23:14
@henrypark133
Copy link
Collaborator

Bug: the newly wired stuck_threshold is measured from ctx.started_at, not from when the job entered JobState::Stuck. A job that ran for two hours and only just became stuck will now look over threshold immediately, so self-repair triggers right away even if the configured stuck threshold is much smaller. If this threshold is meant to debounce repair attempts, it should be computed from the timestamp of the transition to Stuck (for example, the last matching entry in ctx.transitions).

@ilblackdragon
Copy link
Member

Let's add a test that actually uses it, and at least mocks the building part and loading new tool to unstuck the job.

Copy link
Member

@ilblackdragon ilblackdragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an e2e test for the full stuck > self repair > builder > new tool > unstuck

@github-actions github-actions bot removed the size: M 50-199 changed lines label Mar 16, 2026
@zmanian
Copy link
Collaborator Author

zmanian commented Mar 16, 2026

Added the E2E test in a8db274: e2e_stuck_job_repair_and_tool_rebuild

Tests the full cycle:

  1. Job transitions Pending -> InProgress -> Stuck
  2. detect_stuck_jobs() finds it (zero threshold)
  3. repair_stuck_job() recovers it back to InProgress
  4. A broken tool is submitted to repair_broken_tool()
  5. MockBuilder (impl SoftwareBuilder) builds successfully
  6. Verifies builder was invoked and repair returned Success

Uses a real libsql test database for the store path (increment_repair_attempts, mark_tool_repaired). The MockBuilder returns a successful BuildResult without requiring an LLM or filesystem.

All 10 self_repair tests pass, zero clippy warnings.

@github-actions github-actions bot added the size: L 200-499 changed lines label Mar 16, 2026
@zmanian zmanian requested a review from ilblackdragon March 16, 2026 07:44
Copy link
Member

@ilblackdragon ilblackdragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat(self-repair): wire stuck_threshold, store, and builder

Good progress wiring up the previously-dead fields. The builder integration and test harness (MockBuilder with AtomicU32) are well-designed, and the overall approach of threading store and builder through AppComponentsAgentDepsDefaultSelfRepair is clean. A few issues need attention before merging.


1. Critical: stuck_duration is measured from started_at, not from the stuck transition

In src/agent/self_repair.rs, detect_stuck_jobs() computes stuck_duration as:

let stuck_duration = ctx
    .started_at
    .map(|start| {
        let now = Utc::now();
        let duration = now.signed_duration_since(start);
        Duration::from_secs(duration.num_seconds().max(0) as u64)
    })
    .unwrap_or_default();

started_at is set once — when the job first transitions to InProgress (see src/context/state.rs:261, guarded by if self.started_at.is_none()). It is never updated on subsequent Stuck → InProgress recovery cycles.

Impact: A job that ran successfully for 2 hours before becoming stuck will have stuck_duration ≈ 2h, immediately exceeding any reasonable threshold (e.g., 5 minutes). The threshold filter added by this PR (if stuck_duration < self.stuck_threshold { continue; }) becomes useless — every long-running job that becomes stuck will be reported immediately.

Fix: Use the timestamp of the most recent Stuck transition from ctx.transitions:

let stuck_since = ctx.transitions.iter().rev()
    .find(|t| t.to == JobState::Stuck)
    .map(|t| t.timestamp);

let stuck_duration = stuck_since
    .map(|ts| {
        let duration = Utc::now().signed_duration_since(ts);
        Duration::from_secs(duration.num_seconds().max(0) as u64)
    })
    .unwrap_or_default();

2. Same issue in StuckJob.last_activity

last_activity: ctx.started_at.unwrap_or(ctx.created_at),

This field is documented/named as "last activity" but it reflects when the job was first started, not when it became stuck. If this field is meant to communicate when the job was last doing something useful (before it got stuck), it should use the stuck-transition timestamp or the timestamp of the last InProgress → Stuck transition. If it's intentionally the job start time, it should be renamed to avoid confusion.

3. Test gap: all tests create jobs that become stuck immediately

The new tests (detect_stuck_jobs_filters_by_threshold and detect_stuck_jobs_includes_when_over_threshold) and the existing detect_stuck_job_finds_stuck_state test all create a job and immediately transition it to Stuck. This means started_at ≈ stuck_timestamp ≈ now, so the bug in (1) is invisible — stuck_duration is near-zero whether measured from started_at or the stuck transition.

A test that exposes the bug:

#[tokio::test]
async fn stuck_duration_measured_from_stuck_transition_not_started_at() {
    let cm = Arc::new(ContextManager::new(10));
    let job_id = cm.create_job("Long runner", "desc").await.unwrap();

    // Transition to InProgress (sets started_at to now).
    cm.update_context(job_id, |ctx| ctx.transition_to(JobState::InProgress, None))
        .await.unwrap().unwrap();

    // Manually backdate started_at to simulate a job that ran for 2 hours.
    cm.update_context(job_id, |ctx| {
        ctx.started_at = Some(Utc::now() - chrono::Duration::hours(2));
        Ok(())
    }).await.unwrap().unwrap();

    // Now transition to Stuck (stuck transition timestamp is ~now).
    cm.update_context(job_id, |ctx| {
        ctx.transition_to(JobState::Stuck, Some("wedged".into()))
    }).await.unwrap().unwrap();

    // With a 5-minute threshold, the job JUST became stuck — should NOT be detected.
    let repair = DefaultSelfRepair::new(cm, Duration::from_secs(300), 3);
    let stuck = repair.detect_stuck_jobs().await;
    assert!(stuck.is_empty(),
        "Job stuck for <1s should not exceed 5min threshold, \
         but stuck_duration was computed from started_at (2h ago)");
}

This test will fail on the current implementation and pass once (1) is fixed.

4. Stray artifact: // ci fix in src/tools/builtin/memory.rs

The diff adds a bare // ci fix comment at the end of src/tools/builtin/memory.rs. This appears to be a leftover from a CI troubleshooting session and should be removed.

5. Misleading log: "hot-reloaded into registry"

if result.registered {
    if self.tools.is_some() {
        tracing::info!("Repaired tool '{}' hot-reloaded into registry", tool.name);
    } else {
        tracing::info!("Repaired tool '{}' auto-registered", tool.name);
    }
}

The self.tools field (Option<Arc<ToolRegistry>>) is checked via is_some() but never actually used — no code reads from or writes to the registry here. The result.registered flag comes from SoftwareBuilder::build(), meaning the builder registered the tool (the LlmSoftwareBuilder already holds its own Arc<ToolRegistry>). Logging "hot-reloaded into registry" when this code path did nothing with the registry is misleading. Either:

  • Actually use self.tools to perform a reload/refresh (if that's the intent), or
  • Drop the distinction and keep the existing "auto-registered" message until real hot-reload logic is implemented.

6. Positive notes

  • The wiring through AppComponents.builderAgentDeps.builderDefaultSelfRepair is straightforward and follows the existing pattern for other optional deps.
  • Removing the #[allow(dead_code)] annotations and pub(crate)pub visibility changes are appropriate now that the fields are used.
  • MockBuilder with AtomicU32 for build counting is a clean test pattern.
  • The register_builder_tool returning Arc<dyn SoftwareBuilder> instead of () is a nice refactor that avoids reconstructing the builder elsewhere.
  • The threshold filter in detect_stuck_jobs is the right addition — it just needs to measure from the correct timestamp.

@zmanian
Copy link
Collaborator Author

zmanian commented Mar 16, 2026

Thanks for the thorough review, @ilblackdragon. All points are valid — here's the plan:

  1. E2E test: Will add a full stuck → self-repair → builder → new tool → unstuck end-to-end test.

  2. Critical: stuck_duration using started_at: Good catch — a job that ran for hours before becoming stuck would immediately exceed the threshold. Will fix to derive stuck_duration from the most recent Stuck transition timestamp in ctx.transitions.

  3. StuckJob.last_activity same issue: Will fix similarly to use the stuck transition timestamp rather than started_at.

  4. Test gap: Will add the suggested test with a job that runs for a while before becoming stuck, so the bug would be caught by the test suite.

  5. Stray // ci fix comment in memory.rs: Will remove.

  6. Misleading "hot-reloaded into registry" log: Will drop the "hot-reloaded" distinction until real hot-reload is implemented.

zmanian and others added 4 commits March 17, 2026 03:00
Wire the previously dead-code fields in DefaultSelfRepair:

- stuck_threshold: detect_stuck_jobs() now filters by duration, only
  reporting jobs stuck longer than the configured threshold
- with_store(): wired in agent_loop.rs from AgentDeps.store for
  tool failure tracking via Database trait
- with_builder(): wired from register_builder_tool() return value
  through AppComponents and AgentDeps for automatic tool rebuilding
- tools: passed alongside builder for hot-reload logging

Remove all #[allow(dead_code)] annotations. Add regression tests for
threshold-based filtering (both above and below threshold).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ness

After rebase onto staging, AgentDeps gained a `builder` field for
self-repair tool rebuilding. The gateway workflow test harness was
missing this field, causing CI compilation failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zmanian and others added 4 commits March 17, 2026 03:00
Tests the full self-repair flow requested in review:
1. Job transitions Pending -> InProgress -> Stuck
2. detect_stuck_jobs() finds it (zero threshold)
3. repair_stuck_job() recovers it back to InProgress
4. A broken tool is repaired via MockBuilder
5. Verify builder was invoked and repair succeeded

Uses a MockBuilder (impl SoftwareBuilder) that returns successful
BuildResult without requiring an LLM or filesystem. Uses libsql
test database for the store (increment_repair_attempts, mark_tool_repaired).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tarted_at

- Use ctx.transitions to find the most recent Stuck transition timestamp
  instead of ctx.started_at (which reflects job start, not stuck time)
- Fix StuckJob.last_activity to use stuck transition timestamp
- Remove misleading "hot-reloaded into registry" log
- Remove stray "// ci fix" comment in memory.rs
- Add regression test: backdated started_at must not inflate stuck_duration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zmanian zmanian force-pushed the feat/647-wire-self-repair branch from 9829d95 to fdd1257 Compare March 17, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: medium Business logic, config, or moderate-risk modules scope: agent Agent core (agent loop, router, scheduler) scope: tool/builtin Built-in tools scope: tool Tool infrastructure size: L 200-499 changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wire self-repair system: stuck_threshold, tool hot-reload, and persistence

3 participants