feat(self-repair): wire stuck_threshold, store, and builder#712
feat(self-repair): wire stuck_threshold, store, and builder#712
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
69be66a to
cf92503
Compare
52ff273 to
3a12dc8
Compare
|
Bug: the newly wired |
|
Let's add a test that actually uses it, and at least mocks the building part and loading new tool to unstuck the job. |
ilblackdragon
left a comment
There was a problem hiding this comment.
Let's add an e2e test for the full stuck > self repair > builder > new tool > unstuck
|
Added the E2E test in a8db274: Tests the full cycle:
Uses a real libsql test database for the store path ( All 10 self_repair tests pass, zero clippy warnings. |
ilblackdragon
left a comment
There was a problem hiding this comment.
Review: feat(self-repair): wire stuck_threshold, store, and builder
Good progress wiring up the previously-dead fields. The builder integration and test harness (MockBuilder with AtomicU32) are well-designed, and the overall approach of threading store and builder through AppComponents → AgentDeps → DefaultSelfRepair is clean. A few issues need attention before merging.
1. Critical: stuck_duration is measured from started_at, not from the stuck transition
In src/agent/self_repair.rs, detect_stuck_jobs() computes stuck_duration as:
let stuck_duration = ctx
.started_at
.map(|start| {
let now = Utc::now();
let duration = now.signed_duration_since(start);
Duration::from_secs(duration.num_seconds().max(0) as u64)
})
.unwrap_or_default();started_at is set once — when the job first transitions to InProgress (see src/context/state.rs:261, guarded by if self.started_at.is_none()). It is never updated on subsequent Stuck → InProgress recovery cycles.
Impact: A job that ran successfully for 2 hours before becoming stuck will have stuck_duration ≈ 2h, immediately exceeding any reasonable threshold (e.g., 5 minutes). The threshold filter added by this PR (if stuck_duration < self.stuck_threshold { continue; }) becomes useless — every long-running job that becomes stuck will be reported immediately.
Fix: Use the timestamp of the most recent Stuck transition from ctx.transitions:
let stuck_since = ctx.transitions.iter().rev()
.find(|t| t.to == JobState::Stuck)
.map(|t| t.timestamp);
let stuck_duration = stuck_since
.map(|ts| {
let duration = Utc::now().signed_duration_since(ts);
Duration::from_secs(duration.num_seconds().max(0) as u64)
})
.unwrap_or_default();2. Same issue in StuckJob.last_activity
last_activity: ctx.started_at.unwrap_or(ctx.created_at),This field is documented/named as "last activity" but it reflects when the job was first started, not when it became stuck. If this field is meant to communicate when the job was last doing something useful (before it got stuck), it should use the stuck-transition timestamp or the timestamp of the last InProgress → Stuck transition. If it's intentionally the job start time, it should be renamed to avoid confusion.
3. Test gap: all tests create jobs that become stuck immediately
The new tests (detect_stuck_jobs_filters_by_threshold and detect_stuck_jobs_includes_when_over_threshold) and the existing detect_stuck_job_finds_stuck_state test all create a job and immediately transition it to Stuck. This means started_at ≈ stuck_timestamp ≈ now, so the bug in (1) is invisible — stuck_duration is near-zero whether measured from started_at or the stuck transition.
A test that exposes the bug:
#[tokio::test]
async fn stuck_duration_measured_from_stuck_transition_not_started_at() {
let cm = Arc::new(ContextManager::new(10));
let job_id = cm.create_job("Long runner", "desc").await.unwrap();
// Transition to InProgress (sets started_at to now).
cm.update_context(job_id, |ctx| ctx.transition_to(JobState::InProgress, None))
.await.unwrap().unwrap();
// Manually backdate started_at to simulate a job that ran for 2 hours.
cm.update_context(job_id, |ctx| {
ctx.started_at = Some(Utc::now() - chrono::Duration::hours(2));
Ok(())
}).await.unwrap().unwrap();
// Now transition to Stuck (stuck transition timestamp is ~now).
cm.update_context(job_id, |ctx| {
ctx.transition_to(JobState::Stuck, Some("wedged".into()))
}).await.unwrap().unwrap();
// With a 5-minute threshold, the job JUST became stuck — should NOT be detected.
let repair = DefaultSelfRepair::new(cm, Duration::from_secs(300), 3);
let stuck = repair.detect_stuck_jobs().await;
assert!(stuck.is_empty(),
"Job stuck for <1s should not exceed 5min threshold, \
but stuck_duration was computed from started_at (2h ago)");
}This test will fail on the current implementation and pass once (1) is fixed.
4. Stray artifact: // ci fix in src/tools/builtin/memory.rs
The diff adds a bare // ci fix comment at the end of src/tools/builtin/memory.rs. This appears to be a leftover from a CI troubleshooting session and should be removed.
5. Misleading log: "hot-reloaded into registry"
if result.registered {
if self.tools.is_some() {
tracing::info!("Repaired tool '{}' hot-reloaded into registry", tool.name);
} else {
tracing::info!("Repaired tool '{}' auto-registered", tool.name);
}
}The self.tools field (Option<Arc<ToolRegistry>>) is checked via is_some() but never actually used — no code reads from or writes to the registry here. The result.registered flag comes from SoftwareBuilder::build(), meaning the builder registered the tool (the LlmSoftwareBuilder already holds its own Arc<ToolRegistry>). Logging "hot-reloaded into registry" when this code path did nothing with the registry is misleading. Either:
- Actually use
self.toolsto perform a reload/refresh (if that's the intent), or - Drop the distinction and keep the existing "auto-registered" message until real hot-reload logic is implemented.
6. Positive notes
- The wiring through
AppComponents.builder→AgentDeps.builder→DefaultSelfRepairis straightforward and follows the existing pattern for other optional deps. - Removing the
#[allow(dead_code)]annotations andpub(crate)→pubvisibility changes are appropriate now that the fields are used. MockBuilderwithAtomicU32for build counting is a clean test pattern.- The
register_builder_toolreturningArc<dyn SoftwareBuilder>instead of()is a nice refactor that avoids reconstructing the builder elsewhere. - The threshold filter in
detect_stuck_jobsis the right addition — it just needs to measure from the correct timestamp.
|
Thanks for the thorough review, @ilblackdragon. All points are valid — here's the plan:
|
Wire the previously dead-code fields in DefaultSelfRepair: - stuck_threshold: detect_stuck_jobs() now filters by duration, only reporting jobs stuck longer than the configured threshold - with_store(): wired in agent_loop.rs from AgentDeps.store for tool failure tracking via Database trait - with_builder(): wired from register_builder_tool() return value through AppComponents and AgentDeps for automatic tool rebuilding - tools: passed alongside builder for hot-reload logging Remove all #[allow(dead_code)] annotations. Add regression tests for threshold-based filtering (both above and below threshold). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ness After rebase onto staging, AgentDeps gained a `builder` field for self-repair tool rebuilding. The gateway workflow test harness was missing this field, causing CI compilation failure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests the full self-repair flow requested in review: 1. Job transitions Pending -> InProgress -> Stuck 2. detect_stuck_jobs() finds it (zero threshold) 3. repair_stuck_job() recovers it back to InProgress 4. A broken tool is repaired via MockBuilder 5. Verify builder was invoked and repair succeeded Uses a MockBuilder (impl SoftwareBuilder) that returns successful BuildResult without requiring an LLM or filesystem. Uses libsql test database for the store (increment_repair_attempts, mark_tool_repaired). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tarted_at - Use ctx.transitions to find the most recent Stuck transition timestamp instead of ctx.started_at (which reflects job start, not stuck time) - Fix StuckJob.last_activity to use stuck transition timestamp - Remove misleading "hot-reloaded into registry" log - Remove stray "// ci fix" comment in memory.rs - Add regression test: backdated started_at must not inflate stuck_duration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9829d95 to
fdd1257
Compare
Summary
Closes #647
stuck_threshold:detect_stuck_jobs()now filters by duration — only jobs stuck longer than the configured threshold are reported, preventing premature repair attempts on recently-stuck jobswith_store(): Wired inagent_loop.rsfromAgentDeps.storesodetect_broken_tools()can query the database for tool failure recordswith_builder():register_builder_tool()now returns theArc<dyn SoftwareBuilder>, propagated throughAppComponentsandAgentDepssorepair_broken_tool()can rebuild broken WASM toolstools: Passed alongside builder for hot-reload logging after successful repair#[allow(dead_code)]annotations removed;with_store()/with_builder()made fully publicTest plan
detect_stuck_jobs_filters_by_threshold— verifies job stuck <1s is filtered by 1h thresholddetect_stuck_jobs_includes_when_over_threshold— verifies zero threshold includes all stuck jobsdetect_stuck_job_finds_stuck_stateuses zero threshold to match new behavior--all-features)postgres,libsql)cargo fmt --checkclean