Conversation
|
Bosun CI signal: Bosun-created PR currently has failing checks.
|
There was a problem hiding this comment.
Pull request overview
Adds a “delegation watchdog” mechanism intended to detect and recover stalled delegated workflow runs (especially non-task delegations), wiring defaults through node schema/template variables and adding regression tests.
Changes:
- Add delegation watchdog config knobs (
delegationWatchdogTimeoutMs,delegationWatchdogMaxRecoveries) toaction.run_agentand attempt automatic re-execution when a delegated run appears stuck. - Add interrupted-run recovery logic in
WorkflowEngine.resumeInterruptedRuns()to auto-retry stalled delegations based on persisted watchdog metadata. - Propagate watchdog defaults into workflow templates and add tests for the new behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| workflow/workflow-nodes/actions.mjs | Adds watchdog config + retry loop around delegated engine.execute() for action.run_agent. |
| workflow/workflow-engine.mjs | Adds watchdog-based retry decision during interrupted-run recovery. |
| workflow-templates/task-lifecycle.mjs | Adds template variables for watchdog defaults. |
| workflow-templates/github.mjs | Adds watchdog defaults to PR watchdog template; wires templated watchdog fields into SDK conflict resolver node. |
| tests/workflow-task-lifecycle.test.mjs | Adds a test asserting watchdog defaults are installed. |
| tests/workflow-engine.test.mjs | Adds tests for watchdog retry/exhaustion in interrupted-run recovery and for action.run_agent delegation stall handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| delegationWatchdogTimeoutMs: "{{delegationWatchdogTimeoutMs}}", | ||
| delegationWatchdogMaxRecoveries: "{{delegationWatchdogMaxRecoveries}}", |
There was a problem hiding this comment.
SDK_CONFLICT_RESOLVER_TEMPLATE sets delegationWatchdogTimeoutMs: "{{delegationWatchdogTimeoutMs}}" and delegationWatchdogMaxRecoveries: "{{delegationWatchdogMaxRecoveries}}", but those variables aren’t defined in this template’s variables block. That makes overrides impossible and can leave unresolved template strings flowing into numeric parsing. Either add these to variables (with defaults) or set numeric defaults directly in the node config.
| delegationWatchdogTimeoutMs: "{{delegationWatchdogTimeoutMs}}", | |
| delegationWatchdogMaxRecoveries: "{{delegationWatchdogMaxRecoveries}}", | |
| delegationWatchdogTimeoutMs: 900000, | |
| delegationWatchdogMaxRecoveries: 3, |
| suspiciousDeletionRatio: 3, // e.g. deletes 3× more lines than it adds | ||
| minDestructiveDeletions: 500, // absolute floor — small PRs are fine even if net negative | ||
| autoApplySuggestions: true, // auto-commit review suggestions before merge | ||
| delegationWatchdogTimeoutMs: 300000, | ||
| delegationWatchdogMaxRecoveries: 1, | ||
| }, |
There was a problem hiding this comment.
This file modifies workflow templates; remember to regenerate and commit ui/demo-defaults.js and site/ui/demo-defaults.js by running node tools/generate-demo-defaults.mjs (enforced by tests/demo-defaults-sync.test.mjs). Without that, the demo-defaults sync test will fail in CI.
| function buildDelegationWatchdogDecision(detail = {}) { | ||
| const watchdog = detail?.data?._delegationWatchdog; | ||
| if (!watchdog || typeof watchdog !== "object") return null; | ||
|
|
||
| const delegationType = String(watchdog.delegationType || "").trim().toLowerCase(); | ||
| const taskScoped = watchdog.taskScoped === true; | ||
| if (taskScoped || delegationType === "task") return null; | ||
|
|
There was a problem hiding this comment.
The interrupted-run watchdog logic depends on detail.data._delegationWatchdog, but a repo-wide search shows no production code that ever sets/persists _delegationWatchdog (only the new helper + tests reference it). As a result, buildDelegationWatchdogDecision() will always return null in real runs and the watchdog won’t actually recover anything. Please add the code path that writes _delegationWatchdog into run detail when a workflow delegates, or base the watchdog decision on existing persisted fields.
| return { | ||
| type: "retry", | ||
| mode: "from_failed", | ||
| reason: `${reasonBase}:retryable`, | ||
| nodeId: watchdog.nodeId || null, | ||
| elapsedMs, | ||
| timeoutMs, | ||
| recoveryAttempts, | ||
| maxRecoveries, | ||
| }; |
There was a problem hiding this comment.
buildDelegationWatchdogDecision() returns mode: "from_failed" for watchdog retries, but resumeInterruptedRuns() expects the decision to provide the retry mode and the added tests assert from_scratch. For a stalled delegation, from_scratch also seems safer than resuming from the failed node. Update the watchdog decision to use from_scratch (and keep the decision shape consistent with _chooseRetryModeFromDetail).
| @@ -520,6 +579,11 @@ registerNodeType("action.run_agent", { | |||
| subWorkflowName: candidate.name || candidate.id, | |||
| subStatus, | |||
| subRun, | |||
| watchdogRecovered, | |||
| recoveredFromStall: watchdogRecovered, | |||
| watchdogRetryCount, | |||
| failureKind: stalledDelegation ? "stalled_delegation" : undefined, | |||
| retryable: stalledDelegation ? true : undefined, | |||
| }; | |||
There was a problem hiding this comment.
The new delegation-watchdog logic was added to workflow/workflow-nodes/actions.mjs, but this node type is also implemented in the monolithic workflow/workflow-nodes.mjs (see registerBuiltinNodeType("action.run_agent", ...)). Per repo convention these copies must stay in sync; otherwise some runtime paths won’t get the watchdog behavior/schema. Please port the same schema fields and execution logic changes to workflow/workflow-nodes.mjs as well.
| taskTimeoutMs: 21600000, | ||
| delegationWatchdogTimeoutMs: 300000, | ||
| delegationWatchdogMaxRecoveries: 1, // 6 hours |
There was a problem hiding this comment.
This template variable block has an incorrect inline comment: delegationWatchdogMaxRecoveries: 1, // 6 hours (the “6 hours” comment belonged to taskTimeoutMs). Please fix/remove the comment so it accurately describes the setting, and keep the taskTimeoutMs annotation if desired for readability.
| taskTimeoutMs: 21600000, | |
| delegationWatchdogTimeoutMs: 300000, | |
| delegationWatchdogMaxRecoveries: 1, // 6 hours | |
| taskTimeoutMs: 21600000, // 6 hours | |
| delegationWatchdogTimeoutMs: 300000, | |
| delegationWatchdogMaxRecoveries: 1, // max recovery attempts for delegation watchdog |
| claimRenewIntervalMs: 60000, | ||
| defaultSdk: "auto", | ||
| defaultTargetBranch: "origin/main", | ||
| taskTimeoutMs: 21600000, // 6 hours | ||
| taskTimeoutMs: 21600000, | ||
| delegationWatchdogTimeoutMs: 300000, | ||
| delegationWatchdogMaxRecoveries: 1, // 6 hours | ||
| prePrValidationEnabled: true, | ||
| prePrValidationCommand: "auto", | ||
| autoMergeOnCreate: false, |
There was a problem hiding this comment.
This file changes workflow templates. The repo has a sync test (tests/demo-defaults-sync.test.mjs) that requires regenerating ui/demo-defaults.js and site/ui/demo-defaults.js via node tools/generate-demo-defaults.mjs after template updates. Please run the generator and commit the updated demo-defaults outputs, otherwise CI will fail.
| it("installs delegation watchdog defaults for non-task recovery", () => { | ||
| const result = installTemplate("template-task-lifecycle", engine); | ||
| expect(result.variables.delegationWatchdogTimeoutMs).toBeGreaterThan(0); | ||
| expect(result.variables.delegationWatchdogMaxRecoveries).toBe(1); | ||
| }); expect(result.variables.maxParallel).toBe(5); | ||
| expect(result.variables.taskTimeoutMs).toBe(3600000); | ||
| expect(result.variables.defaultSdk).toBe("auto"); // unchanged |
There was a problem hiding this comment.
The new it("installs delegation watchdog defaults…") block was inserted in the middle of the existing it("installs with variable overrides") test, and expect(result.variables.maxParallel)... now appears after the new test’s closing brace on the same line. This breaks the test structure (and likely syntax). Move the new it(...) to its own top-level block and keep the original assertions inside the override test.
| it("installs delegation watchdog defaults for non-task recovery", () => { | |
| const result = installTemplate("template-task-lifecycle", engine); | |
| expect(result.variables.delegationWatchdogTimeoutMs).toBeGreaterThan(0); | |
| expect(result.variables.delegationWatchdogMaxRecoveries).toBe(1); | |
| }); expect(result.variables.maxParallel).toBe(5); | |
| expect(result.variables.taskTimeoutMs).toBe(3600000); | |
| expect(result.variables.defaultSdk).toBe("auto"); // unchanged | |
| expect(result.variables.maxParallel).toBe(5); | |
| expect(result.variables.taskTimeoutMs).toBe(3600000); | |
| expect(result.variables.defaultSdk).toBe("auto"); // unchanged | |
| }); | |
| it("installs delegation watchdog defaults for non-task recovery", () => { | |
| const result = installTemplate("template-task-lifecycle", engine); | |
| expect(result.variables.delegationWatchdogTimeoutMs).toBeGreaterThan(0); | |
| expect(result.variables.delegationWatchdogMaxRecoveries).toBe(1); |
| it("records delegated runs in session tracker for task visibility", async () => { | ||
| const handler = getNodeType("action.run_agent"); | ||
| expect(handler).toBeDefined(); | ||
|
|
||
| const ctx = new WorkflowContext({ | ||
| taskId: "TASK-DELEGATE-SESSION", | ||
| taskTitle: "Backend migration", | ||
| workspaceId: "virtengine-gh", | ||
| task: { | ||
| id: "TASK-DELEGATE-SESSION", | ||
| title: "Backend migration", | ||
| tags: ["backend"], | ||
| branchName: "feat/backend-migration", | ||
| }, | ||
| }); | ||
|
|
||
| it("marks stalled delegated non-task workflows retryable and recovers only once", async () => { | ||
| const handler = getNodeType("action.run_agent"); | ||
| expect(handler).toBeDefined(); | ||
|
|
There was a problem hiding this comment.
New it(...) blocks for the delegation watchdog were added inside the existing it("records delegated runs…") test (note the missing closing braces before the next it(). This will cause invalid test nesting / syntax errors, and the existing test’s body is now truncated. Close the original test before starting new it(...) blocks, and ensure each test has its own ctx/mockEngine setup.
| const ctx = new WorkflowContext({ | ||
| workspaceId: "virtengine-gh", | ||
| ticketId: "GH-123", | ||
| delegationWatchdogTimeoutMs: 25, | ||
| }); | ||
|
|
||
| const stalledRun = { | ||
| runId: "delegated-stalled-run", | ||
| status: WorkflowStatus.RUNNING, |
There was a problem hiding this comment.
Even after fixing the test nesting, these new watchdog tests set ctx without any task context (taskId/taskTitle), but action.run_agent only enters the workflow-delegation path when hasTaskContext is true. As written, the handler will fall back to engine.services.agentPool.* and never call engine.execute/getRunHistory, so the assertions about delegated runs won’t hold. Provide a minimal task payload (e.g., taskId + taskTitle) or adjust the test to target the non-delegated agent path.
|
Bosun PR classification: Bosun-created.
|
b40c07f to
9c7a2a9
Compare
9c7a2a9 to
9014dd8
Compare
Co-authored-by: bosun-ve[bot] <262908237+bosun-ve[bot]@users.noreply.github.com>
…tion-watchdog-to-auto-
Task-ID: 76cc7f25-b5f2-47c3-a959-8aff1eb8edd5\n\nAutomated PR for task 76cc7f25-b5f2-47c3-a959-8aff1eb8edd5