Skip to content

[m] feat(workflow): add delegation watchdog to auto-recover stalled non-task nodes#437

Open
jaeko44 wants to merge 7 commits intomainfrom
task/76cc7f25b5f2-m-feat-workflow-add-delegation-watchdog-to-auto-
Open

[m] feat(workflow): add delegation watchdog to auto-recover stalled non-task nodes#437
jaeko44 wants to merge 7 commits intomainfrom
task/76cc7f25b5f2-m-feat-workflow-add-delegation-watchdog-to-auto-

Conversation

@jaeko44
Copy link
Copy Markdown
Member

@jaeko44 jaeko44 commented Mar 25, 2026

Task-ID: 76cc7f25-b5f2-47c3-a959-8aff1eb8edd5\n\nAutomated PR for task 76cc7f25-b5f2-47c3-a959-8aff1eb8edd5

@jaeko44 jaeko44 added the bosun-attached Bosun PR attachment marker label Mar 25, 2026
Copilot AI review requested due to automatic review settings March 25, 2026 17:08
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 25, 2026

Bosun CI signal: Bosun-created PR currently has failing checks.

@github-actions github-actions bot added the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 25, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a “delegation watchdog” mechanism intended to detect and recover stalled delegated workflow runs (especially non-task delegations), wiring defaults through node schema/template variables and adding regression tests.

Changes:

  • Add delegation watchdog config knobs (delegationWatchdogTimeoutMs, delegationWatchdogMaxRecoveries) to action.run_agent and attempt automatic re-execution when a delegated run appears stuck.
  • Add interrupted-run recovery logic in WorkflowEngine.resumeInterruptedRuns() to auto-retry stalled delegations based on persisted watchdog metadata.
  • Propagate watchdog defaults into workflow templates and add tests for the new behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
workflow/workflow-nodes/actions.mjs Adds watchdog config + retry loop around delegated engine.execute() for action.run_agent.
workflow/workflow-engine.mjs Adds watchdog-based retry decision during interrupted-run recovery.
workflow-templates/task-lifecycle.mjs Adds template variables for watchdog defaults.
workflow-templates/github.mjs Adds watchdog defaults to PR watchdog template; wires templated watchdog fields into SDK conflict resolver node.
tests/workflow-task-lifecycle.test.mjs Adds a test asserting watchdog defaults are installed.
tests/workflow-engine.test.mjs Adds tests for watchdog retry/exhaustion in interrupted-run recovery and for action.run_agent delegation stall handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2116 to +2117
delegationWatchdogTimeoutMs: "{{delegationWatchdogTimeoutMs}}",
delegationWatchdogMaxRecoveries: "{{delegationWatchdogMaxRecoveries}}",
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDK_CONFLICT_RESOLVER_TEMPLATE sets delegationWatchdogTimeoutMs: "{{delegationWatchdogTimeoutMs}}" and delegationWatchdogMaxRecoveries: "{{delegationWatchdogMaxRecoveries}}", but those variables aren’t defined in this template’s variables block. That makes overrides impossible and can leave unresolved template strings flowing into numeric parsing. Either add these to variables (with defaults) or set numeric defaults directly in the node config.

Suggested change
delegationWatchdogTimeoutMs: "{{delegationWatchdogTimeoutMs}}",
delegationWatchdogMaxRecoveries: "{{delegationWatchdogMaxRecoveries}}",
delegationWatchdogTimeoutMs: 900000,
delegationWatchdogMaxRecoveries: 3,

Copilot uses AI. Check for mistakes.
Comment on lines 1071 to 1076
suspiciousDeletionRatio: 3, // e.g. deletes 3× more lines than it adds
minDestructiveDeletions: 500, // absolute floor — small PRs are fine even if net negative
autoApplySuggestions: true, // auto-commit review suggestions before merge
delegationWatchdogTimeoutMs: 300000,
delegationWatchdogMaxRecoveries: 1,
},
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file modifies workflow templates; remember to regenerate and commit ui/demo-defaults.js and site/ui/demo-defaults.js by running node tools/generate-demo-defaults.mjs (enforced by tests/demo-defaults-sync.test.mjs). Without that, the demo-defaults sync test will fail in CI.

Copilot uses AI. Check for mistakes.
Comment on lines +165 to +172
function buildDelegationWatchdogDecision(detail = {}) {
const watchdog = detail?.data?._delegationWatchdog;
if (!watchdog || typeof watchdog !== "object") return null;

const delegationType = String(watchdog.delegationType || "").trim().toLowerCase();
const taskScoped = watchdog.taskScoped === true;
if (taskScoped || delegationType === "task") return null;

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interrupted-run watchdog logic depends on detail.data._delegationWatchdog, but a repo-wide search shows no production code that ever sets/persists _delegationWatchdog (only the new helper + tests reference it). As a result, buildDelegationWatchdogDecision() will always return null in real runs and the watchdog won’t actually recover anything. Please add the code path that writes _delegationWatchdog into run detail when a workflow delegates, or base the watchdog decision on existing persisted fields.

Copilot uses AI. Check for mistakes.
Comment on lines +222 to +231
return {
type: "retry",
mode: "from_failed",
reason: `${reasonBase}:retryable`,
nodeId: watchdog.nodeId || null,
elapsedMs,
timeoutMs,
recoveryAttempts,
maxRecoveries,
};
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildDelegationWatchdogDecision() returns mode: "from_failed" for watchdog retries, but resumeInterruptedRuns() expects the decision to provide the retry mode and the added tests assert from_scratch. For a stalled delegation, from_scratch also seems safer than resuming from the failed node. Update the watchdog decision to use from_scratch (and keep the decision shape consistent with _chooseRetryModeFromDetail).

Copilot uses AI. Check for mistakes.
Comment on lines 500 to 587
@@ -520,6 +579,11 @@ registerNodeType("action.run_agent", {
subWorkflowName: candidate.name || candidate.id,
subStatus,
subRun,
watchdogRecovered,
recoveredFromStall: watchdogRecovered,
watchdogRetryCount,
failureKind: stalledDelegation ? "stalled_delegation" : undefined,
retryable: stalledDelegation ? true : undefined,
};
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new delegation-watchdog logic was added to workflow/workflow-nodes/actions.mjs, but this node type is also implemented in the monolithic workflow/workflow-nodes.mjs (see registerBuiltinNodeType("action.run_agent", ...)). Per repo convention these copies must stay in sync; otherwise some runtime paths won’t get the watchdog behavior/schema. Please port the same schema fields and execution logic changes to workflow/workflow-nodes.mjs as well.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +73
taskTimeoutMs: 21600000,
delegationWatchdogTimeoutMs: 300000,
delegationWatchdogMaxRecoveries: 1, // 6 hours
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This template variable block has an incorrect inline comment: delegationWatchdogMaxRecoveries: 1, // 6 hours (the “6 hours” comment belonged to taskTimeoutMs). Please fix/remove the comment so it accurately describes the setting, and keep the taskTimeoutMs annotation if desired for readability.

Suggested change
taskTimeoutMs: 21600000,
delegationWatchdogTimeoutMs: 300000,
delegationWatchdogMaxRecoveries: 1, // 6 hours
taskTimeoutMs: 21600000, // 6 hours
delegationWatchdogTimeoutMs: 300000,
delegationWatchdogMaxRecoveries: 1, // max recovery attempts for delegation watchdog

Copilot uses AI. Check for mistakes.
Comment on lines 68 to 76
claimRenewIntervalMs: 60000,
defaultSdk: "auto",
defaultTargetBranch: "origin/main",
taskTimeoutMs: 21600000, // 6 hours
taskTimeoutMs: 21600000,
delegationWatchdogTimeoutMs: 300000,
delegationWatchdogMaxRecoveries: 1, // 6 hours
prePrValidationEnabled: true,
prePrValidationCommand: "auto",
autoMergeOnCreate: false,
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file changes workflow templates. The repo has a sync test (tests/demo-defaults-sync.test.mjs) that requires regenerating ui/demo-defaults.js and site/ui/demo-defaults.js via node tools/generate-demo-defaults.mjs after template updates. Please run the generator and commit the updated demo-defaults outputs, otherwise CI will fail.

Copilot uses AI. Check for mistakes.
Comment on lines +3311 to 3317
it("installs delegation watchdog defaults for non-task recovery", () => {
const result = installTemplate("template-task-lifecycle", engine);
expect(result.variables.delegationWatchdogTimeoutMs).toBeGreaterThan(0);
expect(result.variables.delegationWatchdogMaxRecoveries).toBe(1);
}); expect(result.variables.maxParallel).toBe(5);
expect(result.variables.taskTimeoutMs).toBe(3600000);
expect(result.variables.defaultSdk).toBe("auto"); // unchanged
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new it("installs delegation watchdog defaults…") block was inserted in the middle of the existing it("installs with variable overrides") test, and expect(result.variables.maxParallel)... now appears after the new test’s closing brace on the same line. This breaks the test structure (and likely syntax). Move the new it(...) to its own top-level block and keep the original assertions inside the override test.

Suggested change
it("installs delegation watchdog defaults for non-task recovery", () => {
const result = installTemplate("template-task-lifecycle", engine);
expect(result.variables.delegationWatchdogTimeoutMs).toBeGreaterThan(0);
expect(result.variables.delegationWatchdogMaxRecoveries).toBe(1);
}); expect(result.variables.maxParallel).toBe(5);
expect(result.variables.taskTimeoutMs).toBe(3600000);
expect(result.variables.defaultSdk).toBe("auto"); // unchanged
expect(result.variables.maxParallel).toBe(5);
expect(result.variables.taskTimeoutMs).toBe(3600000);
expect(result.variables.defaultSdk).toBe("auto"); // unchanged
});
it("installs delegation watchdog defaults for non-task recovery", () => {
const result = installTemplate("template-task-lifecycle", engine);
expect(result.variables.delegationWatchdogTimeoutMs).toBeGreaterThan(0);
expect(result.variables.delegationWatchdogMaxRecoveries).toBe(1);

Copilot uses AI. Check for mistakes.
Comment on lines 4247 to +4266
it("records delegated runs in session tracker for task visibility", async () => {
const handler = getNodeType("action.run_agent");
expect(handler).toBeDefined();

const ctx = new WorkflowContext({
taskId: "TASK-DELEGATE-SESSION",
taskTitle: "Backend migration",
workspaceId: "virtengine-gh",
task: {
id: "TASK-DELEGATE-SESSION",
title: "Backend migration",
tags: ["backend"],
branchName: "feat/backend-migration",
},
});

it("marks stalled delegated non-task workflows retryable and recovers only once", async () => {
const handler = getNodeType("action.run_agent");
expect(handler).toBeDefined();

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New it(...) blocks for the delegation watchdog were added inside the existing it("records delegated runs…") test (note the missing closing braces before the next it(). This will cause invalid test nesting / syntax errors, and the existing test’s body is now truncated. Close the original test before starting new it(...) blocks, and ensure each test has its own ctx/mockEngine setup.

Copilot uses AI. Check for mistakes.
Comment on lines +4267 to +4275
const ctx = new WorkflowContext({
workspaceId: "virtengine-gh",
ticketId: "GH-123",
delegationWatchdogTimeoutMs: 25,
});

const stalledRun = {
runId: "delegated-stalled-run",
status: WorkflowStatus.RUNNING,
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even after fixing the test nesting, these new watchdog tests set ctx without any task context (taskId/taskTitle), but action.run_agent only enters the workflow-delegation path when hasTaskContext is true. As written, the handler will fall back to engine.services.agentPool.* and never call engine.execute/getRunHistory, so the assertions about delegated runs won’t hold. Provide a minimal task payload (e.g., taskId + taskTitle) or adjust the test to target the non-delegated agent path.

Copilot uses AI. Check for mistakes.
@jaeko44 jaeko44 removed the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 25, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 25, 2026

Bosun PR classification: Bosun-created.
This PR is tracked by Bosun attachment automation.

  • PR class label: bosun-pr-bosun-created
  • Attach label: bosun-attached (yes, because Bosun-created PRs remain attached regardless of human PR attach mode)
  • Attach mode policy: all
  • Trusted author: no
  • Bosun-created label present: yes
  • Automation scope: Eligible for Bosun repair and merge automation (Bosun-created PR).
  • CI failure signal label: bosun-needs-fix
  • CI failure marker:
  • Trigger: pull_request_target / synchronize

@github-actions github-actions bot added the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 25, 2026
@jaeko44 jaeko44 removed the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 25, 2026
@github-actions github-actions bot added the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 25, 2026
@jaeko44 jaeko44 removed the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 25, 2026
@github-actions github-actions bot added the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 25, 2026
@jaeko44 jaeko44 removed the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 25, 2026
@github-actions github-actions bot added the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 25, 2026
@jaeko44 jaeko44 removed the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 25, 2026
@github-actions github-actions bot added the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 25, 2026
@jaeko44 jaeko44 force-pushed the task/76cc7f25b5f2-m-feat-workflow-add-delegation-watchdog-to-auto- branch from b40c07f to 9c7a2a9 Compare March 25, 2026 22:00
@jaeko44 jaeko44 removed the bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair label Mar 25, 2026
@github-actions github-actions bot added the bosun-pr-public PR observed by Bosun but not trusted for high-risk automation label Mar 25, 2026
@jaeko44 jaeko44 added bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair and removed bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair labels Mar 25, 2026
@jaeko44 jaeko44 force-pushed the task/76cc7f25b5f2-m-feat-workflow-add-delegation-watchdog-to-auto- branch from 9c7a2a9 to 9014dd8 Compare March 25, 2026 22:54
@jaeko44 jaeko44 added bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair and removed bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair labels Mar 25, 2026
Co-authored-by: bosun-ve[bot] <262908237+bosun-ve[bot]@users.noreply.github.com>
@jaeko44 jaeko44 added bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair and removed bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair labels Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bosun-attached Bosun PR attachment marker bosun-needs-fix Attached PR with failing CI that Bosun should pick up for repair bosun-pr-bosun-created PR created by Bosun and eligible for Bosun automation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants