lionrooter · lionrooter · Mar 12, 2026 · Mar 13, 2026 · Mar 13, 2026 · Mar 14, 2026
diff --git a/.workflow/archive/gateway-agent-1006-fix-2026-03-22/plan.md b/.workflow/archive/gateway-agent-1006-fix-2026-03-22/plan.md
@@ -0,0 +1,69 @@
+# Technical Plan: Cron Agent Routing Repair for Self-Learning Jobs
+
+**Status:** Approved
+**Date:** 2026-03-12
+**Flow/Context Builder Output:** Repo-grounded planning identified `clawdbot/src/gateway/server-cron.ts` as the primary fault point. The live runtime config is also missing `maclern`, so a complete fix needs both gateway hardening and a config repair.
+
+## Architecture
+
+- `src/gateway/server-cron.ts` is the cron gateway wiring layer. It currently resolves requested cron agents through `resolveCronAgent`, which falls back to the default agent when the requested agent is absent from `agents.list`.
+- `src/cron/service/timer.ts` passes `job.agentId` and `job.sessionKey` into the gateway callbacks, so the gateway layer is the final authority for cron wake/session routing.
+- `src/cron/isolated-agent/run.ts` already supports requested agent IDs for isolated runs, but workspace selection still depends on runtime config when an explicit per-agent workspace is needed.
+- The live config in `~/.openclaw/openclaw.json` must include `maclern` so the runtime resolves the intended workspace `/Users/lionheart/clawd/agents/maclern` and agent directory `~/.openclaw/agents/maclern/agent`.
+
+## Files to Modify
+
+- `clawdbot/src/gateway/server-cron.ts` — preserve requested isolated cron agent IDs instead of silently falling back to `main` when the agent is missing from config.
+- `clawdbot/src/gateway/server-cron.test.ts` — add regression coverage for a cron agent that is missing from `agents.list`.
+- `.openclaw/openclaw.json` — add the missing `maclern` agent entry following the existing per-agent workspace/agentDir convention.
+
+## New Files
+
+- None.
+
+## Tasks
+
+1. [ ] Patch `server-cron.ts` so `resolveCronAgent` preserves an explicit requested `agentId` for isolated cron routing.
+2. [ ] Add regression tests that cover the missing-agent-in-config case and verify the requested agent is passed into isolated cron runs.
+3. [ ] Patch `~/.openclaw/openclaw.json` to add a `maclern` agent entry with the correct workspace and agent directory.
+4. [ ] Verify there are no other current cron agent IDs missing from `agents.list`.
+5. [ ] Run targeted tests for the gateway cron path.
+6. [ ] Do a review pass before wrapping up.
+
+## Detailed File Plan
+
+### 1) `clawdbot/src/gateway/server-cron.ts`
+
+- Update `resolveCronAgent` to normalize and preserve a non-empty requested `agentId` instead of requiring membership in `agents.list`.
+- Keep the default-agent fallback only for empty or missing `agentId`.
+- Leave main-session constraints enforced by existing cron job validation in `src/cron/service/jobs.ts`.
+
+### 2) `clawdbot/src/gateway/server-cron.test.ts`
+
+- Mock or observe the isolated cron run path so the test can assert that a job with `agentId: "maclern"` is passed through as `maclern` even when runtime config only lists `main`.
+- Add coverage for the summary/wake path if needed to prove the gateway callbacks stay agent-scoped.
+
+### 3) `.openclaw/openclaw.json`
+
+- Add a `maclern` entry under `agents.list` using the existing convention:
+  - `workspace`: `/Users/lionheart/clawd/agents/maclern`
+  - `agentDir`: `/Users/lionheart/.openclaw/agents/maclern/agent`
+- Do not disturb the ordering or semantics of existing agent entries beyond what is necessary.
+
+## Testing Strategy
+
+- Run targeted gateway cron tests covering the new regression.
+- Re-check the current cron agent IDs against `agents.list` after the config patch.
+- If practical, inspect the resolved Maclern routing via a focused local invocation or config readback.
+
+## Rollback Plan
+
+- Revert the `server-cron.ts` change to restore the previous fallback behavior.
+- Remove the `maclern` config entry if it proves incorrect.
+- Existing misrouted historical runs remain untouched.
+
+## Risks and Validation Gates
+
+- The code hardening should not change behavior for default-agent cron jobs.
+- The config patch must not duplicate an existing `maclern` entry or break JSON formatting.
+- Main-session jobs must remain restricted to the default agent through current validation.
diff --git a/.workflow/archive/gateway-agent-1006-fix-2026-03-22/prd.md b/.workflow/archive/gateway-agent-1006-fix-2026-03-22/prd.md
@@ -0,0 +1,35 @@
+# PRD: Cron Agent Routing Repair for Self-Learning Jobs
+
+**Status:** Approved
+**Date:** 2026-03-12
+**Provenance:** See `.workflow/inputs/original-request.md` and `.workflow/inputs/references.md`.
+
+## Summary
+
+Repair isolated cron routing so Maclern self-learning jobs run as Maclern instead of silently falling back to the default `main` agent, and harden the cron runtime so future non-default cron agents do not misroute their self-learning jobs into the wrong agent/session namespace.
+
+## User Stories
+
+- As Bryan, I want Maclern overnight self-learning jobs to run with Maclern’s own agent identity, workspace, and session store so the digest reflects actual training context.
+- As an operator, I want cron jobs for non-default agents to fail or route correctly instead of silently downgrading to `main`.
+- As a future agent owner, I want isolated cron jobs to remain agent-scoped even when the runtime config is incomplete.
+
+## Acceptance Criteria
+
+- [ ] `maclern-nightly-queue-review` and `maclern-overnight-digest` no longer resolve to `main` because `maclern` is absent from `agents.list`.
+- [ ] Isolated cron jobs preserve the requested non-default `agentId` through gateway cron routing, session key selection, wake routing, and isolated run execution.
+- [ ] Maclern resolves to its intended workspace and agent directory for self-learning runs instead of using the default main workspace.
+- [ ] Regression tests cover the missing-agent-in-config case for isolated cron jobs.
+- [ ] The live runtime config includes the Maclern agent entry needed for correct workspace mapping.
+
+## Out of Scope
+
+- Rewriting Maclern’s prompts, training artifacts, or Training Ops product surfaces.
+- Changing main-session cron semantics for the default agent.
+- Backfilling old misrouted Maclern session history.
+
+## Technical Notes
+
+- Current root cause is twofold: `server-cron.ts` falls back to `main` when a requested cron agent is not present in runtime config, and the live config currently omits `maclern`.
+- The code change should be minimal and defensible, with regression tests in gateway cron coverage.
+- The runtime config patch should follow the existing per-agent convention already used for agents like `cody`, `storie`, and `grove`.
diff --git a/.workflow/inputs/original-request.md b/.workflow/inputs/original-request.md
@@ -1,19 +1,16 @@
 # Original Request
 
-**Date:** 2026-03-08
+**Date:** 2026-03-22
 **Source:** Codex session
 **From:** Bryan
 
 ## The Ask
 
-> help me update openclaw with the newest features -
->
-> and fix this inability to read large posts (that have to go to .txt files)
->
-> Bryan Fisher: PastedText.txt Archie Bot: We have a group chat message from Bryan Fisher in group chat #13:fixing: infrastructure-loop > Infrastructure Expansion (Nanochat). There's a pasted file PastedText.txt. The user didn't write any explicit request, but presumably the pasted text contains a request. We need to open the file to see its contents. Archie Bot: Hey Bryan! I see you attached a text file, but I can’t open it directly from here. Could you let me know what you’d like to do with its contents? Feel free to paste the relevant part or describe the task, and I’ll help you out.
+> yes, put it in a new worktree and fix it
 
 ## Initial Context
 
-- The core OpenClaw media-understanding pipeline can already extract text-like attachments once they enter the media pipeline.
-- Lionroot-specific intake paths appear to drop or underutilize large `.txt` attachments before they reach that extractor.
-- The local clawdbot fork is substantially divergent from upstream and has a dirty worktree, so upstream sync must be handled as a separate, careful track.
+- The resolved-model telemetry bug was fixed separately.
+- A distinct bug remains in the gateway transport path for `openclaw agent`.
+- CLI gateway calls for `agent` can fail with `gateway closed (1006 abnormal closure (no close frame))` and then fall back to embedded local execution.
+- That fallback is unsafe if the gateway had already accepted the run, because it can duplicate execution.
diff --git a/.workflow/plan.md b/.workflow/plan.md
@@ -1,42 +1,51 @@
-# Technical Plan — Repo-wide Lint Cleanup
+# Technical Plan: Gateway Agent 1006 Final-Response Transport Fix
 
-**Status:** Approved
-**Date:** 2026-03-10
-**Flow/Context Builder Output:** Lint backlog profiling shows 410 remaining diagnostics after the initial pass, led by `no-explicit-any`, `curly`, `no-unsafe-optional-chaining`, `no-unused-vars`, and `no-unnecessary-type-assertion`. The first safe tranche targets mechanical fixes only.
+**Status:** Draft
+**Date:** 2026-03-22
+**Source:** RepoPrompt context_builder + plan chat `gateway-1006-agent-57FB48`
 
 ## Architecture
 
-- Lint cleanup should be done in narrow, rule-driven batches.
-- Mechanical rule families (`curly`, `no-unused-vars`) should be preferred first because they are low-risk and often auto-fixable.
-- Higher-cost categories such as `no-explicit-any` should be deferred until after easy wins reduce noise.
+The `openclaw agent` CLI uses `agentViaGatewayCommand()` to call the gateway `agent` RPC with `expectFinal:true`. The gateway method sends two `res` frames on the same request ID: an immediate `accepted` ack and a terminal `ok`/`error` response. The client transport currently waits through the ack, but if the websocket closes before the terminal response arrives, the close is surfaced as a generic error and `agentCliCommand()` unconditionally falls back to embedded local execution. That can duplicate work when the gateway already accepted the run.
 
 ## Files to Modify
 
-### First safe tranche
-
-- `extensions/feishu/src/bot.ts`
-- `extensions/feishu/src/outbound.ts`
-- `extensions/zalo/src/accounts.ts`
-- `extensions/zalo/src/channel.ts`
-
-### Deferred until separately cleaned
-
-- Synology Chat, Mattermost, and Feishu docx helper files touched by exploratory autofix but not yet brought clean.
-
-## Tasks
-
-1. [x] Profile lint backlog by rule and file.
-2. [x] Fix a first mechanical tranche (`curly`, `no-unused-vars`, safe typing cleanup) in the four selected files.
-3. [ ] Commit the first clean tranche.
-4. [ ] Re-profile the backlog and select the next tranche.
-
-## Testing Strategy
-
-- Re-run `oxlint --type-aware` on the selected tranche files.
-- Re-run formatting checks on those files.
-- Preserve previously passing targeted Zulip/typecheck tests.
-
-## Rollback Plan
-
-- Revert the tranche commit if any behavior regresses.
-- Keep future lint cleanup isolated in separate commits by tranche.
+- `src/gateway/client.ts` — add ack-tracking state for `expectFinal:true`, a typed accepted-then-closed error, and disable reconnects for one-shot clients.
+- `src/gateway/call.ts` — configure one-shot gateway clients with `reconnect: false` and surface accepted-then-closed as a distinct error.
+- `src/commands/agent-via-gateway.ts` — stop local fallback when the gateway had already accepted the run; keep fallback for genuine pre-accept failures.
+- `src/gateway/client.test.ts` — cover ack tracking, accepted-then-close, and reconnect-disabled behavior.
+- `src/gateway/call.test.ts` — cover propagation of accepted-then-close for one-shot `callGateway()` usage.
+- `src/commands/agent-via-gateway.test.ts` — cover no-fallback-after-accept and continued fallback for pre-accept failures.
+
+## New Types / Errors
+
+- `GatewayRequestAcceptedError` in `src/gateway/client.ts`
+  - subclass of `Error`
+  - used when a connection closes after an `accepted` ack was already observed for a pending `expectFinal:true` request.
+
+## Implementation Steps
+
+1. Add `ackReceived` tracking to pending requests in `src/gateway/client.ts` and expose whether the client has seen an accepted ack.
+2. Add `GatewayRequestAcceptedError` and emit it when a pending accepted request is interrupted by close.
+3. Add `reconnect?: boolean` to `GatewayClientOptions`; set `reconnect: false` for one-shot `callGateway()` clients.
+4. Update `src/gateway/call.ts` to use the new error type in `onClose` when the request was already accepted.
+5. Update `src/commands/agent-via-gateway.ts` so `agentCliCommand()` does not fall back to embedded execution on `GatewayRequestAcceptedError`.
+6. Add focused tests for `client`, `call`, and CLI fallback behavior.
+7. Verify with a live `openclaw agent` run that the CLI no longer double-executes accepted requests.
+
+## Risks
+
+- `expectFinal:true` behavior is convention-based (`status: "accepted"`), not protocol-versioned. The fix must stay scoped to that existing contract.
+- If other callers rely on generic close handling, typed accepted-close errors must remain backward compatible as `Error` subclasses.
+- Fallback behavior changes user-visible CLI behavior; the replacement error message must be explicit that the gateway may still be running the request.
+
+## Validation
+
+- Targeted tests:
+  - `bunx vitest run src/gateway/client.test.ts`
+  - `bunx vitest run src/gateway/call.test.ts`
+  - `bunx vitest run src/commands/agent-via-gateway.test.ts`
+- Live verification:
+  - reproduce previous `1006` path if possible
+  - confirm no embedded fallback occurs after accept
+  - confirm normal gateway RPCs still work
diff --git a/.workflow/prd.md b/.workflow/prd.md
@@ -1,28 +1,35 @@
-# PRD — Repo-wide Lint Cleanup
+# PRD: Gateway Agent 1006 Final-Response Transport Fix
 
-**Status:** Approved
-**Date:** 2026-03-10
-**Provenance:** Bryan explicitly asked to continue into the repo-wide lint failures after the typecheck cleanup was committed and pushed on 2026-03-10.
+**Status:** Draft
+**Date:** 2026-03-22
+**Provenance:** See `.workflow/inputs/original-request.md`
 
 ## Summary
 
-Reduce the current repo-wide lint backlog in `clawdbot` with a pragmatic, category-driven cleanup plan aimed at making `pnpm check` materially healthier and eventually green.
+Fix the `openclaw agent` gateway transport path so CLI agent runs that use `expectFinal:true` do not fail with websocket close code `1006` and incorrectly fall back to embedded local execution after the gateway has already accepted the run.
 
 ## User Stories
 
-- As a maintainer, I want lint failures grouped and attacked by highest-yield categories instead of random file hopping.
-- As an engineer, I want fixes to be mostly mechanical and low-risk.
-- As an operator, I do not want the already-verified Zulip or typecheck fixes to regress during lint cleanup.
+- As a CLI user, I want `openclaw agent` to reliably wait for the gateway's final response so that I do not get silent fallback behavior.
+- As an operator, I want the CLI to distinguish pre-accept gateway failures from post-accept disconnects so that accepted runs are not executed twice.
+- As a maintainer, I want the gateway request transport to preserve normal RPC behavior while making the `agent` ack/final sequence robust.
 
 ## Acceptance Criteria
 
-- [x] Workflow docs exist for this scope.
-- [ ] Lint backlog is profiled by rule and high-impact files.
-- [ ] A first safe tranche of lint fixes lands with targeted verification.
-- [ ] Previously-fixed Zulip and typecheck areas still pass focused verification.
+- [ ] `openclaw agent` no longer falls back to embedded local execution when the gateway already accepted the run.
+- [ ] Gateway close-after-accept is surfaced as a distinct error path, not treated like pre-connect/unreachable failures.
+- [ ] One-shot gateway clients do not perform unnecessary reconnect attempts during `expectFinal:true` agent calls.
+- [ ] Existing non-agent gateway RPC calls continue to work normally.
+- [ ] Tests cover ack-then-close sequencing, no-fallback-after-accept behavior, and continued fallback for genuine pre-accept failures.
 
 ## Out of Scope
 
-- Feature work.
-- Broad refactors that are not needed for lint compliance.
-- Changing lint policy/config unless strictly required.
+- Protocol redesign for all response frames.
+- General websocket keepalive/ping redesign.
+- Dashboard/routing telemetry changes already fixed under separate tasks.
+
+## Technical Notes
+
+- Likely files: `src/gateway/client.ts`, `src/gateway/call.ts`, `src/commands/agent-via-gateway.ts`, and related tests.
+- The current `agent` server method intentionally sends ack and final responses as separate `res` frames with the same request ID.
+- The bug is in the client transport / fallback semantics, not in model routing.
diff --git a/docs/contributor/ai-tooling.md b/docs/contributor/ai-tooling.md
@@ -51,6 +51,7 @@ For non-trivial coding work:
 - create and approve `.workflow/plan.md`
 - use `rp-cli` / RepoPrompt for ANCHOR and REVIEW when available
 - follow `ANCHOR -> EXECUTE -> REVIEW -> TEST -> GATE`
+- runtime workflow-lane enforcement is narrower than the procedural `.workflow/` gate: it enforces stage ordering around mutation and finalization, but it does not replace PRD approval, plan approval, or deploy authority
 
 Canonical workflow source: `/Users/lionheart/clawd/workflows/cody/cody_Workflow-SKILL.md`