docs: add Paude backend feasibility & design spec#57
docs: add Paude backend feasibility & design spec#57tiwillia wants to merge 1 commit intojsell-rh:mainfrom
Conversation
Evaluate paude (github.com/bbrowning/paude) as a SessionBackend that runs Claude Code in isolated Podman containers or OpenShift pods. All 13 interface methods map via container exec + tmux commands. Podman viable for up to ~10 agents; OpenShift marginal (exec latency exceeds liveness budget at scale). Calibrated against real 11-agent deployment data from sdk-backend-replacement.
jsell-rh
left a comment
There was a problem hiding this comment.
Review — PaudeSessionBackend Feasibility Spec
Overall: Strong analysis. The interface mapping is complete, the latency calibration against real sdk-backend-replacement data is compelling, and the phase ordering recommendation is sound. A few technical concerns to address before merge.
Issues to fix (required)
1. execInSession lacks context — goroutines can hang
execInSession uses plain exec.Command (no context). On OpenShift, oc exec can hang indefinitely on network issues (connection dropped mid-exec, API server unresponsive). With liveness polling at 5s intervals and 10+ agents, a single hung exec blocks that goroutine indefinitely. The helper must accept a context.Context with a per-call deadline (e.g., 10s for Podman, 30s for OpenShift).
2. shellQuote undefined and security-critical
SendInput calls shellQuote(text) without defining it. This is the primary shell-injection risk — text from the coordinator (could include $(...), backticks, single quotes) is interpolated into a bash -c command string. The spec must define the quoting strategy. Suggested: single-quote the argument and escape embedded single-quotes as '\'''. Must be in the spec before implementation.
3. podman start workaround for missing --no-attach is unsafe
The workaround proposes podman start paude-{name} instead of paude start. But paude start does more than start the container — it syncs credentials, resets the credential TTL, and for OpenShift handles StatefulSet scale-up with init container coordination. Bypassing the CLI could leave credentials stale or skip initialization. Recommend treating option 1 (upstream --no-attach flag) as the only safe path. This is a blocker for Phase 3a without the upstream fix.
4. Podman host networking note is incomplete
The spec says "host networking or explicit port forwarding" for boss API access from Podman containers. Standard Podman provides host.containers.internal as the magic hostname that resolves to the host. Agents would configure BOSS_URL=http://host.containers.internal:8899. This should be explicit — it affects the ignition flow.
Recommendations
5. Batch exec should be mandatory, not optional
For OpenShift, even with --yolo, 5 agents × 1s exec = 5s per liveness tick — already over the 3s polling budget. Batching should be the default implementation strategy, not a mitigation.
6. Option B idle detection adds a hidden exec call
Prompt heuristics (Option B) require capturing the terminal pane — this actually increases exec calls vs. Option A (window activity timestamp). Option B only makes sense when batched with CaptureOutput. The recommendation should be explicit about this interaction.
7. paude list parsing fragility needs a documented example
The spec acknowledges paude list outputs a human-readable table but does not show what it looks like or sketch a parser. Since ListSessions and SessionExists both depend on this parse, the spec should include a sample output and a parsing sketch, or explicitly classify --json upstream as a Phase 3 blocker.
Verdict
Approve with required changes for issues 1–3. Issues 5–7 can be addressed in the implementation phase or as follow-up tasks. The core analysis is accurate — paude-podman is viable for up to ~10 agents (validated against sdk-backend-replacement 11-agent data), paude-openshift is limited to 1–3 agents. Phase ordering is correct.
Evaluate paude (github.com/bbrowning/paude) as a SessionBackend that runs Claude Code in isolated Podman containers or OpenShift pods. All 13 interface methods map via container exec + tmux commands. Podman viable for up to ~10 agents; OpenShift marginal (exec latency exceeds liveness budget at scale). Calibrated against real 11-agent deployment data from sdk-backend-replacement.