Skip to content

Commit f55da53

Browse files
caohy1988claude
andcommitted
docs: Align roadmap with Option A, unify recovery policy, fix fallback gaps
- Roadmap Phase 2 now targets Option A (persistent process) directly instead of cumulative replay (Option C) first - Unify recovery policy: timeout and crash both return error indicating state loss, no automatic replay (consistent across §4.2.3 and §9) - Add Windows NotImplementedError gate to LocalSandboxCodeExecutor - Guard resource import in 3.10 preexec_fn fallback path - Document that PermissionError → container.restart() is the primary timeout path in practice (container runs as root, ADK as non-root) - Add explicit test cases for PermissionError fallback and kill paths - Update open question #1 to focus on I/O boundary protocol design Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d8692ba commit f55da53

File tree

1 file changed

+79
-43
lines changed

1 file changed

+79
-43
lines changed

docs/design/code_executor_enhancements.md

Lines changed: 79 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -351,18 +351,31 @@ def execute_code(self, invocation_context, code_execution_input):
351351
container. This is the most reliable fallback but destroys
352352
in-container state.
353353

354-
**Permissions:** `os.kill()` on the host PID requires the ADK process
355-
to run as the same user that started the container (or as root). This
356-
is the normal case for local Docker usage. For rootless Docker or
357-
restricted environments, the container-restart fallback applies.
354+
**Permissions and user mismatch risk:** `os.kill()` on the host PID
355+
requires the ADK process to run as the same user that owns the
356+
container's exec'd process on the host. The current
357+
`ContainerCodeExecutor` does not set `user=` on `containers.run()`
358+
(`container_code_executor.py:182`), so the container process runs as
359+
root. If the ADK process runs as a non-root user (common in
360+
development), `os.kill(host_pid, SIGKILL)` will raise
361+
`PermissionError` and the container-restart fallback activates.
362+
363+
This user mismatch is the expected case for default Docker usage,
364+
so the `PermissionError → container.restart()` path is the **primary
365+
timeout mechanism in practice**, not an edge case. The `os.kill` path
366+
becomes primary only when ADK runs as root or the container is
367+
configured with `user=` matching the host user.
358368

359369
**Recovery after timeout:**
360370
- **Stateless mode:** No recovery needed. The killed process is gone;
361371
the next `exec_run` starts a fresh process in the same container.
362-
- **Stateful mode:** The timed-out block is NOT appended to history
363-
(append-after-success invariant). If the container was restarted
364-
as fallback, the executor must detect this and replay the
365-
accumulated history on the next call.
372+
- **Stateful mode (Option A / persistent process):** The timed-out
373+
block is NOT appended to history (append-after-success invariant).
374+
If the persistent REPL was killed or the container was restarted,
375+
the executor returns an error with `stderr` indicating state was
376+
lost. The caller (LLM flow or skill tool) must handle this —
377+
typically by starting a new session. Automatic replay is not
378+
attempted because it may re-execute side effects.
366379

367380
#### 4.2.4 `GkeCodeExecutor` — Already Implemented
368381

@@ -844,9 +857,18 @@ class LocalSandboxCodeExecutor(BaseCodeExecutor):
844857
allowed_env_vars: list[str] = []
845858

846859
def execute_code(self, invocation_context, code_execution_input):
860+
import platform
847861
import subprocess
862+
import sys
848863
import tempfile
849864

865+
# Windows is out of scope (§2 Non-Goals).
866+
if platform.system() == 'Windows':
867+
raise NotImplementedError(
868+
'LocalSandboxCodeExecutor is not supported on '
869+
'Windows. Use ContainerCodeExecutor instead.'
870+
)
871+
850872
with tempfile.NamedTemporaryFile(
851873
mode='w', suffix='.py', delete=True
852874
) as f:
@@ -866,28 +888,32 @@ class LocalSandboxCodeExecutor(BaseCodeExecutor):
866888
if timeout is None:
867889
timeout = self.max_cpu_seconds
868890

869-
import sys
870891
# Prefer process_group (3.11+) over preexec_fn
871892
# (not fork-safe with threads).
872893
spawn_kwargs = {}
873894
if sys.version_info >= (3, 11):
874895
spawn_kwargs['process_group'] = 0
875896
else:
876-
# Fallback for 3.10; caveat: not fork-safe
897+
# Fallback for 3.10; caveat: not fork-safe.
898+
# Guard resource import for platforms where
899+
# the module is unavailable.
877900
def _set_limits():
878-
import resource
879-
resource.setrlimit(
880-
resource.RLIMIT_CPU,
881-
(self.max_cpu_seconds,) * 2,
882-
)
883-
mem = self.max_memory_mb * 1024 * 1024
884-
resource.setrlimit(
885-
resource.RLIMIT_AS, (mem, mem),
886-
)
901+
try:
902+
import resource
903+
resource.setrlimit(
904+
resource.RLIMIT_CPU,
905+
(self.max_cpu_seconds,) * 2,
906+
)
907+
mem = self.max_memory_mb * 1024 * 1024
908+
resource.setrlimit(
909+
resource.RLIMIT_AS, (mem, mem),
910+
)
911+
except (ImportError, OSError):
912+
pass # timeout-only enforcement
887913
spawn_kwargs['preexec_fn'] = _set_limits
888914

889-
# Guard resource import for platforms where
890-
# it is unavailable (falls back to timeout-only).
915+
# Inline wrapper sets resource limits in the child
916+
# process. Guarded for missing resource module.
891917
limit_code = (
892918
f'try:\n'
893919
f' import resource\n'
@@ -1122,8 +1148,11 @@ is new and has no tests yet.
11221148
| Unit tests | Mock-based tests per executor | **Add `test_container_code_executor.py`**, add `test_local_sandbox_code_executor.py` |
11231149
| Integration tests | Real executor tests (like `ExecuteSkillScriptTool` integration tests) | Add Docker-based container tests (CI-gated) |
11241150
| Timeout tests | Scripts with `time.sleep()` to verify enforcement | Per-executor timeout tests |
1151+
| Timeout kill fallback | Verify `PermissionError` from `os.kill` triggers container restart | Mock `os.kill` to raise `PermissionError`, assert `container.restart()` called and `CodeExecutionResult.stderr` contains timeout message |
1152+
| Timeout kill success | Verify `os.kill(host_pid)` path when permitted | Mock `exec_inspect` to return PID, assert `os.kill` called with correct signal |
11251153
| Security tests | Scripts attempting blocked operations | `restrict_builtins` bypass attempts, env var leakage |
11261154
| Stateful tests | Multi-call sequences verifying variable persistence | Append-after-success, failure-does-not-poison, `execution_id` isolation |
1155+
| Stateful crash recovery | Verify error returned on REPL/container crash | Kill REPL mid-execution, assert error indicates state loss |
11271156

11281157
---
11291158

@@ -1141,17 +1170,23 @@ is new and has no tests yet.
11411170
7. Update `ExecuteSkillScriptTool` to set per-invocation timeout via
11421171
`CodeExecutionInput.timeout_seconds`
11431172

1144-
### Phase 2: Stateful Container (3-5 days)
1173+
### Phase 2: Stateful Container (5-8 days)
1174+
1175+
Implement Option A (persistent process) directly, as recommended in
1176+
§5.2.2. This avoids the side-effect replay problems of Option C.
11451177

11461178
1. Unfreeze `stateful` on `ContainerCodeExecutor`
1147-
2. Implement cumulative code history with stdout suppression
1148-
(append-after-success invariant)
1149-
3. Add `execution_id`-based history isolation
1150-
4. Wire `execution_id` in `ExecuteSkillScriptTool`
1151-
5. Add `reset_state()` method
1152-
6. Add stateful execution tests (including failure-does-not-poison test)
1153-
7. Update samples and documentation
1154-
8. Evaluate persistent-process approach (Option A) for Phase 2b
1179+
2. Design persistent-process protocol: sentinel-delimited I/O for
1180+
output boundaries, error detection, and process health checks
1181+
3. Implement persistent Python REPL management (start, send code,
1182+
read output, detect crash/restart)
1183+
4. Add `execution_id`-based session isolation (one REPL per
1184+
`execution_id`)
1185+
5. Wire `execution_id` in `ExecuteSkillScriptTool`
1186+
6. Add `reset_state()` method (kills and restarts the REPL)
1187+
7. Add stateful execution tests (variable persistence, crash recovery,
1188+
`execution_id` isolation)
1189+
8. Update samples and documentation
11551190

11561191
### Phase 3: Security Hardening (5-7 days)
11571192

@@ -1171,13 +1206,12 @@ is new and has no tests yet.
11711206

11721207
## 9. Open Questions
11731208

1174-
1. **Should we skip Phase 1 (cumulative replay) and go straight to
1175-
Phase 2 (persistent process) for stateful execution?**
1176-
The side-effect replay problem is fundamental to Option C. If the
1177-
persistent-process I/O boundary protocol can be solved with
1178-
reasonable complexity (e.g., sentinel-delimited output), the MVP
1179-
phase may not be worth the tech debt. Decision: Evaluate during
1180-
Phase 2 planning.
1209+
1. **What I/O boundary protocol should the persistent REPL use?**
1210+
The roadmap targets Option A (persistent process) directly. The
1211+
key design question is how to delimit output for each code block:
1212+
sentinel strings in stdout, JSON-envelope protocol, or a side
1213+
channel (e.g., file-based result). Sentinel strings are simplest
1214+
but can collide with user output. Decision: spike during Phase 2.
11811215

11821216
2. **Should `LocalSandboxCodeExecutor` support stateful execution?**
11831217
Subprocess-based execution is inherently stateless. Stateful support
@@ -1189,12 +1223,14 @@ is new and has no tests yet.
11891223
`SecurityWarning` and documentation should steer users toward safer
11901224
alternatives for anything beyond local prototyping.
11911225

1192-
4. **How should `ContainerCodeExecutor` handle container crashes in
1193-
stateful mode?**
1194-
If the container crashes (OOM, segfault), the code history is lost.
1195-
Options: (a) re-create container and replay history, (b) return error
1196-
and let user restart, (c) persist history to host volume. Recommend
1197-
(b) for simplicity.
1226+
4. **How should `ContainerCodeExecutor` handle container/REPL crashes
1227+
in stateful mode?**
1228+
If the container crashes (OOM, segfault) or the persistent REPL
1229+
dies, in-process state is lost. The executor returns an error
1230+
indicating state loss and lets the caller handle recovery (e.g.,
1231+
start a new session). Automatic replay is not attempted because
1232+
prior code blocks may have had side effects that should not be
1233+
re-executed (consistent with §4.2.3 recovery policy).
11981234

11991235
---
12001236

0 commit comments

Comments
 (0)