Skip to content

Commit 4bb83a0

Browse files
caohy1988claude
andcommitted
docs: Surface cleanup failure as unhealthy state, add post-kill thread join
- If both os.kill and container.restart() fail, set _healthy=False and return distinct "cleanup failed" error instead of silently returning a normal timeout result - Add thread.join(timeout=2) after kill/restart to prevent daemon thread leaks on repeated timeout failures - Log warning if worker thread is still alive after post-kill join - Add test cases for total cleanup failure and thread leak scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f4fd794 commit 4bb83a0

File tree

1 file changed

+44
-3
lines changed

1 file changed

+44
-3
lines changed

docs/design/code_executor_enhancements.md

Lines changed: 44 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -309,6 +309,7 @@ def execute_code(self, invocation_context, code_execution_input):
309309
# is the correct PID for os.kill() on the host.
310310
# (Killing from inside the container would require
311311
# the container-namespace PID, which is different.)
312+
cleanup_failed = False
312313
try:
313314
info = self._client.api.exec_inspect(exec_id)
314315
host_pid = info.get('Pid', 0)
@@ -323,9 +324,34 @@ def execute_code(self, invocation_context, code_execution_input):
323324
# process is terminated.
324325
try:
325326
self._container.restart(timeout=1)
326-
except Exception:
327-
pass
327+
except Exception as restart_err:
328+
cleanup_failed = True
329+
logger.error(
330+
'Timeout cleanup failed: could not kill '
331+
'process or restart container: %s',
332+
restart_err,
333+
)
334+
self._healthy = False
335+
336+
# Give the worker thread a short window to finish
337+
# after kill/restart, so it doesn't leak indefinitely.
338+
thread.join(timeout=2)
339+
if thread.is_alive():
340+
logger.warning(
341+
'Worker thread still alive after timeout '
342+
'cleanup; daemon thread will linger until '
343+
'process exit.'
344+
)
328345

346+
if cleanup_failed:
347+
return CodeExecutionResult(
348+
stderr=(
349+
f'Execution timed out after {timeout}s '
350+
f'and cleanup failed — executor is '
351+
f'unhealthy. Reinitialize the executor '
352+
f'before further use.'
353+
)
354+
)
329355
return CodeExecutionResult(
330356
stderr=f'Execution timed out after {timeout}s'
331357
)
@@ -349,7 +375,20 @@ def execute_code(self, invocation_context, code_execution_input):
349375
concurrent execs or unrelated Python processes)
350376
- Dependency on `procps`/`pkill` being installed in the image
351377

352-
3. **Container restart as last resort** — If `os.kill` fails (e.g.,
378+
3. **Post-kill thread join** — After kill/restart, a short
379+
`thread.join(timeout=2)` gives the worker thread time to exit
380+
cleanly. If it's still alive, a warning is logged. The thread is
381+
a daemon, so it will not prevent process exit, but repeated
382+
timeout failures without this join could accumulate leaked threads.
383+
384+
4. **Unhealthy state on total cleanup failure** — If both `os.kill`
385+
and `container.restart()` fail, the executor sets `self._healthy
386+
= False` and returns a distinct error message. Subsequent calls
387+
should check `self._healthy` and raise early rather than queueing
388+
work against a broken container. Reinitialization (stop + start)
389+
is required to recover.
390+
391+
5. **Container restart as last resort** — If `os.kill` fails (e.g.,
353392
insufficient permissions when Docker runs rootless), restart the
354393
container. This is the most reliable fallback but destroys
355394
in-container state.
@@ -1153,6 +1192,8 @@ is new and has no tests yet.
11531192
| Timeout tests | Scripts with `time.sleep()` to verify enforcement | Per-executor timeout tests |
11541193
| Timeout kill fallback | Verify `PermissionError` from `os.kill` triggers container restart | Mock `os.kill` to raise `PermissionError`, assert `container.restart()` called and `CodeExecutionResult.stderr` contains timeout message |
11551194
| Timeout kill success | Verify `os.kill(host_pid)` path when permitted | Mock `exec_inspect` to return PID, assert `os.kill` called with correct signal |
1195+
| Timeout total failure | Verify both `os.kill` and `container.restart()` fail → unhealthy | Mock both to raise, assert `_healthy` is `False` and `stderr` contains "cleanup failed" |
1196+
| Timeout thread leak | Verify post-kill `join(2)` is called and warning logged if thread lingers | Mock thread to stay alive after kill, assert warning logged |
11561197
| Security tests | Scripts attempting blocked operations | `restrict_builtins` bypass attempts, env var leakage |
11571198
| Stateful tests | Multi-call sequences verifying variable persistence | Append-after-success, failure-does-not-poison, `execution_id` isolation |
11581199
| Stateful crash recovery | Verify error returned on REPL/container crash | Kill REPL mid-execution, assert error indicates state loss |

0 commit comments

Comments
 (0)