-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Stale PID lock file causes cascading Gateway failure: unhandled rejections + session deadlock #1088
Description
Stale PID lock file causes cascading Gateway failure: unhandled rejections + session deadlock
Environment
- OpenViking: 0.2.12 (Python 3.12.3, local mode)
- OpenClaw Gateway: latest (Node.js 22.22.1)
- OS: Linux 6.6.114 (WSL2)
Problem Description
When a stale PID lock file (~/.openviking/data/.openviking.pid) exists from a previous crashed/killed OpenViking process, the entire OpenClaw Gateway becomes progressively unresponsive:
- First message after restart: processed normally (auto-recall/capture fail silently)
- Every subsequent message is silently dropped — the user's Telegram message disappears with no response
- The session is effectively dead until
/newor/reset
This is not just "OpenViking fails to start" — the failure cascades and kills the messaging pipeline.
Root Cause Analysis
The failure chain involves three layers:
Layer 1: PID lock + PID recycling race condition (Python side)
openviking/utils/process_lock.py:
def _is_pid_alive(pid: int) -> bool:
try:
os.kill(pid, 0)
return True
except ProcessLookupError:
return False
except PermissionError:
return True # ← assumes process existsos.kill(pid, 0)only checks if any process with that PID exists- On Linux, PIDs are recycled — PID 1200 from a dead OpenViking process may be reused by an unrelated system process
_is_pid_alive()then returnsTruefor a completely unrelated process- Result:
DataDirectoryLockedis raised, Python process exits with non-zero code
Layer 2: Unhandled promise rejection in plugin start() (Node.js side)
openclaw/extensions/openviking/index.ts, service.start():
try {
await waitForHealth(baseUrl, timeoutMs, intervalMs);
// ... success path
} catch (err) {
localProcess = null;
child.kill("SIGTERM");
markLocalUnavailable("startup failed", err);
// ...
throw err; // ← NOT caught by Gateway's service manager
}When waitForHealth times out (60s), throw err propagates as an unhandled promise rejection. The logs confirm:
[openclaw] Unhandled promise rejection: Error: OpenViking health check timeout at http://127.0.0.1:1933
at Timeout.tick [as _onTimeout] (process-manager.ts:14:16)
This occurs repeatedly (10+ times in today's logs) because the Gateway retries service startup.
Layer 3: Session deadlock from cascading rejections
The unhandled rejections destabilize the Node.js event loop:
before_prompt_buildhook waits up to 5s forgetClient()→ fails → returns (this is handled)- But the concurrent service restart attempts create competing promise rejections
- The Gateway's session handler gets into an inconsistent state where new incoming messages are queued but never processed
- Result: messages "disappear" — received by Telegram but never reach the agent
Steps to Reproduce
- Start OpenViking in local mode (it creates
~/.openviking/data/.openviking.pid) - Kill the OpenViking process ungracefully (e.g.,
kill -9, system crash, OOM) - Ensure the PID in the lock file gets recycled to another process (or simulate by writing a PID of a long-running system process like
systemdto the lock file) - Restart the OpenClaw Gateway
- Send a message via Telegram — first message works, subsequent messages are dropped
Suggested Fixes
1. PID lock: verify process identity, not just PID existence
def _is_pid_alive(pid: int) -> bool:
if pid <= 0:
return False
try:
os.kill(pid, 0)
except ProcessLookupError:
return False
except PermissionError:
return True
except OSError:
if sys.platform == "win32":
return False
raise
# NEW: verify this is actually an OpenViking process
try:
with open(f"/proc/{pid}/cmdline", "rb") as f:
cmdline = f.read().decode("utf-8", errors="replace")
return "openviking" in cmdline.lower()
except (OSError, FileNotFoundError):
return False # can't verify → assume stale2. Plugin start(): don't throw — mark unavailable gracefully
} catch (err) {
localProcess = null;
child.kill("SIGTERM");
markLocalUnavailable("startup failed", err);
if (stderrChunks.length) {
api.logger.warn(
`openviking: startup failed (health check timeout or error).${formatStderrOutput()}`,
);
}
// DON'T throw — let the Gateway continue operating without OpenViking
// throw err; // ← remove this
}3. Stale lock auto-cleanup with age threshold
Add a timestamp-based check: if the lock file is older than N seconds (e.g., 300s), treat it as stale regardless of PID status:
import time
LOCK_MAX_AGE_SECONDS = 300 # 5 minutes
def _is_lock_stale(lock_path: str, existing_pid: int) -> bool:
if not _is_pid_alive(existing_pid):
return True
# Also check lock file age as a safety net
try:
stat = os.stat(lock_path)
return (time.time() - stat.st_mtime) > LOCK_MAX_AGE_SECONDS
except OSError:
return TrueWorkaround
rm -f ~/.openviking/data/.openviking.pid
(sleep 1 && kill -SIGUSR1 $(pgrep -f "openclaw.*gateway")) &Metadata
Metadata
Assignees
Labels
Type
Projects
Status