Skip to content

Stale PID lock file causes cascading Gateway failure: unhandled rejections + session deadlock #1088

@easonlao

Description

@easonlao

Stale PID lock file causes cascading Gateway failure: unhandled rejections + session deadlock

Environment

  • OpenViking: 0.2.12 (Python 3.12.3, local mode)
  • OpenClaw Gateway: latest (Node.js 22.22.1)
  • OS: Linux 6.6.114 (WSL2)

Problem Description

When a stale PID lock file (~/.openviking/data/.openviking.pid) exists from a previous crashed/killed OpenViking process, the entire OpenClaw Gateway becomes progressively unresponsive:

  1. First message after restart: processed normally (auto-recall/capture fail silently)
  2. Every subsequent message is silently dropped — the user's Telegram message disappears with no response
  3. The session is effectively dead until /new or /reset

This is not just "OpenViking fails to start" — the failure cascades and kills the messaging pipeline.

Root Cause Analysis

The failure chain involves three layers:

Layer 1: PID lock + PID recycling race condition (Python side)

openviking/utils/process_lock.py:

def _is_pid_alive(pid: int) -> bool:
    try:
        os.kill(pid, 0)
        return True
    except ProcessLookupError:
        return False
    except PermissionError:
        return True  # ← assumes process exists
  • os.kill(pid, 0) only checks if any process with that PID exists
  • On Linux, PIDs are recycled — PID 1200 from a dead OpenViking process may be reused by an unrelated system process
  • _is_pid_alive() then returns True for a completely unrelated process
  • Result: DataDirectoryLocked is raised, Python process exits with non-zero code

Layer 2: Unhandled promise rejection in plugin start() (Node.js side)

openclaw/extensions/openviking/index.ts, service.start():

try {
    await waitForHealth(baseUrl, timeoutMs, intervalMs);
    // ... success path
} catch (err) {
    localProcess = null;
    child.kill("SIGTERM");
    markLocalUnavailable("startup failed", err);
    // ...
    throw err;  // ← NOT caught by Gateway's service manager
}

When waitForHealth times out (60s), throw err propagates as an unhandled promise rejection. The logs confirm:

[openclaw] Unhandled promise rejection: Error: OpenViking health check timeout at http://127.0.0.1:1933
    at Timeout.tick [as _onTimeout] (process-manager.ts:14:16)

This occurs repeatedly (10+ times in today's logs) because the Gateway retries service startup.

Layer 3: Session deadlock from cascading rejections

The unhandled rejections destabilize the Node.js event loop:

  • before_prompt_build hook waits up to 5s for getClient() → fails → returns (this is handled)
  • But the concurrent service restart attempts create competing promise rejections
  • The Gateway's session handler gets into an inconsistent state where new incoming messages are queued but never processed
  • Result: messages "disappear" — received by Telegram but never reach the agent

Steps to Reproduce

  1. Start OpenViking in local mode (it creates ~/.openviking/data/.openviking.pid)
  2. Kill the OpenViking process ungracefully (e.g., kill -9, system crash, OOM)
  3. Ensure the PID in the lock file gets recycled to another process (or simulate by writing a PID of a long-running system process like systemd to the lock file)
  4. Restart the OpenClaw Gateway
  5. Send a message via Telegram — first message works, subsequent messages are dropped

Suggested Fixes

1. PID lock: verify process identity, not just PID existence

def _is_pid_alive(pid: int) -> bool:
    if pid <= 0:
        return False
    try:
        os.kill(pid, 0)
    except ProcessLookupError:
        return False
    except PermissionError:
        return True
    except OSError:
        if sys.platform == "win32":
            return False
        raise
    # NEW: verify this is actually an OpenViking process
    try:
        with open(f"/proc/{pid}/cmdline", "rb") as f:
            cmdline = f.read().decode("utf-8", errors="replace")
            return "openviking" in cmdline.lower()
    except (OSError, FileNotFoundError):
        return False  # can't verify → assume stale

2. Plugin start(): don't throw — mark unavailable gracefully

} catch (err) {
    localProcess = null;
    child.kill("SIGTERM");
    markLocalUnavailable("startup failed", err);
    if (stderrChunks.length) {
        api.logger.warn(
            `openviking: startup failed (health check timeout or error).${formatStderrOutput()}`,
        );
    }
    // DON'T throw — let the Gateway continue operating without OpenViking
    // throw err;  // ← remove this
}

3. Stale lock auto-cleanup with age threshold

Add a timestamp-based check: if the lock file is older than N seconds (e.g., 300s), treat it as stale regardless of PID status:

import time

LOCK_MAX_AGE_SECONDS = 300  # 5 minutes

def _is_lock_stale(lock_path: str, existing_pid: int) -> bool:
    if not _is_pid_alive(existing_pid):
        return True
    # Also check lock file age as a safety net
    try:
        stat = os.stat(lock_path)
        return (time.time() - stat.st_mtime) > LOCK_MAX_AGE_SECONDS
    except OSError:
        return True

Workaround

rm -f ~/.openviking/data/.openviking.pid
(sleep 1 && kill -SIGUSR1 $(pgrep -f "openclaw.*gateway")) &

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions