Stale PID lock file causes cascading Gateway failure: unhandled rejections + session deadlock

## Stale PID lock file causes cascading Gateway failure: unhandled rejections + session deadlock

### Environment

- OpenViking: 0.2.12 (Python 3.12.3, local mode)
- OpenClaw Gateway: latest (Node.js 22.22.1)
- OS: Linux 6.6.114 (WSL2)

### Problem Description

When a stale PID lock file (`~/.openviking/data/.openviking.pid`) exists from a previous crashed/killed OpenViking process, the entire OpenClaw Gateway becomes **progressively unresponsive**:

1. First message after restart: processed normally (auto-recall/capture fail silently)
2. **Every subsequent message is silently dropped** — the user's Telegram message disappears with no response
3. The session is effectively dead until `/new` or `/reset`

This is not just "OpenViking fails to start" — the failure **cascades** and kills the messaging pipeline.

### Root Cause Analysis

The failure chain involves **three layers**:

#### Layer 1: PID lock + PID recycling race condition (Python side)

`openviking/utils/process_lock.py`:

```python
def _is_pid_alive(pid: int) -> bool:
    try:
        os.kill(pid, 0)
        return True
    except ProcessLookupError:
        return False
    except PermissionError:
        return True  # ← assumes process exists
```

- `os.kill(pid, 0)` only checks if **any** process with that PID exists
- On Linux, PIDs are recycled — PID 1200 from a dead OpenViking process may be reused by an unrelated system process
- `_is_pid_alive()` then returns `True` for a completely unrelated process
- Result: `DataDirectoryLocked` is raised, Python process exits with non-zero code

#### Layer 2: Unhandled promise rejection in plugin `start()` (Node.js side)

`openclaw/extensions/openviking/index.ts`, `service.start()`:

```typescript
try {
    await waitForHealth(baseUrl, timeoutMs, intervalMs);
    // ... success path
} catch (err) {
    localProcess = null;
    child.kill("SIGTERM");
    markLocalUnavailable("startup failed", err);
    // ...
    throw err;  // ← NOT caught by Gateway's service manager
}
```

When `waitForHealth` times out (60s), `throw err` propagates as an **unhandled promise rejection**. The logs confirm:

```
[openclaw] Unhandled promise rejection: Error: OpenViking health check timeout at http://127.0.0.1:1933
    at Timeout.tick [as _onTimeout] (process-manager.ts:14:16)
```

This occurs **repeatedly** (10+ times in today's logs) because the Gateway retries service startup.

#### Layer 3: Session deadlock from cascading rejections

The unhandled rejections destabilize the Node.js event loop:
- `before_prompt_build` hook waits up to 5s for `getClient()` → fails → returns (this is handled)
- But the concurrent service restart attempts create competing promise rejections
- The Gateway's session handler gets into an inconsistent state where new incoming messages are queued but never processed
- Result: messages "disappear" — received by Telegram but never reach the agent

### Steps to Reproduce

1. Start OpenViking in local mode (it creates `~/.openviking/data/.openviking.pid`)
2. Kill the OpenViking process ungracefully (e.g., `kill -9`, system crash, OOM)
3. Ensure the PID in the lock file gets recycled to another process (or simulate by writing a PID of a long-running system process like `systemd` to the lock file)
4. Restart the OpenClaw Gateway
5. Send a message via Telegram — first message works, subsequent messages are dropped

### Suggested Fixes

#### 1. PID lock: verify process identity, not just PID existence

```python
def _is_pid_alive(pid: int) -> bool:
    if pid <= 0:
        return False
    try:
        os.kill(pid, 0)
    except ProcessLookupError:
        return False
    except PermissionError:
        return True
    except OSError:
        if sys.platform == "win32":
            return False
        raise
    # NEW: verify this is actually an OpenViking process
    try:
        with open(f"/proc/{pid}/cmdline", "rb") as f:
            cmdline = f.read().decode("utf-8", errors="replace")
            return "openviking" in cmdline.lower()
    except (OSError, FileNotFoundError):
        return False  # can't verify → assume stale
```

#### 2. Plugin `start()`: don't throw — mark unavailable gracefully

```typescript
} catch (err) {
    localProcess = null;
    child.kill("SIGTERM");
    markLocalUnavailable("startup failed", err);
    if (stderrChunks.length) {
        api.logger.warn(
            `openviking: startup failed (health check timeout or error).${formatStderrOutput()}`,
        );
    }
    // DON'T throw — let the Gateway continue operating without OpenViking
    // throw err;  // ← remove this
}
```

#### 3. Stale lock auto-cleanup with age threshold

Add a timestamp-based check: if the lock file is older than N seconds (e.g., 300s), treat it as stale regardless of PID status:

```python
import time

LOCK_MAX_AGE_SECONDS = 300  # 5 minutes

def _is_lock_stale(lock_path: str, existing_pid: int) -> bool:
    if not _is_pid_alive(existing_pid):
        return True
    # Also check lock file age as a safety net
    try:
        stat = os.stat(lock_path)
        return (time.time() - stat.st_mtime) > LOCK_MAX_AGE_SECONDS
    except OSError:
        return True
```

### Workaround

```bash
rm -f ~/.openviking/data/.openviking.pid
(sleep 1 && kill -SIGUSR1 $(pgrep -f "openclaw.*gateway")) &
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stale PID lock file causes cascading Gateway failure: unhandled rejections + session deadlock #1088