Skip to content

Server resilience — auto-start WSL2, Tailscale, and services on boot #93

@joelteply

Description

@joelteply

Problem

Windows reboot = everything dead. WSL2 doesn't auto-start, Tailscale doesn't reconnect, SSH is unreachable. Requires someone physically at the machine to open a terminal and type wsl.

This just cost us a session — OOM crashed the system, remote reboot brought Windows back but WSL2 stayed down. Nobody could reach it remotely.

What Needs to Happen on Every Boot

  1. WSL2 auto-starts — Windows scheduled task or startup script
  2. Tailscale reconnectstailscale up inside WSL2
  3. SSH daemon startsservice ssh start
  4. GPU is accessible — verify CUDA/nvidia-smi works
  5. Health endpoint — something we can ping from outside to verify

Implementation

Windows Scheduled Task (runs at login or boot)

# Create scheduled task that starts WSL on boot
$action = New-ScheduledTaskAction -Execute "wsl" -Argument "-d Ubuntu -- bash -c 'sudo service ssh start && tailscale up && echo ready'"
$trigger = New-ScheduledTaskTrigger -AtStartup
Register-ScheduledTask -TaskName "StartWSL" -Action $action -Trigger $trigger -RunLevel Highest

WSL2 /etc/wsl.conf

[boot]
command = "service ssh start && service cron start"

Tailscale auto-start inside WSL

Add to /etc/rc.local or systemd:

tailscaled &
sleep 2
tailscale up --ssh

Health check (from any machine on the grid)

# Quick check: is the tower alive?
ssh -o ConnectTimeout=5 joel@100.124.122.107 "nvidia-smi --query-gpu=name --format=csv,noheader" 2>/dev/null || echo "TOWER DOWN"

OOM prevention

The crash that caused this was an OOM kill that took down sshd. Add:

# Protect SSH from OOM killer
echo -17 > /proc/$(pgrep sshd | head -1)/oom_adj

This Is a Grid Problem Too

Every node on the grid needs to survive reboots. If a tower goes offline during a forge job, the grid scheduler needs to:

  1. Detect the node is down
  2. Reassign the job
  3. When the node comes back, it re-registers automatically

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions