Server resilience — auto-start WSL2, Tailscale, and services on boot

## Problem

Windows reboot = everything dead. WSL2 doesn't auto-start, Tailscale doesn't reconnect, SSH is unreachable. Requires someone physically at the machine to open a terminal and type `wsl`.

This just cost us a session — OOM crashed the system, remote reboot brought Windows back but WSL2 stayed down. Nobody could reach it remotely.

## What Needs to Happen on Every Boot

1. **WSL2 auto-starts** — Windows scheduled task or startup script
2. **Tailscale reconnects** — `tailscale up` inside WSL2
3. **SSH daemon starts** — `service ssh start` 
4. **GPU is accessible** — verify CUDA/nvidia-smi works
5. **Health endpoint** — something we can ping from outside to verify

## Implementation

### Windows Scheduled Task (runs at login or boot)
```powershell
# Create scheduled task that starts WSL on boot
$action = New-ScheduledTaskAction -Execute "wsl" -Argument "-d Ubuntu -- bash -c 'sudo service ssh start && tailscale up && echo ready'"
$trigger = New-ScheduledTaskTrigger -AtStartup
Register-ScheduledTask -TaskName "StartWSL" -Action $action -Trigger $trigger -RunLevel Highest
```

### WSL2 /etc/wsl.conf
```ini
[boot]
command = "service ssh start && service cron start"
```

### Tailscale auto-start inside WSL
Add to `/etc/rc.local` or systemd:
```bash
tailscaled &
sleep 2
tailscale up --ssh
```

### Health check (from any machine on the grid)
```bash
# Quick check: is the tower alive?
ssh -o ConnectTimeout=5 joel@100.124.122.107 "nvidia-smi --query-gpu=name --format=csv,noheader" 2>/dev/null || echo "TOWER DOWN"
```

### OOM prevention
The crash that caused this was an OOM kill that took down sshd. Add:
```bash
# Protect SSH from OOM killer
echo -17 > /proc/$(pgrep sshd | head -1)/oom_adj
```

## This Is a Grid Problem Too
Every node on the grid needs to survive reboots. If a tower goes offline during a forge job, the grid scheduler needs to:
1. Detect the node is down
2. Reassign the job
3. When the node comes back, it re-registers automatically

## Dependencies
- Grid infrastructure (reticulum)
- #92 — Adapter registry (nodes need to stay online to serve)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server resilience — auto-start WSL2, Tailscale, and services on boot #93

Problem

What Needs to Happen on Every Boot

Implementation

Windows Scheduled Task (runs at login or boot)

WSL2 /etc/wsl.conf

Tailscale auto-start inside WSL

Health check (from any machine on the grid)

OOM prevention

This Is a Grid Problem Too

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Server resilience — auto-start WSL2, Tailscale, and services on boot #93

Description

Problem

What Needs to Happen on Every Boot

Implementation

Windows Scheduled Task (runs at login or boot)

WSL2 /etc/wsl.conf

Tailscale auto-start inside WSL

Health check (from any machine on the grid)

OOM prevention

This Is a Grid Problem Too

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions