-
Notifications
You must be signed in to change notification settings - Fork 0
Server resilience — auto-start WSL2, Tailscale, and services on boot #93
Copy link
Copy link
Open
Description
Problem
Windows reboot = everything dead. WSL2 doesn't auto-start, Tailscale doesn't reconnect, SSH is unreachable. Requires someone physically at the machine to open a terminal and type wsl.
This just cost us a session — OOM crashed the system, remote reboot brought Windows back but WSL2 stayed down. Nobody could reach it remotely.
What Needs to Happen on Every Boot
- WSL2 auto-starts — Windows scheduled task or startup script
- Tailscale reconnects —
tailscale upinside WSL2 - SSH daemon starts —
service ssh start - GPU is accessible — verify CUDA/nvidia-smi works
- Health endpoint — something we can ping from outside to verify
Implementation
Windows Scheduled Task (runs at login or boot)
# Create scheduled task that starts WSL on boot
$action = New-ScheduledTaskAction -Execute "wsl" -Argument "-d Ubuntu -- bash -c 'sudo service ssh start && tailscale up && echo ready'"
$trigger = New-ScheduledTaskTrigger -AtStartup
Register-ScheduledTask -TaskName "StartWSL" -Action $action -Trigger $trigger -RunLevel HighestWSL2 /etc/wsl.conf
[boot]
command = "service ssh start && service cron start"Tailscale auto-start inside WSL
Add to /etc/rc.local or systemd:
tailscaled &
sleep 2
tailscale up --sshHealth check (from any machine on the grid)
# Quick check: is the tower alive?
ssh -o ConnectTimeout=5 joel@100.124.122.107 "nvidia-smi --query-gpu=name --format=csv,noheader" 2>/dev/null || echo "TOWER DOWN"OOM prevention
The crash that caused this was an OOM kill that took down sshd. Add:
# Protect SSH from OOM killer
echo -17 > /proc/$(pgrep sshd | head -1)/oom_adjThis Is a Grid Problem Too
Every node on the grid needs to survive reboots. If a tower goes offline during a forge job, the grid scheduler needs to:
- Detect the node is down
- Reassign the job
- When the node comes back, it re-registers automatically
Dependencies
- Grid infrastructure (reticulum)
- Adapter registry — semantic search, auto-forge, publish (npm for intelligence) #92 — Adapter registry (nodes need to stay online to serve)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels