Skip to content

Commit a9fbc4f

Browse files
committed
more robust debug_sleep_and_shutdown
1 parent 9c7bc5d commit a9fbc4f

File tree

3 files changed

+23
-7
lines changed

3 files changed

+23
-7
lines changed

CLAUDE.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,12 +124,14 @@ The runner uses a polling-based approach to determine when to terminate:
124124

125125
3. **Robustness Features**:
126126
- **Process Monitoring**: Distinguishes between idle Listener and active Worker
127+
- **Fallback Termination**: Multiple shutdown methods with increasing force
127128
- **Hook Script Separation**: Scripts fetched from GitHub for maintainability
128129

129130
4. **Clean Shutdown Sequence**:
130131
- Stop runner processes gracefully (SIGINT with timeout)
131132
- Deregister all runners from GitHub
132133
- Flush CloudWatch logs (if configured)
134+
- Execute shutdown with fallbacks (`systemctl poweroff`, `shutdown -h now`, `halt -f`)
133135

134136
### AWS Resource Tagging
135137
By default, launched EC2 instances are Tagged with:

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -224,6 +224,7 @@ The systemd timer checks every `runner_poll_interval` seconds (default: 10s) and
224224
#### Robustness Features
225225
- **Stale Job Detection**: Removes job files older than 3× poll interval (likely disk full)
226226
- **Worker Process Detection**: Distinguishes between idle runners and active jobs
227+
- **Multiple Shutdown Methods**: Uses robust termination with fallback to `shutdown -h now`
227228

228229
#### Clean Shutdown Sequence
229230
1. Stop runner processes gracefully (SIGINT)

src/ec2_gha/templates/shared-functions.sh

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -66,21 +66,34 @@ debug_sleep_and_shutdown() {
6666
if [[ "$debug" =~ ^[0-9]+$ ]]; then
6767
local sleep_minutes="$debug"
6868
local sleep_seconds=$((sleep_minutes * 60))
69-
log "Debug: Sleeping ${sleep_minutes} minutes before shutdown..."
69+
log "Debug: Sleeping ${sleep_minutes} minutes before shutdown..." || true
7070
# Detect the SSH user from the home directory
7171
local ssh_user=$(basename "$homedir" 2>$dn || echo "ec2-user")
7272
local public_ip=$(curl -s http://169.254.169.254/latest/meta-data/public-ipv4)
73-
log "SSH into instance with: ssh ${ssh_user}@${public_ip}"
74-
log "Then check: /var/log/runner-setup.log and /var/log/runner-debug.log"
73+
log "SSH into instance with: ssh ${ssh_user}@${public_ip}" || true
74+
log "Then check: /var/log/runner-setup.log and /var/log/runner-debug.log" || true
7575
sleep "$sleep_seconds"
76-
log "Debug period ended, shutting down"
76+
log "Debug period ended, shutting down" || true
7777
elif [ "$debug" = "true" ] || [ "$debug" = "True" ] || [ "$debug" = "trace" ]; then
7878
# Just tracing enabled, no sleep
79-
log "Shutting down immediately (debug tracing enabled but no sleep requested)"
79+
log "Shutting down immediately (debug tracing enabled but no sleep requested)" || true
8080
else
81-
log "Shutting down immediately (debug mode not enabled)"
81+
log "Shutting down immediately (debug mode not enabled)" || true
8282
fi
83-
shutdown -h now
83+
84+
# Try multiple shutdown methods as fallbacks (important when disk is full)
85+
shutdown -h now 2>/dev/null || {
86+
# If shutdown fails, try halt
87+
halt -f 2>/dev/null || {
88+
# If halt fails, try sysrq if available (Linux only)
89+
if [ -w /proc/sysrq-trigger ]; then
90+
echo 1 > /proc/sys/kernel/sysrq 2>/dev/null
91+
echo o > /proc/sysrq-trigger 2>/dev/null
92+
fi
93+
# Last resort: force immediate reboot
94+
reboot -f 2>/dev/null || true
95+
}
96+
}
8497
}
8598

8699
# Function to handle fatal errors and terminate the instance

0 commit comments

Comments
 (0)