-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add graceful VM shutdown #182
Comments
Someone will have to correct me if I'm wrong, but each compute node should be able to query to controller daemon via slurm.conf and squeue. So in theory:
Coupled with: https://cloud.google.com/compute/docs/instances/create-use-spot#handle-preemption Could work |
@bliklabs yeah that would def work. it does need however to set |
Good point. Also the scancel brace might be an issue on line length. Likely it's better to: Also, maybe something to consider is to find the respective step pids locally via cgroup slice and: Also might be good to check the behavior of slurmd during this type of process, it could still be polling for work if it's not in drain. Curious what would happen if you masked slurmd then sigint slurmd. |
so after some experimentation on our staging cluster I think I have something which may work well for now: #!/bin/bash
set -euxo pipefail
# Send SIGTERM to all jobs, both childer and batch script.
# We use SIGTERM as its the same sent by Slurm when a job is preempted.
# https://slurm.schedmd.com/scancel.html
# https://slurm.schedmd.com/preempt.html
echo "Shutting down Slurm jobs on $(hostname), sending SIGUSR2/SIGTERM to all jobs..."
squeue -w "$(hostname)" -h -o "%.18i" | xargs -I {} scancel --signal=SIGTERM --full {}
# We send SIGUSR2 to make sure also submitit jobs are handled well.
# https://github.com/facebookincubator/submitit/blob/07f21fa1234e34151874c00d80c345e215af4967/submitit/core/job_environment.py#L152
squeue -w "$(hostname)" -h -o "%.18i" | xargs -I {} scancel --signal=SIGUSR2 --full {}
# Mark node for power down as soon as possible
echo "Marking node $(hostname) for power down to avoid slurm not seeing it..."
scontrol update nodename="$(hostname)" state=power_down reason="Node is shutting down/preempted"
# We wait here for slurmd ideally shutting down gracefully (jobs in node exit + slurm shuts down node)
# preventing spot instance to be stopped as much as possible.
SLURMD_PID="$(pgrep -n slurmd)"
while kill -0 "$SLURMD_PID"; do
sleep 1
done this together with this fragment in the initialization script: echo "Installing shutdown script..."
chmod +x /opt/local/slurm/shutdown_slurm.sh
# Based on Google's shutdown script service and https://github.com/GoogleCloudPlatform/slurm-gcp/issues/182
cat <<EOF > /lib/systemd/system/slurm-shutdown.service
[Unit]
Description=Slurm Shutdown Service
Wants=network-online.target rsyslog.service
After=network-online.target rsyslog.service
[Service]
Type=oneshot
ExecStart=/bin/true
RemainAfterExit=true
# This service does nothing on start, and runs shutdown scripts on stop.
ExecStop=/opt/local/slurm/shutdown_slurm.sh
TimeoutStopSec=0
KillMode=process
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl --no-reload --now enable /lib/systemd/system/slurm-shutdown.service Doing this seems to be okay as we mark node to be drainer (update node status), and send the TERM signal. I ended up doing SIGTERM to match up default behaviour from Slurm when preempting a job due to priority (https://slurm.schedmd.com/preempt.html) Slurm.conf part which is needed:
I decided to do SIGTERM vs SIGINT as SIGINT denotes user sending it, where term is a bit more automatic-related? This can be used in a sbatch script as such: #!/bin/bash
#SBATCH --requeue
#SBATCH --cpus-per-task 1
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
set -euxo pipefail
CHDIR="$(pwd)"
TMP_DIR=/tmp
JOB_ID="${SLURM_JOBID:-0}"
sig_handler()
{
echo "Got SIGTERM, saving state"
wait # wait for all children, this is important!
mv "$TMP_DIR/times-$JOB_ID.txt" "$CHDIR"
# Exit code 143 is the default for SIGTERM. This ensures we get rescheduled even if we handled it well.
# Slurm will always reschedule jobs which have been terminated with 143.
exit 143
}
# trap SIGTERM
trap 'sig_handler' SIGTERM
# create file if it doesn't exist
if [ ! -f "./times-$JOB_ID.txt" ]; then
touch "times-$JOB_ID.txt"
fi
cd "$TMP_DIR"
cp "$CHDIR/times-$JOB_ID.txt" .
date >> "./times-$JOB_ID.txt"
srun --jobid "$SLURM_JOBID" bash -c 'sleep 300'
echo "All done!" |
Great solution. The only suggestion i have is a nit, I'm thinking some type of explicit error handling instead of |
Alright so after some weird debugging and issues reproducing my existing working test, I found a slight issue which may be useful to document to future users (and maybe potentially add this to slurm-gcp repo?) So issue was that the above service definition meant that slurm-shutdown.service could be stopped after slurmd. In which case scancel cant be sent to the tasks. Separately, seems slurmd shuts down without notifying controller node which leads to controller not knowing wether jobs need to be preempted until the node fully disappears (which slurm_sync.py sees as hostname is non-reachable). So modifications needed:
#!/bin/bash
# Send SIGTERM to all jobs, both childer and batch script.
# We use SIGTERM as its the same sent by Slurm when a job is preempted.
# https://slurm.schedmd.com/scancel.html
# https://slurm.schedmd.com/preempt.html
echo "Shutting down Slurm jobs on $(hostname), sending SIGUSR2/SIGTERM to all jobs..."
squeue -w "$(hostname)" -h -o "%.18i" | xargs -I {} scancel --signal=SIGTERM --full {}
# We send SIGUSR2 to make sure also submitit jobs are handled well.
# https://github.com/facebookincubator/submitit/blob/07f21fa1234e34151874c00d80c345e215af4967/submitit/core/job_environment.py#L152
squeue -w "$(hostname)" -h -o "%.18i" | xargs -I {} scancel --signal=SIGUSR2 --full {}
# Mark node for power down as soon as possible. Note that it's okay to do this here as Slurm
# will still allow jobs to finish, but will not schedule new jobs on this node.
echo "Marking node $(hostname) for power down to avoid slurm not seeing it..."
scontrol update nodename="$(hostname)" state=power_down reason="Node is shutting down/preempted"
# We wait here for slurmd ideally shutting down gracefully (jobs in node exit + slurm shuts down node)
# preventing spot instance to be stopped as much as possible.
echo "Waiting for slurmstepd to stop: $(pgrep 'slurmstepd')"
while pkill -0 "slurmstepd"; do
sleep 1
done if [ -f /opt/local/slurm/shutdown_slurm.sh ]; then
echo "Setting up shutdown service..."
chmod +x /opt/local/slurm/shutdown_slurm.sh
# Based on Google's shutdown script service and https://github.com/GoogleCloudPlatform/slurm-gcp/issues/182
cat <<EOF > /lib/systemd/system/slurm-shutdown.service
[Unit]
Description=Slurm Shutdown Service
# we need to run before slurmd is stopped
After=slurmd.service network-online.target
Wants=slurmd.service network-online.target
[Service]
Type=oneshot
ExecStart=/bin/true
RemainAfterExit=true
# This service does nothing on start, and runs shutdown scripts on stop.
ExecStop=/opt/local/slurm/shutdown_slurm.sh
TimeoutStopSec=0
KillMode=process
[Install]
WantedBy=multi-user.target
EOF
# Force services to stop quickly
mkdir -p /etc/systemd/system.conf.d
cat <<EOF > /etc/systemd/system.conf.d/10-timeout-stop.conf
[Manager]
DefaultTimeoutStopSec=2s
EOF
systemctl daemon-reload
systemctl --now enable /lib/systemd/system/slurm-shutdown.service
echo "Shutdown service set up."
fi Adds dependency of service to run after (and therefore stop before) slurmd.service. This ensures we can run w slurmd. Also added a shorter default timeout as the default (90s) is literally the whole shut down period allowed by GCP. This may not be needed but I added it as I tried debugging why it didnt work before (I assumed it was due to some service slowing down the rest of stopping services). Hard to debug due to logs not showing up on GCP logging console, so had to take a guess and it seems to work now. Also modified test script: #!/bin/bash
#SBATCH --requeue # This will requeue the job if preempted.
#SBATCH --cpus-per-task 1 # Only run with one CPU
#SBATCH --ntasks-per-node=1 # 1 task only
#SBATCH --nodes=1 # 1 node only
#SBATCH --time=6:00 # timeout after 6 minutes
set -uxo pipefail
CHDIR="$(pwd)"
TMP_DIR=/tmp
JOB_ID="${SLURM_JOBID:-0}"
### PREEMPTION HANDLING ###
sig_handler()
{
echo "Got SIGTERM, saving state"
mv "$TMP_DIR/times-$JOB_ID.txt" "$CHDIR"
# Exit code 143 is the default for SIGTERM.
# Slurm will reschedule jobs with --requeue and exit code 143.
exit 143
}
# trap SIGTERM and call sig_handler
trap 'sig_handler' SIGTERM
### JOB SETUP ###
# SLURM_RESTART_COUNT is the number of times the job has been restarted.
# You can use it avoid doing some operations if the job is restarted like reloading checkpoints.
RESTART_COUNT="${SLURM_RESTART_COUNT:-0}"
echo "Running job $JOB_ID. Restart count: $RESTART_COUNT"
# create file if it doesn't exist
if [ ! -f "$CHDIR/times-$JOB_ID.txt" ]; then
touch "$CHDIR/times-$JOB_ID.txt"
fi
cd "$TMP_DIR"
cp "$CHDIR/times-$JOB_ID.txt" .
### JOB LOGIC ###
date >> "./times-$JOB_ID.txt"
# Note here we run on background, so that our signal handler can be called.
srun --jobid "$SLURM_JOBID" bash -c 'sleep 300' &
# Let's wait for signals or end of all background commands
wait
### JOB CLEANUP ###
# This runs if the job didn't get preempted.
echo "All done!"
mv "$TMP_DIR/times-$JOB_ID.txt" "$CHDIR" |
Currently when a machine is deleted, slurm step is interrupted without warning. However it would be great to send to all slurm steps within the machine a SIGINT such that they can run code to clean up (copy state into GCS for example)
Specially significant for Spot VMs.
I have not been able to find wether slurm currently handles it well.
The text was updated successfully, but these errors were encountered: