sinfo
# shows the status of nodesscontrol show node mynode132
- Shows the status of mynode132
- If it is down, show the Reason why it is down
- ssh into the ndoe or pdsh (HP CMU) to check the status of slurm daemon
sudo pdsh -w mynode231 'ps -ef |grep slurmd'
- If not running, run as
sudo pdsh -w mynode231 '/opt/slurm/sbin/slurmd'
- Bring downed nodes into up
sudo /opt/slurm/bin/scontrol update nodename=mynode113 state=resume
- Bring the working nodes into down
sudo /opt/slurm/bin/scontrol update nodename=brcnode123 state=down
- When CG is the job status
- Restart daemon in the head node
- Restart daemon in the computing node
- Remove the node from SLURM and mount again
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to localhost:6819: Connection refused
sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused
sacctmgr: error: Problem talking to the database: Connection refused
- Start or restart slurmdbd
- For RHEL6,
service slurmdbd restart
- For RHEL6,
- scancel will not be able to remove them. Not found from squeue
- Restart of slurmd will not resolve
sacctmgr show RunawayJobs
- Will ask fix or not. Enter
y
to fix and enter
- Will ask fix or not. Enter
- Check the time b/w slurm head node and computing nodes. They must sync each other
- check install guide and use rpmbuild then can use systemctl
- When slurmctld yields the message of
not a valid controller
- In slurm.conf, use
SlurmctrldHost=hostName(IP.Number.here)
- In slurm.conf, use
- When slurmd yields the message of
unable to determine this slurmd's NodeName
- In slurm.conf, use node name of
hostname -s
atControlMachine=nodeName
- In slurm.conf, use node name of
- When compute node is invalid from
sinfo
- Make sure that the detail of node in slurm.conf match the results of
slurmd -C
in each node - No. of cores, threads per core, memsize, ...
- Make sure that the detail of node in slurm.conf match the results of