Take a configuration such as:
openhpc_nodegroups:
- name: cpu
node_params:
CPUSpecList: 92-95
Initial deployment will work fine. However, if CPUSpecList
is changed (e.g. to 90-95
), the deployment of the new configuration will lead affected nodes to invalid state with Reason=CoreSpec differ
. This happens as soon as slurmctld
is restarted, probably due to a mismatch between slurmctld
and slurmd
.
This can be fixed by forcing another configuration update elsewhere in the Slurm configuration.
There must be some safe way to roll out this change? Maybe stop slurmd
services first, then restart slurmctld
and finally start all slurmd
services?