FAQ | Troubleshooting | Glossary
This guide focuses on setting up a hybrid Slurm cluster. With hybrid, there are different challenges and considerations that need to be taken into account. This guide will cover them and their recommended solutions.
There is a clear separation of how on-prem and cloud resources are managed within your hybrid cluster. This means that you can modify either side of the hybrid cluster without disrupting the other side! You manage your on-prem and our Slurm cluster module will manage the cloud.
See Cloud Scheduling Guide for additional information.
NOTE: The manual configurations are required to finish the hybrid setup.
Terraform is used to setup and manage most cloud resources for your hybrid cluster. It will ensure that the cloud contains resources as described in your terraform project.
We provide terraform modules that support a hybrid cluster use case.
Specifically,
slurm_controller_hybrid
is responsible for generating slurm configuration files based upon your
configurations and our cloud scripts (e.g. ResumeProgram
, SuspendProgram
)
for your on-premise controller to use.
There are a set of scripts and files that support the functionality of creating and terminating nodes in the cloud:
cloud_gres.conf
- Contains Slurm GRES configuration lines about cloud compute GRES resources.
- To be included in your
gres.conf
.
cloud.conf
- Contains Slurm configuration lines to support a hybrid/cloud environment.
- To be included in your
slurm.conf
. - WARNING: Certain lines may need reconciliation with your
slurm.conf
(e.g.SlurmctldParameters
).
config.yaml
- Encodes information about your configuration and compute resources for
resume.py
andsuspend.py
.
- Encodes information about your configuration and compute resources for
resume.py
ResumeProgram
inslurm.conf
.- Creates compute node resources based upon Slurm job allocation and configured compute resources.
slurmsync.py
- Synchronizes the Slurm state and the GCP state, reducing discrepancies from manual admin activity or other edge cases.
- May update Slurm node states, create or destroy GCP compute resources or other script managed GCP resources.
- To be run under
crontab
orsystemd
on an interval.
startup.sh
- Compute node startup script.
suspend.py
SuspendProgram
inslurm.conf
.
util.py
- Contains utility functions for the other python scripts.
The compute resources in GCP use
configless mode to manage
their slurm.conf
, by default.
Terraform is used to manage the cloud resources within your hybrid cluster. The slurm_cluster module, when in hybrid mode, creates the required files to support an on-premise controller capable of cloud bursting.
See the Slurm cluster module for details.
If you are unfamiliar with terraform, then please checkout out the documentation and starter guide to get you familiar.
See the test cluster example for an extensible and robust example. It can be configured to handle creation of all supporting resources (e.g. network, service accounts) or leave that to you. Slurm can be configured with partitions and nodesets as desired.
NOTE: It is recommended to use the slurm_cluster module in your own terraform project. It may be useful to copy and modify one of the provided examples.
Alternatively, see HPC Blueprints for HPC Toolkit examples.
- Communication between on-premise and GCP. This is commonly accomplished with a VPN of some kind -- software VPN or hardware VPN. The kind of VPN usually depends on throughput needs.
- Bidirectional DNS between on-premise and GCP
- Open ports and firewall rules.
- Slurm communication
- NFS and network mounts
- slurmctld
- slurmdbd
- SrunPortRange
There are two options:
- setup DNS between the on-premise network and the GCP network
- configure Slurm to use NodeAddr to communicate with cloud compute nodes.
In the end, the slurmctld and any login nodes should be able to communicate with cloud compute nodes, and the cloud compute nodes should be able to communicate with the controller.
-
Configure DNS peering
- GCP instances need to be resolvable by name from the controller and any login nodes.
- The controller needs to be resolvable by name from GCP instances, or the controller IP address needs to be added to /etc/hosts. See peering zones for details.
-
Use IP addresses with NodeAddr
-
Disable cloud_dns in slurm.conf
-
Add cloud_reg_addrs to slurm.conf:
# slurm.conf SlurmctldParameters=cloud_reg_addrs
-
Disable hierarchical communication in slurm.conf:
# slurm.conf TreeWidth=65533
-
Add controller's IP address to /etc/hosts on the custom image.
-
The simplest way to handle user synchronization in a hybrid cluster is to use
nss_slurm. This permits passwd
and
group
resolution for a job on the compute node to be serviced by the local
slurmstepd
process rather than some other network-based service. User
information is sent from the controller for each job and served by the
slurmstepd
.
nss_slurm is installed and configured on all SchedMD public images.
Once you have successfully configured a hybrid
slurm_cluster and applied the
terraform infrastructure, the necessary files will be
generated at $output_dir
. Should another machine be the TerraformHost or
non-SlurmUser be the TerraformUser, then set $install_dir
to the intended
directory where the generated files will be deployed on the Slurm controller
(e.g. var.install_dir = "/etc/slurm"
).
Follow the below steps to complete the process of configuring your on-prem controller to be able to burst into the cloud.
- Configure terraform modules (e.g. slurm_cluster; slurm_controller_hybrid) with desired configurations.
- Apply terraform project and its configuration.
terraform init terraform apply
- The
$output_dir
and its contents should be owned by theSlurmUser
, eg.chown -R slurm:slurm $output_dir
- Move files from
$output_dir
on TerraformHost to$install_dir
of SlurmctldHost and make sure SlurmUser owns the files.scp ${output_dir}/* ${SLURMCTLD_HOST}:${install_dir}/ ssh $SLURMCTLD_HOST sudo chown -R ${SLURM_USER}:${SLURM_USER} $output_dir
- In your slurm.conf, include the generated cloud.conf:
# slurm.conf include $install_dir/cloud.conf
- In your gres.conf, include the generated cloud_gres.conf:
# gres.conf include $install_dir/cloud_gres.conf
- Install the slurmcmd systemd files
cp ${install_dir}/slurmcmd.* /etc/systemd/system/ systemctl daemon-reload
- Restart slurmctld and resolve include conflicts.
- Enable and start slurmcmd.
systemctl enable --now slurmcmd.timer
- Test cloud bursting.
scontrol update nodename=$NODENAME state=power_up reason=test scontrol update nodename=$NODENAME state=power_down reason=test
Additionally, MUNGE secrets must be consistent across the cluster. There are a few safe ways to deal with munge.key distribution:
- Use NFS to mount
/etc/munge
from the controller (default behavior). - Create a custom image that contains the
munge.key
for your cluster.
Regardless of chosen secret delivery system, tight access control is required to maintain the security of your cluster.
Should NFS or another shared filesystem method be used, then controlling connections to the munge NFS is critical.
- Isolate the cloud compute nodes of the cluster into their own project, VPC, and subnetworks. Use project or network peering to enable access to other cloud infrastructure in a controlled manner.
- Setup firewall rules to control ingress and egress to the controller such that only trusted machines or networks use its NFS.
- Only allow trusted private address (ranges) for communication to the controller.
Should secrets be 'baked' into an image, then controlling deployment of images is critical.
- Only cluster admins or sudoer's should be allowed to deploy those images.
- Never allow regular users to gain sudo privledges.
- Never allow export/download of image.