Skip to content

Latest commit

 

History

History
 
 

slurm_cluster

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Module: Slurm Cluster

FAQ | Troubleshooting | Glossary

Overview

This module creates a Slurm cluster on GCP. There are two modes of operation: cloud; and hybrid. Cloud mode will create a VM controller. Hybrid mode will generate cloud.conf and cloud_gres.conf files to be included in the on-prem configuration files, while managing a config.yaml file for internal module use.

Partitions define what compute resources are available to the controller so it may allocate jobs. Slurm will resume/create compute instances as needed to run allocated jobs and will suspend/terminate the instances after they are no longer needed (e.g. IDLE for SuspendTimeout duration). Static nodes are persistent; they are exempt from being suspended/terminated under normal conditions. Dynamic nodes are burstable; they will scale up and down with workload.

WARNING: Destroying the controller before it has suspended/terminated all static and dynamic node instances and supporting resources (e.g. placement groups, subscription) will leave those resources orphaned unless cleanup options are enabled (.e.g enable_cleanup_compute, enable_cleanup_subscriptions).

Usage

See examples directory for sample usages.

See below for a simple inclusion within your own terraform project.

module "slurm_cluster" {
  source = "[email protected]:SchedMD/slurm-gcp.git//terraform/slurm_cluster?ref=v5.0.0"

  project_id = "<PROJECT_ID>"

  slurm_cluster_name = "<SLURM_CLUSTER_NAME>"

  # ... omitted ...
}

NOTE: Because this module is not hosted on Terraform Registry, the version must be strictly controlled via revision syntax on the source line.

Dependencies

Software

Certain software must be installed on the local machine or APIs enabled in GCP for TerraformUser to be able to use this module.

Required

Optional

  • Python is installed.
    • Required Version: >= 3.6.0, < 4.0.0
    • Required when any of:
      • enable_hybrid=true
      • enable_cleanup_compute=true
  • Pip packages are installed.
    • Required when any of:
      • enable_hybrid=true
      • enable_cleanup_compute=true
    • pip3 install -r ../../scripts/requirements.txt --user
  • Private Google Access is enabled.
    • Required when any instances only have internal IPs.
  • Secret Manager API is enabled.
    • Required when cloudsql != null.
  • Bigquery API is enabled.
    • Required when enable_bigquery_load=true.

TerraformUser

TerraformUser authenticates with credentials to Google Cloud. It is recommended to create a principal IAM for this user and associate roles to them. Optionally, the TerraformUser can operate through a service account.

Required

  • Compute Instance Admin (v1) (roles/compute.instanceAdmin.v1)

Optional

  • Secret Manager Admin (roles/secretmanager.admin)
    • Required when cloudsql != null.
  • Service Account User (roles/iam.serviceAccountUser)

Controller SA

Service account intended to be associated with the controller instance template for slurm_controller_instance.

Required

  • Compute Instance Admin (v1) (roles/compute.instanceAdmin.v1)
  • Compute Instance Admin (beta) (roles/compute.instanceAdmin)
  • Service Account User (roles/iam.serviceAccountUser)

Optional

  • BigQuery Data Editor (roles/bigquery.dataEditor)
    • Required when enable_bigquery_load=true.
  • Cloud SQL Editor (roles/cloudsql.editor)
    • Required when all of:
      • cloudsql != null
      • Communicating to CloudSQL instance
  • Logs Writer (roles/logging.logWriter)
    • Recommended.
  • Monitoring Metric Writer (roles/monitoring.metricWriter)
    • Recommended.

Compute SA

Service account intended to be associated with the compute instance templates created by slurm_partition.

Optional

  • Logs Writer (roles/logging.logWriter)
    • Recommended.
  • Monitoring Metric Writer (roles/monitoring.metricWriter)
    • Recommended.

Login SA

Service account intended to be associated with the login instance templates created by slurm_partition.

Optional

  • Logs Writer (roles/logging.logWriter)
    • Recommended.
  • Monitoring Metric Writer (roles/monitoring.metricWriter)
    • Recommended.

Module API

For the terraform module API reference, please see README_TF.md.