Possible bug - nodes repeatedly created and deleted if invalid quota chosen #3225

scott-nag · 2024-11-06T17:57:21Z

Describe the bug

This might be expected behaviour but I figured it is worth getting clarification on...

Compute nodes are continuously deleted and re-created when a job is run that:

fits within the Slurm partition
does not fit within "CPUs per VM family" quota

In my example I have requested 2 x c4-standard-8 machines, and have set the quota to 12 CPUs per family.

The VMs are created up until the point that the quota is met, at which point this error shows slurmctld.log: "GCP Error: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4."

Steps to reproduce

Run a Slurm job that requests more nodes/cpus than the CPUS_PER_VM_FAMILY is set to.

Expected behavior

Quota is checked(?) and if there is no capacity the job does not attempt to run.
Error message relating to the quota not having capacity?

Actual behavior

Nodes are created up until the point the quota is met
Above error appears in /var/log/slurm/slurmctld.log
Node(s) sit idle until powered down
Repeat...

Version (`gcluster --version`)

scott@MacBookPro-ScottG cluster-toolkit % ./gcluster --version
gcluster version v1.41.0
Built from 'detached HEAD' branch.
Commit info: v1.41.0-0-g26fafe0d
Terraform version: 1.9.8

Blueprint

---

blueprint_name: hpc-slurm

vars:
  project_id: projectname
  deployment_name: hpc-slurm
  region: europe-west4
  zone: europe-west4-c

deployment_groups:
- group: primary
  modules:
  - id: network
    source: modules/network/vpc

  - id: homefs
    source: modules/file-system/filestore
    use: [network]
    settings:
      local_mount: /home

  - id: debug_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      node_count_dynamic_max: 2
      machine_type: n2-standard-2
      enable_placement: false # the default is: true
      allow_automatic_updates: false

  - id: debug_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use:
    - debug_nodeset
    settings:
      partition_name: debug
      exclusive: false # allows nodes to stay up after jobs are done
      is_default: true

  - id: compute_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      node_count_dynamic_max: 20
      bandwidth_tier: gvnic_enabled
      allow_automatic_updates: false

  - id: compute_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use:
    - compute_nodeset
    settings:
      partition_name: compute

  - id: c4_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      node_count_dynamic_max: 2
      machine_type: c4-standard-8
      disk_type: hyperdisk-balanced
      bandwidth_tier: gvnic_enabled
      allow_automatic_updates: false
      enable_placement: false

  - id: c4_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use:
    - c4_nodeset
    settings:
      partition_name: c4
      exclusive: false

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
    use: [network]
    settings:
      machine_type: n2-standard-4
      enable_login_public_ips: true

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    use:
    - network
    - debug_partition
    - compute_partition
    - c4_partition
    - homefs
    - slurm_login
    settings:
      enable_controller_public_ips: true

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running gcluster expand your-blueprint.yaml.

Disregard if the bug occurs when running gcluster expand ... as well.

ghpc_version: v1.41.0-0-g26fafe0d
vars:
  deployment_name: hpc-slurm
  labels:
    ghpc_blueprint: hpc-slurm
    ghpc_deployment: ((var.deployment_name))
  project_id: projectname
  region: europe-west4
  zone: europe-west4-c
deployment_groups:
  - group: primary
    terraform_providers:
      google:
        source: hashicorp/google
        version: '>= 4.84.0, < 6.8.0'
        configuration:
          project: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
      google-beta:
        source: hashicorp/google-beta
        version: '>= 4.84.0, < 6.8.0'
        configuration:
          project: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
    modules:
      - source: modules/network/vpc
        kind: terraform
        id: network
        settings:
          deployment_name: ((var.deployment_name))
          project_id: ((var.project_id))
          region: ((var.region))
      - source: modules/file-system/filestore
        kind: terraform
        id: homefs
        use:
          - network
        settings:
          deployment_name: ((var.deployment_name))
          labels: ((var.labels))
          local_mount: /home
          network_id: ((module.network.network_id))
          project_id: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
      - source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        kind: terraform
        id: debug_nodeset
        use:
          - network
        settings:
          allow_automatic_updates: false
          enable_placement: false
          labels: ((var.labels))
          machine_type: n2-standard-2
          name: debug_nodeset
          node_count_dynamic_max: 2
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network.subnetwork_self_link))
          zone: ((var.zone))
      - source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        kind: terraform
        id: debug_partition
        use:
          - debug_nodeset
        settings:
          exclusive: false
          is_default: true
          nodeset: ((flatten([module.debug_nodeset.nodeset])))
          partition_name: debug
      - source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        kind: terraform
        id: compute_nodeset
        use:
          - network
        settings:
          allow_automatic_updates: false
          bandwidth_tier: gvnic_enabled
          labels: ((var.labels))
          name: compute_nodeset
          node_count_dynamic_max: 20
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network.subnetwork_self_link))
          zone: ((var.zone))
      - source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        kind: terraform
        id: compute_partition
        use:
          - compute_nodeset
        settings:
          nodeset: ((flatten([module.compute_nodeset.nodeset])))
          partition_name: compute
      - source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        kind: terraform
        id: c4_nodeset
        use:
          - network
        settings:
          allow_automatic_updates: false
          bandwidth_tier: gvnic_enabled
          disk_type: hyperdisk-balanced
          enable_placement: false
          labels: ((var.labels))
          machine_type: c4-standard-8
          name: c4_nodeset
          node_count_dynamic_max: 2
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network.subnetwork_self_link))
          zone: ((var.zone))
      - source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        kind: terraform
        id: c4_partition
        use:
          - c4_nodeset
        settings:
          exclusive: false
          nodeset: ((flatten([module.c4_nodeset.nodeset])))
          partition_name: c4
      - source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
        kind: terraform
        id: slurm_login
        use:
          - network
        settings:
          enable_login_public_ips: true
          labels: ((var.labels))
          machine_type: n2-standard-4
          name_prefix: slurm_login
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network.subnetwork_self_link))
          zone: ((var.zone))
      - source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
        kind: terraform
        id: slurm_controller
        use:
          - network
          - debug_partition
          - compute_partition
          - c4_partition
          - homefs
          - slurm_login
        settings:
          deployment_name: ((var.deployment_name))
          enable_controller_public_ips: true
          labels: ((var.labels))
          login_nodes: ((flatten([module.slurm_login.login_nodes])))
          network_storage: ((flatten([module.homefs.network_storage])))
          nodeset: ((flatten([module.c4_partition.nodeset, flatten([module.compute_partition.nodeset, flatten([module.debug_partition.nodeset])])])))
          nodeset_dyn: ((flatten([module.c4_partition.nodeset_dyn, flatten([module.compute_partition.nodeset_dyn, flatten([module.debug_partition.nodeset_dyn])])])))
          nodeset_tpu: ((flatten([module.c4_partition.nodeset_tpu, flatten([module.compute_partition.nodeset_tpu, flatten([module.debug_partition.nodeset_tpu])])])))
          partitions: ((flatten([module.c4_partition.partitions, flatten([module.compute_partition.partitions, flatten([module.debug_partition.partitions])])])))
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network.subnetwork_self_link))
          zone: ((var.zone))

Output and logs

[2024-11-06T11:23:41.970] _slurm_rpc_submit_batch_job: JobId=2 InitPrio=1 usec=1945
[2024-11-06T11:23:42.003] sched: Allocate JobId=2 NodeList=hpcslurm-c4nodeset-[0-1] #CPUs=2 Partition=c4
[2024-11-06T11:23:46.189] _update_job: setting admin_comment to GCP Error: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4. for JobId=2
[2024-11-06T11:23:46.189] _slurm_rpc_update_job: complete JobId=2 uid=981 usec=98
[2024-11-06T11:23:46.199] update_node: node hpcslurm-c4nodeset-1 reason set to: GCP Error: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4.
[2024-11-06T11:23:46.199] requeue job JobId=2 due to failure of node hpcslurm-c4nodeset-1
[2024-11-06T11:23:46.199] update_node: node hpcslurm-c4nodeset-1 state set to DOWN
[2024-11-06T11:23:46.223] Requeuing JobId=2
[2024-11-06T11:28:45.594] node hpcslurm-c4nodeset-1 not resumed by ResumeTimeout(300), setting DOWN and POWERED_DOWN
[2024-11-06T11:29:39.969] update_node: node hpcslurm-c4nodeset-0 state set to IDLE
[2024-11-06T11:29:39.970] update_node: node hpcslurm-c4nodeset-1 state set to IDLE
[2024-11-06T11:29:41.003] sched: Allocate JobId=2 NodeList=hpcslurm-c4nodeset-[0-1] #CPUs=2 Partition=c4
[2024-11-06T11:30:11.565] _update_job: setting admin_comment to GCP Error: QUOTA_EXCEEDED: Quota CPUS_PER_VM_FAMILY exceeded.  Limit: 12.0 in region europe-west4. for JobId=2
[2024-11-06T11:30:11.565] _slurm_rpc_update_job: complete JobId=2 uid=981 usec=118
[2024-11-06T11:30:11.575] update_node: node hpcslurm-c4nodeset-0 reason set to: GCP Error: QUOTA_EXCEEDED: Quota CPUS_PER_VM_FAMILY exceeded.  Limit: 12.0 in region europe-west4.
[2024-11-06T11:30:11.575] requeue job JobId=2 due to failure of node hpcslurm-c4nodeset-0
[2024-11-06T11:30:11.579] Requeuing JobId=2
[2024-11-06T11:30:11.580] update_node: node hpcslurm-c4nodeset-0 state set to DOWN
[2024-11-06T11:30:11.593] error: xgetaddrinfo: getaddrinfo(hpcslurm-c4nodeset-0:6818) failed: Name or service not known
[2024-11-06T11:30:11.594] error: slurm_set_addr: Unable to resolve "hpcslurm-c4nodeset-0"
[2024-11-06T11:30:11.594] error: _thread_per_group_rpc: can't find address for host hpcslurm-c4nodeset-0, check slurm.conf
[2024-11-06T11:30:30.136] Node hpcslurm-c4nodeset-1 now responding
[2024-11-06T11:34:49.621] node hpcslurm-c4nodeset-0 not resumed by ResumeTimeout(300), setting DOWN and POWERED_DOWN
[2024-11-06T11:35:39.930] update_node: node hpcslurm-c4nodeset-0 state set to IDLE
[2024-11-06T11:36:40.047] update_node: node hpcslurm-c4nodeset-1 state set to IDLE
[2024-11-06T11:36:41.002] sched: Allocate JobId=2 NodeList=hpcslurm-c4nodeset-[0-1] #CPUs=2 Partition=c4
[2024-11-06T11:37:11.659] _update_job: setting admin_comment to GCP Error: QUOTA_EXCEEDED: Quota CPUS_PER_VM_FAMILY exceeded.  Limit: 12.0 in region europe-west4. for JobId=2
[2024-11-06T11:37:11.660] _slurm_rpc_update_job: complete JobId=2 uid=981 usec=1569
[2024-11-06T11:37:11.674] update_node: node hpcslurm-c4nodeset-0 reason set to: GCP Error: QUOTA_EXCEEDED: Quota CPUS_PER_VM_FAMILY exceeded.  Limit: 12.0 in region europe-west4.
[2024-11-06T11:37:11.674] requeue job JobId=2 due to failure of node hpcslurm-c4nodeset-0
[2024-11-06T11:37:11.675] Requeuing JobId=2
[2024-11-06T11:37:11.675] update_node: node hpcslurm-c4nodeset-0 state set to DOWN
[2024-11-06T11:37:11.694] error: xgetaddrinfo: getaddrinfo(hpcslurm-c4nodeset-0:6818) failed: Name or service not known
[2024-11-06T11:37:11.694] error: slurm_set_addr: Unable to resolve "hpcslurm-c4nodeset-0"
[2024-11-06T11:37:11.694] error: _thread_per_group_rpc: can't find address for host hpcslurm-c4nodeset-0, check slurm.conf
[2024-11-06T11:37:31.409] Node hpcslurm-c4nodeset-1 now responding
[2024-11-06T11:41:48.652] node hpcslurm-c4nodeset-0 not resumed by ResumeTimeout(300), setting DOWN and POWERED_DOWN
[2024-11-06T11:42:39.867] update_node: node hpcslurm-c4nodeset-0 state set to IDLE
[2024-11-06T11:43:40.011] update_node: node hpcslurm-c4nodeset-1 state set to IDLE
[2024-11-06T11:43:41.003] sched: Allocate JobId=2 NodeList=hpcslurm-c4nodeset-[0-1] #CPUs=2 Partition=c4

Execution environment

OS: any
Shell (To find this, run ps -p $$): zsh
go version: go version go1.21.5 darwin/arm64

The text was updated successfully, but these errors were encountered:

harshthakkar01 · 2024-12-13T08:23:54Z

This behavior was working as intended. In Slurm-GCP we had set minCount = 1 in bulk api request. So, Bulk API would try to fetch resources up until quota is reached and if fetched resources are greater than minCount resources, the overall operation would be considered as successful.

This was addressed in 4b8af75

So, if you try newer version of gcluster binary, it should be fixed. Let me know if you run into any issues.

scott-nag · 2024-12-13T14:25:05Z

Amazing, thank you for the update! We'll test it out and let you know if there's still any issues.

scott-nag added the bug Something isn't working label Nov 6, 2024

rohitramu self-assigned this Nov 11, 2024

rohitramu removed their assignment Nov 22, 2024

harshthakkar01 assigned scott-nag Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug - nodes repeatedly created and deleted if invalid quota chosen #3225

Possible bug - nodes repeatedly created and deleted if invalid quota chosen #3225

scott-nag commented Nov 6, 2024 •

edited

Loading

harshthakkar01 commented Dec 13, 2024

scott-nag commented Dec 13, 2024

Possible bug - nodes repeatedly created and deleted if invalid quota chosen #3225

Possible bug - nodes repeatedly created and deleted if invalid quota chosen #3225

Comments

scott-nag commented Nov 6, 2024 • edited Loading

Describe the bug

Steps to reproduce

Expected behavior

Actual behavior

Version (gcluster --version)

Blueprint

Expanded Blueprint

Output and logs

Execution environment

harshthakkar01 commented Dec 13, 2024

scott-nag commented Dec 13, 2024

scott-nag commented Nov 6, 2024 •

edited

Loading

Version (`gcluster --version`)