Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug - nodes repeatedly created and deleted if invalid quota chosen #3225

Open
scott-nag opened this issue Nov 6, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@scott-nag
Copy link
Contributor

scott-nag commented Nov 6, 2024

Describe the bug

This might be expected behaviour but I figured it is worth getting clarification on...

Compute nodes are continuously deleted and re-created when a job is run that:

  • fits within the Slurm partition
  • does not fit within "CPUs per VM family" quota

In my example I have requested 2 x c4-standard-8 machines, and have set the quota to 12 CPUs per family.

The VMs are created up until the point that the quota is met, at which point this error shows slurmctld.log: "GCP Error: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4."

Steps to reproduce

Run a Slurm job that requests more nodes/cpus than the CPUS_PER_VM_FAMILY is set to.

Expected behavior

  1. Quota is checked(?) and if there is no capacity the job does not attempt to run.
  2. Error message relating to the quota not having capacity?

Actual behavior

  1. Nodes are created up until the point the quota is met
  2. Above error appears in /var/log/slurm/slurmctld.log
  3. Node(s) sit idle until powered down
  4. Repeat...

Version (gcluster --version)

scott@MacBookPro-ScottG cluster-toolkit % ./gcluster --version
gcluster version v1.41.0
Built from 'detached HEAD' branch.
Commit info: v1.41.0-0-g26fafe0d
Terraform version: 1.9.8

Blueprint

---

blueprint_name: hpc-slurm

vars:
  project_id: projectname
  deployment_name: hpc-slurm
  region: europe-west4
  zone: europe-west4-c

deployment_groups:
- group: primary
  modules:
  - id: network
    source: modules/network/vpc

  - id: homefs
    source: modules/file-system/filestore
    use: [network]
    settings:
      local_mount: /home

  - id: debug_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      node_count_dynamic_max: 2
      machine_type: n2-standard-2
      enable_placement: false # the default is: true
      allow_automatic_updates: false

  - id: debug_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use:
    - debug_nodeset
    settings:
      partition_name: debug
      exclusive: false # allows nodes to stay up after jobs are done
      is_default: true

  - id: compute_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      node_count_dynamic_max: 20
      bandwidth_tier: gvnic_enabled
      allow_automatic_updates: false

  - id: compute_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use:
    - compute_nodeset
    settings:
      partition_name: compute

  - id: c4_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      node_count_dynamic_max: 2
      machine_type: c4-standard-8
      disk_type: hyperdisk-balanced
      bandwidth_tier: gvnic_enabled
      allow_automatic_updates: false
      enable_placement: false

  - id: c4_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use:
    - c4_nodeset
    settings:
      partition_name: c4
      exclusive: false

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
    use: [network]
    settings:
      machine_type: n2-standard-4
      enable_login_public_ips: true

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    use:
    - network
    - debug_partition
    - compute_partition
    - c4_partition
    - homefs
    - slurm_login
    settings:
      enable_controller_public_ips: true

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running gcluster expand your-blueprint.yaml.

Disregard if the bug occurs when running gcluster expand ... as well.

ghpc_version: v1.41.0-0-g26fafe0d
vars:
  deployment_name: hpc-slurm
  labels:
    ghpc_blueprint: hpc-slurm
    ghpc_deployment: ((var.deployment_name))
  project_id: projectname
  region: europe-west4
  zone: europe-west4-c
deployment_groups:
  - group: primary
    terraform_providers:
      google:
        source: hashicorp/google
        version: '>= 4.84.0, < 6.8.0'
        configuration:
          project: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
      google-beta:
        source: hashicorp/google-beta
        version: '>= 4.84.0, < 6.8.0'
        configuration:
          project: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
    modules:
      - source: modules/network/vpc
        kind: terraform
        id: network
        settings:
          deployment_name: ((var.deployment_name))
          project_id: ((var.project_id))
          region: ((var.region))
      - source: modules/file-system/filestore
        kind: terraform
        id: homefs
        use:
          - network
        settings:
          deployment_name: ((var.deployment_name))
          labels: ((var.labels))
          local_mount: /home
          network_id: ((module.network.network_id))
          project_id: ((var.project_id))
          region: ((var.region))
          zone: ((var.zone))
      - source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        kind: terraform
        id: debug_nodeset
        use:
          - network
        settings:
          allow_automatic_updates: false
          enable_placement: false
          labels: ((var.labels))
          machine_type: n2-standard-2
          name: debug_nodeset
          node_count_dynamic_max: 2
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network.subnetwork_self_link))
          zone: ((var.zone))
      - source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        kind: terraform
        id: debug_partition
        use:
          - debug_nodeset
        settings:
          exclusive: false
          is_default: true
          nodeset: ((flatten([module.debug_nodeset.nodeset])))
          partition_name: debug
      - source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        kind: terraform
        id: compute_nodeset
        use:
          - network
        settings:
          allow_automatic_updates: false
          bandwidth_tier: gvnic_enabled
          labels: ((var.labels))
          name: compute_nodeset
          node_count_dynamic_max: 20
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network.subnetwork_self_link))
          zone: ((var.zone))
      - source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        kind: terraform
        id: compute_partition
        use:
          - compute_nodeset
        settings:
          nodeset: ((flatten([module.compute_nodeset.nodeset])))
          partition_name: compute
      - source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        kind: terraform
        id: c4_nodeset
        use:
          - network
        settings:
          allow_automatic_updates: false
          bandwidth_tier: gvnic_enabled
          disk_type: hyperdisk-balanced
          enable_placement: false
          labels: ((var.labels))
          machine_type: c4-standard-8
          name: c4_nodeset
          node_count_dynamic_max: 2
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network.subnetwork_self_link))
          zone: ((var.zone))
      - source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        kind: terraform
        id: c4_partition
        use:
          - c4_nodeset
        settings:
          exclusive: false
          nodeset: ((flatten([module.c4_nodeset.nodeset])))
          partition_name: c4
      - source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
        kind: terraform
        id: slurm_login
        use:
          - network
        settings:
          enable_login_public_ips: true
          labels: ((var.labels))
          machine_type: n2-standard-4
          name_prefix: slurm_login
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network.subnetwork_self_link))
          zone: ((var.zone))
      - source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
        kind: terraform
        id: slurm_controller
        use:
          - network
          - debug_partition
          - compute_partition
          - c4_partition
          - homefs
          - slurm_login
        settings:
          deployment_name: ((var.deployment_name))
          enable_controller_public_ips: true
          labels: ((var.labels))
          login_nodes: ((flatten([module.slurm_login.login_nodes])))
          network_storage: ((flatten([module.homefs.network_storage])))
          nodeset: ((flatten([module.c4_partition.nodeset, flatten([module.compute_partition.nodeset, flatten([module.debug_partition.nodeset])])])))
          nodeset_dyn: ((flatten([module.c4_partition.nodeset_dyn, flatten([module.compute_partition.nodeset_dyn, flatten([module.debug_partition.nodeset_dyn])])])))
          nodeset_tpu: ((flatten([module.c4_partition.nodeset_tpu, flatten([module.compute_partition.nodeset_tpu, flatten([module.debug_partition.nodeset_tpu])])])))
          partitions: ((flatten([module.c4_partition.partitions, flatten([module.compute_partition.partitions, flatten([module.debug_partition.partitions])])])))
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network.subnetwork_self_link))
          zone: ((var.zone))

Output and logs

[2024-11-06T11:23:41.970] _slurm_rpc_submit_batch_job: JobId=2 InitPrio=1 usec=1945
[2024-11-06T11:23:42.003] sched: Allocate JobId=2 NodeList=hpcslurm-c4nodeset-[0-1] #CPUs=2 Partition=c4
[2024-11-06T11:23:46.189] _update_job: setting admin_comment to GCP Error: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4. for JobId=2
[2024-11-06T11:23:46.189] _slurm_rpc_update_job: complete JobId=2 uid=981 usec=98
[2024-11-06T11:23:46.199] update_node: node hpcslurm-c4nodeset-1 reason set to: GCP Error: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4.
[2024-11-06T11:23:46.199] requeue job JobId=2 due to failure of node hpcslurm-c4nodeset-1
[2024-11-06T11:23:46.199] update_node: node hpcslurm-c4nodeset-1 state set to DOWN
[2024-11-06T11:23:46.223] Requeuing JobId=2
[2024-11-06T11:28:45.594] node hpcslurm-c4nodeset-1 not resumed by ResumeTimeout(300), setting DOWN and POWERED_DOWN
[2024-11-06T11:29:39.969] update_node: node hpcslurm-c4nodeset-0 state set to IDLE
[2024-11-06T11:29:39.970] update_node: node hpcslurm-c4nodeset-1 state set to IDLE
[2024-11-06T11:29:41.003] sched: Allocate JobId=2 NodeList=hpcslurm-c4nodeset-[0-1] #CPUs=2 Partition=c4
[2024-11-06T11:30:11.565] _update_job: setting admin_comment to GCP Error: QUOTA_EXCEEDED: Quota CPUS_PER_VM_FAMILY exceeded.  Limit: 12.0 in region europe-west4. for JobId=2
[2024-11-06T11:30:11.565] _slurm_rpc_update_job: complete JobId=2 uid=981 usec=118
[2024-11-06T11:30:11.575] update_node: node hpcslurm-c4nodeset-0 reason set to: GCP Error: QUOTA_EXCEEDED: Quota CPUS_PER_VM_FAMILY exceeded.  Limit: 12.0 in region europe-west4.
[2024-11-06T11:30:11.575] requeue job JobId=2 due to failure of node hpcslurm-c4nodeset-0
[2024-11-06T11:30:11.579] Requeuing JobId=2
[2024-11-06T11:30:11.580] update_node: node hpcslurm-c4nodeset-0 state set to DOWN
[2024-11-06T11:30:11.593] error: xgetaddrinfo: getaddrinfo(hpcslurm-c4nodeset-0:6818) failed: Name or service not known
[2024-11-06T11:30:11.594] error: slurm_set_addr: Unable to resolve "hpcslurm-c4nodeset-0"
[2024-11-06T11:30:11.594] error: _thread_per_group_rpc: can't find address for host hpcslurm-c4nodeset-0, check slurm.conf
[2024-11-06T11:30:30.136] Node hpcslurm-c4nodeset-1 now responding
[2024-11-06T11:34:49.621] node hpcslurm-c4nodeset-0 not resumed by ResumeTimeout(300), setting DOWN and POWERED_DOWN
[2024-11-06T11:35:39.930] update_node: node hpcslurm-c4nodeset-0 state set to IDLE
[2024-11-06T11:36:40.047] update_node: node hpcslurm-c4nodeset-1 state set to IDLE
[2024-11-06T11:36:41.002] sched: Allocate JobId=2 NodeList=hpcslurm-c4nodeset-[0-1] #CPUs=2 Partition=c4
[2024-11-06T11:37:11.659] _update_job: setting admin_comment to GCP Error: QUOTA_EXCEEDED: Quota CPUS_PER_VM_FAMILY exceeded.  Limit: 12.0 in region europe-west4. for JobId=2
[2024-11-06T11:37:11.660] _slurm_rpc_update_job: complete JobId=2 uid=981 usec=1569
[2024-11-06T11:37:11.674] update_node: node hpcslurm-c4nodeset-0 reason set to: GCP Error: QUOTA_EXCEEDED: Quota CPUS_PER_VM_FAMILY exceeded.  Limit: 12.0 in region europe-west4.
[2024-11-06T11:37:11.674] requeue job JobId=2 due to failure of node hpcslurm-c4nodeset-0
[2024-11-06T11:37:11.675] Requeuing JobId=2
[2024-11-06T11:37:11.675] update_node: node hpcslurm-c4nodeset-0 state set to DOWN
[2024-11-06T11:37:11.694] error: xgetaddrinfo: getaddrinfo(hpcslurm-c4nodeset-0:6818) failed: Name or service not known
[2024-11-06T11:37:11.694] error: slurm_set_addr: Unable to resolve "hpcslurm-c4nodeset-0"
[2024-11-06T11:37:11.694] error: _thread_per_group_rpc: can't find address for host hpcslurm-c4nodeset-0, check slurm.conf
[2024-11-06T11:37:31.409] Node hpcslurm-c4nodeset-1 now responding
[2024-11-06T11:41:48.652] node hpcslurm-c4nodeset-0 not resumed by ResumeTimeout(300), setting DOWN and POWERED_DOWN
[2024-11-06T11:42:39.867] update_node: node hpcslurm-c4nodeset-0 state set to IDLE
[2024-11-06T11:43:40.011] update_node: node hpcslurm-c4nodeset-1 state set to IDLE
[2024-11-06T11:43:41.003] sched: Allocate JobId=2 NodeList=hpcslurm-c4nodeset-[0-1] #CPUs=2 Partition=c4

Execution environment

  • OS: any
  • Shell (To find this, run ps -p $$): zsh
  • go version: go version go1.21.5 darwin/arm64
@scott-nag scott-nag added the bug Something isn't working label Nov 6, 2024
@rohitramu rohitramu self-assigned this Nov 11, 2024
@rohitramu rohitramu removed their assignment Nov 22, 2024
@harshthakkar01
Copy link
Contributor

This behavior was working as intended. In Slurm-GCP we had set minCount = 1 in bulk api request. So, Bulk API would try to fetch resources up until quota is reached and if fetched resources are greater than minCount resources, the overall operation would be considered as successful.

This was addressed in 4b8af75

So, if you try newer version of gcluster binary, it should be fixed. Let me know if you run into any issues.

@scott-nag
Copy link
Contributor Author

Amazing, thank you for the update! We'll test it out and let you know if there's still any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants