You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This might be expected behaviour but I figured it is worth getting clarification on...
Compute nodes are continuously deleted and re-created when a job is run that:
fits within the Slurm partition
does not fit within "CPUs per VM family" quota
In my example I have requested 2 x c4-standard-8 machines, and have set the quota to 12 CPUs per family.
The VMs are created up until the point that the quota is met, at which point this error shows slurmctld.log: "GCP Error: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4."
Steps to reproduce
Run a Slurm job that requests more nodes/cpus than the CPUS_PER_VM_FAMILY is set to.
Expected behavior
Quota is checked(?) and if there is no capacity the job does not attempt to run.
Error message relating to the quota not having capacity?
Actual behavior
Nodes are created up until the point the quota is met
Above error appears in /var/log/slurm/slurmctld.log
Node(s) sit idle until powered down
Repeat...
Version (gcluster --version)
scott@MacBookPro-ScottG cluster-toolkit % ./gcluster --version
gcluster version v1.41.0
Built from 'detached HEAD' branch.
Commit info: v1.41.0-0-g26fafe0d
Terraform version: 1.9.8
[2024-11-06T11:23:41.970] _slurm_rpc_submit_batch_job: JobId=2 InitPrio=1 usec=1945
[2024-11-06T11:23:42.003] sched: Allocate JobId=2 NodeList=hpcslurm-c4nodeset-[0-1] #CPUs=2 Partition=c4
[2024-11-06T11:23:46.189] _update_job: setting admin_comment to GCP Error: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4. for JobId=2
[2024-11-06T11:23:46.189] _slurm_rpc_update_job: complete JobId=2 uid=981 usec=98
[2024-11-06T11:23:46.199] update_node: node hpcslurm-c4nodeset-1 reason set to: GCP Error: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4.
[2024-11-06T11:23:46.199] requeue job JobId=2 due to failure of node hpcslurm-c4nodeset-1
[2024-11-06T11:23:46.199] update_node: node hpcslurm-c4nodeset-1 state set to DOWN
[2024-11-06T11:23:46.223] Requeuing JobId=2
[2024-11-06T11:28:45.594] node hpcslurm-c4nodeset-1 not resumed by ResumeTimeout(300), setting DOWN and POWERED_DOWN
[2024-11-06T11:29:39.969] update_node: node hpcslurm-c4nodeset-0 state set to IDLE
[2024-11-06T11:29:39.970] update_node: node hpcslurm-c4nodeset-1 state set to IDLE
[2024-11-06T11:29:41.003] sched: Allocate JobId=2 NodeList=hpcslurm-c4nodeset-[0-1] #CPUs=2 Partition=c4
[2024-11-06T11:30:11.565] _update_job: setting admin_comment to GCP Error: QUOTA_EXCEEDED: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4. for JobId=2
[2024-11-06T11:30:11.565] _slurm_rpc_update_job: complete JobId=2 uid=981 usec=118
[2024-11-06T11:30:11.575] update_node: node hpcslurm-c4nodeset-0 reason set to: GCP Error: QUOTA_EXCEEDED: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4.
[2024-11-06T11:30:11.575] requeue job JobId=2 due to failure of node hpcslurm-c4nodeset-0
[2024-11-06T11:30:11.579] Requeuing JobId=2
[2024-11-06T11:30:11.580] update_node: node hpcslurm-c4nodeset-0 state set to DOWN
[2024-11-06T11:30:11.593] error: xgetaddrinfo: getaddrinfo(hpcslurm-c4nodeset-0:6818) failed: Name or service not known
[2024-11-06T11:30:11.594] error: slurm_set_addr: Unable to resolve "hpcslurm-c4nodeset-0"
[2024-11-06T11:30:11.594] error: _thread_per_group_rpc: can't find address for host hpcslurm-c4nodeset-0, check slurm.conf
[2024-11-06T11:30:30.136] Node hpcslurm-c4nodeset-1 now responding
[2024-11-06T11:34:49.621] node hpcslurm-c4nodeset-0 not resumed by ResumeTimeout(300), setting DOWN and POWERED_DOWN
[2024-11-06T11:35:39.930] update_node: node hpcslurm-c4nodeset-0 state set to IDLE
[2024-11-06T11:36:40.047] update_node: node hpcslurm-c4nodeset-1 state set to IDLE
[2024-11-06T11:36:41.002] sched: Allocate JobId=2 NodeList=hpcslurm-c4nodeset-[0-1] #CPUs=2 Partition=c4
[2024-11-06T11:37:11.659] _update_job: setting admin_comment to GCP Error: QUOTA_EXCEEDED: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4. for JobId=2
[2024-11-06T11:37:11.660] _slurm_rpc_update_job: complete JobId=2 uid=981 usec=1569
[2024-11-06T11:37:11.674] update_node: node hpcslurm-c4nodeset-0 reason set to: GCP Error: QUOTA_EXCEEDED: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4.
[2024-11-06T11:37:11.674] requeue job JobId=2 due to failure of node hpcslurm-c4nodeset-0
[2024-11-06T11:37:11.675] Requeuing JobId=2
[2024-11-06T11:37:11.675] update_node: node hpcslurm-c4nodeset-0 state set to DOWN
[2024-11-06T11:37:11.694] error: xgetaddrinfo: getaddrinfo(hpcslurm-c4nodeset-0:6818) failed: Name or service not known
[2024-11-06T11:37:11.694] error: slurm_set_addr: Unable to resolve "hpcslurm-c4nodeset-0"
[2024-11-06T11:37:11.694] error: _thread_per_group_rpc: can't find address for host hpcslurm-c4nodeset-0, check slurm.conf
[2024-11-06T11:37:31.409] Node hpcslurm-c4nodeset-1 now responding
[2024-11-06T11:41:48.652] node hpcslurm-c4nodeset-0 not resumed by ResumeTimeout(300), setting DOWN and POWERED_DOWN
[2024-11-06T11:42:39.867] update_node: node hpcslurm-c4nodeset-0 state set to IDLE
[2024-11-06T11:43:40.011] update_node: node hpcslurm-c4nodeset-1 state set to IDLE
[2024-11-06T11:43:41.003] sched: Allocate JobId=2 NodeList=hpcslurm-c4nodeset-[0-1] #CPUs=2 Partition=c4
Execution environment
OS: any
Shell (To find this, run ps -p $$): zsh
go version: go version go1.21.5 darwin/arm64
The text was updated successfully, but these errors were encountered:
This behavior was working as intended. In Slurm-GCP we had set minCount = 1 in bulk api request. So, Bulk API would try to fetch resources up until quota is reached and if fetched resources are greater than minCount resources, the overall operation would be considered as successful.
Describe the bug
This might be expected behaviour but I figured it is worth getting clarification on...
Compute nodes are continuously deleted and re-created when a job is run that:
In my example I have requested 2 x c4-standard-8 machines, and have set the quota to 12 CPUs per family.
The VMs are created up until the point that the quota is met, at which point this error shows slurmctld.log: "GCP Error: Quota CPUS_PER_VM_FAMILY exceeded. Limit: 12.0 in region europe-west4."
Steps to reproduce
Run a Slurm job that requests more nodes/cpus than the CPUS_PER_VM_FAMILY is set to.
Expected behavior
Actual behavior
Version (
gcluster --version
)scott@MacBookPro-ScottG cluster-toolkit % ./gcluster --version
gcluster version v1.41.0
Built from 'detached HEAD' branch.
Commit info: v1.41.0-0-g26fafe0d
Terraform version: 1.9.8
Blueprint
Expanded Blueprint
If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running
gcluster expand your-blueprint.yaml
.Disregard if the bug occurs when running
gcluster expand ...
as well.Output and logs
Execution environment
ps -p $$
): zshThe text was updated successfully, but these errors were encountered: