Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FirstFitRouting Allocator only tries the first cluster it finds then fails before attempting others #10598

Open
loth opened this issue Mar 20, 2025 · 6 comments · May be fixed by #10603
Open

Comments

@loth
Copy link

loth commented Mar 20, 2025

Our CS environment has two clusters, one windows and one linux. We have host tags that allocate where VM's are placed. We are facing an issue when setting host in maintenance that sometimes it will just fall to error state. Upon checking the logs I found that while it checked the first cluster (ID 19, Windows) however it never tried any hosts in the Linux cluster (ID 16).

Here is the log from the management server. As we can see, the allocator tried to find resources in cluster 19, failed to find any with the Linux tag, then failed out before trying the other cluster.

2025-03-20 20:04:13,017 DEBUG [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388]) (logid:30d64f64) Run VM work job: com.cloud.vm.VmWorkMigrateAway for VM 8127, job origin: 3900296
2025-03-20 20:04:13,018 DEBUG [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Execute VM work job: com.cloud.vm.VmWorkMigrateAway{"srcHostId":982,"userId":1,"accountId":1,"vmId":8127,"handlerName":"VirtualMachineManagerImpl"}
2025-03-20 20:04:13,031 INFO  [c.c.v.ClusteredVirtualMachineManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Searching for hosts in the zone for vm migration
2025-03-20 20:04:13,034 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Trying to deploy VM [error decoding VM instance {"id":8127,"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}] and details: Plan [{"_dcId":14,"_poolId":637,"_recreateDisks":false,"preferredHostIds":[],"migrationPlan":true,"hostPriorities":{}}]; avoid list [{"_clusterIds":[24],"_hostIds":[982]}] and planner: [].
2025-03-20 20:04:13,037 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Checking non dedicated resources to deploy VM [{"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}].
2025-03-20 20:04:13,046 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Adding pods [[]], clusters [[]] and hosts [[]] to the avoid list in the deploy process of user VM ["ce0c1fb6-65ef-4942-97de-05b255fdcd7d"], because this VM is not explicitly dedicated to these components.
2025-03-20 20:04:13,048 DEBUG [o.a.c.u.r.ReflectionToStringBuilderUtils] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Object is null, not reflecting it.
2025-03-20 20:04:13,049 DEBUG [o.a.c.u.r.ReflectionToStringBuilderUtils] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Object is null, not reflecting it.
2025-03-20 20:04:13,049 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Trying to allocate a host and storage pools from datacenter [{"name":"CS12","uuid":"81e07cf1-e395-4f5e-aa7e-4e9d80aa3cfb"}], pod [null], cluster [null], to deploy VM [{"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}] with requested CPU [24000] and requested RAM [(32.00 GB) 34359738368].
2025-03-20 20:04:13,050 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) ROOT volume [69d114a8-5673-4ad7-a387-2ac0d11cf3a1] is ready to deploy VM [ce0c1fb6-65ef-4942-97de-05b255fdcd7d].
2025-03-20 20:04:13,054 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Adding pods [] to the avoid set because these pods are in the Disabled state.
2025-03-20 20:04:13,057 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Adding clusters [] of pod [5] to the void set because these clusters are in the Disabled state.
2025-03-20 20:04:13,060 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Adding hosts [] of datacenter [81e07cf1-e395-4f5e-aa7e-4e9d80aa3cfb] to the avoid set, because these hosts are in the Disabled state.
2025-03-20 20:04:13,065 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) DeploymentPlan [DataCenterDeployment] has not specified host. Trying to find another destination to deploy VM [ce0c1fb6-65ef-4942-97de-05b255fdcd7d], avoiding pods [], clusters [24] and hosts [982].
2025-03-20 20:04:13,065 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Deploy avoids pods: [], clusters: [24], hosts: [982].
2025-03-20 20:04:13,065 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Deploy hosts with priorities {}, hosts have NORMAL priority by default
2025-03-20 20:04:13,066 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) This VM has last host_id specified, trying to choose the same host: 982
2025-03-20 20:04:13,069 WARN  [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) The last host [bf145201-bcab-4634-8750-36b064574bc7] of VM [ce0c1fb6-65ef-4942-97de-05b255fdcd7d] is in the avoid set. Skipping this and trying other available hosts.
2025-03-20 20:04:13,069 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Cannot choose the last host to deploy this VM VM instance {"id":8127,"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}.
2025-03-20 20:04:13,072 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Searching all possible resources under this Zone: 14
2025-03-20 20:04:13,076 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Listing clusters in order of aggregate capacity, that have (at least one host with) enough CPU and RAM capacity under this Zone: 14
2025-03-20 20:04:13,084 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Removing from the clusterId list these clusters from avoid set: [24]
2025-03-20 20:04:13,102 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) The clusterId list for the given offering tag: [19, 16]
2025-03-20 20:04:13,106 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Checking resources in Cluster: 19 under Pod: 5
2025-03-20 20:04:13,111 INFO  [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64)  Guest VM is requested with Custom[UEFI] Boot Type false
2025-03-20 20:04:13,111 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) Looking for hosts in zone [14], pod [5], cluster [19]
2025-03-20 20:04:13,111 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) Looking for hosts having tag specified on SvcOffering:generic
2025-03-20 20:04:13,118 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) Hosts with tag 'generic' are:[Host {"id":919,"name":"node87","type":"Routing","uuid":"e5668da2-862f-4cea-bda8-20707525ecda"}, Host {"id":1126,"name":"node51","type":"Routing","uuid":"be3eb6a2-29b3-4a1e-a125-af8f9135d3a6"}, Host {"id":1129,"name":"node52","type":"Routing","uuid":"a803eaea-82a1-4c10-a4d2-b0db6c8c15c6"}, Host {"id":1350,"name":"node84","type":"Routing","uuid":"1a006517-3a4b-4a98-85a2-f693fa4f4501"}, Host {"id":1368,"name":"node50","type":"Routing","uuid":"bc93e3f6-3751-46fd-8dee-c24f6e8fc922"}]
2025-03-20 20:04:13,118 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) Looking for hosts having tag specified on Template:linux
2025-03-20 20:04:13,123 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) Hosts with tag 'linux' are:[]
2025-03-20 20:04:13,124 DEBUG [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) Found 0 Hosts satisfying both tags, host ids are:[]
2025-03-20 20:04:13,126 ERROR [c.c.a.m.a.i.FirstFitRoutingAllocator] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975, FirstFitRoutingAllocator]) (logid:30d64f64) No suitable host found for vm [VM instance {"id":8127,"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}] with tags [generic].
2025-03-20 20:04:13,127 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: No suitable host found for vm [VM instance {"id":8127,"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}].
2025-03-20 20:04:13,127 INFO  [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388, ctx-e309f975]) (logid:30d64f64) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: No suitable host found for vm [VM instance {"id":8127,"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}].
2025-03-20 20:04:13,127 DEBUG [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388]) (logid:30d64f64) Done with run of VM work job: com.cloud.vm.VmWorkMigrateAway for VM 8127, job origin: 3900296
2025-03-20 20:04:13,127 ERROR [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388]) (logid:30d64f64) Unable to complete AsyncJobVO: {id:3901388, userId: 1, accountId: 1, instanceType: null, instanceId: null, cmd: com.cloud.vm.VmWorkMigrateAway, cmdInfo: rO0ABXNyAB5jb20uY2xvdWQudm0uVm1Xb3JrTWlncmF0ZUF3YXmt4MX4jtcEmwIAAUoACXNyY0hvc3RJZHhyABNjb20uY2xvdWQudm0uVm1Xb3Jrn5m2VvAlZ2sCAARKAAlhY2NvdW50SWRKAAZ1c2VySWRKAAR2bUlkTAALaGFuZGxlck5hbWV0ABJMamF2YS9sYW5nL1N0cmluZzt4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAAfv3QAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAD1g, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 95539959339, completeMsid: null, lastUpdated: null, lastPolled: null, created: Thu Mar 20 20:04:12 UTC 2025, removed: null}, job origin: 3900296 com.cloud.utils.exception.CloudRuntimeException: No suitable host found for vm [VM instance {"id":8127,"instanceName":"i-10989-8127-VM","type":"User","uuid":"ce0c1fb6-65ef-4942-97de-05b255fdcd7d"}].
2025-03-20 20:04:13,128 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (Work-Job-Executor-145:[ctx-2ec22277, job-3900296/job-3901388]) (logid:30d64f64) Complete async job-3901388, jobStatus: FAILED, resultCode: 0

I also checked the UI upon doing a manual migration, all windows hosts are invalid destinations in the options to select a host (which I believe findHostsForMigration uses, so that should be working fine)

versions

ACS 4.20, KVM with NFS Primary/Secondary storage

The steps to reproduce the bug

Setup two clusters, with host tags to identify where VM's should go (windows/linux).
Try to set maintenance on one of the hosts.
Check for clusters in the log to see if it attempts others.

What to do about it?

Correct the code to try all clusters instead of just the first.

@Pearl1594
Copy link
Contributor

As per my understanding, when the host on which the VM is running on is placed in maintenance mode, it will consider only those clusters that have the same hypervisor type as that of the host on which the VM is running for migrating VMs on the host put into maintenance mode. Do both the clusters have the same hypervisor type?

@loth
Copy link
Author

loth commented Mar 20, 2025

As per my understanding, when the host on which the VM is running on is placed in maintenance mode, it will consider only those clusters that have the same hypervisor type as that of the host on which the VM is running for migrating VMs on the host put into maintenance mode. Do both the clusters have the same hypervisor type?

The original host is node99 on the Linux cluster (ID 16) with tags: generic,linux,node_other,os_other,unix

Both clusters are KVM type

@weizhouapache
Copy link
Member

it looks both service offering and template have host tags.
would it be a problem ?

Looking for hosts having tag specified on SvcOffering:generic
Hosts with tag 'generic' are:[Host {"id":919,"name":"node87", ........
Looking for hosts having tag specified on Template:linux
Hosts with tag 'linux' are:[]
Found 0 Hosts satisfying both tags, host ids are:[]
No suitable host found for vm

btw: is global setting migrate.vm.across.clusters set to true ?

@loth
Copy link
Author

loth commented Mar 20, 2025

it looks both service offering and template have host tags. would it be a problem ?

Looking for hosts having tag specified on SvcOffering:generic
Hosts with tag 'generic' are:[Host {"id":919,"name":"node87", ........
Looking for hosts having tag specified on Template:linux
Hosts with tag 'linux' are:[]
Found 0 Hosts satisfying both tags, host ids are:[]
No suitable host found for vm

btw: is global setting migrate.vm.across.clusters set to true ?

I'm not sure, is it unsupported to use both? Even if it was only using the service offering, both windows/linux hypervisors have the 'generic' tag which would mean it should still try to migrate to the other cluster.

Yes, migrate.vm.across.clusters is set to true. However the VM was originally on the Linux cluster, so it wouldnt have needed to move clusters anyway.

EDIT: I set migrate.vm.across.clusters to false on another environment I was having this issue with and the host was set to maintenance fine, so its definitely related to the code around this option.

@weizhouapache weizhouapache modified the milestones: 4.19.3, 4.20.1 Mar 21, 2025
@weizhouapache
Copy link
Member

Hi @loth ,

@Pearl1594 has found some clues which caused the issue. we will update you later.

@Pearl1594
Copy link
Contributor

Pearl1594 commented Mar 21, 2025

Thanks for reporting this observation @loth seems like you have hit a bug. Investigating the details you shared, we identified that this issue of other clusters not being considered when the 1st cluster has no suitable hosts matching the constraints is caused by:
https://github.com/apache/cloudstack/pull/7489/files#diff-fccb35a57d5f36b17a7ca93e2dde52d4f57839420f955b0780af7488cf9f690dR208-R212

This issue is most likely going to be encountered even with the RandomAllocator: https://github.com/apache/cloudstack/pull/7489/files#diff-039cdf205f1edfd0c7c6ebb0cb97e2caef1d08a8dd7dbcbd30813a68da5dae1bR100-R103

We'll look into it. Thanks

Edit: Seems like part of the issue was fixed in: #8611 - available in 4.19.1. But is missing in 4.20.0

@Pearl1594 Pearl1594 linked a pull request Mar 21, 2025 that will close this issue
14 tasks
@Pearl1594 Pearl1594 linked a pull request Mar 21, 2025 that will close this issue
14 tasks
@Pearl1594 Pearl1594 moved this to In Progress in ACS 4.20.1 Mar 21, 2025
@Pearl1594 Pearl1594 removed this from ACS 4.20.1 Mar 21, 2025
@DaanHoogland DaanHoogland modified the milestones: 4.19.3, 4.20.1 Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants