Issue fence_gce agent while having more than 500 servers in a zone #617

casierjean · 2025-02-26T10:17:50Z

Hello,

We are implementing fence_gce as our stonith agent for our clusters on a cloud provider.

We are using a corosync pacemaker cluster with 2 nodes configured with a stonith.

Example of a configuration of a stonith for 1 node :

primitive STONITH-primary stonith:fence_gce \
	op monitor interval=300s timeout=300s on-fail=restart \
	op start interval=0 timeout=60s on-fail=restart \
	params port=<hostname> plugzonemap="<gcp_region>-a,<gcp_region>-b,<gcp_region>-c" pcmk_monitor_retries=4 pcmk_reboot_timeout=300 pcmk_delay_max=30 \
	meta is-managed=true

We detected an issue while the 2 nodes are stopped. If we start one node, the other node status in the cluster keeps the UNCLEAN status.

While checking further, it seems that the stonith configured for the node in UNCLEAN status is not registered by the cluster for the node.
We checked this with the command stonith_admin --list=<hostname> but it's not able to find any stonith attached.

By checking the documentation part of stonith on pacemaker, it seems if a hostname is specifed in the stonith configuration, it try to fetch the node though a dynamic list by doing a list operation of the stonith.

I tried to reproduce the list operation with the command fence_gce -o list -n <hostname> but result of the command is limited to 500 entries.
We have more than 500 instances on our cloud provider.

I think this limitation of 500 entries returned by the "list operation" blocks the cluster to attach the stonith agent to the node in the cluster. Which also block the fence of the node in unclean status to get a OFFLINE status instead and start the cluster ressources.

Is it possible to increase the number of entries returned by the python function used for operation list for fence_gce ?

thanks

The text was updated successfully, but these errors were encountered:

oalbrigt · 2025-02-26T10:57:13Z

The error might be specific to the list-action.

Does it work if you run fence_gce -n -o status? If you you can try on/off/reboot actions as well.

In addition you can use pcmk_host_map="n1:p1;n2:p2,p3" to map hostnames to VM names when using it with Pacemaker.

casierjean · 2025-02-26T13:35:49Z

Yes fence_gce command is working fine. I was able to start stop and get the status of both nodes of the cluster.

I did the test by adding the following parameters pcmk_host_list=<node2> pcmk_host_check=static-list. Now cluster is able to change de status from UNCLEAN to OFFLINE.

Result from stonith_admin --list=NODE2

STONITH-secondary
1 fence device found

log from stonith history :

Failed Fencing Actions:
  * turning off of NODE2 failed: client=pacemaker-controld.2168, origin=NODE2, last-failed='2025-02-26 14:23:58 +01:00' (a later attempt from NODE2 succeeded)

To add some information, I deployed this new agent ( fence_gce) on other clusters without any issue, maybe a coincidence but all the nodes/clusters which are working were in the 500 instances returned by the list operation.

casierjean · 2025-02-26T16:05:01Z

In fact, from Google API documentation, it seems that results from gcp api is limited to 500 results : https://cloud.google.com/compute/docs/reference/rest/v1/instances/list

I see that someone from stackoverflow had the same limitation from gcp api for an other topic : https://stackoverflow.com/questions/72805673/how-to-manage-gcp-google-api-client-calls-that-have-over-500-results

We would need to update the fence_gce code to add the possibility to list instances from all the pages.

I think it would be easier to use pacemaker pcmk_host_list param for the moment.

thanks

oalbrigt · 2025-02-27T09:02:59Z

Nice to hear you found a solution. I will try to add a filter parameter in the future to avoid these issues.

oalbrigt self-assigned this Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue fence_gce agent while having more than 500 servers in a zone #617

Issue fence_gce agent while having more than 500 servers in a zone #617

casierjean commented Feb 26, 2025

oalbrigt commented Feb 26, 2025 •

edited

Loading

casierjean commented Feb 26, 2025 •

edited

Loading

casierjean commented Feb 26, 2025

oalbrigt commented Feb 27, 2025

Issue fence_gce agent while having more than 500 servers in a zone #617

Issue fence_gce agent while having more than 500 servers in a zone #617

Comments

casierjean commented Feb 26, 2025

oalbrigt commented Feb 26, 2025 • edited Loading

casierjean commented Feb 26, 2025 • edited Loading

casierjean commented Feb 26, 2025

oalbrigt commented Feb 27, 2025

oalbrigt commented Feb 26, 2025 •

edited

Loading

casierjean commented Feb 26, 2025 •

edited

Loading