Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue fence_gce agent while having more than 500 servers in a zone #617

Open
casierjean opened this issue Feb 26, 2025 · 4 comments
Open
Assignees

Comments

@casierjean
Copy link

Hello,

We are implementing fence_gce as our stonith agent for our clusters on a cloud provider.

We are using a corosync pacemaker cluster with 2 nodes configured with a stonith.

Example of a configuration of a stonith for 1 node :

primitive STONITH-primary stonith:fence_gce \
	op monitor interval=300s timeout=300s on-fail=restart \
	op start interval=0 timeout=60s on-fail=restart \
	params port=<hostname> plugzonemap="<gcp_region>-a,<gcp_region>-b,<gcp_region>-c" pcmk_monitor_retries=4 pcmk_reboot_timeout=300 pcmk_delay_max=30 \
	meta is-managed=true

We detected an issue while the 2 nodes are stopped. If we start one node, the other node status in the cluster keeps the UNCLEAN status.

While checking further, it seems that the stonith configured for the node in UNCLEAN status is not registered by the cluster for the node.
We checked this with the command stonith_admin --list=<hostname> but it's not able to find any stonith attached.

By checking the documentation part of stonith on pacemaker, it seems if a hostname is specifed in the stonith configuration, it try to fetch the node though a dynamic list by doing a list operation of the stonith.

I tried to reproduce the list operation with the command fence_gce -o list -n <hostname> but result of the command is limited to 500 entries.
We have more than 500 instances on our cloud provider.

I think this limitation of 500 entries returned by the "list operation" blocks the cluster to attach the stonith agent to the node in the cluster. Which also block the fence of the node in unclean status to get a OFFLINE status instead and start the cluster ressources.

Is it possible to increase the number of entries returned by the python function used for operation list for fence_gce ?

thanks

@oalbrigt
Copy link
Collaborator

oalbrigt commented Feb 26, 2025

The error might be specific to the list-action.

Does it work if you run fence_gce -n -o status? If you you can try on/off/reboot actions as well.

In addition you can use pcmk_host_map="n1:p1;n2:p2,p3" to map hostnames to VM names when using it with Pacemaker.

@casierjean
Copy link
Author

casierjean commented Feb 26, 2025

Yes fence_gce command is working fine. I was able to start stop and get the status of both nodes of the cluster.

I did the test by adding the following parameters pcmk_host_list=<node2> pcmk_host_check=static-list. Now cluster is able to change de status from UNCLEAN to OFFLINE.

Result from stonith_admin --list=NODE2

STONITH-secondary
1 fence device found

log from stonith history :

Failed Fencing Actions:
  * turning off of NODE2 failed: client=pacemaker-controld.2168, origin=NODE2, last-failed='2025-02-26 14:23:58 +01:00' (a later attempt from NODE2 succeeded)

To add some information, I deployed this new agent ( fence_gce) on other clusters without any issue, maybe a coincidence but all the nodes/clusters which are working were in the 500 instances returned by the list operation.

@casierjean
Copy link
Author

In fact, from Google API documentation, it seems that results from gcp api is limited to 500 results : https://cloud.google.com/compute/docs/reference/rest/v1/instances/list

I see that someone from stackoverflow had the same limitation from gcp api for an other topic : https://stackoverflow.com/questions/72805673/how-to-manage-gcp-google-api-client-calls-that-have-over-500-results

We would need to update the fence_gce code to add the possibility to list instances from all the pages.

I think it would be easier to use pacemaker pcmk_host_list param for the moment.

thanks

@oalbrigt
Copy link
Collaborator

Nice to hear you found a solution. I will try to add a filter parameter in the future to avoid these issues.

@oalbrigt oalbrigt self-assigned this Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants