-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue fence_gce agent while having more than 500 servers in a zone #617
Comments
The error might be specific to the list-action. Does it work if you run fence_gce -n -o status? If you you can try on/off/reboot actions as well. In addition you can use pcmk_host_map="n1:p1;n2:p2,p3" to map hostnames to VM names when using it with Pacemaker. |
Yes fence_gce command is working fine. I was able to start stop and get the status of both nodes of the cluster. I did the test by adding the following parameters Result from
log from stonith history :
To add some information, I deployed this new agent ( fence_gce) on other clusters without any issue, maybe a coincidence but all the nodes/clusters which are working were in the 500 instances returned by the list operation. |
In fact, from Google API documentation, it seems that results from gcp api is limited to 500 results : https://cloud.google.com/compute/docs/reference/rest/v1/instances/list I see that someone from stackoverflow had the same limitation from gcp api for an other topic : https://stackoverflow.com/questions/72805673/how-to-manage-gcp-google-api-client-calls-that-have-over-500-results We would need to update the fence_gce code to add the possibility to list instances from all the pages. I think it would be easier to use pacemaker pcmk_host_list param for the moment. thanks |
Nice to hear you found a solution. I will try to add a filter parameter in the future to avoid these issues. |
Hello,
We are implementing fence_gce as our stonith agent for our clusters on a cloud provider.
We are using a corosync pacemaker cluster with 2 nodes configured with a stonith.
Example of a configuration of a stonith for 1 node :
We detected an issue while the 2 nodes are stopped. If we start one node, the other node status in the cluster keeps the UNCLEAN status.
While checking further, it seems that the stonith configured for the node in UNCLEAN status is not registered by the cluster for the node.
We checked this with the command
stonith_admin --list=<hostname>
but it's not able to find any stonith attached.By checking the documentation part of stonith on pacemaker, it seems if a hostname is specifed in the stonith configuration, it try to fetch the node though a dynamic list by doing a list operation of the stonith.
I tried to reproduce the list operation with the command
fence_gce -o list -n <hostname>
but result of the command is limited to 500 entries.We have more than 500 instances on our cloud provider.
I think this limitation of 500 entries returned by the "list operation" blocks the cluster to attach the stonith agent to the node in the cluster. Which also block the fence of the node in unclean status to get a OFFLINE status instead and start the cluster ressources.
Is it possible to increase the number of entries returned by the python function used for operation list for fence_gce ?
thanks
The text was updated successfully, but these errors were encountered: