You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have configured 2 database nodes with automatic failover with repmgr on two appliances that we have set up as database only appliances.
the other appliances are connected to the primary with the evm-failover-monitor service enabled and started.
The expected result with this set up, is that if we stop postgresql on the primary database node, repmgr will fail over the node that was acting as standby, but also that the application nodes are aware of that change and start pointing to the new primary, so, having only a short temporary downgraded service.
The result that we get is that the replication manager is working properly and marks the old primary as failed and sets the node that was replicating as standby as the new primary. The application nodes are aware of the change, because if we check evm.log it indicates that the old primary is no longer reachable by 5432 port, and that is going to point to a new primary. But it gets stuck and everything starts failing. the UI is not working nor the workers. But in the application logs there is no message indicating any problem or error.
As far as I've seen, the service evmserverd.service has the Unit specified with:
But if we do the described procedure, what I can see is that the manageiq-db-ready service gets stuck pointing to the old primary that is no longer working.
This is preventing the application to successfully recover on this scenario. Because the manageiq-db-ready service gets stuck and in that case doesn't matter the restart on the evmserverd service that gets stuck in a restart loop.
Checking the manageiq-db-ready.service it has specified Type=oneshot
and no indication on restarting upon failure.
I've commented the line with: Type=oneshot, and specified that I expect the service to restart on failure with:
Restart=on-failure
RestartSec=30s
after a daemon-reload and service restart. If I try to reproduce the same scenario the application is working after a primary database stop. Because everything restarts and starts pointing the new primary. So at the end, it starts working as expected.
My questions are: is mandatory the type indication? will be a problem to have there a restart on failure on that service? Could this be reviewed for next appliance versions please?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
We have configured 2 database nodes with automatic failover with repmgr on two appliances that we have set up as database only appliances.
the other appliances are connected to the primary with the evm-failover-monitor service enabled and started.
The expected result with this set up, is that if we stop postgresql on the primary database node, repmgr will fail over the node that was acting as standby, but also that the application nodes are aware of that change and start pointing to the new primary, so, having only a short temporary downgraded service.
The result that we get is that the replication manager is working properly and marks the old primary as failed and sets the node that was replicating as standby as the new primary. The application nodes are aware of the change, because if we check evm.log it indicates that the old primary is no longer reachable by 5432 port, and that is going to point to a new primary. But it gets stuck and everything starts failing. the UI is not working nor the workers. But in the application logs there is no message indicating any problem or error.
As far as I've seen, the service evmserverd.service has the Unit specified with:
and the service with:
Restart=on-failure
But if we do the described procedure, what I can see is that the manageiq-db-ready service gets stuck pointing to the old primary that is no longer working.
This is preventing the application to successfully recover on this scenario. Because the manageiq-db-ready service gets stuck and in that case doesn't matter the restart on the evmserverd service that gets stuck in a restart loop.
Checking the manageiq-db-ready.service it has specified
Type=oneshot
and no indication on restarting upon failure.
I've commented the line with: Type=oneshot, and specified that I expect the service to restart on failure with:
after a daemon-reload and service restart. If I try to reproduce the same scenario the application is working after a primary database stop. Because everything restarts and starts pointing the new primary. So at the end, it starts working as expected.
My questions are: is mandatory the type indication? will be a problem to have there a restart on failure on that service? Could this be reviewed for next appliance versions please?
thanks in advance
Beta Was this translation helpful? Give feedback.
All reactions