Problem with HA database replication on appliances with version: najdorf-1.3 #22494

Dani-r-p · 2023-05-04T06:49:06Z

Dani-r-p
May 4, 2023

We have configured 2 database nodes with automatic failover with repmgr on two appliances that we have set up as database only appliances.
the other appliances are connected to the primary with the evm-failover-monitor service enabled and started.

The expected result with this set up, is that if we stop postgresql on the primary database node, repmgr will fail over the node that was acting as standby, but also that the application nodes are aware of that change and start pointing to the new primary, so, having only a short temporary downgraded service.

The result that we get is that the replication manager is working properly and marks the old primary as failed and sets the node that was replicating as standby as the new primary. The application nodes are aware of the change, because if we check evm.log it indicates that the old primary is no longer reachable by 5432 port, and that is going to point to a new primary. But it gets stuck and everything starts failing. the UI is not working nor the workers. But in the application logs there is no message indicating any problem or error.

As far as I've seen, the service evmserverd.service has the Unit specified with:

After=memcached.service manageiq-db-ready.service
Wants=memcached.service manageiq-db-ready.service

and the service with:
Restart=on-failure

But if we do the described procedure, what I can see is that the manageiq-db-ready service gets stuck pointing to the old primary that is no longer working.
This is preventing the application to successfully recover on this scenario. Because the manageiq-db-ready service gets stuck and in that case doesn't matter the restart on the evmserverd service that gets stuck in a restart loop.

Checking the manageiq-db-ready.service it has specified
Type=oneshot

and no indication on restarting upon failure.

I've commented the line with: Type=oneshot, and specified that I expect the service to restart on failure with:

Restart=on-failure
RestartSec=30s

after a daemon-reload and service restart. If I try to reproduce the same scenario the application is working after a primary database stop. Because everything restarts and starts pointing the new primary. So at the end, it starts working as expected.

My questions are: is mandatory the type indication? will be a problem to have there a restart on failure on that service? Could this be reviewed for next appliance versions please?

thanks in advance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ManageIQ

Problem with HA database replication on appliances with version: najdorf-1.3 #22494

{{title}}

Replies: 0 comments

Select a reply

ManageIQ

Problem with HA database replication on appliances with version: najdorf-1.3 #22494

Dani-r-p May 4, 2023

Replies: 0 comments

Dani-r-p
May 4, 2023