graph-node: Probe doesn't detect provider failure #417

josedev-union · 2023-07-12T06:35:56Z

Issue

Graph node was down after Graph node database server upgraded. During the upgrade, db connection was lost for a minute in the graph node.
I was able to fetch following error logs from graph node.

Jul 12 05:58:20.416 CRIT Error receiving message, error: db error: FATAL: terminating connection due to administrator command, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:58:20.441 CRIT Error receiving message, error: db error: FATAL: terminating connection due to administrator command, channel: store_events, component: NotificationListener
Jul 12 05:58:20.623 ERRO Failed to connect notification listener: db error: FATAL: the database system is shutting down, retry_delay_s: 1, attempt: 0, channel: store_events, component: NotificationListener
Jul 12 05:58:20.639 ERRO Failed to connect notification listener: db error: FATAL: the database system is shutting down, retry_delay_s: 1, attempt: 0, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:58:21.634 ERRO Failed to connect notification listener: error connecting to server: Connection refused (os error 111), retry_delay_s: 2, attempt: 1, channel: store_events, component: NotificationListener
Jul 12 05:58:21.647 ERRO Failed to connect notification listener: error connecting to server: Connection refused (os error 111), retry_delay_s: 2, attempt: 1, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:59:02.699 ERRO Postgres connection error, error: terminating connection due to administrator command, pool: main, shard: primary, component: ConnectionPool

The problem is it didn't retry to connect db and remained failed status for over 10 mins. So we had to reboot graph node manuallly.

Expectation

We have an alerting channel which fires alert if eth_rpc_status{provider="xxx"} metric is not equal to 0. So we received the alerts immediately but expected it to be resolved auto after a few mins of db upgrade finish.
If the liveneesprobe and readiness probe can detect this kind of issue properly so the pod can be recreated automatically, it will be perfect.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph-node: Probe doesn't detect provider failure #417

graph-node: Probe doesn't detect provider failure #417

josedev-union commented Jul 12, 2023

graph-node: Probe doesn't detect provider failure #417

graph-node: Probe doesn't detect provider failure #417

Comments

josedev-union commented Jul 12, 2023

Issue

Expectation