You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Graph node was down after Graph node database server upgraded. During the upgrade, db connection was lost for a minute in the graph node.
I was able to fetch following error logs from graph node.
Jul 12 05:58:20.416 CRIT Error receiving message, error: db error: FATAL: terminating connection due to administrator command, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:58:20.441 CRIT Error receiving message, error: db error: FATAL: terminating connection due to administrator command, channel: store_events, component: NotificationListener
Jul 12 05:58:20.623 ERRO Failed to connect notification listener: db error: FATAL: the database system is shutting down, retry_delay_s: 1, attempt: 0, channel: store_events, component: NotificationListener
Jul 12 05:58:20.639 ERRO Failed to connect notification listener: db error: FATAL: the database system is shutting down, retry_delay_s: 1, attempt: 0, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:58:21.634 ERRO Failed to connect notification listener: error connecting to server: Connection refused (os error 111), retry_delay_s: 2, attempt: 1, channel: store_events, component: NotificationListener
Jul 12 05:58:21.647 ERRO Failed to connect notification listener: error connecting to server: Connection refused (os error 111), retry_delay_s: 2, attempt: 1, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:59:02.699 ERRO Postgres connection error, error: terminating connection due to administrator command, pool: main, shard: primary, component: ConnectionPool
The problem is it didn't retry to connect db and remained failed status for over 10 mins. So we had to reboot graph node manuallly.
Expectation
We have an alerting channel which fires alert if eth_rpc_status{provider="xxx"} metric is not equal to 0. So we received the alerts immediately but expected it to be resolved auto after a few mins of db upgrade finish.
If the liveneesprobe and readiness probe can detect this kind of issue properly so the pod can be recreated automatically, it will be perfect.
The text was updated successfully, but these errors were encountered:
Issue
Graph node was down after Graph node database server upgraded. During the upgrade, db connection was lost for a minute in the graph node.
I was able to fetch following error logs from graph node.
The problem is it didn't retry to connect db and remained failed status for over 10 mins. So we had to reboot graph node manuallly.
Expectation
We have an alerting channel which fires alert if
eth_rpc_status{provider="xxx"}
metric is not equal to 0. So we received the alerts immediately but expected it to be resolved auto after a few mins of db upgrade finish.If the liveneesprobe and readiness probe can detect this kind of issue properly so the pod can be recreated automatically, it will be perfect.
The text was updated successfully, but these errors were encountered: