Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graph-node: Probe doesn't detect provider failure #417

Open
josedev-union opened this issue Jul 12, 2023 · 0 comments
Open

graph-node: Probe doesn't detect provider failure #417

josedev-union opened this issue Jul 12, 2023 · 0 comments

Comments

@josedev-union
Copy link
Contributor

Issue

Graph node was down after Graph node database server upgraded. During the upgrade, db connection was lost for a minute in the graph node.
I was able to fetch following error logs from graph node.

Jul 12 05:58:20.416 CRIT Error receiving message, error: db error: FATAL: terminating connection due to administrator command, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:58:20.441 CRIT Error receiving message, error: db error: FATAL: terminating connection due to administrator command, channel: store_events, component: NotificationListener
Jul 12 05:58:20.623 ERRO Failed to connect notification listener: db error: FATAL: the database system is shutting down, retry_delay_s: 1, attempt: 0, channel: store_events, component: NotificationListener
Jul 12 05:58:20.639 ERRO Failed to connect notification listener: db error: FATAL: the database system is shutting down, retry_delay_s: 1, attempt: 0, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:58:21.634 ERRO Failed to connect notification listener: error connecting to server: Connection refused (os error 111), retry_delay_s: 2, attempt: 1, channel: store_events, component: NotificationListener
Jul 12 05:58:21.647 ERRO Failed to connect notification listener: error connecting to server: Connection refused (os error 111), retry_delay_s: 2, attempt: 1, channel: chain_head_updates, component: ChainHeadUpdateListener > NotificationListener
Jul 12 05:59:02.699 ERRO Postgres connection error, error: terminating connection due to administrator command, pool: main, shard: primary, component: ConnectionPool

The problem is it didn't retry to connect db and remained failed status for over 10 mins. So we had to reboot graph node manuallly.

Expectation

We have an alerting channel which fires alert if eth_rpc_status{provider="xxx"} metric is not equal to 0. So we received the alerts immediately but expected it to be resolved auto after a few mins of db upgrade finish.
If the liveneesprobe and readiness probe can detect this kind of issue properly so the pod can be recreated automatically, it will be perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant