Reconsider health-check handling for the relayer #602

iansuvak · 2024-12-18T16:55:38Z

Context and scope
Currently the health-check service endpoint fatals when it's unhealthy. Since this is an external endpoint intended to be used by Kubernetes or another monitor we should let the caller decide what to do and for how long to wait when the service is unhealthy.

As part of this we should make sure that all of the places where we set the state to unhealthy is actually recoverable. If it's not we can fatal through a different mechanism than the external healthcheck endpoint.

Discussion and alternatives

#579 changed the handling for network exceptions to attempt to reconnect up to max tries before marking itself unhealthy because of the fatal behavior. If we go ahead with removing fatals, we should revert this to mark unhealthy as soon as it is and to attempt reconnecting with backoffs. The caller can then decide when to kill/restart the service.

Open questions

iansuvak added the enhancement New feature or request label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider health-check handling for the relayer #602

Reconsider health-check handling for the relayer #602

iansuvak commented Dec 18, 2024

Reconsider health-check handling for the relayer #602

Reconsider health-check handling for the relayer #602

Comments

iansuvak commented Dec 18, 2024