Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider health-check handling for the relayer #602

Open
iansuvak opened this issue Dec 18, 2024 · 0 comments
Open

Reconsider health-check handling for the relayer #602

iansuvak opened this issue Dec 18, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@iansuvak
Copy link
Contributor

Context and scope
Currently the health-check service endpoint fatals when it's unhealthy. Since this is an external endpoint intended to be used by Kubernetes or another monitor we should let the caller decide what to do and for how long to wait when the service is unhealthy.

As part of this we should make sure that all of the places where we set the state to unhealthy is actually recoverable. If it's not we can fatal through a different mechanism than the external healthcheck endpoint.

Discussion and alternatives

#579 changed the handling for network exceptions to attempt to reconnect up to max tries before marking itself unhealthy because of the fatal behavior. If we go ahead with removing fatals, we should revert this to mark unhealthy as soon as it is and to attempt reconnecting with backoffs. The caller can then decide when to kill/restart the service.

Open questions

@iansuvak iansuvak added the enhancement New feature or request label Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant