Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry renewal of SSL certificates #338

Open
jdno opened this issue Aug 7, 2023 · 1 comment
Open

Retry renewal of SSL certificates #338

jdno opened this issue Aug 7, 2023 · 1 comment

Comments

@jdno
Copy link
Member

jdno commented Aug 7, 2023

The renew-ssl-certs.service fails periodically, which triggers the following alert in Grafana:

The systemd unit renew-ssl-certs.service on the <server>:9100 instance failed to execute.

The fix for this alert is to restart the service manually:

sudo systemctl restart renew-ssl-certs.service

We should investigate the reason why the service fails in the first place and retry automatically.

@jdno jdno added this to infra-team Aug 7, 2023
@github-project-automation github-project-automation bot moved this to Backlog in infra-team Aug 7, 2023
@jdno jdno moved this from Backlog to Ready in infra-team Aug 9, 2023
@Mark-Simulacrum
Copy link
Member

Relevant logs:

May 23 00:00:01 monitoring.infra.rust-lang.org systemd[1]: Starting Renew SSL certificates...
May 23 00:00:03 monitoring.infra.rust-lang.org renew-ssl-certs[2866]: 2022/05/23 00:00:03 [INFO] [grafana.rust-lang.org] acme: Trying renewal with 647 hours remaining
May 23 00:00:03 monitoring.infra.rust-lang.org renew-ssl-certs[2866]: 2022/05/23 00:00:03 [INFO] [grafana.rust-lang.org] acme: Obtaining bundled SAN certificate
May 23 00:00:09 monitoring.infra.rust-lang.org renew-ssl-certs[2866]: 2022/05/23 00:00:09 acme: error: 500 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:serverInternal :: Error creating new order, url:
May 23 00:00:09 monitoring.infra.rust-lang.org systemd[1]: renew-ssl-certs.service: Main process exited, code=exited, status=1/FAILURE
May 23 00:00:09 monitoring.infra.rust-lang.org systemd[1]: renew-ssl-certs.service: Failed with result 'exit-code'.
May 23 00:00:09 monitoring.infra.rust-lang.org systemd[1]: Failed to start Renew SSL certificates.

My guess is this is an upstream spurious failure of some kind -- presumably one-off and resolvable by retrying. Let's Encrypt doesn't note any problems on their status page at this time:

image

So my best guess is low levels of fuzz in their availability; we likely can fix this by retrying on our side at the systemd layer or within the executed script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready
Development

Successfully merging a pull request may close this issue.

2 participants