Retry renewal of SSL certificates #338

jdno · 2023-08-07T10:37:06Z

The renew-ssl-certs.service fails periodically, which triggers the following alert in Grafana:

The systemd unit renew-ssl-certs.service on the <server>:9100 instance failed to execute.

The fix for this alert is to restart the service manually:

sudo systemctl restart renew-ssl-certs.service

We should investigate the reason why the service fails in the first place and retry automatically.

The text was updated successfully, but these errors were encountered:

Mark-Simulacrum · 2023-09-04T22:37:19Z

Relevant logs:

May 23 00:00:01 monitoring.infra.rust-lang.org systemd[1]: Starting Renew SSL certificates...
May 23 00:00:03 monitoring.infra.rust-lang.org renew-ssl-certs[2866]: 2022/05/23 00:00:03 [INFO] [grafana.rust-lang.org] acme: Trying renewal with 647 hours remaining
May 23 00:00:03 monitoring.infra.rust-lang.org renew-ssl-certs[2866]: 2022/05/23 00:00:03 [INFO] [grafana.rust-lang.org] acme: Obtaining bundled SAN certificate
May 23 00:00:09 monitoring.infra.rust-lang.org renew-ssl-certs[2866]: 2022/05/23 00:00:09 acme: error: 500 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:serverInternal :: Error creating new order, url:
May 23 00:00:09 monitoring.infra.rust-lang.org systemd[1]: renew-ssl-certs.service: Main process exited, code=exited, status=1/FAILURE
May 23 00:00:09 monitoring.infra.rust-lang.org systemd[1]: renew-ssl-certs.service: Failed with result 'exit-code'.
May 23 00:00:09 monitoring.infra.rust-lang.org systemd[1]: Failed to start Renew SSL certificates.

My guess is this is an upstream spurious failure of some kind -- presumably one-off and resolvable by retrying. Let's Encrypt doesn't note any problems on their status page at this time:

So my best guess is low levels of fuzz in their availability; we likely can fix this by retrying on our side at the systemd layer or within the executed script.

jdno added this to infra-team Aug 7, 2023

github-project-automation bot moved this to Backlog in infra-team Aug 7, 2023

jdno moved this from Backlog to Ready in infra-team Aug 9, 2023

meysam81 mentioned this issue Sep 23, 2023

fix: retry failed ssl renewal with jittered backoff 🐛 #349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry renewal of SSL certificates #338

Retry renewal of SSL certificates #338

jdno commented Aug 7, 2023

Mark-Simulacrum commented Sep 4, 2023

Retry renewal of SSL certificates #338

Retry renewal of SSL certificates #338

Comments

jdno commented Aug 7, 2023

Mark-Simulacrum commented Sep 4, 2023