Skip to content

Conversation

@sheurich
Copy link
Contributor

@sheurich sheurich commented Oct 10, 2025

Summary

Replace deprecated Temporary() with Timeout() in DNS retry logic per golang/go#45729.

Change

  • Check net.Error.Timeout() instead of deprecated Temporary()
  • Only retry timeout errors, not all transient network errors

@sheurich sheurich requested a review from a team as a code owner October 10, 2025 00:28
@beautifulentropy
Copy link
Member

Some context we should review when reviewing this PR, courtesy of @pgporada:

bdns/dns.go Outdated
Comment on lines 273 to 274
// Retry all errors up to maxTries limit for maximum resilience.
isRetryable := true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without commenting on whether we should retry all errors (I'm still contemplating), I think that if we do go this direction, we should take this minor simplification a step further: I'd remove the isRetryable boolean altogether and simplify the conditionals on lines 276 and 283.

@jsha
Copy link
Contributor

jsha commented Oct 14, 2025

Hi @sheurich! Thanks for the contribution. I'm glad to be getting rid of the deprecated call to .Temporary().

As a side note, can I ask if you are using AI to generate your PR descriptions? If so, could you provide the prompt you used, or how you generated it (i.e. what are the inputs)? In particular I'd love to discuss the Rationale but I first want to make sure those are your words and not an AI's.

My takeaway from golang/go#45729 is that Temporary is a superset of Timeout, and:

The cases where Temporary does not imply Timeout are surprising and not particularly useful.

So a smaller change here would be to simply replace Temporary with Timeout, which is not deprecated. I'd prefer that change, in part because it avoids masking surprising behavior behind retries.

As an example: in Unbound 1.20, a few new options were introduced, including discard-timeout (defaults to 1900 ms): https://unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound.conf.html#unbound-conf-discard-timeout. From experimentation, I believe the default discard-timeout results in unexpected EOF errors from our DoH client when a resolution takes more than 1.9 seconds. We'll be disabling that, at least for now (discard-timeout: 0).

That unexpected EOF is not a Boulder-side timeout, but arguably should be retried in this scenario since we know it represents an Unbound-side timeout. But I'd rather implement that specifically within our dohExchanger instead of retying all errors unconditionally.

As an example of a non-retryable error: a certificate validation failure for one out of a pool of DoH servers. If we retry all errors, we might never notice that one server has a broken config, because we would automatically retry on a different server1. Yes, this would be more resilient, but it would also be masking brokenness, possibly until the brokenness gets worse.

So, overall, I'm in favor of switching this to use .Timeout(), and possibly, as a separate PR, handling unexpected EOF (net.ErrUnexpectedEOF) in dohExchanger and turning into something that implements .Timeout() { return true }.

Footnotes

  1. Note that we might switch to a 1:1 VA<->Unbound relationship, which would change this example, but the spirit holds.

@sheurich
Copy link
Contributor Author

sheurich commented Oct 15, 2025

Hi @sheurich! Thanks for the contribution. I'm glad to be getting rid of the deprecated call to .Temporary().

As a side note, can I ask if you are using AI to generate your PR descriptions? If so, could you provide the prompt you used, or how you generated it (i.e. what are the inputs)? In particular I'd love to discuss the Rationale but I first want to make sure those are your words and not an AI's.

The description was generated from the results of investigation into DoH errors that were not classified as retryable based on Temporary() output. Discovery of the go 1.18 Temporary() deprecation came out of this work.

I use Claude Code (and to a small extent Roo Code) as a coding assistant. For this PR, there wasn't a traditional "prompt". My workflow was:

  1. I made the code change to address the deprecated Temporary() call
  2. Asked Claude to generate a PR description from the diff
  3. Claude expanded on the rationale based on the deprecation notices
  • these are not valid rationale for this change

My takeaway from golang/go#45729 is that Temporary is a superset of Timeout, and:

The cases where Temporary does not imply Timeout are surprising and not particularly useful.

So a smaller change here would be to simply replace Temporary with Timeout, which is not deprecated. I'd prefer that change, in part because it avoids masking surprising behavior behind retries.

As an example: in Unbound 1.20, a few new options were introduced, including discard-timeout (defaults to 1900 ms): unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound.conf.html#unbound-conf-discard-timeout. From experimentation, I believe the default discard-timeout results in unexpected EOF errors from our DoH client when a resolution takes more than 1.9 seconds. We'll be disabling that, at least for now (discard-timeout: 0).

We were seeing 'unexpected EOF' errors in VA DoH requests. Your comment about Unbound’s discard-timeout was incredibly helpful and led us to discover that our pre-1.20 Unbound package has this feature backported for the DNSBomb CVE and enabled by default. Disabling the discard-timeout eliminated these errors. Thank you!

That unexpected EOF is not a Boulder-side timeout, but arguably should be retried in this scenario since we know it represents an Unbound-side timeout. But I'd rather implement that specifically within our dohExchanger instead of retying all errors unconditionally.

This makes sense. I have an alternate proposal that trades some additional configuration complexity to allow for fine-grained operator control of DNS retry behavior: #8443

I agree that replacing Temporary() with Timeout() is the correct minimal fix. The deprecation of Temporary() indicates we need better visibility into what's retryable, but your point about not masking errors like cert validation failures is good. Retry-all could hide critical failures (although the logging should still note this).

As an example of a non-retryable error: a certificate validation failure for one out of a pool of DoH servers. If we retry all errors, we might never notice that one server has a broken config, because we would automatically retry on a different server1. Yes, this would be more resilient, but it would also be masking brokenness, possibly until the brokenness gets worse.

So, overall, I'm in favor of switching this to use .Timeout(), and possibly, as a separate PR, handling unexpected EOF (net.ErrUnexpectedEOF) in dohExchanger and turning into something that implements .Timeout() { return true }.

I can change this right away and then either move forward with something like #8443 if that makes sense or more like your suggestion.

I'll update this PR to the simple Temporary() → Timeout() replacement as you suggested.

Thanks!

The net.Error.Temporary() method has been deprecated since Go 1.18.
Replace with Timeout() to check specifically for timeout errors.
This also switches from *url.Error to net.Error because Timeout()
is defined by the net.Error interface
Reflects the change from net.Error.Temporary() to
net.Error.Timeout() for DNS retry logic.

Test changes:
- Add testTimeoutError type to mirror tempError
- Rename test variables: isTempErr → isTimeoutErr, nonTempErr → nonTimeoutErr
- Update test case names and comments to reference "timeout"
- Fix test expectations: timeout errors return "query timed out" message
- Update mockTempURLError → mockTimeoutURLError

All tests pass with the new timeout-based retry semantics.
@sheurich sheurich force-pushed the dns-retry-all-errors branch from a873eb7 to 867033e Compare October 17, 2025 18:10
@sheurich sheurich requested a review from aarongable October 17, 2025 18:20
@sheurich
Copy link
Contributor Author

@jsha I appreciate the visibility vs. resilience trade-off discussion. While I agree with moving forward with Timeout() as the minimal fix, I wanted to share my operational perspective:

Even mature deployment practices can't eliminate all transient connection errors. Network path transitions and process/container orchestration events routinely cause brief connection-refused/reset errors that are genuinely transient and not misconfigurations. In production CA operations, failing customer certificate issuance for these routine events feels like the wrong trade-off.

I think retry-all (or at least retry-connection-errors) with proper logging/alerting provides the best of both worlds: resilience for customers and visibility for operators. That said, I respect the fail-fast philosophy and understand the operational debt concerns. I'm happy to stick with the Timeout() change as implemented here and we can revisit retry strategy in future PRs if operational experience suggests it's needed.

Copy link
Contributor

@jsha jsha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make good points about transient errors in the connection between VA and Unbound. Can you add some detail about what your setup is like, what specific errors you're getting, and how often?

Comment on lines +275 to +278
// Check if the error is a timeout error. Network errors
// that can timeout implement the net.Error interface.
var netErr net.Error
isRetryable = errors.As(r.err, &netErr) && netErr.Timeout()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this became a net.Error? *url.Error also offers .Timeout().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are functionally identical; http.Client.Do() returns *url.Error, which implements net.Error. Both errors.As checks match the same object and call the same .Timeout() method.

I changed to net.Error because we only need the .Timeout() method, not *url.Error-specific fields like .Op or .URL.

@sheurich
Copy link
Contributor Author

Can you add some detail about what your setup is like, what specific errors you're getting, and how often?

Unbound and boulder-va run in separate containers, communicating over Docker networking.

Current error rates (% of validations):

  • "Client.Timeout exceeded": 0-1%
  • "connect: connection refused": 1%
  • "unexpected EOF": ~0.1% (5-30% before the Unbound fix)

@jsha jsha requested a review from a team October 23, 2025 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants