Remove deprecated Temporary() usage in DNS retry logic #8441

sheurich · 2025-10-10T00:28:24Z

Summary

Replace deprecated Temporary() with Timeout() in DNS retry logic per golang/go#45729.

Change

Check net.Error.Timeout() instead of deprecated Temporary()
Only retry timeout errors, not all transient network errors

beautifulentropy · 2025-10-10T18:05:14Z

Some context we should review when reviewing this PR, courtesy of @pgporada:

proposal: net: add ErrRetryableAcceptError golang/go#66252
net/url: (*Error).Timeout and (*Error).Temporary methods should use errors.As instead of type assertion golang/go#60578
Remove use of deprecated net.Error.Temporary miekg/dns#1589
- and the reversion in Revert "Remove use of deprecated net.Error.Temporary (#1589)" miekg/dns#1594

aarongable · 2025-10-10T22:27:17Z

bdns/dns.go

+				// Retry all errors up to maxTries limit for maximum resilience.
+				isRetryable := true


Without commenting on whether we should retry all errors (I'm still contemplating), I think that if we do go this direction, we should take this minor simplification a step further: I'd remove the isRetryable boolean altogether and simplify the conditionals on lines 276 and 283.

jsha · 2025-10-14T23:47:53Z

Hi @sheurich! Thanks for the contribution. I'm glad to be getting rid of the deprecated call to .Temporary().

As a side note, can I ask if you are using AI to generate your PR descriptions? If so, could you provide the prompt you used, or how you generated it (i.e. what are the inputs)? In particular I'd love to discuss the Rationale but I first want to make sure those are your words and not an AI's.

My takeaway from golang/go#45729 is that Temporary is a superset of Timeout, and:

The cases where Temporary does not imply Timeout are surprising and not particularly useful.

So a smaller change here would be to simply replace Temporary with Timeout, which is not deprecated. I'd prefer that change, in part because it avoids masking surprising behavior behind retries.

As an example: in Unbound 1.20, a few new options were introduced, including discard-timeout (defaults to 1900 ms): https://unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound.conf.html#unbound-conf-discard-timeout. From experimentation, I believe the default discard-timeout results in unexpected EOF errors from our DoH client when a resolution takes more than 1.9 seconds. We'll be disabling that, at least for now (discard-timeout: 0).

That unexpected EOF is not a Boulder-side timeout, but arguably should be retried in this scenario since we know it represents an Unbound-side timeout. But I'd rather implement that specifically within our dohExchanger instead of retying all errors unconditionally.

As an example of a non-retryable error: a certificate validation failure for one out of a pool of DoH servers. If we retry all errors, we might never notice that one server has a broken config, because we would automatically retry on a different server¹. Yes, this would be more resilient, but it would also be masking brokenness, possibly until the brokenness gets worse.

So, overall, I'm in favor of switching this to use .Timeout(), and possibly, as a separate PR, handling unexpected EOF (net.ErrUnexpectedEOF) in dohExchanger and turning into something that implements .Timeout() { return true }.

Note that we might switch to a 1:1 VA<->Unbound relationship, which would change this example, but the spirit holds. ↩

sheurich · 2025-10-15T00:29:40Z

Hi @sheurich! Thanks for the contribution. I'm glad to be getting rid of the deprecated call to .Temporary().

As a side note, can I ask if you are using AI to generate your PR descriptions? If so, could you provide the prompt you used, or how you generated it (i.e. what are the inputs)? In particular I'd love to discuss the Rationale but I first want to make sure those are your words and not an AI's.

The description was generated from the results of investigation into DoH errors that were not classified as retryable based on Temporary() output. Discovery of the go 1.18 Temporary() deprecation came out of this work.

I use Claude Code (and to a small extent Roo Code) as a coding assistant. For this PR, there wasn't a traditional "prompt". My workflow was:

I made the code change to address the deprecated Temporary() call
Asked Claude to generate a PR description from the diff
Claude expanded on the rationale based on the deprecation notices

these are not valid rationale for this change

My takeaway from golang/go#45729 is that Temporary is a superset of Timeout, and:

The cases where Temporary does not imply Timeout are surprising and not particularly useful.

So a smaller change here would be to simply replace Temporary with Timeout, which is not deprecated. I'd prefer that change, in part because it avoids masking surprising behavior behind retries.

As an example: in Unbound 1.20, a few new options were introduced, including discard-timeout (defaults to 1900 ms): unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound.conf.html#unbound-conf-discard-timeout. From experimentation, I believe the default discard-timeout results in unexpected EOF errors from our DoH client when a resolution takes more than 1.9 seconds. We'll be disabling that, at least for now (discard-timeout: 0).

We were seeing 'unexpected EOF' errors in VA DoH requests. Your comment about Unbound’s discard-timeout was incredibly helpful and led us to discover that our pre-1.20 Unbound package has this feature backported for the DNSBomb CVE and enabled by default. Disabling the discard-timeout eliminated these errors. Thank you!

That unexpected EOF is not a Boulder-side timeout, but arguably should be retried in this scenario since we know it represents an Unbound-side timeout. But I'd rather implement that specifically within our dohExchanger instead of retying all errors unconditionally.

This makes sense. I have an alternate proposal that trades some additional configuration complexity to allow for fine-grained operator control of DNS retry behavior: #8443

I agree that replacing Temporary() with Timeout() is the correct minimal fix. The deprecation of Temporary() indicates we need better visibility into what's retryable, but your point about not masking errors like cert validation failures is good. Retry-all could hide critical failures (although the logging should still note this).

As an example of a non-retryable error: a certificate validation failure for one out of a pool of DoH servers. If we retry all errors, we might never notice that one server has a broken config, because we would automatically retry on a different server1. Yes, this would be more resilient, but it would also be masking brokenness, possibly until the brokenness gets worse.

So, overall, I'm in favor of switching this to use .Timeout(), and possibly, as a separate PR, handling unexpected EOF (net.ErrUnexpectedEOF) in dohExchanger and turning into something that implements .Timeout() { return true }.

I can change this right away and then either move forward with something like #8443 if that makes sense or more like your suggestion.

I'll update this PR to the simple Temporary() → Timeout() replacement as you suggested.

Thanks!

The net.Error.Temporary() method has been deprecated since Go 1.18. Replace with Timeout() to check specifically for timeout errors. This also switches from *url.Error to net.Error because Timeout() is defined by the net.Error interface

Reflects the change from net.Error.Temporary() to net.Error.Timeout() for DNS retry logic. Test changes: - Add testTimeoutError type to mirror tempError - Rename test variables: isTempErr → isTimeoutErr, nonTempErr → nonTimeoutErr - Update test case names and comments to reference "timeout" - Fix test expectations: timeout errors return "query timed out" message - Update mockTempURLError → mockTimeoutURLError All tests pass with the new timeout-based retry semantics.

sheurich · 2025-10-20T17:41:43Z

@jsha I appreciate the visibility vs. resilience trade-off discussion. While I agree with moving forward with Timeout() as the minimal fix, I wanted to share my operational perspective:

Even mature deployment practices can't eliminate all transient connection errors. Network path transitions and process/container orchestration events routinely cause brief connection-refused/reset errors that are genuinely transient and not misconfigurations. In production CA operations, failing customer certificate issuance for these routine events feels like the wrong trade-off.

I think retry-all (or at least retry-connection-errors) with proper logging/alerting provides the best of both worlds: resilience for customers and visibility for operators. That said, I respect the fail-fast philosophy and understand the operational debt concerns. I'm happy to stick with the Timeout() change as implemented here and we can revisit retry strategy in future PRs if operational experience suggests it's needed.

jsha

You make good points about transient errors in the connection between VA and Unbound. Can you add some detail about what your setup is like, what specific errors you're getting, and how often?

jsha · 2025-10-21T22:50:48Z

bdns/dns.go

+				// Check if the error is a timeout error. Network errors
+				// that can timeout implement the net.Error interface.
+				var netErr net.Error
+				isRetryable = errors.As(r.err, &netErr) && netErr.Timeout()


Can you explain why this became a net.Error? *url.Error also offers .Timeout().

They are functionally identical; http.Client.Do() returns *url.Error, which implements net.Error. Both errors.As checks match the same object and call the same .Timeout() method.

I changed to net.Error because we only need the .Timeout() method, not *url.Error-specific fields like .Op or .URL.

sheurich · 2025-10-23T23:18:02Z

Can you add some detail about what your setup is like, what specific errors you're getting, and how often?

Unbound and boulder-va run in separate containers, communicating over Docker networking.

Current error rates (% of validations):

"Client.Timeout exceeded": 0-1%
"connect: connection refused": 1%
"unexpected EOF": ~0.1% (5-30% before the Unbound fix)

sheurich requested a review from a team as a code owner October 10, 2025 00:28

sheurich requested a review from beautifulentropy October 10, 2025 00:28

aarongable reviewed Oct 10, 2025

View reviewed changes

jsha mentioned this pull request Oct 16, 2025

Add configurable retry policy for DoH transport errors #8443

Closed

sheurich added 2 commits October 17, 2025 10:15

sheurich force-pushed the dns-retry-all-errors branch from a873eb7 to 867033e Compare October 17, 2025 18:10

sheurich requested a review from aarongable October 17, 2025 18:20

jsha requested changes Oct 21, 2025

View reviewed changes

jsha approved these changes Oct 23, 2025

View reviewed changes

jsha requested a review from a team October 23, 2025 23:46

beautifulentropy approved these changes Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Remove deprecated Temporary() usage in DNS retry logic #8441

Remove deprecated Temporary() usage in DNS retry logic #8441

Uh oh!

sheurich commented Oct 10, 2025 •

edited

Loading

Uh oh!

beautifulentropy commented Oct 10, 2025

Uh oh!

aarongable Oct 10, 2025

Uh oh!

jsha commented Oct 14, 2025

Uh oh!

sheurich commented Oct 15, 2025 •

edited

Loading

Uh oh!

sheurich commented Oct 20, 2025

Uh oh!

jsha left a comment

Uh oh!

jsha Oct 21, 2025

Uh oh!

sheurich Oct 23, 2025

Uh oh!

sheurich commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// Retry all errors up to maxTries limit for maximum resilience.
		isRetryable := true

Uh oh!

Remove deprecated Temporary() usage in DNS retry logic #8441

Are you sure you want to change the base?

Remove deprecated Temporary() usage in DNS retry logic #8441

Uh oh!

Conversation

sheurich commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Uh oh!

beautifulentropy commented Oct 10, 2025

Uh oh!

aarongable Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

jsha commented Oct 14, 2025

Footnotes

Uh oh!

sheurich commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sheurich commented Oct 20, 2025

Uh oh!

jsha left a comment

Choose a reason for hiding this comment

Uh oh!

jsha Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

sheurich Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

sheurich commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sheurich commented Oct 10, 2025 •

edited

Loading

sheurich commented Oct 15, 2025 •

edited

Loading