Skip to content

Conversation

DylanRussell
Copy link
Contributor

@DylanRussell DylanRussell commented May 28, 2025

Fixes open-telemetry/opentelemetry-specification#4513

I've tried to change the wording such that it's acceptable for SDKs to use gRPC retry config, which only respects the grpc-retry-pushback-ms header to handle all backoff / retry logic.

The advantage of this is that SDKs don't need to implement / maintain any backoff/retry logic and can hand it off to the official gRPC library.

@DylanRussell DylanRussell requested a review from a team May 28, 2025 19:15
Copy link
Member

@pellared pellared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, LGTM

@pellared
Copy link
Member

pellared commented Jul 1, 2025

@DylanRussell, given #669 is merged I think you can update this PR 😉

@DylanRussell
Copy link
Contributor Author

Ack.. I've merged and updated the wording to be closer to what you recommended.. PTAL when you have a chance

@pellared pellared changed the title Propose updated language around gRPC error retry handling Support grpc-retry-pushback-ms Jul 16, 2025
pellared
pellared previously approved these changes Jul 16, 2025
@DylanRussell
Copy link
Contributor Author

Linter passing now

@pellared
Copy link
Member

@open-telemetry/spec-sponsors, @open-telemetry/technical-committee, PTAL

[RetryInfo](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L40).
Here is a snippet of sample Go code to illustrate:
[RetryInfo](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L40)
or via Trailer [with the gRPC metadata key `grpc-retry-pushback-ms`](https://github.com/grpc/proposal/blob/master/A6-client-retries.md#pushback).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link to a permanent link instead of one that can break.


The client SHOULD honor at least one of these backpressure mechanisms.

In order to have the backpressure respected, the server SHOULD implement both mechanisms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the server should be responding with both mechanisms at the same time? The example below only has one mechanism being used at a time.

This seems odd to me. It is usually the client that is recommended to handle multiple versions of a response and the server is expected to support at least one.

Is there a preference for which mechanism the server should support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a preference. About half the languages are using gRPC retry config which supports only the pushback-ms header, and the other half is rolling their own retry logic and only support the RetryInfo..

If a client uses gRPC retry config, it is offloading all the retry logic to gRPC internal library, so it can't also support RetryInfo.

@pellared pellared dismissed their stale review August 4, 2025 15:21

Dismissing because of

It is usually the client that is recommended to handle multiple versions of a response and the server is expected to support at least one.

More: https://en.wikipedia.org/wiki/Robustness_principle

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DylanRussell can you please also attach a prototype implementation that shows how this works? You can take existing implementation in one of the languages or in the Collector and fork/modify it to demonstrate the change. Collectors otlp receiver/exporter fork would be ideal since it shows both the client and server side.


```go
// Do this on the server side.
trailer := metadata.Pairs("grpc-retry-pushback-ms", "5000")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected to be done only if the server wants to pushback? It is not clear from the sample if this is some sort of permanent option that the server must always set or only set when the server is overloaded and needs to pushback the clients.

[RetryInfo](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L40)
or via Trailer [with the gRPC metadata key `grpc-retry-pushback-ms`](https://github.com/grpc/proposal/blob/master/A6-client-retries.md#pushback).

The client SHOULD honor at least one of these backpressure mechanisms.
Copy link
Member

@tigrannajaryan tigrannajaryan Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with the clients that honor the new mechanism that attempt to work with a server that honors only the old mechanism (existing implementations)? They would be incompatible, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below it's recommended below for the server to implement both mechanisms, then the client is guaranteed to respect it..

Right now clients are split between the 2 mechanisms, that's the current state, the server needs to do both if it wants a guarantee.

Alternatively lets just recommend the retry - config mechanism in all cases for everyone ? Since the clients are already split, it's likely that server's timeouts aren't being respected in all cases. If a retry-timeout isn't respected it's not that bad anyway, as all the clients have their own back off timeout logic already.

@tigrannajaryan
Copy link
Member

I think we need a description of how the new and old implementations should interoperate. Explain the operation in the following 3 scenarios:

  • Old client implementing the spec before this change connects to a new server implementing the spec after this change.
  • New client connects to old server.
  • New client connects to new server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Question about gRPC error handling specification
4 participants