GRPCRoute timeout - GEP-3139 #3219

xtineskim · 2024-07-26T13:42:01Z

What type of PR is this?

/kind gep

What this PR does / why we need it:

Staying consistent with the HTTPRoute timeout feature, opening a GEP to allow for GRPCRoute timeouts

Which issue(s) this PR fixes:

Fixes # #3139

Does this PR introduce a user-facing change?:

k8s-ci-robot · 2024-07-26T13:42:04Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-07-26T13:42:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xtineskim
Once this PR has been reviewed and has the lgtm label, please assign danwinship for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

geps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xtineskim · 2024-07-26T13:43:35Z

geps/gep-3139/index.md

+The timeout for a single request from the gateway to upstream. This field is optional Extended support.
+
+Disabling streaming RPC
+- `timeout.streamingRequest`


open to suggestions on this, I have received feedback that this sits weird

I think that having this field only set to zero for disabling streaming is a strange user experience. This is very related to the GEP goals: I think we should address bidirectional streaming as well so that such a field becomes meaningful.

agreed - darn, will likely not be in for the next release. But it makes sense, trying to define this field felt bizarre

Following this for any implications for timeouts/retries on HTTP streaming, which likewise I feel like we don't really have a good grasp on yet...

xtineskim · 2024-07-26T14:55:52Z

cc @robscott @arkodg

gnossen

Thanks for getting the conversation started on this, @xtineskim ! I know timeouts are a deeply desired feature for the gRPC community, so it will be great to see support for them roll out to the Gateway API.

Broadly speaking, I want to make sure that we focus on ways in which gRPC timeout semantics differ from REST and deliver a reasonable user experience based on those differences. GRPCRoute was introduced alongside HTTPRoute instead of building on top of it because there are meaningful ways in which these two protocols differ and I think timeout is one of those ways.

A major point is that the timeout semantics across all gRPC implementations is the following:

"The maximum duration for the peer to respond to a gRPC request. This timeout is relative to when the client application initiates the RPC or, in the case of a proxy, when the proxy first receives the stream. If the stream has not entered the closed state this long after the timer has started, the RPC MUST be terminated with gRPC status 4 (DEADLINE_EXCEEDED)."

If we deviate from this, we would need to change gRPC itself in order to support that semantic, including going through the gRFC process and implementing across the full matrix of supported languages. And all this for unclear user value.

What's more, if we use the semantic I suggest above, it will apply across all arities and the problem of specializing unary RPCs disappears.

Besides the semantics of how we time a timeout, I think we need to make sure that we take into account that gRPC itself will often be acting as the data plane for users of GRPCRoute, without any proxy at all. I'm happy to support proxy-based implementations as well, but we have to make sure that the proposal supports both.

Finally, I want to make sure that we lean into areas where the gRPC protocol has already differentiated itself from REST on timeouts. The "grpc-timeout" header and timeout propagation more generally is a great example of that.

geps/gep-3139/index.md

gnossen · 2024-07-26T18:13:34Z

geps/gep-3139/index.md

+    // Support: Extended
+    //
+    // +optional
+    StreamingRequest *Duration `json:"request,omitempty"`


How is the determination that an RPC is streaming vs unary made? I don't think this is possible either for gRPC library implementations or for proxies. From the perspective of the data plane, all RPCs of any arity are just streams. The only difference is that unary RPCs enter half-close after the client sends a single message and fully closed after the server sends a single response. This scenario may also happen under any of the three other arities. The only way for a data plane to make this determination would be by having access to the schema (specifically this portion) or some processed form of the schema (such as a FileDescriptorSet). I can see three ways that this would be delivered, but none of them are currently implemented, all of them require significant effort, and all of them result in a diminished UX for users of the Gateway API:

Plumbed through the gRPC Library

gRPC library implementations do retain information keeping track of an RPC method's arity at the highest layer of generated code, but it quickly hits a generic streaming layer that throws the information about arity because all four arities are just special cases of bidirectional streaming.

So, in general, gRPC implementations simply do not have access to this information at runtime in the places in code that count and neither do proxies unless they are pre-loaded with the protobuf schema.

Delivered to a Proxy via Bundled DescriptorSets

You could bundle the schema information with the proxy and have the proxy look up the arity of individual URIs from the DescriporSet. But this only works for a certain set of RPCs which must be determined ahead of time.

You would also need to orchestrate mounting the DescriptorSet into your proxy container. Depending on the Gateway API implementation, this could be quite hard.

Delivered to a Proxy via gRPC Reflection

The gRPC reflection API offers a better mechanism for delivering the structured type information than bundling a processed form. The proxy would make a networked call to a reflection server. However, this injects additional latency (though this could be reduced by caching results). This would require that all RPCs that would possibly be routed have type information stored on a single network-accessible reflection server.

The proxy would of course have to be augmented with this functionality.

geps/gep-3139/index.md

gnossen · 2024-07-26T21:45:42Z

geps/gep-3139/index.md

+    // Support: Extended
+    //
+    // +optional
+    Request *Duration `json:"request,omitempty"`


gRPC propagates timeouts from the client to the server, and onward to further servers. This is used on the server side to cancel RPCs that surpass their timeout. Since the client will not be awaiting the result any longer, it doesn't make sense for the server to continue processing the request past the timeout.

This is communicated from the client to the server via the "grpc-timeout" metadata key. If a gateway or service mesh implementation is enforcing a stricter timeout than the client itself, it makes sense to rewrite this metadata element with the shorter of the two timeouts. For example, Envoy already provides knobs to do this.

I think it would be good to add this as an optional feature, perhaps with a boolean that, if set to true on an implementation that does not support it, will fail validation.

makes sense, thank you for the links! 👍

gnossen · 2024-07-26T22:24:55Z

geps/gep-3139/index.md

+    BackendRequest *Duration `json:"backendRequest,omitempty"`
+
+    // StreamingRequest specifies the ability for disabling bidirectional streaming. 
+    // The only supported settings are `0s`, so users can disable timeouts for streaming


Why would only an infinite timeout be allowed? It certainly makes sense to limit the max duration of a streaming RPC.

geps/gep-3139/index.md

gnossen · 2024-07-26T22:38:47Z

geps/gep-3139/index.md

+    // Support: Extended
+    //
+    // +optional
+    BackendRequest *Duration `json:"backendRequest,omitempty"`


Continuing on the theme of support for the gRPC library as a data plane, this field doesn't seem to make sense in that context. I think I'm fine with having a field that only applies for implementations with proxies, but we need to specify what happens when a Gateway API implementation that does not support this field (because there is no gateway) receives this field.

geps/gep-3139/index.md

arkodg · 2024-08-01T00:14:04Z

thanks for authoring this GEP @xtineskim and @gnossen for reviewing this in depth !

thinking out loud for gRPC timeouts, thoughts on the below semantics for GRPCRoute ?

If no timeout section is defined, rely on grpc-timeout header for deciding a per request timeout https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md#requests
timeouts.maxStreamDuration which overrides grpc-timeout header timeout and instead enforces a HTTP/2 stream duration timeout

xtineskim · 2024-08-01T15:34:55Z

Thanks @arkodg 😄 !
for your point here:

timeouts.maxStreamDuration which overrides grpc-timeout header timeout and instead enforces a HTTP/2 stream duration timeout

I wonder if this should be the opposite - if a request were to propagate to another service, could it just continually be growing in duration 🤔

arkodg · 2024-08-07T14:34:43Z

Thanks @arkodg 😄 ! for your point here:

timeouts.maxStreamDuration which overrides grpc-timeout header timeout and instead enforces a HTTP/2 stream duration timeout

I wonder if this should be the opposite - if a request were to propagate to another service, could it just continually be growing in duration 🤔

i meant the timeouts.maxStreamDuration would override the timeout value defined in the header, but not overwrite the grpc-timeout header itself

k8s-ci-robot · 2024-09-10T20:30:14Z

@xtineskim: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-gateway-api-crds-validation-4	`193adc3`	link	true	`/test pull-gateway-api-crds-validation-4`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Initial commit of GEP-3139

95d2e3d

k8s-ci-robot added kind/gep PRs related to Gateway Enhancement Proposal(GEP) do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 26, 2024

k8s-ci-robot requested review from mikemorris and mlavacca July 26, 2024 13:42

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 26, 2024

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 26, 2024

xtineskim commented Jul 26, 2024

View reviewed changes

Render sequence diagram, small edits

bbff733

gnossen suggested changes Jul 26, 2024

View reviewed changes

xtineskim added 2 commits August 8, 2024 11:55

Merge branch 'kubernetes-sigs:main' into gep3139-grpc-timeouts

c6fc4d1

Update GEP3139 per feedback

193adc3

xtineskim requested a review from gnossen August 19, 2024 22:12

shaneutt requested review from shaneutt and kflynn and removed request for gnossen August 27, 2024 15:40

shaneutt added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Sep 18, 2024

shaneutt added this to the v1.3.0 milestone Sep 18, 2024

shadialtarsha mentioned this pull request Nov 8, 2024

gep: GEP-3440 - Gateway API Support for gRPC Retries #3441

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPCRoute timeout - GEP-3139 #3219

GRPCRoute timeout - GEP-3139 #3219

xtineskim commented Jul 26, 2024

k8s-ci-robot commented Jul 26, 2024

k8s-ci-robot commented Jul 26, 2024

xtineskim Jul 26, 2024

mlavacca Jul 30, 2024

xtineskim Jul 30, 2024

mikemorris Jul 30, 2024

xtineskim commented Jul 26, 2024

gnossen left a comment •

edited

Loading

gnossen Jul 26, 2024 •

edited

Loading

gnossen Jul 26, 2024

xtineskim Aug 7, 2024

gnossen Jul 26, 2024

gnossen Jul 26, 2024

arkodg commented Aug 1, 2024

xtineskim commented Aug 1, 2024 •

edited

Loading

arkodg commented Aug 7, 2024

k8s-ci-robot commented Sep 10, 2024

GRPCRoute timeout - GEP-3139 #3219

Are you sure you want to change the base?

GRPCRoute timeout - GEP-3139 #3219

Conversation

xtineskim commented Jul 26, 2024

k8s-ci-robot commented Jul 26, 2024

k8s-ci-robot commented Jul 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xtineskim commented Jul 26, 2024

gnossen left a comment • edited Loading

Choose a reason for hiding this comment

gnossen Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Plumbed through the gRPC Library

Delivered to a Proxy via Bundled DescriptorSets

Delivered to a Proxy via gRPC Reflection

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arkodg commented Aug 1, 2024

xtineskim commented Aug 1, 2024 • edited Loading

arkodg commented Aug 7, 2024

k8s-ci-robot commented Sep 10, 2024

gnossen left a comment •

edited

Loading

gnossen Jul 26, 2024 •

edited

Loading

xtineskim commented Aug 1, 2024 •

edited

Loading