Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass through an RpcFinished(None) when grpc write streams are terminated early #7422

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Mar 22, 2019

Problem

When using remotely executed scala compiles by setting --execution-strategy=hermetic with zinc compiles and pointing the --remote-execution-server and --remote-store-servers at an internal instance of https://github.com/twitter/scoot to compile scala in https://github.com/twitter/util, we are seeing some RpcFinished(None) errors causing upload digest requests to fail after a few retries.

In #6344 we see a case of a spurious RpcFinished(None) which caused us to fork grpcio. We believe this is similarly a spurious failure caused by a non-erroneous race condition, as in logs from Scoot we are seeing that the digests reporting a failure have successfully been uploaded to the remote store. In testing, we have found that this occurs with files which are large and are used by multiple process requests (e.g. scala-reflect.jar in zinc compilations). We believe the RpcFinished(None) we are converting into an Ok(None) here occurs when pants has multiple concurrent write requests against the remote store, and the remote store cancels the others after it successfully receives the first write request.

Solution

Result

Pants no longer considers it a failure to receive an RpcFinished(None) response. The discussion is ongoing as to whether this should be an accepted part of the remote execution API at https://groups.google.com/forum/#!topic/remote-execution-apis/NXUe3ItCw68/discussion.

@cosmicexplorer cosmicexplorer force-pushed the maybe-fix-unnecessary-rpcfinished branch from 0f2ec4d to fec857f Compare March 24, 2019 11:15
Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty uncomfortable with blindly ignoring errors here without understanding why they're happening... Do you know when RpcFinished(None) is happening and why? Is it a server-side issue we're working around? Is it an issue in grpcio we're working around? In particular, this seems to be common, because there are already retries, so if this is happening often enough that requests are failing despite multiple attempts, it would be really good to address the root cause.

src/rust/engine/fs/src/store.rs Outdated Show resolved Hide resolved
src/rust/engine/serverset/src/retry.rs Outdated Show resolved Hide resolved
@cosmicexplorer cosmicexplorer force-pushed the maybe-fix-unnecessary-rpcfinished branch from fec857f to ed9db30 Compare March 31, 2019 01:19
@cosmicexplorer cosmicexplorer force-pushed the maybe-fix-unnecessary-rpcfinished branch 2 times, most recently from 88b5729 to be8d5e9 Compare April 9, 2019 18:29
@cosmicexplorer cosmicexplorer changed the title pass through an RpcFinished(None) in store.rs and add debug logging to Retry [WIP] pass through an RpcFinished(None) in store.rs and add debug logging to Retry Apr 9, 2019
@cosmicexplorer
Copy link
Contributor Author

Marked WIP as per the above comment. This PR currently just has logging changes that are being used to diagnose this.

@cosmicexplorer cosmicexplorer force-pushed the maybe-fix-unnecessary-rpcfinished branch from 391a11d to 784df66 Compare April 11, 2019 03:08
@cosmicexplorer cosmicexplorer changed the title [WIP] pass through an RpcFinished(None) in store.rs and add debug logging to Retry pass through an RpcFinished(None) in store.rs and add debug logging to Retry Apr 11, 2019
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Apr 11, 2019

As per discussion with @illicitonion "offline", I'm going to be splitting out the changes adding logging to a separate PR and focus this one on specifically the RpcFinished(None) => Ok(()) change in order to make it easier to revert. The scoot implementation terminating early isn't exactly part of the remote execution API specification, so this response may or may not continue to occur: see the discussion upstream at https://groups.google.com/forum/#!topic/remote-execution-apis/NXUe3ItCw68/discussion.

@cosmicexplorer cosmicexplorer force-pushed the maybe-fix-unnecessary-rpcfinished branch from 784df66 to 40a2c42 Compare April 11, 2019 23:42
@cosmicexplorer cosmicexplorer changed the title pass through an RpcFinished(None) in store.rs and add debug logging to Retry pass through an RpcFinished(None) when grpc write streams are terminated early Apr 12, 2019
@cosmicexplorer cosmicexplorer force-pushed the maybe-fix-unnecessary-rpcfinished branch from 40a2c42 to 3644a17 Compare April 12, 2019 06:02
Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

src/rust/engine/fs/src/store.rs Outdated Show resolved Hide resolved
@cosmicexplorer cosmicexplorer merged commit e1b7fed into pantsbuild:master Apr 12, 2019
illicitonion pushed a commit that referenced this pull request May 13, 2019
…ted early (#7422)

### Problem

When using remotely executed scala compiles by setting `--execution-strategy=hermetic` with zinc compiles and pointing the `--remote-execution-server` and `--remote-store-server`s at an internal instance of https://github.com/twitter/scoot to compile scala in https://github.com/twitter/util, we are seeing some [`RpcFinished(None)` errors](https://docs.rs/grpcio/0.4.2/grpcio/enum.Error.html) causing upload digest requests to fail after a few retries.

In #6344 we see a case of a spurious `RpcFinished(None)` which caused us to fork grpcio. We believe this is similarly a spurious failure caused by a non-erroneous race condition, as in logs from Scoot we are seeing that the digests reporting a failure have successfully been uploaded to the remote store. In testing, we have found that this occurs with files which are large and are used by multiple process requests (e.g. `scala-reflect.jar` in zinc compilations). We believe the `RpcFinished(None)` we are converting into an `Ok(None)` here occurs when pants has multiple concurrent write requests against the remote store, and the remote store cancels the others after it successfully receives the first write request.

### Solution

- Pass through an [`RpcFinished(None)`](https://docs.rs/grpcio/0.4.2/grpcio/enum.Error.html) in `store.rs`.

### Result

Pants no longer considers it a failure to receive an `RpcFinished(None)` response. The discussion is ongoing as to whether this should be an accepted part of the remote execution API at https://groups.google.com/forum/#!topic/remote-execution-apis/NXUe3ItCw68/discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants