-
-
Notifications
You must be signed in to change notification settings - Fork 629
Add an extra timeout waiting for RVAs #8434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Once we have enough perspectives, we want to wait a little bit more to see if the last perspective replies, but we don't want to hold up all of issuance on a perspective that's hanging or being slow.
This seems like a good idea. My first instinct was that, rather than a sleep past the end of hitting quorum, we should have a certain amount of time past the start of validation. E.g. "We'll wait up to 10 seconds to see all perspectives; after that we'll return early if we hit quorum." The idea being that it would be somewhat more obvious in the impact it would create in our latencies. But this is probably better, since slow hosts are generally slow from all perspectives; so a host that takes 15 seconds to reply should get an additional (e.g.) 2 seconds to reply from the last perspective. |
Yeah, I like this idea. I like that it can mitigate one perspective behaving poorly: if all the other perspectives completed in 500ms, we don't want to wait a full 60s for the last one. Give it a chance, sure, but no need to wait forever if we have what we need. I think this is an elegant compromise between our previous strategy (cancel immediately upon getting enough results) and our current strategy (never cancel). I think with a config and a default value I'm happy to LGTM this. |
@mcpherrinm, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values. |
Requesting review from others because Matt asked me to take this PR over from him. |
Thanks! |
I've filed IN-11910 to track the corresponding production config changes |
Once we have enough RVA results, set a tighter timeout on the remaining ones so we don't hold up issuance on a single slow perspective. This strikes a balance between our previous strategy (cancelling all dangling remotes immediately upon getting enough results) and our current strategy (don't cancel any remotes unless we hit our global timeout).
Add a new VA config field to control how long we wait for the lagging perspectives. If this config field is not set, default to not cancelling early, which is our current behavior.