-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffered protocol changes #24839
Buffered protocol changes #24839
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good from docs
CI test resultstest results on build#60851
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the request is buffered in the buffered protocol queue we do not
want to account that time for the overall timeout
why?
It does more harm than good, in a scenario with overloaded cluster the buffered requests are immediately timing out when sent through RPC layer, this makes the cluster unstable as leaders have more work and followers are unable to receive requests. |
should the timeout be longer then? i mean generally it sounds fragile to just ignore real time that a request is alive, if it is a buffer or not. but maybe i don't fully understand the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
When the request is buffered in the buffered protocol queue we do not want to account that time for the overall timeout Signed-off-by: Michał Maślanka <[email protected]>
The default buffer size of 5 MiB made the buffer to grow very large. Changed the default to minimize the buffering impact on the producer latency in saturated clusters. Signed-off-by: Michał Maślanka <[email protected]>
9c2b2d4
to
7cc196d
Compare
It indeed is a buffer, i guess i can adjust the timeout at the caller, what i observed is that this is very fragile. The timeout is there to bound the network latency. This is why is decided to adjust the timeout. There are also situations in which the dispatch loop is sending a requests that has 5 ms left to be timed out. This case is the worst as the request will error out at the requester but it will be sent, this will force the leader to resend the message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offline discussion I had with Michal made it clearer what this change was doing. The timeout in question is not something chosen as part of initiating a raft replicate request, in which case yes, the buffering time should be accounted for in processing. Rather, this timeout is intended to detect problems with followers or network partitions etc... so it makes sense in this case that buffering time should be used as part of this particular timeout use case.
When the request is buffered in the buffered protocol queue we do not
want to account that time for the overall timeout
Backports Required
Release Notes