-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: first attempt at #776 / #941 #942
base: main
Are you sure you want to change the base?
Conversation
350d7b4
to
03ad88e
Compare
end | ||
|
||
def server_pipelined_get(server, request) | ||
buffer_size = server.socket_sndbuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had been thinking about doing it by batching the number of keys, but I think this approach of looking it at from the perspective of # of bytes is a better way to address the issue. It also allows us to give the end user direction about how to control the behavior (by tuning sndbuf)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started out with batching the keys, but I got the idea to batch by bytes instead when I struggled to come up with a good default value for the batch size - It seemed to me that the original problem was caused by the response size rather than the number of keys requested..
I don't know if sndbuf is a good value though, maybe it is more efficient to try to write larger batches than the buffer.
It will also not allow users of the library to tweak the send buffer and the chunk size independently (I do not know if that makes sense).
It might not matter at all as this uses nonblocking IO for the writes, which could mean that there would just be partial writes (which are already correctly handled AFAIK) if we try to write everything at once. Maybe I will do some experiments, but I think there are other areas I should polish first (basic correctness being the most important one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might do something like this:
- Add an explicit chunking parameter that can be set on the client
- Have that chunking parameter fallback to the sndbuf size if it isn't explicitly set
start_time = Time.now | ||
servers = requests.keys | ||
|
||
# FIXME: this was executed before the finish request was sent. Why? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to look at this in a little detail. The behavior in the case of no servers available is supposed to a returned []
with no error. But if I recall correctly, there are cases where the a server is not connected that can trigger raising a Dalli::Error
without this check. I'll find some time to look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any hints on the intended original behavior of the multi-server code are highly appreciated. I haven't been able to fully wrap my head around it yet, and I guess it is the hardest to get right when things fail in between reading/writing chunks. Hard to correctly test as well..
lib/dalli/protocol/base.rb
Outdated
@@ -143,6 +145,14 @@ def quiet? | |||
end | |||
alias multi? quiet? | |||
|
|||
def pipelined_get_request(keys) | |||
req = +'' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this line was just copied and moved from pre-existing code, but looking at it we can almost certainly reduce allocations by pre-allocating req
to a reasonable size. Something like keys.size * 24 + keys.map(&:size).sum)
for binary? And something like keys.size * 18 + keys.map(&:size).sum)
for meta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. But it seems that it should be 17 for meta, at least it was in my tests (non-base64, which will completely break this calculation..).
Does not seem to make a significant difference in speed, though.
I have another idea for a performance optimization: start of by only generating as much bytes as are needed for the first chunk and send it, then pre-calculate upcoming bytes whenever the select blocks (or the socket is writable and no pre-calculated data is available).
This means the server can start generating the response earlier, dalli would use less memory, and time that is "lost" waiting for I/O could be used for calculation.
But it will be a bit harder to do elegantly without a library like async or eventmachine, and I don't know if it is worth the added complexity. Anyway, I would put this in an extra PR once this one is finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very possible I got the number wrong for meta (I did the calculation quickly) but just make sure you're including the separator.
I think the benefit for that optimization is unlikely to be worth the tradeoff, but we can discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a neat idea for an optimization, but yeah it feels like it could be an improvement for a future PR.
The .map(&:size).sum)
is faster in my little benchmark, but by only by like 10 nanoseconds a key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
.map(&:size).sum)
is faster in my little benchmark, but by only by like 10 nanoseconds a key.
I had the (naive) assumption that the cost of allocating an array should be prevented - not knowing that it is outweighed by doing the sum in C instead if in Ruby. I should have remembered Knuth.
There's a bit of "devil in the details", but directionally this looks correct. Thanks for drafting @marvinthepa . With a little back and forth we can almost certainly get this in. |
Happy to give back and spend some company time on a library we have been using for a few years now.. |
03ad88e
to
dac656d
Compare
WIP as it contains a few FIXMEs that need to be addressed (especially error handling).
Uploaded anyway as a basis for discussion.