-
Notifications
You must be signed in to change notification settings - Fork 1.6k
rpc: sink_impl: batch sending and deletion of snd_buf:s #2980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: sink_impl: batch sending and deletion of snd_buf:s #2980
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a backpressure mechanism issue in the RPC sink implementation where semaphore units were being released prematurely, allowing other shards to accumulate too many resources. The solution extends the semaphore units' lifetime to match the foreign_ptr by storing them in the snd_buf structure.
- Adds semaphore_units field to snd_buf structure to extend its lifetime
- Moves semaphore units assignment before the remote execution submission
- Removes premature semaphore units release from the completion handler
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| include/seastar/rpc/rpc_types.hh | Adds semaphore.hh include and semaphore_units field to snd_buf struct |
| include/seastar/rpc/rpc_impl.hh | Moves semaphore units to snd_buf and removes premature release |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
0d42b97 to
c8f1c1c
Compare
|
c8f1c1c: added comment documenting new snd_buf semaphore_units member |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, but it would have been nice to reproduce the problem and see that there is an improvement.
|
What about a test? |
I am working on a test that reproduces an issue without this backpressure extension |
|
upd: I was able to reproduce Note also long queue for waiting for semaphore units. |
c8f1c1c to
48452c0
Compare
|
In 48452c0:
|
| throw std::runtime_error(msg); | ||
| } | ||
| break; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This infinite loop is not easy to understand.
I'd expect something like
try {
while (auto msg = source().get()) {
...
}
} catch ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do.
the special handling is for rpc::stream_closed, and since it's a final state we can catch it outside the infinite loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move try/catch around infinite for loop in 1fb710f
|
I have a feeling this is the wrong approach. For every message we send, we create a cross-shard set of tasks. We have to account for out-of-order messages due to task reordering (which @gleb-cloudius hates). But messages can be tiny (tombstone-only mutation fragments in ScyllaDB). The producer (caller of sink_impl::operator()) will have no problem producing messages back-to-back. Suggest this:
The goal here is to have the local producer producing into the local buffer, and send the local buffer batch over with a concurrency of 1. There is no reordering. |
It may work, but it will not fix the problem @bhalevy tries to fix here. The problem is that the work submitted to another shard is not fully completed when cross-smp call returns since snd_buf's are freed asynchronously. |
The high level problem here is the long queue of cross-shard tasks. What Avi suggests is managing the send queue using a single cross-shard task that would preserve the messages order. |
May be I am wrong, but the report we saw were all about foreign_ptr destruction, not about tasks submitted from sink::op(). The later is limited by the semaphore while the former is not. But I see how a lot of very small message can create a lot of submit_to calls. |
The work would be reduced by a large amount since cross-smp calls would happen for entire batches (in both directions). |
I do not see what will batch |
|
We could wrap the vector with a foreign_ptr rather than individual snd_bufs. Though it may be hard to keep them in the vector. Alternatively, detach them from the vector, then collect them again after use. Note the smp call that sends the vector blocks until it's processed, in order to let the tcp listener accumulate a new batch. |
|
Why do we suddenly care so much about cross shard streaming which should not be happening in normal circumstances? Is this because of tablet brokenness that does not preserve shard locality? I will kill our performance in many other places as well. |
|
It's not only tablets, it's also mixed node that sometimes happens. Eventually we need to full to full mesh (shard-to-shard) but we can't do that with TCP. I'd like to see RPC over QUIC. |
auto ret_fut = con->send(std::move(local_data), {}, nullptr);This becomes a loop over the vector, no? We can make ret_fut return local_data. |
|
While complicated, looks good. Worth to test with scylladb mutation streaming and tiny mutations, with cross-shard streams, and observe the reduction in cross-shard calls. |
1d9dc73 to
7bd9b0b
Compare
|
|
@gleb-cloudius please review again |
include/seastar/rpc/rpc_impl.hh
Outdated
| auto size = std::min(size_t(data.size), max_stream_buffers_memory); | ||
| const auto seq_num = _next_seq_num++; | ||
| return get_units(this->_sem, size).then([this, data = make_foreign(std::make_unique<snd_buf>(std::move(data))), seq_num] (semaphore_units<> su) mutable { | ||
| return get_units(this->_sem, size).then([this, data = std::make_unique<snd_buf>(std::move(data))] (semaphore_units<> su) mutable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to move data into a unique_ptr now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to ensure that it will get destroyed on failure (including say semaphore broken).
That said, below should change to the following to achieve that in the continuation:
_send_queue.enqueue(data.get());
data.release();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using an intrusive list would also remove the need to potentially allocate here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But data is not a pointer, so it will be destroyed if the lambda is destroyed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With intrusive list I created data initially as unique_ptr and released the pointer when pushing to the queue in the .then continuation, after getting the semaphore units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From that reply I cannot figure out if it is still relevant to the current code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved back to using boost::intrusive::slist in 1fb710f
| {} | ||
|
|
||
| virtual ~snd_buf_deleter_impl() override { | ||
| _delete_queue.enqueue(_obj_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may throw, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it might.
We can reserve the space ahead of time when processing the send queue to the maximum of outstanding snd_buf:s, but it may be an overkill.
That's one of the reasons I prefer the intrusive list approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avikivity what do you think regarding using intrusive lists for the queues vs. a vector?
Intrusive lists save allocations and make the buffers more easily transferable across shards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay.
intrusive lists are expensive because each item will take a cache miss and stall the machine. With a vector, each miss brings in multiple items and can initiate a prefetch for even more, so the stall is better amortized.
Still, we can allow the list here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative is boost::static_vector (or std::inplace_vector). It does not allocate. The downside is that if the capacity is exhausted, the connection shard has to stop processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay.
intrusive lists are expensive because each item will take a cache miss and stall the machine. With a vector, each miss brings in multiple items and can initiate a prefetch for even more, so the stall is better amortized.
Still, we can allow the list here.
Here the vector contains pointers to the snd_buf:s so the random access to memory happens anyway.
I'm not sure how much prefetching the next pointers in the queue would help. We can prefetch manually using the boost intrusive slist as well, but again, I'm not sure how much it will buy us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avikivity after playing around a bit with the idea of a static vector / array, it will require another semaphore and managing the units throughout the snd_buf lifecycle which adds another complication.
All in all, I'll go back to the boost intrusive list which is the simplest overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used boost::intrusive::slist in 1fb710f
Its push_back method doesn't throw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardware will prefetch pointers from the array automatically.
You cannot prefetch across a linked list because you don't know what address to prefetch. You have to wait until you have data in the next node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avikivity after playing around a bit with the idea of a static vector / array, it will require another semaphore and managing the units throughout the snd_buf lifecycle which adds another complication. All in all, I'll go back to the boost intrusive list which is the simplest overall.
ok
Gleb wrote: > backpresure mechanism in sink_impl<Serializer, Out...>::operator() > does not work as expected. It uses semaphore _sem to limit the > amount of work on the other shard but units are released before > foreign_ptr is freed, so another shard may accumulate a lot of them. > The solution as I see it is to make semaphore_units lifetime to be > the same as foreign_ptr (by holding it in the snd_buf for instance). Fixes scylladb#2979 Refs scylladb/scylladb#24818 Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
Define batched_queue that is used for batch processing of snd_buf:s on a remote shard. It id used first to queue up buffers on the send path where a send loop is invoked on the connection shard to send queued batches of snd_buf:s. Then, on the completion path, the exhausted buffers are queued for deletion on the delete queue, where the processing loop is invoked back on the sink shard to delete the buffers. Both changes avoid too long task queues that may be caused by sending small messages across shards. Note that batched_queue ensures processing of the buffers in fifo order also across shards, so the equence_number mechanism previously used to reorder out-of-order continuations was dropped. Signed-off-by: Benny Halevy <[email protected]>
Make sure any errors are returned as an exceptional future rather be thrown as exceptions. With that, close can be easily used to auto-close the sink using deferred_close. Signed-off-by: Benny Halevy <[email protected]>
Aborts using on_fatal_internal_error when the task queue grows too long (over the configured max_task_backlog which is 1000 by default). This is useful mostly for tests that may trigger too long queues and want to fail when that happens. Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
7bd9b0b to
1fb710f
Compare
|
In v5 (1fb710f):
|
Why? |
In case we'd want to use it for rcv_buf as a followup. |
Would it be needed for rcv_buf? |
Given the code in |
| batched_queue<snd_buf> _send_queue; | ||
| batched_queue<snd_buf> _delete_queue; | ||
|
|
||
| public: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure any errors are returned as an exceptional
The patch asserts that, not "makes sure" :)
| _queue.push_back(*buf); | ||
| if (_process_fut.available()) { | ||
| _process_fut = process_loop(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this depends on do_until (in process_loop) atomically checking the termination condition and making the future ready if needed. Which does hold.
| buf.bufs = std::move(newbufs); | ||
| } | ||
|
|
||
| return buf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could share the deleter for an entire batch rather than making one for each snd_buf. That would remove a lot of allocations. Since it's a batch we don't gain anything by having fine-grained deleters.
We can do that in a follow up.
This series first extends rpc sink_impl backpressure until snd_buf destroy
so that callers block until enough memory, held by outstanding snd_buf:s is freed.
In addition batching mechanisms are added to queue up snd_buf:s
while a send_loop is busy sending the previous batch possibly on a remote shard.
When done sending, the original buffers are queued again for batch
destroy and delete on their original shard.
The batching mechanisms avoid too-long task queues that were
caused by small messages being sent and destroyed individually
across shards.
Plus, the single send loop ensures in-order sending of the messages,
thus it simplifies the sink implementation that now no longer needs
to sequence messages and reorder them after in the submit_to task.
Fixes #2979
Refs scylladb/scylladb#24818