-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens #2596
base: master
Are you sure you want to change the base?
Conversation
@@ -160,6 +160,10 @@ public: | |||
_rovers.release(tokens); | |||
} | |||
|
|||
void refund(T tokens) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reposting a comment by @xemul which he wrote in michoecho@b0ec97d#r150664671. (I created this PR just to anchor his comment to something. Comments attached to commits are hard to find).
We had some experience with returning tokens to bucket. This didn't work well, mixing time-based replenish with token-based replenish had a weird effect. Mind reading #1766, specifically #1766 (comment) and #1766 (comment) comments for details. If we're going to go with this fix, we need some justification of why we don't step on the same problem again
Why cross-shard fairness? Not dropping the preempted capacity on the floor sounds like "fix the shard-local preemption" to me (#2591) |
It fixes both. The main aim of this patch is to add cross-shard fairness by grabbing tokens for many requests at once. The local preemption problem is handled as a byproduct.
Preemption is handled like this: conceptually there is no dedicated "the pending request". Rather, there is a pending token reservation, and when we finally get some tokens out of it, we just dispatch them on the highest-priority request at the moment. If, due to this, we are left with a bunch of tokens we can't use immediately (e.g. because a request with 100k tokens butted in into a reservation done earlier for a 1.5M request, so we are left with 1.4M after dispatching the 100k), we "roll them over" to the next reservation, essentially by grabbing For example, if we do a pending reservation of cap=1.5M at wanthead=10M, and we call (Note that this also means that requests bigger than Note that this change means that the the worst case I/O latency effectively increases by one io-latency-goal, (because each shard can now allocate up to Also note that I didn't give any thought to the issues you mention in #1766 when I was writing this patch. I only glanced at #1766 and didn't have the time to think how this patch interacts with them yet. |
Mind the following. All shards cannot grab more than full token-bucket limit, so there's natural limit on the amount of token a shard can get. E.g. here's how token-bucket is configured for io-properties that I have:
And the requests costs can grow as large as:
so you can charge at most 9.6 128k reads for that limit for the whole node. It's not that much |
OK, let's consider some simple io-tester job:
What would the result be with this pr? |
So you reserve capacity for large request in several grabs. There's one thing that bothers me. Below is very simplified example that tries to demonstrate it There are 2 shards, 10 tokens limit and requests cost 6 tokens each. Here's how it will move:
Here, disk gets one request in the middle of the timeline and another request at the end of it. Now let's grab tokens with per-tick-threshold of 5 tokens batches
Here, disk is idling up until the end of the timeline, then gets two requests in one go. Effectively we re-distributed the load by making it smaller (down to idle) and then compensating for the idleness after the next replenish took place (2x times). It's not necessarily a problem as shards don't always line-up as in the former example. However, in the current implementation rovers serve two purposes -- account for the available and consumed tokens and for a queue of requests ( The same thing, btw, happens with the patch from #2591 :( |
@xemul Not quite. The patch anticipates this, and does something different. The important part is: we avoid "hoarding" tokens — if we successfully grabbed some tokens, but they aren't enough to fulfill the highest-priority request, we don't keep the tokens and wait until we grab the necessary remainder, but we "roll over" the tokens by releasing them back to the bucket and immediately make a bigger combined reservation. So in your example, if shard 0 calls So a shard never "hoards" allocated-but-not-dispatched tokens for more than one poll period. If it sees that it can't dispatch in this grab cycle, then it immediately hands over the tokens to the next person in the queue, and makes up for it in the next cycle. So the first actually dispatchable request will be dispatched as soon as there is enough tokens in the bucket and all shards did one poll cycle to hand over their tokens to the dispatchable request. So dispatching isn't delayed to the end of the timeline — it's delayed by at most one poll cycle from the optimal dispatch point. |
I need to check the final version of the patch, this explanation is not clear. First, please clarify what "not enough" means. Assume in my example shard-0 first tries to grab 5 tokens. That's not enough, right? But why does it grab 5 tokens if it knows that it will need 6? Or does it grab 6 from the very beginning? |
Refs #1083. This is a dirty attempt to fix the lack of cross-shard fairness.
This draft was only created as an anchor for some comments posted in a different thread. Please don't review it (at least yet).