Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens #2596

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

michoecho
Copy link
Contributor

@michoecho michoecho commented Dec 23, 2024

Refs #1083. This is a dirty attempt to fix the lack of cross-shard fairness.

This draft was only created as an anchor for some comments posted in a different thread. Please don't review it (at least yet).

@@ -160,6 +160,10 @@ public:
_rovers.release(tokens);
}

void refund(T tokens) noexcept {
Copy link
Contributor Author

@michoecho michoecho Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reposting a comment by @xemul which he wrote in michoecho@b0ec97d#r150664671. (I created this PR just to anchor his comment to something. Comments attached to commits are hard to find).

We had some experience with returning tokens to bucket. This didn't work well, mixing time-based replenish with token-based replenish had a weird effect. Mind reading #1766, specifically #1766 (comment) and #1766 (comment) comments for details. If we're going to go with this fix, we need some justification of why we don't step on the same problem again

@michoecho michoecho changed the title Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens [DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens Dec 23, 2024
@xemul
Copy link
Contributor

xemul commented Dec 23, 2024

This is a dirty attempt to fix the lack of cross-shard fairness.

Why cross-shard fairness? Not dropping the preempted capacity on the floor sounds like "fix the shard-local preemption" to me (#2591)

@michoecho
Copy link
Contributor Author

This is a dirty attempt to fix the lack of cross-shard fairness.

Why cross-shard fairness? Not dropping the preempted capacity on the floor sounds like "fix the shard-local preemption" to me

It fixes both. The main aim of this patch is to add cross-shard fairness by grabbing tokens for many requests at once. The local preemption problem is handled as a byproduct.

_queued_cap is the sum of the capacities of all queued (i.e. not dispatched and not cancelled yet) requests. When we grab tokens, we grab min(per_tick_grab_threshold, _queued_cap) tokens. Grabbing a fixed amount of tokens at a time, rather than a fixed amount of requests (i.e. one request) at a time is the key part which gives us fairness of tokens instead of fairness of IOPS.

Preemption is handled like this: conceptually there is no dedicated "the pending request". Rather, there is a pending token reservation, and when we finally get some tokens out of it, we just dispatch them on the highest-priority request at the moment. If, due to this, we are left with a bunch of tokens we can't use immediately (e.g. because a request with 100k tokens butted in into a reservation done earlier for a 1.5M request, so we are left with 1.4M after dispatching the 100k), we "roll them over" to the next reservation, essentially by grabbing min(per_tick_grab_threshold + leftovers, _queued_cap) (and then immediately refunding the unused tokens to the group head so that they are used by others rather than wasted).

For example, if we do a pending reservation of cap=1.5M at wanthead=10M, and we call dispatch_requests after grouphead=11M, and the highest-priority request at the moment is 100k tokens, then we dispatch 100k, grab another pending allocation with cap=2.9M at wanthead=13.9M, and atomically advance the grouphead by 1.4M (so to 12.4M assuming no interference).

(Note that this also means that requests bigger than per_tick_grab_threshold can take several calls to dispatch_requests to finally accumulate enough tokens to execute.)

Note that this change means that the the worst case I/O latency effectively increases by one io-latency-goal, (because each shard can now allocate up to max_request_size + io-latency-goal / smp::count  per dispatch_requests due to the leftovers, instead of the old max_request_size), but I don't really see a way around this. Every algorithm I could think of which solves the local-preemption waste problem does that.

Also note that I didn't give any thought to the issues you mention in #1766 when I was writing this patch. I only glanced at #1766 and didn't have the time to think how this patch interacts with them yet.

@xemul
Copy link
Contributor

xemul commented Dec 23, 2024

The main aim of this patch is to add cross-shard fairness by grabbing tokens for many requests at once

Mind the following. All shards cannot grab more than full token-bucket limit, so there's natural limit on the amount of token a shard can get. E.g. here's how token-bucket is configured for io-properties that I have:

  token_bucket:
    limit: 12582912
    rate: 16777216
    threshold: 80612

And the requests costs can grow as large as:

    131072: // request size in bytes
      read: 1312849
      write: 3038283

so you can charge at most 9.6 128k reads for that limit for the whole node. It's not that much

@xemul
Copy link
Contributor

xemul commented Dec 23, 2024

It fixes both.

OK, let's consider some simple io-tester job:

- name: shard_0
  shard_info:
    parallelism: 32
    reqsize: 4kB
    shares: 100
  shards:
  - '0'
  type: randread
- name: shard_1
  shard_info:
    parallelism: 32
    reqsize: 4kB
    shares: 800
  shards:
  - '1'
  type: randread

What would the result be with this pr?

@xemul
Copy link
Contributor

xemul commented Dec 23, 2024

(Note that this also means that requests bigger than per_tick_grab_threshold can take several calls to dispatch_requests to finally accumulate enough tokens to execute.)

So you reserve capacity for large request in several grabs. There's one thing that bothers me. Below is very simplified example that tries to demonstrate it

There are 2 shards, 10 tokens limit and requests cost 6 tokens each. Here's how it will move:

shard0
shard1
       ------------------------------------------------------------------
       |         |
       tail      head

// shard0 grabs 6 tokens and dispatches

shard0 |------|
shard1
       ------------------------------------------------------------------
              |  |
              t  head

// shard1 grabs 6 tokens and gets "pending"

shard0 |------|
shard1        |------|
       ------------------------------------------------------------------
                 |   |
                 h   t

// time to replenish 4 more tokens elapses, shard1 dispatches

shard0 |------|
shard1        |------|
       ------------------------------------------------------------------
                     |
                     t,h

Here, disk gets one request in the middle of the timeline and another request at the end of it. Now let's grab tokens with per-tick-threshold of 5 tokens batches

shard0
shard1
       ------------------------------------------------------------------
       |         |
       tail      head

// shard0 grabs 5 tokens and waits

shard0 |-----|
shard1
       ------------------------------------------------------------------
             |   |
             t   head

// shard1 grabs 5 tokens and waits

shard0 |-----|
shard1       |-----|
       ------------------------------------------------------------------
                 | |
                 h t

// time to replenish 4 more tokens elapses

shard0 |-----|
shard1       |-----|
       ------------------------------------------------------------------
                   |   |
                   t   h

// both shards get 1 token each and dispatch

shard0 |-----|     |-|
shard1       |-----| |-|
       ------------------------------------------------------------------
                   |   |
                   t   h

Here, disk is idling up until the end of the timeline, then gets two requests in one go. Effectively we re-distributed the load by making it smaller (down to idle) and then compensating for the idleness after the next replenish took place (2x times).

It's not necessarily a problem as shards don't always line-up as in the former example. However, in the current implementation rovers serve two purposes -- account for the available and consumed tokens and for a queue of requests (group.grab_capacity(cap) returns the "position" in this queue). With batched grabbing and one request requiring several "grabs" breaks the queuing facility. And while it's OK to overdispatch a disk with short requests, this will cause overdispatching with long requests with is worse.

The same thing, btw, happens with the patch from #2591 :(

@michoecho
Copy link
Contributor Author

There's one thing that bothers me.
...
There are 2 shards, 10 tokens limit and requests cost 6 tokens each. Here's how it will move:

@xemul Not quite. The patch anticipates this, and does something different. The important part is: we avoid "hoarding" tokens — if we successfully grabbed some tokens, but they aren't enough to fulfill the highest-priority request, we don't keep the tokens and wait until we grab the necessary remainder, but we "roll over" the tokens by releasing them back to the bucket and immediately make a bigger combined reservation.

So in your example, if shard 0 calls dispatch_requests again after shard 1 made its reservation (even before any replenishment happens), and sees that it has successfully grabbed 5 tokens, it won't allocate the remaining 1 token right after shard 1's allocation, but it will instead immediately return its 5 tokens to the bucket and request a combined reservation of 6 tokens right after shard 1's. And then shard 1 on the next dispatch_requests call will see that it can't fulfill the request with the ongoing allocation of 5 tokens, so it will cancel the allocation by returning the 3 tokens it already has and moving the group head after the 2 tokens which are right after the head, and it will make a combined reservation of 6 tokens after shard 0's allocation. And then shard 0 on its next dispatch_requests call will see that its allocation is fulfilled, and will dispatch the request.

So a shard never "hoards" allocated-but-not-dispatched tokens for more than one poll period. If it sees that it can't dispatch in this grab cycle, then it immediately hands over the tokens to the next person in the queue, and makes up for it in the next cycle. So the first actually dispatchable request will be dispatched as soon as there is enough tokens in the bucket and all shards did one poll cycle to hand over their tokens to the dispatchable request. So dispatching isn't delayed to the end of the timeline — it's delayed by at most one poll cycle from the optimal dispatch point.

@xemul
Copy link
Contributor

xemul commented Dec 23, 2024

The important part is: we avoid "hoarding" tokens — if we successfully grabbed some tokens, but they aren't enough to fulfill the highest-priority request, we don't keep the tokens and wait until we grab the necessary remainder, but we "roll over" the tokens by releasing them back to the bucket and immediately make a bigger combined reservation.

I need to check the final version of the patch, this explanation is not clear.

First, please clarify what "not enough" means. Assume in my example shard-0 first tries to grab 5 tokens. That's not enough, right? But why does it grab 5 tokens if it knows that it will need 6? Or does it grab 6 from the very beginning?
Next, what if shard grabs 6 tokens that are needed for the request, but tail overruns the head. Is it still/also "not enough"? Or is it something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants