[DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens #2596

michoecho · 2024-12-23T10:00:49Z

Refs #1083. This is a dirty attempt to fix the lack of cross-shard fairness.

This draft was only created as an anchor for some comments posted in a different thread. Please don't review it (at least yet).

michoecho · 2024-12-23T10:03:34Z

include/seastar/util/shared_token_bucket.hh

@@ -160,6 +160,10 @@ public:
        _rovers.release(tokens);
    }

+    void refund(T tokens) noexcept {


Reposting a comment by @xemul which he wrote in michoecho@b0ec97d#r150664671. (I created this PR just to anchor his comment to something. Comments attached to commits are hard to find).

We had some experience with returning tokens to bucket. This didn't work well, mixing time-based replenish with token-based replenish had a weird effect. Mind reading #1766, specifically #1766 (comment) and #1766 (comment) comments for details. If we're going to go with this fix, we need some justification of why we don't step on the same problem again

xemul · 2024-12-23T10:33:43Z

This is a dirty attempt to fix the lack of cross-shard fairness.

Why cross-shard fairness? Not dropping the preempted capacity on the floor sounds like "fix the shard-local preemption" to me (#2591)

michoecho · 2024-12-23T11:07:48Z

This is a dirty attempt to fix the lack of cross-shard fairness.

Why cross-shard fairness? Not dropping the preempted capacity on the floor sounds like "fix the shard-local preemption" to me

It fixes both. The main aim of this patch is to add cross-shard fairness by grabbing tokens for many requests at once. The local preemption problem is handled as a byproduct.

_queued_cap is the sum of the capacities of all queued (i.e. not dispatched and not cancelled yet) requests. When we grab tokens, we grab min(per_tick_grab_threshold, _queued_cap) tokens. Grabbing a fixed amount of tokens at a time, rather than a fixed amount of requests (i.e. one request) at a time is the key part which gives us fairness of tokens instead of fairness of IOPS.

Preemption is handled like this: conceptually there is no dedicated "the pending request". Rather, there is a pending token reservation, and when we finally get some tokens out of it, we just dispatch them on the highest-priority request at the moment. If, due to this, we are left with a bunch of tokens we can't use immediately (e.g. because a request with 100k tokens butted in into a reservation done earlier for a 1.5M request, so we are left with 1.4M after dispatching the 100k), we "roll them over" to the next reservation, essentially by grabbing min(per_tick_grab_threshold + leftovers, _queued_cap) (and then immediately refunding the unused tokens to the group head so that they are used by others rather than wasted).

For example, if we do a pending reservation of cap=1.5M at wanthead=10M, and we call dispatch_requests after grouphead=11M, and the highest-priority request at the moment is 100k tokens, then we dispatch 100k, grab another pending allocation with cap=2.9M at wanthead=13.9M, and atomically advance the grouphead by 1.4M (so to 12.4M assuming no interference).

(Note that this also means that requests bigger than per_tick_grab_threshold can take several calls to dispatch_requests to finally accumulate enough tokens to execute.)

Note that this change means that the the worst case I/O latency effectively increases by one io-latency-goal, (because each shard can now allocate up to max_request_size + io-latency-goal / smp::count per dispatch_requests due to the leftovers, instead of the old max_request_size), but I don't really see a way around this. Every algorithm I could think of which solves the local-preemption waste problem does that.

Also note that I didn't give any thought to the issues you mention in #1766 when I was writing this patch. I only glanced at #1766 and didn't have the time to think how this patch interacts with them yet.

xemul · 2024-12-23T12:50:34Z

The main aim of this patch is to add cross-shard fairness by grabbing tokens for many requests at once

Mind the following. All shards cannot grab more than full token-bucket limit, so there's natural limit on the amount of token a shard can get. E.g. here's how token-bucket is configured for io-properties that I have:

  token_bucket:
    limit: 12582912
    rate: 16777216
    threshold: 80612

And the requests costs can grow as large as:

    131072: // request size in bytes
      read: 1312849
      write: 3038283

so you can charge at most 9.6 128k reads for that limit for the whole node. It's not that much

xemul · 2024-12-23T12:53:06Z

It fixes both.

OK, let's consider some simple io-tester job:

- name: shard_0
  shard_info:
    parallelism: 32
    reqsize: 4kB
    shares: 100
  shards:
  - '0'
  type: randread
- name: shard_1
  shard_info:
    parallelism: 32
    reqsize: 4kB
    shares: 800
  shards:
  - '1'
  type: randread

What would the result be with this pr?

xemul · 2024-12-23T13:11:30Z

(Note that this also means that requests bigger than per_tick_grab_threshold can take several calls to dispatch_requests to finally accumulate enough tokens to execute.)

So you reserve capacity for large request in several grabs. There's one thing that bothers me. Below is very simplified example that tries to demonstrate it

There are 2 shards, 10 tokens limit and requests cost 6 tokens each. Here's how it will move:

shard0
shard1
       ------------------------------------------------------------------
       |         |
       tail      head

// shard0 grabs 6 tokens and dispatches

shard0 |------|
shard1
       ------------------------------------------------------------------
              |  |
              t  head

// shard1 grabs 6 tokens and gets "pending"

shard0 |------|
shard1        |------|
       ------------------------------------------------------------------
                 |   |
                 h   t

// time to replenish 4 more tokens elapses, shard1 dispatches

shard0 |------|
shard1        |------|
       ------------------------------------------------------------------
                     |
                     t,h

Here, disk gets one request in the middle of the timeline and another request at the end of it. Now let's grab tokens with per-tick-threshold of 5 tokens batches

shard0
shard1
       ------------------------------------------------------------------
       |         |
       tail      head

// shard0 grabs 5 tokens and waits

shard0 |-----|
shard1
       ------------------------------------------------------------------
             |   |
             t   head

// shard1 grabs 5 tokens and waits

shard0 |-----|
shard1       |-----|
       ------------------------------------------------------------------
                 | |
                 h t

// time to replenish 4 more tokens elapses

shard0 |-----|
shard1       |-----|
       ------------------------------------------------------------------
                   |   |
                   t   h

// both shards get 1 token each and dispatch

shard0 |-----|     |-|
shard1       |-----| |-|
       ------------------------------------------------------------------
                   |   |
                   t   h

Here, disk is idling up until the end of the timeline, then gets two requests in one go. Effectively we re-distributed the load by making it smaller (down to idle) and then compensating for the idleness after the next replenish took place (2x times).

It's not necessarily a problem as shards don't always line-up as in the former example. However, in the current implementation rovers serve two purposes -- account for the available and consumed tokens and for a queue of requests (group.grab_capacity(cap) returns the "position" in this queue). With batched grabbing and one request requiring several "grabs" breaks the queuing facility. And while it's OK to overdispatch a disk with short requests, this will cause overdispatching with long requests with is worse.

The same thing, btw, happens with the patch from #2591 :(

michoecho · 2024-12-23T15:53:06Z

There's one thing that bothers me.
...
There are 2 shards, 10 tokens limit and requests cost 6 tokens each. Here's how it will move:

@xemul Not quite. The patch anticipates this, and does something different. The important part is: we avoid "hoarding" tokens — if we successfully grabbed some tokens, but they aren't enough to fulfill the highest-priority request, we don't keep the tokens and wait until we grab the necessary remainder, but we "roll over" the tokens by releasing them back to the bucket and immediately make a bigger combined reservation.

So in your example, if shard 0 calls dispatch_requests again after shard 1 made its reservation (even before any replenishment happens), and sees that it has successfully grabbed 5 tokens, it won't allocate the remaining 1 token right after shard 1's allocation, but it will instead immediately return its 5 tokens to the bucket and request a combined reservation of 6 tokens right after shard 1's. And then shard 1 on the next dispatch_requests call will see that it can't fulfill the request with the ongoing allocation of 5 tokens, so it will cancel the allocation by returning the 3 tokens it already has and moving the group head after the 2 tokens which are right after the head, and it will make a combined reservation of 6 tokens after shard 0's allocation. And then shard 0 on its next dispatch_requests call will see that its allocation is fulfilled, and will dispatch the request.

So a shard never "hoards" allocated-but-not-dispatched tokens for more than one poll period. If it sees that it can't dispatch in this grab cycle, then it immediately hands over the tokens to the next person in the queue, and makes up for it in the next cycle. So the first actually dispatchable request will be dispatched as soon as there is enough tokens in the bucket and all shards did one poll cycle to hand over their tokens to the dispatchable request. So dispatching isn't delayed to the end of the timeline — it's delayed by at most one poll cycle from the optimal dispatch point.

xemul · 2024-12-23T16:55:05Z

The important part is: we avoid "hoarding" tokens — if we successfully grabbed some tokens, but they aren't enough to fulfill the highest-priority request, we don't keep the tokens and wait until we grab the necessary remainder, but we "roll over" the tokens by releasing them back to the bucket and immediately make a bigger combined reservation.

I need to check the final version of the patch, this explanation is not clear.

First, please clarify what "not enough" means. Assume in my example shard-0 first tries to grab 5 tokens. That's not enough, right? But why does it grab 5 tokens if it knows that it will need 6? Or does it grab 6 from the very beginning?
Next, what if shard grabs 6 tokens that are needed for the request, but tail overruns the head. Is it still/also "not enough"? Or is it something else?

michoecho added 4 commits December 15, 2024 21:55

checkpoint

b68efd3

checkpoint

e2f29cb

checkpoint

bd5aa72

fix?

b0ec97d

michoecho commented Dec 23, 2024

View reviewed changes

michoecho changed the title ~~Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens~~ [DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens #2596

[DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens #2596

michoecho commented Dec 23, 2024 •

edited

Loading

michoecho Dec 23, 2024 •

edited

Loading

xemul commented Dec 23, 2024 •

edited

Loading

michoecho commented Dec 23, 2024

xemul commented Dec 23, 2024

xemul commented Dec 23, 2024

xemul commented Dec 23, 2024

michoecho commented Dec 23, 2024

xemul commented Dec 23, 2024

[DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens #2596

Are you sure you want to change the base?

[DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens #2596

Conversation

michoecho commented Dec 23, 2024 • edited Loading

michoecho Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

xemul commented Dec 23, 2024 • edited Loading

michoecho commented Dec 23, 2024

xemul commented Dec 23, 2024

xemul commented Dec 23, 2024

xemul commented Dec 23, 2024

michoecho commented Dec 23, 2024

xemul commented Dec 23, 2024

michoecho commented Dec 23, 2024 •

edited

Loading

michoecho Dec 23, 2024 •

edited

Loading

xemul commented Dec 23, 2024 •

edited

Loading