Skip to content

Commit ae5d86b

Browse files
committed
Update background scheduling overhead
1 parent 1d57e12 commit ae5d86b

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

content/posts/scheduling_overhead.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ To understand the tradeoffs of iterative scheduling in today's environment, we p
2222

2323
LLM inference today performs batched model forwarding by sending a batch of requests to the GPU at a time. Prior LLM inference systems schedule a subsequent batch after all the requests in the current batch finish their generation, which causes GPU resource waste as some requests in a batch finish earlier and wait for the others. [Iterative LLM inference scheduling](https://www.usenix.org/conference/osdi22/presentation/yu) mitigates this issue by constructing a batch after each model forwarding iteration, where each iteration executes a prompt prefilling and/or one decoding token generation. With chances of adding new requests to a batch at any iteration, iterative scheduling largely improves GPU utilization.
2424

25-
Typically, LLM scheduling involves post-processing requests (sampling and detokenization) from the previous batch, selecting requests to include in the next batch, and preparing a new request for model forwarding. Note that pre- and post-processing can be treated as tasks independent of an actual scheduling algorithm; we include them in the “scheduling overhead” in this work, as they are always and only called when the scheduler is invoked.
25+
Typically, LLM scheduling involves post-processing requests (detokenization) from the previous batch, selecting requests to include in the next batch, and preparing a new request for model forwarding. Note that pre- and post-processing can be treated as tasks independent of an actual scheduling algorithm; we include them in the “scheduling overhead” in this work, as they are always and only called when the scheduler is invoked.
2626

2727
A key assumption iterative scheduling makes is that the scheduling delay (including other tasks invoked together with the scheduler) is much smaller than an iteration of model forwarding time. Thus, scheduling at each iteration is acceptable. Two new developments in LLM inferencing are challenging this assumption. First, model forwarding has become much faster with new inference kernels like [FlashInfer3](https://flashinfer.ai/). As a result, the relative time spent on scheduling is more significant. Second, today’s scheduling systems often undertake more tasks with more considerations. For example, a technique called [chunked prefill](https://arxiv.org/abs/2308.16369) separates a prompt into multiple chunks, each executed in one iteration with other decoding requests, thereby improving GPU utilization. Supporting chunked prefill adds to the tasks of a scheduler at the iteration level. Such added tasks inevitably increase the scheduling delay.
2828

0 commit comments

Comments
 (0)