Update background scheduling overhead

vikranth22446 · vikranth22446 · commit ae5d86bf2866 · 2024-09-12T21:23:59.000-07:00
diff --git a/content/posts/scheduling_overhead.md b/content/posts/scheduling_overhead.md
@@ -22,7 +22,7 @@ To understand the tradeoffs of iterative scheduling in today's environment, we p
 
 LLM inference today performs batched model forwarding by sending a batch of requests to the GPU at a time. Prior LLM inference systems schedule a subsequent batch after all the requests in the current batch finish their generation, which causes GPU resource waste as some requests in a batch finish earlier and wait for the others. [Iterative LLM inference scheduling](https://www.usenix.org/conference/osdi22/presentation/yu) mitigates this issue by constructing a batch after each model forwarding iteration, where each iteration executes a prompt prefilling and/or one decoding token generation. With chances of adding new requests to a batch at any iteration, iterative scheduling largely improves GPU utilization. 
 
-Typically, LLM scheduling involves post-processing requests (sampling and detokenization) from the previous batch, selecting requests to include in the next batch, and preparing a new request for model forwarding. Note that pre- and post-processing can be treated as tasks independent of an actual scheduling algorithm; we include them in the “scheduling overhead” in this work, as they are always and only called when the scheduler is invoked. 
+Typically, LLM scheduling involves post-processing requests (detokenization) from the previous batch, selecting requests to include in the next batch, and preparing a new request for model forwarding. Note that pre- and post-processing can be treated as tasks independent of an actual scheduling algorithm; we include them in the “scheduling overhead” in this work, as they are always and only called when the scheduler is invoked. 
 
 A key assumption iterative scheduling makes is that the scheduling delay (including other tasks invoked together with the scheduler) is much smaller than an iteration of model forwarding time. Thus, scheduling at each iteration is acceptable. Two new developments in LLM inferencing are challenging this assumption. First, model forwarding has become much faster with new inference kernels like [FlashInfer3](https://flashinfer.ai/). As a result, the relative time spent on scheduling is more significant. Second, today’s scheduling systems often undertake more tasks with more considerations. For example, a technique called [chunked prefill](https://arxiv.org/abs/2308.16369) separates a prompt into multiple chunks, each executed in one iteration with other decoding requests, thereby improving GPU utilization. Supporting chunked prefill adds to the tasks of a scheduler at the iteration level. Such added tasks inevitably increase the scheduling delay.