A key assumption iterative scheduling makes is that the scheduling delay (including other tasks invoked together with the scheduler) is much smaller than an iteration of model forwarding time. Thus, scheduling at each iteration is acceptable. Two new developments in LLM inferencing are challenging this assumption. First, model forwarding has become much faster with new inference kernels like [FlashInfer3](https://flashinfer.ai/). As a result, the relative time spent on scheduling is more significant. Second, today’s scheduling systems often undertake more tasks with more considerations. For example, a technique called [chunked prefill](https://arxiv.org/abs/2308.16369) separates a prompt into multiple chunks, each executed in one iteration with other decoding requests, thereby improving GPU utilization. Supporting chunked prefill adds to the tasks of a scheduler at the iteration level. Such added tasks inevitably increase the scheduling delay.
0 commit comments