llama: Ensure KV cache is fully defragmented. #10873

jessegross · 2024-12-17T20:46:35Z

Sometimes the KV cache requires defragmentation even without triggering the threshold heuristic. In this case, decoding will not being able to find a KV cache slot. This is particularly difficult for the caller to handle if it happens in between ubatches. To avoid this, we should immediately trigger a defrag.

In addition, a heavily fragmented cache can require more than max_moves to defragment. Currently, we stop when we hit the limit but this can leave a cache that still does not have adequate space even after defragmentation is triggered. Instead, we should do multiple batches of processing until everything is complete.

Sometimes the KV cache requires defragmentation even without triggering the threshold heuristic. In this case, decoding will not being able to find a KV cache slot. This is particularly difficult for the caller to handle if it happens in between ubatches. To avoid this, we should immediately trigger a defrag. In addition, a heavily fragmented cache can require more than max_moves to defragment. Currently, we stop when we hit the limit but this can leave a cache that still does not have adequate space even after defragmentation is triggered. Instead, we should do multiple batches of processing until everything is complete.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: Ensure KV cache is fully defragmented. #10873

llama: Ensure KV cache is fully defragmented. #10873

jessegross commented Dec 17, 2024

llama: Ensure KV cache is fully defragmented. #10873

Are you sure you want to change the base?

llama: Ensure KV cache is fully defragmented. #10873

Conversation

jessegross commented Dec 17, 2024