-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry Regarding "Mooncake- A KVCache-centric Disaggregated Architecture for LLM Serving" #50
Comments
Hello @VegetaPn, Thank you for your interest in Mooncake and for sharing these questions. Below are the detailed answers:
I hope this helps clarify your questions. Please let me know if there's anything else I can assist you with. |
@chestnut-Q Thank you very much for your reply! I have a few more questions I'd like to ask:
Your response has been very helpful to me, thank you again! |
Hello @VegetaPn, here are the answers:
If you have any more questions, please feel free to ask. |
@chestnut-Q Thank you very much! I have a few more questions I'd like to ask:
|
@VegetaPn, here is a straightforward and easy-to-implement approach: Conductor does not manage the KVCache of the decoding instance, which means the decoding node does not cache the KVCache. There is no layer-by-layer KVCache transfer between the prefill and decoding nodes. These optimizations are not discussed in the paper but can be implemented on top of it. Your understanding of point 2 is correct. |
@chestnut-Q, if it's not a layer-by-layer transfer, then the entire KVCache can only be transferred after the prefill execution is completed, which would likely be inefficient. How have you considered the load and efficiency issues regarding the transfer? Have there been any related optimizations implemented in Kimi? |
@VegetaPn Yes, but this asynchronous transfer does not interfere with the model inference on GPUs. |
|
We believe that using CPU-based two-hop transfers is easier to implement and scale, as it requires fewer modifications to the existing inference framework. Moreover, KVCache transfers will not affect the inference of other requests, which is advantageous for systems where throughput is a priority. The choice of transfer method depends on the storage medium (e.g., HBM, DRAM, etc.) and the hardware setup. You can refer to our |
I noticed that the mooncake-transfer-engine supports multiple types of mediums. However, I'm confused about the actual choices made by mooncake, as the paper mentions both GPU-to-CPU and CPU-to-CPU transfers. @chestnut-Q |
@woodyji Mooncake supports multiple types of transfers to accommodate different hardware conditions. However, you can think of it as GPU -> CPU -> GPU because this modification is the simplest, as I mentioned above. |
Thank you for your response. I have two additional questions regarding KVCache transfer:
|
Yes.
You can use shared memory to implement this. |
The workflow makes sense. However, will the decode instance transfer the incremental KVCache to the prefill instance for better KVCache reuse? The paper does not seem to clearly state this. @chestnut-Q |
Please refer to my previous reply. @woodyji |
In the GPU -> CPU -> GPU mode, will offloading and data transfer simultaneously lead to PCIe bandwidth contention? |
Hello:
I recently read your insightful paper titled "Mooncake- A KVCache-centric Disaggregated Architecture for LLM Serving" and found it extremely enlightening.
However, I have some questions about a few parts that I'm finding difficult to understand and would appreciate your help with:
Block IDs Clarification: In the paper, it is mentioned that the selected prefill node receives a request including the raw input, the block IDs of the prefix cache that can be reused, and the block IDs of the full cache allocated to the request. Could you please clarify whether these block IDs refer to the memory blocks' IDs allocated during service startup or are they the hash keys of tokens within the blocks?
Conductor's Load Awareness: How does the Conductor component of the system perceive the load status of inference instances in real-time? Is there any latency involved in this process?
State Management of Conductor: If the Conductor operates as a stateless multi-instance service, how is data consistency maintained across different instances of the Conductor?
Concurrency Issues in KV Cache Management: During the selection of a Prefill node, what happens if the KV Cache on that particular node is evicted? How does the system handle such scenarios to ensure smooth operation?
Your insights into these questions would greatly enhance my understanding of the system's architecture and its operational intricacies. I appreciate your time and assistance, and I look forward to your response.
Thank you for your groundbreaking work and for contributing to the field with such innovative solutions.
Best regards.
The text was updated successfully, but these errors were encountered: