Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry Regarding "Mooncake- A KVCache-centric Disaggregated Architecture for LLM Serving" #50

Open
VegetaPn opened this issue Dec 25, 2024 · 16 comments

Comments

@VegetaPn
Copy link

Hello:

I recently read your insightful paper titled "Mooncake- A KVCache-centric Disaggregated Architecture for LLM Serving" and found it extremely enlightening.

However, I have some questions about a few parts that I'm finding difficult to understand and would appreciate your help with:

  1. Block IDs Clarification: In the paper, it is mentioned that the selected prefill node receives a request including the raw input, the block IDs of the prefix cache that can be reused, and the block IDs of the full cache allocated to the request. Could you please clarify whether these block IDs refer to the memory blocks' IDs allocated during service startup or are they the hash keys of tokens within the blocks?

  2. Conductor's Load Awareness: How does the Conductor component of the system perceive the load status of inference instances in real-time? Is there any latency involved in this process?

  3. State Management of Conductor: If the Conductor operates as a stateless multi-instance service, how is data consistency maintained across different instances of the Conductor?

  4. Concurrency Issues in KV Cache Management: During the selection of a Prefill node, what happens if the KV Cache on that particular node is evicted? How does the system handle such scenarios to ensure smooth operation?

Your insights into these questions would greatly enhance my understanding of the system's architecture and its operational intricacies. I appreciate your time and assistance, and I look forward to your response.

Thank you for your groundbreaking work and for contributing to the field with such innovative solutions.

Best regards.

@chestnut-Q
Copy link
Collaborator

Hello @VegetaPn,

Thank you for your interest in Mooncake and for sharing these questions. Below are the detailed answers:

  1. These blocks are the logical IDs of memory blocks. For example, if a machine allocates 10,000 blocks at startup, the block IDs would range from 0 to 9999.
  2. Conductor estimates processing time based on the requests currently being processed or queued on each instance (e.g., prompt_len, prefix_prompt_len, batch_size). For prefill instances, Conductor estimates the TTFT based on the last request in the queue; for decoding instances, it estimates the TBT for the requests in a batch. Conductor handles multiple concurrent requests efficiently, so any additional latency is negligible in practice.
  3. Currently, there is only one Conductor in each Mooncake cluster, running on a dedicated node.
  4. Conductor adopts an LRU eviction strategy for KVCache, with carefully designed concurrency and locking mechanisms to ensure that eviction does not disrupt the selection of a prefill node. Specifically, there is no KVCache eviction during prefix hash matching; moreover, once a block is matched, it will not be evicted until its ref_count drops to 0.

I hope this helps clarify your questions. Please let me know if there's anything else I can assist you with.

@VegetaPn
Copy link
Author

@chestnut-Q Thank you very much for your reply!

I have a few more questions I'd like to ask:

  1. Regarding the second point, data such as queue_duration and batch_size belong to the inference instances, How are they synchronized to Conductor—by carrying them in the inference results or by synchronizing them at a fixed frequency?

  2. Regarding the third point, since Conductor is a single-instance service, is there currently a fallback mechanism if Conductor unavailable/down? If it evolves into a multi-instance service, are there any good solutions in place? What I can think of is to create replicas of Conductor, and if the primary node fails, switch one of the replicas to become the primary node. However, this would require maintaining a very strong consistency between the primary and secondary nodes.

  3. What is the call chain between Conductor and inference instances? 1) Does Conductor first call Prefill, get the initial token, and then call Decoding, or 2) call Prefill and Decoding simultaneously, or 3) Conductor only calls Decoding, and then Decoding initiates a call to Prefill? Although it seems to be the first method in vLLM, I would like to understand the calling relationship in Kimi. Because it seems that the latter two methods are more friendly for KVCache transmission.

  4. Looking at the current open-source code of TransferEngine, if two nodes are to perform KVCache transmission, the initiating node needs to know the virtual address of the target node and the rkey of the MemoryRegion. Will these address details also be managed by Conductor? If Conductor does not manage the address information of Blocks, then it is necessary to pre-communicate the address information of Blocks between the two nodes? Does this imply that there is communication between P and D?

  5. If there is communication between P and D, is the metadata management service of TransferEngine unnecessary? Wouldn't it be better to simply convey the metadata needed for RDMA in the communication information between P and D?

Your response has been very helpful to me, thank you again!

@chestnut-Q
Copy link
Collaborator

Hello @VegetaPn, here are the answers:

  1. Conductor has information on all requests on an instance when scheduling requests, and it maintains this information based on responses of instances.
  2. Because Conductor has a very low chance of being unavailable, we generally treat it as though there is only a single instance.
  3. We choose (1).
  4. &5. There is no communication between P and D. Instead, an additional metadata server is set up outside of Conductor, aimed at simplifying the interface and facilitating scalability.

If you have any more questions, please feel free to ask.

@VegetaPn
Copy link
Author

@chestnut-Q Thank you very much! I have a few more questions I'd like to ask:

  1. Based on your response, Conductor manages the memory allocation for Prefill instances; does Conductor also manage the memory allocation for Decoding? The length of the response generated during the Decode phase is uncertain, making it challenging if Conductor is to manage the memory allocation for D.
  2. Metadata server, can it be understood that: the metaserver stores static information, such as the address of each block, rather than frequently changing dynamic information. During KVCache transmission, the blockIds have already been obtained from the requests, and it is only necessary to query the actual virtual address/Rkey from the information in the metaserver. Even the data in the metaserver can be cached locally.
  3. If Conductor requests the Prefill instance first, will the Prefill instance transmit the KVCache layer by layer to the Decoding instance simultaneously with its computation? Or will it wait until Conductor requests the Decoding instance to then perform the KVCache transfer?
  4. In a multi-round dialogue scenario, when initiating a second session request for Prefill, the longest reusable KVCache is on the Decoding instance (which stores the KVCache generated during the first session). At this point, will this Prefill instance pull the KVCache from the Decoding instance?

@chestnut-Q
Copy link
Collaborator

chestnut-Q commented Dec 28, 2024

@VegetaPn, here is a straightforward and easy-to-implement approach: Conductor does not manage the KVCache of the decoding instance, which means the decoding node does not cache the KVCache. There is no layer-by-layer KVCache transfer between the prefill and decoding nodes. These optimizations are not discussed in the paper but can be implemented on top of it. Your understanding of point 2 is correct.

@VegetaPn
Copy link
Author

@chestnut-Q, if it's not a layer-by-layer transfer, then the entire KVCache can only be transferred after the prefill execution is completed, which would likely be inefficient. How have you considered the load and efficiency issues regarding the transfer? Have there been any related optimizations implemented in Kimi?

@chestnut-Q
Copy link
Collaborator

@VegetaPn Yes, but this asynchronous transfer does not interfere with the model inference on GPUs.

@woodyji
Copy link

woodyji commented Dec 29, 2024

The transfer of these KVCache blocks across CPUs and GPUs is handled by a separate (GPUDirect) RDMA-based component called Messenger. This archtecture also enables us to provide the context caching API to outside users for a higher reusement of KVCache.

  1. Based on your previous response and your paper, does the prefill instance write KVCache to the decode instance using a GPUDirect-RDMA WRITE operation? Is the source address of the prefill instance in GPU space while the target address of the decode instance is in CPU space?

  2. The prefill instance transfers KVCache layer-by-layer. Once the prefill instance has completed all computations and transfers, the conductor begins scheduling the decode instance. Therefore, does the decode instance need to copy KVCache from CPU to GPU without any overlap during the request's life cycle?

@VegetaPn

@chestnut-Q
Copy link
Collaborator

We believe that using CPU-based two-hop transfers is easier to implement and scale, as it requires fewer modifications to the existing inference framework. Moreover, KVCache transfers will not affect the inference of other requests, which is advantageous for systems where throughput is a priority. The choice of transfer method depends on the storage medium (e.g., HBM, DRAM, etc.) and the hardware setup. You can refer to our mooncake-transfer-engine implementation for more details.

@woodyji

@woodyji
Copy link

woodyji commented Dec 29, 2024

I noticed that the mooncake-transfer-engine supports multiple types of mediums. However, I'm confused about the actual choices made by mooncake, as the paper mentions both GPU-to-CPU and CPU-to-CPU transfers. @chestnut-Q

@chestnut-Q
Copy link
Collaborator

@woodyji Mooncake supports multiple types of transfers to accommodate different hardware conditions. However, you can think of it as GPU -> CPU -> GPU because this modification is the simplest, as I mentioned above.

@woodyji
Copy link

woodyji commented Dec 30, 2024

Thank you for your response. I have two additional questions regarding KVCache transfer:

  1. In the GPU -> CPU -> GPU mode, should the prefill process first offload the KVCache from GPU to CPU, followed by the Messenger handling the CPU to CPU transfer via RDMA? Once the transfer is complete, the decode process perform an async load from CPU to GPU?

  2. As mentioned in your paper, the Messenger operates as an independent process. Does it share the same CPU address space with the inference process? How do you implement the sharing of such a large address space across multiple processes?

@chestnut-Q

@chestnut-Q
Copy link
Collaborator

@woodyji

In the GPU -> CPU -> GPU mode, should the prefill process first offload the KVCache from GPU to CPU, followed by the Messenger handling the CPU to CPU transfer via RDMA? Once the transfer is complete, the decode process perform an async load from CPU to GPU?

Yes.

As mentioned in your paper, the Messenger operates as an independent process. Does it share the same CPU address space with the inference process? How do you implement the sharing of such a large address space across multiple processes?

You can use shared memory to implement this.

@woodyji
Copy link

woodyji commented Dec 30, 2024

The workflow makes sense. However, will the decode instance transfer the incremental KVCache to the prefill instance for better KVCache reuse? The paper does not seem to clearly state this. @chestnut-Q

@chestnut-Q
Copy link
Collaborator

The workflow makes sense. However, will the decode instance transfer the incremental KVCache to the prefill instance for better KVCache reuse? The paper does not seem to clearly state this. @chestnut-Q

Please refer to my previous reply. @woodyji

@woodyji
Copy link

woodyji commented Jan 2, 2025

In the GPU -> CPU -> GPU mode, will offloading and data transfer simultaneously lead to PCIe bandwidth contention?
@chestnut-Q

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants