[MooncakeStore] Introduce MooncakeStore: A new Distributed Object Sto…

…re for XpYd PD disaggregation (#126) Mooncake Store is a high-performance **distributed key-value (KV) cache storage engine** designed specifically for LLM inference scenarios. Unlike traditional caching systems such as Redis or Memcached, Mooncake Store is positioned as **a distributed KV cache rather than a generic caching system**. The key difference is that in the former, the key is derived from the value through hashing, so value is immutable after inserting (although the key/value pair may be garbage collected). Mooncake Store provides low-level object storage and management capabilities, while specific caching strategies (e.g., eviction policies) are left to upper-layer frameworks (like vLLM) or users for implementation, offering higher flexibility and customizability. Key features of Mooncake Store include: - **Object-level storage operations**: Mooncake Store provides simple and easy-to-use object-level APIs, including `Put`, `Get`, and `Remove` operations. - **Multi-replica support**: Mooncake Store supports storing multiple data replicas for the same object, effectively alleviating hotspots in access pressure. - **Eventual consistency**: Mooncake Store ensures that `Get` operations read complete and correct data, but does not guarantee the latest written data. This eventual consistency model ensures high performance while simplifying system design. - **High bandwidth utilization**: Mooncake Store supports striping and parallel I/O transfer of large objects, fully utilizing multi-NIC aggregated bandwidth for high-speed data reads and writes. - **Dynamic resource scaling**: Mooncake Store supports dynamically adding and removing nodes to flexibly handle changes in system load, achieving elastic resource management (to be refined in future versions). --------- Co-authored-by: Feng Ren <[email protected]> Co-authored-by: Teng Ma <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: susun <[email protected]> Co-authored-by: liusy58 <[email protected]>
kvcache-ai · Mar 7, 2025 · ac98ea9 · ac98ea9
1 parent 6cf9381
commit ac98ea9
Show file tree

Hide file tree

Showing 89 changed files with 25,715 additions and 5 deletions.
diff --git a/.gitignore b/.gitignore
@@ -175,4 +175,17 @@ cython_debug/
 
 *.etcd
 
+# Ignore generated protobuf files
+*.pb.cc
+*.pb.cpp
+*.pb.h
+
 thirdparties
+
+# JetBrains IDEs
+.idea
+cmake-build-debug
+cmake-build-release
+cmake-build-relwithdebinfo
+cmake-build-minsizerel
+cmake-build
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -33,9 +33,22 @@ option(USE_ETCD "option for enable etcd as metadata server" ON)
 option(USE_REDIS "option for enable redis as metadata server" OFF)
 option(USE_HTTP "option for enable http as metadata server" OFF)
 
+option(WITH_STORE "build mooncake store library and sample code" ON)
 option(WITH_P2P_STORE "build p2p store library and sample code" OFF)
 option(WITH_RUST_EXAMPLE "build the Rust interface and sample code for the transfer engine" OFF)
 
+find_package(Python3 REQUIRED Interpreter Development)
+find_package(pybind11 REQUIRED)
+set(PYTHON_EXECUTABLE "python3")
+execute_process(
+    COMMAND ${PYTHON_EXECUTABLE} -c "import sys; print(sys.path[-1])"
+    OUTPUT_VARIABLE PYTHON_SYS_PATH
+)
+string(STRIP ${PYTHON_SYS_PATH} PYTHON_SYS_PATH)
+
+set(PYBIND11_FINDPYTHON ON)
+find_package(pybind11 CONFIG REQUIRED) 
+
 if (USE_CUDA)
   add_compile_definitions(USE_CUDA)
   message(STATUS "CUDA support is enabled")
@@ -74,6 +87,13 @@ add_subdirectory(mooncake-transfer-engine)
 
 include_directories(mooncake-transfer-engine/include)
 
+if (WITH_STORE)
+  message(STATUS "Mooncake Store will be build")
+  find_package(pybind11 CONFIG REQUIRED)
+  add_subdirectory(mooncake-store)
+  include_directories(mooncake-store/include)
+endif()
+
 add_subdirectory(mooncake-integration)
 
 if (WITH_P2P_STORE AND USE_CUDA)

diff --git a/README.md b/README.md
@@ -14,11 +14,12 @@
 
 
 Mooncake is the serving platform for  <a href="https://kimi.ai/"><img src="image/kimi.png" alt="icon" style="height: 16px; vertical-align: middle;"> Kimi</a>, a leading LLM service provided by <a href="https://www.moonshot.cn/"><img src="image/moonshot.jpg" alt="icon" style="height: 16px; vertical-align: middle;"> Moonshot AI</a>.
-Now the core of Mooncake - Transfer Engine is open-sourced!
+Now both the Transfer Engine and Mooncake Store are open-sourced!
 This repository also hosts its technical report and the open sourced traces. 
 
 <h2 id="updates">🔄 Updates</h2>
 
+ - **Mar 7, 2025**: We open sourced the Mooncake Store, a distributed KVCache based on Transfer Engine. vLLM's xPyD disaggregated prefilling & decoding based on Mooncake Store will be released soon.
  - **Feb 25, 2025**: Mooncake receives the **Best Paper Award** at **FAST 2025**!
  - **Feb 21, 2025**: The updated <a href="FAST25-release/traces" target="_blank">traces</a> used in our FAST'25 paper have been released.
  - **Dec 16, 2024**: vLLM officially supports Mooncake Transfer Engine for disaggregated prefilling and KV cache transfer.
@@ -44,7 +45,7 @@ The core of Mooncake is its KVCache-centric scheduler, which balances maximizing
 - The bottom part of Mooncake is **Transfer Engine**, which supports rapid, reliable and flexible data transfer over TCP, RDMA, NVIDIA GPUDirect-based RDMA and and NVMe over Fabric (NVMe-of) protocols. Comparing with gloo (used by Distributed PyTorch) and TCP, Mooncake Transfer Engine has the lowest I/O latency.
 - Based on **Transfer Engine**, we implemented the **P2P Store** library, supports sharing temporary objects (e.g., checkpoint files) among nodes in a cluster. It avoids bandwidth saturation on a single machine.
 - Additionally, we modified vLLM so that **Transfer Engine** is integrated. It makes prefill-decode disaggregation more efficient by utilizing RDMA devices. 
-- In the future, we plan to build **Mooncake Store** on the basis of **Transfer Engine**, which supports pooled KVCache for more flexible P/D disaggregation.
+- **Mooncake Store** is based on **Transfer Engine**, which supports distributed pooled KVCache for vLLM's xPyD disaggregation. 
 
 <h2 id="show-cases">🔥 Show Cases</h2>
 
@@ -69,6 +70,9 @@ With 40 GB of data (equivalent to the size of the KVCache generated by 128k toke
 P2P Store is built on the Transfer Engine and supports sharing temporary objects between peer nodes in a cluster. P2P Store is ideal for scenarios like checkpoint transfer, where data needs to be rapidly and efficiently shared across a cluster. 
 **P2P Store has been used in the checkpoint transfer service of Moonshot AI.**
 
+### Mooncake Store ([Guide](doc/en/mooncake-store-preview.md))
+Mooncake Store is a distributed KVCache storage engine specialized for LLM inference. It offers object-level APIs (`Put`, `Get` and `Remove`), and we will soon release an new vLLM integration to demonstrate xPyD disaggregation. Mooncake Store is the central component of the KVCache-centric disaggregated architecture.
+
 #### Highlights
 - **Decentralized architecture.** P2P Store leverages a pure client-side architecture with global metadata managed by the etcd service.
 
@@ -82,7 +86,9 @@ Thanks to the high performance of Transfer Engine, P2P Stores can also distribut
 ### vLLM Integration ([Guide v0.2](doc/en/vllm-integration-v0.2.md))
 To optimize LLM inference, the vLLM community is working on supporting [disaggregated prefilling (PR 10502)](https://github.com/vllm-project/vllm/pull/10502). This feature allows separating the **prefill** phase from the **decode** phase in different processes. The vLLM uses `nccl` and `gloo` as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines.
 
-We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of `nccl` and `gloo`, to support **inter-node KVCache transfer** [(PR 10884)](https://github.com/vllm-project/vllm/pull/10884). Transfer Engine provides simpler interfaces and more efficient use of RDMA devices. In the future, we plan to build Mooncake Store on the basis of Transfer Engine, which supports pooled prefill/decode disaggregation.
+We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of `nccl` and `gloo`, to support **inter-node KVCache transfer** [(PR 10884)](https://github.com/vllm-project/vllm/pull/10884). Transfer Engine provides simpler interfaces and more efficient use of RDMA devices. 
+
+We will soon release the new vLLM integration based on Mooncake Store, which supports xPyD prefill/decode disaggregation.
 
 **_Update[Dec 16, 2024]: Here is the latest vLLM Integration ([Guide v0.2](doc/en/vllm-integration-v0.2.md)) that is based on vLLM's main branch._**
 

diff --git a/doc/en/build.md b/doc/en/build.md
@@ -146,6 +146,7 @@ This document describes how to build Mooncake.
 Mooncake supports the following advanced compile options:
 - `-DUSE_CUDA=[ON|OFF]`: Enable GPU Direct RDMA & NVMe-of support. 
 - `-DUSE_CXL=[ON|OFF]`: Enable CXL protocols. 
+- `-DWITH_STORE=[ON|OFF]`: Build Mooncake Store.
 - `-DWITH_P2P_STORE=[ON|OFF]`: Enable Golang support and build P2P Store. 
 - `-DWITH_WITH_RUST_EXAMPLE=[ON|OFF]`: Enable Rust language support.
 - `-DUSE_REDIS=[ON|OFF]`: Enable Redis as metadata server in Mooncake (`hiredis` required).