Skip to content

Commit

Permalink
[MooncakeStore] Introduce MooncakeStore: A new Distributed Object Sto…
Browse files Browse the repository at this point in the history
…re for XpYd PD disaggregation (#126)

Mooncake Store is a high-performance **distributed key-value (KV) cache storage engine** designed specifically for LLM inference scenarios.

Unlike traditional caching systems such as Redis or Memcached, Mooncake Store is positioned as **a distributed KV cache rather than a generic caching system**. The key difference is that in the former, the key is derived from the value through hashing, so value is immutable after inserting (although the key/value pair may be garbage collected).

Mooncake Store provides low-level object storage and management capabilities, while specific caching strategies (e.g., eviction policies) are left to upper-layer frameworks (like vLLM) or users for implementation, offering higher flexibility and customizability.

Key features of Mooncake Store include:
- **Object-level storage operations**: Mooncake Store provides simple and easy-to-use object-level APIs, including `Put`, `Get`, and `Remove` operations.
- **Multi-replica support**: Mooncake Store supports storing multiple data replicas for the same object, effectively alleviating hotspots in access pressure.
- **Eventual consistency**: Mooncake Store ensures that `Get` operations read complete and correct data, but does not guarantee the latest written data. This eventual consistency model ensures high performance while simplifying system design.
- **High bandwidth utilization**: Mooncake Store supports striping and parallel I/O transfer of large objects, fully utilizing multi-NIC aggregated bandwidth for high-speed data reads and writes.
- **Dynamic resource scaling**: Mooncake Store supports dynamically adding and removing nodes to flexibly handle changes in system load, achieving elastic resource management (to be refined in future versions).

---------

Co-authored-by: Feng Ren <[email protected]>
Co-authored-by: Teng Ma <[email protected]>
Co-authored-by: Shangming Cai <[email protected]>
Co-authored-by: susun <[email protected]>
Co-authored-by: liusy58 <[email protected]>
  • Loading branch information
6 people authored Mar 7, 2025
1 parent 6cf9381 commit ac98ea9
Show file tree
Hide file tree
Showing 89 changed files with 25,715 additions and 5 deletions.
13 changes: 13 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -175,4 +175,17 @@ cython_debug/

*.etcd

# Ignore generated protobuf files
*.pb.cc
*.pb.cpp
*.pb.h

thirdparties

# JetBrains IDEs
.idea
cmake-build-debug
cmake-build-release
cmake-build-relwithdebinfo
cmake-build-minsizerel
cmake-build
20 changes: 20 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,22 @@ option(USE_ETCD "option for enable etcd as metadata server" ON)
option(USE_REDIS "option for enable redis as metadata server" OFF)
option(USE_HTTP "option for enable http as metadata server" OFF)

option(WITH_STORE "build mooncake store library and sample code" ON)
option(WITH_P2P_STORE "build p2p store library and sample code" OFF)
option(WITH_RUST_EXAMPLE "build the Rust interface and sample code for the transfer engine" OFF)

find_package(Python3 REQUIRED Interpreter Development)
find_package(pybind11 REQUIRED)
set(PYTHON_EXECUTABLE "python3")
execute_process(
COMMAND ${PYTHON_EXECUTABLE} -c "import sys; print(sys.path[-1])"
OUTPUT_VARIABLE PYTHON_SYS_PATH
)
string(STRIP ${PYTHON_SYS_PATH} PYTHON_SYS_PATH)

set(PYBIND11_FINDPYTHON ON)
find_package(pybind11 CONFIG REQUIRED)

if (USE_CUDA)
add_compile_definitions(USE_CUDA)
message(STATUS "CUDA support is enabled")
Expand Down Expand Up @@ -74,6 +87,13 @@ add_subdirectory(mooncake-transfer-engine)

include_directories(mooncake-transfer-engine/include)

if (WITH_STORE)
message(STATUS "Mooncake Store will be build")
find_package(pybind11 CONFIG REQUIRED)
add_subdirectory(mooncake-store)
include_directories(mooncake-store/include)
endif()

add_subdirectory(mooncake-integration)

if (WITH_P2P_STORE AND USE_CUDA)
Expand Down
12 changes: 9 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,12 @@


Mooncake is the serving platform for <a href="https://kimi.ai/"><img src="image/kimi.png" alt="icon" style="height: 16px; vertical-align: middle;"> Kimi</a>, a leading LLM service provided by <a href="https://www.moonshot.cn/"><img src="image/moonshot.jpg" alt="icon" style="height: 16px; vertical-align: middle;"> Moonshot AI</a>.
Now the core of Mooncake - Transfer Engine is open-sourced!
Now both the Transfer Engine and Mooncake Store are open-sourced!
This repository also hosts its technical report and the open sourced traces.

<h2 id="updates">🔄 Updates</h2>

- **Mar 7, 2025**: We open sourced the Mooncake Store, a distributed KVCache based on Transfer Engine. vLLM's xPyD disaggregated prefilling & decoding based on Mooncake Store will be released soon.
- **Feb 25, 2025**: Mooncake receives the **Best Paper Award** at **FAST 2025**!
- **Feb 21, 2025**: The updated <a href="FAST25-release/traces" target="_blank">traces</a> used in our FAST'25 paper have been released.
- **Dec 16, 2024**: vLLM officially supports Mooncake Transfer Engine for disaggregated prefilling and KV cache transfer.
Expand All @@ -44,7 +45,7 @@ The core of Mooncake is its KVCache-centric scheduler, which balances maximizing
- The bottom part of Mooncake is **Transfer Engine**, which supports rapid, reliable and flexible data transfer over TCP, RDMA, NVIDIA GPUDirect-based RDMA and and NVMe over Fabric (NVMe-of) protocols. Comparing with gloo (used by Distributed PyTorch) and TCP, Mooncake Transfer Engine has the lowest I/O latency.
- Based on **Transfer Engine**, we implemented the **P2P Store** library, supports sharing temporary objects (e.g., checkpoint files) among nodes in a cluster. It avoids bandwidth saturation on a single machine.
- Additionally, we modified vLLM so that **Transfer Engine** is integrated. It makes prefill-decode disaggregation more efficient by utilizing RDMA devices.
- In the future, we plan to build **Mooncake Store** on the basis of **Transfer Engine**, which supports pooled KVCache for more flexible P/D disaggregation.
- **Mooncake Store** is based on **Transfer Engine**, which supports distributed pooled KVCache for vLLM's xPyD disaggregation.

<h2 id="show-cases">🔥 Show Cases</h2>

Expand All @@ -69,6 +70,9 @@ With 40 GB of data (equivalent to the size of the KVCache generated by 128k toke
P2P Store is built on the Transfer Engine and supports sharing temporary objects between peer nodes in a cluster. P2P Store is ideal for scenarios like checkpoint transfer, where data needs to be rapidly and efficiently shared across a cluster.
**P2P Store has been used in the checkpoint transfer service of Moonshot AI.**

### Mooncake Store ([Guide](doc/en/mooncake-store-preview.md))
Mooncake Store is a distributed KVCache storage engine specialized for LLM inference. It offers object-level APIs (`Put`, `Get` and `Remove`), and we will soon release an new vLLM integration to demonstrate xPyD disaggregation. Mooncake Store is the central component of the KVCache-centric disaggregated architecture.

#### Highlights
- **Decentralized architecture.** P2P Store leverages a pure client-side architecture with global metadata managed by the etcd service.

Expand All @@ -82,7 +86,9 @@ Thanks to the high performance of Transfer Engine, P2P Stores can also distribut
### vLLM Integration ([Guide v0.2](doc/en/vllm-integration-v0.2.md))
To optimize LLM inference, the vLLM community is working on supporting [disaggregated prefilling (PR 10502)](https://github.com/vllm-project/vllm/pull/10502). This feature allows separating the **prefill** phase from the **decode** phase in different processes. The vLLM uses `nccl` and `gloo` as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines.

We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of `nccl` and `gloo`, to support **inter-node KVCache transfer** [(PR 10884)](https://github.com/vllm-project/vllm/pull/10884). Transfer Engine provides simpler interfaces and more efficient use of RDMA devices. In the future, we plan to build Mooncake Store on the basis of Transfer Engine, which supports pooled prefill/decode disaggregation.
We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of `nccl` and `gloo`, to support **inter-node KVCache transfer** [(PR 10884)](https://github.com/vllm-project/vllm/pull/10884). Transfer Engine provides simpler interfaces and more efficient use of RDMA devices.

We will soon release the new vLLM integration based on Mooncake Store, which supports xPyD prefill/decode disaggregation.

**_Update[Dec 16, 2024]: Here is the latest vLLM Integration ([Guide v0.2](doc/en/vllm-integration-v0.2.md)) that is based on vLLM's main branch._**

Expand Down
1 change: 1 addition & 0 deletions doc/en/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ This document describes how to build Mooncake.
Mooncake supports the following advanced compile options:
- `-DUSE_CUDA=[ON|OFF]`: Enable GPU Direct RDMA & NVMe-of support.
- `-DUSE_CXL=[ON|OFF]`: Enable CXL protocols.
- `-DWITH_STORE=[ON|OFF]`: Build Mooncake Store.
- `-DWITH_P2P_STORE=[ON|OFF]`: Enable Golang support and build P2P Store.
- `-DWITH_WITH_RUST_EXAMPLE=[ON|OFF]`: Enable Rust language support.
- `-DUSE_REDIS=[ON|OFF]`: Enable Redis as metadata server in Mooncake (`hiredis` required).
Expand Down
Loading

0 comments on commit ac98ea9

Please sign in to comment.