Releases: ModelEngine-Group/unified-cache-management
Releases · ModelEngine-Group/unified-cache-management
v0.5.0rc1
Highlight
- UCM adopts a more advanced core model architecture, with expanded support for GLM-4.x, GLM-5, and Minimax 2.5, available on both CUDA and Ascend platforms (See Latest Feature and Model Support Matrix)
- Added garbage collection support for POSIX store, improving storage lifecycle management and resource utilization. (#777)
- Optimized GSA on-device execution by fusing operators and resolving multiple performance-related issues, leading to better runtime efficiency. (#861, #862)
- Store now supports configurable CPU affinity, enabling more efficient KVCache dump and load operations. (#852)
- Improved layer-wise KV loading by introducing sequential per-layer scheduling, allowing better overlap between data loading and forward execution for enhanced throughput. (#783)
What's Changed
- [Feat] Adding an environment variable input by @Menglths in #837
- [opt] Add parallel size check in CI by @flesher0813 in #844
- [fix]Add timeout in PosixStore working thread by @qyh111 in #849
- [feat] Ascend: add mmap-based Host memory for O_DIRECT support by @mag1c-h in #850
- [opt] Remove redundant step(
Install GTest) from the workflow. by @mag1c-h in #851 - [opt] Cancel the Load/Dump task proactively after it times out. by @mag1c-h in #853
- [Feat] Add mindie-llm support by @nrj868 in #848
- [opt] Cancel all timed-out tasks at once. by @mag1c-h in #859
- [bugfix] Update seq_lens_list only on NPU path by @wangwenxin0312 in #860
- [feat] Store provides the ability to configure CPU affinity. by @mag1c-h in #852
- [opt] Optimize GSA by fusing operators by @Fengli5355 in #861
- [feat] Support kvcsstore by @ayaka836 in #855
- [feat] Add NUMA-aware CPU core split for vllm worker and store threads by @wangwenxin0312 in #854
- [Bugfix] fix offline patch by @Infinite666 in #864
- [fix] Update patch to track mindie changes by @nrj868 in #863
- [CI] Add gsa online test by @dante159753 in #857
- [opt] stack-protector on EmptyStore by @mag1c-h in #866
- [opt] Enhance NPU CPU affinity resolution with NUMA fallback by @wangwenxin0312 in #865
- [bugfix] Get all CPUs for the device's local socket by @wangwenxin0312 in #869
- add timeout for PR gate pipeline by @dante159753 in #862
- [refactor] Remove NVML-based CPU affinity setting by @wangwenxin0312 in #870
- [feat] Add tensor parallel size support and update GPU memory utilization for online inference tests by @hmy98213 in #874
- [fix]Submit layerwise KV load tasks one layer at a time by @qyh111 in #783
- [feat] log with rate limit by @Lijiachen1018 in #821
- [Feat] adapt for DSA model on CUDA platform by @sumingZero in #871
- [Build] Update UCM Dockerfiles for vLLM/vLLM-Ascend v0.17.0 by @yuanzhg078 in #876
- [Usage] Move use layerwise and hit ratio into config file by @harrisonyhq in #784
- [test] multi-processor test on AIO and Shm by @mag1c-h in #873
- [Feature] Garbage Collection by @UESTC-AHao in #777
- [CI] add test set config by @dante159753 in #877
- [bugfix] Fix MLA block_table row mapping by @wangwenxin0312 in #882
- [doc] Update support matrix. by @yuanzhg078 in #880
- release 0.5.0rc1 by @yuanzhg078 in #881
- [doc] Update framework version compatibility notes by @yuanzhg078 in #885
New Contributors
Full Changelog: v0.4.0...v0.5.0rc1
v0.4.0
Highlight
Support SGLang
- UCM is now integrated with SGLang, enabling prefix cache offloading to Posix Store to reduce redundant computation and lower TTFT
(see Quickstart-SGLang: https://ucm.readthedocs.io/en/latest/getting-started/quickstart_sglang.html)
(#757)
Refactor PipelineStore for Scalability and Performance
- Refactor PipelineStore into a modular, plugin-based architecture with automatic registration and runtime loading (#689)
- Improve overall performance through optimized Store implementations (e.g., cache store, posix store) and execution flow (#722, #744, #787)
UCM Connector
- UCM now additionally supports advanced parallel paradigms, including PCP / DCP and PP, enabling more flexible and scalable distributed execution (#750)
- Improve UCM connector performance by introducing optional event synchronization control (#768)
Inference Enhancement Features
- GSAOnDevice sparse attention algorithm has been upgraded with improved performance and accuracy, now fully supporting vLLM / vLLM-Ascend 0.11.0 (#659, #746, #729)
- Add support for Rerope in vLLM version 0.11.0 (#686)
- Enhance UCM logger compatibility (#760)
Document
- Add feature and model support matrix
- Extend UCM Store with scalable storage, persistence, and efficient data handling
(see: https://ucm.readthedocs.io/en/latest/developer-guide/extending_store.html)
What's Changed
- [fix] Adapt ESA to the LayerWiseConnector by @wangwenxin0312 in #681
- [doc] Add Code of Conduct by @yuanzhg078 in #684
- [Opt] GsaOnDevice cuda bugfix & optimization by @wangwenxin0312 in #659
- [CI] Modify pull request template by @yuanzhg078 in #687
- [Feature] rewrite logger module by @Lijiachen1018 in #608
- [refactor] Rename global rank and remove broadcast function by @harrisonyhq in #685
- [fix] clean code by @Lijiachen1018 in #688
- [opt] Refactor PipelineStore for Enhanced Scalability by @mag1c-h in #689
- [fix] Fix logger by @Lijiachen1018 in #690
- [doc] How to extend UCM Store by @mag1c-h in #692
- [CI] logger use zlibstatic by @Lijiachen1018 in #698
- [Bugfix] Cherry-pick modify worker_id to distinguish diff workers(#691) by @flesher0813 in #701
- [bugfix] rm unavailable lib and fix doc and update patch by @wuhuxiao in #699
- [feat] rerope feature for vllm0.11.0 by @xinSky00 in #686
- [perf] Reduce directory lock conflicts during batch dumps in PosixStore by @mag1c-h in #707
- [bugfix] fix debug log printing by @Lijiachen1018 in #706
- [bugfix] Fixed the issue of invalid LocalBuffer pointers in PCStore by @mag1c-h in #715
- [bugfix] rerope feature for vllm0.9.2 and git apply merging by @xinSky00 in #708
- [CICD] run e2e test in docker by @dante159753 in #712
- [Feature] Add readme and dataset in performance and evaluation test by @zzycode1005 in #721
- [bugfix] Adaptive modification of llmperf by @Menglths in #719
- [Feat] sparse patch for vllm-ascend v0.11.0 by @Infinite666 in #718
- [Bugfix]Fix garbled output when tp > 1 by @qyh111 in #716
- [perf] Copy Bandwidth Optimize: Multi-Stream parallelism supported in CacheStore by @mag1c-h in #722
- [Feat] sparse patch for gsa on device(GQA) va0.11.0rc1 by @Infinite666 in #726
- [Feat]Add layerwise and log_path config in run.sh by @qyh111 in #724
- [opt] Default depth of the waiting queue needs to be increased by nShard times for layer-wise by @mag1c-h in #731
- [Feat] Reuse-aware layer skipping under dynamic KV sparsification by @tedi20 in #725
- [opt] Increase the default running queue depth to support greater concurrent requests. by @qyh111 in #733
- [Feat]: Monkey patch framework for vllm 0.11.0, fix graph mode + UCM bugs by @NaganooMei in #735
- [Feat] Add csrc/ascend NPU custom ops for GSA by @leideng in #729
- [feat] Variable length IO supported in CacheStore by @mag1c-h in #734
- [Opt] Enable concurrent prefix lookup for posixstore by @sumingZero in #739
- [CI] refine docker file to use in yellow field by @dante159753 in #741
- [Feat]: Implement load failure recovery via monkey patch by @NaganooMei in #738
- [Opt]Split the thread pool into separate load and dump pools to prevent them from interfering with each other. by @qyh111 in #744
- [opt] Print
TaskIdin the CacheStore Error Log by @mag1c-h in #742 - [opt]Adapt variable io size by @qyh111 in #745
- [Opt] Add log timestamp in run_vllm.sh by @qyh111 in #747
- [bugfix & opt] gsaOnDevice for CUDA Graph mode by @wangwenxin0312 in #732
- [test & bugfix] fix low dump performance in posixstore e2e test by @NaganooMei in #751
- [Fix] Modify the config files of gsaondevice. by @AooooooA-C in #749
- [Test] Remove memory manager abstraction in PosixStore e2e test by @NaganooMei in #753
- [opt] CUDA Hamming Distance Kernel Optimization for GQA by @wangwenxin0312 in #755
- [fix] fix zlib gitcode url by @Lijiachen1018 in #758
- [Feature] Integrate UnifiedCache (UCM) into SGLang for Multi-Level Caching System by @pyxyzc in #757
- chore(test): Ensure that unnecessary import failures do not affect test execution by @Potterluo in #754
- [feat] GSAOnDevice for MLA Models Like DeepSeek V2/V3 in Ascend NPU by @leideng in #746
- [Feat] sparse patch for gsa on device(MLA) va0.11.0 by @Infinite666 in #761
- [Fix] fix save_speed core dump and loaded blocks num when task failed by @flesher0813 in #763
- Fix batch_size_for_hamming bug when slice is disabled (vllm-ascend 0.11.0) by @leideng in #765
- [Feat] adapt dcp&pcp by @flesher0813 in #750
- [Fix] Add init.py for rerope. by @AooooooA-C in #769
- [Refactor]monkey patch sparse feature in v0.11.0 by @ayaka836 in #743
- [Opt] update deepseek r1 config by @leideng in #770
- [feat] Introduce platform-specific sparse trigger thresholds for GPU and NPU by @wangwenxin0312 in #762
- [opt] Define
UCM_ROOT_DIRto ensure safety when used UCM as a sub-repository by @mag1c-h in #772 - [opt] enable Ascend register pin optimization by @mag1c-h in #775
- [fix] remove imports that specific to platform by @dante159753 in #771
- [opt] supports lo...
v0.3.0
HighLights
- Refinement of PipelineStore Architecture and Enhancement of Core Capabilities #653 #711
- Now supports 3FS for scalable and efficient storage backends #622
- Features the new GSAOnDevice sparse attention algorithm, enabling high-performance HBM utilization across both CUDA and Ascend platforms.#647 #638
- Aligned CacheBlend with the new UCM storage and sparse engine updates to support vLLM 0.9.2. #664
Known Issues
- Layerwise is not supported when using vllm 0.11.0
- Currently, installing with
pip install uc-managerdoes not support using vllm 0.11.0. - If you need to use vLLM 0.11.0+ with UCM layerwise, please refer to vllm-project/vllm#26675 for modifications.
- Currently, installing with
What's Changed
- [bugfix]cherry-pick from 0.2.0release Fix KeyError by @qyh111 in #573
- [bugfix] cherry-pick from 0.2.0release patch update by @wangwenxin0312 in #574
- [fix]cherry pick from 0.2.0-release fix monitor issue (#572) by @qyh111 in #575
- [bugfix] build hamming dist by @wangwenxin0312 in #577
- [feat]Update data file layout to adapt to garbage collection by @qyh111 in #579
- [bugfix]cherry pick from 0.2.0-release sparse patch & cmake by @wangwenxin0312 in #581
- [bugfix] kvcomp config by @wangwenxin0312 in #584
- [feat] KvCompOnDevice: per-KV-head Top-K for Qwen by @wangwenxin0312 in #589
- feature for triton rerope by @xinSky00 in #497
- [bugfix] kvcomp for qwen by @wangwenxin0312 in #594
- [bugfix] share buffer used out (cherry-picked from #592) by @mag1c-h in #598
- [fix]cherry-pick clean code and set local_rank_size to tp_size (#596) by @qyh111 in #600
- [misc] split dependency preparation logic into individual dependency files for enhanced configuration flexibility by @mag1c-h in #597
- [fix]fix clean code (#601) by @qyh111 in #602
- Modify blend and rerope docs by @xinSky00 in #593
- [docs] Modify blend introduction by @wuhuxiao in #605
- add qiongwu as codeownner by @Infinite666 in #610
- KVComp in NPU -- HBM version by @leideng in #599
- [bugfix] bugfix in PCStore, cherry-pick from release by @mag1c-h in #609
- [docs]Add doc for pipeline store by @qyh111 in #607
- [fix] remove request_succeed_dumped_blocks() in monkey patch by @xinSky00 in #613
- [fix]Sync changes from the release branch to develop. including docs、version and dockerfile by @qyh111 in #621
- [feat] Cherry-pick updates from 0.2.0-release to develop (patches and docs) by @wangwenxin0312 in #623
- [bugfix] ] Cherry-pick updates from 0.2.0-release (hamming compile) by @wangwenxin0312 in #625
- [doc]rename pipline_store to pipeline_store by @qyh111 in #626
- [bugfix] fix register_kv_caches patch by @Clarence-1103 in #629
- Unify xSA name as GSA by @leideng in #631
- [Feature] 3FS Store by @UESTC-AHao in #622
- [optimize]Optimized LLMPerf Test Cases by @Potterluo in #634
- [Doc] 3FS Document by @UESTC-AHao in #637
- [Feat] Basic scripts for deployment best practices by @sumingZero in #556
- [feature]Add LLM connection base components and OpenAI connector by @Potterluo in #636
- [Bugfix] Fix 3FS by @UESTC-AHao in #650
- [feat] PipelineStore Architecture Refresh and Capability Enhancement by @mag1c-h in #653
- [doc] Add contributing guide by @yuanzhg078 in #648
- [doc]Implement the function of a kv cache calculator html in User Guide by @Potterluo in #652
- [Opt] New gsa config by @leideng in #646
- [Feat] Support C++/Python to use same metrics singleton within a process by @flesher0813 in #654
- [feat]Add Layerwise Connector by @qyh111 in #656
- [Fix] Modify ucm_connector to adapt metrics by @flesher0813 in #658
- [doc] Update quickstart section in README_zh by @yuanzhg078 in #663
- [Feat] Update sparse method patches for vllm 0.11.0 by @AooooooA-C in #638
- [CI] add pr gate workflow by @dante159753 in #662
- [Opt] Gsa npu performance optimize by @leideng in #647
- [misc] Reduce gpu utilization to 6GB in test for 1.5B model by @dante159753 in #665
- [feat] add monkey patch for gsa on device v0.9.2 by @Clarence-1103 in #618
- [Fix] coredump if add new c++ metrics by @flesher0813 in #666
- [opt] adapt cache blend for store and sparse's new version by @wuhuxiao in #664
- [Doc] Update documents related to sparse. by @AooooooA-C in #672
- [CI] use requirements file to prepare test env by @dante159753 in #673
- [test]Evaluate model performance and accuracy with UCM by @ayaka836 in #642
- [Fix] Failed to start vLLM service using multi-node launch scripts under CUDA data parallelism by @sumingZero in #670
- [CI] remove logger, check branch up-to-date, fast fail e2e test by @dante159753 in #674
- release 0.3.0 by @flesher0813 in #677
- [bugfix] Fix compilation error due to missing atomic include by @harrisonyhq in #693
- [Bugfix] Modify worker_id set to separate different worker by @flesher0813 in #691
- [bugfix] rm unavailable lib and fix doc and update patch by @wuhuxiao in #700
- [perf] Reduce directory lock conflicts during batch dumps in PosixStore by @mag1c-h in #711
New Contributors
- @Infinite666 made their first contribution in #610
- @dante159753 made their first contribution in #662
- @ayaka836 made their first contribution in #642
Full Changelog: v0.2.0...v0.3.0
v0.2.0
Hightlights
- Support Model Window Extrapolation:Rectified Rotary Position Embeddings (ReRoPE)(#497)
- Support sparse attention algorithms in HBM on both CUDA GPUs and Ascend NPUs. It sparsifies attention by hashing KV states and using Hamming distance Top-K selection.(#559)
- Add Pipeline Store composed of Cache Store and POSIX Store(#553).
- Improved KV cache transfer performance for NfsStore.(#393)
Known Issues
- Sparse is not supported when installing via pip
- Currently, installing with
pip install uc-managerdoes not support Sparse. - Before installing via pip, please make sure to set the platform explicitly:
export PLATFORM=xxx - To use Sparse, please install via the Docker image or build from source.
- Currently, installing with
What's Changed
- [Feature] Add performance and evaluation testing tools using the pytest framework by @zzycode1005 in #462
- [Feature] Added environment pre-check by @Menglths in #498
- [docs] fix links in docs and add clarifications (#499) by @Lijiachen1018 in #502
- [build] rewrite setup.py by @ygwpz in #501
- [bugfix] Adapt the patch to support YAML sections. by @wangwenxin0312 in #480
- [bugfix] fix pip install -e no so by @ygwpz in #508
- [Feature] Cache Blend by @wuhuxiao in #467
- merge Feature_store_next to develop by @qyh111 in #518
- [bugfix]fix setup.py by @qyh111 in #520
- [bugfix]fix setup.py (#520) by @qyh111 in #521
- feat(test): Add PostgreSQL support and optimize database write logic by @Potterluo in #507
- [fix] move init to intergration/vllm directory by @Lijiachen1018 in #535
- [Fix]Add PLATFORM reminder by @zhou-haitao in #526
- cherry-pick from 0.1.0-release by @Lijiachen1018 in #552
- [Feat] New Store Impl: CacheStore - PosixStore - PipelineStore by @mag1c-h in #553
- [Perf] parallel block-existence checks + timeout exception by @mag1c-h in #550
- [feat] Shard block files into subdirs by hash prefix, with opt-out switch by @mag1c-h in #561
- [feat]use numpy to calculate addrs by @qyh111 in #564
- [Bugfix] use-after-free in LookupBatch by @mag1c-h in #565
- [Bugfix] skip fresh shm files to avoid race between multiple instances by @mag1c-h in #566
- [Bugfix] Fix incorrect fallback in GetHostBuffer: use MakeHostBuffer instead of MakeDeviceBuffer by @mag1c-h in #568
- [feat] kvcomp on device by @wangwenxin0312 in #559
- [fix]Add exception handling by @qyh111 in #569
- [bugfix]Fix KeyError when VLLM_HASH_ATTENTION environment variable is not set by @qyh111 in #570
- [bugfix] patch update by @wangwenxin0312 in #571
- [fix]fix monitor issue by @qyh111 in #572
- [bugfix] build hamming dist by @wangwenxin0312 in #578
- [feat] Update data file layout to adapt to garbage collection by @qyh111 in #576
- [bugfix] sparse patch & cmake by @wangwenxin0312 in #580
- [build]fix spdlog use ext fmt by @Lijiachen1018 in #585
- [bugfix] kvcomp fix by @wangwenxin0312 in #586
- [feat] KvCompOnDevice: per-KV-head Top-K for Qwen by @wangwenxin0312 in #588
- [bugfix] share buffer used out by @mag1c-h in #592
- [bugfix] kvcomp for qwen by @wangwenxin0312 in #595
- [fix]clean code and set local_rank_size to tp_size by @qyh111 in #596
- [fix]fix clean code by @qyh111 in #601
- [Bugfix] update block dir permission & double-free fix by @mag1c-h in #603
- [bugfix] double-release shared-block while make reader failed by @mag1c-h in #604
- [docs]add doc for pipeline store by @qyh111 in #612
- [feat] cherry-pick to 0.2.0-release to add rerope by @xinSky00 in #614
- fix ascend patch and change version by @qyh111 in #615
- add patch in dokerfile-npu by @qyh111 in #617
- [feat] cherry-pick KVComp in NPU -- HBM version into the 0.2.0-release branch by @wangwenxin0312 in #619
- [feat] update all patch and docs by @wangwenxin0312 in #620
- [bugfix] hamming compile by @wangwenxin0312 in #624
New Contributors
- @zzycode1005 made their first contribution in #462
Full Changelog: v0.1.2...v0.2.0
v0.2.0rc1
Hightlights
- Improved Prefix Cache offload/load performance.
- Support Cache Blend.
Core:
Known Issues
- When using the Ascend platform:
- Broadcasting is not supported.
load_only_first_rankmust be set tofalsein the configuration.
- When compiling from source, make sure to set the
PLATFORMenvironment variable.
What's Changed
- [Feature] Add performance and evaluation testing tools using the pytest framework by @zzycode1005 in #462
- [Feature] Added environment pre-check by @Menglths in #498
- [docs] fix links in docs and add clarifications (#499) by @Lijiachen1018 in #502
- [build] rewrite setup.py by @ygwpz in #501
- [bugfix] Adapt the patch to support YAML sections. by @wangwenxin0312 in #480
- [bugfix] fix pip install -e no so by @ygwpz in #508
- [Feature] Cache Blend by @wuhuxiao in #467
- merge Feature_store_next to develop by @qyh111 in #518
- [bugfix]fix setup.py by @qyh111 in #520
New Contributors
- @zzycode1005 made their first contribution in #462
- @wuhuxiao made their first contribution in #467
Full Changelog: v0.1.2...v0.2.0rc1
v0.1.2
Some small fixes in this release.
- [Docs] Documents are now easier to read.
- [Docs] PD disaggregation documentation update : Update the PD disaggregation documentation to remove the --enforce-eager argument when starting the vllm service, so that graph mode is enabled by default at startup.
- [Feat] Completely remove
UCconnector, please useUCMConnectorfrom now on. - [Feat] UCM supports recovery form load failure:Implement the get_block_ids_with_load_errors interface in the KVConnectorBase_V1 class, enabling vLLM to reexecute inference for requests whose KV cache failed to load from UCM.
- [Build] Use
pip install uc-manager==0.1.2and the install will build from source for both vllm and vllm-ascend. - [Build] Sparse module are now built and used only if set environment variable
export ENABLE_SPARSE=TRUE.
What's Changed
- [cleancode]rm video by @Lijiachen1018 in #459
- [fix] pick fixes from Release to develop by @Lijiachen1018 in #465
- [cleancode]remove uc connector by @Lijiachen1018 in #460
- [build] project docs for pypi by @Lijiachen1018 in #466
- [build]build sparse only if enabled by @Lijiachen1018 in #470
- [Misc] fetch dependence from gitcode as backup by @mag1c-h in #469
- [docs] renew docs by @Lijiachen1018 in #476
- release v0.1.1 by @Lijiachen1018 in #478
- feat: add MetaX MACA device support for PC by @simshi in #387
- [Docs] PD disaggregation documentation update by @sumingZero in #479
- [Feat] UCM supports recovery form load failure by @sumingZero in #477
- [feat]Add configurable scattergatter by @qyh111 in #483
- [bugfix]add synchronize on ascend platform by @qyh111 in #485
- [build] fix build by source distribution by @Lijiachen1018 in #484
- release v0.1.2 by @Lijiachen1018 in #491
- develop merge into main by @ygwpz in #492
- [docs] fix links in docs and add clarifications by @Lijiachen1018 in #499
New Contributors
Full Changelog: v0.1.0...v0.1.2
v0.1.0
We are excited to announce the first official release of Unified Cache Manager.
Hightlights
- Offload Prefix Cache to storage.
- Homogeneous/ Heterogeneos PD disaggregation.
- Training-Free sparsity in accelerating inference.(vllm==0.9.2, vllm-ascend==0.9.2rc1)in #199, #335, #190, #451
Core:
- Garbage collection for store in #315 and #312
- Adapt to vllm and vllm-ascend in #13, #292, #415 and #362
- UCM supports metrics display online via Grafana and Promethues in #414 and docs in #416
Known Issues
If using Ascend platform, please be mind of
- not compatible with broadcast
load_only_first_rank: falsein config
Others
- Update documents
- Tools for performance tuning, hyperparameter optimization in #418
What's Changed
- [opt] Share Infra implementation and unify status codes by @mag1c-h in #399
- [bugfix] Fix ESA to be compatible with the latest NFSStore. by @wangwenxin0312 in #401
- release v0.1.0rc4 by @Lijiachen1018 in #402
- [opt] Remove unused cc impl of dramstore by @mag1c-h in #406
- [Fix]remove dram docs and modify quick-start doc by @hero0307 in #411
- [Feature] Added performance testing tool based on the PyTest testing framework by @Menglths in #295
- [Misc] Add cpp-linter.yml by @mag1c-h in #422
- [docs]add metrics doc by @hero0307 in #416
- [perf] Modify CUDA SIMD and add Triton hash encoder by @Clarence-1103 in #408
- [bugfix] batch trans on cuda with SM return 700 error by @mag1c-h in #434
- [Misc] set default logger backend to spdlog by @mag1c-h in #440
- [rebase]Dev-ucm-v1 rebase to develop by @Lijiachen1018 in #453
- [cleancode] remove dramstore by @Lijiachen1018 in #455
- Fix metrics by @Lijiachen1018 in #456
New Contributors
Full Changelog: v0.1.0rc4...v0.1.0
v0.1.0rc4
What's Changed
- [feat] ucmtrans: Unify API for Device-Host Memory Transfers by @mag1c-h in #379
- [feat] Add support for Ascend device memory transfers by @mag1c-h in #382
- [Fix] fix build, fix no save kv layer by @Lijiachen1018 in #390
- [feat] Add
pcstorefor enhanced PrefixCache performance by @FangRun2 in #393 - [fix] fix ascend attention by @Lijiachen1018 in #394
- release v0.1.0rc3 by @Lijiachen1018 in #395
- [fix] fix sparse attention by @Lijiachen1018 in #397
New Contributors
Full Changelog: v0.1.0rc2...v0.1.0rc4
v0.1.0rc2
What's Changed
- [docs] update docs for v0.1.0rc1 by @Lijiachen1018 in #365
- [bug fix] Dev patch fix for sparse by @Lijiachen1018 in #371
- [build] auto patch for ascend by @Lijiachen1018 in #372
- feat: add Mthreads MUSA device support -stage 1 by @superleo in #370
- release v0.1.0rc2 by @Lijiachen1018 in #373
- prefetch bug by @zbb200819 in #360
- [Feat]Adapt to vllm-ascend0.9.1 and vllm-ascend0.11.0 by @hero0307 in #362
- [bugfix] add cmake option to bypass NUMA binding by @Clarence-1103 in #368
- [Feat] Update the data items saved by trace replay by @sumingZero in #366
New Contributors
Full Changelog: v0.1.0rc1...v0.1.0rc2
v0.1.0rc1
Support Features
- Prefix Cache
- Sparse Attention
- Sparse Attention Offload
- PD Disaggregation
What's Changed
- remove impl by @flesher0813 in #11
- adapt vllm v0.9.2 by @flesher0813 in #13
- [Doc] Outline of the document by @ygwpz in #15
- remove impl test and add uc connector test by @flesher0813 in #14
- [Doc] Installation of ucm by @flesher0813 in #17
- [Feature] Add DRAM Connector for uc_connector by @harrisonyhq in #18
- [doc] add readme and license by @ygwpz in #24
- [Feature] Add Dockerfiles by @flesher0813 in #20
- [Feature]Nfsstore by @propanone1006 in #23
- [doc] change docs outline by @ygwpz in #32
- [Feature] Add Cmake build command in setup.py by @harrisonyhq in #34
- [fixbug] fix issue#25 issue#31 and issue#33 by @flesher0813 in #30
- [Fix][Docs] Make example runnable and add performance data (closes #37 #29 #42) by @harrisonyhq in #41
- [Feat] Move kv_block_size to config by @harrisonyhq in #43
- [feature][docs]finish nfs store and add docs by @qyh111 in #44
- [doc] Add export of device type in installation;[Fix] fix version invalid#45 #46 by @harrisonyhq in #47
- add perf data in readme by @ygwpz in #49
- [Feat] Merge 0.0.1 back into develop by @flesher0813 in #50
- [bugfix] fix issue#26 and issue#36 by @ygwpz in #55
- [Doc] Add vllm institution by @flesher0813 in #61
- [CI][Fix] update issue and pr template, fix issue #57, cherry-pick main by @flesher0813 in #65
- [Doc] update install doc using patch to build from source code by @flesher0813 in #68
- [Feat] Merge 0.0.1 back into develop by @ygwpz in #72
- [Style] Fix codestyle problems and typo in develop by @harrisonyhq in #75
- [Feature] add ucm_sparse v1.0: unified sparse attention algorithm framework by @hek14 in #79
- [Fix] Fix cant find cmake error when using pip install -e . by @harrisonyhq in #80
- Revert "[Feature] add ucm_sparse v1.0: unified sparse attention algorithm framework " by @ygwpz in #82
- [Feature] add Mooncake Store by @propanone1006 in #86
- [Fix bug] Simplify docker build and installation.md by @flesher0813 in #87
- [BUG]adapt deepseek by @qyh111 in #89
- [Feature][P/D] add example for disaggregated prefill by @flesher0813 in #90
- [Perf] Pipelined ucmnfsstore by @mag1c-h in #97
- Revert "[Feature] add Mooncake Store" by @ygwpz in #98
- [Fix bug] fix uc_connector ut and change hash generation method by @hero0307 in #101
- [Fix] Fix .so build error by @harrisonyhq in #104
- [Fix] Fix ascend compile error by @mag1c-h in #106
- [Perf]Modify start_load_kv by @qyh111 in #103
- [Fix] Fix duplicate create/commit errors upon preemption by @flesher0813 in #109
- [Feat] Adapt for vllm 0.9.1 by @sumingZero in #113
- [Feature] [Doc] UCMSparse framework by @hek14 in #112
- [fix] remove redundant code and files/rename file names by @NaganooMei in #118
- [Fix] Fix spelling issues with PR templates by @propanone1006 in #119
- remove load_tasks by @NaganooMei in #121
- [bugfix] bugfix in ucmnfsstore by @mag1c-h in #123
- [doc]Add config parameter by @UESTC-AHao in #130
- [bugfix]fix rank handing in multi-node pp setup by @qyh111 in #129
- [Feat]Support UCM Sparse on cuda by @harrisonyhq in #126
- [Feature] Add mooncake store by @hufumans in #117
- [bugfix]modify mla dump by @zhou-haitao in #128
- [feature] non-blocking interfaces are provided to check whether the transmission task is completed by @mag1c-h in #139
- [feature] return error if block exists while batch creation. by @mag1c-h in #138
- [feature]modify create interface by @hufumans in #145
- [Doc] change logo and rearange docs by @flesher0813 in #156
- 0.0.2 release merge develop by @ygwpz in #158
- [doc][feature] change code directory by @ygwpz in #161
- [fix] modify patch and workflow by @NaganooMei in #163
- [Feat] Support load async by @flesher0813 in #166
- [Feat]Support load async and load failure by @flesher0813 in #165
- [Feature]refactor ucconnector by @qyh111 in #167
- [feature] upload retake codes by @truthstriver in #172
- [bugfix]Resolve the issue of the first-round commit failure under dsv2 by @zhou-haitao in #186
- [Feat] Add KVComp sparse attention implementation in UCM by @leideng in #182
- [perf]prepare offset in advance by @qyh111 in #188
- [feature] GSA by @HaoLi980405 in #190
- [bugfix]fix pp problem and remove err logs when duplicate create by @qyh111 in #191
- [Fix] Fix bug: check task returns -50005 during async load by @sumingZero in #192
- [bugfix]gsa fix reslotmapping bug by @HaoLi980405 in #194
- [bugfix]gsa fix running reqs exceed 30 bug by @HaoLi980405 in #195
- [doc] design doc directory by @ygwpz in #197
- [Perf]kv_block_size as well as transferIoSize are calculated rather than configured by @UESTC-AHao in #196
- [Feat] add cuda topk and gsa descriptions by @HaoLi980405 in #198
- [Fix] Fix workflow image space error in action by @harrisonyhq in #203
- [bugfix]roll back dataoffset by @qyh111 in #201
- [bugfix] fix whl install gsa error and gsa kpre reslotmapping out of range by @HaoLi980405 in #204
- [Fix][Doc] Modify sparse docs by @flesher0813 in https://gi...