Highlight
Support SGLang
- UCM is now integrated with SGLang, enabling prefix cache offloading to Posix Store to reduce redundant computation and lower TTFT
(see Quickstart-SGLang: https://ucm.readthedocs.io/en/latest/getting-started/quickstart_sglang.html)
(#757)
Refactor PipelineStore for Scalability and Performance
- Refactor PipelineStore into a modular, plugin-based architecture with automatic registration and runtime loading (#689)
- Improve overall performance through optimized Store implementations (e.g., cache store, posix store) and execution flow (#722, #744, #787)
UCM Connector
- UCM now additionally supports advanced parallel paradigms, including PCP / DCP and PP, enabling more flexible and scalable distributed execution (#750)
- Improve UCM connector performance by introducing optional event synchronization control (#768)
Inference Enhancement Features
- GSAOnDevice sparse attention algorithm has been upgraded with improved performance and accuracy, now fully supporting vLLM / vLLM-Ascend 0.11.0 (#659, #746, #729)
- Add support for Rerope in vLLM version 0.11.0 (#686)
- Enhance UCM logger compatibility (#760)
Document
- Add feature and model support matrix
- Extend UCM Store with scalable storage, persistence, and efficient data handling
(see: https://ucm.readthedocs.io/en/latest/developer-guide/extending_store.html)
What's Changed
- [fix] Adapt ESA to the LayerWiseConnector by @wangwenxin0312 in #681
- [doc] Add Code of Conduct by @yuanzhg078 in #684
- [Opt] GsaOnDevice cuda bugfix & optimization by @wangwenxin0312 in #659
- [CI] Modify pull request template by @yuanzhg078 in #687
- [Feature] rewrite logger module by @Lijiachen1018 in #608
- [refactor] Rename global rank and remove broadcast function by @harrisonyhq in #685
- [fix] clean code by @Lijiachen1018 in #688
- [opt] Refactor PipelineStore for Enhanced Scalability by @mag1c-h in #689
- [fix] Fix logger by @Lijiachen1018 in #690
- [doc] How to extend UCM Store by @mag1c-h in #692
- [CI] logger use zlibstatic by @Lijiachen1018 in #698
- [Bugfix] Cherry-pick modify worker_id to distinguish diff workers(#691) by @flesher0813 in #701
- [bugfix] rm unavailable lib and fix doc and update patch by @wuhuxiao in #699
- [feat] rerope feature for vllm0.11.0 by @xinSky00 in #686
- [perf] Reduce directory lock conflicts during batch dumps in PosixStore by @mag1c-h in #707
- [bugfix] fix debug log printing by @Lijiachen1018 in #706
- [bugfix] Fixed the issue of invalid LocalBuffer pointers in PCStore by @mag1c-h in #715
- [bugfix] rerope feature for vllm0.9.2 and git apply merging by @xinSky00 in #708
- [CICD] run e2e test in docker by @dante159753 in #712
- [Feature] Add readme and dataset in performance and evaluation test by @zzycode1005 in #721
- [bugfix] Adaptive modification of llmperf by @Menglths in #719
- [Feat] sparse patch for vllm-ascend v0.11.0 by @Infinite666 in #718
- [Bugfix]Fix garbled output when tp > 1 by @qyh111 in #716
- [perf] Copy Bandwidth Optimize: Multi-Stream parallelism supported in CacheStore by @mag1c-h in #722
- [Feat] sparse patch for gsa on device(GQA) va0.11.0rc1 by @Infinite666 in #726
- [Feat]Add layerwise and log_path config in run.sh by @qyh111 in #724
- [opt] Default depth of the waiting queue needs to be increased by nShard times for layer-wise by @mag1c-h in #731
- [Feat] Reuse-aware layer skipping under dynamic KV sparsification by @tedi20 in #725
- [opt] Increase the default running queue depth to support greater concurrent requests. by @qyh111 in #733
- [Feat]: Monkey patch framework for vllm 0.11.0, fix graph mode + UCM bugs by @NaganooMei in #735
- [Feat] Add csrc/ascend NPU custom ops for GSA by @leideng in #729
- [feat] Variable length IO supported in CacheStore by @mag1c-h in #734
- [Opt] Enable concurrent prefix lookup for posixstore by @sumingZero in #739
- [CI] refine docker file to use in yellow field by @dante159753 in #741
- [Feat]: Implement load failure recovery via monkey patch by @NaganooMei in #738
- [Opt]Split the thread pool into separate load and dump pools to prevent them from interfering with each other. by @qyh111 in #744
- [opt] Print
TaskIdin the CacheStore Error Log by @mag1c-h in #742 - [opt]Adapt variable io size by @qyh111 in #745
- [Opt] Add log timestamp in run_vllm.sh by @qyh111 in #747
- [bugfix & opt] gsaOnDevice for CUDA Graph mode by @wangwenxin0312 in #732
- [test & bugfix] fix low dump performance in posixstore e2e test by @NaganooMei in #751
- [Fix] Modify the config files of gsaondevice. by @AooooooA-C in #749
- [Test] Remove memory manager abstraction in PosixStore e2e test by @NaganooMei in #753
- [opt] CUDA Hamming Distance Kernel Optimization for GQA by @wangwenxin0312 in #755
- [fix] fix zlib gitcode url by @Lijiachen1018 in #758
- [Feature] Integrate UnifiedCache (UCM) into SGLang for Multi-Level Caching System by @pyxyzc in #757
- chore(test): Ensure that unnecessary import failures do not affect test execution by @Potterluo in #754
- [feat] GSAOnDevice for MLA Models Like DeepSeek V2/V3 in Ascend NPU by @leideng in #746
- [Feat] sparse patch for gsa on device(MLA) va0.11.0 by @Infinite666 in #761
- [Fix] fix save_speed core dump and loaded blocks num when task failed by @flesher0813 in #763
- Fix batch_size_for_hamming bug when slice is disabled (vllm-ascend 0.11.0) by @leideng in #765
- [Feat] adapt dcp&pcp by @flesher0813 in #750
- [Fix] Add init.py for rerope. by @AooooooA-C in #769
- [Refactor]monkey patch sparse feature in v0.11.0 by @ayaka836 in #743
- [Opt] update deepseek r1 config by @leideng in #770
- [feat] Introduce platform-specific sparse trigger thresholds for GPU and NPU by @wangwenxin0312 in #762
- [opt] Define
UCM_ROOT_DIRto ensure safety when used UCM as a sub-repository by @mag1c-h in #772 - [opt] enable Ascend register pin optimization by @mag1c-h in #775
- [fix] remove imports that specific to platform by @dante159753 in #771
- [opt] supports loading non-built-in connectors by @mag1c-h in #776
- [feat] enhance logger compatibility by @Lijiachen1018 in #760
- Fix padded query start loc bug for vllm-ascend v0.11.0 by @leideng in #774
- [Bugfix] Correct MLA Threshold Gating for GSA by @wangwenxin0312 in #778
- [bugfix] Fix padded block tables in graph mode for vllm-ascend v0.11.0 by @wangwenxin0312 in #779
- [fix] logger add debug_once by @Lijiachen1018 in #781
- [opt]Add event to sync by @qyh111 in #768
- [Feat] Adapt new monkey patch framework for vllm & vllm_ascend 0.11.0, supporting KV cache load recovery and graph compilation modes fix for UCM layerwise connector by @yuanzhg078 in #780
- [opt]Add patch config and dev_mode config in config.properties by @qyh111 in #790
- [Fix] Fix the excessive TPOT latency by modifying update_attn_params by @flesher0813 in #788
- [Fix] change package 'wrapt' version by @ayaka836 in #792
- [fix]Add init.py in patch by @qyh111 in #793
- [Build] update gsa build scripts by @Infinite666 in #789
- [Build] add execute permission for .run by @Infinite666 in #794
- [opt] PosixStore supports the selection and switching of different IO engines. by @mag1c-h in #787
- [feat] enable external PC hit for MLA on CUDA and NPU by @wangwenxin0312 in #785
- [refactor] eliminate redundant Pimpl idiom in dynamically-loaded component by @mag1c-h in #797
- [bugfix] Fix CUDA full graph for DeepSeek implementation by @wangwenxin0312 in #799
- [CI] replacing offline e2e test with online e2e test by @dante159753 in #782
- [Build] add sparse configs by @Infinite666 in #805
- [Fix] Set sink_token of cuda_hamming_topk equal to block size. by @AooooooA-C in #796
- [opt]Add UcmPipelineStore and layerwise in CI by @qyh111 in #786
- [fix] log exception by @Lijiachen1018 in #808
- [Fix] fix perf test bug by @dante159753 in #809
- [Docs] Add DeepWiki reference to UCM user guide. by @yuanzhg078 in #815
- feat(test): Support result saving functionality for multiple storage by @Potterluo in #764
- [bugfix] NPU GQA Accurancy under Graph Mode by @leideng in #810
- [Feat] Add monkey patch for sparse on ascend by @Fengli5355 in #791
- [fix] Add mutex/lock to the local mode cache. by @mag1c-h in #819
- [Fix] Fix hashset deletion probing and optimize TopNHeap insertion path by @Tarrei in #820
- [fix] Resolving circular dependencies between stacked Stages in PipelineStore by @mag1c-h in #817
- [Fix] fix online test conflict with vllm server on same host by @dante159753 in #814
- [Refactor] Update monkey patch framework and remove deprecated implementation by @yuanzhg078 in #806
- [Feature] Adapt prefix cache for GQA/MLA to support vLLM >= 0.14.0 by @sumingZero in #818
- [feat] log compatible with vllm v0.16.0 by @Lijiachen1018 in #825
- [Docs] Add documention for monkey patch v0.11.0 by @Fengli5355 in #822
- [doc] Add model support and feature compatibility document by @yuanzhg078 in #826
- Fix skipping interface auto-detection by @SuperMarioYL in #811
- [doc] Update feature support matrix and readme. by @yuanzhg078 in #830
- [CI][Build] Update Dockerfile for vllm (vllm_ascend) 0.11.0 by @yuanzhg078 in #813
- [Build] Adjust wrapt dependency version by @ayaka836 in #835
- [opt] Add strip and stack-protector compile options in release build mode by @mag1c-h in #832
- release 0.4.0 by @sumingZero in #834
- [feat] adapt PP for use_layerwise=True by @Lyj1007 in #816
- [Bugfix] The script failed to start CUDA multi-node service by @sumingZero in #839
- [CI] fix the gsaondevice offline test ci by @AooooooA-C in #824
- [doc] Update version in support matrix by @yuanzhg078 in #841
- [doc] update support version by @yuanzhg078 in #847
New Contributors
- @tedi20 made their first contribution in #725
- @Fengli5355 made their first contribution in #791
- @Tarrei made their first contribution in #820
- @SuperMarioYL made their first contribution in #811
- @Lyj1007 made their first contribution in #816
Full Changelog: v0.3.0...v0.4.0