Skip to content

[Feature] Garbage Collection#777

Merged
mag1c-h merged 2 commits intoModelEngine-Group:developfrom
UESTC-AHao:develop
Mar 31, 2026
Merged

[Feature] Garbage Collection#777
mag1c-h merged 2 commits intoModelEngine-Group:developfrom
UESTC-AHao:develop

Conversation

@UESTC-AHao
Copy link
Copy Markdown
Contributor

@UESTC-AHao UESTC-AHao commented Mar 4, 2026

Purpose

Support Garbage Collection:

Hotness Updates: Each Scheduler records blocks hit by a Lookup and asynchronously updates the last time of the corresponding files.

GC Process Enablement: Executed only on Schedulers. In Data Parallel (DP) scenarios, it runs exclusively on the DP0 Scheduler, controlled by the configuration item posix_gc_enable.

GC Initial Capacity: The total storage capacity is derived from the user configuration posix_capacity_gb, and the maximum number of storable files is calculated based on individual file sizes.

GC Trigger Conditions: Triggered if either condition is met: a) Timer expiration; b) The file count still exceeds the threshold after the previous GC round.

GC Execution Detection: Randomly samples 10% of subdirectories and uses multi-threading to count files in parallel, checking whether the average file count per subdirectory exceeds the threshold.

GC Execution Method: Processes all subdirectories in parallel using multiple threads, reclaiming the 10% oldest (least recently accessed) files within each subdirectory.

Modifications

ucm_connectors:

  • ucm_connector_name: "UcmPipelineStore"
    ucm_connector_config:
    store_pipeline: "Cache|Posix"
    storage_backends: "./data"

    [REQUIRED] Total storage capacity in GB (must set when GC enabled)
    posix_capacity_gb:

    [CONFIGURABLE] Enable garbage collection (default: false)
    posix_gc_enable: true

    [CONFIGURABLE] GC trigger threshold ratio (default: 0.7)
    posix_gc_trigger_threshold_ratio: 0.7

    [CONFIGURABLE] Percentage of files to recycle per GC run (default: 0.1)
    posix_gc_recycle_percent: 0.1

    [CONFIGURABLE] Number of GC worker threads (default: 16)
    posix_gc_concurrency: 16

    [CONFIGURABLE] Timeof GC check interval (default: 30)
    posix_gc_check_interval_sec: 30

    [CONFIGURABLE] The maximum number of files that can be processed per shard (default: 1000)
    posix_gc_max_recycle_count_per_shard: 1000

    [CONFIGURABLE] Shard sample ratio (default: 0.1)
    posix_gc_shard_sample_ratio: 0.1

Test

423459bc-48d5-42e7-a5b1-8c90baa081a2

@UESTC-AHao UESTC-AHao closed this Mar 4, 2026
@UESTC-AHao UESTC-AHao reopened this Mar 4, 2026
@UESTC-AHao UESTC-AHao changed the title [Freture]Garbage Collection [Feature] Garbage Collection Mar 4, 2026
@UESTC-AHao UESTC-AHao force-pushed the develop branch 5 times, most recently from 5847b36 to f3e11bb Compare March 10, 2026 11:37
@UESTC-AHao UESTC-AHao force-pushed the develop branch 2 times, most recently from af9c8f2 to 86b0500 Compare March 23, 2026 08:49
mag1c-h
mag1c-h previously approved these changes Mar 23, 2026
@mag1c-h mag1c-h self-requested a review March 23, 2026 15:07
@mag1c-h mag1c-h dismissed their stale review March 23, 2026 15:09

Conversation need to be resolved.

@UESTC-AHao UESTC-AHao force-pushed the develop branch 4 times, most recently from eae7cec to 78e5ec5 Compare March 25, 2026 03:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces garbage collection support for the Posix store backend, including “hotness” tracking on scheduler lookups and a background GC loop to reclaim least-recently-used files based on configurable thresholds.

Changes:

  • Add scheduler-side hotness tracking and a background shard GC manager for the Posix store.
  • Extend Posix store config parsing/validation/logging with GC-related settings (capacity, thresholds, concurrency, sampling).
  • Add an e2e script for exercising Posix GC and update vLLM connector + example YAML to account for GC capacity/config.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
ucm/store/test/e2e/posixstore_gc_test.py New e2e script to exercise GC-enabled Posix store behavior.
ucm/store/posix/cc/space_manager.h Adds HotnessTracker/ShardGarbageCollector members and GC enable flag.
ucm/store/posix/cc/space_manager.cc Wires GC setup and hotness touch into prefix lookup flow.
ucm/store/posix/cc/space_layout.h Exposes shard sampling/counting and oldest-file selection helpers for GC.
ucm/store/posix/cc/space_layout.cc Implements shard sampling, file counting, and “oldest file” selection logic.
ucm/store/posix/cc/shard_gc.h Introduces ShardGarbageCollector interface and task context.
ucm/store/posix/cc/shard_gc.cc Implements background GC trigger loop and per-shard GC tasks.
ucm/store/posix/cc/posix_store.cc Parses, validates, and logs new GC configuration knobs.
ucm/store/posix/cc/hotness_tracker.h Introduces HotnessTracker to update file timestamps asynchronously.
ucm/store/posix/cc/hotness_tracker.cc Implements async timestamp updates for “hot” blocks.
ucm/store/posix/cc/global_config.h Adds GC-related configuration fields (enable, capacity, thresholds, etc.).
ucm/integration/vllm/ucm_connector.py Ensures scheduler config includes derived block_size and disables GC except DP0 scheduler.
examples/ucm_config_example.yaml Documents posix_capacity_gb as an example config field.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mag1c-h mag1c-h merged commit 7a3e408 into ModelEngine-Group:develop Mar 31, 2026
22 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants