[Core] Autoscaler for virtual clusters #506

Chong-Li · 2025-02-18T09:02:21Z

Description

Currently, when autoscaler (v2) is (periodically) triggered, it basically goes through the following steps:

Reconciler._sync_from: get the latest ray cluster resource data (including pending resource demands) from gcs.
Reconciler._step_next: make autoscaling decisions. It calls ResourceDemandScheduler.schedule, which includes:
a. ResourceDemandScheduler._enforce_min/max_workers: determine the nodes to add or terminate to enforce the min/max node limits of the cluster.
b. ResourceDemandScheduler._sched_(gang)_resource_requests: determine the nodes to add to fulfill the pending PGs, actors and tasks (resource demands).
c. ResourceDemandScheduler._enforce_idle_termination: determine which nodes to terminate if being idle for long time.
KubeRayProvider aggregates the autoscaling decisions and sends patch to k8s.

To make the autoscaler support virtual clusters, we have to do the following changes:

Reconciler._sync_from() has to additionally get the latest metadata (including pending demands) of virtual clusters from gcs.
Do 2a, 2b, 2c for each virtual cluster. To make it happen, we have to additionally support:
a. User APIs to configure each virtual cluster's min/max limits.
b. ResourceDemandScheduler._sched_(gang)_resource_requests simulates the scheduling of pending demands within the corresponding virtual cluster.
c. Shrink specific ray nodes from a virtual cluster, which involves node draining.
KubeRayProvider aggregates the autoscaling decisions made by each virtual cluster. Before sending patch to k8s, it has to find any possibility for rebalancing (nodes shrunk by a virtual cluster could be replenished to another). It reduces the cost of pod creation and termination.

Besides the changes to autoscaler, we have to adapt GcsAutoscalerStateManager in order to provide the virtual cluster metadata to autoscaler. But if it requires too many changes to the existing rpc protocol, then we should do it in GcsVirtualClusterManager instead.

Use case

No response

The text was updated successfully, but these errors were encountered:

Chong-Li · 2025-02-18T12:38:34Z

@wumuzi520 any suggestions?

Chong-Li added the enhancement New feature or request label Feb 18, 2025

Chong-Li changed the title ~~[<Ray component: Core>] Autoscaler for virtual clusters~~ [Core] Autoscaler for virtual clusters Feb 18, 2025

Chong-Li self-assigned this Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Autoscaler for virtual clusters #506

[Core] Autoscaler for virtual clusters #506

Chong-Li commented Feb 18, 2025 •

edited

Loading

Chong-Li commented Feb 18, 2025

[Core] Autoscaler for virtual clusters #506

[Core] Autoscaler for virtual clusters #506

Comments

Chong-Li commented Feb 18, 2025 • edited Loading

Description

Use case

Chong-Li commented Feb 18, 2025

Chong-Li commented Feb 18, 2025 •

edited

Loading