You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when autoscaler (v2) is (periodically) triggered, it basically goes through the following steps:
Reconciler._sync_from: get the latest ray cluster resource data (including pending resource demands) from gcs.
Reconciler._step_next: make autoscaling decisions. It calls ResourceDemandScheduler.schedule, which includes:
a. ResourceDemandScheduler._enforce_min/max_workers: determine the nodes to add or terminate to enforce the min/max node limits of the cluster.
b. ResourceDemandScheduler._sched_(gang)_resource_requests: determine the nodes to add to fulfill the pending PGs, actors and tasks (resource demands).
c. ResourceDemandScheduler._enforce_idle_termination: determine which nodes to terminate if being idle for long time.
KubeRayProvider aggregates the autoscaling decisions and sends patch to k8s.
To make the autoscaler support virtual clusters, we have to do the following changes:
Reconciler._sync_from() has to additionally get the latest metadata (including pending demands) of virtual clusters from gcs.
Do 2a, 2b, 2c for each virtual cluster. To make it happen, we have to additionally support:
a. User APIs to configure each virtual cluster's min/max limits.
b. ResourceDemandScheduler._sched_(gang)_resource_requests simulates the scheduling of pending demands within the corresponding virtual cluster.
c. Shrink specific ray nodes from a virtual cluster, which involves node draining.
KubeRayProvider aggregates the autoscaling decisions made by each virtual cluster. Before sending patch to k8s, it has to find any possibility for rebalancing (nodes shrunk by a virtual cluster could be replenished to another). It reduces the cost of pod creation and termination.
Besides the changes to autoscaler, we have to adapt GcsAutoscalerStateManager in order to provide the virtual cluster metadata to autoscaler. But if it requires too many changes to the existing rpc protocol, then we should do it in GcsVirtualClusterManager instead.
Use case
No response
The text was updated successfully, but these errors were encountered:
Description
Currently, when autoscaler (v2) is (periodically) triggered, it basically goes through the following steps:
Reconciler._sync_from
: get the latest ray cluster resource data (including pending resource demands) from gcs.Reconciler._step_next
: make autoscaling decisions. It callsResourceDemandScheduler.schedule
, which includes:a.
ResourceDemandScheduler._enforce_min/max_workers
: determine the nodes to add or terminate to enforce the min/max node limits of the cluster.b.
ResourceDemandScheduler._sched_(gang)_resource_requests
: determine the nodes to add to fulfill the pending PGs, actors and tasks (resource demands).c.
ResourceDemandScheduler._enforce_idle_termination
: determine which nodes to terminate if being idle for long time.KubeRayProvider
aggregates the autoscaling decisions and sends patch to k8s.To make the autoscaler support virtual clusters, we have to do the following changes:
Reconciler._sync_from()
has to additionally get the latest metadata (including pending demands) of virtual clusters from gcs.Do 2a, 2b, 2c for each virtual cluster. To make it happen, we have to additionally support:
a. User APIs to configure each virtual cluster's min/max limits.
b.
ResourceDemandScheduler._sched_(gang)_resource_requests
simulates the scheduling of pending demands within the corresponding virtual cluster.c. Shrink specific ray nodes from a virtual cluster, which involves node draining.
KubeRayProvider
aggregates the autoscaling decisions made by each virtual cluster. Before sending patch to k8s, it has to find any possibility for rebalancing (nodes shrunk by a virtual cluster could be replenished to another). It reduces the cost of pod creation and termination.Besides the changes to autoscaler, we have to adapt
GcsAutoscalerStateManager
in order to provide the virtual cluster metadata to autoscaler. But if it requires too many changes to the existing rpc protocol, then we should do it inGcsVirtualClusterManager
instead.Use case
No response
The text was updated successfully, but these errors were encountered: