Physical resource management for hbo #2731

chaokunyang · 2022-02-21T07:23:01Z

Is your feature request related to a problem? Please describe.

Currently mars use slot-based resource management for subtask scheduling. For every subtask, it check coresponding worker has 1 cpu for execution. If the worker does has one idle cpu, then it allow subtask to be executed in that worker, and the SubtaskManager will submit the subtak to that worker. After the subtask has been submitted to the worker, the worker check whether there are enough resource for subtask execution. If not, this subtask will be hold until there are enough resource for execution. This kind of slot-based execution has following issues:

Mostly the subtask use not exactly one cpu, there will be resource waste or oversold
The scheduling didn't account fot the memory, if the worker doesn't have enough memory, the subtask will be waiting onn this worker, thus increase the schduling an execution latency.
The scheduling only takes the cpu into account, it didn;t use a multi-denmensional resource scheduling model, thus can't archive a more opitimal resource ultilization for the whole cluster.

Describe the solution you'd like
We can use a more accurate resource management model for mars.

In supervisor, we can bookeep the usages of cpu,memory, gpus for every subtask.
For evenry subtasks, we can get a resource recommendation plan based on the subtask graph execution history.
When we acquire the resources for the subtask from the global resource manager, we can submit the accurate resource requirements the resource manager, and if it return true, we can be sure that the subtask can be executed on the specified worker using about the previous specified resource without any delay. And there won't be any resource waste or oversold. For example, if we have 1000 subtasks to be scheduled, each need 0.5 cpu for 30 minutes, and use little memory , we will need 1000 cpus for 30 minutes. But now we only need 500 cpus, which gives us about 2 times improvement for resource utilization.

The text was updated successfully, but these errors were encountered:

wjsi · 2022-02-21T07:28:04Z

Some design conflicts with #2631. More considerations needed whether to add more load to GlobalSlotManagerActor.

chaokunyang mentioned this issue Feb 21, 2022

[scheduling] accurate resource management for global slot manager #2732

Merged

2 tasks

zhongchun mentioned this issue Mar 4, 2022

[Proposal] Expand slot management to resource management #2787

Closed

qinxuye closed this as completed in #2732 Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Physical resource management for hbo #2731

Physical resource management for hbo #2731

chaokunyang commented Feb 21, 2022

wjsi commented Feb 21, 2022

Physical resource management for hbo #2731

Physical resource management for hbo #2731

Comments

chaokunyang commented Feb 21, 2022

wjsi commented Feb 21, 2022