You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently mars use slot-based resource management for subtask scheduling. For every subtask, it check coresponding worker has 1 cpu for execution. If the worker does has one idle cpu, then it allow subtask to be executed in that worker, and the SubtaskManager will submit the subtak to that worker. After the subtask has been submitted to the worker, the worker check whether there are enough resource for subtask execution. If not, this subtask will be hold until there are enough resource for execution. This kind of slot-based execution has following issues:
Mostly the subtask use not exactly one cpu, there will be resource waste or oversold
The scheduling didn't account fot the memory, if the worker doesn't have enough memory, the subtask will be waiting onn this worker, thus increase the schduling an execution latency.
The scheduling only takes the cpu into account, it didn;t use a multi-denmensional resource scheduling model, thus can't archive a more opitimal resource ultilization for the whole cluster.
Describe the solution you'd like
We can use a more accurate resource management model for mars.
In supervisor, we can bookeep the usages of cpu,memory, gpus for every subtask.
For evenry subtasks, we can get a resource recommendation plan based on the subtask graph execution history.
When we acquire the resources for the subtask from the global resource manager, we can submit the accurate resource requirements the resource manager, and if it return true, we can be sure that the subtask can be executed on the specified worker using about the previous specified resource without any delay. And there won't be any resource waste or oversold. For example, if we have 1000 subtasks to be scheduled, each need 0.5 cpu for 30 minutes, and use little memory , we will need 1000 cpus for 30 minutes. But now we only need 500 cpus, which gives us about 2 times improvement for resource utilization.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Currently mars use slot-based resource management for subtask scheduling. For every subtask, it check coresponding worker has 1 cpu for execution. If the worker does has one idle cpu, then it allow subtask to be executed in that worker, and the
SubtaskManager
will submit the subtak to that worker. After the subtask has been submitted to the worker, the worker check whether there are enough resource for subtask execution. If not, this subtask will be hold until there are enough resource for execution. This kind of slot-based execution has following issues:Describe the solution you'd like
We can use a more accurate resource management model for mars.
The text was updated successfully, but these errors were encountered: