You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently Mars use slot for resource management and bands allocation which just consider cpu/gpu but no memory. Mars always allocate one slot which represents one core cpu or gpu for a subtask. It works well most time. But there are some shortcomings like:
Subtasks need less cpu but assigned more which results in low cpu utilization and long execution time
Subtasks need more memory and less cpu which leads node OOM
So we could develop more granular resource management and allocation to increase resource utilization, improve scheduling efficiency, and avoid OOM.
Design
We propose a more common resource management which includes not only cpu/gpu but also memory, and even estimated time of a subtask.
A subtask of Mars needs one slot but no other resource by default. We could add more different types of resources to management.
Obviusly we can involve memory first as follows:
class Resource:
num_cpus: float
num_gpus: float
num_mem_bytes: float
With this we can expand slot management to resource management. And bands allocation needs to consider both cpu/gpu and memory.
So we should develop a more complex resource management from a simple resource(cpu/gpu) to multiple resources.
In addition, we can easily implement hbo if we have an external system which can recommend resources for subtasks by history information.
If no external system, we can set memory resource to 0 which degenerates to the original slot scheduler or set a value through configuration to avoid OOM.
And later we can estimated execution time of subtasks if the external HBO system can recommend subtask execution time.
Plan
In order to implement this proposal, we plan to do:
As slots and CPU / memory combination are both resources, if we unify slot and resource strategy (for instance, always specify mem requirement as 0), do we really need strategy itself?
As slots and CPU / memory combination are both resources, if we unify slot and resource strategy (for instance, always specify mem requirement as 0), do we really need strategy itself?
Agree with you, we can unify them. I'll optimize the workflow.
zhongchun
changed the title
[Proposal] Support more bands allocation strategies
[Proposal] Expand slot scheduler to resoure scheduler
Mar 15, 2022
zhongchun
changed the title
[Proposal] Expand slot scheduler to resoure scheduler
[Proposal] Expand slot management to resource management
Mar 15, 2022
Motivation
Currently Mars use slot for resource management and bands allocation which just consider cpu/gpu but no memory. Mars always allocate one slot which represents one core cpu or gpu for a subtask. It works well most time. But there are some shortcomings like:
So we could develop more granular resource management and allocation to increase resource utilization, improve scheduling efficiency, and avoid OOM.
Design
We propose a more common resource management which includes not only cpu/gpu but also memory, and even estimated time of a subtask.
A subtask of Mars needs one slot but no other resource by default. We could add more different types of resources to management.
Obviusly we can involve memory first as follows:
With this we can expand slot management to resource management. And bands allocation needs to consider both cpu/gpu and memory.
So we should develop a more complex resource management from a simple resource(cpu/gpu) to multiple resources.
In addition, we can easily implement hbo if we have an external system which can recommend resources for subtasks by history information.
If no external system, we can set memory resource to 0 which degenerates to the original slot scheduler or set a value through configuration to avoid OOM.
And later we can estimated execution time of subtasks if the external HBO system can recommend subtask execution time.
Plan
In order to implement this proposal, we plan to do:
The text was updated successfully, but these errors were encountered: