[Proposal] Expand slot management to resource management #2787

zhongchun · 2022-03-04T07:42:11Z

Motivation

Currently Mars use slot for resource management and bands allocation which just consider cpu/gpu but no memory. Mars always allocate one slot which represents one core cpu or gpu for a subtask. It works well most time. But there are some shortcomings like:

Subtasks need less cpu but assigned more which results in low cpu utilization and long execution time
Subtasks need more memory and less cpu which leads node OOM

So we could develop more granular resource management and allocation to increase resource utilization, improve scheduling efficiency, and avoid OOM.

Design

We propose a more common resource management which includes not only cpu/gpu but also memory, and even estimated time of a subtask.
A subtask of Mars needs one slot but no other resource by default. We could add more different types of resources to management.
Obviusly we can involve memory first as follows:

class Resource:
    num_cpus: float
    num_gpus: float
    num_mem_bytes: float

With this we can expand slot management to resource management. And bands allocation needs to consider both cpu/gpu and memory.
So we should develop a more complex resource management from a simple resource(cpu/gpu) to multiple resources.

In addition, we can easily implement hbo if we have an external system which can recommend resources for subtasks by history information.
If no external system, we can set memory resource to 0 which degenerates to the original slot scheduler or set a value through configuration to avoid OOM.

And later we can estimated execution time of subtasks if the external HBO system can recommend subtask execution time.

Plan

In order to implement this proposal, we plan to do:

Add physical resource management which has been in Physical resource management for hbo #2731
Add a logic id for subtask which represents a unique subtask and in different submits the same subtask has same logic id which has been in [Proposal] A speculative execution based task scheduler #2575
Add a logic key for tileable graph which just like subtask logic key and this is for HBO in Add logic key for tileable graph #2961
Introduce resource management and bands allocation [scheduling] Expand slot scheduler to resource scheduler #2846

The text was updated successfully, but these errors were encountered:

wjsi · 2022-03-07T08:38:01Z

As slots and CPU / memory combination are both resources, if we unify slot and resource strategy (for instance, always specify mem requirement as 0), do we really need strategy itself?

zhongchun · 2022-03-07T10:17:38Z

As slots and CPU / memory combination are both resources, if we unify slot and resource strategy (for instance, always specify mem requirement as 0), do we really need strategy itself?

Agree with you, we can unify them. I'll optimize the workflow.

zhongchun assigned qinxuye, fyrestone, wjsi, Catch-Bull, chaokunyang and hekaisheng Mar 4, 2022

qinxuye added the proposal label Mar 6, 2022

qinxuye mentioned this issue Mar 7, 2022

Mars roadmaps and enhancement proposals #537

Open

zhongchun changed the title ~~[Proposal] Support more bands allocation strategies~~ [Proposal] Expand slot scheduler to resoure scheduler Mar 15, 2022

zhongchun changed the title ~~[Proposal] Expand slot scheduler to resoure scheduler~~ [Proposal] Expand slot management to resource management Mar 15, 2022

This was referenced Mar 15, 2022

[scheduling] Expand slot management to resource management #2825

Closed

[scheduling] Expand slot scheduler to resource scheduler #2846

Merged

qinxuye added this to the v0.9.0rc2 milestone Mar 24, 2022

qinxuye closed this as completed in #2846 Mar 28, 2022

zhongchun mentioned this issue Apr 25, 2022

Add logic key for tileable graph #2961

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Expand slot management to resource management #2787

[Proposal] Expand slot management to resource management #2787

zhongchun commented Mar 4, 2022 •

edited

Loading

wjsi commented Mar 7, 2022

zhongchun commented Mar 7, 2022

[Proposal] Expand slot management to resource management #2787

[Proposal] Expand slot management to resource management #2787

Comments

zhongchun commented Mar 4, 2022 • edited Loading

Motivation

Design

Plan

wjsi commented Mar 7, 2022

zhongchun commented Mar 7, 2022

zhongchun commented Mar 4, 2022 •

edited

Loading