Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Expand slot management to resource management #2787

Closed
zhongchun opened this issue Mar 4, 2022 · 2 comments · Fixed by #2846
Closed

[Proposal] Expand slot management to resource management #2787

zhongchun opened this issue Mar 4, 2022 · 2 comments · Fixed by #2846
Assignees
Labels
Milestone

Comments

@zhongchun
Copy link
Contributor

zhongchun commented Mar 4, 2022

Motivation

Currently Mars use slot for resource management and bands allocation which just consider cpu/gpu but no memory. Mars always allocate one slot which represents one core cpu or gpu for a subtask. It works well most time. But there are some shortcomings like:

  • Subtasks need less cpu but assigned more which results in low cpu utilization and long execution time
  • Subtasks need more memory and less cpu which leads node OOM

So we could develop more granular resource management and allocation to increase resource utilization, improve scheduling efficiency, and avoid OOM.

Design

We propose a more common resource management which includes not only cpu/gpu but also memory, and even estimated time of a subtask.
A subtask of Mars needs one slot but no other resource by default. We could add more different types of resources to management.
Obviusly we can involve memory first as follows:

class Resource:
    num_cpus: float
    num_gpus: float
    num_mem_bytes: float

With this we can expand slot management to resource management. And bands allocation needs to consider both cpu/gpu and memory.
So we should develop a more complex resource management from a simple resource(cpu/gpu) to multiple resources.

In addition, we can easily implement hbo if we have an external system which can recommend resources for subtasks by history information.
If no external system, we can set memory resource to 0 which degenerates to the original slot scheduler or set a value through configuration to avoid OOM.

And later we can estimated execution time of subtasks if the external HBO system can recommend subtask execution time.

Plan

In order to implement this proposal, we plan to do:

@wjsi
Copy link
Member

wjsi commented Mar 7, 2022

As slots and CPU / memory combination are both resources, if we unify slot and resource strategy (for instance, always specify mem requirement as 0), do we really need strategy itself?

@zhongchun
Copy link
Contributor Author

As slots and CPU / memory combination are both resources, if we unify slot and resource strategy (for instance, always specify mem requirement as 0), do we really need strategy itself?

Agree with you, we can unify them. I'll optimize the workflow.

@zhongchun zhongchun changed the title [Proposal] Support more bands allocation strategies [Proposal] Expand slot scheduler to resoure scheduler Mar 15, 2022
@zhongchun zhongchun changed the title [Proposal] Expand slot scheduler to resoure scheduler [Proposal] Expand slot management to resource management Mar 15, 2022
@qinxuye qinxuye added this to the v0.9.0rc2 milestone Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment