Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Physical resource management for hbo #2731

Closed
chaokunyang opened this issue Feb 21, 2022 · 1 comment · Fixed by #2732
Closed

Physical resource management for hbo #2731

chaokunyang opened this issue Feb 21, 2022 · 1 comment · Fixed by #2732

Comments

@chaokunyang
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Currently mars use slot-based resource management for subtask scheduling. For every subtask, it check coresponding worker has 1 cpu for execution. If the worker does has one idle cpu, then it allow subtask to be executed in that worker, and the SubtaskManager will submit the subtak to that worker. After the subtask has been submitted to the worker, the worker check whether there are enough resource for subtask execution. If not, this subtask will be hold until there are enough resource for execution. This kind of slot-based execution has following issues:

  • Mostly the subtask use not exactly one cpu, there will be resource waste or oversold
  • The scheduling didn't account fot the memory, if the worker doesn't have enough memory, the subtask will be waiting onn this worker, thus increase the schduling an execution latency.
  • The scheduling only takes the cpu into account, it didn;t use a multi-denmensional resource scheduling model, thus can't archive a more opitimal resource ultilization for the whole cluster.

Describe the solution you'd like
We can use a more accurate resource management model for mars.

  • In supervisor, we can bookeep the usages of cpu,memory, gpus for every subtask.
  • For evenry subtasks, we can get a resource recommendation plan based on the subtask graph execution history.
  • When we acquire the resources for the subtask from the global resource manager, we can submit the accurate resource requirements the resource manager, and if it return true, we can be sure that the subtask can be executed on the specified worker using about the previous specified resource without any delay. And there won't be any resource waste or oversold. For example, if we have 1000 subtasks to be scheduled, each need 0.5 cpu for 30 minutes, and use little memory , we will need 1000 cpus for 30 minutes. But now we only need 500 cpus, which gives us about 2 times improvement for resource utilization.
@wjsi
Copy link
Member

wjsi commented Feb 21, 2022

Some design conflicts with #2631. More considerations needed whether to add more load to GlobalSlotManagerActor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants