Skip to content

enable-gang-scheduling not work #1387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chenwenjun-github opened this issue Mar 27, 2021 · 6 comments
Closed

enable-gang-scheduling not work #1387

chenwenjun-github opened this issue Mar 27, 2021 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@chenwenjun-github
Copy link

chenwenjun-github commented Mar 27, 2021

What happened:
I use the kubeflow/tf-operator with --enable-gang-scheduling = true, it run on k8s, I have different namespace.
When I submit two same tfjobs at the same time and the resourcequota of namespace is only enough for one tfjob, I find the two tfjob still start to occupy the resources and cause deadlock.

And I try to submit only one tfjob with more resource but the resourcequota of namespace is not enough
image
I find although podgroup show that the resouce is not enough for all pods of tfjob, but there still have several pod which has been created and pod's status is pending

What you expected to happen:
when the resouce is not enough for all pods of tfjob, all the pods of tfjob shouldn't be created,
because the pod which status is pending still will be distribute resource.

How to reproduce it (as minimally and precisely as possible):
like the describe above

Anything else we need to know?:
tf-operator version is recently either.

Environment:

  • Volcano Version:1.2.0
  • Kubernetes version (use kubectl version):1.13
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@chenwenjun-github chenwenjun-github added the kind/bug Categorizes issue or PR as related to a bug. label Mar 27, 2021
@Thor-wl
Copy link
Contributor

Thor-wl commented Mar 29, 2021

/assign @william-wang Please help take a look

@shinytang6
Copy link
Member

xref: kubeflow/trainer#1264

@william-wang
Copy link
Member

OK, Let me have a check and investigation in my environment.

@Thor-wl
Copy link
Contributor

Thor-wl commented Apr 2, 2021

According to the discussion of community weekly meeting, adivice as follows:

  • switch the use of tf-operator to volcano job api. There will not meet this problem.
  • submit an issue to tf-operator to consider gang-scheduling when creating pods.

@stale
Copy link

stale bot commented Jul 1, 2021

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2021
@stale
Copy link

stale bot commented Aug 30, 2021

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Aug 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants