-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable-gang-scheduling not work #1264
Comments
Are you using volcano? |
yes I do |
OK, I will have a look ASAP. /cc @Thor-wl @shinytang6 |
Yes, we've noticed this issue and is analysising it. |
|
I'm sorry for not reviewing the code. Let me take a look. |
If you only use the default queue and
p.s: even though your queue has declared capability, the podgroup created by tf-operator still has something wrong currently because Similar issue: kubeflow/common#112 |
@shinytang6 Thank you. I also find that the status of podgroup is always inqueue even if the resourcequota of namespace is not enough for tfjob when I test gang-scheduler. |
emm, I don’t think so. The tasks scheduled by volcano are generated based on the pod with |
@shinytang6 you are right. volcano don't create pod. the pod and podgroup is created by tf-operator, schedulered by volcano. And I think I know my problem. |
@shinytang6 @jiangkaihua have discussed about this issue last night. And they are working for it. |
@Thor-wl: GitHub didn't allow me to assign the following users: jiangkaihua. Note that only kubeflow members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
What happened:
I use the kubeflow/tf-operator with --enable-gang-scheduling = true, and I have installed volcano.
When I submit two same tfjobs at the same time and the resourcequota of namespace is only enough for one tfjob, I find the two tfjob still start to occupy the resources and cause deadlock.
And I try to submit only one tfjob with more resource but the resourcequota of namespace is not enough
I find although podgroup show that the resouce is not enough for all pods of tfjob, but there still have several pod which has been created and pod's status is pending
What you expected to happen:
when the resouce is not enough for all pods of tfjob, all the pods of tfjob shouldn't be created,
because the pod which status is pending still will be distribute resource.
How to reproduce it (as minimally and precisely as possible):
like the describe above
Anything else we need to know?:
tf-operator version is recently either.
I use default queue, and I have many namespace, does this have an impact?
Environment:
kubectl version
):1.13uname -a
):The text was updated successfully, but these errors were encountered: