-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flag to not advertise NUMA information #320
Comments
@blackgold This is really interesting - have you got numbers for how long the TM calculation is taking? Does it impact your container startup time? It would be really helpful to understand the impact of the Topology calculation. |
@blackgold I assume you also have other device plugin instances running in the same cluster that requires NUMA advertising, correct? so disabling NUMA policy in kubelet is not an option here. |
It takes more than 20 minutes. Jobcontroller kills the jobs in pending state for more than 20 mins after binding to node. |
Ack. we have gpu device plugin advertising topology information so cannot disable it in kubelet. |
@blackgold Is this an 8 NUMA zone node? I didn't realize the Topology Manager calculation could take so long - any extra information on the set up and config would be great. @zshi-redhat this seems like a must-have for these sorts of situations. Do you think it would work as a cmd flag i.e. daemonset wide(but not necessarily cluster wide) , or would it be better to have it as a per-pool config (would allow TM active for SRIOV on some pools but not on others) |
Yup its a 8 NUMA zone node, 8 gpu, 8 RDMA devices and 255 cpus. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/topologymanager/policy.go#L142 When I timed it took 220 seconds to generate permutations from [8,0] to [8,96] ~= 22944 function calls. 220 seconds seems a lot for those many function calls. Need to debug more. |
I think having a per-pool config would allow more flexibility and ultimately solve any relevant issues. For example, running one device plugin instance would be possible for several resource pools, with NUMA enabled for some pools but not the others. For this particular case, my understanding is GPU is advertised by a different device plugin (may not be sriov), so having a cli option would be enough. |
was an issue filed against topology manager ? maybe the algorithm can be improved |
If you guys think its reasonable to control this using a cli option i can send out a mr |
I'm fine with using a cli option, this is aligned with the discussion we had in #320 and resource mgmt meeting - to have a featureGate for features that may need to be enabled/disabled. I think numa could be one example of such. /cc @killianmuldoon @ahalim-intel @adrianchiris @martinkennelly |
@zshi-redhat I think a feature gate is a good idea here for sure, but we should think about implementing per-pool numa-awareness (default on, opt out for a specific pool) for advanced cases where sriov topology may not be important (one NIC per node, multi-resource NUMA contstraints). |
What would you like to be added?
Flag to not advertise NUMA information
What is the use case for this feature / enhancement?
Logic to generate placement hints in topology manager is exponential to number of numa cores.
When we have like 8 nodes it takes really long.
We are not using numa information for rdma, so if device plugin does not send it (configurable by flag)it will be helpful.
The text was updated successfully, but these errors were encountered: