抢占的第二步是从可以实施抢占动作的节点中找出一个最佳目标节点。通过 SelectCandidate()
来实现。
func SelectCandidate(candidates []Candidate) Candidate {
if len(candidates) == 0 {
return nil
}
if len(candidates) == 1 {
return candidates[0]
}
victimsMap := candidatesToVictimsMap(candidates)
...
}
func candidatesToVictimsMap(candidates []Candidate) map[string]*extenderv1.Victims {
m := make(map[string]*extenderv1.Victims)
for _, c := range candidates {
m[c.Name()] = c.Victims()
}
return m
}
首先,使用 candidatesToVictimsMap()
将数组转化成 Map,其中的 key 是节点名称,value 是此节点上需要被抢占的 Pod 列表。
candidateNode := pickOneNodeForPreemption(victimsMap)
// Same as candidatesToVictimsMap, this logic is not applicable for out-of-tree
// preemption plugins that exercise different candidates on the same nominated node.
if victims := victimsMap[candidateNode]; victims != nil {
return &candidate{
victims: victims,
name: candidateNode,
}
}
// We shouldn't reach here.
klog.Errorf("should not reach here, no candidate selected from %v.", candidates)
// To not break the whole flow, return the first candidate.
return candidates[0]
}
然后调用 pickOneNodeForPreemption()
从刚才的 Map 中找出一个最佳节点,然后将结果返回。这个函数的执行包括很多步骤,如果在其中的某一步过滤后只剩一个节点那么就直接返回,否则会执行下一个步骤进行再一次的筛选。
-
针对每个节点,如果把这些被抢占的 Pod 删除后,会造成多少个干扰(violations of PodDisruptionBudget)受到影响。这一步找出这个值最小的节点,如果只有一个节点这个值最小,那么直接返回,如果有多个节点的这个值都是相等的最小值,则执行下一步筛选。
从这一步筛选逻辑可以看出,抢占调度会尽可能保证干扰(PodDisruptionBudget)的正常工作。
代码如下:
func pickOneNodeForPreemption(nodesToVictims map[string]*extenderv1.Victims) string { if len(nodesToVictims) == 0 { return "" } minNumPDBViolatingPods := int64(math.MaxInt32) var minNodes1 []string lenNodes1 := 0 for node, victims := range nodesToVictims { numPDBViolatingPods := victims.NumPDBViolations if numPDBViolatingPods < minNumPDBViolatingPods { minNumPDBViolatingPods = numPDBViolatingPods minNodes1 = nil lenNodes1 = 0 } if numPDBViolatingPods == minNumPDBViolatingPods { minNodes1 = append(minNodes1, node) lenNodes1++ } } if lenNodes1 == 1 { return minNodes1[0] }
-
针对每个节点,在被抢占的 Pod 列表中,索引为 0 的 Pod 的优先级最高。这个步骤会遍历每个节点,找出索引为 0 的 Pod 的优先级最低的那个节点。如果只有一个节点,则直接返回;如果有多个节点的这个优先级都为最低值,则进行下一步的过滤。
代码如下:
// There are more than one node with minimum number PDB violating pods. Find // the one with minimum highest priority victim. minHighestPriority := int32(math.MaxInt32) var minNodes2 = make([]string, lenNodes1) lenNodes2 := 0 for i := 0; i < lenNodes1; i++ { node := minNodes1[i] victims := nodesToVictims[node] // highestPodPriority is the highest priority among the victims on this node. highestPodPriority := corev1helpers.PodPriority(victims.Pods[0]) if highestPodPriority < minHighestPriority { minHighestPriority = highestPodPriority lenNodes2 = 0 } if highestPodPriority == minHighestPriority { minNodes2[lenNodes2] = node lenNodes2++ } } if lenNodes2 == 1 { return minNodes2[0] }
-
针对每个节点,计算所有被抢占的 Pod 的优先级的总和,挑选中总和最小的那个节点。如果有一个节点,则直接返回;否则,如果有多个节点的 Pod 优先级总和都一样小,则进行下一步的筛选。
从这一步和上一步中可以看出抢占的时候会优先选择那些 Pod 优先级低的节点。
代码如下:
// There are a few nodes with minimum highest priority victim. Find the
// smallest sum of priorities.
minSumPriorities := int64(math.MaxInt64)
lenNodes1 = 0
for i := 0; i < lenNodes2; i++ {
var sumPriorities int64
node := minNodes2[i]
for _, pod := range nodesToVictims[node].Pods {
// We add MaxInt32+1 to all priorities to make all of them >= 0. This is
// needed so that a node with a few pods with negative priority is not
// picked over a node with a smaller number of pods with the same negative
// priority (and similar scenarios).
sumPriorities += int64(corev1helpers.PodPriority(pod)) + int64(math.MaxInt32+1)
}
if sumPriorities < minSumPriorities {
minSumPriorities = sumPriorities
lenNodes1 = 0
}
if sumPriorities == minSumPriorities {
minNodes1[lenNodes1] = node
lenNodes1++
}
}
if lenNodes1 == 1 {
return minNodes1[0]
}
-
找出被抢占的 Pod 数量最小的那个节点。如果有一个节点,则直接返回;否则,如果有多个节点的 Pod 数量都一样,则进行下一步的筛选。
// There are a few nodes with minimum highest priority victim and sum of priorities. // Find one with the minimum number of pods. minNumPods := math.MaxInt32 lenNodes2 = 0 for i := 0; i < lenNodes1; i++ { node := minNodes1[i] numPods := len(nodesToVictims[node].Pods) if numPods < minNumPods { minNumPods = numPods lenNodes2 = 0 } if numPods == minNumPods { minNodes2[lenNodes2] = node lenNodes2++ } } if lenNodes2 == 1 { return minNodes2[0] }
-
如果到了这一步仍然有多个备选节点,那么会按照被抢占 Pod 的时间顺序来选择节点。
代码如下:
latestStartTime := util.GetEarliestPodStartTime(nodesToVictims[minNodes2[0]]) ... nodeToReturn := minNodes2[0] for i := 1; i < lenNodes2; i++ { node := minNodes2[i] earliestStartTimeOnNode := util.GetEarliestPodStartTime(nodesToVictims[node]) ... if earliestStartTimeOnNode.After(latestStartTime.Time) { latestStartTime = earliestStartTimeOnNode nodeToReturn = node } } return nodeToReturn } func GetEarliestPodStartTime(victims *extenderv1.Victims) *metav1.Time { ... earliestPodStartTime := GetPodStartTime(victims.Pods[0]) maxPriority := corev1helpers.PodPriority(victims.Pods[0]) for _, pod := range victims.Pods { if corev1helpers.PodPriority(pod) == maxPriority { if GetPodStartTime(pod).Before(earliestPodStartTime) { earliestPodStartTime = GetPodStartTime(pod) } } else if corev1helpers.PodPriority(pod) > maxPriority { maxPriority = corev1helpers.PodPriority(pod) earliestPodStartTime = GetPodStartTime(pod) } } return earliestPodStartTime }
会遍历每一个候选节点,使用
GetEarliestPodStartTime()
找出当前节点上优先级最高的 Pod 列表,然后找出创建时间最早的那个 Pod 的创建时间并返回。接着会比较所有候选节点上的这个返回的时间,找出最早的那个,然后将此节点返回。
至此,经过以上一系列的步骤,会选出一个用于抢占的最佳的目标节点,下一步会实施真正的抢占动作,下一节会对此进行分析。