Skip to content

Add wildcard tolerations to ComputeDomain pods (issue #305)#306

Merged
jgehrcke merged 1 commit intokubernetes-sigs:mainfrom
jgehrcke:jp/cdpod-tolerations
Mar 26, 2025
Merged

Add wildcard tolerations to ComputeDomain pods (issue #305)#306
jgehrcke merged 1 commit intokubernetes-sigs:mainfrom
jgehrcke:jp/cdpod-tolerations

Conversation

@jgehrcke
Copy link
Copy Markdown
Contributor

For #305. This is not yet tested. Will report back.

…#305)

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jgehrcke
Copy link
Copy Markdown
Contributor Author

OK, I tested this on Luna. Confirming that the expected tolerations are set:

$ kubectl get pod -n nvidia-dra-driver-gpu   imex-channel-injection-c9s9r-5pzxk -o yaml | grep -C15 tolerations
<snip>
  nodeSelector:
    resource.nvidia.com/computeDomain: 25c467f6-b235-477d-a306-09c8e88311a7
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  resourceClaims:
  - name: compute-domain-daemon
    resourceClaimTemplateName: imex-channel-injection-daemon-claim-template-w6qnx
<snip>
  tolerations:
  - effect: NoSchedule
    operator: Exists
  - effect: NoExecute
    operator: Exists
  - effect: PreferNoSchedule
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists

The top three tolerations are the new catch-alls. The other three are set by k8s by default.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds wildcard tolerations for ComputeDomain pods to address issue #305.

  • Adds three wildcard tolerations with effects "NoSchedule", "NoExecute", and "PreferNoSchedule".
  • Updates the ComputeDomain daemon template to include these tolerations.

@jgehrcke jgehrcke merged commit 50703b0 into kubernetes-sigs:main Mar 26, 2025
7 checks passed
@klueska klueska added this to the v25.3.0 milestone Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants