Skip to content

Zenml update staging and update karpenter to enable training on gpu resources#130

Open
dakotabenjamin wants to merge 4 commits intomainfrom
zenml-update-staging
Open

Zenml update staging and update karpenter to enable training on gpu resources#130
dakotabenjamin wants to merge 4 commits intomainfrom
zenml-update-staging

Conversation

@dakotabenjamin
Copy link
Copy Markdown
Member

Describe this PR

  1. Updates the opentofu config to what is currently deployed on zenml
  2. Adds the label requirements needed for the orchestrator to deploy gpu and cpu nodes for training and inference.

Screenshots

We were seeing this when attempting to deploy the training pipeline:
image

@dakotabenjamin dakotabenjamin requested a review from spwoodcock May 6, 2026 18:56
@dakotabenjamin
Copy link
Copy Markdown
Member Author

@spwoodcock didn't realize karpenter was already in use! I'll revise.

@spwoodcock
Copy link
Copy Markdown
Member

Sorry I fixed up karpenter cpu deploy / policy attachment, putting this out of sync 😅 Needs a resolve

@spwoodcock
Copy link
Copy Markdown
Member

Oh ffs it failed: https://github.com/hotosm/k8s-infra/actions/runs/25458268272/job/74693442224
I hate this flaky workflow based terraform apply.

Too late for me - will come back to it later

@spwoodcock
Copy link
Copy Markdown
Member

Alright, round two - here we go!

@spwoodcock spwoodcock force-pushed the zenml-update-staging branch from 28d5f56 to 0d3beea Compare May 6, 2026 21:22
@spwoodcock
Copy link
Copy Markdown
Member

I fixed the merge conflict 👍

Copy link
Copy Markdown
Member

@spwoodcock spwoodcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me!

@spwoodcock spwoodcock force-pushed the zenml-update-staging branch from 6118be9 to 0d3beea Compare May 6, 2026 21:46
@spwoodcock spwoodcock force-pushed the zenml-update-staging branch from 0d3beea to d280cfa Compare May 6, 2026 22:22
@spwoodcock
Copy link
Copy Markdown
Member

Made some tweaks to the AMIs (plus CPU done in previous commit), ensuring all the current Kubernetes versions match 🤞

I also added Nvidia Device Plugin, which is required for adding labels to pods and expecting them to run on GPU nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants