Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kubeflow 1.10] Distributions and Kubeflow #765

Closed
rimolive opened this issue Oct 16, 2024 · 20 comments
Closed

[Kubeflow 1.10] Distributions and Kubeflow #765

rimolive opened this issue Oct 16, 2024 · 20 comments
Assignees

Comments

@rimolive
Copy link
Member

rimolive commented Oct 16, 2024

This issue will be used to track the progress of and coordinate with distributions along the 1.10 release.

While we hope all distros will manage to be ready when the KF 1.10 release is out, this is sometimes difficult to achieve. In this issue, we want to both keep track of the progress of distributions towards the KF 1.10 release and also know which of the distros will be working on KF 1.10 (testing during the distribution testing cycle) even if they can't meet the KF 1.10 deadline.

Tagging distribution owners identified from previous releases (Any new or missed distro owners, please comment on this issue)

Distribution Representative(s) State
Charmed Kubeflow @mvlassis Will participate in 1.10
Google Cloud @zijianjoy
@chensun
IBM IKS @yhwang
Microsoft
Nutanix @johnugeorge
@saileshd1402
@nagar-ajay
Will participate in 1.10
Red Hat OpenShift AI @rimolive Will participate in 1.10
Oracle Cloud Infrastructure @julioo
DeployKF @thesuperzapper Will participate in 1.10
VMWare @liuqi
@xujinheng
QBO @alexeadem Will participate in 1.10

Please let us know if you'll be participating in the 1.10 release by answering the following questions:

  • Are you planning on having your distro ready in sync with the KF 1.10 release?
  • Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)?
  • If you cannot participate, when can the community expect your distro to be ready for release 1.10?

Please note the release timelines are being discussed in #761.

cc @tarilabs @juliusvonkohout @varodrig @diegolovison @tombuuz @dpoulopoulos @saileshd1402 @mvlassis @tarekabouzeid @hbelmiro @milosjava @jbottum

@rimolive rimolive converted this from a draft issue Oct 16, 2024
@alexeadem
Copy link

Are you planning on having your distro ready in sync with the KF 1.10 release?
yes
Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)?
yes
If you cannot participate, when can the community expect your distro to be ready for release 1.10?
n/a

@saileshd1402
Copy link

saileshd1402 commented Oct 21, 2024

Hi @rimolive,
Could you please add @nagar-ajay and me as Nutanix distribution owners alongside @johnugeorge?

Answers to the distribution participation questions:

  • Are you planning on having your distro ready in sync with the KF 1.10 release? Yes
  • Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)? Yes
  • If you cannot participate, when can the community expect your distro to be ready for release 1.10? N/A

@rimolive
Copy link
Member Author

As the current distribution owner for Red Hat OpenShift AI, I will add the answer to the questions:

  • Are you planning on having your distro ready in sync with the KF 1.10 release? Yes
  • Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)? Yes
  • If you cannot participate, when can the community expect your distro to be ready for release 1.10? N/A

@thesuperzapper
Copy link
Member

  1. deployKF plans to release a GA version that includes the 1.10 versions within a reasonable timeframe of the manifest release.

    • There may also be a deployKF RC version released before the final 1.10.0 is cut, depending on how stable everything is.
  2. As usual, I will also give feedback on the manifests RCs.

  3. See above

@mvlassis
Copy link

Regarding the Charmed Kubeflow distribution:

  • Are you planning on having your distro ready in sync with the KF 1.10 release?
    • Yes
  • Will you participate by testing your distro during the distribution testing phase and providing feedback (reporting any issues to the release team)?
    • Yes
  • If you cannot participate, when can the community expect your distro to be ready for release 1.10?
    • N/A

Also, if its possible, keep only myself and not @DnPlas as a point of contact, since I communicate everything with the team :)

@rimolive
Copy link
Member Author

rimolive commented Nov 4, 2024

/assign @rimolive

@varodrig
Copy link
Contributor

varodrig commented Nov 4, 2024

@rimolive and @jbottum to follow up on this. let's follow up next week.

@varodrig
Copy link
Contributor

@rimolive and @jbottum to follow up on this. Let's sync up this week and feel free to add any comments here.

@tarilabs
Copy link
Member

from Ricardo from Release meeting progressing

@varodrig
Copy link
Contributor

varodrig commented Dec 3, 2024

No updates so far from @rimolive

@varodrig
Copy link
Contributor

@rimolive any news on this?

@varodrig
Copy link
Contributor

varodrig commented Feb 3, 2025

@rimolive @jbottum I'm following up on the distributions - any news on this?

@rimolive
Copy link
Member Author

Calling all Distribution owners! We are planning to release rc.2 next Monday March 3rd, and we'll officially begin the distribution testing. One concern raised is that our schedule to release GA is March 31st, and the deadline for distribution testing is very tight.

We'd like to gather more feedback about this concern from the other distributions so we can plan a new release date. I really appreciate any feedback so we can decide on keep the original schedule or delay the release date.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Feb 26, 2025

You can test on the 1.10 branch and https://github.com/kubeflow/manifests/milestone/1 is the milestone with current issues. kubeflow/pipelines#11669 is also quite relevant.

@alexeadem
Copy link

Calling all Distribution owners! We are planning to release rc.2 next Monday March 3rd, and we'll officially begin the distribution testing. One concern raised is that our schedule to release GA is March 31st, and the deadline for distribution testing is very tight.

We'd like to gather more feedback about this concern from the other distributions so we can plan a new release date. I really appreciate any feedback so we can decide on keep the original schedule or delay the release date.

I'm ok with your timing. As long as we don't run into issues testing should be done within those timelines.

@rimolive rimolive moved this from Todo to In Progress in 1.10 Release Mar 10, 2025
@rimolive
Copy link
Member Author

Calling all Distribution owners. With rc.2 release last week, we are good to go with Distribution testing. We need your feedback if testing is running fine and we need this asap. For the Distribution owners who did not yet confirm participation in Distribution Testing, let me know if you can run the tests.

cc @mvlassis @zijianjoy @chensun @yhwang @johnugeorge @saileshd1402 @nagar-ajay @julioo @thesuperzapper @liuqi @xujinheng @alexeadem

@juliusvonkohout
Copy link
Member

https://github.com/kubeflow/manifests/tree/v1.10-branch is the branch to test, because it will always be ahead of the RCs.

@mvlassis
Copy link

Hi @rimolive, thank you for reaching out and keeping us in the loop!

On our side (Charmed Kubeflow distribution), because of the delays in the RC release for some of the components (e.g. Notebooks, Katib), we are still currently wrapping up the updates across all distribution artifacts to align them with the latest RC versions. We expect this to be done by the end of this week, such that we can start testing the full bundle on Monday, March 17th, across all our use cases and product integrations to investigate whether regressions are present.

Currently, the plan is 1 week behind schedule, so delaying the release by an additional week would be beneficial and very much appreciated. This extra time would allow us to test the full deployment thoroughly and flag any issues to you, based on both integration testing and our Solution QA extensive testing.

@mvlassis
Copy link

Providing an update for Charmed Kubeflow:

The QA team tested the bundle and found with no issues. This means that we're going to release a beta version of the distribution later today.

@rimolive rimolive moved this from In Progress to Done in 1.10 Release Mar 25, 2025
@rimolive rimolive closed this as completed by moving to Done in 1.10 Release Mar 25, 2025
@alexeadem
Copy link

alexeadem commented Mar 26, 2025

Hi, sorry for the delay. I've tested QBO with Kubeflow version v1.10.0-rc.2. Everything works as expected now, but I had to make a few changes::

DEX/JWT

Clearing site data or opening an incognito window was necessary to get past this error.

Jwks doesn't have key to match kid or alg from Jwt

Image

RBAC

cat istio-ingressgateway-sds-binding.yaml 
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: istio-ingressgateway-sds
  namespace: istio-system
subjects:
- kind: ServiceAccount
  name: istio-ingressgateway-service-account
  namespace: istio-system
roleRef:
  kind: Role
  name: istio-ingressgateway-sds
  apiGroup: rbac.authorization.k8s.io

and

cat istio-ingressgateway-sds.yaml 
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: istio-ingressgateway-sds
  namespace: istio-system
rules:
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get", "watch", "list"]

were necessary to access the Kubeflow Web UI

kustomize

The options --server-side --force-conflicts are necessary, or I'll get the following errors when running this command. I see you added them here as well:

https://github.com/kubeflow/manifests/blob/0016e6b8c24c4ee34342c76b7a738ade5e494682/README.md?plain=1#L159

Error from server (Invalid): error when creating "STDIN": CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: metadata.annotations: Too long: may not be more than 262144 bytes
Error from server (Invalid): error when creating "STDIN": CustomResourceDefinition.apiextensions.k8s.io "paddlejobs.kubeflow.org" is invalid: metadata.annotations: Too long: may not be more than 262144 bytes
Error from server (Invalid): error when creating "STDIN": CustomResourceDefinition.apiextensions.k8s.io "pytorchjobs.kubeflow.org" is invalid: metadata.annotations: Too long: may not be more than 262144 bytes

After those changes, Kubeflow is working as expected with the NVIDIA GPU Operator and the following components:

NAME                    CHART VERSION   APP VERSION     DESCRIPTION                                       
nvidia/gpu-operator     v24.9.2         v24.9.2         NVIDIA GPU Operator creates/configures/manages ...
NVIDIA-SMI 570.124.06 
kubectl version
Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.32.3
kubectl get pods --all-namespaces -o jsonpath="{..image}" | sed 's/ /\n/g' | sort | uniq
docker.io/istio/pilot:1.24.2
docker.io/istio/proxyv2:1.24.2
docker.io/kindest/kindnetd:v20220726-ed811e41
docker.io/kindest/local-path-provisioner:v0.0.22-kind.0
docker.io/kserve/kserve-controller:v0.14.1
docker.io/kserve/kserve-localmodel-controller:v0.14.1
docker.io/kserve/models-web-app:v0.14.0-rc.0
docker.io/kubeflow/training-operator:v1-5170a36
docker.io/kubeflowkatib/katib-controller:v0.18.0-rc.0
docker.io/kubeflowkatib/katib-db-manager:v0.18.0-rc.0
docker.io/kubeflowkatib/katib-ui:v0.18.0-rc.0
docker.io/kubeflownotebookswg/centraldashboard:v1.10.0-rc.1
docker.io/kubeflownotebookswg/jupyter-scipy:v1.10.0-rc.1
docker.io/kubeflownotebookswg/jupyter-web-app:v1.10.0-rc.1
docker.io/kubeflownotebookswg/kfam:v1.10.0-rc.1
docker.io/kubeflownotebookswg/notebook-controller:v1.10.0-rc.1
docker.io/kubeflownotebookswg/poddefaults-webhook:v1.10.0-rc.1
docker.io/kubeflownotebookswg/profile-controller:v1.10.0-rc.1
docker.io/kubeflownotebookswg/pvcviewer-controller:v1.10.0-rc.1
docker.io/kubeflownotebookswg/tensorboard-controller:v1.10.0-rc.1
docker.io/kubeflownotebookswg/tensorboards-web-app:v1.10.0-rc.1
docker.io/kubeflownotebookswg/volumes-web-app:v1.10.0-rc.1
docker.io/library/mysql:8.0.29
docker.io/library/python:3.9
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:e70bc675f97778da144157f125b3001124ba7a5903b85dab9e77776352fea1c7
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:7d76a6d42d139ed53aae3ca2dfd600b1c776eb85a17af64dd1b604176a4b132a
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:cc39d40985f7b37ba384a857d194a24ac5eae7e204aac4ed9bf4ebfd8d62e721
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:59c2e7ad52cea17bedfc2aca9b9e33060bb34f04d35fd71fe61147bcbdb881e4
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:0e47362d044f8eac84595ed0a9fdf22e5dd5a07cc7a5df74e93eb5ad17ad4827
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:d42e2f83c9018779465860fdc67ce6ada3eac8ba8c47c5c2127c0bb45f9b328a
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/workflow-controller:v3.4.17-license-compliance
gcr.io/tfx-oss-public/ml_metadata_store_server:1.14.0
ghcr.io/dexidp/dex:v2.41.1
ghcr.io/kubeflow/kfp-api-server:2.4.1
ghcr.io/kubeflow/kfp-cache-deployer:2.4.1
ghcr.io/kubeflow/kfp-cache-server:2.4.1
ghcr.io/kubeflow/kfp-frontend:2.4.1
ghcr.io/kubeflow/kfp-metadata-envoy:2.4.1
ghcr.io/kubeflow/kfp-metadata-writer:2.4.1
ghcr.io/kubeflow/kfp-persistence-agent:2.4.1
ghcr.io/kubeflow/kfp-scheduled-workflow-controller:2.4.1
ghcr.io/kubeflow/kfp-viewer-crd-controller:2.4.1
ghcr.io/kubeflow/kfp-visualization-server:2.4.1
ghcr.io/metacontroller/metacontroller:v4.11.22
kserve/kserve-controller:v0.14.1
kserve/kserve-localmodel-controller:v0.14.1
kserve/models-web-app:v0.14.0-rc.0
kubeflow/training-operator:v1-5170a36
kubeflownotebookswg/jupyter-scipy:v1.10.0-rc.1
mysql:8.0.29
nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.9.2
nvcr.io/nvidia/gpu-operator:v24.9.2
nvcr.io/nvidia/k8s-device-plugin:v0.17.0
nvcr.io/nvidia/k8s/container-toolkit:v1.17.4-ubuntu20.04
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04
python:3.9
quay.io/brancz/kube-rbac-proxy:v0.13.1
quay.io/brancz/kube-rbac-proxy:v0.18.0
quay.io/brancz/kube-rbac-proxy:v0.8.0
quay.io/jetstack/cert-manager-cainjector:v1.16.1
quay.io/jetstack/cert-manager-controller:v1.16.1
quay.io/jetstack/cert-manager-webhook:v1.16.1
quay.io/oauth2-proxy/oauth2-proxy:v7.7.1
registry.k8s.io/coredns/coredns:v1.11.3
registry.k8s.io/etcd:3.5.16-0
registry.k8s.io/kube-apiserver-amd64:v1.32.3
registry.k8s.io/kube-apiserver:v1.32.3
registry.k8s.io/kube-controller-manager-amd64:v1.32.3
registry.k8s.io/kube-controller-manager:v1.32.3
registry.k8s.io/kube-proxy-amd64:v1.32.3
registry.k8s.io/kube-proxy:v1.32.3
registry.k8s.io/kube-scheduler-amd64:v1.32.3
registry.k8s.io/kube-scheduler:v1.32.3
registry.k8s.io/nfd/node-feature-discovery:v0.16.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

8 participants