Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ray.io clusterrole missing for codeflare-operator.v1.0.0-rc.1 #302

Closed
jbusche opened this issue Sep 21, 2023 · 3 comments
Closed

ray.io clusterrole missing for codeflare-operator.v1.0.0-rc.1 #302

jbusche opened this issue Sep 21, 2023 · 3 comments

Comments

@jbusche
Copy link
Collaborator

jbusche commented Sep 21, 2023

Describe the Bug

Noticed that appwrappers are failing due to missing api ray.io apiGroup

Error message is the following when a cluster.up() is applied:

. Attempting to re-enqueque...
W0921 20:57:19.009638       1 queuejob_controller_ex.go:1936] [worker] Item re-enqueued.
I0921 20:57:20.797788       1 request.go:690] Waited for 1.777425005s due to client-side throttling, not priority and fairness, request: GET:[https://172.30.0.1:443/apis/kfdef.apps.kubeflow.org/v1](https://172.30.0.1/apis/kfdef.apps.kubeflow.org/v1)
E0921 20:57:22.741529       1 queuejob_controller_ex.go:2194] [Cleanup] Error deleting generic item raytest, from app wrapper='default/raytest' err=rayclusters.ray.io is forbidden: User "system:serviceaccount:openshift-operators:codeflare-operator-controller-manager" cannot list resource "rayclusters" in API group "ray.io" at the cluster scope.
E0921 20:57:26.513252       1 queuejob_controller_ex.go:1874] [worker] Failed to delete resources for AppWrapper Job 'default/raytest', err=1 error occurred:
	* rayclusters.ray.io is forbidden: User "system:serviceaccount:openshift-operators:codeflare-operator-controller-manager" cannot list resource "rayclusters" in API group "ray.io" at the cluster scope

W0921 20:57:26.513314       1 queuejob_controller_ex.go:1932] [worker] Fail to process item from eventQueue, err 1 error occurred:
	* rayclusters.ray.io is forbidden: User "system:serviceaccount:openshift-operators:codeflare-operator-controller-manager" cannot list resource "rayclusters" in API group "ray.io" at the cluster scope

My temporary fix was to edit the clusterrole:

oc edit clusterrole codeflare-operator.v1.0.0-rc.1-59f4fb8598 
and add

- apiGroups:
  - ray.io
  resources:
  - rayclusters
  - rayjobs
  - rayservices
  verbs:
  - create
  - delete
  - get
  - list
  - patch

I suspect that the real fix should be in here:
https://github.com/project-codeflare/codeflare-operator/blob/main/config/rbac/role.yaml

Codeflare Stack Component Versions

Please specify the component versions in which you have encountered this bug.

Codeflare SDK: 0.7.1
MCAD: Built-in now to codeflare operator
Instascale:Built-in now to codeflare operator
Codeflare Operator: codeflare-operator.v1.0.0-rc.1
Other:

Steps to Reproduce the Bug

Deploy the current codeflare v1.0.0-rc.1 using the QuickStart guide on an OpenShift 4.12 cluster. It all comes up, but when you try to run a cluster.up() the Appwrapper starts but is unable to properly communicate to the Ray Operator.

What Have You Already Tried to Debug the Issue?

Described above

Expected Behavior

I expected the appwrapper to start the Ray head and worker nodes

Screenshots, Console Output, Logs, etc.

Add screenshots of UIs (like dashboards), etc. that help explain the issue.

Affected Releases

v1.0.0-rc.1 and main

Additional Context

Add as applicable and when known:

  • OS: 1) MacOS, 2) Linux, 3) Windows: [1 - 3]
  • OS Version: [e.g. RedHat Linux X.Y.Z, MacOS Monterey, ...]
  • Browser (UI issues): 1) Chrome, 2) Safari, 3) Firefox, 4) Other (describe): [1 - 4 + description?]
  • Browser Version (UI issues): [e.g. Firefix 97.0]
  • Cloud: 1) AWS, 2) IBM Cloud, 3) Other (describe), or 4) on-premise: [1 - 4 + description?]
  • Kubernetes: 1) OpenShift, 2) Other K8s [1 - 2 + description]
  • OpenShift or K8s version: [e.g. 1.23.1]
  • Other relevant info

Add any other information you think might be useful here.

@jbusche
Copy link
Collaborator Author

jbusche commented Sep 21, 2023

Note, MCAD's apiGroups had been this before:

- apiGroups:
  - ray.io
  resources:
  - rayclusters
  - rayclusters/finalizers
  - rayclusters/status
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete

@astefanutti
Copy link
Contributor

It's been addressed downstream in ODH in opendatahub-io/distributed-workloads#127.

The medium term solution is to move to using an aggregated ClusterRole as described in project-codeflare/multi-cluster-app-dispatcher#635.

@jbusche
Copy link
Collaborator Author

jbusche commented Oct 16, 2023

Great - Closing issue

@jbusche jbusche closed this as completed Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants