If your users are experiencing errors in {productname-long} relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.
If the problem is not documented here or in the release notes, contact {org-name} Support.
The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.
The user’s Ray cluster head pod or worker pods remain in a suspended state.
Check the status of the Workloads
resource that is created with the RayCluster
resource.
The status.conditions.message
field provides the reason for the suspended state, as shown in the following example:
status:
conditions:
- lastTransitionTime: '2024-05-29T13:05:09Z'
message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'
-
Check whether the resource flavor is created, as follows:
-
Click Home → Search, and from the Resources list, select ResourceFlavor.
-
If necessary, create the resource flavor.
-
-
Check the cluster queue configuration in the user’s code, to ensure that the resources that they requested are within the limits defined for the project.
-
If necessary, increase the resource quota.
For information about configuring resource flavors and quotas, see Configuring quota management for distributed workloads.
The user might have insufficient resources.
The user’s Ray cluster head pod or worker pods are not running.
When a Ray cluster is created, it initially enters a failed
state.
This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.
If the failed state persists, complete the following steps:
-
Click Workloads → Pods.
-
Click the user’s pod name to open the pod details page.
-
Click the Events tab, and review the pod events to identify the cause of the problem.
-
Check the status of the
Workloads
resource that is created with theRayCluster
resource. Thestatus.conditions.message
field provides the reason for the failed state.
After the user runs the cluster.up()
command, the following error is shown:
ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}
The CodeFlare Operator pod might not be running.
-
Click Workloads → Pods.
-
Verify that the CodeFlare Operator pod is running. If necessary, restart the CodeFlare Operator pod.
-
Review the logs for the CodeFlare Operator pod to verify that the webhook server is serving, as shown in the following example:
INFO controller-runtime.webhook Serving webhook server {"host": "", "port": 9443}
After the user runs the cluster.up()
command, the following error is shown:
ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}
The Kueue pod might not be running.
-
Click Workloads → Pods.
-
Verify that the Kueue pod is running. If necessary, restart the Kueue pod.
-
Review the logs for the Kueue pod to verify that the webhook server is serving, as shown in the following example:
{"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}
After the user runs the cluster.up()
command, when they run either the cluster.details()
command or the cluster.status()
command, the Ray cluster status remains as Starting
instead of changing to Ready
.
No pods are created.
Check the status of the Workloads
resource that is created with the RayCluster
resource.
The status.conditions.message
field provides the reason for remaining in the Starting
state.
Similarly, check the status.conditions.message
field for the RayCluster
resource.
-
Click Workloads → Pods.
-
Verify that the KubeRay pod is running. If necessary, restart the KubeRay pod.
-
Review the logs for the KubeRay pod to identify errors.
After the user runs the cluster.up()
command, the following error is shown:
Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.
No default local queue is defined, and a local queue is not specified in the cluster configuration.
-
Check whether a local queue exists in the user’s project, as follows:
-
Click Home → Search, and from the Resources list, select LocalQueue.
-
If no local queues are found, create a local queue.
-
Provide the user with the details of the local queues in their project, and advise them to add a local queue to their cluster configuration.
-
-
Define a default local queue.
For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.
After the user runs the cluster.up()
command, the following error is shown:
local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.
An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.
-
Click Search, and from the Resources list, select LocalQueue.
-
Resolve the problem in one of the following ways:
-
If no local queues are found, create a local queue.
-
If one or more local queues are found, provide the user with the details of the local queues in their project. Advise the user to ensure that they spelled the local queue name correctly in their cluster configuration, and that the
namespace
value in the cluster configuration matches their project name. If the user does not specify anamespace
value in the cluster configuration, the Ray cluster is created in the current project.
-
-
Define a default local queue.
For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.
After the user runs the cluster.up()
command, an error similar to the following text is shown:
RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}
The correct OpenShift login credentials are not specified in the TokenAuthentication
section of the user’s notebook code.
-
Advise the user to identify and specify the correct OpenShift login credentials as follows:
-
In the new tab that opens, log in as the user whose credentials you want to use.
-
Click Display Token.
-
From the Log in with this token section, copy the
token
andserver
values. -
Specify the copied
token
andserver
values in your notebook code as follows:auth = TokenAuthentication( token = "<token>", server = "<server>", skip_tls=False ) auth.login()
-
-
Verify that the user has the correct permissions and is part of the
rhoai-users
group.
Kueue waits for a period of time before marking a workload as ready, to enable all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.
-
Click Workloads → Pods.
-
Click the user’s pod name to open the pod details page.
-
Click the Events tab, and review the pod events to check whether the image pull completed successfully.
If the pod takes more than 5 minutes to pull the image, resolve the problem in one of the following ways:
-
Add an
OnFailure
restart policy for resources that are managed by Kueue. -
In the
redhat-ods-applications
namespace, edit thekueue-manager-config
ConfigMap to set a custom timeout for thewaitForPodsReady
property. For more information about this configuration option, see Enabling waitForPodsReady in the Kueue documentation.