Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask-Kubernetes Example Internal Server Error #174

Open
isVoid opened this issue Aug 9, 2022 · 6 comments
Open

Dask-Kubernetes Example Internal Server Error #174

isVoid opened this issue Aug 9, 2022 · 6 comments

Comments

@isVoid
Copy link
Contributor

isVoid commented Aug 9, 2022

When running dask_cuML_Exploration notebook, I ran into the following error:

kubernetes_asyncio.client.exceptions.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: <CIMultiDictProxy('Audit-Id': '5ed8d1bf-dd87-4a29-ae0d-fa38c6dc254f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'db4c51d8-2442-4906-93b6-7ed03022eb2e', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'aa178aa3-5472-418c-8f79-a301b839fda3', 'Date': 'Tue, 09 Aug 2022 01:17:22 GMT', 'Content-Length': '242')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"The POST operation against Pod could not be completed at this time, please try again.","reason":"ServerTimeout","details":{"name":"POST","kind":"Pod"},"code":500}

The client environment is setup with this docker file:
https://github.com/isVoid/cloud-ml-examples/blob/f2683711b7ff5d1c3ae15ba41b4e828eccf8b2a3/dask/kubernetes/Dockerfile

Pod specs for worker and scheuler.

The cluster is setup on GCP and controlled from within the client's container.

@jacobtomlinson
Copy link
Member

The ServerTimeout makes me wonder if this is one of those times where the GKE control plane goes unavailable due to resizing. What happens if you try again later?

@isVoid
Copy link
Contributor Author

isVoid commented Aug 9, 2022

Tried 7 hours later and running into the same issue.

@isVoid
Copy link
Contributor Author

isVoid commented Aug 9, 2022

In a local cluster, I tried setting up the pods with the same specs described above. The error I ran into this time is that when dask-kubernetes scale up a worker pod that has name defined, the control plan fails to scale up because attempts to create more worker pod using the same name defined in worker-spec.yaml. Not sure if this is the same 500 error I ran into as above, hopefully this may give some insight.

import dask_kubernetes
dask_kubernetes.__version__
'2022.7.0'

Environment from: https://jacobtomlinson.dev/posts/2022/running-kubeflow-inside-kind-with-gpu-support/

@jacobtomlinson
Copy link
Member

I think this line should be name: dask-cuda-worker-{uuid} to ensure the name is unique.

@isVoid
Copy link
Contributor Author

isVoid commented Aug 11, 2022

Doesn't seem to parameterize?

kubernetes_asyncio.client.exceptions.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: <CIMultiDictProxy('Audit-Id': '01188928-aca4-4d94-9e8d-fc89df88e03f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '7d3b09b3-048f-4edd-b335-46a87cfce5a2', 'X-Kubernetes-Pf-Prioritylevel-Uid': '13223eb9-a260-4d6f-b264-e6a6cede27c6', 'Date': 'Thu, 11 Aug 2022 22:00:17 GMT', 'Content-Length': '1537')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod \"gpu_worker\" is invalid: [metadata.name: Invalid value: \"gpu_worker\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.containers[0].name: Invalid value: \"dask-cuda-worker-{uuid}\": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')]","reason":"Invalid","details":{"name":"gpu_worker","kind":"Pod","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"gpu_worker\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')","field":"metadata.name"},{"reason":"FieldValueInvalid","message":"Invalid value: \"dask-cuda-worker-{uuid}\": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')","field":"spec.containers[0].name"}]},"code":422}

@jacobtomlinson
Copy link
Member

Looks like you have an underscore in the name which is not a valid character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants