[core] Get cloud provider with ray on kubernetes#51793
[core] Get cloud provider with ray on kubernetes#51793edoakes merged 17 commits intoray-project:masterfrom
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
| max_workers: Optional[int] = None | ||
| head_node_instance_type: Optional[str] = None | ||
| worker_node_instance_types: Optional[List[str]] = None | ||
| cloud_provider_alt: Optional[str] = None |
There was a problem hiding this comment.
we cannot change the schema here without changing the server since server does the schema validation. Lets discuss offline how to change the schema.
There was a problem hiding this comment.
updated with added field to UsageStatsToReport is that all?
There was a problem hiding this comment.
and updated schema in test
There was a problem hiding this comment.
updated based on discussion
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
|
|
||
| import requests | ||
|
|
||
| def cloud_metadata_request( |
There was a problem hiding this comment.
Can we check if we are in GKE, EKS, AKS specifically since theoretically user can also install k8s on their ec2 machines.
There was a problem hiding this comment.
I couldn't find any way of checking that from inside the ray container, @kevin85421 do you know if the kuberay-operator has any context?
There was a problem hiding this comment.
talked offline, don't know of any way
There was a problem hiding this comment.
If users install k8s on EC2 machines, they might use tools like kind. And I think the kind cluster environment is isolated from the host machine, so we can't detect which cloud provider we're in from inside the kind cluster.
There was a problem hiding this comment.
I think you can look for some eks metadata tags like:
curl -s http://169.254.169.254/latest/meta-data/tags/instance/eks:cluster-name
unlikely to be perfect though.
| if res.status_code != 404: | ||
| if result.cloud_provider is None: | ||
| result.cloud_provider = cloud_provider | ||
| else: | ||
| result.cloud_provider += f"_{cloud_provider}" | ||
| return True |
There was a problem hiding this comment.
Is it true for those endpoints that as long as it's reachable, you are running on it?
There was a problem hiding this comment.
yes, the endpoint isn't reachable otherwise
There was a problem hiding this comment.
then we don't need to check 404? 404 means it's still reachable?
There was a problem hiding this comment.
misunderstanding on my side, we can get a 404 when requesting from other clouds. You'll never get a 404 if the request is from that cloud. Can see example in description
Signed-off-by: dayshah <dhyey2019@gmail.com>
edoakes
left a comment
There was a problem hiding this comment.
please add tests by faking requests library
| else: | ||
| return "unknown" |
There was a problem hiding this comment.
we should wrap all the above checks in an unexpected exception handlers and return "unknown" in those cases (and log the unexpected exception)
There was a problem hiding this comment.
or catch all exceptions inside cloud_metadata_request
There was a problem hiding this comment.
done, catching in cloud_metadata_request and logging info
| else: | ||
| result.cloud_provider += f"_${get_cloud_from_metadata_requests()}" |
There was a problem hiding this comment.
what's this intending to do?
There was a problem hiding this comment.
added a comment above
it's for if kubernetes was set already
then you want the cloud provider to be kubernetes_aws
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
vibcoded some tests like the cool kids 😎 |
Signed-off-by: dayshah <dhyey2019@gmail.com>
fixed @edoakes |
On GKE
```
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://metadata.google.internal/computeMetadata/v1',headers={'Metadata-Flavor': 'Google'}))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [200]>
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://169.254.169.254/latest/meta-data/'))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [404]>
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://169.254.169.254/metadata/instance?api-version=2021-02-01'))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [404]>
```
On anyscale on eks (google metadata req results in ConnectionError)
```
>>> print(requests.get('http://169.254.169.254/latest/meta-data/'))
<Response [200]>
>>> print(requests.get('http://169.254.169.254/metadata/instance?api-version=2021-02-01'))
<Response [404]>
```
Note: Untested on azure
---------
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: ChanChan Mao <chanchanmao1130@gmail.com>
On GKE
```
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://metadata.google.internal/computeMetadata/v1',headers={'Metadata-Flavor': 'Google'}))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [200]>
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://169.254.169.254/latest/meta-data/'))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [404]>
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://169.254.169.254/metadata/instance?api-version=2021-02-01'))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [404]>
```
On anyscale on eks (google metadata req results in ConnectionError)
```
>>> print(requests.get('http://169.254.169.254/latest/meta-data/'))
<Response [200]>
>>> print(requests.get('http://169.254.169.254/metadata/instance?api-version=2021-02-01'))
<Response [404]>
```
Note: Untested on azure
---------
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Why are these changes needed?
On GKE
On anyscale on eks (google metadata req results in ConnectionError)
Note: Untested on azure
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.