Skip to content

[core] Get cloud provider with ray on kubernetes#51793

Merged
edoakes merged 17 commits intoray-project:masterfrom
dayshah:usage-cloud-provider
Jul 7, 2025
Merged

[core] Get cloud provider with ray on kubernetes#51793
edoakes merged 17 commits intoray-project:masterfrom
dayshah:usage-cloud-provider

Conversation

@dayshah
Copy link
Copy Markdown
Contributor

@dayshah dayshah commented Mar 28, 2025

Why are these changes needed?

On GKE

dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://metadata.google.internal/computeMetadata/v1',headers={'Metadata-Flavor': 'Google'}))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [200]>
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://169.254.169.254/latest/meta-data/'))"                                                                                       
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [404]>
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://169.254.169.254/metadata/instance?api-version=2021-02-01'))"                                                                
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [404]>

On anyscale on eks (google metadata req results in ConnectionError)

>>> print(requests.get('http://169.254.169.254/latest/meta-data/'))
<Response [200]>
>>> print(requests.get('http://169.254.169.254/metadata/instance?api-version=2021-02-01'))
<Response [404]>

Note: Untested on azure

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah added go add ONLY when ready to merge, run all tests and removed go add ONLY when ready to merge, run all tests labels Mar 28, 2025
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Mar 28, 2025
@dayshah dayshah marked this pull request as ready for review March 28, 2025 21:07
Comment thread python/ray/_private/usage/usage_lib.py Outdated
max_workers: Optional[int] = None
head_node_instance_type: Optional[str] = None
worker_node_instance_types: Optional[List[str]] = None
cloud_provider_alt: Optional[str] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot change the schema here without changing the server since server does the schema validation. Lets discuss offline how to change the schema.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with added field to UsageStatsToReport is that all?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and updated schema in test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated based on discussion

Comment thread python/ray/_private/usage/usage_lib.py Outdated
dayshah added 4 commits March 28, 2025 14:14
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
dayshah added 3 commits April 1, 2025 12:59
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Comment thread python/ray/_private/usage/usage_lib.py Outdated

import requests

def cloud_metadata_request(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check if we are in GKE, EKS, AKS specifically since theoretically user can also install k8s on their ec2 machines.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any way of checking that from inside the ray container, @kevin85421 do you know if the kuberay-operator has any context?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked offline, don't know of any way

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users install k8s on EC2 machines, they might use tools like kind. And I think the kind cluster environment is isolated from the host machine, so we can't detect which cloud provider we're in from inside the kind cluster.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can look for some eks metadata tags like:

curl -s http://169.254.169.254/latest/meta-data/tags/instance/eks:cluster-name

unlikely to be perfect though.

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah requested a review from jjyao April 4, 2025 20:38
Comment thread python/ray/_private/usage/usage_lib.py Outdated
Comment thread python/ray/_private/usage/usage_lib.py Outdated
Comment on lines +792 to +797
if res.status_code != 404:
if result.cloud_provider is None:
result.cloud_provider = cloud_provider
else:
result.cloud_provider += f"_{cloud_provider}"
return True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true for those endpoints that as long as it's reachable, you are running on it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the endpoint isn't reachable otherwise

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we don't need to check 404? 404 means it's still reachable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misunderstanding on my side, we can get a 404 when requesting from other clouds. You'll never get a 404 if the request is from that cloud. Can see example in description

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah requested a review from jjyao April 15, 2025 05:39
Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add tests by faking requests library

Comment on lines +730 to +731
else:
return "unknown"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should wrap all the above checks in an unexpected exception handlers and return "unknown" in those cases (and log the unexpected exception)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or catch all exceptions inside cloud_metadata_request

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, catching in cloud_metadata_request and logging info

Comment on lines +816 to +817
else:
result.cloud_provider += f"_${get_cloud_from_metadata_requests()}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this intending to do?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment above
it's for if kubernetes was set already
then you want the cloud provider to be kubernetes_aws

Comment thread python/ray/_private/usage/usage_lib.py
Comment thread python/ray/_private/usage/usage_lib.py Outdated
@mascharkh mascharkh added the community-contribution Contributed by the community label May 15, 2025
@mascharkh mascharkh removed the community-contribution Contributed by the community label May 15, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 7, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 7, 2025
@dayshah dayshah removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 7, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 23, 2025
@dayshah dayshah removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 23, 2025
dayshah added 4 commits July 2, 2025 22:56
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah
Copy link
Copy Markdown
Contributor Author

dayshah commented Jul 3, 2025

please add tests by faking requests library

vibcoded some tests like the cool kids 😎

@dayshah dayshah requested a review from edoakes July 3, 2025 07:31
@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Jul 3, 2025

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah
Copy link
Copy Markdown
Contributor Author

dayshah commented Jul 3, 2025

@edoakes edoakes merged commit 348063f into ray-project:master Jul 7, 2025
5 checks passed
@dayshah dayshah deleted the usage-cloud-provider branch July 7, 2025 16:47
ccmao1130 pushed a commit to ccmao1130/ray that referenced this pull request Jul 29, 2025
On GKE
```
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://metadata.google.internal/computeMetadata/v1',headers={'Metadata-Flavor': 'Google'}))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [200]>
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://169.254.169.254/latest/meta-data/'))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [404]>
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://169.254.169.254/metadata/instance?api-version=2021-02-01'))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [404]>
```

On anyscale on eks (google metadata req results in ConnectionError)
```
>>> print(requests.get('http://169.254.169.254/latest/meta-data/'))
<Response [200]>
>>> print(requests.get('http://169.254.169.254/metadata/instance?api-version=2021-02-01'))
<Response [404]>
```

Note: Untested on azure

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: ChanChan Mao <chanchanmao1130@gmail.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
On GKE
```
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://metadata.google.internal/computeMetadata/v1',headers={'Metadata-Flavor': 'Google'}))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [200]>
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://169.254.169.254/latest/meta-data/'))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [404]>
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://169.254.169.254/metadata/instance?api-version=2021-02-01'))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [404]>
```

On anyscale on eks (google metadata req results in ConnectionError)
```
>>> print(requests.get('http://169.254.169.254/latest/meta-data/'))
<Response [200]>
>>> print(requests.get('http://169.254.169.254/metadata/instance?api-version=2021-02-01'))
<Response [404]>
```

Note: Untested on azure

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants