Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary Vast AI support #4365

Merged
merged 126 commits into from
Feb 4, 2025
Merged

Conversation

kristopolous
Copy link
Contributor

@kristopolous kristopolous commented Nov 15, 2024

This is preliminary support for Vast. It currently works on an unreleased version of the SDK which we will soon get up to PyPy

The document https://docs.google.com/document/d/1oWox3qb3Kz3wXXSGg9ZJWwijoa99a3PIQUHBR8UgEGs/edit?pli=1&tab=t.0 was followed and all the testing passed

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

I'm pretty sure there will need to be edits, I'm fine with that. This is attempt 1. The outstanding work:

We need to

  • tidy up our dockerhub and will get a better image to launch.
  • release the updates to the sdk and come up with a pip name for it.
  • get our catalog to update in the git hook flow as described (my goal is every 6 hours)

@Michaelvll Michaelvll requested a review from cblmemo November 16, 2024 02:46
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing to this @kristopolous ! This is really exciting. Left some discussions. One main confusion I have is that, does vast ai like runpod, a cloud providing pods to users as their "VM"s? Asking because I'm seeing a lot of docker related code, and just want to confirm :)

sky/adaptors/vast.py Outdated Show resolved Hide resolved
sky/clouds/vast.py Show resolved Hide resolved
sky/clouds/vast.py Outdated Show resolved Hide resolved
sky/clouds/vast.py Show resolved Hide resolved
sky/clouds/vast.py Outdated Show resolved Hide resolved
sky/provision/vast/utils.py Show resolved Hide resolved
sky/provision/vast/utils.py Show resolved Hide resolved
sky/provision/vast/instance.py Outdated Show resolved Hide resolved
sky/provision/vast/instance.py Show resolved Hide resolved
sky/provision/vast/instance.py Show resolved Hide resolved
@kristopolous
Copy link
Contributor Author

Thanks for contributing to this @kristopolous ! This is really exciting. Left some discussions. One main confusion I have is that, does vast ai like runpod, a cloud providing pods to users as their "VM"s? Asking because I'm seeing a lot of docker related code, and just want to confirm :)

historically, runpod was a clone of vast. We currently offer docker-style containers and will be providing vms soonish (probably before end of year)

@kristopolous kristopolous force-pushed the vast.ai-support branch 3 times, most recently from e9e922a to 4c9aff9 Compare November 21, 2024 22:28
@kristopolous
Copy link
Contributor Author

these test passing is blocked by https://github.com/skypilot-org/skypilot-catalog/pull/100/commits

@kristopolous
Copy link
Contributor Author

Also I'm going to switch the catalog from ISO 3166-1 to UN M.49 for a bit of consolidation.

Seems like the tests/test_optimizer_dryruns.py is failing. Could you help to fix that?
https://github.com/skypilot-org/skypilot/actions/runs/12971661591/job/36178313189?pr=4365

so you're requesting 256GB of disk in the test but the catalog format doesn't have a disk space column. It's an additional constraint that isn't communicated in the catalog - this is why things are not working.
To be ludicrous, if I was to ask for, say, 100 yottabytes, the system would return everything in the catalog and then totally fail to allocate 100 yottabytes.
So in order for the search constraint to match the catalog listing, it has to do at least one of the following:

  1. ignore disk_size requirements (this is the only thing I can personally do within the scope of this PR)
  2. have a new v7 version of the catalog with disk size claims
  3. permit some graceful "loosening of requirements" or failure state so a cloud can say "I don't have 100 yottabyte instances"

Could you remind me why other cloud does not have this issue? And why Vast will affect the CI test using other clouds and make it failing?

Also, seems like the tests/test_optimizer_random_dag.py is failing as well. Is it due to the same reason?

https://github.com/skypilot-org/skypilot/actions/runs/13002637917/job/36264021477?pr=4365

Depending on the datacenter, provider, and infrastructure, every cloud not only has their own limit (on cudo for instance, it's 1TB), but they also have different storage costs which aren't included in the catalog price (along sometimes with networking costs which also aren't accounted for).

Other providers appear to be using things like CEPH or Gluster and autogrow (lambda does something like this) so it acts like an s3 bucket.

Regardless, there's costs and limits that aren't accounted for. If you add a couple 0s to your disk size request you'll start breaking many clouds. DigitalOcean throws an error that just silently passes, runpod has no error handling at all. I see some raise an exception, which I'll be happy to do.

The fluffy cloud demo encourages returning None (https://github.com/skypilot-org/new-cloud-skeleton/blob/94409beacc6a934840076399f5f01a9059945663/fluffycloud/fluffycloud_api.py#L18)

In an unsatisfiable request (ie, too much disk space), this just ends up in an infinite loop, which is what at least I am seeing.

Nearly all of our users end up using under 100gb for their inference work and most of our providers offer conventional on-system NVME through m.2 controllers.

Anyways, we don't require our providers to have petabyte-sized arrays on distributed file systems with a high speed interconnect, which really is the only way to be indifferent to storage requests since you'll have to accommodate lots of instances.

So what to do? I'll add an exception for the launch and I have an outstanding commit to do more disk checks.

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 29, 2025

I revisited the detailed log of the failed CI tests and I was a little bit confused about why the error is related to disk size.

  • For example, the following test failure (from here) seems suggesting there is multiple price available, which I guess is due to Vast has multiple offers that shows up as the same instance type. I thought we already discussed about this in this comment, but it seems like this bug is still breaking the test. Can you help confirm why this bug still persists?
image
  • Also, the following test failure (from here) seems suggesting, failing to find corresponding instance for a specific resource config in VastAI will break a provision loop with an exception. From the log it seems like this is the config that results in an error. However, the expected behaviour is, if VastAI does not have such offering, then we just skip it and the optimizer will try to use other cloud (instead of raising an exception). Could you help take a look at why this is happening?
image
  • At the same time, could you remind me which test output is related to the disk size issue you mentioned?

@kristopolous
Copy link
Contributor Author

kristopolous commented Jan 29, 2025

That's not the current instance type. That's the old format. You're using an old catalog. This was changed in 81d3bdb two weeks ago.

Here's how you do a "proper" test reset. I have a standalone version of sky/clouds/service_catalog/data_fetchers/fetch_vast.py that lives outside of the repository. I put it over here in a personal gist: https://gist.github.com/kristopolous/6e05f7a09ffb021f050fac39edcab1a5

this has the new georegion=true feature flag we implemented to pass your unit test. It's not in pushed version of fetch_vast.py yet because even with this new consolidation, it's still apparently not enough to reliably pass all these required tests.

Now using this I do the following:

sqlite3 ~/.sky/state.db 'delete from clusters; delete from cluster_history;' # so old cluster definitions don't break things
./vast_catalog-v6.py > ~/.sky/catalogs/v6/vast/vms.csv  # to make sure that the current catalog matches the test
rm /tmp/*.log               # so I don't accidentally look at an old run

Then I go over to manually remove any instances that didn't stop on the website from poor test cleanup

Then I run the test

pytest -n -v ...

also please make sure you have an up to date version of the vastai_sdk and depending on how your environment is set up

pip install -e .

may also be needed.

@kristopolous
Copy link
Contributor Author

kristopolous commented Jan 29, 2025

  • At the same time, could you remind me which test output is related to the disk size issue you mentioned?

disk_size: 256

assert resources.disk_size is 256

If you want my recommendation, set it at something modest like 40. This covers stable diffusion, comfyui, openwebui, 70b infefernce models, most of the stuff on civitai and huggingface along with most distributed scientific computing applications since they use things like HDF5 on S3 buckets and don't transfer over their tens of terabytes of data to each node.

It also keeps the test runs cheaper and maps closer to real world use cases. People aren't firing up a cluster of 8x H200s for their data storage needs.

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 29, 2025

That's not the current instance type. That's the old format. You're using an old catalog. This was changed in 81d3bdb two weeks ago.

Here's how you do a "proper" test reset. I have a standalone version of sky/clouds/service_catalog/data_fetchers/fetch_vast.py that lives outside of the repository. I put it over here in a personal gist: https://gist.github.com/kristopolous/6e05f7a09ffb021f050fac39edcab1a5

this has the new georegion=true feature flag we implemented to pass your unit test. It's not in pushed version of fetch_vast.py yet because even with this new consolidation, it's still apparently not enough to reliably pass all these required tests.

Now using this I do the following:

sqlite3 ~/.sky/state.db 'delete from clusters; delete from cluster_history;' # so old cluster definitions don't break things
./vast_catalog-v6.py > ~/.sky/catalogs/v6/vast/vms.csv  # to make sure that the current catalog matches the test
rm /tmp/*.log               # so I don't accidentally look at an old run

Then I go over to manually remove any instances that didn't stop on the website from poor test cleanup

Then I run the test

pytest -n -v ...

also please make sure you have an up to date version of the vastai_sdk and depending on how your environment is set up

pip install -e .

may also be needed.

If that is the case, can you push a new version of VastAI catalog to the catalog repo?

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 29, 2025

  • At the same time, could you remind me which test output is related to the disk size issue you mentioned?

disk_size: 256

assert resources.disk_size is 256

If you want my recommendation, set it at something modest like 40. This covers stable diffusion, comfyui, openwebui, 70b infefernce models, most of the stuff on civitai and huggingface along with most distributed scientific computing applications since they use things like HDF5 on S3 buckets and don't transfer over their tens of terabytes of data to each node.

It also keeps the test runs cheaper and maps closer to real world use cases. People aren't firing up a cluster of 8x H200s for their data storage needs.

Just want to make sure, will this cause any test failure?

cc @Michaelvll for a look here for changing the default disk size. Maybe we can have a default disk size for Vast specifically?

@kristopolous
Copy link
Contributor Author

kristopolous commented Jan 30, 2025

so i had some talk with the management here. We decided to change how consolidation into instance types work in order for you guys. Essentially we're excluding a bunch of our more obscure machines as skypilot was just picking a bunch of things that happened to be cheap but kind of obscure. Hopefully this makes things more reliable.

The 0.8 -> 0.5 change in the last commit is because we are offering a more limited array of machines, we can sacrifice our depth of offering for a lower price ceiling.

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 30, 2025

so i had some talk with the management here. We decided to change how consolidation into instance types work in order for you guys. Essentially we're excluding a bunch of our more obscure machines as skypilot was just picking a bunch of things that happened to be cheap but kind of obscure. Hopefully this makes things more reliable.

The 0.8 -> 0.5 change in the last commit is because we are offering a more limited array of machines, we can sacrifice our depth of offering for a lower price ceiling.

Got it. LGTM. Seems like there are still failing CI test: https://github.com/skypilot-org/skypilot/actions/runs/13043991734/job/36391357825?pr=4365

Could you help fixing this?

@kristopolous
Copy link
Contributor Author

so i had some talk with the management here. We decided to change how consolidation into instance types work in order for you guys. Essentially we're excluding a bunch of our more obscure machines as skypilot was just picking a bunch of things that happened to be cheap but kind of obscure. Hopefully this makes things more reliable.
The 0.8 -> 0.5 change in the last commit is because we are offering a more limited array of machines, we can sacrifice our depth of offering for a lower price ceiling.

Got it. LGTM. Seems like there are still failing CI test: https://github.com/skypilot-org/skypilot/actions/runs/13043991734/job/36391357825?pr=4365

Could you help fixing this?

I guess you are using this? skypilot-org/skypilot-catalog#102 what do you want me to do? Upload another static file? have some kind of automated hook? what?

I can also just comment out that test.

@kristopolous
Copy link
Contributor Author

@cblmemo
Copy link
Collaborator

cblmemo commented Feb 1, 2025

so i had some talk with the management here. We decided to change how consolidation into instance types work in order for you guys. Essentially we're excluding a bunch of our more obscure machines as skypilot was just picking a bunch of things that happened to be cheap but kind of obscure. Hopefully this makes things more reliable.
The 0.8 -> 0.5 change in the last commit is because we are offering a more limited array of machines, we can sacrifice our depth of offering for a lower price ceiling.

Got it. LGTM. Seems like there are still failing CI test: https://github.com/skypilot-org/skypilot/actions/runs/13043991734/job/36391357825?pr=4365
Could you help fixing this?

I guess you are using this? skypilot-org/skypilot-catalog#102 what do you want me to do? Upload another static file? have some kind of automated hook? what?

I can also just comment out that test.

Yes, for now manually push a newer version of catalog LGTM. Just merged the PR.

I'll take a look on the auto fetcher after this PR.

@cblmemo
Copy link
Collaborator

cblmemo commented Feb 1, 2025

Regarding the disk size, is it possible to add a vast specific default disk size (e.g. 80GB as you proposed)? ideally we still want the other cloud to keep as the same.

@cblmemo
Copy link
Collaborator

cblmemo commented Feb 3, 2025

Regarding the disk size, is it possible to add a vast specific default disk size (e.g. 80GB as you proposed)? ideally we still want the other cloud to keep as the same.

@kristopolous, for this, do you think we can separate it into another PR? If so I think this PR is in good shape and should be ready to go after smoke tests ;)

Will trigger smoke test after your confirmation.

@kristopolous
Copy link
Contributor Author

Regarding the disk size, is it possible to add a vast specific default disk size (e.g. 80GB as you proposed)? ideally we still want the other cloud to keep as the same.

@kristopolous, for this, do you think we can separate it into another PR? If so I think this PR is in good shape and should be ready to go after smoke tests ;)

Will trigger smoke test after your confirmation.

Don't worry about it. I've disabled those tests. We can integrate it at a later date. We intend to be expanding our offerings and this should be less of a problem in the future

@cblmemo
Copy link
Collaborator

cblmemo commented Feb 4, 2025

/smoke-test --aws

@cblmemo
Copy link
Collaborator

cblmemo commented Feb 4, 2025

Thanks @kristopolous for this amazing contribution! Looks good to me. Merging now!

@cblmemo cblmemo merged commit 5e3e430 into skypilot-org:master Feb 4, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants