Preliminary Vast AI support #4365

kristopolous · 2024-11-15T00:15:01Z

This is preliminary support for Vast. It currently works on an unreleased version of the SDK which we will soon get up to PyPy

The document https://docs.google.com/document/d/1oWox3qb3Kz3wXXSGg9ZJWwijoa99a3PIQUHBR8UgEGs/edit?pli=1&tab=t.0 was followed and all the testing passed

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

I'm pretty sure there will need to be edits, I'm fine with that. This is attempt 1. The outstanding work:

We need to

tidy up our dockerhub and will get a better image to launch.
release the updates to the sdk and come up with a pip name for it.
get our catalog to update in the git hook flow as described (my goal is every 6 hours)

cblmemo

Thanks for contributing to this @kristopolous ! This is really exciting. Left some discussions. One main confusion I have is that, does vast ai like runpod, a cloud providing pods to users as their "VM"s? Asking because I'm seeing a lot of docker related code, and just want to confirm :)

sky/adaptors/vast.py

sky/clouds/vast.py

sky/provision/vast/utils.py

sky/provision/vast/instance.py

kristopolous · 2024-11-19T19:21:57Z

Thanks for contributing to this @kristopolous ! This is really exciting. Left some discussions. One main confusion I have is that, does vast ai like runpod, a cloud providing pods to users as their "VM"s? Asking because I'm seeing a lot of docker related code, and just want to confirm :)

historically, runpod was a clone of vast. We currently offer docker-style containers and will be providing vms soonish (probably before end of year)

kristopolous · 2024-11-22T00:35:23Z

these test passing is blocked by https://github.com/skypilot-org/skypilot-catalog/pull/100/commits

Co-authored-by: Tian Xia <[email protected]>

kristopolous · 2025-01-28T19:18:56Z

Also I'm going to switch the catalog from ISO 3166-1 to UN M.49 for a bit of consolidation.

Seems like the tests/test_optimizer_dryruns.py is failing. Could you help to fix that?
https://github.com/skypilot-org/skypilot/actions/runs/12971661591/job/36178313189?pr=4365

so you're requesting 256GB of disk in the test but the catalog format doesn't have a disk space column. It's an additional constraint that isn't communicated in the catalog - this is why things are not working.
To be ludicrous, if I was to ask for, say, 100 yottabytes, the system would return everything in the catalog and then totally fail to allocate 100 yottabytes.
So in order for the search constraint to match the catalog listing, it has to do at least one of the following:

ignore disk_size requirements (this is the only thing I can personally do within the scope of this PR)

have a new v7 version of the catalog with disk size claims

permit some graceful "loosening of requirements" or failure state so a cloud can say "I don't have 100 yottabyte instances"

Could you remind me why other cloud does not have this issue? And why Vast will affect the CI test using other clouds and make it failing?

Also, seems like the tests/test_optimizer_random_dag.py is failing as well. Is it due to the same reason?

https://github.com/skypilot-org/skypilot/actions/runs/13002637917/job/36264021477?pr=4365

Depending on the datacenter, provider, and infrastructure, every cloud not only has their own limit (on cudo for instance, it's 1TB), but they also have different storage costs which aren't included in the catalog price (along sometimes with networking costs which also aren't accounted for).

Other providers appear to be using things like CEPH or Gluster and autogrow (lambda does something like this) so it acts like an s3 bucket.

Regardless, there's costs and limits that aren't accounted for. If you add a couple 0s to your disk size request you'll start breaking many clouds. DigitalOcean throws an error that just silently passes, runpod has no error handling at all. I see some raise an exception, which I'll be happy to do.

The fluffy cloud demo encourages returning None (https://github.com/skypilot-org/new-cloud-skeleton/blob/94409beacc6a934840076399f5f01a9059945663/fluffycloud/fluffycloud_api.py#L18)

In an unsatisfiable request (ie, too much disk space), this just ends up in an infinite loop, which is what at least I am seeing.

Nearly all of our users end up using under 100gb for their inference work and most of our providers offer conventional on-system NVME through m.2 controllers.

Anyways, we don't require our providers to have petabyte-sized arrays on distributed file systems with a high speed interconnect, which really is the only way to be indifferent to storage requests since you'll have to accommodate lots of instances.

So what to do? I'll add an exception for the launch and I have an outstanding commit to do more disk checks.

cblmemo · 2025-01-29T00:49:25Z

I revisited the detailed log of the failed CI tests and I was a little bit confused about why the error is related to disk size.

For example, the following test failure (from here) seems suggesting there is multiple price available, which I guess is due to Vast has multiple offers that shows up as the same instance type. I thought we already discussed about this in this comment, but it seems like this bug is still breaking the test. Can you help confirm why this bug still persists?

Also, the following test failure (from here) seems suggesting, failing to find corresponding instance for a specific resource config in VastAI will break a provision loop with an exception. From the log it seems like this is the config that results in an error. However, the expected behaviour is, if VastAI does not have such offering, then we just skip it and the optimizer will try to use other cloud (instead of raising an exception). Could you help take a look at why this is happening?

At the same time, could you remind me which test output is related to the disk size issue you mentioned?

kristopolous · 2025-01-29T17:24:52Z

That's not the current instance type. That's the old format. You're using an old catalog. This was changed in 81d3bdb two weeks ago.

Here's how you do a "proper" test reset. I have a standalone version of sky/clouds/service_catalog/data_fetchers/fetch_vast.py that lives outside of the repository. I put it over here in a personal gist: https://gist.github.com/kristopolous/6e05f7a09ffb021f050fac39edcab1a5

this has the new georegion=true feature flag we implemented to pass your unit test. It's not in pushed version of fetch_vast.py yet because even with this new consolidation, it's still apparently not enough to reliably pass all these required tests.

Now using this I do the following:

sqlite3 ~/.sky/state.db 'delete from clusters; delete from cluster_history;' # so old cluster definitions don't break things
./vast_catalog-v6.py > ~/.sky/catalogs/v6/vast/vms.csv  # to make sure that the current catalog matches the test
rm /tmp/*.log               # so I don't accidentally look at an old run

Then I go over to manually remove any instances that didn't stop on the website from poor test cleanup

Then I run the test

pytest -n -v ...

also please make sure you have an up to date version of the vastai_sdk and depending on how your environment is set up

pip install -e .

may also be needed.

kristopolous · 2025-01-29T17:32:35Z

At the same time, could you remind me which test output is related to the disk size issue you mentioned?

skypilot/tests/test_config.py

Line 174 in a80208f

disk_size: 256

skypilot/tests/test_yaml_parser.py

Line 67 in a80208f

assert resources.disk_size is 256

If you want my recommendation, set it at something modest like 40. This covers stable diffusion, comfyui, openwebui, 70b infefernce models, most of the stuff on civitai and huggingface along with most distributed scientific computing applications since they use things like HDF5 on S3 buckets and don't transfer over their tens of terabytes of data to each node.

It also keeps the test runs cheaper and maps closer to real world use cases. People aren't firing up a cluster of 8x H200s for their data storage needs.

cblmemo · 2025-01-29T19:27:47Z

That's not the current instance type. That's the old format. You're using an old catalog. This was changed in 81d3bdb two weeks ago.

Here's how you do a "proper" test reset. I have a standalone version of sky/clouds/service_catalog/data_fetchers/fetch_vast.py that lives outside of the repository. I put it over here in a personal gist: https://gist.github.com/kristopolous/6e05f7a09ffb021f050fac39edcab1a5

this has the new georegion=true feature flag we implemented to pass your unit test. It's not in pushed version of fetch_vast.py yet because even with this new consolidation, it's still apparently not enough to reliably pass all these required tests.

Now using this I do the following:
sqlite3 ~/.sky/state.db 'delete from clusters; delete from cluster_history;' # so old cluster definitions don't break things
./vast_catalog-v6.py > ~/.sky/catalogs/v6/vast/vms.csv  # to make sure that the current catalog matches the test
rm /tmp/*.log               # so I don't accidentally look at an old run
Then I go over to manually remove any instances that didn't stop on the website from poor test cleanup

Then I run the test

pytest -n -v ...

also please make sure you have an up to date version of the vastai_sdk and depending on how your environment is set up

pip install -e .

may also be needed.

If that is the case, can you push a new version of VastAI catalog to the catalog repo?

cblmemo · 2025-01-29T19:30:26Z

At the same time, could you remind me which test output is related to the disk size issue you mentioned?

skypilot/tests/test_config.py

Line 174 in a80208f

disk_size: 256

skypilot/tests/test_yaml_parser.py

Line 67 in a80208f

assert resources.disk_size is 256

If you want my recommendation, set it at something modest like 40. This covers stable diffusion, comfyui, openwebui, 70b infefernce models, most of the stuff on civitai and huggingface along with most distributed scientific computing applications since they use things like HDF5 on S3 buckets and don't transfer over their tens of terabytes of data to each node.

It also keeps the test runs cheaper and maps closer to real world use cases. People aren't firing up a cluster of 8x H200s for their data storage needs.

Just want to make sure, will this cause any test failure?

cc @Michaelvll for a look here for changing the default disk size. Maybe we can have a default disk size for Vast specifically?

kristopolous · 2025-01-30T01:42:32Z

so i had some talk with the management here. We decided to change how consolidation into instance types work in order for you guys. Essentially we're excluding a bunch of our more obscure machines as skypilot was just picking a bunch of things that happened to be cheap but kind of obscure. Hopefully this makes things more reliable.

The 0.8 -> 0.5 change in the last commit is because we are offering a more limited array of machines, we can sacrifice our depth of offering for a lower price ceiling.

cblmemo · 2025-01-30T20:47:19Z

so i had some talk with the management here. We decided to change how consolidation into instance types work in order for you guys. Essentially we're excluding a bunch of our more obscure machines as skypilot was just picking a bunch of things that happened to be cheap but kind of obscure. Hopefully this makes things more reliable.

The 0.8 -> 0.5 change in the last commit is because we are offering a more limited array of machines, we can sacrifice our depth of offering for a lower price ceiling.

Got it. LGTM. Seems like there are still failing CI test: https://github.com/skypilot-org/skypilot/actions/runs/13043991734/job/36391357825?pr=4365

Could you help fixing this?

kristopolous · 2025-01-30T23:27:02Z

so i had some talk with the management here. We decided to change how consolidation into instance types work in order for you guys. Essentially we're excluding a bunch of our more obscure machines as skypilot was just picking a bunch of things that happened to be cheap but kind of obscure. Hopefully this makes things more reliable.
The 0.8 -> 0.5 change in the last commit is because we are offering a more limited array of machines, we can sacrifice our depth of offering for a lower price ceiling.

Got it. LGTM. Seems like there are still failing CI test: https://github.com/skypilot-org/skypilot/actions/runs/13043991734/job/36391357825?pr=4365

Could you help fixing this?

I guess you are using this? skypilot-org/skypilot-catalog#102 what do you want me to do? Upload another static file? have some kind of automated hook? what?

I can also just comment out that test.

kristopolous · 2025-01-31T00:18:04Z

see skypilot-org/skypilot-catalog#110

cblmemo · 2025-02-01T00:13:55Z

so i had some talk with the management here. We decided to change how consolidation into instance types work in order for you guys. Essentially we're excluding a bunch of our more obscure machines as skypilot was just picking a bunch of things that happened to be cheap but kind of obscure. Hopefully this makes things more reliable.
The 0.8 -> 0.5 change in the last commit is because we are offering a more limited array of machines, we can sacrifice our depth of offering for a lower price ceiling.

Got it. LGTM. Seems like there are still failing CI test: https://github.com/skypilot-org/skypilot/actions/runs/13043991734/job/36391357825?pr=4365
Could you help fixing this?

I guess you are using this? skypilot-org/skypilot-catalog#102 what do you want me to do? Upload another static file? have some kind of automated hook? what?

I can also just comment out that test.

Yes, for now manually push a newer version of catalog LGTM. Just merged the PR.

I'll take a look on the auto fetcher after this PR.

cblmemo · 2025-02-01T00:23:30Z

Regarding the disk size, is it possible to add a vast specific default disk size (e.g. 80GB as you proposed)? ideally we still want the other cloud to keep as the same.

cblmemo · 2025-02-03T19:53:37Z

Regarding the disk size, is it possible to add a vast specific default disk size (e.g. 80GB as you proposed)? ideally we still want the other cloud to keep as the same.

@kristopolous, for this, do you think we can separate it into another PR? If so I think this PR is in good shape and should be ready to go after smoke tests ;)

Will trigger smoke test after your confirmation.

kristopolous · 2025-02-03T20:19:51Z

Regarding the disk size, is it possible to add a vast specific default disk size (e.g. 80GB as you proposed)? ideally we still want the other cloud to keep as the same.

@kristopolous, for this, do you think we can separate it into another PR? If so I think this PR is in good shape and should be ready to go after smoke tests ;)

Will trigger smoke test after your confirmation.

Don't worry about it. I've disabled those tests. We can integrate it at a later date. We intend to be expanding our offerings and this should be less of a problem in the future

cblmemo · 2025-02-04T00:38:49Z

/smoke-test --aws

cblmemo · 2025-02-04T21:10:03Z

Thanks @kristopolous for this amazing contribution! Looks good to me. Merging now!

Michaelvll requested a review from cblmemo November 16, 2024 02:46

cblmemo reviewed Nov 18, 2024

View reviewed changes

kristopolous force-pushed the vast.ai-support branch 3 times, most recently from e9e922a to 4c9aff9 Compare November 21, 2024 22:28

kristopolous force-pushed the vast.ai-support branch from 2b3e658 to 25b99f9 Compare November 22, 2024 21:34

kristopolous and others added 18 commits December 11, 2024 16:38

Preliminary Vast AI support

a41d633

Update sky/provision/vast/instance.py

a3e770f

Co-authored-by: Tian Xia <[email protected]>

Update sky/clouds/vast.py

d2c4dca

Co-authored-by: Tian Xia <[email protected]>

Update sky/clouds/vast.py

5deac26

Co-authored-by: Tian Xia <[email protected]>

Update sky/clouds/vast.py

372b860

Co-authored-by: Tian Xia <[email protected]>

Updating the vast dependencies in the setup.py

e5f5f3f

Vast: Copy update on object stores

15b5f5b

Vast: update base image dockerhub link

fbeaa14

Vast: removing errant comment

60abaec

Vast: provision/utils cleanup of a shallow copy

0f8a035

Vast: Simplifying the credential files mount

bb8a6eb

Vast: Linter cleanup

6107b82

Vast: Internal api cleanup

0cba961

Vast: Adding the catalog_fetcher

9287709

Vast: Linting fixes

b299ea4

Vast: Linting fixes

b8e3752

Vast: ordering the ports

1bbbbcc

Vast: Updating a function signature

7face5e

kristopolous force-pushed the vast.ai-support branch from 3010706 to 7face5e Compare December 12, 2024 00:55

kristopolous added 3 commits December 11, 2024 16:58

Vast: comment cleanup

3eb8823

Vast: Adding a comment for disk_size calculation

6a6c215

Vast: Comment on the geolocation string processing

a9d2ff4

kristopolous added 4 commits January 28, 2025 18:44

Vast: Richer error for a failed library import

dc57423

Vast: pushing the sdk requirement version forward

3be78a6

Vast: pushing the dependency requirement forward

5d0cdb3

Vast: updating the controller requirement text

56e694a

Vast: Adding georegion flag for the launch search feature

90567a3

kristopolous added 4 commits January 29, 2025 12:55

Vast: Adding the chunked feature flag to consolidate into instance types

d80266f

Vast: using a new chunked feature flag in catalog searching

f9bda8b

Vast: bumping sdk minimum version

dbebef8

Vast: catalog instance type naming scheme update

369b502

kristopolous added 3 commits January 30, 2025 17:11

Bringing back the old isntance type format to satisfy the parser

b7b1889

Vast: skipping tests

d6f6563

Vast: skipping tests

9d1931d

Vast: adding a cardinality cutoff to the instance list generation

cf23e1b

cblmemo merged commit 5e3e430 into skypilot-org:master Feb 4, 2025
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preliminary Vast AI support #4365

Preliminary Vast AI support #4365

kristopolous commented Nov 15, 2024 •

edited

Loading

cblmemo left a comment

kristopolous commented Nov 19, 2024

kristopolous commented Nov 22, 2024

kristopolous commented Jan 28, 2025

cblmemo commented Jan 29, 2025 •

edited

Loading

kristopolous commented Jan 29, 2025 •

edited

Loading

kristopolous commented Jan 29, 2025 •

edited

Loading

cblmemo commented Jan 29, 2025

cblmemo commented Jan 29, 2025

kristopolous commented Jan 30, 2025 •

edited

Loading

cblmemo commented Jan 30, 2025

kristopolous commented Jan 30, 2025

kristopolous commented Jan 31, 2025

cblmemo commented Feb 1, 2025

cblmemo commented Feb 1, 2025

cblmemo commented Feb 3, 2025 •

edited

Loading

kristopolous commented Feb 3, 2025

cblmemo commented Feb 4, 2025

cblmemo commented Feb 4, 2025

Preliminary Vast AI support #4365

Preliminary Vast AI support #4365

Conversation

kristopolous commented Nov 15, 2024 • edited Loading

cblmemo left a comment

Choose a reason for hiding this comment

kristopolous commented Nov 19, 2024

kristopolous commented Nov 22, 2024

kristopolous commented Jan 28, 2025

cblmemo commented Jan 29, 2025 • edited Loading

kristopolous commented Jan 29, 2025 • edited Loading

kristopolous commented Jan 29, 2025 • edited Loading

cblmemo commented Jan 29, 2025

cblmemo commented Jan 29, 2025

kristopolous commented Jan 30, 2025 • edited Loading

cblmemo commented Jan 30, 2025

kristopolous commented Jan 30, 2025

kristopolous commented Jan 31, 2025

cblmemo commented Feb 1, 2025

cblmemo commented Feb 1, 2025

cblmemo commented Feb 3, 2025 • edited Loading

kristopolous commented Feb 3, 2025

cblmemo commented Feb 4, 2025

cblmemo commented Feb 4, 2025

kristopolous commented Nov 15, 2024 •

edited

Loading

cblmemo commented Jan 29, 2025 •

edited

Loading

kristopolous commented Jan 29, 2025 •

edited

Loading

kristopolous commented Jan 29, 2025 •

edited

Loading

kristopolous commented Jan 30, 2025 •

edited

Loading

cblmemo commented Feb 3, 2025 •

edited

Loading