-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preliminary Vast AI support #4365
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing to this @kristopolous ! This is really exciting. Left some discussions. One main confusion I have is that, does vast ai like runpod, a cloud providing pods to users as their "VM"s? Asking because I'm seeing a lot of docker related code, and just want to confirm :)
historically, runpod was a clone of vast. We currently offer docker-style containers and will be providing vms soonish (probably before end of year) |
e9e922a
to
4c9aff9
Compare
these test passing is blocked by https://github.com/skypilot-org/skypilot-catalog/pull/100/commits |
2b3e658
to
25b99f9
Compare
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
3010706
to
7face5e
Compare
Also I'm going to switch the catalog from ISO 3166-1 to UN M.49 for a bit of consolidation.
Depending on the datacenter, provider, and infrastructure, every cloud not only has their own limit (on cudo for instance, it's 1TB), but they also have different storage costs which aren't included in the catalog price (along sometimes with networking costs which also aren't accounted for). Other providers appear to be using things like CEPH or Gluster and autogrow (lambda does something like this) so it acts like an s3 bucket. Regardless, there's costs and limits that aren't accounted for. If you add a couple 0s to your disk size request you'll start breaking many clouds. DigitalOcean throws an error that just silently passes, runpod has no error handling at all. I see some raise an exception, which I'll be happy to do. The fluffy cloud demo encourages returning In an unsatisfiable request (ie, too much disk space), this just ends up in an infinite loop, which is what at least I am seeing. Nearly all of our users end up using under 100gb for their inference work and most of our providers offer conventional on-system NVME through m.2 controllers. Anyways, we don't require our providers to have petabyte-sized arrays on distributed file systems with a high speed interconnect, which really is the only way to be indifferent to storage requests since you'll have to accommodate lots of instances. So what to do? I'll add an exception for the launch and I have an outstanding commit to do more disk checks. |
I revisited the detailed log of the failed CI tests and I was a little bit confused about why the error is related to disk size.
![]()
![]()
|
That's not the current instance type. That's the old format. You're using an old catalog. This was changed in 81d3bdb two weeks ago. Here's how you do a "proper" test reset. I have a standalone version of this has the new Now using this I do the following: sqlite3 ~/.sky/state.db 'delete from clusters; delete from cluster_history;' # so old cluster definitions don't break things
./vast_catalog-v6.py > ~/.sky/catalogs/v6/vast/vms.csv # to make sure that the current catalog matches the test
rm /tmp/*.log # so I don't accidentally look at an old run Then I go over to manually remove any instances that didn't stop on the website from poor test cleanup Then I run the test pytest -n -v ... also please make sure you have an up to date version of the
may also be needed. |
Line 174 in a80208f
skypilot/tests/test_yaml_parser.py Line 67 in a80208f
If you want my recommendation, set it at something modest like 40. This covers stable diffusion, comfyui, openwebui, 70b infefernce models, most of the stuff on civitai and huggingface along with most distributed scientific computing applications since they use things like HDF5 on S3 buckets and don't transfer over their tens of terabytes of data to each node. It also keeps the test runs cheaper and maps closer to real world use cases. People aren't firing up a cluster of 8x H200s for their data storage needs. |
If that is the case, can you push a new version of VastAI catalog to the catalog repo? |
Just want to make sure, will this cause any test failure? cc @Michaelvll for a look here for changing the default disk size. Maybe we can have a default disk size for Vast specifically? |
so i had some talk with the management here. We decided to change how consolidation into instance types work in order for you guys. Essentially we're excluding a bunch of our more obscure machines as skypilot was just picking a bunch of things that happened to be cheap but kind of obscure. Hopefully this makes things more reliable. The 0.8 -> 0.5 change in the last commit is because we are offering a more limited array of machines, we can sacrifice our depth of offering for a lower price ceiling. |
Got it. LGTM. Seems like there are still failing CI test: https://github.com/skypilot-org/skypilot/actions/runs/13043991734/job/36391357825?pr=4365 Could you help fixing this? |
I guess you are using this? skypilot-org/skypilot-catalog#102 what do you want me to do? Upload another static file? have some kind of automated hook? what? I can also just comment out that test. |
Yes, for now manually push a newer version of catalog LGTM. Just merged the PR. I'll take a look on the auto fetcher after this PR. |
Regarding the disk size, is it possible to add a vast specific default disk size (e.g. 80GB as you proposed)? ideally we still want the other cloud to keep as the same. |
@kristopolous, for this, do you think we can separate it into another PR? If so I think this PR is in good shape and should be ready to go after smoke tests ;) Will trigger smoke test after your confirmation. |
Don't worry about it. I've disabled those tests. We can integrate it at a later date. We intend to be expanding our offerings and this should be less of a problem in the future |
/smoke-test --aws |
Thanks @kristopolous for this amazing contribution! Looks good to me. Merging now! |
This is preliminary support for Vast. It currently works on an unreleased version of the SDK which we will soon get up to PyPy
The document https://docs.google.com/document/d/1oWox3qb3Kz3wXXSGg9ZJWwijoa99a3PIQUHBR8UgEGs/edit?pli=1&tab=t.0 was followed and all the testing passed
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh
I'm pretty sure there will need to be edits, I'm fine with that. This is attempt 1. The outstanding work:
We need to