-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue 3744: Regions are not respected for buckets created with sky launch #3789
base: master
Are you sure you want to change the base?
Fix issue 3744: Regions are not respected for buckets created with sky launch #3789
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Shrinandan-N! Left a comment about an alternative implementation.
Rclone.generate_rclone_bucket_profile_name( | ||
self.name, Rclone.RcloneClouds.IBM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to revert these cosmetic changes to make the PR easier to read?
We rely on our linter (./format.py) to enforce a consistent style and make changes if necessary.
sky/task.py
Outdated
resources = config.get('resources', None) | ||
region = resources.get('region', None) if resources else None | ||
storage_obj = storage_lib.Storage.from_yaml_config( | ||
storage[1], region) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is a solution works, it breaks the abstractions we've drawn between Storage
and Store
. Storage
is intended to be agnostic to the underlying Store
(and consequently, the region the Store
is in). As a result, I don't think Storage
should not have a region
field.
As you note, the bug occurs only when a store
is specified in the YAML spec. This happens because we try to add the store at YAML parse:
Lines 1046 to 1047 in e794971
if store is not None: | |
storage_obj.add_store(StoreType(store.upper())) |
Instead, do you think we can delay this add_store
from the YAML parsing to happen after the optimization step (see execution.py::_execute for the sequencing) so that it gets handled in the sync_storage_mounts step? At that step SkyPilot knows the region that the task will be scheduled in, so we can accurately specify the region to create the bucket in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romilbhardwaj
I removed the add_store
from YAML parsing and noticed that the buckets were still being created as the init function inside of Storage
was adding to sky_stores
(prev at storage.py, 527):
if handle is not None:
self.handle = handle
# Reconstruct the Storage object from the global_user_state
logger.debug('Detected existing storage object, '
f'loading Storage: {self.name}')
self._add_store_from_metadata(self.handle.sky_stores)
The init function is triggered by the following in from_yaml_config
(storage.py, 1043-1046):
storage_obj = cls(name=name,
source=source,
persistent=persistent,
mode=mode)
After removing self.add_store_from_metadata
from Storage
init function, the bucket creation process happens after the Optimization step and creates the bucket in the specified region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Can you also verify if TestStorageWithCredentials
, test_file_mounts
, test_gcp_storage_mounts_with_stop
and test_managed_jobs_storage
also pass?
…:Shrinandan-N/skypilot into fix/issue-3744-sky-launch-bucket-regions
Hi @Shrinandan-N, I was wondering if the PR is still in WIP? Please feel free to ping me on this thread when you are ready so I can run some tests on my end as well. Thanks! |
Hi @landscapepainter , sorry I've been communicating with Romil over slack on this, it's WIP :) Would love to see if you could test |
@Shrinandan-N My bad, I should have clarified on this. I'll be running manual tests to make sure this PR does not missout any edge cases. Let me know when you are ready with all the other tests :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Shrinandan-N, I ran some manual tests and found 2 edge cases before running smoke tests on Azure.
- Store type does not follow the type given in task yaml:
Reproduce:
# task.yaml
file_mounts:
/my_data:
name: my-sky-bucket-new-unique
store: gcs # Optional: either of s3, gcs, azure, r2, ibm
sky launch -c task.yaml --cloud aws
In this case, we would want the storage my-sky-bucket-new-unique
to be created at gcs
, but it gets created at s3
. This decision is made when store_type, store_region = self._get_preferred_store()
is ran from task.py::sync_storage_mounts()
. We need to make sure the store type specified under task yaml is respected.
- Cannot create storages with identical names with different store types.
Reprodce:
# task_gcs.yaml
file_mounts:
/my_data:
name: my-sky-bucket-new-unique
store: gcs # Optional: either of s3, gcs, azure, r2, ibm
sky launch -c task_gcs.yaml --cloud gcp
After creating a GCS storage named my-sky-bucket-new-unique
, we attempt to create another S3 storage with identical name. So we run the following in order.
# task_s3.yaml
file_mounts:
/my_data:
name: my-sky-bucket-new-unique
store: s3 # Optional: either of s3, gcs, azure, r2, ibm
sky launch -c task_s3.yaml --cloud aws
Two sky launch
above should result in creating two sky managed storages like below:
# result of running two sky launch above from master branch.
$ sky storage ls
NAME UPDATED STORE COMMAND STATUS
my-sky-bucket-new-unique a few secs ago S3, GCS /home/gcpuser/skypilot/sk... READY
But this branch instead raises an error as you can see below:
Task from YAML spec: /home/gcpuser/test2.yaml
google.api_core.exceptions.Conflict: 409 POST https://storage.googleapis.com/storage/v1/b?project=skypilot-375900&prettyPrint=false: Your previous request to create the named bucket succeeded and you already own it.
The above exception was the direct cause of the following exception:
sky.exceptions.StorageBucketCreateError: Attempted to create a bucket my-sky-bucket-new-unique but failed.
Hi @landscapepainter , the new implementation resolves these edge cases while respecting regions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Shrinandan-N! This update seems to be in the right direction by separating the bucket creation logic from creating instantiating AbstractStore object.
I tested with simple sky launch
with GCS store type and it currently fails with sky.exceptions.StorageUploadError
. This is because some parts of our code expects creation of AbstractStore
object to imply creation of the bucket. For example, Storage.add_store()
creates AbstractStore
object and uploads the source to the bucket. But since we separate the logic of creating the bucket from AbstractStore
object, this fails. We want to make sure that all the parts that creates AbstractStore
object does not cause any failures due to this expectation of the bucket's existence. For those parts where the AbstractStore
obj is created and where the bucket's existence is expected, we need to make a call to sync_bucket
as well after creating the object. One of the places you also may want to take a look is at _add_store_from_metadata.
Please take a thorough look to check if there are other parts with the expectation and make corresponding updates.
Repro for my test that failed:
sky launch test.yaml --cloud gcp ``test.yaml
:
file_mounts:
/mydata:
store: gcs
source: ~/mydata
name: test-dy-mydata
mode: MOUNT
After doing so, let's also make sure the following smoke tests pass along with the edge cases discussed at #3789 (review).
pytest tests/test_smoke.py::TestStorageWithCredentials --gcp
pytest tests/test_smoke.py::TestStorageWithCredentials --aws
pytest tests/test_smoke.py::test_managed_jobs_storage --gcp
pytest tests/test_smoke.py::test_managed_jobs_storage --aws
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Shrinandan-N! Confirmed all the smoke tests are passing.
I was running manual tests, and found a failing edge case as following:
test.yaml
file_mounts:
/test:
store: gcs
name: my-gcs-storage-test-dy
mode: COPY
source: ~/test
$ sky launch temp.yaml --cloud aws -y -c test-storage
Task from YAML spec: temp.yaml
I 11-04 03:46:57 optimizer.py:691] == Optimizer ==
I 11-04 03:46:57 optimizer.py:702] Target: minimizing cost
I 11-04 03:46:57 optimizer.py:714] Estimated cost: $0.4 / hour
I 11-04 03:46:57 optimizer.py:714]
I 11-04 03:46:57 optimizer.py:839] Considered resources (1 node):
I 11-04 03:46:57 optimizer.py:909] ------------------------------------------------------------------------------------------
I 11-04 03:46:57 optimizer.py:909] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 11-04 03:46:57 optimizer.py:909] ------------------------------------------------------------------------------------------
I 11-04 03:46:57 optimizer.py:909] AWS m6i.2xlarge 8 32 - us-east-1 0.38 ✔
I 11-04 03:46:57 optimizer.py:909] ------------------------------------------------------------------------------------------
I 11-04 03:46:57 optimizer.py:909]
Running task on cluster test-storage...
I 11-04 03:46:58 storage.py:1461] Created S3 bucket 'my-gcs-storage-test-dy' in us-east-1
I 11-04 03:46:59 cloud_vm_ray_backend.py:4354] Creating a new cluster: 'test-storage' [1x AWS(m6i.2xlarge)].
As you can see, sky launch
creates a storage at s3 while the specified bucket is to be gcs. Lets fix this and add an additional test under TestStorageWithCredentials
to make sure the bucket is created under correct cloud
when specified cloud
for the cluster and storage is different.
Also, let's merge with master branch. I see a smoke test failing from outdated code, test_private_bucket
.
Help to run ./format.sh
as well to run the linter.
Added region variable to Storage class and modified
self.handle.sky_stores
to change the region value from None to the region variable. Modified task.py to pass in region value into theStorage.from_yaml_config
function to be accessed down the line. The main issue was that the bucket region was None when creating the bucket, resulting in the fallback us-central1.Tested:
bash format.sh