Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 3744: Regions are not respected for buckets created with sky launch #3789

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

Shrinandan-N
Copy link
Contributor

Added region variable to Storage class and modified self.handle.sky_stores to change the region value from None to the region variable. Modified task.py to pass in region value into the Storage.from_yaml_config function to be accessed down the line. The main issue was that the bucket region was None when creating the bucket, resulting in the fallback us-central1.

Tested:

  • YAML with and without region specified for GCP in the repro format as shown in the issue.
  • Code formatting: bash format.sh
  • [x ] pytest tests/test_smoke.py::test_spot_storage --gcp
  • [x ] pytest tests/test_smoke.py::test_spot_storage --aws

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Shrinandan-N! Left a comment about an alternative implementation.

Comment on lines +3091 to +3092
Rclone.generate_rclone_bucket_profile_name(
self.name, Rclone.RcloneClouds.IBM)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to revert these cosmetic changes to make the PR easier to read?

We rely on our linter (./format.py) to enforce a consistent style and make changes if necessary.

sky/task.py Outdated
Comment on lines 434 to 437
resources = config.get('resources', None)
region = resources.get('region', None) if resources else None
storage_obj = storage_lib.Storage.from_yaml_config(
storage[1], region)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is a solution works, it breaks the abstractions we've drawn between Storage and Store. Storage is intended to be agnostic to the underlying Store (and consequently, the region the Store is in). As a result, I don't think Storage should not have a region field.

As you note, the bug occurs only when a store is specified in the YAML spec. This happens because we try to add the store at YAML parse:

skypilot/sky/data/storage.py

Lines 1046 to 1047 in e794971

if store is not None:
storage_obj.add_store(StoreType(store.upper()))

Instead, do you think we can delay this add_store from the YAML parsing to happen after the optimization step (see execution.py::_execute for the sequencing) so that it gets handled in the sync_storage_mounts step? At that step SkyPilot knows the region that the task will be scheduled in, so we can accurately specify the region to create the bucket in.

Copy link
Contributor Author

@Shrinandan-N Shrinandan-N Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj
I removed the add_store from YAML parsing and noticed that the buckets were still being created as the init function inside of Storage was adding to sky_stores (prev at storage.py, 527):

        if handle is not None:
            self.handle = handle
            # Reconstruct the Storage object from the global_user_state
            logger.debug('Detected existing storage object, '
                         f'loading Storage: {self.name}')
            self._add_store_from_metadata(self.handle.sky_stores)

The init function is triggered by the following in from_yaml_config (storage.py, 1043-1046):

        storage_obj = cls(name=name,
                          source=source,
                          persistent=persistent,
                          mode=mode)

After removing self.add_store_from_metadata from Storage init function, the bucket creation process happens after the Optimization step and creates the bucket in the specified region.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Can you also verify if TestStorageWithCredentials, test_file_mounts, test_gcp_storage_mounts_with_stop and test_managed_jobs_storage also pass?

@landscapepainter
Copy link
Collaborator

Hi @Shrinandan-N, I was wondering if the PR is still in WIP? Please feel free to ping me on this thread when you are ready so I can run some tests on my end as well. Thanks!

@Shrinandan-N
Copy link
Contributor Author

Hi @landscapepainter , sorry I've been communicating with Romil over slack on this, it's WIP :) Would love to see if you could test test_managed_jobs_storage and TestStorageWithCredentials . The other two tests are passing on my end, these two are failing due to a permission error and a microsoft package error. Thank you!

@landscapepainter
Copy link
Collaborator

@Shrinandan-N My bad, I should have clarified on this. I'll be running manual tests to make sure this PR does not missout any edge cases. Let me know when you are ready with all the other tests :)

Copy link
Collaborator

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Shrinandan-N, I ran some manual tests and found 2 edge cases before running smoke tests on Azure.

  1. Store type does not follow the type given in task yaml:
    Reproduce:
# task.yaml
file_mounts:
  /my_data:
    name: my-sky-bucket-new-unique
    store: gcs  # Optional: either of s3, gcs, azure, r2, ibm

sky launch -c task.yaml --cloud aws
In this case, we would want the storage my-sky-bucket-new-unique to be created at gcs, but it gets created at s3. This decision is made when store_type, store_region = self._get_preferred_store() is ran from task.py::sync_storage_mounts(). We need to make sure the store type specified under task yaml is respected.

  1. Cannot create storages with identical names with different store types.
    Reprodce:
# task_gcs.yaml
file_mounts:
  /my_data:
    name: my-sky-bucket-new-unique
    store: gcs  # Optional: either of s3, gcs, azure, r2, ibm

sky launch -c task_gcs.yaml --cloud gcp

After creating a GCS storage named my-sky-bucket-new-unique, we attempt to create another S3 storage with identical name. So we run the following in order.

# task_s3.yaml
file_mounts:
  /my_data:
    name: my-sky-bucket-new-unique
    store: s3  # Optional: either of s3, gcs, azure, r2, ibm

sky launch -c task_s3.yaml --cloud aws

Two sky launch above should result in creating two sky managed storages like below:

# result of running two sky launch above from master branch.
$ sky storage ls
NAME                      UPDATED         STORE    COMMAND                       STATUS
my-sky-bucket-new-unique  a few secs ago  S3, GCS  /home/gcpuser/skypilot/sk...  READY

But this branch instead raises an error as you can see below:

Task from YAML spec: /home/gcpuser/test2.yaml
google.api_core.exceptions.Conflict: 409 POST https://storage.googleapis.com/storage/v1/b?project=skypilot-375900&prettyPrint=false: Your previous request to create the named bucket succeeded and you already own it.

The above exception was the direct cause of the following exception:

sky.exceptions.StorageBucketCreateError: Attempted to create a bucket my-sky-bucket-new-unique but failed.

@Shrinandan-N
Copy link
Contributor Author

Hi @landscapepainter , the new implementation resolves these edge cases while respecting regions.

Copy link
Collaborator

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Shrinandan-N! This update seems to be in the right direction by separating the bucket creation logic from creating instantiating AbstractStore object.

I tested with simple sky launch with GCS store type and it currently fails with sky.exceptions.StorageUploadError. This is because some parts of our code expects creation of AbstractStore object to imply creation of the bucket. For example, Storage.add_store() creates AbstractStore object and uploads the source to the bucket. But since we separate the logic of creating the bucket from AbstractStore object, this fails. We want to make sure that all the parts that creates AbstractStore object does not cause any failures due to this expectation of the bucket's existence. For those parts where the AbstractStore obj is created and where the bucket's existence is expected, we need to make a call to sync_bucket as well after creating the object. One of the places you also may want to take a look is at _add_store_from_metadata. Please take a thorough look to check if there are other parts with the expectation and make corresponding updates.

Repro for my test that failed:
sky launch test.yaml --cloud gcp ``test.yaml:

file_mounts:
  /mydata:
    store: gcs
    source: ~/mydata
    name: test-dy-mydata
    mode: MOUNT

After doing so, let's also make sure the following smoke tests pass along with the edge cases discussed at #3789 (review).

  • pytest tests/test_smoke.py::TestStorageWithCredentials --gcp
  • pytest tests/test_smoke.py::TestStorageWithCredentials --aws
  • pytest tests/test_smoke.py::test_managed_jobs_storage --gcp
  • pytest tests/test_smoke.py::test_managed_jobs_storage --aws

Copy link
Collaborator

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Shrinandan-N! Confirmed all the smoke tests are passing.
I was running manual tests, and found a failing edge case as following:

test.yaml

file_mounts:
  /test:
    store: gcs
    name: my-gcs-storage-test-dy
    mode: COPY
    source: ~/test
$ sky launch temp.yaml --cloud aws -y -c test-storage
Task from YAML spec: temp.yaml
I 11-04 03:46:57 optimizer.py:691] == Optimizer ==
I 11-04 03:46:57 optimizer.py:702] Target: minimizing cost
I 11-04 03:46:57 optimizer.py:714] Estimated cost: $0.4 / hour
I 11-04 03:46:57 optimizer.py:714]
I 11-04 03:46:57 optimizer.py:839] Considered resources (1 node):
I 11-04 03:46:57 optimizer.py:909] ------------------------------------------------------------------------------------------
I 11-04 03:46:57 optimizer.py:909]  CLOUD   INSTANCE      vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
I 11-04 03:46:57 optimizer.py:909] ------------------------------------------------------------------------------------------
I 11-04 03:46:57 optimizer.py:909]  AWS     m6i.2xlarge   8       32        -              us-east-1     0.38          ✔
I 11-04 03:46:57 optimizer.py:909] ------------------------------------------------------------------------------------------
I 11-04 03:46:57 optimizer.py:909]
Running task on cluster test-storage...
I 11-04 03:46:58 storage.py:1461] Created S3 bucket 'my-gcs-storage-test-dy' in us-east-1
I 11-04 03:46:59 cloud_vm_ray_backend.py:4354] Creating a new cluster: 'test-storage' [1x AWS(m6i.2xlarge)].

As you can see, sky launch creates a storage at s3 while the specified bucket is to be gcs. Lets fix this and add an additional test under TestStorageWithCredentials to make sure the bucket is created under correct cloud when specified cloud for the cluster and storage is different.

Also, let's merge with master branch. I see a smoke test failing from outdated code, test_private_bucket.
Help to run ./format.sh as well to run the linter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants