Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust file system storage class and storage put functions #73

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

convexquad
Copy link
Collaborator

@convexquad convexquad commented Oct 5, 2024

Currently, it is only possible to write Wicker datasets that have column bytes files to S3. Let's make a very light refactoring to make it just a little bit easier to test writing Wicker datasets (with column bytes files) to local filesystems by changing just a couple of the functions to be non-S3 specific.

  • In the refactoring, we will add put_file and put_object functions to the storage interface and update the S3DataStorage class so that the S3-specific put_file_s3 and put_object_s3 functions just call put_file and put_object (update: we decided to call them persist_file and persist_content instead).
  • Let's also rename an internal class for testing (so that it is more clear that it is just for tests).

In addition, let's remove one overly complex things about the current FileSystemDataStorage class that works with local filesystems and that might damage local filesystem performance when used together with GCSFuse or mountpoint-S3.

@convexquad convexquad self-assigned this Oct 5, 2024
@convexquad
Copy link
Collaborator Author

@zhenyu let me get a review for this PR that does two refactoring changes:

  1. Change confusing LocalDataStorage test class name.
  2. Add put_file and put_object functions to AbstractDataStorage interface and add these functions to S3DataStorage class (and have the S3-specific versions of these functions call the new functions).

Actually, both of these refactoring changes are optional - I don't absolutely need them. But, I am thinking that they would be helpful to you and anyone else using Wicker. But it would be ok with me if you didn't want to do one or both of them.

I also made a change to the FileSystemDataStorage class to remove the download_with_retries function and this is actually the one important change I want to make in order to remove the @retry annotation that we do not actually need for GCSFuse or mountpoint-S3.

@convexquad convexquad added the enhancement New feature or request label Oct 8, 2024
@convexquad convexquad requested a review from zhenyu October 8, 2024 00:48
wicker/core/storage.py Outdated Show resolved Hide resolved
wicker/core/storage.py Outdated Show resolved Hide resolved
@convexquad
Copy link
Collaborator Author

convexquad commented Oct 9, 2024

@zhenyu do you see any other things on this PR that we should fix or skip? I can definitely make the name changes i.e. put_file -> persist_file and put_object -> persist_content (or maybe write_file and write_content???)

@@ -34,7 +35,7 @@ class AbstractDataStorage(ABC):

@abstractmethod
def fetch_file(self, input_path: str, local_prefix: str, timeout_seconds: int) -> str:
"""Fetch file from chosen data storage method.
"""Fetch file from data storage into the local path.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still have the concept of fetch for the mounted Fuse DataStorage? I am assuming the end user just care about the content, Do we need make this ensure path method as a public interface?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_path-> storage_path, add please add doc string for what is the input path. Wether relative path to each DataStorage, or abs path. I am assuming the relative path since the DataStorage class would be a holder for all the interactive with file based interactive

Copy link
Collaborator Author

@convexquad convexquad Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenyu the way that S3DataStorage was created is that unfortunately input_path represents an absolute path within that type of storage, for example the code looks like this:

data_storage = S3DataStorage()
input_path = "s3://foo/bar/baz/dummy"
local_prefix = "/tmp/data"
data_storage.fetch_file(input_path, local_prefix)

For backwards compatibility I think it is too late to change this for S3DataStorage.

I have marked in the docstrings that the first parameter (now renamed from input_path -> storage_path) represents an absolute path within that storage type and I have updated all the parameters with examples.

wicker/core/storage.py Outdated Show resolved Hide resolved
wicker/core/storage.py Outdated Show resolved Hide resolved
:type target_path: str
"""
pass


class FileSystemDataStorage(AbstractDataStorage):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a root prefix attr for this FileSystemDataStorage?

wicker/core/storage.py Outdated Show resolved Hide resolved
@@ -206,27 +234,35 @@ def fetch_obj_s3(self, input_path: str) -> bytes:
self.client.download_fileobj(bucket, key, bio)
return bio.getvalue()

def put_object_s3(self, object_bytes: bytes, s3_path: str) -> None:
def put_object(self, object_bytes: bytes, target_path: str) -> None:
Copy link
Collaborator

@zhenyu zhenyu Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like make the target_path a relative path for better abstraction. Or at least a relative option and default to false for the compatible reason

@zhenyu
Copy link
Collaborator

zhenyu commented Oct 9, 2024

@zhenyu let me get a review for this PR that does two refactoring changes:

  1. Change confusing LocalDataStorage test class name.
  2. Add put_file and put_object functions to AbstractDataStorage interface and add these functions to S3DataStorage class (and have the S3-specific versions of these functions call the new functions).

Actually, both of these refactoring changes are optional - I don't absolutely need them. But, I am thinking that they would be helpful to you and anyone else using Wicker. But it would be ok with me if you didn't want to do one or both of them.

I also made a change to the FileSystemDataStorage class to remove the download_with_retries function and this is actually the one important change I want to make in order to remove the @retry annotation that we do not actually need for GCSFuse or mountpoint-S3.

@convexquad Thanks for the PR. I roughly got your ideas and good job.
One thing I want to point out is. Seems like the Datastorage should be a container for all the files about the wicker, and each storage has their config with what is the root (for example, s3 root bucket in the wicker.json). So I would like to introduce the relative storage path in the put/fetch etc functions. Let me know WDYT. Thanks again

storage: S3DataStorage = S3DataStorage(),
s3_path_factory: S3PathFactory = S3PathFactory(),
storage: AbstractDataStorage = S3DataStorage(),
s3_path_factory: WickerPathFactory = S3PathFactory(), # S3-specific naming kept for backwards compatibility.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is out of the scope of this PR, but I think the PathFactory should be part of the DataStorage class. Let us see whether we can improve it even a little

@convexquad convexquad force-pushed the dev/abain/update_data_storage branch from aae388a to 21335a3 Compare October 10, 2024 06:16
@convexquad convexquad force-pushed the dev/abain/update_data_storage branch from f566b6b to 4adfa69 Compare October 19, 2024 00:05
@convexquad
Copy link
Collaborator Author

convexquad commented Oct 19, 2024

@zhenyu I made the change you requested on the AbstractDataStorage functions to change the target_path parameter to storage_path and update the parameter docstrings with examples!

However, at least right now I do not think we can safely make the change to storage classes so that they are based around relative paths (they current expect absolute paths). If you make a search like https://github.tri-ad.tech/level5/avjadoo/search?q=S3DataStorage, there are many instances of code that create an S3DataStorage object and expect to pass it absolute paths.

  • The way that the storage classes were originally designed, they were thought of as a wrapper API abstraction of the underlying storage class. So for example S3DataStorage was intended as a way to use the S3 API. It was not actually intended to represent a particular bucket (and starting path within a bucket).
  • The path factory is used to actually determine what bucket / storage system to use and the parent path in the bucket / storage system. The storage class and path factory classes were intended to be used as clients of a particular Dataset class that understands how to use them together (rather than the path factory being a private member of the storage instance as you suggested).

I think your idea to use relative paths and encode the root path / root bucket path into the storage instance is better! But if we make that change right now, I think we will break compatibility with the existing usage of Wicker pretty hard.

Would it be ok to keep this PR about this light refactor? I think we could actually work on fundamental improvements to Wicker in a new series of classes (i.e. new dataset, storage, etc. classes) in other PRs so there is no chance of breaking old Wicker. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants