-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust file system storage class and storage put functions #73
base: main
Are you sure you want to change the base?
Conversation
@zhenyu let me get a review for this PR that does two refactoring changes:
Actually, both of these refactoring changes are optional - I don't absolutely need them. But, I am thinking that they would be helpful to you and anyone else using Wicker. But it would be ok with me if you didn't want to do one or both of them. I also made a change to the |
@zhenyu do you see any other things on this PR that we should fix or skip? I can definitely make the name changes i.e. |
@@ -34,7 +35,7 @@ class AbstractDataStorage(ABC): | |||
|
|||
@abstractmethod | |||
def fetch_file(self, input_path: str, local_prefix: str, timeout_seconds: int) -> str: | |||
"""Fetch file from chosen data storage method. | |||
"""Fetch file from data storage into the local path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still have the concept of fetch for the mounted Fuse DataStorage? I am assuming the end user just care about the content, Do we need make this ensure path method as a public interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input_path-> storage_path, add please add doc string for what is the input path. Wether relative path to each DataStorage, or abs path. I am assuming the relative path since the DataStorage class would be a holder for all the interactive with file based interactive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhenyu the way that S3DataStorage
was created is that unfortunately input_path
represents an absolute path within that type of storage, for example the code looks like this:
data_storage = S3DataStorage()
input_path = "s3://foo/bar/baz/dummy"
local_prefix = "/tmp/data"
data_storage.fetch_file(input_path, local_prefix)
For backwards compatibility I think it is too late to change this for S3DataStorage
.
I have marked in the docstrings that the first parameter (now renamed from input_path
-> storage_path
) represents an absolute path within that storage type and I have updated all the parameters with examples.
:type target_path: str | ||
""" | ||
pass | ||
|
||
|
||
class FileSystemDataStorage(AbstractDataStorage): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a root prefix attr for this FileSystemDataStorage?
wicker/core/storage.py
Outdated
@@ -206,27 +234,35 @@ def fetch_obj_s3(self, input_path: str) -> bytes: | |||
self.client.download_fileobj(bucket, key, bio) | |||
return bio.getvalue() | |||
|
|||
def put_object_s3(self, object_bytes: bytes, s3_path: str) -> None: | |||
def put_object(self, object_bytes: bytes, target_path: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like make the target_path a relative path for better abstraction. Or at least a relative option and default to false for the compatible reason
@convexquad Thanks for the PR. I roughly got your ideas and good job. |
storage: S3DataStorage = S3DataStorage(), | ||
s3_path_factory: S3PathFactory = S3PathFactory(), | ||
storage: AbstractDataStorage = S3DataStorage(), | ||
s3_path_factory: WickerPathFactory = S3PathFactory(), # S3-specific naming kept for backwards compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is out of the scope of this PR, but I think the PathFactory should be part of the DataStorage class. Let us see whether we can improve it even a little
aae388a
to
21335a3
Compare
f566b6b
to
4adfa69
Compare
@zhenyu I made the change you requested on the However, at least right now I do not think we can safely make the change to storage classes so that they are based around relative paths (they current expect absolute paths). If you make a search like https://github.tri-ad.tech/level5/avjadoo/search?q=S3DataStorage, there are many instances of code that create an
I think your idea to use relative paths and encode the root path / root bucket path into the storage instance is better! But if we make that change right now, I think we will break compatibility with the existing usage of Wicker pretty hard. Would it be ok to keep this PR about this light refactor? I think we could actually work on fundamental improvements to Wicker in a new series of classes (i.e. new dataset, storage, etc. classes) in other PRs so there is no chance of breaking old Wicker. What do you think? |
Currently, it is only possible to write Wicker datasets that have column bytes files to S3. Let's make a very light refactoring to make it just a little bit easier to test writing Wicker datasets (with column bytes files) to local filesystems by changing just a couple of the functions to be non-S3 specific.
put_file
andput_object
functions to the storage interface and update theS3DataStorage
class so that the S3-specificput_file_s3
andput_object_s3
functions just callput_file
andput_object
(update: we decided to call thempersist_file
andpersist_content
instead).In addition, let's remove one overly complex things about the current
FileSystemDataStorage
class that works with local filesystems and that might damage local filesystem performance when used together with GCSFuse or mountpoint-S3.