Abstractify the Path Factory #55

pickles-bread-and-butter · 2024-05-07T23:59:02Z

Overview

Abstractifies the file pathing class in Wicker. Wicker has one common pathing structure and it doesn't constrain itself to S3. This abstraction allows for adding different file pathers on top of this if needed while keeping the consistent structure.

Motivation

This set of PRs, #56, #57 aim to fix two problems/incorrect assumptions. Below is the context on the problems bulleted:

Wicker right now limits itself very heavily to S3 usage. It only implements S3 and more or less lacks the infrastructure/correct abstractions to add more data sources. This is evidenced by classes like S3Dataset and S3PathFactory which can be seen as implementations of a base class but lack the base class. These are the only writing and access methods and there is no common class for interfacing.
There is no common core, there is no common API that users can infer from the different classes. It's very easy with each implementation to diverge from the common access pattern. This should be rectified and a common core added.
Wicker has a very tight file structure, it shouldn't deviate from data platform to data platform. What should deviate is the location in which that file structure is stored and the way wicker accesses it but not the actual path structure.
There is no way to have a local/mounted file structure. Currently Wicker doesn't even support having a file system locally with a dataset. You can't create a dataset locally and test it, that's a huge problem for functional testing that we have to rely on mocking out S3 just to test locally.

What this implements

This PR is very targeted to abstracting away the S3PathFactory into an implementation class for S3 and a base class WickerPathFactory. This new path defines the path structure generically while the S3PathFactory only appends on the relevant S3 pieces. This gives us a common class to build off where we can keep the access pattern for paths identical across both for easy swapping of user code.

Testing

Run CI
Run your workflow
Verify outputs

Compatibility

This PR is entirely backward compatible because it does not change any access patterns. The function signatures, names, and outputs are 1-1 so there is no requirement to change anything for user code if the upgraded is undertaken.

wicker/core/storage.py

convexquad · 2024-05-20T17:18:12Z

@isaak-willett great job! Let me get a few minor changes on this PR before we can approve it. Just FYI, this will also be necessary to support the new AWS S3 FUSE driver.

convexquad · 2024-05-21T21:31:42Z

wicker/core/storage.py

+        """
+        s3_config = get_config().aws_s3_config
+        store_concatenated_bytes_files_in_dataset = s3_config.store_concatenated_bytes_files_in_dataset
+        if s3_root_path is None:


nit for: s3_root_path = s3_root_path or s3_config.s3_datasets_path

convexquad

👍 Thanks, Isaak! I added one small nit, but it is ok to skip it.

marccarre

Thanks a lot for the PR @isaak-willett! 🙇🏻‍♂️
From my side:

Just a few nits on the docstrings, that you can hopefully just quickly apply. 😌
Would test_datasets::TestS3Dataset cover enough to catch potential regressions caused by this refactoring?

wicker/core/storage.py

aalavian

Thanks @isaak-willett , code LGTM in general other than few minor comments. Mostly requesting change for the test artifacts (please feel free to share on doc)

aalavian · 2024-05-22T06:33:34Z

wicker/core/storage.py


    def __eq__(self, other: Any) -> bool:
        return super().__eq__(other) and type(self) == type(other) and self.root_path == other.root_path

-    def get_dataset_assets_path(self, dataset_id: DatasetID, s3_prefix: bool = True) -> str:
+    def _get_dataset_assets_path(self, dataset_id: DatasetID, prefix: Optional[str] = None) -> str:


The naming prefix here suggests that it is about to be added rather remove, should we rename to something like prefix_to_drop? Similar comment applies elsewhere

How about I do it in a later PR? The naming is not good on the s3_prefix param either and I want to rename them all at once in a different PR.

Since we have just introduced the prefix name in this PR, and it is minimal effort to change that also in this PR, I suggest to tackle it here and not introduce a new debt.

The s3_prefix is external to this repo and I agree that we might need to do more work to change it elsewhere and it is better to be a separate PR.

@isaak-willett I also got a little bit confused, maybe we should call it prefix_remove or prefix_trim.

Changed to prefix_to_trim

aalavian · 2024-05-22T06:38:47Z

wicker/core/storage.py

+
+        Args:.
+            root_path (str): File system loc of the root of the wicker file structure.
+            store_concatenated_bytes_files_in_dataset (bool, optional): Whether to assume concat bytes files are stored.


hmm, do we mean "Whether to assume concatenated bytes files are stored in dataset folder"?

Co-authored-by: Marc Carré <[email protected]>

aalavian

Thanks @isaak-willett , Let's address the comment on prefix naming for the newly introduced function _get_dataset_assets_path before merging.

aalavian · 2024-05-22T20:43:07Z

wicker/core/storage.py


    def __eq__(self, other: Any) -> bool:
        return super().__eq__(other) and type(self) == type(other) and self.root_path == other.root_path

-    def get_dataset_assets_path(self, dataset_id: DatasetID, s3_prefix: bool = True) -> str:
+    def _get_dataset_assets_path(self, dataset_id: DatasetID, prefix: Optional[str] = None) -> str:


Since we have just introduced the prefix name in this PR, and it is minimal effort to change that also in this PR, I suggest to tackle it here and not introduce a new debt.

The s3_prefix is external to this repo and I agree that we might need to do more work to change it elsewhere and it is better to be a separate PR.

abstract the path factory

227ee07

pickles-bread-and-butter requested review from aalavian, anantsimran and chrisochoatri as code owners May 7, 2024 23:59

Isaak Willett added 11 commits May 7, 2024 17:01

remove print

90c256f

merge

323ee13

update doc strings

102405d

more doc strings

c17b229

lint

fe5034d

lints

c944461

fix typing

9646656

fix

923352e

add more docs

0cb935b

update path for linting

91ae474

fix ci

721e559

pickles-bread-and-butter mentioned this pull request May 20, 2024

Abstracts DataStorage and Makes Local DataStorage #56

Merged

convexquad reviewed May 20, 2024

View reviewed changes

wicker/core/storage.py Outdated Show resolved Hide resolved

convexquad reviewed May 20, 2024

View reviewed changes

wicker/core/storage.py Outdated Show resolved Hide resolved

convexquad reviewed May 20, 2024

View reviewed changes

wicker/core/storage.py Outdated Show resolved Hide resolved

convexquad reviewed May 20, 2024

View reviewed changes

wicker/core/storage.py Outdated Show resolved Hide resolved

Isaak Willett added 3 commits May 20, 2024 14:50

changes

d1b0203

fix

eb0f404

lint

a8627f4

pickles-bread-and-butter requested a review from convexquad May 20, 2024 22:10

convexquad reviewed May 21, 2024

View reviewed changes

convexquad approved these changes May 21, 2024

View reviewed changes

change

7b18832

marccarre approved these changes May 22, 2024

View reviewed changes

aalavian requested changes May 22, 2024

View reviewed changes

pickles-bread-and-butter and others added 12 commits May 22, 2024 07:57

Update wicker/core/storage.py

4ad0367

Co-authored-by: Marc Carré <[email protected]>

Update wicker/core/storage.py

1c107db

Co-authored-by: Marc Carré <[email protected]>

Update wicker/core/storage.py

6e881f2

Co-authored-by: Marc Carré <[email protected]>

Update wicker/core/storage.py

93de8ce

Co-authored-by: Marc Carré <[email protected]>

Update wicker/core/storage.py

46775a4

Co-authored-by: Marc Carré <[email protected]>

Update wicker/core/storage.py

e3bffaf

Co-authored-by: Marc Carré <[email protected]>

Update wicker/core/storage.py

07286be

Co-authored-by: Marc Carré <[email protected]>

Update wicker/core/storage.py

244fd03

Co-authored-by: Marc Carré <[email protected]>

Update wicker/core/storage.py

36b59b6

Co-authored-by: Marc Carré <[email protected]>

Update wicker/core/storage.py

adb019f

Co-authored-by: Marc Carré <[email protected]>

doc string

2f06043

Merge branch 'main' into feature/isaak/local_fs

b4b84e6

pickles-bread-and-butter requested a review from aalavian May 22, 2024 19:51

aalavian approved these changes May 22, 2024

View reviewed changes

marccarre approved these changes May 22, 2024

View reviewed changes

Isaak Willett added 2 commits May 22, 2024 16:20

rename

3039f52

update docstring

e7edaeb

pickles-bread-and-butter merged commit b4d2343 into main May 22, 2024
2 checks passed

pickles-bread-and-butter deleted the feature/isaak/local_fs branch May 22, 2024 23:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstractify the Path Factory #55

Abstractify the Path Factory #55

pickles-bread-and-butter commented May 7, 2024 •

edited by marccarre

Loading

convexquad commented May 20, 2024

convexquad May 21, 2024

convexquad left a comment

marccarre left a comment

aalavian left a comment

aalavian May 22, 2024

pickles-bread-and-butter May 22, 2024

aalavian May 22, 2024

convexquad May 22, 2024

pickles-bread-and-butter May 22, 2024

aalavian May 22, 2024

pickles-bread-and-butter May 22, 2024

aalavian left a comment

aalavian May 22, 2024

Abstractify the Path Factory #55

Abstractify the Path Factory #55

Conversation

pickles-bread-and-butter commented May 7, 2024 • edited by marccarre Loading

Overview

Motivation

What this implements

Testing

Compatibility

convexquad commented May 20, 2024

Choose a reason for hiding this comment

convexquad left a comment

Choose a reason for hiding this comment

marccarre left a comment

Choose a reason for hiding this comment

aalavian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aalavian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pickles-bread-and-butter commented May 7, 2024 •

edited by marccarre

Loading