Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstractify the Path Factory #55

Merged
merged 30 commits into from
May 22, 2024
Merged

Conversation

pickles-bread-and-butter
Copy link
Collaborator

@pickles-bread-and-butter pickles-bread-and-butter commented May 7, 2024

Overview

Abstractifies the file pathing class in Wicker. Wicker has one common pathing structure and it doesn't constrain itself to S3. This abstraction allows for adding different file pathers on top of this if needed while keeping the consistent structure.

Motivation

This set of PRs, #56, #57 aim to fix two problems/incorrect assumptions. Below is the context on the problems bulleted:

  1. Wicker right now limits itself very heavily to S3 usage. It only implements S3 and more or less lacks the infrastructure/correct abstractions to add more data sources. This is evidenced by classes like S3Dataset and S3PathFactory which can be seen as implementations of a base class but lack the base class. These are the only writing and access methods and there is no common class for interfacing.
  2. There is no common core, there is no common API that users can infer from the different classes. It's very easy with each implementation to diverge from the common access pattern. This should be rectified and a common core added.
  3. Wicker has a very tight file structure, it shouldn't deviate from data platform to data platform. What should deviate is the location in which that file structure is stored and the way wicker accesses it but not the actual path structure.
  4. There is no way to have a local/mounted file structure. Currently Wicker doesn't even support having a file system locally with a dataset. You can't create a dataset locally and test it, that's a huge problem for functional testing that we have to rely on mocking out S3 just to test locally.

What this implements

  1. This PR is very targeted to abstracting away the S3PathFactory into an implementation class for S3 and a base class WickerPathFactory. This new path defines the path structure generically while the S3PathFactory only appends on the relevant S3 pieces. This gives us a common class to build off where we can keep the access pattern for paths identical across both for easy swapping of user code.

Testing

  • Run CI
  • Run your workflow
  • Verify outputs

Compatibility

This PR is entirely backward compatible because it does not change any access patterns. The function signatures, names, and outputs are 1-1 so there is no requirement to change anything for user code if the upgraded is undertaken.

@convexquad
Copy link
Collaborator

@isaak-willett great job! Let me get a few minor changes on this PR before we can approve it. Just FYI, this will also be necessary to support the new AWS S3 FUSE driver.

Isaak Willett added 3 commits May 20, 2024 14:50
"""
s3_config = get_config().aws_s3_config
store_concatenated_bytes_files_in_dataset = s3_config.store_concatenated_bytes_files_in_dataset
if s3_root_path is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit for: s3_root_path = s3_root_path or s3_config.s3_datasets_path

Copy link
Collaborator

@convexquad convexquad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks, Isaak! I added one small nit, but it is ok to skip it.

Copy link
Collaborator

@marccarre marccarre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @isaak-willett! 🙇🏻‍♂️
From my side:

  • Just a few nits on the docstrings, that you can hopefully just quickly apply. 😌
  • Would test_datasets::TestS3Dataset cover enough to catch potential regressions caused by this refactoring?

Copy link
Collaborator

@aalavian aalavian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @isaak-willett , code LGTM in general other than few minor comments. Mostly requesting change for the test artifacts (please feel free to share on doc)


def __eq__(self, other: Any) -> bool:
return super().__eq__(other) and type(self) == type(other) and self.root_path == other.root_path

def get_dataset_assets_path(self, dataset_id: DatasetID, s3_prefix: bool = True) -> str:
def _get_dataset_assets_path(self, dataset_id: DatasetID, prefix: Optional[str] = None) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming prefix here suggests that it is about to be added rather remove, should we rename to something like prefix_to_drop? Similar comment applies elsewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I do it in a later PR? The naming is not good on the s3_prefix param either and I want to rename them all at once in a different PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have just introduced the prefix name in this PR, and it is minimal effort to change that also in this PR, I suggest to tackle it here and not introduce a new debt.

The s3_prefix is external to this repo and I agree that we might need to do more work to change it elsewhere and it is better to be a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isaak-willett I also got a little bit confused, maybe we should call it prefix_remove or prefix_trim.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to prefix_to_trim


Args:.
root_path (str): File system loc of the root of the wicker file structure.
store_concatenated_bytes_files_in_dataset (bool, optional): Whether to assume concat bytes files are stored.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, do we mean "Whether to assume concatenated bytes files are stored in dataset folder"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@aalavian aalavian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @isaak-willett , Let's address the comment on prefix naming for the newly introduced function _get_dataset_assets_path before merging.


def __eq__(self, other: Any) -> bool:
return super().__eq__(other) and type(self) == type(other) and self.root_path == other.root_path

def get_dataset_assets_path(self, dataset_id: DatasetID, s3_prefix: bool = True) -> str:
def _get_dataset_assets_path(self, dataset_id: DatasetID, prefix: Optional[str] = None) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have just introduced the prefix name in this PR, and it is minimal effort to change that also in this PR, I suggest to tackle it here and not introduce a new debt.

The s3_prefix is external to this repo and I agree that we might need to do more work to change it elsewhere and it is better to be a separate PR.

Isaak Willett added 2 commits May 22, 2024 16:20
@pickles-bread-and-butter pickles-bread-and-butter merged commit b4d2343 into main May 22, 2024
2 checks passed
@pickles-bread-and-butter pickles-bread-and-butter deleted the feature/isaak/local_fs branch May 22, 2024 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants