Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text and PDFs uploads #819

Merged
merged 2 commits into from
Dec 11, 2024
Merged

Text and PDFs uploads #819

merged 2 commits into from
Dec 11, 2024

Conversation

alexey-cord-tech
Copy link
Contributor

@alexey-cord-tech alexey-cord-tech commented Dec 10, 2024

Introduction and Explanation

Adding ability to upload local text and PDF files, and also to specify "cloud uploads" of the same with the structured types.

Documentation

Docstrings in place, "human docs" TBD

Tests

Coming with the BE PR here: https://github.com/encord-team/cord-backend/pull/4429

Known issues

We should consolidate the local upload logic in a "strategy" of sorts, too much C&P there now.

Copy link

github-actions bot commented Dec 10, 2024

Unit test report ((Pydantic 2.x)

189 tests   189 ✅  6s ⏱️
  1 suites    0 💤
  1 files      0 ❌

Results for commit 0eb3673.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Dec 10, 2024

Unit test report (Pydantic 1.x)

189 tests   189 ✅  5s ⏱️
  1 suites    0 💤
  1 files      0 ❌

Results for commit 0eb3673.

♻️ This comment has been updated with latest results.

Comment on lines +1242 to +1247
text_mime = mimetypes.guess_type(str(file_path))[0]
if text_mime and (
text_mime.startswith("text/")
or text_mime.startswith("application/json")
or text_mime.startswith("application/xml")
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not one for this PR, but in general:

Given the increasing number of modalities (and hence mime types), Would it be worth creating a MIME_TYPE String enum? With all the different mime types we sort of expect?

  • This would reduce the possibility of silly typo's
  • Serves as nice documentation of the mimetypes we support.

Although I also appreciate there might be sooooo many mimetypes, that it might become unwieldy to maintain this enum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of something like this:

class EncordMimeType(StringEnum):
    VIDEO_MP4 = auto()
    AUDIO_MP3 = auto()
    TEXT_JSON = auto()
    TEXT_HTML = auto()
    APPLICATION_JSON = auto()

    @classmethod
    def is_plain_text(cls, mime_type: str) -> bool:
        return (mime_type.startswith(cls.TEXT_HTML)
                or mime_type.startswith(cls.TEXT_JSON))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially. We don't necessarily need one client side here given that this code is relatively static.
Also your plain text method doesn't handle the newly introduced cases of supporting all textual files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think about how we handle this (current way it not great), but in actual reality there are way too many text MIMEs for that (e.g. basically every programming language is text/x-{langname})

encord/storage.py Outdated Show resolved Hide resolved
Copy link
Contributor

@clinton-encord clinton-encord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments/questions, but LGTM!

Copy link
Contributor

@Jim-Encord Jim-Encord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People expressed an interest in providing the mime_type (Not saying that I love it). Are we deciding against that?

title: Optional[str] = None,
client_metadata: Optional[Dict[str, Any]] = None,
cloud_upload_settings: CloudUploadSettings = CloudUploadSettings(),
) -> UUID: # TODO this should return an item?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should A) follow convention
B) Outside of convention, returning a UUID is reasonable given that the method is long-polling to a degree.

Comment on lines +1242 to +1247
text_mime = mimetypes.guess_type(str(file_path))[0]
if text_mime and (
text_mime.startswith("text/")
or text_mime.startswith("application/json")
or text_mime.startswith("application/xml")
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially. We don't necessarily need one client side here given that this code is relatively static.
Also your plain text method doesn't handle the newly introduced cases of supporting all textual files.

@alexey-cord-tech alexey-cord-tech merged commit 3d3b419 into master Dec 11, 2024
5 of 6 checks passed
@alexey-cord-tech alexey-cord-tech deleted the ak/text-modern-uploads branch December 11, 2024 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants