-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text and PDFs uploads #819
Conversation
Unit test report ((Pydantic 2.x)189 tests 189 ✅ 6s ⏱️ Results for commit 0eb3673. ♻️ This comment has been updated with latest results. |
Unit test report (Pydantic 1.x)189 tests 189 ✅ 5s ⏱️ Results for commit 0eb3673. ♻️ This comment has been updated with latest results. |
text_mime = mimetypes.guess_type(str(file_path))[0] | ||
if text_mime and ( | ||
text_mime.startswith("text/") | ||
or text_mime.startswith("application/json") | ||
or text_mime.startswith("application/xml") | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not one for this PR, but in general:
Given the increasing number of modalities (and hence mime types), Would it be worth creating a MIME_TYPE
String enum? With all the different mime types we sort of expect?
- This would reduce the possibility of silly typo's
- Serves as nice documentation of the mimetypes we support.
Although I also appreciate there might be sooooo many mimetypes, that it might become unwieldy to maintain this enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking of something like this:
class EncordMimeType(StringEnum):
VIDEO_MP4 = auto()
AUDIO_MP3 = auto()
TEXT_JSON = auto()
TEXT_HTML = auto()
APPLICATION_JSON = auto()
@classmethod
def is_plain_text(cls, mime_type: str) -> bool:
return (mime_type.startswith(cls.TEXT_HTML)
or mime_type.startswith(cls.TEXT_JSON))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially. We don't necessarily need one client side here given that this code is relatively static.
Also your plain text method doesn't handle the newly introduced cases of supporting all textual files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to think about how we handle this (current way it not great), but in actual reality there are way too many text MIMEs for that (e.g. basically every programming language is text/x-{langname}
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments/questions, but LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People expressed an interest in providing the mime_type (Not saying that I love it). Are we deciding against that?
title: Optional[str] = None, | ||
client_metadata: Optional[Dict[str, Any]] = None, | ||
cloud_upload_settings: CloudUploadSettings = CloudUploadSettings(), | ||
) -> UUID: # TODO this should return an item? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should A) follow convention
B) Outside of convention, returning a UUID is reasonable given that the method is long-polling to a degree.
text_mime = mimetypes.guess_type(str(file_path))[0] | ||
if text_mime and ( | ||
text_mime.startswith("text/") | ||
or text_mime.startswith("application/json") | ||
or text_mime.startswith("application/xml") | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially. We don't necessarily need one client side here given that this code is relatively static.
Also your plain text method doesn't handle the newly introduced cases of supporting all textual files.
Introduction and Explanation
Adding ability to upload local text and PDF files, and also to specify "cloud uploads" of the same with the structured types.
Documentation
Docstrings in place, "human docs" TBD
Tests
Coming with the BE PR here: https://github.com/encord-team/cord-backend/pull/4429
Known issues
We should consolidate the local upload logic in a "strategy" of sorts, too much C&P there now.