-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added Document.from_gcs_multi()
method to get multiple wrapped documents from a GCS directory.
#223
Conversation
… documents from a GCS directory. Fixes #214 - Note: `from_gcs()` takes in a GCS directory, but it only works for a single sharded document from a single input document source. - In a GA release, it would be a better practice to have `from_gcs()` take in any GCS directory and output a list of Wrapped Documents. But this would be a backwards-incompatible change now. - Not sure if it's possible/advisable to have two possible return types for `from_gcs()` and just have it return a list when there are multiple Wrapped Documents?
- Input data doesn't match what the actual `storage.list_blobs()` method outputs. - Updated output in `print_gcs_document_tree()` to match the test cases, as this was the intended structure.
List[Document]: | ||
A List of documents from gcs. | ||
""" | ||
return [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is assuming that every sub-directory in the parent (bucket+prefix) is acceptable by Document.from_gcs
, which I think has some validation checks.
What is the desired behavior here if some of those sub-directories causes Document.from_gcs
to fail? Should the whole list_from_gcs
call fail, or should it return just whatever is successfully loaded (maybe along with some metadata explaining the failed items)?
(Let's include unit tests for the case where one of the sub-directories contains wrong files.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list_document_tree()
only outputs directories that contain JSON files. (However, it doesn't directly validate if they are Document
JSON format.)
.from_gcs()
will throw a ValueError
if a file is invalid. Or the documentai.Document.from_json()
method will throw an exception.
It would probably make sense for this batch case to return whatever was successfully loaded and just skip files that throw exceptions. (And printing out or returning the failed files.)
Document.list_from_gcs()
method to get multiple wrapped documents from a GCS directory.Document.from_gcs_multi()
method to get multiple wrapped documents from a GCS directory.
Closing out this PR. I don't think it makes sense to add this method because only the end user will know how exactly the GCS directories are set up. For example, if the documents in each directory are shards of the same document or should be treated as separate documents. Unless the documents are being imported from the Batch Process output, in that case, the user can use |
Fixes #214 🦕
from_gcs()
takes in a GCS directory, but it only works for a single sharded document from a single input document source.from_gcs()
take in any GCS directory and output a list of Wrapped Documents. But this would be a backwards-incompatible change now.from_gcs()
and just have it return a list when there are multiple Wrapped Documents?