Abstract download file extraction and conversion #1685
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related to https://github.com/elastic/enterprise-search-team/issues/5857
This PR moves the logic for downloading files to temp files, and their extraction/conversion to b64 to BaseDataSource.
The new tool is very generic, so it can be easily added to other connectors in future PRs.
If this is approved, I will also generalise/abstract the file size and file extension check.
This PR uses two connectors as examples for how this tool would be used:
This is a "standard" download and extraction case. The download func uses
response.content.iter_chunked
, which is the same pattern that most connectors use. This download func is now abstracted onto theBaseDataSource
class asgeneric_chunked_download_func
.ABS has an irregular download method (it uses a library and has its own chunking method). As a result, its download func can't be abstracted. Instead the download func is defined in the ABS connector and is passed as a partial arg.
Checklists
Pre-Review Checklist
v7.13.2
,v7.14.0
,v8.0.0
)