Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract download file extraction and conversion #1685

Merged

Conversation

navarone-feekery
Copy link
Contributor

@navarone-feekery navarone-feekery commented Sep 26, 2023

Related to https://github.com/elastic/enterprise-search-team/issues/5857

This PR moves the logic for downloading files to temp files, and their extraction/conversion to b64 to BaseDataSource.
The new tool is very generic, so it can be easily added to other connectors in future PRs.
If this is approved, I will also generalise/abstract the file size and file extension check.

This PR uses two connectors as examples for how this tool would be used:

  1. Confluence

This is a "standard" download and extraction case. The download func uses response.content.iter_chunked, which is the same pattern that most connectors use. This download func is now abstracted onto the BaseDataSource class as generic_chunked_download_func.

  1. Azure Blob Storage

ABS has an irregular download method (it uses a library and has its own chunking method). As a result, its download func can't be abstracted. Instead the download func is defined in the ABS connector and is passed as a partial arg.

Checklists

Pre-Review Checklist

  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • if there is no GH issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)
  • Considered corresponding documentation changes
  • Contributed any configuration settings changes to the configuration reference

@navarone-feekery navarone-feekery marked this pull request as ready for review September 26, 2023 16:00
@navarone-feekery navarone-feekery requested a review from a team September 26, 2023 16:00
Copy link
Member

@artem-shelkovnikov artem-shelkovnikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Will you follow-up with using it in other sources, or we create an issue and address later?

@navarone-feekery
Copy link
Contributor Author

@artem-shelkovnikov Thanks! I'm intending to use this in all sources with file downloads, I'll be making some batch PRs for that now :)

@navarone-feekery navarone-feekery enabled auto-merge (squash) September 27, 2023 09:48
@navarone-feekery navarone-feekery merged commit dc57d9e into main Sep 27, 2023
@navarone-feekery navarone-feekery deleted the navarone/5857-generalise-extraction-service-calls branch September 27, 2023 09:59
@github-actions
Copy link

💔 Failed to create backport PR(s)

The backport operation could not be completed due to the following error:
There are no branches to backport to. Aborting.

The backport PRs will be merged automatically after passing CI.

To backport manually run:
backport --pr 1685 --autoMerge --autoMergeMethod squash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants