Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand extraction service to more connectors #2 #1693

Merged
merged 12 commits into from
Sep 28, 2023

Conversation

navarone-feekery
Copy link
Contributor

Related to https://github.com/elastic/enterprise-search-team/issues/5857

  • Expand extraction service to more connectors:
    • Jira
    • Dropbox
    • S3
  • Create utils func for retrieving file extension
  • Abstract file size and file extension pre-download check
  • Fix some tests

Extraction service will be expanded to more connectors in further PRs

Checklists

Pre-Review Checklist

  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • if there is no GH issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)
  • Considered corresponding documentation changes
  • Contributed any configuration settings changes to the configuration reference

Related Pull Requests

@navarone-feekery navarone-feekery requested a review from a team September 27, 2023 12:53
@navarone-feekery navarone-feekery changed the title Expand extraction service (1) Expand extraction service to more connectors #1 Sep 27, 2023
@navarone-feekery navarone-feekery changed the title Expand extraction service to more connectors #1 Expand extraction service to more connectors #2 Sep 27, 2023
Comment on lines +692 to +693
def get_file_extension(self, filename):
return get_file_extension(filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks redundant to me, why not use utils method directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to call the import on every source. AFAIU I would need to import it multiple times. This way it's imported once on BaseDataSource and I can just call back up to get to the utils func.

I agree it looks redundant. Do you think it's better coding standards to just import and call on each source directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence here, I think it's okay to keep it and then inline if not needed.

I just feel that a method that pipes arguments to another method without any modifications is a code-smell, but IMO not a big one and okay to keep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an alternative is to not have the file extension tool in utils. It's only used in BaseDataSource anyway. In my next PR I'll just move the code to this function so it's not piping anymore.

def get_file_extension(self, filename):
return get_file_extension(filename)

def can_file_be_downloaded(self, file_extension, filename, file_size):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to pass the file name (or path?) and calculate extension from it?

I see that now filename is used only for debug lines, so method signature feels a bit weird from this perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file extension is needed for other operations (specifically creating a tempfile with the correct suffix), so I extract earlier and pass it on to this function. Also, some connectors get their file extensions from other sources. For example Salesforce responses have a separate field just for the file extension, and it isn't included in the file name.

I see that now filename is used only for debug lines, so method signature feels a bit weird from this perspective.

Is there something else we can do to fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help aesthetically if filename was the last arg?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be that the method returns an object instead of just True/False, e.g.:

if file_extension == "":
    return FileDownloadValidationResult(False, SomeEnum.NoExtension) 

Names of files and choice of enum/string with reason is up to discussion here, but not relying on self._logger here is a nice bonus IMO.

However, it might be a change that can be done only in future when almost the whole download function is in the base source, so my comment can be ignored for now.

)
return False

if file_size > FILE_SIZE_LIMIT and not self.configuration.get(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases (Sharepoint REST API) we don't know file sizes, it would be nice to have default handling for it (e.g. not passing file_size always downloads it or something similar)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have a draft PR that will allow that! A connector will be able to call the file type or the file size check individually if needed.
#1694

It's dependent on this PR though so I haven't opened it for review yet.

connectors/sources/azure_blob_storage.py Outdated Show resolved Hide resolved
@navarone-feekery navarone-feekery enabled auto-merge (squash) September 28, 2023 10:29
@navarone-feekery navarone-feekery merged commit 8cd842c into main Sep 28, 2023
@navarone-feekery navarone-feekery deleted the navarone/5857-expand-extraction-service-2 branch September 28, 2023 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants