Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand extraction service to more connectors #3 #1694

Merged
merged 13 commits into from
Sep 29, 2023

Conversation

navarone-feekery
Copy link
Contributor

@navarone-feekery navarone-feekery commented Sep 27, 2023

Related to https://github.com/elastic/enterprise-search-team/issues/5857

  • Expand extraction service to more connectors:
    • Google Cloud Storage
    • Google Drive
    • Onedrive
    • Salesforce
    • Servicenow

Checklists

Pre-Review Checklist

  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • if there is no GH issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)
  • Considered corresponding documentation changes
  • Contributed any configuration settings changes to the configuration reference

Related Pull Requests

@navarone-feekery navarone-feekery changed the base branch from main to navarone/5857-expand-extraction-service-2 September 27, 2023 15:12
Base automatically changed from navarone/5857-expand-extraction-service-2 to main September 28, 2023 10:40
@navarone-feekery navarone-feekery force-pushed the navarone/5857-expand-extraction-service-3 branch from 9cca6a3 to 439efa2 Compare September 28, 2023 11:24
@navarone-feekery navarone-feekery marked this pull request as ready for review September 28, 2023 11:24
@navarone-feekery navarone-feekery requested a review from a team September 28, 2023 11:24
connectors/source.py Show resolved Hide resolved
connectors/source.py Outdated Show resolved Hide resolved
connectors/sources/google_cloud_storage.py Show resolved Hide resolved
Comment on lines +392 to +393
# gcs has a unique download method so we can't utilize
# the generic download_and_extract_file func
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful comment 👍

Is there anything we could generify, so that we don't have usages of create_temp_file in multiple places? Like would it help if the base function could take a proc as an optional arg or something? Not for this PR, but something we can think about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the issue with calling create_temp_file here?
I think we can look into generifying some of the non-standard downloads. Mostly the issue is they pipe directly to a file, but my generic download func doesn't support that. I felt strapped for time to make two different versions of it so for now I've not generified downloads that pipe directly to files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with calling create_temp_file is that every time it's called, there's a chance that the author does it in such a way that the temp file won't be cleaned up. We've had this issue before, where like 8/10 connectors were cleaning up their tempfiles appropriately, but due to copy-paste errors, occasionally some wouldn't. These types of bugs can be hard to catch, and its easier to keep them from propagating if you just don't have numerous usages of the risky code.

But again, not necessary to solve right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seanstory I was also concerned about that. It may alleviate your concerns but I think I have this covered with the way tempfiles are being created now. If a connector uses create_temp_file it will clean itself up after everything is done, including deleting the file and outputting an error if the file deletion failed.

The code in question: https://github.com/elastic/connectors-python/blob/b9fe37744bd9724b3b4b82104f0c124d70bf3b02/connectors/source.py#L771-L783

Of course we should properly check to see if this is actually the case.

connectors/sources/salesforce.py Show resolved Hide resolved
@navarone-feekery navarone-feekery merged commit 32f236a into main Sep 29, 2023
@navarone-feekery navarone-feekery deleted the navarone/5857-expand-extraction-service-3 branch September 29, 2023 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants