-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand extraction service to more connectors #2 #1693
Expand extraction service to more connectors #2 #1693
Conversation
def get_file_extension(self, filename): | ||
return get_file_extension(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks redundant to me, why not use utils method directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to call the import on every source. AFAIU I would need to import it multiple times. This way it's imported once on BaseDataSource
and I can just call back up to get to the utils func.
I agree it looks redundant. Do you think it's better coding standards to just import and call on each source directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence here, I think it's okay to keep it and then inline if not needed.
I just feel that a method that pipes arguments to another method without any modifications is a code-smell, but IMO not a big one and okay to keep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an alternative is to not have the file extension tool in utils. It's only used in BaseDataSource anyway. In my next PR I'll just move the code to this function so it's not piping anymore.
def get_file_extension(self, filename): | ||
return get_file_extension(filename) | ||
|
||
def can_file_be_downloaded(self, file_extension, filename, file_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to pass the file name (or path?) and calculate extension from it?
I see that now filename
is used only for debug lines, so method signature feels a bit weird from this perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file extension is needed for other operations (specifically creating a tempfile with the correct suffix), so I extract earlier and pass it on to this function. Also, some connectors get their file extensions from other sources. For example Salesforce responses have a separate field just for the file extension, and it isn't included in the file name.
I see that now filename is used only for debug lines, so method signature feels a bit weird from this perspective.
Is there something else we can do to fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help aesthetically if filename
was the last arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be that the method returns an object instead of just True/False, e.g.:
if file_extension == "":
return FileDownloadValidationResult(False, SomeEnum.NoExtension)
Names of files and choice of enum/string with reason is up to discussion here, but not relying on self._logger
here is a nice bonus IMO.
However, it might be a change that can be done only in future when almost the whole download function is in the base source, so my comment can be ignored for now.
) | ||
return False | ||
|
||
if file_size > FILE_SIZE_LIMIT and not self.configuration.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases (Sharepoint REST API) we don't know file sizes, it would be nice to have default handling for it (e.g. not passing file_size always downloads it or something similar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually have a draft PR that will allow that! A connector will be able to call the file type or the file size check individually if needed.
#1694
It's dependent on this PR though so I haven't opened it for review yet.
Related to https://github.com/elastic/enterprise-search-team/issues/5857
Extraction service will be expanded to more connectors in further PRs
Checklists
Pre-Review Checklist
v7.13.2
,v7.14.0
,v8.0.0
)Related Pull Requests