-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download files in fetch.txt #119
base: master
Are you sure you want to change the base?
Conversation
Thanks for diving in to add the fetch functionality @kba. I wonder if it might be a bit more readable to rename |
Also, I haven't worked with fetch.txt files much before. But I'm kind of surprised that the test suite considers a bag valid if it has a fetch.txt containing an item that is not present in the payload directory. See: https://github.com/LibraryOfCongress/bagit-python/blob/master/test.py#L1023 |
Fine with me, I wanted to avoid confusion with
Sure.
These tests break the "Every file listed in the fetch file MUST be listed in every payload manifest" rule and it isn't validated. Since the manifests determines the number and size of files, it could make sense to allow "bag with holes" validation against only the files not mentioned in fetch.txt with a special parameter though, if you don't want to fetch the whole thing. By default,
should not be valid, you're right. |
I guess we could consider validation as a separate issue from this PR though. |
By default only allow HTTP(S) and (S)FTP URL
@@ -296,6 +301,7 @@ def __init__(self, path=None): | |||
self.normalized_manifest_names = {} | |||
|
|||
self.algorithms = [] | |||
self.fetch_url_whitelist = DEFAULT_FETCH_URL_WHITELIST if fetch_url_whitelist is None else fetch_url_whitelist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be simplified:
self.fetch_url_whitelist = DEFAULT_FETCH_URL_WHITELIST if fetch_url_whitelist is None else fetch_url_whitelist | |
self.fetch_url_whitelist = fetch_url_whitelist or DEFAULT_FETCH_URL_WHITELIST |
for url, expected_size, filename in self.fetch_entries(): | ||
if not fnmatch_any(url, self.fetch_url_whitelist): | ||
raise BagError(_("Malformed URL in fetch.txt: %s, matches none of the whitelisted URL patterns %s") % (url, self.fetch_url_whitelist)) | ||
expected_size = -1 if expected_size == '-' else int(expected_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking this would be more idiomatic as:
expected_size = -1 if expected_size == '-' else int(expected_size) | |
expected_size = None if expected_size == '-' else int(expected_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wanted to check whether fetch_entries()
is correctly validating that these values are either “-” or digits.
yield filename | ||
|
||
def fetch(self, force=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether we should support force()
or simply have the user handle file operations outside of scope. The main thing which that does is open up the need to think about potential security concerns so perhaps that should just be confirming that we correctly validate file paths not being outside of the bag scope.
else: | ||
content_length = int(headers['content-length']) | ||
if content_length != expected_size: | ||
raise BagError(_("Inconsistent size of %s: Expected %s but Content-Length is %s") % (filename, expected_size, content_length)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering whether this should have the URL as well:
raise BagError(_("Inconsistent size of %s: Expected %s but Content-Length is %s") % (filename, expected_size, content_length)) | |
raise BagError(_("Inconsistent size for %s: expected %s bytes but %s had a Content-Length of %s") % (filename, expected_size, url, content_length)) |
resp = opener.open(req) | ||
headers = resp.info() | ||
if expected_size >= 0: | ||
if "content-length" not in headers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would only apply to HTTP requests but the intention of this code is to support other protocols, which suggests that this would need to be a conditional warning since I don't believe the stdlib sets a header like that for other protocols.
parsed_url = urlparse(url) | ||
|
||
if not all((parsed_url.scheme, parsed_url.netloc)): | ||
raise BagError(_("Malformed URL in fetch.txt: %s") % url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still have this validation step since it reports some malformed URLs earlier. This would also allow us to remove both of the fnmatch
checks by changing the whitelist to be a list of URL schemes and simply adding a second if parsed_url.scheme not in self.fetch_url_schema_whitelist: …
check here.
Adds a new method
Bag.fetch_files_to_be_fetched()
that fetches files listed infetch.txt
, c.f. #118.If this is useful for someone, can be further refined (CLI, overrideable fetch implementation, anti-hammering interval).