Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download files in fetch.txt #119

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

kba
Copy link
Contributor

@kba kba commented Nov 27, 2018

Adds a new method Bag.fetch_files_to_be_fetched() that fetches files listed in fetch.txt, c.f. #118.

If this is useful for someone, can be further refined (CLI, overrideable fetch implementation, anti-hammering interval).

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 81.842% when pulling 1de38c4 on kba:fetch-files into c39b650 on LibraryOfCongress:master.

@coveralls
Copy link

coveralls commented Nov 27, 2018

Coverage Status

Coverage increased (+0.09%) to 83.595% when pulling 6fda25c on kba:fetch-files into 8a8263e on LibraryOfCongress:master.

.gitignore Outdated Show resolved Hide resolved
bagit.py Outdated Show resolved Hide resolved
bagit.py Outdated Show resolved Hide resolved
test.py Outdated Show resolved Hide resolved
@edsu
Copy link
Contributor

edsu commented Dec 6, 2018

Thanks for diving in to add the fetch functionality @kba. I wonder if it might be a bit more readable to rename fetch_files_to_be_fetched() to fetch() and have it take an optional force parameter that would re-download things that are already present?

@edsu
Copy link
Contributor

edsu commented Dec 6, 2018

Also, I haven't worked with fetch.txt files much before. But I'm kind of surprised that the test suite considers a bag valid if it has a fetch.txt containing an item that is not present in the payload directory.

See: https://github.com/LibraryOfCongress/bagit-python/blob/master/test.py#L1023

@kba
Copy link
Contributor Author

kba commented Dec 7, 2018

I wonder if it might be a bit more readable to rename fetch_files_to_be_fetched() to fetch()

Fine with me, I wanted to avoid confusion with fetch_entries.

and have it take an optional force parameter that would re-download things that are already present?

Sure.

I'm kind of surprised that the test suite considers a bag valid if it has a fetch.txt containing an item that is not present in the payload directory.

These tests break the "Every file listed in the fetch file MUST be listed in every payload manifest" rule and it isn't validated. fetch_entries should not just check for unsafe filenames but ensure files is also listed in payload_entries. The validation only checks data on disk and manifest entries. That is a bug.

Since the manifests determines the number and size of files, it could make sense to allow "bag with holes" validation against only the files not mentioned in fetch.txt with a special parameter though, if you don't want to fetch the whole thing. By default,

if it has a fetch.txt containing an item that is not present in the payload directory

should not be valid, you're right.

@edsu
Copy link
Contributor

edsu commented Dec 7, 2018

I guess we could consider validation as a separate issue from this PR though.

@@ -296,6 +301,7 @@ def __init__(self, path=None):
self.normalized_manifest_names = {}

self.algorithms = []
self.fetch_url_whitelist = DEFAULT_FETCH_URL_WHITELIST if fetch_url_whitelist is None else fetch_url_whitelist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified:

Suggested change
self.fetch_url_whitelist = DEFAULT_FETCH_URL_WHITELIST if fetch_url_whitelist is None else fetch_url_whitelist
self.fetch_url_whitelist = fetch_url_whitelist or DEFAULT_FETCH_URL_WHITELIST

for url, expected_size, filename in self.fetch_entries():
if not fnmatch_any(url, self.fetch_url_whitelist):
raise BagError(_("Malformed URL in fetch.txt: %s, matches none of the whitelisted URL patterns %s") % (url, self.fetch_url_whitelist))
expected_size = -1 if expected_size == '-' else int(expected_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this would be more idiomatic as:

Suggested change
expected_size = -1 if expected_size == '-' else int(expected_size)
expected_size = None if expected_size == '-' else int(expected_size)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wanted to check whether fetch_entries() is correctly validating that these values are either “-” or digits.

yield filename

def fetch(self, force=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we should support force() or simply have the user handle file operations outside of scope. The main thing which that does is open up the need to think about potential security concerns so perhaps that should just be confirming that we correctly validate file paths not being outside of the bag scope.

else:
content_length = int(headers['content-length'])
if content_length != expected_size:
raise BagError(_("Inconsistent size of %s: Expected %s but Content-Length is %s") % (filename, expected_size, content_length))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether this should have the URL as well:

Suggested change
raise BagError(_("Inconsistent size of %s: Expected %s but Content-Length is %s") % (filename, expected_size, content_length))
raise BagError(_("Inconsistent size for %s: expected %s bytes but %s had a Content-Length of %s") % (filename, expected_size, url, content_length))

resp = opener.open(req)
headers = resp.info()
if expected_size >= 0:
if "content-length" not in headers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would only apply to HTTP requests but the intention of this code is to support other protocols, which suggests that this would need to be a conditional warning since I don't believe the stdlib sets a header like that for other protocols.

parsed_url = urlparse(url)

if not all((parsed_url.scheme, parsed_url.netloc)):
raise BagError(_("Malformed URL in fetch.txt: %s") % url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still have this validation step since it reports some malformed URLs earlier. This would also allow us to remove both of the fnmatch checks by changing the whitelist to be a list of URL schemes and simply adding a second if parsed_url.scheme not in self.fetch_url_schema_whitelist: … check here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants