-
Notifications
You must be signed in to change notification settings - Fork 4
Add Storage Service DIP integration #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #153 +/- ##
=========================================
+ Coverage 85.54% 85.7% +0.15%
=========================================
Files 26 30 +4
Lines 1086 1203 +117
=========================================
+ Hits 929 1031 +102
- Misses 157 172 +15
|
04f5bdf
to
326ab5c
Compare
b96dbfa
to
aabe2e4
Compare
Hi @cole, this is ready for review when you have some time. First of all, sorry for mixing contents and a few back and forwards in this PR. I tried to detail it all in the description and on each commit but, if you think it's better to split this work in various PRs, I can try to clean it up a little. A couple of important notes about this changes:
Thanks! |
A few more things @cole, I have implement test for almost everything except the METS parsing process ( Also, this will require to add some documentation (for now to the read me file), but I think it's better to wait until the integration is "finished" for both DIP types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good. I didn't review the tests in depth but the ones I looked at looked good. I'm really happy about the level of testing here! I have made a lot of small comments, but none of them are serious issues, and I leave it to you which ones you want to address, if any. Let me know if any are unclear.
Initially, I decided to use django_celery_results to track task errors in the import process, which has turned out to be an inconvenient now that we're using task chains. I think you also had a conversation with Sevein about that not being a great idea, so we're not using them anymore. Nevertheless, I have not fully removed the django_celery_results requirement from the application as it involves migrations. I mentioned to CCA the unification of the core apps and the change in the project and migration names (from #150 (comment)), which may happen soon and it will ease the removal of django_celery_results.
To be clear, I think you still need django_celery_results
(and it's fine to use), my point before was that it's better to avoid referencing the models it creates directly, as it might make sense for the project to switch to a different task result backend in future. If you use the celery error handlers instead (I think you've already switched to them), then switching out the result backend should be just a settings change.
dips/api_views.py
Outdated
import_status=DIP.IMPORT_PENDING, | ||
) | ||
else: | ||
# Should we use 409 Conflict or 422 Unprocessable Entity? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is where REST kind of sucks 🙂
I think either is appropriate, there's also a case to be made for 303 with a link to the other:
If the result of processing a POST would be equivalent to a representation of an existing resource, an origin server MAY redirect the user agent to that resource by sending a 303 (See Other) response with the existing resource's identifier in the Location field
Source: https://tools.ietf.org/html/rfc7231#section-4.3.3
But, 303 may be problematic as a lot of clients don't consider that an error. I'd probably go with 422. I think any of 400, 409, or 422 is better than 403.
dips/api_views.py
Outdated
dc=DublinCore.objects.create(identifier=dip_uuid), | ||
import_status=DIP.IMPORT_PENDING, | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a case for get_or_create:
# Create DIP or fail if already exists
dip, created = DIP.objects.get_or_create(ss_uuid=dip_uuid, defaults=dict(
ss_uuid=dip_uuid,
ss_host_url=ss_host["url"],
ss_download_url=download_url,
dc=DublinCore.objects.create(identifier=dip_uuid),
import_status=DIP.IMPORT_PENDING
))
if not created:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't know about defaults
, thanks!
dips/tasks.py
Outdated
# Check the SS host is still configured in the settings | ||
dip = DIP.objects.get(pk=dip_id) | ||
ss_host = next((x for x in settings.SS_HOSTS if x["url"] == dip.ss_host_url), None) | ||
if not ss_host: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dip.ss_host_url not in [host["url"] for host in settings.SS_HOSTS]:
?
It might be easier to make settings.SS_HOSTS a dict of urls -> host info, or cache the set of URLs as a different setting.
dips/tasks.py
Outdated
dip = DIP.objects.get(pk=dip_id) | ||
ss_host = next((x for x in settings.SS_HOSTS if x["url"] == dip.ss_host_url), None) | ||
if not ss_host: | ||
raise Exception("Configuration not found for SS host: %s" % dip.ss_host_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise Exception("Configuration not found for SS host: %s" % dip.ss_host_url) | |
raise ValueError("Configuration not found for SS host: %s" % dip.ss_host_url) |
You could also view it as the error is from the settings being incorrect, in which case maybe RuntimeError
?
dips/tasks.py
Outdated
mets.parse_mets() | ||
metsfile = None | ||
with zipfile.ZipFile(zip_path) as zip_: | ||
mets_re = re.compile(r".*METS.[0-9a-f\-]{36}.*$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a module level constant (saves recompiling on each task run).
default_retry_delay=30, | ||
ignore_result=True, | ||
) | ||
def update_es_descendants(class_name, pk): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would split these into two tasks. You can keep the shared logic if that's easier, so that you have something like:
@shared_task(...)
def update_collection_es_descendants(pk):
update_es_descendants("Collection", pk)
@shared_task(...)
def update_dip_es_descendants(pk):
update_es_descendants("DIP", pk)
def update_es_descendants(class_name, pk):
...
You could also use a separate task that takes a script and a series of pks to do the actual file update and chain them.
All of that might not be something for this PR, but it will allow easier understanding of what a task is doing just from the name, and potential for processing them in different queues, etc. in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some conflicts about this too when I was doing it. Currently, the search tasks are called in the save and delete methods from the parent AbstractEsModel, which should make adding a new model with the same requirements easier. I should have moved the script declaration and other model/doc dependencies from those tasks to the actual model/doc to make it better, but I didn't expect that many differences and it ended all there ;(
About a single task to update each file (I'm not sure if that's what you're suggesting), I thought that, even in an async. context, it would be better to hit Elasticsearch as little as possible, using bulk requests.
search/tasks.py
Outdated
With the partial data from the ancestor Collection or DIP. | ||
""" | ||
if class_name not in ["Collection", "DIP"]: | ||
raise Exception("Can not update descendants of %s." % class_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError("Can not update descendants of %s." % class_name)
raise Exception("Can not update descendants of %s." % class_name) | |
raise ValueError("Can not update descendants of %s." % class_name) |
search/tasks.py
Outdated
"lang": "painless", | ||
"params": {"collection": collection.get_es_data_for_files()}, | ||
} | ||
files = DigitalFile.objects.filter(dip__collection__pk=pk).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use .values_list('id', flat=True) here to save loading unnecessary fields.
search/tasks.py
Outdated
"lang": "painless", | ||
"params": data_params, | ||
} | ||
files = DigitalFile.objects.filter(dip__pk=pk).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values_list should also apply here.
search/tasks.py
Outdated
def delete_es_descendants(class_name, pk): | ||
"""Deletes the related documents in ES based on the ancestor id.""" | ||
if class_name not in ["Collection", "DIP"]: | ||
raise Exception("Can not delete descendants of %s." % class_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise Exception("Can not delete descendants of %s." % class_name) | |
raise ValueError("Can not delete descendants of %s." % class_name) |
Allow to receive notifications when DIPs are stored in the SS to fetch their METS files and create a Folder to represent it in the application. - Add SS_HOSTS setting allowing multiple RFC-1738 SS URLs. - Modify DIP model to include SS information. - Create DIP from SS notification in DIPStoredWebhook: - Check Origin header against SS_HOSTS. - Get SS data, create DIP and launch async. tasks chain. - Split existing `extract_and_parse_mets` task and add `download_mets` and `save_import_error` tasks to ease both DIP import workflows. - Add import error to DIP model instead of tracking the TaskResult. - Try to create DIP - Collection relation on METS parsing process. - Proxy DIP from SS in download DIP view, prioritizing the local file if it exists. Due to this integration, a Folder (DIP) may now not end related with a Collection, which required a few changes: - Allow null values in DIP's model collection field. - Check that a relation with a Collection exists where needed. - Do not require collection field on DIP creation form. - Allow to modify/clear collection field on DIP edit form. - Update Collection data in the Elasticsearch documents of related DigitalFiles when a DIP is modified. - Add orphan folders page to show Folders not related to a Collection. - Only allow editors and administrators to see those Folders. - Filter files from orphan folders in search page for non editor or administrator users. Other changes: - Add `requests` requirement to connect with the SS and `vcrpy` to fake those requests in tests. - Change Elasticsearch image in Docker environment. - Ignore .tox folder in Docker Compose `watchmedo` command. - Move search related tasks to search app. - Remove some obvious comments and logs from tasks.
b64c07b
to
f340461
Compare
Allow to receive notifications when DIPs are stored in the SS to fetch
their METS files and create a Folder to represent it in the application.
extract_and_parse_mets
task and adddownload_mets
and
save_import_error
tasks to ease both DIP import workflows.if it exists.
Due to this integration, a Folder (DIP) may now not end related with a
Collection, which required a few changes:
DigitalFiles when a DIP is modified.
administrator users.
Other changes:
requests
requirement to connect with the SS andvcrpy
to fakethose requests in tests.
watchmedo
command.Related to #2 and #154.