Add Storage Service DIP integration #153

jraddaoui · 2019-04-20T17:38:04Z

Allow to receive notifications when DIPs are stored in the SS to fetch
their METS files and create a Folder to represent it in the application.

Add SS_HOSTS setting allowing multiple RFC-1738 SS URLs.
Modify DIP model to include SS information.
Create DIP from SS notification in DIPStoredWebhook:
- Check Origin header against SS_HOSTS.
- Get SS data, create DIP and launch async. tasks chain.
- Split existing extract_and_parse_mets task and add download_mets
  and save_import_error tasks to ease both DIP import workflows.
- Add import error to DIP model instead of tracking the TaskResult.
Try to create DIP - Collection relation on METS parsing process.
Proxy DIP from SS in download DIP view, prioritizing the local file
if it exists.

Due to this integration, a Folder (DIP) may now not end related with a
Collection, which required a few changes:

Allow null values in DIP's model collection field.
Check that a relation with a Collection exists where needed.
Do not require collection field on DIP creation form.
Allow to modify/clear collection field on DIP edit form.
Update Collection data in the Elasticsearch documents of related
DigitalFiles when a DIP is modified.
Add orphan folders page to show Folders not related to a Collection.
Only allow editors and administrators to see those Folders.
Filter files from orphan folders in search page for non editor or
administrator users.

Other changes:

Add requests requirement to connect with the SS and vcrpy to fake
those requests in tests.
Change Elasticsearch image in Docker environment.
Ignore .tox folder in Docker Compose watchmedo command.
Move search related tasks to search app.
Remove some obvious comments and logs from tasks.

Related to #2 and #154.

codecov · 2019-04-20T17:40:29Z

Codecov Report

Merging #153 into master will increase coverage by 0.15%.
The diff coverage is 91.59%.

@@            Coverage Diff            @@
##           master    #153      +/-   ##
=========================================
+ Coverage   85.54%   85.7%   +0.15%     
=========================================
  Files          26      30       +4     
  Lines        1086    1203     +117     
=========================================
+ Hits          929    1031     +102     
- Misses        157     172      +15

Impacted Files	Coverage Δ
scope/urls.py	`100% <ø> (ø)`	⬆️
search/documents.py	`100% <ø> (ø)`	⬆️
dips/api_views.py	`100% <100%> (ø)`	⬆️
dips/models.py	`98.6% <100%> (-0.04%)`	⬇️
search/tasks.py	`100% <100%> (ø)`
dips/migrations/0006_alter_dip.py	`100% <100%> (ø)`
scope/helpers.py	`100% <100%> (ø)`
dips/migrations/0007_delete_taskresult.py	`100% <100%> (ø)`
scope/settings.py	`100% <100%> (ø)`	⬆️
dips/tasks.py	`100% <100%> (ø)`	⬆️
... and 6 more

jraddaoui · 2019-05-02T12:00:28Z

Hi @cole, this is ready for review when you have some time.

First of all, sorry for mixing contents and a few back and forwards in this PR. I tried to detail it all in the description and on each commit but, if you think it's better to split this work in various PRs, I can try to clean it up a little.

A couple of important notes about this changes:

Without entering in too much detail, two types of DIPs can be imported in the application. With this changes, the normal AM DIPs will work using the SS integration and the custom CCA DIPs will work using the Folder (DIP) creation form. Both DIPs should work on either workflow, but I want to check with CCA and the team before, to see how much we can normalize in the DIPs before making that work. That's why you'll see a few to-do comments in the tasks code.
Initially, I decided to use django_celery_results to track task errors in the import process, which has turned out to be an inconvenient now that we're using task chains. I think you also had a conversation with Sevein about that not being a great idea, so we're not using them anymore. Nevertheless, I have not fully removed the django_celery_results requirement from the application as it involves migrations. I mentioned to CCA the unification of the core apps and the change in the project and migration names (from Add placeholder webhook for post store DIP event #150 (comment)), which may happen soon and it will ease the removal of django_celery_results.

Thanks!

jraddaoui · 2019-05-02T12:16:50Z

A few more things @cole,

I have implement test for almost everything except the METS parsing process (parsemets.py), we have an issue to address that, as there are no tests for that entire module.

Also, this will require to add some documentation (for now to the read me file), but I think it's better to wait until the integration is "finished" for both DIP types.

cole

This PR looks good. I didn't review the tests in depth but the ones I looked at looked good. I'm really happy about the level of testing here! I have made a lot of small comments, but none of them are serious issues, and I leave it to you which ones you want to address, if any. Let me know if any are unclear.

Initially, I decided to use django_celery_results to track task errors in the import process, which has turned out to be an inconvenient now that we're using task chains. I think you also had a conversation with Sevein about that not being a great idea, so we're not using them anymore. Nevertheless, I have not fully removed the django_celery_results requirement from the application as it involves migrations. I mentioned to CCA the unification of the core apps and the change in the project and migration names (from #150 (comment)), which may happen soon and it will ease the removal of django_celery_results.

To be clear, I think you still need django_celery_results (and it's fine to use), my point before was that it's better to avoid referencing the models it creates directly, as it might make sense for the project to switch to a different task result backend in future. If you use the celery error handlers instead (I think you've already switched to them), then switching out the result backend should be just a settings change.

cole · 2019-05-02T16:54:12Z

dips/api_views.py

+                import_status=DIP.IMPORT_PENDING,
+            )
+        else:
+            # Should we use 409 Conflict or 422 Unprocessable Entity?


Yeah, this is where REST kind of sucks 🙂

I think either is appropriate, there's also a case to be made for 303 with a link to the other:

If the result of processing a POST would be equivalent to a representation of an existing resource, an origin server MAY redirect the user agent to that resource by sending a 303 (See Other) response with the existing resource's identifier in the Location field

Source: https://tools.ietf.org/html/rfc7231#section-4.3.3

But, 303 may be problematic as a lot of clients don't consider that an error. I'd probably go with 422. I think any of 400, 409, or 422 is better than 403.

cole · 2019-05-02T17:00:11Z

dips/api_views.py

+                dc=DublinCore.objects.create(identifier=dip_uuid),
+                import_status=DIP.IMPORT_PENDING,
+            )
+        else:


I think this is a case for get_or_create:

# Create DIP or fail if already exists dip, created = DIP.objects.get_or_create(ss_uuid=dip_uuid, defaults=dict( ss_uuid=dip_uuid, ss_host_url=ss_host["url"], ss_download_url=download_url, dc=DublinCore.objects.create(identifier=dip_uuid), import_status=DIP.IMPORT_PENDING )) if not created:

Oh, I didn't know about defaults, thanks!

cole · 2019-05-02T17:09:38Z

dips/tasks.py

+    # Check the SS host is still configured in the settings
+    dip = DIP.objects.get(pk=dip_id)
+    ss_host = next((x for x in settings.SS_HOSTS if x["url"] == dip.ss_host_url), None)
+    if not ss_host:


if dip.ss_host_url not in [host["url"] for host in settings.SS_HOSTS]:?

It might be easier to make settings.SS_HOSTS a dict of urls -> host info, or cache the set of URLs as a different setting.

cole · 2019-05-02T17:15:55Z

dips/tasks.py

+    dip = DIP.objects.get(pk=dip_id)
+    ss_host = next((x for x in settings.SS_HOSTS if x["url"] == dip.ss_host_url), None)
+    if not ss_host:
+        raise Exception("Configuration not found for SS host: %s" % dip.ss_host_url)


Suggested change

raise Exception("Configuration not found for SS host: %s" % dip.ss_host_url)

raise ValueError("Configuration not found for SS host: %s" % dip.ss_host_url)

You could also view it as the error is from the settings being incorrect, in which case maybe RuntimeError?

cole · 2019-05-02T17:26:47Z

dips/tasks.py

-            mets.parse_mets()
+    metsfile = None
+    with zipfile.ZipFile(zip_path) as zip_:
+        mets_re = re.compile(r".*METS.[0-9a-f\-]{36}.*$")


This can be a module level constant (saves recompiling on each task run).

cole · 2019-05-02T17:52:35Z

search/tasks.py

+    default_retry_delay=30,
+    ignore_result=True,
+)
+def update_es_descendants(class_name, pk):


I would split these into two tasks. You can keep the shared logic if that's easier, so that you have something like:

@shared_task(...) def update_collection_es_descendants(pk): update_es_descendants("Collection", pk) @shared_task(...) def update_dip_es_descendants(pk): update_es_descendants("DIP", pk) def update_es_descendants(class_name, pk): ...

You could also use a separate task that takes a script and a series of pks to do the actual file update and chain them.

All of that might not be something for this PR, but it will allow easier understanding of what a task is doing just from the name, and potential for processing them in different queues, etc. in future.

I had some conflicts about this too when I was doing it. Currently, the search tasks are called in the save and delete methods from the parent AbstractEsModel, which should make adding a new model with the same requirements easier. I should have moved the script declaration and other model/doc dependencies from those tasks to the actual model/doc to make it better, but I didn't expect that many differences and it ended all there ;(

About a single task to update each file (I'm not sure if that's what you're suggesting), I thought that, even in an async. context, it would be better to hit Elasticsearch as little as possible, using bulk requests.

cole · 2019-05-02T17:53:12Z

search/tasks.py

+    With the partial data from the ancestor Collection or DIP.
+    """
+    if class_name not in ["Collection", "DIP"]:
+        raise Exception("Can not update descendants of %s." % class_name)


raise ValueError("Can not update descendants of %s." % class_name)

Suggested change

raise Exception("Can not update descendants of %s." % class_name)

raise ValueError("Can not update descendants of %s." % class_name)

cole · 2019-05-02T18:04:57Z

search/tasks.py

+            "lang": "painless",
+            "params": {"collection": collection.get_es_data_for_files()},
+        }
+        files = DigitalFile.objects.filter(dip__collection__pk=pk).all()


I think you can use .values_list('id', flat=True) here to save loading unnecessary fields.

cole · 2019-05-02T18:05:24Z

search/tasks.py

+            "lang": "painless",
+            "params": data_params,
+        }
+        files = DigitalFile.objects.filter(dip__pk=pk).all()


values_list should also apply here.

cole · 2019-05-02T18:05:59Z

search/tasks.py

+def delete_es_descendants(class_name, pk):
+    """Deletes the related documents in ES based on the ancestor id."""
+    if class_name not in ["Collection", "DIP"]:
+        raise Exception("Can not delete descendants of %s." % class_name)


Suggested change

raise Exception("Can not delete descendants of %s." % class_name)

raise ValueError("Can not delete descendants of %s." % class_name)

jraddaoui · 2019-05-03T18:27:59Z

Thanks a lot @cole!

I think I have addressed all except the permissions (for which I created #155) and the update descendants task (I added some comments above).

Allow to receive notifications when DIPs are stored in the SS to fetch their METS files and create a Folder to represent it in the application. - Add SS_HOSTS setting allowing multiple RFC-1738 SS URLs. - Modify DIP model to include SS information. - Create DIP from SS notification in DIPStoredWebhook: - Check Origin header against SS_HOSTS. - Get SS data, create DIP and launch async. tasks chain. - Split existing `extract_and_parse_mets` task and add `download_mets` and `save_import_error` tasks to ease both DIP import workflows. - Add import error to DIP model instead of tracking the TaskResult. - Try to create DIP - Collection relation on METS parsing process. - Proxy DIP from SS in download DIP view, prioritizing the local file if it exists. Due to this integration, a Folder (DIP) may now not end related with a Collection, which required a few changes: - Allow null values in DIP's model collection field. - Check that a relation with a Collection exists where needed. - Do not require collection field on DIP creation form. - Allow to modify/clear collection field on DIP edit form. - Update Collection data in the Elasticsearch documents of related DigitalFiles when a DIP is modified. - Add orphan folders page to show Folders not related to a Collection. - Only allow editors and administrators to see those Folders. - Filter files from orphan folders in search page for non editor or administrator users. Other changes: - Add `requests` requirement to connect with the SS and `vcrpy` to fake those requests in tests. - Change Elasticsearch image in Docker environment. - Ignore .tox folder in Docker Compose `watchmedo` command. - Move search related tasks to search app. - Remove some obvious comments and logs from tasks.

jraddaoui self-assigned this Apr 20, 2019

jraddaoui added the in progress label Apr 20, 2019

jraddaoui force-pushed the dev/issue-2-automated-import branch 3 times, most recently from 04f5bdf to 326ab5c Compare April 22, 2019 12:03

jraddaoui mentioned this pull request Apr 25, 2019

Extend service callbacks artefactual/archivematica-storage-service#449

Merged

jraddaoui force-pushed the dev/issue-2-automated-import branch 2 times, most recently from b96dbfa to aabe2e4 Compare May 1, 2019 09:31

jraddaoui changed the title ~~Extend DIP import process~~ Add Storage Service DIP integration May 2, 2019

jraddaoui requested a review from cole May 2, 2019 11:56

jraddaoui added review and removed in progress labels May 2, 2019

cole approved these changes May 2, 2019

View reviewed changes

jraddaoui force-pushed the dev/issue-2-automated-import branch from b64c07b to f340461 Compare May 3, 2019 18:31

jraddaoui merged commit f340461 into master May 3, 2019

jraddaoui deleted the dev/issue-2-automated-import branch May 3, 2019 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Storage Service DIP integration #153

Add Storage Service DIP integration #153

jraddaoui commented Apr 20, 2019 •

edited

Loading

codecov bot commented Apr 20, 2019 •

edited

Loading

jraddaoui commented May 2, 2019

jraddaoui commented May 2, 2019

cole left a comment

cole May 2, 2019

cole May 2, 2019

jraddaoui May 3, 2019

cole May 2, 2019

cole May 2, 2019

cole May 2, 2019

cole May 2, 2019

jraddaoui May 3, 2019

cole May 2, 2019

cole May 2, 2019

cole May 2, 2019

cole May 2, 2019

jraddaoui commented May 3, 2019

	raise Exception("Configuration not found for SS host: %s" % dip.ss_host_url)
	raise ValueError("Configuration not found for SS host: %s" % dip.ss_host_url)

	raise Exception("Can not update descendants of %s." % class_name)
	raise ValueError("Can not update descendants of %s." % class_name)

	raise Exception("Can not delete descendants of %s." % class_name)
	raise ValueError("Can not delete descendants of %s." % class_name)

Add Storage Service DIP integration #153

Add Storage Service DIP integration #153

Conversation

jraddaoui commented Apr 20, 2019 • edited Loading

codecov bot commented Apr 20, 2019 • edited Loading

Codecov Report

jraddaoui commented May 2, 2019

jraddaoui commented May 2, 2019

cole left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jraddaoui commented May 3, 2019

jraddaoui commented Apr 20, 2019 •

edited

Loading

codecov bot commented Apr 20, 2019 •

edited

Loading