Skip to content

Add Storage Service DIP integration #153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 3, 2019

Conversation

jraddaoui
Copy link
Collaborator

@jraddaoui jraddaoui commented Apr 20, 2019

Allow to receive notifications when DIPs are stored in the SS to fetch
their METS files and create a Folder to represent it in the application.

  • Add SS_HOSTS setting allowing multiple RFC-1738 SS URLs.
  • Modify DIP model to include SS information.
  • Create DIP from SS notification in DIPStoredWebhook:
    • Check Origin header against SS_HOSTS.
    • Get SS data, create DIP and launch async. tasks chain.
    • Split existing extract_and_parse_mets task and add download_mets
      and save_import_error tasks to ease both DIP import workflows.
    • Add import error to DIP model instead of tracking the TaskResult.
  • Try to create DIP - Collection relation on METS parsing process.
  • Proxy DIP from SS in download DIP view, prioritizing the local file
    if it exists.

Due to this integration, a Folder (DIP) may now not end related with a
Collection, which required a few changes:

  • Allow null values in DIP's model collection field.
  • Check that a relation with a Collection exists where needed.
  • Do not require collection field on DIP creation form.
  • Allow to modify/clear collection field on DIP edit form.
  • Update Collection data in the Elasticsearch documents of related
    DigitalFiles when a DIP is modified.
  • Add orphan folders page to show Folders not related to a Collection.
  • Only allow editors and administrators to see those Folders.
  • Filter files from orphan folders in search page for non editor or
    administrator users.

Other changes:

  • Add requests requirement to connect with the SS and vcrpy to fake
    those requests in tests.
  • Change Elasticsearch image in Docker environment.
  • Ignore .tox folder in Docker Compose watchmedo command.
  • Move search related tasks to search app.
  • Remove some obvious comments and logs from tasks.

Related to #2 and #154.

@jraddaoui jraddaoui self-assigned this Apr 20, 2019
@codecov
Copy link

codecov bot commented Apr 20, 2019

Codecov Report

Merging #153 into master will increase coverage by 0.15%.
The diff coverage is 91.59%.

@@            Coverage Diff            @@
##           master    #153      +/-   ##
=========================================
+ Coverage   85.54%   85.7%   +0.15%     
=========================================
  Files          26      30       +4     
  Lines        1086    1203     +117     
=========================================
+ Hits          929    1031     +102     
- Misses        157     172      +15
Impacted Files Coverage Δ
scope/urls.py 100% <ø> (ø) ⬆️
search/documents.py 100% <ø> (ø) ⬆️
dips/api_views.py 100% <100%> (ø) ⬆️
dips/models.py 98.6% <100%> (-0.04%) ⬇️
search/tasks.py 100% <100%> (ø)
dips/migrations/0006_alter_dip.py 100% <100%> (ø)
scope/helpers.py 100% <100%> (ø)
dips/migrations/0007_delete_taskresult.py 100% <100%> (ø)
scope/settings.py 100% <100%> (ø) ⬆️
dips/tasks.py 100% <100%> (ø) ⬆️
... and 6 more

@jraddaoui jraddaoui force-pushed the dev/issue-2-automated-import branch 3 times, most recently from 04f5bdf to 326ab5c Compare April 22, 2019 12:03
@jraddaoui jraddaoui force-pushed the dev/issue-2-automated-import branch 2 times, most recently from b96dbfa to aabe2e4 Compare May 1, 2019 09:31
@jraddaoui jraddaoui changed the title Extend DIP import process Add Storage Service DIP integration May 2, 2019
@jraddaoui jraddaoui requested a review from cole May 2, 2019 11:56
@jraddaoui
Copy link
Collaborator Author

Hi @cole, this is ready for review when you have some time.

First of all, sorry for mixing contents and a few back and forwards in this PR. I tried to detail it all in the description and on each commit but, if you think it's better to split this work in various PRs, I can try to clean it up a little.

A couple of important notes about this changes:

  • Without entering in too much detail, two types of DIPs can be imported in the application. With this changes, the normal AM DIPs will work using the SS integration and the custom CCA DIPs will work using the Folder (DIP) creation form. Both DIPs should work on either workflow, but I want to check with CCA and the team before, to see how much we can normalize in the DIPs before making that work. That's why you'll see a few to-do comments in the tasks code.

  • Initially, I decided to use django_celery_results to track task errors in the import process, which has turned out to be an inconvenient now that we're using task chains. I think you also had a conversation with Sevein about that not being a great idea, so we're not using them anymore. Nevertheless, I have not fully removed the django_celery_results requirement from the application as it involves migrations. I mentioned to CCA the unification of the core apps and the change in the project and migration names (from Add placeholder webhook for post store DIP event #150 (comment)), which may happen soon and it will ease the removal of django_celery_results.

Thanks!

@jraddaoui
Copy link
Collaborator Author

A few more things @cole,

I have implement test for almost everything except the METS parsing process (parsemets.py), we have an issue to address that, as there are no tests for that entire module.

Also, this will require to add some documentation (for now to the read me file), but I think it's better to wait until the integration is "finished" for both DIP types.

Copy link
Contributor

@cole cole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good. I didn't review the tests in depth but the ones I looked at looked good. I'm really happy about the level of testing here! I have made a lot of small comments, but none of them are serious issues, and I leave it to you which ones you want to address, if any. Let me know if any are unclear.

Initially, I decided to use django_celery_results to track task errors in the import process, which has turned out to be an inconvenient now that we're using task chains. I think you also had a conversation with Sevein about that not being a great idea, so we're not using them anymore. Nevertheless, I have not fully removed the django_celery_results requirement from the application as it involves migrations. I mentioned to CCA the unification of the core apps and the change in the project and migration names (from #150 (comment)), which may happen soon and it will ease the removal of django_celery_results.

To be clear, I think you still need django_celery_results (and it's fine to use), my point before was that it's better to avoid referencing the models it creates directly, as it might make sense for the project to switch to a different task result backend in future. If you use the celery error handlers instead (I think you've already switched to them), then switching out the result backend should be just a settings change.

import_status=DIP.IMPORT_PENDING,
)
else:
# Should we use 409 Conflict or 422 Unprocessable Entity?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is where REST kind of sucks 🙂

I think either is appropriate, there's also a case to be made for 303 with a link to the other:

If the result of processing a POST would be equivalent to a representation of an existing resource, an origin server MAY redirect the user agent to that resource by sending a 303 (See Other) response with the existing resource's identifier in the Location field

Source: https://tools.ietf.org/html/rfc7231#section-4.3.3

But, 303 may be problematic as a lot of clients don't consider that an error. I'd probably go with 422. I think any of 400, 409, or 422 is better than 403.

dc=DublinCore.objects.create(identifier=dip_uuid),
import_status=DIP.IMPORT_PENDING,
)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a case for get_or_create:

        # Create DIP or fail if already exists
        dip, created = DIP.objects.get_or_create(ss_uuid=dip_uuid, defaults=dict(
                ss_uuid=dip_uuid,
                ss_host_url=ss_host["url"],
                ss_download_url=download_url,
                dc=DublinCore.objects.create(identifier=dip_uuid),
                import_status=DIP.IMPORT_PENDING
         ))
         if not created:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know about defaults, thanks!

dips/tasks.py Outdated
# Check the SS host is still configured in the settings
dip = DIP.objects.get(pk=dip_id)
ss_host = next((x for x in settings.SS_HOSTS if x["url"] == dip.ss_host_url), None)
if not ss_host:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if dip.ss_host_url not in [host["url"] for host in settings.SS_HOSTS]:?

It might be easier to make settings.SS_HOSTS a dict of urls -> host info, or cache the set of URLs as a different setting.

dips/tasks.py Outdated
dip = DIP.objects.get(pk=dip_id)
ss_host = next((x for x in settings.SS_HOSTS if x["url"] == dip.ss_host_url), None)
if not ss_host:
raise Exception("Configuration not found for SS host: %s" % dip.ss_host_url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise Exception("Configuration not found for SS host: %s" % dip.ss_host_url)
raise ValueError("Configuration not found for SS host: %s" % dip.ss_host_url)

You could also view it as the error is from the settings being incorrect, in which case maybe RuntimeError?

dips/tasks.py Outdated
mets.parse_mets()
metsfile = None
with zipfile.ZipFile(zip_path) as zip_:
mets_re = re.compile(r".*METS.[0-9a-f\-]{36}.*$")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a module level constant (saves recompiling on each task run).

default_retry_delay=30,
ignore_result=True,
)
def update_es_descendants(class_name, pk):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split these into two tasks. You can keep the shared logic if that's easier, so that you have something like:

@shared_task(...)
def update_collection_es_descendants(pk):
    update_es_descendants("Collection", pk)

@shared_task(...)
def update_dip_es_descendants(pk):
    update_es_descendants("DIP", pk)

def update_es_descendants(class_name, pk):
    ...

You could also use a separate task that takes a script and a series of pks to do the actual file update and chain them.

All of that might not be something for this PR, but it will allow easier understanding of what a task is doing just from the name, and potential for processing them in different queues, etc. in future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some conflicts about this too when I was doing it. Currently, the search tasks are called in the save and delete methods from the parent AbstractEsModel, which should make adding a new model with the same requirements easier. I should have moved the script declaration and other model/doc dependencies from those tasks to the actual model/doc to make it better, but I didn't expect that many differences and it ended all there ;(

About a single task to update each file (I'm not sure if that's what you're suggesting), I thought that, even in an async. context, it would be better to hit Elasticsearch as little as possible, using bulk requests.

search/tasks.py Outdated
With the partial data from the ancestor Collection or DIP.
"""
if class_name not in ["Collection", "DIP"]:
raise Exception("Can not update descendants of %s." % class_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    raise ValueError("Can not update descendants of %s." % class_name)
Suggested change
raise Exception("Can not update descendants of %s." % class_name)
raise ValueError("Can not update descendants of %s." % class_name)

search/tasks.py Outdated
"lang": "painless",
"params": {"collection": collection.get_es_data_for_files()},
}
files = DigitalFile.objects.filter(dip__collection__pk=pk).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use .values_list('id', flat=True) here to save loading unnecessary fields.

search/tasks.py Outdated
"lang": "painless",
"params": data_params,
}
files = DigitalFile.objects.filter(dip__pk=pk).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values_list should also apply here.

search/tasks.py Outdated
def delete_es_descendants(class_name, pk):
"""Deletes the related documents in ES based on the ancestor id."""
if class_name not in ["Collection", "DIP"]:
raise Exception("Can not delete descendants of %s." % class_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise Exception("Can not delete descendants of %s." % class_name)
raise ValueError("Can not delete descendants of %s." % class_name)

@jraddaoui
Copy link
Collaborator Author

Thanks a lot @cole!

I think I have addressed all except the permissions (for which I created #155) and the update descendants task (I added some comments above).

Allow to receive notifications when DIPs are stored in the SS to fetch
their METS files and create a Folder to represent it in the application.

- Add SS_HOSTS setting allowing multiple RFC-1738 SS URLs.
- Modify DIP model to include SS information.
- Create DIP from SS notification in DIPStoredWebhook:
  - Check Origin header against SS_HOSTS.
  - Get SS data, create DIP and launch async. tasks chain.
  - Split existing `extract_and_parse_mets` task and add `download_mets`
    and `save_import_error` tasks to ease both DIP import workflows.
  - Add import error to DIP model instead of tracking the TaskResult.
- Try to create DIP - Collection relation on METS parsing process.
- Proxy DIP from SS in download DIP view, prioritizing the local file
  if it exists.

Due to this integration, a Folder (DIP) may now not end related with a
Collection, which required a few changes:

- Allow null values in DIP's model collection field.
- Check that a relation with a Collection exists where needed.
- Do not require collection field on DIP creation form.
- Allow to modify/clear collection field on DIP edit form.
- Update Collection data in the Elasticsearch documents of related
  DigitalFiles when a DIP is modified.
- Add orphan folders page to show Folders not related to a Collection.
- Only allow editors and administrators to see those Folders.
- Filter files from orphan folders in search page for non editor or
  administrator users.

Other changes:

- Add `requests` requirement to connect with the SS and `vcrpy` to fake
  those requests in tests.
- Change Elasticsearch image in Docker environment.
- Ignore .tox folder in Docker Compose `watchmedo` command.
- Move search related tasks to search app.
- Remove some obvious comments and logs from tasks.
@jraddaoui jraddaoui force-pushed the dev/issue-2-automated-import branch from b64c07b to f340461 Compare May 3, 2019 18:31
@jraddaoui jraddaoui merged commit f340461 into master May 3, 2019
@jraddaoui jraddaoui deleted the dev/issue-2-automated-import branch May 3, 2019 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants