Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oac fetcher / mapper #418

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

oac fetcher / mapper #418

wants to merge 2 commits into from

Conversation

lthurston
Copy link
Contributor

@lthurston lthurston commented May 10, 2023

This is a combo fetcher / mapper PR.

@lthurston lthurston force-pushed the oac-fetcher branch 6 times, most recently from 326ada7 to e922e75 Compare May 23, 2023 22:12
@lthurston lthurston marked this pull request as ready for review June 5, 2023 19:51
@lthurston lthurston changed the title Implement OAC mappers 2. oac fetcher / mapper Jun 5, 2023
@lthurston lthurston force-pushed the oac-fetcher branch 2 times, most recently from bd99980 to 1a9c016 Compare June 7, 2023 20:34
@lthurston lthurston marked this pull request as draft June 7, 2023 22:49
@lthurston lthurston force-pushed the oac-fetcher branch 2 times, most recently from 5915a65 to 686652f Compare June 8, 2023 15:46
@lthurston lthurston marked this pull request as ready for review June 8, 2023 15:56
Copy link
Collaborator

@amywieliczka amywieliczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mapper uses models of thinking from the dpla-ingestion codebase. We would like to strip out all use of exists, getprop, iterify, and get_vals. These functions accomodate unknown data structures, where we would like to enforce known data structures. Mapping should be a process of moving from one expected data structures to our normalized well-defined data structure. The gymnastics done in these functions obfuscate where data came from, and where it goes to.

I know that I had previously written an OAC fetcher/mapper pair, and this work all seems to build on what I had previously written. I do think it might be worthwhile to start from scratch and to write more specific mapping functions.

Also, we'd love to get rid of the enrichment select_oac_id - defined in mapper.py - could you move the mapping logic done there into this mapper, and make the select_oac_id enrichment a no-op? (The function definition still needs to exist, since we cannot remove it from our enrichment chain in the collection registry due to legacy harvester operations.)

@lthurston
Copy link
Contributor Author

I hear what you're saying here @amywieliczka. Since the source_metadata is in many cases a list of parsed XML nodes (that look like this [{'attrib': {'q': 'created'}, 'text': '1938'}, {'attrib': {'q': 'created'}, 'text': '1938-00-00'}], for example), many mapped values in UCLDC_map() require iteration to extract text. Do you prefer seeing a simple list comprehension in that method like this:

            ...
            "contributor":
                [n["text"] for n in self.source_metadata.get("contributor", [])],
            .... 

or would you rather see a map_contributor() method in UCLDC_map()?

I assume that because that the text node extraction logic is ok to be duplicated in several places in order to keep the mapper denormalized and simple. Is that a reasonable assumption?

The replacements for those fields that currently use get_vals with suppressed attributes, and those that rely on map_specific() will certainly have their own map method.

@lthurston lthurston marked this pull request as draft August 10, 2023 21:12
@lthurston
Copy link
Contributor Author

@amywieliczka I threw some spaghetti at the wall here. UCLDC_map is pretty easy to make sense of at a glance, but there's some redundancy that comes as a result of explicitness. Let me know what you think about where this is going. Thanks!

@lthurston lthurston changed the title 2. oac fetcher / mapper oac fetcher / mapper Aug 11, 2023
@amywieliczka amywieliczka self-requested a review August 14, 2023 21:52
@amywieliczka
Copy link
Collaborator

You know, I really don't mind the hyper-locality of implementing "include" inside each of these functions. It makes a data stack trace for any single one of them really straightforward, which is a huge boon. I think I'm a-okay with this approach. @barbarahui what do you think?

I am still seeing one instance of exists and two instances of iterify. Both seem to protect against unexpected inputs, but I'm curious if either is actually necessary or if both were added "out of an abundance of caution".

@barbarahui
Copy link
Collaborator

barbarahui commented Aug 14, 2023

I like this approach! Very readable (and traceable, as Amy points out), even if a little redundant. This is a big improvement over what was there before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants