Allow missing journal names when importing publications #472

jonavellecuerdo · 2024-05-09T15:07:53Z

Purpose and background context

Allow imports for publications that are missing fields required to generate a minimal citation.

How can a reviewer manually see the effects of these changes?

A couple unit tests were updated.
Test imports were performed via Solenoid app in staging
Prerequisites: I manually deployed the staging app from this branch and updated some config vars to point to the Prod instance of Solenoid (DJANGO_ELEMENTS_ENDPOINT, DJANGO_ELEMENTS_PASSWORD, DJANGO_ELEMENTS_USER) since the records that could not be imported due to missing citation fields can be found in Prod.

Note: Solenoid performs read-only operations from Symplectic Elements, so reading from Prod shouldn't cause any issues.

To check that the updates work as expected, we can try importing publications for the authors listed in the spreadsheet shared by stakeholders. The following authors' Elements IDs can be used for testing:
- 12338 (this was used for PR review)
- 11259 (this was used for PR review)
- 12130 (this was used for PR review, see comment)
- 11969 (this was used for PR review, see comment)
- 12127
IMPORTANT NOTE: You can run test imports using the IDs above, but only use Solenoid in staging.
1. Running the import for author #12338 was a success ✅
  
  See log. Note: The logged message has been fixed to remove the unintended spacing.
2. Running the import for author #11259 was a success ✅
  
  See log

Includes new or updated dependencies?

YES but only to update dependencies with make update

Changes expectations for external applications?

NO

What are the relevant tickets?

Include links to Jira Software and/or Jira Service Management tickets here.

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

ghukill

Overall, looking good. Given the considerable moving parts of any one section of Solenoid, I think the approach makes a lot of sense.

My questions are a possible (minor) bug, and a question if a test is possible (without knowing how much of a lift that would be).

solenoid/records/models.py

ghukill · 2024-05-09T17:41:04Z

solenoid/records/models.py

+        citation_fields = {
+            field: paper_data.get(
+                field, f"<{field.replace('-', ' ').upper()} UNIDENTIFIED>"
+            )
+            for field in Fields.CITATION_DATA
+        }
        citation = "{last}, {first_init}. ".format(
-            last=paper_data[Fields.LAST_NAME], first_init=paper_data[Fields.FIRST_NAME][0]
+            last=citation_fields[Fields.LAST_NAME],
+            first_init=citation_fields[Fields.FIRST_NAME][0],


I think I see the thinking behind citation_fields dictionary, where it loops through all fields and if not present in the data, sets a string like <Field Name UNIDENTIFIED>.

But on line 92, what happens if Fields.FIRST_NAME was not present in the data? Wouldn't citation_fields have a string value "<First Name UNIDENTIFIED>" and therefore this index retrieval of [0] would return the first character of that string "<"?

Additionally, I think it could get easy to miss that citation_fields and paper_data are similar, but not identical, but both are used throught the rest of the method.

I'm not sure what an alternative would be offhand, as other methods in this class are using kind of similar logic of looping through Fields.* and doing things.

If that is a possible bug outlined above, but that's the only real blocker, might just be worth figuring out a workaround for that and moving on!

Good point! Please see the changes introduced here.

@jonavellecuerdo - I dig it. This kind of string building, coupled with logic to account for the pre-existing enums and structures, is definitely kind of finicky. But this new approach, I find it much easier to just scan and understand.

ghukill · 2024-05-09T17:47:19Z

solenoid/records/models.py

+    def missing_citation_fields(paper_data):
+        """Check whether record is citable.
+
+        If a citation is provided, it will be used; otherwise a minimal citation
+        is generated using the author's first name and last name and the publication's
+        title and journal (to which it was published).
+        """
+        # check if record is citable
+        missing_citation_fields = [
+            field
+            for field in [Fields.CITATION, *Fields.CITATION_DATA]
+            if not paper_data[field]
+        ]
+        if missing_citation_fields:
+            return (
+                "Citation was not provided and/or "
+                f"missing required fields to generate minimal citation: {missing_citation_fields}."
+            )


Am I understanding correctly that this method will identify what minimal citation fields are missing, but that in _run_checks_on_paper(), it won't bail on processing the paper, it will just log that?

I ask, because the work above in create_citation() seems to suggest that we're somewhat okay with missing citation fields, where we just use <Field Name UNIDENTIFIED> and carry on.

Yes, that is correct! Previously, when any of the checks performed in _run_checks_on_paper() failed, it would proceed to skip the record, which resulted in Solenoid skipping the record for import when any of the citation fields were missing.

ghukill · 2024-05-09T17:50:40Z

solenoid/records/tests/test_models.py

Would it be difficult to add a test that shows what a citation will look like if, say, Fields.JOURNAL is missing from the paper_data?

Hmm, it would not, however, from what I'm seeing it would raise a KeyError if it was somehow missing (though how cleanly Solenoid + Django would inform the user is unclear. That said: author_data is constructed explicitly as a dictionary through this method. Given that author_data will always have these fields by the time this method is called, it doesn't seem like it would be a helpful test to include. What do you think? 🤔

Sorry, let me rephrase. And maybe be a bit more general to avoid being inaccurate.

Given that part of the new code will handle a missing field like Fields.JOURNAL when constructing a citation with "<Journal UNIDENTIFIED>" instead of the journal name, can we have a test that shows that? Very pseudo code-y:

def test_unidentified_text_inserted_in_citation(): mocked_data = {...} citation_string = Record.create_citation(mocked_data) assert citation_string == """Smith, A.B. (1980). Elephants love water.. <Journal UNIDENTIFIED>, 14(2), 10-23 doi:XX.XXXXX."""

I think a test like this would a) confirm that missing paper_data will result in that component getting replaced by the "UNIDENTIFIED" block, and b) would serve as a visual example of that to people reviewing code.

Ah! Thank you for clarifying. For the time being, I added a test to the existing set of Record.create_citation tests. See here.

T'was a good idea as there were changes that needed to be made. 🤓

ehanson8

Good fix, some questions and suggestions for cleaning up the docstrings and comments

solenoid/records/models.py

ehanson8 · 2024-05-09T17:44:55Z

solenoid/records/models.py

+
+    @staticmethod
+    def missing_citation_fields(paper_data):
+        """Check whether record is citable.


Wouldn't be Returns missing citation fields be more accurate? The description also doesn't seem to be referring to this method's functionality, was there some shifting that happened later?

Good observation about naming, agreed! Or maybe get_missing_citation_fields?

Ah, good catch! No shifting, just improper docstring placement. 😅

Moved the extra info to Record.create_citation()

ehanson8 · 2024-05-09T17:53:16Z

solenoid/records/tests/test_models.py

@@ -41,27 +41,43 @@ def setUp(self):
            Fields.TITLE: "Ultraviolet behavior of non-abelian gauge theories",
        }

+    def test_metadata_not_missing_any_citation_fields(self):


Optional: for clarity and consistency, I would name tests like so: test_<method name>_<scenario>_<result>, so here test_missing_citation_fields_full_citation_returns_none. It's long but clear

Thank you for this template!

ehanson8 · 2024-05-09T17:54:12Z

solenoid/records/tests/test_models.py

    # need to actually test create_citation
-    def test_is_metadata_valid_yes_citation_no_citation_data(self):
+    def test_is_cited_metadata_missing_minimal_citation_fields(self):


This could be test_missing_citation_fields_incomplete_citation_returns_fields

ehanson8 · 2024-05-09T17:59:01Z

solenoid/records/tests/test_models.py

+        assert Record.missing_citation_fields(metadata) == (
+            "Citation was not provided and/or "
+            f"missing required fields to generate minimal citation: ['{Fields.TITLE}', '{Fields.JOURNAL}']."


Just clarifying the logic (which I am probably misunderstanding 🙃 ), if there is a Fields.CITATION, does it matter that Fields.TITLE and Fields.JOURNAL aren't set? Do they serve different purposes in the app than Fields.CITATION?

That is a good question and actually a very important point!

If Fields.CITATION exists, it shouldn't matter whether Fields.TITLE and Fields.JOURNAL exist as the latter two fields are only required when generating the minimal citation; and a minimal citation is only generated when Fields.CITATION does not exist (i.e., is not populated in the data retrieved from Elements).

Please see the updated logic in Record.get_missing_citation_fields().

If Fields.CITATION -> return None (i.e., no missing fields)

If not Fields.CITATION -> return message indicating missing fields needed to create a minimal citation or None.

@ehanson8 @ghukill

FWIW, I find the udpated get_missing_citation_fields() much easier to parse as well. Makes sense to me now that the existence of Fields.CITATION in paper_data is enough to immediately return None.

ehanson8 · 2024-05-09T18:01:12Z

solenoid/records/tests/test_models.py

+        assert Record.missing_citation_fields(metadata) == (
+            "Citation was not provided and/or "
+            f"missing required fields to generate minimal citation: ['{Fields.CITATION}', '{Fields.TITLE}', '{Fields.JOURNAL}']."
+        )

    def test_is_record_creatable(self):


This test should be multiple tests given the "1 reason to fail" principle

See renamed and split tests here.

jonavellecuerdo · 2024-05-09T21:42:07Z

Redeployed the app in staging and ran a test import.

Running the import for #12130 was successful ✅
Running this test import pointed out that the characters [<, >] could not be rendered in the email draft created by Solenoid. In the citations below, the blank spaces after the publication title indicate where the placeholder text should've been placed.

Sarma, S. (2024). How Thesis Driven Innovation Radars Could Benefit the Sports Industry. . [doi:10.1007/978-3-031-38981-8_3](https://doi.org/10.1007/978-3-031-38981-8_3)

Sarma, S. (2020). “Wake Neutrality” of Artificial Intelligence Devices. . [doi:10.1017/9781108347846.010](https://doi.org/10.1017/9781108347846.010)

Sarma, S. (2020). How Thesis Driven Innovation Radars Could Benefit the Sports Industry. . [doi:10.1007/978-3-030-50801-2_3](https://doi.org/10.1007/978-3-030-50801-2_3)

Note: This also pointed out that Solenoid currently does not appear to programmatically rebuild the citation or the email (or both) once imported, but fixing this is beyond the scope of this PR.

Running the import for #11969 was successful ✅
See generated email with "JOURNAL-NAME UNIDENTIFIED" included in the minimal citation.

See log.

ghukill

Looks good! Seems like a good solution to ensure these records continue to process, without much of an overhaul of that processing logic. Thanks for the additional test that shows an example outputted citation.

ghukill · 2024-05-10T12:41:27Z

solenoid/records/tests/test_models.py

+        self.assertEqual(
+            citation,
+            "Wilczek, F. (1973). Ultraviolet behavior of "
+            "non-abelian gauge theories. <JOURNAL-NAME UNIDENTIFIED>, "


Why these changes are being introduced: * Publications should be imported even when Solenoid can't generate a "minimal" citation since users of Solenoid can manually enter this missing information when editing the email request. How this addresses that need: * Differentiate methods for identifying missing citation fields vs. required id fields * Improve logging to communicate what the missing fields are (if any) * Update method for generating citations to denote missing citation fields as "UNIDENTIFIED" * Allow addition of records missing citation fields to database Side effects of this change: * Imports that previously failed due to missing citation fields should now be successfully imported to Solenoid. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/INFRA-438

jonavellecuerdo self-assigned this May 9, 2024

mitlib temporarily deployed to mitlibraries-infra-438--f9swrx May 9, 2024 15:11 Inactive

mitlib temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 15:53 Inactive

jonavellecuerdo temporarily deployed to mitlibraries-solenoid-staging May 9, 2024 15:55 Inactive

jonavellecuerdo force-pushed the INFRA-438-allow-missing-journal-names branch from 0333d7b to 9a9908e Compare May 9, 2024 16:15

jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 16:16 Inactive

jonavellecuerdo temporarily deployed to mitlibraries-solenoid-staging May 9, 2024 16:18 Inactive

jonavellecuerdo force-pushed the INFRA-438-allow-missing-journal-names branch from 9a9908e to 203d307 Compare May 9, 2024 16:23

jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 16:23 Inactive

jonavellecuerdo temporarily deployed to mitlibraries-solenoid-staging May 9, 2024 16:27 Inactive

jonavellecuerdo marked this pull request as ready for review May 9, 2024 16:34

jonavellecuerdo requested review from ghukill and ehanson8 May 9, 2024 16:34

ghukill reviewed May 9, 2024

View reviewed changes

ehanson8 reviewed May 9, 2024

View reviewed changes

jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 20:10 Inactive

jonavellecuerdo force-pushed the INFRA-438-allow-missing-journal-names branch from 33d7813 to cf9c2a5 Compare May 9, 2024 20:50

jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 20:50 Inactive

jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 21:12 Inactive

jonavellecuerdo force-pushed the INFRA-438-allow-missing-journal-names branch from 135a344 to 53628b5 Compare May 9, 2024 21:13

jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 21:13 Inactive

jonavellecuerdo requested review from ghukill and ehanson8 May 9, 2024 21:15

jonavellecuerdo temporarily deployed to mitlibraries-solenoid-staging May 9, 2024 21:16 Inactive

jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 21:29 Inactive

jonavellecuerdo temporarily deployed to mitlibraries-solenoid-staging May 9, 2024 21:30 Inactive

ehanson8 approved these changes May 9, 2024

View reviewed changes

ghukill approved these changes May 10, 2024

View reviewed changes

jonavellecuerdo force-pushed the INFRA-438-allow-missing-journal-names branch from 4273374 to 8d67c10 Compare May 10, 2024 13:25

jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 10, 2024 13:26 Inactive

jonavellecuerdo merged commit 266db61 into main May 10, 2024
2 checks passed

jonavellecuerdo deleted the INFRA-438-allow-missing-journal-names branch May 10, 2024 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow missing journal names when importing publications #472

Allow missing journal names when importing publications #472

jonavellecuerdo commented May 9, 2024 •

edited

Loading

ghukill left a comment

ghukill May 9, 2024

jonavellecuerdo May 9, 2024

ghukill May 9, 2024

ghukill May 9, 2024

jonavellecuerdo May 9, 2024

ghukill May 9, 2024

jonavellecuerdo May 9, 2024

ghukill May 9, 2024 •

edited

Loading

jonavellecuerdo May 9, 2024

ehanson8 left a comment

ehanson8 May 9, 2024

ghukill May 9, 2024

jonavellecuerdo May 9, 2024

jonavellecuerdo May 9, 2024

ehanson8 May 9, 2024

jonavellecuerdo May 9, 2024

ehanson8 May 9, 2024

ehanson8 May 9, 2024

jonavellecuerdo May 9, 2024

ghukill May 9, 2024

ehanson8 May 9, 2024

ehanson8 May 9, 2024

jonavellecuerdo May 9, 2024

jonavellecuerdo commented May 9, 2024 •

edited

Loading

ghukill left a comment

ghukill May 10, 2024

Allow missing journal names when importing publications #472

Allow missing journal names when importing publications #472

Conversation

jonavellecuerdo commented May 9, 2024 • edited Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghukill May 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonavellecuerdo commented May 9, 2024 • edited Loading

ghukill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonavellecuerdo commented May 9, 2024 •

edited

Loading

ghukill May 9, 2024 •

edited

Loading

jonavellecuerdo commented May 9, 2024 •

edited

Loading