Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow missing journal names when importing publications #472

Merged
merged 1 commit into from
May 10, 2024

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented May 9, 2024

Purpose and background context

Allow imports for publications that are missing fields required to generate a minimal citation.

How can a reviewer manually see the effects of these changes?

  • A couple unit tests were updated.

  • Test imports were performed via Solenoid app in staging
    Prerequisites: I manually deployed the staging app from this branch and updated some config vars to point to the Prod instance of Solenoid (DJANGO_ELEMENTS_ENDPOINT, DJANGO_ELEMENTS_PASSWORD, DJANGO_ELEMENTS_USER) since the records that could not be imported due to missing citation fields can be found in Prod.

    Note: Solenoid performs read-only operations from Symplectic Elements, so reading from Prod shouldn't cause any issues.

    To check that the updates work as expected, we can try importing publications for the authors listed in the spreadsheet shared by stakeholders. The following authors' Elements IDs can be used for testing:

    • 12338 (this was used for PR review)
    • 11259 (this was used for PR review)
    • 12130 (this was used for PR review, see comment)
    • 11969 (this was used for PR review, see comment)
    • 12127

    IMPORTANT NOTE: You can run test imports using the IDs above, but only use Solenoid in staging.

    1. Running the import for author #12338 was a success ✅
      image
      See log. Note: The logged message has been fixed to remove the unintended spacing.

    2. Running the import for author #11259 was a success ✅
      image
      See log

Includes new or updated dependencies?

YES but only to update dependencies with make update

Changes expectations for external applications?

NO

What are the relevant tickets?

  • Include links to Jira Software and/or Jira Service Management tickets here.

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this May 9, 2024
@mitlib mitlib temporarily deployed to mitlibraries-infra-438--f9swrx May 9, 2024 15:11 Inactive
@mitlib mitlib temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 15:53 Inactive
@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-solenoid-staging May 9, 2024 15:55 Inactive
@jonavellecuerdo jonavellecuerdo force-pushed the INFRA-438-allow-missing-journal-names branch from 0333d7b to 9a9908e Compare May 9, 2024 16:15
@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 16:16 Inactive
@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-solenoid-staging May 9, 2024 16:18 Inactive
@jonavellecuerdo jonavellecuerdo force-pushed the INFRA-438-allow-missing-journal-names branch from 9a9908e to 203d307 Compare May 9, 2024 16:23
@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 16:23 Inactive
@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-solenoid-staging May 9, 2024 16:27 Inactive
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review May 9, 2024 16:34
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looking good. Given the considerable moving parts of any one section of Solenoid, I think the approach makes a lot of sense.

My questions are a possible (minor) bug, and a question if a test is possible (without knowing how much of a lift that would be).

solenoid/records/models.py Show resolved Hide resolved
Comment on lines 84 to 92
citation_fields = {
field: paper_data.get(
field, f"<{field.replace('-', ' ').upper()} UNIDENTIFIED>"
)
for field in Fields.CITATION_DATA
}
citation = "{last}, {first_init}. ".format(
last=paper_data[Fields.LAST_NAME], first_init=paper_data[Fields.FIRST_NAME][0]
last=citation_fields[Fields.LAST_NAME],
first_init=citation_fields[Fields.FIRST_NAME][0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see the thinking behind citation_fields dictionary, where it loops through all fields and if not present in the data, sets a string like <Field Name UNIDENTIFIED>.

But on line 92, what happens if Fields.FIRST_NAME was not present in the data? Wouldn't citation_fields have a string value "<First Name UNIDENTIFIED>" and therefore this index retrieval of [0] would return the first character of that string "<"?

Additionally, I think it could get easy to miss that citation_fields and paper_data are similar, but not identical, but both are used throught the rest of the method.

I'm not sure what an alternative would be offhand, as other methods in this class are using kind of similar logic of looping through Fields.* and doing things.

If that is a possible bug outlined above, but that's the only real blocker, might just be worth figuring out a workaround for that and moving on!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Please see the changes introduced here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonavellecuerdo - I dig it. This kind of string building, coupled with logic to account for the pre-existing enums and structures, is definitely kind of finicky. But this new approach, I find it much easier to just scan and understand.

Comment on lines 222 to 240
def missing_citation_fields(paper_data):
"""Check whether record is citable.

If a citation is provided, it will be used; otherwise a minimal citation
is generated using the author's first name and last name and the publication's
title and journal (to which it was published).
"""
# check if record is citable
missing_citation_fields = [
field
for field in [Fields.CITATION, *Fields.CITATION_DATA]
if not paper_data[field]
]
if missing_citation_fields:
return (
"Citation was not provided and/or "
f"missing required fields to generate minimal citation: {missing_citation_fields}."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly that this method will identify what minimal citation fields are missing, but that in _run_checks_on_paper(), it won't bail on processing the paper, it will just log that?

I ask, because the work above in create_citation() seems to suggest that we're somewhat okay with missing citation fields, where we just use <Field Name UNIDENTIFIED> and carry on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct! Previously, when any of the checks performed in _run_checks_on_paper() failed, it would proceed to skip the record, which resulted in Solenoid skipping the record for import when any of the citation fields were missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be difficult to add a test that shows what a citation will look like if, say, Fields.JOURNAL is missing from the paper_data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it would not, however, from what I'm seeing it would raise a KeyError if it was somehow missing (though how cleanly Solenoid + Django would inform the user is unclear. That said: author_data is constructed explicitly as a dictionary through this method. Given that author_data will always have these fields by the time this method is called, it doesn't seem like it would be a helpful test to include. What do you think? 🤔

Copy link
Contributor

@ghukill ghukill May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, let me rephrase. And maybe be a bit more general to avoid being inaccurate.

Given that part of the new code will handle a missing field like Fields.JOURNAL when constructing a citation with "<Journal UNIDENTIFIED>" instead of the journal name, can we have a test that shows that? Very pseudo code-y:

def test_unidentified_text_inserted_in_citation():
   mocked_data = {...}
   citation_string = Record.create_citation(mocked_data)
   assert citation_string == """Smith, A.B. (1980). Elephants love water.. <Journal UNIDENTIFIED>, 14(2), 10-23 doi:XX.XXXXX."""

I think a test like this would a) confirm that missing paper_data will result in that component getting replaced by the "UNIDENTIFIED" block, and b) would serve as a visual example of that to people reviewing code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Thank you for clarifying. For the time being, I added a test to the existing set of Record.create_citation tests. See here.

T'was a good idea as there were changes that needed to be made. 🤓

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix, some questions and suggestions for cleaning up the docstrings and comments

solenoid/records/models.py Outdated Show resolved Hide resolved
solenoid/records/models.py Outdated Show resolved Hide resolved
solenoid/records/models.py Outdated Show resolved Hide resolved

@staticmethod
def missing_citation_fields(paper_data):
"""Check whether record is citable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be Returns missing citation fields be more accurate? The description also doesn't seem to be referring to this method's functionality, was there some shifting that happened later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation about naming, agreed! Or maybe get_missing_citation_fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch! No shifting, just improper docstring placement. 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the extra info to Record.create_citation()

@@ -41,27 +41,43 @@ def setUp(self):
Fields.TITLE: "Ultraviolet behavior of non-abelian gauge theories",
}

def test_metadata_not_missing_any_citation_fields(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: for clarity and consistency, I would name tests like so: test_<method name>_<scenario>_<result>, so here test_missing_citation_fields_full_citation_returns_none. It's long but clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this template!

# need to actually test create_citation
def test_is_metadata_valid_yes_citation_no_citation_data(self):
def test_is_cited_metadata_missing_minimal_citation_fields(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be test_missing_citation_fields_incomplete_citation_returns_fields

Comment on lines 57 to 59
assert Record.missing_citation_fields(metadata) == (
"Citation was not provided and/or "
f"missing required fields to generate minimal citation: ['{Fields.TITLE}', '{Fields.JOURNAL}']."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just clarifying the logic (which I am probably misunderstanding 🙃 ), if there is a Fields.CITATION, does it matter that Fields.TITLE and Fields.JOURNAL aren't set? Do they serve different purposes in the app than Fields.CITATION?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question and actually a very important point!

  • If Fields.CITATION exists, it shouldn't matter whether Fields.TITLE and Fields.JOURNAL exist as the latter two fields are only required when generating the minimal citation; and a minimal citation is only generated when Fields.CITATION does not exist (i.e., is not populated in the data retrieved from Elements).

Please see the updated logic in Record.get_missing_citation_fields().

  • If Fields.CITATION -> return None (i.e., no missing fields)
  • If not Fields.CITATION -> return message indicating missing fields needed to create a minimal citation or None.

@ehanson8 @ghukill

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I find the udpated get_missing_citation_fields() much easier to parse as well. Makes sense to me now that the existence of Fields.CITATION in paper_data is enough to immediately return None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix!

assert Record.missing_citation_fields(metadata) == (
"Citation was not provided and/or "
f"missing required fields to generate minimal citation: ['{Fields.CITATION}', '{Fields.TITLE}', '{Fields.JOURNAL}']."
)

def test_is_record_creatable(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should be multiple tests given the "1 reason to fail" principle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See renamed and split tests here.

@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 20:10 Inactive
@jonavellecuerdo jonavellecuerdo force-pushed the INFRA-438-allow-missing-journal-names branch from 33d7813 to cf9c2a5 Compare May 9, 2024 20:50
@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 20:50 Inactive
@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 21:12 Inactive
@jonavellecuerdo jonavellecuerdo force-pushed the INFRA-438-allow-missing-journal-names branch from 135a344 to 53628b5 Compare May 9, 2024 21:13
@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 21:13 Inactive
@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-solenoid-staging May 9, 2024 21:16 Inactive
@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 9, 2024 21:29 Inactive
@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-solenoid-staging May 9, 2024 21:30 Inactive
@jonavellecuerdo
Copy link
Contributor Author

jonavellecuerdo commented May 9, 2024

Redeployed the app in staging and ran a test import.

  1. Running the import for #12130 was successful ✅
    Running this test import pointed out that the characters [<, >] could not be rendered in the email draft created by Solenoid. In the citations below, the blank spaces after the publication title indicate where the placeholder text should've been placed.

    Sarma, S. (2024). How Thesis Driven Innovation Radars Could Benefit the Sports Industry. . [doi:10.1007/978-3-031-38981-8_3](https://doi.org/10.1007/978-3-031-38981-8_3)
    
    Sarma, S. (2020). “Wake Neutrality” of Artificial Intelligence Devices. . [doi:10.1017/9781108347846.010](https://doi.org/10.1017/9781108347846.010)
    
    Sarma, S. (2020). How Thesis Driven Innovation Radars Could Benefit the Sports Industry. . [doi:10.1007/978-3-030-50801-2_3](https://doi.org/10.1007/978-3-030-50801-2_3)
    

    Note: This also pointed out that Solenoid currently does not appear to programmatically rebuild the citation or the email (or both) once imported, but fixing this is beyond the scope of this PR.

  2. Running the import for #11969 was successful ✅
    See generated email with "JOURNAL-NAME UNIDENTIFIED" included in the minimal citation.

    image

    See log.

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Seems like a good solution to ensure these records continue to process, without much of an overhaul of that processing logic. Thanks for the additional test that shows an example outputted citation.

self.assertEqual(
citation,
"Wilczek, F. (1973). Ultraviolet behavior of "
"non-abelian gauge theories. <JOURNAL-NAME UNIDENTIFIED>, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Why these changes are being introduced:
* Publications should be imported even when Solenoid can't
generate a "minimal" citation since users of Solenoid can
manually enter this missing information when editing the email
request.

How this addresses that need:
* Differentiate methods for identifying missing citation fields vs. required id fields
* Improve logging to communicate what the missing fields are (if any)
* Update method for generating citations to denote missing citation fields as "UNIDENTIFIED"
* Allow addition of records missing citation fields to database

Side effects of this change:
* Imports that previously failed due to missing citation fields
should now be successfully imported to Solenoid.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/INFRA-438
@jonavellecuerdo jonavellecuerdo force-pushed the INFRA-438-allow-missing-journal-names branch from 4273374 to 8d67c10 Compare May 10, 2024 13:25
@jonavellecuerdo jonavellecuerdo temporarily deployed to mitlibraries-infra-438--vwtqbl May 10, 2024 13:26 Inactive
@jonavellecuerdo jonavellecuerdo merged commit 266db61 into main May 10, 2024
2 checks passed
@jonavellecuerdo jonavellecuerdo deleted the INFRA-438-allow-missing-journal-names branch May 10, 2024 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants