-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow missing journal names when importing publications #472
Conversation
0333d7b
to
9a9908e
Compare
9a9908e
to
203d307
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looking good. Given the considerable moving parts of any one section of Solenoid, I think the approach makes a lot of sense.
My questions are a possible (minor) bug, and a question if a test is possible (without knowing how much of a lift that would be).
solenoid/records/models.py
Outdated
citation_fields = { | ||
field: paper_data.get( | ||
field, f"<{field.replace('-', ' ').upper()} UNIDENTIFIED>" | ||
) | ||
for field in Fields.CITATION_DATA | ||
} | ||
citation = "{last}, {first_init}. ".format( | ||
last=paper_data[Fields.LAST_NAME], first_init=paper_data[Fields.FIRST_NAME][0] | ||
last=citation_fields[Fields.LAST_NAME], | ||
first_init=citation_fields[Fields.FIRST_NAME][0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see the thinking behind citation_fields
dictionary, where it loops through all fields and if not present in the data, sets a string like <Field Name UNIDENTIFIED>
.
But on line 92, what happens if Fields.FIRST_NAME
was not present in the data? Wouldn't citation_fields
have a string value "<First Name UNIDENTIFIED>"
and therefore this index retrieval of [0]
would return the first character of that string "<"
?
Additionally, I think it could get easy to miss that citation_fields
and paper_data
are similar, but not identical, but both are used throught the rest of the method.
I'm not sure what an alternative would be offhand, as other methods in this class are using kind of similar logic of looping through Fields.*
and doing things.
If that is a possible bug outlined above, but that's the only real blocker, might just be worth figuring out a workaround for that and moving on!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Please see the changes introduced here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonavellecuerdo - I dig it. This kind of string building, coupled with logic to account for the pre-existing enums and structures, is definitely kind of finicky. But this new approach, I find it much easier to just scan and understand.
solenoid/records/models.py
Outdated
def missing_citation_fields(paper_data): | ||
"""Check whether record is citable. | ||
|
||
If a citation is provided, it will be used; otherwise a minimal citation | ||
is generated using the author's first name and last name and the publication's | ||
title and journal (to which it was published). | ||
""" | ||
# check if record is citable | ||
missing_citation_fields = [ | ||
field | ||
for field in [Fields.CITATION, *Fields.CITATION_DATA] | ||
if not paper_data[field] | ||
] | ||
if missing_citation_fields: | ||
return ( | ||
"Citation was not provided and/or " | ||
f"missing required fields to generate minimal citation: {missing_citation_fields}." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I understanding correctly that this method will identify what minimal citation fields are missing, but that in _run_checks_on_paper()
, it won't bail on processing the paper, it will just log that?
I ask, because the work above in create_citation()
seems to suggest that we're somewhat okay with missing citation fields, where we just use <Field Name UNIDENTIFIED>
and carry on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct! Previously, when any of the checks performed in _run_checks_on_paper()
failed, it would proceed to skip the record, which resulted in Solenoid skipping the record for import when any of the citation fields were missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be difficult to add a test that shows what a citation will look like if, say, Fields.JOURNAL
is missing from the paper_data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it would not, however, from what I'm seeing it would raise a KeyError
if it was somehow missing (though how cleanly Solenoid + Django would inform the user is unclear. That said: author_data
is constructed explicitly as a dictionary through this method. Given that author_data
will always have these fields by the time this method is called, it doesn't seem like it would be a helpful test to include. What do you think? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, let me rephrase. And maybe be a bit more general to avoid being inaccurate.
Given that part of the new code will handle a missing field like Fields.JOURNAL
when constructing a citation with "<Journal UNIDENTIFIED>"
instead of the journal name, can we have a test that shows that? Very pseudo code-y:
def test_unidentified_text_inserted_in_citation():
mocked_data = {...}
citation_string = Record.create_citation(mocked_data)
assert citation_string == """Smith, A.B. (1980). Elephants love water.. <Journal UNIDENTIFIED>, 14(2), 10-23 doi:XX.XXXXX."""
I think a test like this would a) confirm that missing paper_data
will result in that component getting replaced by the "UNIDENTIFIED" block, and b) would serve as a visual example of that to people reviewing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Thank you for clarifying. For the time being, I added a test to the existing set of Record.create_citation
tests. See here.
T'was a good idea as there were changes that needed to be made. 🤓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good fix, some questions and suggestions for cleaning up the docstrings and comments
solenoid/records/models.py
Outdated
|
||
@staticmethod | ||
def missing_citation_fields(paper_data): | ||
"""Check whether record is citable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be Returns missing citation fields
be more accurate? The description also doesn't seem to be referring to this method's functionality, was there some shifting that happened later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation about naming, agreed! Or maybe get_missing_citation_fields
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch! No shifting, just improper docstring placement. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the extra info to Record.create_citation()
@@ -41,27 +41,43 @@ def setUp(self): | |||
Fields.TITLE: "Ultraviolet behavior of non-abelian gauge theories", | |||
} | |||
|
|||
def test_metadata_not_missing_any_citation_fields(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: for clarity and consistency, I would name tests like so: test_<method name>_<scenario>_<result>
, so here test_missing_citation_fields_full_citation_returns_none
. It's long but clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this template!
# need to actually test create_citation | ||
def test_is_metadata_valid_yes_citation_no_citation_data(self): | ||
def test_is_cited_metadata_missing_minimal_citation_fields(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be test_missing_citation_fields_incomplete_citation_returns_fields
assert Record.missing_citation_fields(metadata) == ( | ||
"Citation was not provided and/or " | ||
f"missing required fields to generate minimal citation: ['{Fields.TITLE}', '{Fields.JOURNAL}']." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just clarifying the logic (which I am probably misunderstanding 🙃 ), if there is a Fields.CITATION
, does it matter that Fields.TITLE
and Fields.JOURNAL
aren't set? Do they serve different purposes in the app than Fields.CITATION
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question and actually a very important point!
- If
Fields.CITATION
exists, it shouldn't matter whetherFields.TITLE
andFields.JOURNAL
exist as the latter two fields are only required when generating the minimal citation; and a minimal citation is only generated whenFields.CITATION
does not exist (i.e., is not populated in the data retrieved from Elements).
Please see the updated logic in Record.get_missing_citation_fields()
.
- If
Fields.CITATION
-> return None (i.e., no missing fields) - If not
Fields.CITATION
-> return message indicating missing fields needed to create a minimal citation or None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I find the udpated get_missing_citation_fields()
much easier to parse as well. Makes sense to me now that the existence of Fields.CITATION
in paper_data
is enough to immediately return None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great fix!
assert Record.missing_citation_fields(metadata) == ( | ||
"Citation was not provided and/or " | ||
f"missing required fields to generate minimal citation: ['{Fields.CITATION}', '{Fields.TITLE}', '{Fields.JOURNAL}']." | ||
) | ||
|
||
def test_is_record_creatable(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should be multiple tests given the "1 reason to fail" principle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See renamed and split tests here.
33d7813
to
cf9c2a5
Compare
135a344
to
53628b5
Compare
Redeployed the app in staging and ran a test import.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Seems like a good solution to ensure these records continue to process, without much of an overhaul of that processing logic. Thanks for the additional test that shows an example outputted citation.
self.assertEqual( | ||
citation, | ||
"Wilczek, F. (1973). Ultraviolet behavior of " | ||
"non-abelian gauge theories. <JOURNAL-NAME UNIDENTIFIED>, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Why these changes are being introduced: * Publications should be imported even when Solenoid can't generate a "minimal" citation since users of Solenoid can manually enter this missing information when editing the email request. How this addresses that need: * Differentiate methods for identifying missing citation fields vs. required id fields * Improve logging to communicate what the missing fields are (if any) * Update method for generating citations to denote missing citation fields as "UNIDENTIFIED" * Allow addition of records missing citation fields to database Side effects of this change: * Imports that previously failed due to missing citation fields should now be successfully imported to Solenoid. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/INFRA-438
4273374
to
8d67c10
Compare
Purpose and background context
Allow imports for publications that are missing fields required to generate a minimal citation.
How can a reviewer manually see the effects of these changes?
A couple unit tests were updated.
Test imports were performed via Solenoid app in staging
Prerequisites: I manually deployed the staging app from this branch and updated some config vars to point to the Prod instance of Solenoid (
DJANGO_ELEMENTS_ENDPOINT
,DJANGO_ELEMENTS_PASSWORD
,DJANGO_ELEMENTS_USER
) since the records that could not be imported due to missing citation fields can be found in Prod.Note: Solenoid performs read-only operations from Symplectic Elements, so reading from Prod shouldn't cause any issues.
To check that the updates work as expected, we can try importing publications for the authors listed in the spreadsheet shared by stakeholders. The following authors' Elements IDs can be used for testing:
IMPORTANT NOTE: You can run test imports using the IDs above, but only use Solenoid in staging.
Running the import for author #12338 was a success ✅
See log. Note: The logged message has been fixed to remove the unintended spacing.
Running the import for author #11259 was a success ✅
See log
Includes new or updated dependencies?
YES but only to update dependencies with
make update
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)