Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty or invalid publication dates not caught by importer, leaving gaps in sources content block #763

Open
tlongers opened this issue Jun 9, 2021 · 13 comments

Comments

@tlongers
Copy link
Member

tlongers commented Jun 9, 2021

Some values in the "Publication Date" column are empty:

image

(Snipped from here)

These values are drawn from source:publication_timestamp, a field that (largely) contains dates in YYYY-MM-DD format. However, I can see that what these particular sources have in common are source:publication_timestamp values that have these cases:

  • they're empty: likely they should not be empty and we need to fix that in the import sheets; or,
  • they contain a partial date, and have trailing hyphon: 2020-10--. The double hyphon is a safeguard against spreadsheet autodatatype "help" on dates, but we have not yet decided to use this hack across our data so it's an invalid form.

In both cases WWIC needs to fail these at the validation stage and log the error so we can fix it.

@hancush
Copy link

hancush commented Jun 10, 2021

@tlongers Roger that, I can add some checks in the import. What, if anything, would you like me to do with the sources sans dates that have already been imported?

@tlongers
Copy link
Member Author

I think we will need to re-run the importer completely; until then, let's leave it be.

@hancush
Copy link

hancush commented Jun 11, 2021

@tlongers Should we log errors for any of the date fields on the source (published, created, uploaded), or just the publication date?

@tlongers
Copy link
Member Author

tlongers commented Jun 11, 2021

For now, the importer should an report an error if a value in source:published_timestamp fails a fuzzy check.

Down the line it will need to be stricter and flag empties. To make that workable we may need - in our sheets - to implement proper datatype specific empty values ("typed-missing"?) as well so we can differentiate between an empty value that is erroneous and one that is deliberate. Something like NA_date_?

@hancush
Copy link

hancush commented Jun 14, 2021

Gotcha – I've added logging for this in #766. Let's try it out with the next data import, which I'll follow up on via email.

@hancush
Copy link

hancush commented Aug 31, 2021

We weren't consistently receiving error messages for publication date on the last import. Two reasons:

  1. We only ran the check when creating a new access point, so we only ever got these messages the first time we saw a source, but not again on subsequent imports.
  2. There was a bug in the validation where publication date could incorrectly register as present.

I revised the source import method to create or retrieve an existing access point, then update it with the relevant source information, in a2906e2. I also fixed the bug in point 2. Now we're consistently getting publication date logs.

Would like to iterate on this with you, @tlongers, once it's ok for the data to be in a state of light flux while we test.

@tlongers
Copy link
Member Author

tlongers commented Sep 3, 2021

Having those errors in the log will be great. Does the failing test cover everything we need it to?

@hancush
Copy link

hancush commented Sep 7, 2021

@tlongers Great question! The fixture has an undated source, but I will update the test to run the import twice to confirm we get errors each time we see an undated source, not just the first time as was the case here.

@tlongers
Copy link
Member Author

tlongers commented Sep 9, 2021

Thanks. Please close this when that's confirmed.

@tlongers
Copy link
Member Author

tlongers commented Feb 2, 2022

Looking at the error log from a recent import, the importer doesn't accept the following values from source:published_timestamp as valid fuzzy dates yet:

1988-
2013-02-
2017-06-

Just reading through the original issue, this may be expected behaviour if the importer is looking for a trailing hyphon in partial dates. The error condition should be where there is a double hyphon: the single hyphon is acceptable.

@hancush
Copy link

hancush commented Feb 2, 2022

@tlongers IIRC, we did not move forward with fuzzy dates last go-round? #703 (comment)

@tlongers
Copy link
Member Author

tlongers commented Feb 2, 2022

I think it's partially implemented though: date values in fields like unit:first_cited_date pass if they have a single trailing hyphon.

@hancush
Copy link

hancush commented Feb 3, 2022

Oh, I see, @tlongers. Quite possible the source date parsing behaves differently than citation date parsing. I'll take a look today and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants