-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty or invalid publication dates not caught by importer, leaving gaps in sources content block #763
Comments
@tlongers Roger that, I can add some checks in the import. What, if anything, would you like me to do with the sources sans dates that have already been imported? |
I think we will need to re-run the importer completely; until then, let's leave it be. |
@tlongers Should we log errors for any of the date fields on the source (published, created, uploaded), or just the publication date? |
For now, the importer should an report an error if a value in Down the line it will need to be stricter and flag empties. To make that workable we may need - in our sheets - to implement proper datatype specific empty values ("typed-missing"?) as well so we can differentiate between an empty value that is erroneous and one that is deliberate. Something like |
Gotcha – I've added logging for this in #766. Let's try it out with the next data import, which I'll follow up on via email. |
We weren't consistently receiving error messages for publication date on the last import. Two reasons:
I revised the source import method to create or retrieve an existing access point, then update it with the relevant source information, in a2906e2. I also fixed the bug in point 2. Now we're consistently getting publication date logs. Would like to iterate on this with you, @tlongers, once it's ok for the data to be in a state of light flux while we test. |
Having those errors in the log will be great. Does the failing test cover everything we need it to? |
@tlongers Great question! The fixture has an undated source, but I will update the test to run the import twice to confirm we get errors each time we see an undated source, not just the first time as was the case here. |
Thanks. Please close this when that's confirmed. |
Looking at the error log from a recent import, the importer doesn't accept the following values from
Just reading through the original issue, this may be expected behaviour if the importer is looking for a trailing hyphon in partial dates. The error condition should be where there is a double hyphon: the single hyphon is acceptable. |
@tlongers IIRC, we did not move forward with fuzzy dates last go-round? #703 (comment) |
I think it's partially implemented though: date values in fields like |
Oh, I see, @tlongers. Quite possible the source date parsing behaves differently than citation date parsing. I'll take a look today and report back. |
Some values in the "Publication Date" column are empty:
(Snipped from here)
These values are drawn from
source:publication_timestamp
, a field that (largely) contains dates inYYYY-MM-DD
format. However, I can see that what these particular sources have in common aresource:publication_timestamp
values that have these cases:2020-10--
. The double hyphon is a safeguard against spreadsheet autodatatype "help" on dates, but we have not yet decided to use this hack across our data so it's an invalid form.In both cases WWIC needs to fail these at the validation stage and log the error so we can fix it.
The text was updated successfully, but these errors were encountered: