Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add edition field to lwmdb #120

Open
2 tasks
kmcdono2 opened this issue Jun 26, 2023 · 11 comments
Open
2 tasks

add edition field to lwmdb #120

kmcdono2 opened this issue Jun 26, 2023 · 11 comments
Labels
enhancement New feature or request

Comments

@kmcdono2
Copy link
Member

Summary

Problem: existing combination of publication_code-issue_code-item_code is NOT unique.

Why?
issue_code is based on date, e.g. 18881204 (Dec 4, 1888).

But, there can (sometimes) be multiple editions on the same day.

Currently there is no edition field in the newspaper db, which would solve this problem.

Solution:
Add edition_code to lwmdb at issue level.

Then, adding this to publication_code, issue_code, and item_code would ensure that we have human-understandable unique ids for all items.

Not important to order edition_code at this stage, as it's both infrequent and there are a limited number of editions (1-3 max?).

Actions

  • create edition_code in issue table
  • use edition_code to create unique ids for items in samples going forward

Related Issues and Pull Requests

Updates

  • DATE OF UPDATE:
@DavidBeavan
Copy link
Member

Tricky, and nice find... I think a deep-dive from the source mets/alto is a good starting point. Can you find an occurrence from HMD or LwM papers (i.e. public) and point us to the files that came from our partners, see how its been handled there

@kmcdono2
Copy link
Member Author

@mcollardanuy I think you had an example of this from one of the collections? Could you share here?

@mcollardanuy
Copy link
Collaborator

Hi @kmcdono2, no, I don't have an example: it was just an observation that we thought it was worth investigating at some point.

So I think we need to understand whether this is really a problem (or could it be that morning and evening editions had different newspaper codes, for example?), and, if it is, whether this comes from the original data or from us, and how this is handled in the DB (i.e. are there duplicate item codes in the DB or were they removed?).

@DavidBeavan
Copy link
Member

Right then, @griff-rees has some ideas on how to test that hypothesis

@griff-rees
Copy link
Collaborator

My approach is two fold:

  • Database implementation (for storing whether we have this data or not, useful as it may well be in the future)
    • edition_name: an optional string (morning, evening etc.)
    • edition_order: a required PositiveSmallIntegerField, likely validated via django_int defaulting to 0.
    • Document this
  • Querying data that might have this property

@griff-rees griff-rees added the enhancement New feature or request label Jul 7, 2023
@kmcdono2
Copy link
Member Author

kmcdono2 commented Jul 7, 2023

QuickcComments:

  • The Atlas (digitisednewspapers.net) indicates that there is no edition metadata in any of the UK newspaper corpora they examined. However, they did not use BNA or HMD (and LwM digitization follows HMD protocols), so we need to examine the metadata for those datasets to confirm they are not different from JISC (e.g. what the Atlas reports as the BL/Gale datasets).
  • While in an ideal world I would like to have edition_name and edition_order, if multiple editions do exist in BNA or HMD, we will have to conform to the way these are expressed, which we don't currently have an example of. If there is no accounting for edition in these datasets, we will likely have to resort to simply an un-ordered, un-named edition field because we will not be able to automatically infer this from the existing metadata.

@DavidBeavan
Copy link
Member

which we don't currently have an example of

@griff-rees - do we have an example of this? You give us a potential solution, but it's unclear if it's actually a problem we are seeing

@griff-rees
Copy link
Collaborator

  • @DavidBeavan: no I have never seen an example of this in the data we have, it came up in conversation hence this ticket
  • @kmcdono2: if this were to be added and documented, and some sort of warning raised for those cases if someone were to come across this, perhaps that's a generic solution for the future, and at least acknowledges that risk? I wonder if that's at least safer than the ambiguity of duplicate first/second pages etc.

@kmcdono2
Copy link
Member Author

@griff-rees can we look for any publication_code-issue_code-item_codes that are not unique? Is there a query to do that easily? We just need to know if this exists at all.

@mialondon
Copy link
Contributor

I noticed this on Slack and thought I'd chime in - FMP don't digitise more than one edition per day, as it's just not worth it for them. Newspaper scholars would prefer that they did, of course, but I can see why they don't.

@mialondon
Copy link
Contributor

If you do have any examples of multiple digitised editions for the same day, I can ask about how they're distinguished in the BNA / BL catalogue (also how they came to exist).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants