Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bibliographic post-processing improvement #49

Open
6 tasks
alliyya opened this issue Feb 19, 2023 · 1 comment
Open
6 tasks

bibliographic post-processing improvement #49

alliyya opened this issue Feb 19, 2023 · 1 comment
Assignees
Labels
Conversion: CWRC This is related to the conversion process using the CWRC ontologies. (Classic Branch) project:bibliography extraction related to extraction of bibliography entries type:idea Idea that should be discussed

Comments

@alliyya
Copy link
Member

alliyya commented Feb 19, 2023

Suggestion from Susan to reduce the amount of duplicate Works/expressions

  • If after processing we end up with
    • Work Title A and Author B with Date C and URI D, plus associated Expression (Record 1)
    • Work Title A and Author B with Date D, which is later than Date C, and URI E, plus associated Expression (Record 2)
  • Then replace URI E with URI D in all triples
  • And delete the Work with URI E
  • And so on for any additional Works whose author/title match those of URI D

This will likely cause too few Works to be created in some cases (e.g. those poets who just repeatedly published Poems that can only be distinguished in particular years. So we might want to exclude certain titles, such as Poems, Collected Poems, Works, Complete Works, Collected Works, Essays, Collected Essays, Prose Works, Collected Prose (if we grab the most frequently recurring words in titles that may help us decide on additional ones--this is just off the top of my head).

Does this seem feasible? (I have to admit that one thing I can't get my head around is how we deal with diffs as the files for these things change.

Looks more feasible than altering the conversion process since the current scripts are a placeholder until CWRC has its new schema in place instead of MODS.

This will likely be a bit of a slow process since we have so many records and there would be quite a few triples to delete since every work/expression has title and timespan triples associated.

Potentially: I think we could keep the expression from record 2 and have that realize the work from record 1. But would that be too much duplication as you'd still have fairly similar expressions? Expressions might have different edition information associated that might make a difference.

related questions:

  • What if the date is identical?
  • Multiple authors listed in one vs the other?
  • What happens to genres attached to Record 2? will get merged?
  • How will this impact Writing extraction, when an entry references a merged record? Doing lookups with every mention of a work would be expensive.

Next steps:

  • sample queries to get at similar works (determine how many records this could reduce
  • further discussion about the above questions and results.
@alliyya alliyya added type:idea Idea that should be discussed project:bibliography extraction related to extraction of bibliography entries Conversion: CWRC This is related to the conversion process using the CWRC ontologies. (Classic Branch) labels Feb 19, 2023
@alliyya alliyya self-assigned this Feb 19, 2023
@SusanBrown
Copy link
Collaborator

  • If the date and the publisher are identical then I think we can safely create a single Work and a single Expression for both Record 1 and Record 2.
  • If there are multiple authors then it is probably safest to have multiple Works and multiple Expressions.
    • We should test my assumptions against some examples of this before deciding, but I expect the most likely case here is that the additional author(s) will be in an editor role, or have written an introduction to the other work
    • If that is the case, then it would be best to create a Work for both and to create a link between them to indicate the relationship through the FRBRoo R2 derivative relationship
    • The newer work should have the genres of the older work plus the genres of the newer work, but the genres of the older work should be left as is.
  • We probably need to discuss the impact on writing extraction. If it will be too costly to do this at the extraction phase, then perhaps this could be better handled with cleanup in RS, based on a report of similar entities. Or by having a phase in which we use VERSD on the subset of bibliographic records that are similar, and then do a find and replace of URIs related to merged entities across the entire dataset.

A question of timing: is it best to do this sooner or to wait on the firming up of the CWRC 2.0 biblio schema?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Conversion: CWRC This is related to the conversion process using the CWRC ontologies. (Classic Branch) project:bibliography extraction related to extraction of bibliography entries type:idea Idea that should be discussed
Projects
None yet
Development

No branches or pull requests

2 participants