Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TITLEs need to be reconciled with Bibliographic records where possible #26

Open
1 task
alliyya opened this issue Feb 23, 2022 · 1 comment
Open
1 task
Assignees
Labels
project:bibliography extraction related to extraction of bibliography entries project:writing extraction related to extraction of writing entries

Comments

@alliyya
Copy link
Member

alliyya commented Feb 23, 2022

@alliyya alliyya added project:writing extraction related to extraction of writing entries project:bibliography extraction related to extraction of bibliography entries labels Feb 23, 2022
@alliyya alliyya self-assigned this Feb 23, 2022
@alliyya
Copy link
Member Author

alliyya commented Oct 30, 2024

Note:

  • Always use REF value of TITLE tag as URI if available (Rarely available)
  • Use REG value of TITLE to check maps (sometimes available)
  • Generic titles are excluded from all title mappings except for entry's individual textscope map. (Ex. "Songs", "Selected Stories", "Essays", "Autobiography","A Book", "")

Updated strategy as of October 30th:

  1. Create a Textscope Map: When extracting data from an entry, create a map of all textscopes for that particular entry. These texts are guaranteed to be written by the Entry subject.
  • All TITLE tags that match the textscopes in the entry will be matched as the same entity
  • EXCEPTION: If the TITLE tag is prefaced with <NAME>'s and the standard name doesn't match the entry's standard name
    • ex. <NAME STANDARD="Brontë, Emily" REF="https://commons.cwrc.ca/orlando:8e599794-3bb5-47d6-87dc-713ce02c2a7f">Emily Brontë</NAME>'s <TITLE TITLETYPE="MONOGRAPHIC">Wuthering Heights</TITLE> within Jessie Fothergill's entry where they have a textscope: <TEXTSCOPE DBREF="34165" PLACEHOLDER="JF, Wuthering Heights, 1887" REF="https://commons.cwrc.ca/orlando:e7558af8-6a97-4f17-960f-b601585a4ade"/> these two Wuthering Heights will have different identifiers
  1. Use title mapping lvl 1. Title Mapping lvl 1 is created based on the above assumption that textscopes are always by the entry's subject. This list of titles and URIs is all the textscope texts across all entries. Rows with duplicate Titles and/or URIs are eliminated at least until they can be fully reviewed. ~7740 Titles available to map to.

  2. Use title mapping lvl 2. Title Mapping lvl 2 is created based on summarizing and then simplifiying all the bibliographic records and eliminating rows where there are duplicate titles. ~ 27000 + Titles.

  3. Create a temporary URI based on Title text. ex. <TITLE>Fake Title</Title> will become data:Fake_Title_TITLE

Next steps:

  • Evaluate, analyze matches and tweak process as needed
    • Possibly leverage <NAME>'s <TITLE> pattern to create another mapping step
    • We may also want to break up mapping files further for time efficiency.
    • Add a list of titles of interest, to ensure these titles are always mapped
    • Add a list of alternative titles
    • Discuss approach with team
  • Streamline the process for creating and cleaning these mapping files
    • Currently, the clean up was done mostly via spreadsheet, need to fully script this process
  • Review duplicate records
  • Add reconciled entities to bibliographic records
  • There's ~130 titles from this sheet reconciled to VIAF, there's ~200 to WorldCat but the handful of links I clicked on seemed to be broken URLs so will have to clean that further.
  • create a query to list most popular titles, and use this list to be reconciled against external authorities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project:bibliography extraction related to extraction of bibliography entries project:writing extraction related to extraction of writing entries
Projects
None yet
Development

No branches or pull requests

1 participant