parse-kangyur-data.py is a script is designed to parse the 84000 XML file kangyur-data.xml to incorporate outside data on texts and their translators, which fall under the purview of the 84000 translation project. Currently it compares 84000 data with the BDRC data found in this spreadsheet, specifically in the DergeKangyur sheet.
In the process of matching up the data, the script uses "WD_identified_person_matches.xlsx" and "WD_language_attributions.xlsx" where I have matched up persons and texts found in both data sets
parse-kangyur-data.py parses the XML file kangyur-data.xml, looking through all the elements of the XML document and matching them up by tohoku number with the BDRC spreadsheet. If there are already attributions on the 84000 side for that work, it creates a list of possible individuals from the spreadsheet (based on the BDRC IDs listed in the spreadsheet). It then matches up the attribution by name, if possible. At this point, the script records and outputs spreadsheet files indicating where persons, works, and roles match or do no match between the XML file and the spreadsheet. It also adds the BDRC ID to the spreadsheet.
After going through the XML data once, it can create a list of all matched persons, adding new entries for unmatched workds where a march could be idebtified. Using this list, it groups all matches of the BDRC ID with 84000 IDs in order to identify duplicates. It then adds the 84000 IDs to the spreadsheet. The updated spreadsheet is then extended with missing entries (tohoku numbers that are duplicates and for which BDRC has information on the duplicate entry), which I identified from previous data output.
Finally the script uses the updated spreadsheet to update the XML file with additional data. If an attribution (person or place) is already in the XML file, it is updated so that the information matches BDRC. If it is not present in the attribution file, attribution elements are added with role attribute, 84000 ID as the resource attribute, name (under the label element), language (under the lang element), and matching BDRC work (owl:sameAs attribute with resource attribute). As it does so, it outputs a spreadsheet of newly added attributions and finally ouputs the new XML file
- ATII - Tensative template.xlsx is BDRC's spreadsheet file downloaded from Google Docs
- kangyur-data.xml is the original 84000 data file
- WD_identified_person_matches.xlsx are matches that I identified by hand, which the script could not identify (these are often transliterated Sanskrit or phonetic Tibetan names)
- WD_language_attributions.xlsx contains languages for the names of persons.
- WD_BDRC_data_with_langs.xlsx was used when I manually compiled the above list
It outputs several csv and spreadsheet files as follows:
- person_matches.csv contains 84000 IDs and BDRC IDs of persons that match
- Discrepancies.xlsx notes where 84000 and BDRC do not match up, as well as possible resolutions. It compiles the following files:
- unmatched_persons.csv contains the 84000 names and ID for persons that did not match, as well as a dictionary of possibile individuals from the BDRC side
- unmatched_works.csv contains a list of works by tohoku number that were not matched to a BDRC work
- unattributed_works.csv contains a list of works (by ID) for which no attributions could be found. It overlaps with the above file.
- matchable_works.csv and attributable_works.csv contain lists of 84000 works that did not match a BDRC work in the spreadsheet, or were missing attributions found in BDRC.
- discprepant_roles.csv notes instances where a person is assigned different roles (i.e. translatorTib vs. translatorPandita) in 84000 and BDRC for the same text.
- all_person_matches.xlsx is a spreadsheet that combines the data in person_matches.csv and WD_identified_person_matches.xlsx. The sheet 'grouped matches' also indicates duplicate 84000 ids, that match with the same BDRC number
- WD_BDRC_data.xlsx is the copy of the BDRC spreadsheet (DergeKangyur sheet) with the data I added on to it: BDRC IDs for works, 84000 IDS for persons, language attributions and missing entries for texts that can be filled in based on 84000 data.
- notes_on_discrepant_roles.csv contains my notes on the discrepant roles based on the colophons translated at 84000.
- new_kangyur_data.xml is the xml data with updated attributions