Skip to content

Data migration from the old cantus to the new cantus

Jacob deGroot-Maggetti edited this page Sep 9, 2022 · 31 revisions

Overview

Scripts for this data migration process are written as Django management commands. The general process is

  1. Collect all the object IDs of a given model and write them into a text file.
  2. Run the corresponding migration script.

Memoized values

CantusDB models include several fields that store calculated values, which serve to optimize page load times by pre-computing expensive calculations:

  • Source.number_of_chants
  • Source.number_of_melodies
  • Chant.next_chant
  • Chant.is_last_chant_in_feast
  • Chant.volpiano_notes
  • Chant.volpiano_intervals

These fields are automatically updated as the database changes (the code that updates these is found in main_app/signals.py), but these fields must be populated with initial values when external data is added directly to the database. There exist several management commands (found in main_app/management/commands) that populate these fields for all items in the database. The necessary management commands are outlined in the Details for each model section below.

Details for each model

Indexer

Information about each indexer can be exported from the old Cantus website using its json_export feature, like this: https://cantus.uwaterloo.ca/json-node/124108. The last part of the url is the indexer object ID. To obtain a list of all indexers' IDs, run SELECT nid FROM node WHERE `type`='indexer’; in mysql on the old Cantus server. Save the list of IDs into a text file and then run python manage.py sync_indexers --remove_extra. The remove_extra flag deletes the extra indexer objects, which are no longer present in the old Cantus DB, that we may have in the new Cantus.

Source

Similar to Indexer, information about each source can be exported using the json_export, like this: https://cantus.uwaterloo.ca/json-node/123931 for the source whose ID is 123931.

To obtain a list of all sources' IDs, runSELECT nid FROM node WHERE `type`='source’; in mysql on the old Cantus server. Save the list of IDs into a text file and then run python manage.py sync_sources <source_id> --remove_extra. It takes an extra argument <source_id>, which allows you to specify one specific source to update. To update all sources, use "all" in place of the <source_id>.

Sources' .number_of_chants and .number_of_melodies fields must be populated by running python manage.py populate_chant_and_melody_counts

Genre

The json_export is not available for all model types. For genres, the migration script directly scrapes the genre list page. There is thus no need to obtain all the IDs beforehand. Just run python manage.py sync_genres and it's done.

Feast

Similar to Genre, there is no json export for feasts. We need to

  1. scrape the feast list page(s) to obtain a list of all feast IDs and save it to a text file.
  2. Using each ID in the ID list, scrape the corresponding feast detail page for other information.

Since there is no designated table for feast objects in the old Cantus database, feast IDs are directly scraped from the feast list page. Both steps are done automatically by running one single command python manage.py sync_feasts all --remove_extra.

After data is initially imported from OldCantus, Feasts' prefix property must populated by running python manage.py add_prefix

Chant

Similar to Indexer and Source, all information about chant objects can be obtained from the json export: https://cantus.uwaterloo.ca/json-node/243157 for the chant with an ID of 243157.

To obtain a list of all chants' IDs, runSELECT nid FROM node WHERE `type`='chant’; in mysql on the old Cantus server. Save the list of IDs into a text file and then run python manage.py sync_chants <chant_id> --remove_extra. To update all chants, use "all" in place of the <chant_id>. There are over 500,000 chants in CantusDB, so migrating the chants is very time-consuming. Expect 30 hours to get it all done.

Chants' .next_chant, .is_last_chant_in_feast, .volpiano_notes and .volpiano_intervals must then be populated. These fields can be populated using the following commands:

  • .next_chant: python manage.py populate_next_chant_fields
  • .is_last_chant_in_feast: python manage.py populate_is_last_chant_in_feast (this command depends on Chants' .next_chant fields already being populated, so be sure the previous step has been completed before beginning this step)
  • .volpiano_notes and .volpiano_intervals: python manage.py populate_volpiano_new_fields

Sequence

Sequence is very similar to Chant. The json export: https://cantus.uwaterloo.ca/json-node/628347 for the sequence with an ID of 628347. To obtain a list of all sequences' IDs, runSELECT nid FROM node WHERE `type`='sequence’;. Save the list of IDs into a text file and then run python manage.py sync_sequences <sequence_id> --remove_extra. To update all sequences, use "all" in place of the <sequence_id>.

After syncing all models

On local

The above process is run on the local machine. After syncing all the data, export the database into json fixtures:

python manage.py dumpdata main_app.indexer -o main_app/fixtures/indexer_fixture.json --indent 2 --verbosity 1

(do this for every model). Note that the chant_fixtures.json is going to be too big for django to handle. Therefore, after generating the chant fixture, run break_json.py to break it into smaller segments, so that django can later load them into staging more easily. Check the paths in break_json.py and make sure they point to where you want before running it.

Finally, scp the fixtures to staging.

scp fixtures.zip cantuscc.dev:/home/ubuntu/fixtures_0803.zip

On staging

If zipped, unzip the folder to main_app/fixtures

unzip fixtures_0803.zip -d code/CantusDB/django/cantusdb_project/main_app

On staging machine, remove all existing data

python manage.py flush

and then load all the fixtures with

bash load_fixtures.sh