Skip to content

Data migration from the old cantus to the new cantus

Lucas edited this page Jul 17, 2023 · 31 revisions

Overview

Scripts for this data migration process are written as Django management commands. The general process is

  1. Collect all the object IDs of a given model and write them into a text file.
  2. Run the corresponding migration script.

Memoized values

CantusDB models include several fields that store pre-computed values, which serve to optimize website responsiveness:

  • Source.number_of_chants
  • Source.number_of_melodies
  • Chant.volpiano_notes
  • Chant.volpiano_intervals
  • Chant.search_vector

These fields are automatically updated as the database changes (the code that updates these is found in main_app/signals.py), but they must be populated with initial values when external data is added directly to the database. To do this, run python manage.py populate_chant_fields, which opens each chant in the database and saves it, thus triggering the necessary receivers in signals.py.

Two fields that similarly store pre-computed values (Chant.next_chant and Chant.is_last_chant_in_feast), however, are rather expensive to calculate (and since they are mostly used to provide rough suggestions for indexers as they add new chants to the database, it is less important that these fields reflect the current state of the database moment-to-moment). For these fields, there exist management commands that populate them. These commands are outlined in the Details section for the Chant model below. (Eventually, we hope to automate these commands so that they run once (or several times) per week on their own)

Details for each model

Indexer

Information about each indexer can be exported from the old Cantus website using its json_export feature, like this: https://cantus.uwaterloo.ca/json-node/124108. The last part of the url is the indexer object ID. To obtain a list of all indexers' IDs, run SELECT nid FROM node WHERE `type`='indexer’; in mysql on the old Cantus server. Save the list of IDs into a text file and then run python manage.py sync_indexers.

Source

Similar to Indexer, information about each source can be exported using the json_export, like this: https://cantus.uwaterloo.ca/json-node/123931 for the source whose ID is 123931.

To obtain a list of all sources' IDs, runSELECT nid FROM node WHERE `type`='source’; in mysql on the old Cantus server. Save the list of IDs into a text file and then run python manage.py sync_sources <source_id> --remove_extra. It takes an extra argument <source_id>, which allows you to specify one specific source to update. To update all sources, use "all" in place of the <source_id>.

Sources' .number_of_chants and .number_of_melodies fields must be populated by running python manage.py populate_chant_and_melody_counts

Genre

The json_export is not available for all model types. For genres, the migration script directly scrapes the genre list page. There is thus no need to obtain all the IDs beforehand. Just run python manage.py sync_genres and it's done.

Feast

Similar to Genre, there is no json export for feasts. We need to

  1. scrape the feast list page(s) to obtain a list of all feast IDs and save it to a text file.
  2. Using each ID in the ID list, scrape the corresponding feast detail page for other information.

Since there is no designated table for feast objects in the old Cantus database, feast IDs are directly scraped from the feast list page. Both steps are done automatically by running one single command python manage.py sync_feasts all --remove_extra.

After data is initially imported from OldCantus, Feasts' prefix property must populated by running python manage.py add_prefix

Chant

Similar to Indexer and Source, all information about chant objects can be obtained from the json export: https://cantus.uwaterloo.ca/json-node/243157 for the chant with an ID of 243157.

To obtain a list of all chants' IDs, runSELECT nid FROM node WHERE `type`='chant’; in mysql on the old Cantus server. Save the list of IDs into a text file and then run python manage.py sync_chants <chant_id> --remove_extra. To update all chants, use "all" in place of the <chant_id>. There are over 500,000 chants in CantusDB, so migrating the chants is very time-consuming. Expect 30 hours to get it all done

Chants' .next_chant, and .is_last_chant_in_feast fields must then be populated. This can be done using the following commands:

  • .next_chant: python manage.py populate_next_chant_fields
  • .is_last_chant_in_feast: python manage.py populate_is_last_chant_in_feast (this command depends on Chants' .next_chant fields already being populated, so be sure the previous step has been completed before beginning this step)

Sequence

Sequence is very similar to Chant. The json export: https://cantus.uwaterloo.ca/json-node/628347 for the sequence with an ID of 628347. To obtain a list of all sequences' IDs, runSELECT nid FROM node WHERE `type`='sequence’;. Save the list of IDs into a text file and then run python manage.py sync_sequences <sequence_id> --remove_extra. To update all sequences, use "all" in place of the <sequence_id>.

After syncing all models

On local

The above process is run on the local machine. After syncing all the data, create the database dump .sql file and scp into staging

  • This file can be created by running docker-compose exec postgres pg_dump cantusdb -U cantusdb > ~/cantus_dump.sql or docker exec -i <postgres container id> pg_dump cantusdb -U cantusdb > ~/cantus_dump.sql
  • scp the file into staging: scp <path to sql file>.sql <user>@cantusdb_staging:<new path to sql file>.sql

On staging

Dump data onto staging:

  • Enter the Postgres container: sudo docker-compose exec postgres bash.
  • Access psql: psql postgres cantusdb The first argument is the DB name, second is the username.
  • If you already have a cantusdb database in your Postgres, run DROP DATABASE cantusdb;
  • Create a database named cantusdb: CREATE DATABASE cantusdb;
  • Quit psql, exit the Postgres container, and run sudo docker-compose exec -T postgres psql -U cantusdb cantusdb < /path/to/cantus_dump.sql to load the dump into your newly created database.
  • Delete the migrations folder in every app: articles, main_app, users.
  • Enter the Django container (sudo docker-compose exec django bash) and make migrations: python manage.py makemigrations users (make sure you do users first, because the two other apps depend on users), python manage.py makemigrations articles, python manage.py makemigrations main_app. This should re-create migrations folders in the apps if you deleted them in the previous step.
  • Migrate: python manage.py migrate.

Change the site domain name in the Django Admin interface:

  • Go to the taxonomy manager on the Admin Navigation menu
  • Go to "Sites" and find the single Site on the page. Ensure its domain is the staging IP address, 206.12.88.113. (Since you just copied the database from your local setup, the domain might be incorrectly set to localhost:8888 or something similar.)
  • Once you've ensured the Sites are configured correctly, restart the docker container: on the staging server, in the same directory as docker-compose.yml, run docker-compose down; docker-compose up -d.
    • without this step, the request-a-new-password feature sometimes behaves erratically, generating links that lead to other domains. Configuring the domain properly and then restarting the container seems to make it work reliably.

At this point, everything should be set up. Visit the app in a browser and try navigating to various sources, chants, etc. to ensure that everything has loaded correctly. Try initiating the password reset process, and ensure the URL in the email you receive takes you back to a page on 206.12.88.113.

Duplicate Primary Keys

(this section adapted from issue #287)

For newly created objects (across all models) in the database, we are using auto-incrementing primary keys, which means we don't specify a primary key and let Django figure out the next available primary key and use that instead. For objects that already exist in OldCantus. We migrate the same primary key over, so that their IDs don't change.

After loading the fixtures (or after data migration between old and new Cantus), a duplicate primary key issue may occur when we try to create new objects in the database. This happens because Django is confused about which primary key to use.

Solution:

  • run python manage.py sqlsequencereset to generate the necessary SQL statements for resetting the sequence.
  • go inside the Postgres container and run the SQL statements:
    • docker-compose exec postgres bash
    • psql postgres cantusdb;
    • connect to the cantus database \c cantusdb;
    • Paste SQL statements and run

https://docs.djangoproject.com/en/4.0/ref/django-admin/#sqlsequencereset