-
Notifications
You must be signed in to change notification settings - Fork 6
Data migration from the old cantus to the new cantus
Scripts for this data migration process are written as Django management commands. The general process is
- Collect all the object IDs of a given model and write them into a text file.
- Run the corresponding migration script.
CantusDB models include several fields that store calculated values, which serve to optimize page load times by pre-computing expensive calculations:
Source.number_of_chants
Source.number_of_melodies
Chant.next_chant
Chant.is_last_chant_in_feast
Chant.volpiano_notes
Chant.volpiano_intervals
These fields are automatically updated as the database changes (the code that updates these is found in main_app/signals.py
), but these fields must be populated with initial values when external data is added directly to the database. There exist several management commands (found in main_app/management/commands
) that populate these fields for all items in the database. The necessary management commands are outlined in the Details for each model section below.
Information about each indexer can be exported from the old Cantus website using its json_export
feature, like this: https://cantus.uwaterloo.ca/json-node/124108. The last part of the url is the indexer object ID.
To obtain a list of all indexers' IDs, run SELECT nid FROM node WHERE `type`='indexer’;
in mysql on the old Cantus server. Save the list of IDs into a text file and then run python manage.py sync_indexers --remove_extra
. The remove_extra
flag deletes the extra indexer objects, which are no longer present in the old Cantus DB, that we may have in the new Cantus.
Similar to Indexer
, information about each source can be exported using the json_export, like this: https://cantus.uwaterloo.ca/json-node/123931 for the source whose ID is 123931
.
To obtain a list of all sources' IDs, runSELECT nid FROM node WHERE `type`='source’;
in mysql on the old Cantus server. Save the list of IDs into a text file and then run python manage.py sync_sources <source_id> --remove_extra
. It takes an extra argument <source_id>
, which allows you to specify one specific source to update. To update all sources, use "all" in place of the <source_id>
.
Sources' .number_of_chants
and .number_of_melodies
fields must be populated by running python manage.py populate_chant_and_melody_counts
The json_export
is not available for all model types. For genres, the migration script directly scrapes the genre list page. There is thus no need to obtain all the IDs beforehand. Just run python manage.py sync_genres
and it's done.
Similar to Genre
, there is no json export for feasts. We need to
- scrape the feast list page(s) to obtain a list of all feast IDs and save it to a text file.
- Using each ID in the ID list, scrape the corresponding feast detail page for other information.
Since there is no designated table for feast objects in the old Cantus database, feast IDs are directly scraped from the feast list page. Both steps are done automatically by running one single command python manage.py sync_feasts all --remove_extra
.
After data is initially imported from OldCantus, Feasts' prefix
property must populated by running python manage.py add_prefix
Similar to Indexer
and Source
, all information about chant objects can be obtained from the json export: https://cantus.uwaterloo.ca/json-node/243157 for the chant with an ID of 243157
.
To obtain a list of all chants' IDs, runSELECT nid FROM node WHERE `type`='chant’;
in mysql on the old Cantus server. Save the list of IDs into a text file and then run python manage.py sync_chants <chant_id> --remove_extra
. To update all chants, use "all" in place of the <chant_id>
. There are over 500,000 chants in CantusDB, so migrating the chants is very time-consuming. Expect 30 hours to get it all done.
Chants' .next_chant
, .is_last_chant_in_feast
, .volpiano_notes
and .volpiano_intervals
must then be populated. These fields can be populated using the following commands:
-
.next_chant
:python manage.py populate_next_chant_fields
-
.is_last_chant_in_feast
:python manage.py populate_is_last_chant_in_feast
(this command depends on Chants'.next_chant
fields already being populated, so be sure the previous step has been completed before beginning this step) -
.volpiano_notes
and.volpiano_intervals
:python manage.py populate_volpiano_new_fields
Sequence
is very similar to Chant
. The json export: https://cantus.uwaterloo.ca/json-node/628347 for the sequence with an ID of 628347
. To obtain a list of all sequences' IDs, runSELECT nid FROM node WHERE `type`='sequence’;
. Save the list of IDs into a text file and then run python manage.py sync_sequences <sequence_id> --remove_extra
. To update all sequences, use "all" in place of the <sequence_id>
.
The above process is run on the local machine. After syncing all the data, export the database into json fixtures:
python manage.py dumpdata main_app.indexer -o main_app/fixtures/indexer_fixture.json --indent 2 --verbosity 1
(do this for every model). Note that the chant_fixtures.json
is going to be too big for django to handle. Therefore, after generating the chant fixture, run break_json.py
to break it into smaller segments, so that django can later load them into staging more easily. Check the paths in break_json.py
and make sure they point to where you want before running it.
Finally, scp
the fixtures to staging.
scp fixtures.zip cantuscc.dev:/home/ubuntu/fixtures_0803.zip
If zipped, unzip the folder to main_app/fixtures
unzip fixtures_0803.zip -d code/CantusDB/django/cantusdb_project/main_app
On staging machine, remove all existing data
python manage.py flush
and then load all the fixtures with
bash load_fixtures.sh