Skip to content

Migration Guide: CONTENTdm

Mark Jordan edited this page Apr 26, 2017 · 55 revisions

Overview

In the spring of 2016, Simon Fraser University Library migrated from CONTENTdm to Islandora. In total, we migrated approximately 110 collections containing 1.3 million objects (using CONTENTdm's count of objects). This migration guide describes the processes and tools we used during our migration.

The "Migration workflow" section below describes migrating a single collection. When planning your migration, keep in mind that the workflow will need to be applied to every source collection in your CONTENTdm instance. The more prep work you can do prior to the migration, the faster it will go. If the expiry of your CONTENTdm license is a hard deadline, you should be planning and practising migrations well in advance of that date. Some things you can do to prepare for the migration include:

  • Use the CONTENTdm Collection Inspector or other tool of you choice to get a sense for how consistent your metadata is in terms of date formats, controlled vocabularies, etc. You may want to consider using CONTENTdm's search and replace feature to clean up your metadata, or you can use MIK's metadata manipulators and mappings files to perform cleanup and normalization tasks during the migration. In either case, you will need to know how consistent (or inconsistent) the metadata in your CONTENTdm instance is.
  • Audit your preservation master files to make sure they are named consistently and are organized in ways that MIK can access them. MIK's CONTENTdm toolchains can combine the data extracted from CONTENTdm with your preservation masters, but only if the masters are organized in specific ways.
  • As suggested above, migrating very large collections can take a long time. Whenever possible, plan your migration so that you can use the derivatives provided by CONTENTdm so you can avoid having Islandora regenerate them, in particular the JP2 and OCR derivatives. You may also choose to migrate very large collections in subsets.

Differences between CONTENTdm and Islandora

There are a number of differences between CONTENTdm and Islandora that you should take into account while planning your migration. Some impact the migration directly; others do not, but may inform decisions that you need to make in preparation for the migration:

CONTENTdm Islandora Impact for migration
Uses a flat metadata structure in which local sites can easily create their own fields. Stores metadata in MODS XML, although it can be configured to store discovery metadata in any datastream. Significant because you will need to determine how to handle fields from CONTENTdm that do not map cleanly to standard MODS elements.
Only stores files intended for consumption by end users. Can store all files associated with an object. Significant because you will need to decide if you are going to store your master files in Islandora or not.
Supports hierarchical objects, including books. Also supports hierarchical objects, but (currently) can't batch ingest them. Significant because all hierarchical objects will be flattened by MIK. Islandora currently lacks a standard tool for batch ingesting hierarchical compound objects. The Islandora Compound Batch module can only ingest flat compound objects.

The two platforms also differ in some ways that are not as significant during the migration itself, but that you should consider when planning the information architecture and user experience of your Islandora site:

CONTENTdm Islandora Impact for information architecture/UX
An object in CONTENTdm can only be in one collection. Objects in Islandora can be in many collections. When you ingest an object into Islandora, you ingest it into a single collection. After an object has been ingested, you can use Islandora's management tools to share objects across collections.
Provides a fixed set of content types. Provides a standard set of "content models", but it is possible to create new ones via custom Solution Packs. All of the most commonly used CONTENTdm content types have equivalent Islandora Solution Packs.
Provides advanced search forms at both the site and collection level. Provides a single site-wide advanced search form. Having only one advanced search form may influence your CONTENTdm-to-MODS field mappings. If your users have come to rely on collection-level advanced search forms that expose the collection's custom metadata fields, you will need to decide how many collection-specific fields to include in your single Islandora site-wide advanced search form.

Some CONTENTdm terminology used within MIK

CONTENTdm uses the terms "alias" and "pointer" to refer to a collection's unique identifier and an object's unique identifier, respectively. These two bits of data are visible in object-level URLs, e.g., in the URL

http://content.lib.sfu.ca/cdm/ref/collection/km/id/12895

km is the collection alias and 12895 is the object's pointer. Because an object in CONTENTdm can only be in one collection, the combination of an alias and a pointer uniquely identifies an object.

"Alias" and to a lesser extent "pointer" are used within MIK configuration files and in some other places in the MIK documentation.

A third term that is used is "nickname". Nicknames are CONTENTdm's internal names for metadata fields. They usually take the form of abbreviated versions of the labels the CONTENTdm administrator assigns to fields. Examples of a collection's field nicknames are shown below.

Migration workflow

MIK imposes two constraints that directly determine the number of jobs necessary to complete your migration. First, MIK migrates single collections from CONTENTdm. This is consistent with Islandora since it ingests content into a single collection. Second, MIK can also only migrate one content model at a time. So, for CONTENTdm collections that contain objects that are still images, PDFs, books, etc., you will need to configure and run MIK once for each of those content models.

These two workflow constraints mean that for every source CONTENTdm collection, you will have at least on MIK job to run, and at least one Islandora batch ingest process. For CONTENTdm collections with multiple content models (some images, some videos, some audio), you will need to run MIK once for each type of content.

The steps in migrating a collection using MIK are:

  1. Analyze the CONTENTdm collection
  2. Configure MIK
  3. Test your configuration
  4. Reconfigure MIK if necessary
  5. Retest until happy
  6. Run MIK to perform the full migration
  7. Ingest your packages into Islandora

Even though you will need to configure, test and run MIK for each content model in a CONTENTdm collection, you can reuse much of the configuration across content models.

Note that MIK only migrates objects within collections. It doesn't create collection objects in Islandora. To migrate objects into an Islandora collection, the target collection object must already exist. If you have a small number of collections to migrate, creating each one manually is not onerous, but if you have a very large number of CONTENTdm collections, you may want to check out the [Islandora CONTENTdm Collection Migrator](* https://github.com/mjordan/islandora_migrate_cdm_collections) module.

Step 1: Analyzing the source CONTENTdm collection

In preparation for migrating each CONTENTdm collection using MIK, you need to assemble the following information:

  1. how the metadata for the collection's objects should map to MODS, and
  2. how objects in the collection fit into Islandora content models

The CONTENTdm Collection Inspector is a tool that can help you with both of these questions. It lets you query a CONTENTdm collection to get information that will assist in configuring MIK.

Determining your CONTENTmd collection's field mappings

In addition to determining which content types your CONTENTdm source objects have, you will need to decide how to map your source collection's metadata to MODS. The CONTENTdm Collection Inspector can help you with this. MIK uses the human-readable labels in its mappings files, not the nicknames. However, some metadata manipulators, for example NormalizeDate and InsertXmlFromTemplate, use nicknames.

For example, running the following command will show you the field configuration for the collection with the alias 'km':

php cdminspect --inspect=nicknames --alias=km

Field nicknames for Komagata Maru - Continuing the Journey

Field label => field nickname
=============================
Title => title
ਸਿਰਲੇਖ => titlep
Subject => subjec
ਵਿਸ਼ਾ (ਸੰਕੇਤ ਸ਼ਬਦ) => subjea
Description => descri
ਵਰਣਨ => descra
Creator => creato
ਰਚਣਹਾਰ => creata
Publisher => publis
ਪ੍ਰਕਾਸ਼ਕ => publia
Contributors => contri
Date => dateso
Display date => date
ਜ਼ਾਹਰ ਤਾਰੀਖ਼ => displa
Type => type
Format => format
Identifier => identi
Repository => source
Language => langua
Duration => durati
Rights => rights
ਹੱਕ => righta
Level1 => level1
Level2 => level2
Level3 => level3
Full text => full
Archival file => fullrs
OCLC number => dmoclcno
Date created => dmcreated
Date modified => dmmodified
CONTENTdm number => dmrecord
CONTENTdm file name => find

Done.

The MIK Cookbook provides two entries that will get you going with the CONTENTdm Collection Inspector:

Determining what content types are in your CONTENTmd collection

As stated above, MIK can also only migrate one content model at a time. For CONTENTdm collections which contain objects of the same content type (all PDFs, or all books, for example), you will only need to configure and run a single MIK job. For CONTENTdm collections that contains objects of multiple content types e.g., (some images, some videos, some audio), you will need to configure and run one MIK job per content type, you will need to run MIK once for each type of content.

In either case, you need to confirm exactly which content types your source objects have so you can map them to the corresponding Islandora content models. The following table illustrates the correspondences between CONTENTdm content types and Islandora content models:

CONTENTdm content type Islandora Solution Pack / content model
JP2000 Large Image Solution Pack
Other image formats Basic Image
Compound (Monograph) Book Solution Pack
Newspapers Newspaper Solution Pack
Compound (Document) Compound Solution Pack
Compound PDF PDF Solution Pack

CONTENTdm provides a summary of content types under each collection's Admin > Reports > Item types menu, but you can also use the CONTENTdm Collection Inspector to generate an object-level list of which content types are in use. From this list, you can then determine which Islandora content models you will need to enable. For example, to generate a list of content types used in a collection with the alias 'km' and save the list to a file named 'km_types.txt', you would run the following command:

php cdminspect --inspect=object_type --alias=km --output_file=km_types.txt

The output will look something like this:

# cdminspect output for the '/km' collection.
15693,compound,Document
15720,compound,Document
15740,compound,Document
15840,compound,Document
15843,compound,Document
15846,compound,Document
15847,simple,jp2
15848,simple,jp2
15849,simple,jp2
15850,simple,pdf
15851,simple,jp2
15852,simple,pdf
15856,compound,Document
15859,compound,Document
15864,compound,Document
15866,compound,Document
15875,compound,Document
15876,simple,mp4
16030,simple,mp4
16031,simple,pdf
9213,compound,Monograph
513,simple,jp2
514,simple,jp2
515,simple,jp2
516,simple,jp2
517,simple,jp2
518,simple,jp2
519,simple,jp2
520,simple,jp2

From this report we see that there are compound documents, JPEG2000s, MP4s, and monographs in this collection. The group of objects in the CONTENTdm source collection adhering to each of these content types will need to be migrated in a separate MIK job, configured according to the documentation for the CONTENTdm Generic Compound, CONTENTdm Single File (for the JP2000 and MP4 objects), and CONTENTdm Books (for the monographs).

Step 2: Configure MIK

Now that you know which fields your CONTENTdm source collection uses, and the content types of the objects in your source collection, you are ready to create an MIK configuration file. To do this, you will first need to create a mappings file.

Again, it is important to remember that you may need to create more than one MIK configuration file per source collection, specifically, you will need one configuration file per target Islandora content model. However, you can generally reuse the same mappings file across jobs required for specific content models.

Mappings files

Mappings file for CONTENTdm migrations contain the human-readable source field label in the first column, and the top-level MODS XML snippet in the second column:

Title,"<titleInfo type=""translated"" lang=""eng""><title>%value%</title></titleInfo>"
ਸਿਰਲੇਖ,"<titleInfo lang=""pan"" script=""Guru""><title>%value%</title></titleInfo>"
Subject,<subject><topic>%value%</topic></subject>
ਵਿਸ਼ਾ (ਸੰਕੇਤ ਸ਼ਬਦ),"<subject lang=""pan"" script=""Guru""><topic>%value%</topic></subject>"
Description,<abstract>%value%</abstract>
ਵਰਣਨ,"<abstract lang=""pan"">%value%</abstract>"
Creator,"<name><namePart>%value%</namePart><role><roleTerm type=""text"" authority=""marcrelator"">creator</roleTerm><roleTerm type=""code"" authority=""marcrelator"">cre</roleTerm></role></name> "
ਰਚਣਹਾਰ,"<name lang=""pan"" script=""Guru""><namePart>%value%</namePart><role><roleTerm type=""text"" authority=""marcrelator"">creator</roleTerm><roleTerm typ=""code"" authority=""marcrelator"">cre</roleTerm></role></name>"
Publisher,<originInfo><publisher>%value%</publisher></originInfo>
ਪ੍ਰਕਾਸ਼ਕ,"<originInfo lang=""pan"" script=""Guru""><publisher>%value%</publisher></originInfo>"
Contributors,"<name><namePart>%value%</namePart><role><roleTerm type=""text"" authority=""marcrelator"">contributor</roleTerm><roleTerm type=""code"" authority=""marcrelator"">ctb</roleTerm></role></name>"
Date,"<originInfo><dateIssued encoding=""w3cdtf"" keyDate=""yes"">%value%</dateIssued></originInfo>"
ਜ਼ਾਹਰ ਤਾਰੀਖ਼,,"<originInfo lang=""pan"" script=""Guru""><dateIssued>%value%</dateIssued></originInfo>"
Type,<genre>%value%</genre>
Identifier,"<identifier>%value%</identifier>"
Repository,<location><physicalLocation>%value%</physicalLocation></location>
Language,"<language><languageTerm type=""text"">%value%</languageTerm></language>"
Duration,<physicalDescription><extent>%value%</extent></physicalDescription>
Rights,"<accessCondition type=""use and reproduction"">%value%</accessCondition>"
ਹੱਕ,"<accessCondition lang=""pan"" script=""Guru"" type=""use and reproduction"">%value%</accessCondition>"
Level1,"<extension><level_1 type=""SFU custom metadata for the Komagata Maru Continuing the Journal Collection"">%value%</level_1></extension>"
Level2,"<extension><level_2 type=""SFU custom metadata for the Komagata Maru Continuing the Journal Collection"">%value%</level_2></extension>"
Level3,"<extension><level_3 type=""SFU custom metadata for the Komagata Maru Continuing the Journal Collection"">%value%</level_3></extension>"
"null1","<extension><CONTENTdmData></CONTENTdmData></extension>",
"null2","<identifier type='uuid'/>"

While creating your mappings file, you will likely want to get a sense of how consistent your source metadata is, since in some cases you may want to take advantage of two MIK features to improve your metadata during the migration:

  1. to replace object-level values in your source metadata with a single target value
  • See the "null mappings" section of the mappings files documentation.
  1. to apply one or more MIK metadata manipulators to your source data. Metadata manipulators you may find useful include:

The CONTENTdm Collection Inspector provides a way of getting a list of unique values in each field for a collection. For example, running

php cdminspect --inspect=field_values --nickname=source --alias=km --output_file=kmsources.txt

will result in the unique values for the 'source' field being written to the kmsources.txt file:

# cdminspect output for the '/km' collection.
BC Archives
British Library
City of Vancouver Archives
Library and Archives Canada
National Archives and Records Administration
Nehru Memorial Museum and Library, New Delhi
SFU Library
SFU Library Special Collections and Rare Books
Simon Faser University Library
Simon Fraser University Library
The British Library
The National Archives
University of British Columbia
Unknown
Vancouver Public Library
[blank]

Create your .ini file

Get derivatives if you can, especially for newspapers.

You will always use the same fetcher and metadata parser, but the filegetter and writers will depend on the content type of the objects you are migrating:

CONTENTdm content type Islandora Solution Pack MIK filegetter MIK writer
JP2000 Large Image Solution Pack CsvSingleFile CsvSingleFile
Other image formats Basic Image CsvSingleFile CsvSingleFile
Compound (Monograph) Book Solution Pack CdmBooks CdmBooks
Newspapers Newspaper Solution Pack CdmNewspapers CdmNewspapers
Compound (Document) Compound Solution Pack CdmCompound CdmCompound
Compound PDF PDF Solution Pack CdmPhpDocuments CdmPhpDocuments

Note that some CONTENTdm toolchains allow you to the files retrieved from CONTENTdm with master files stored outside of CONTENTdm and add them to your Islandora objects as OBJ datastreams. [Possible filegetter manipulators to mention here???] Also, in some cases you can generate OBJ datastreams for book and newspaper pages if you have no master files.

[Also, mention the following fetcher manipulators here:

Sample .ini file:

[CONFIG]
config_id = km jp2
last_updated_on = "2016-04-16"
last_update_by = "mj"

[FETCHER]
class = Cdm
; The alias of the CONTENTdm collection.
alias = km
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
temp_directory = "m:\production_loads\km_2_jp2_mods\temp"
; 'record_key' should always be 'pointer' for CONTENTdm fetchers.
record_key = pointer

[METADATA_PARSER]
class = mods\CdmToMods
alias = km
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
mapping_csv_path = 'extras/sfu/mappings_files/km_2_mappings.csv'
include_migrated_from_uri = TRUE

[FILE_GETTER]
class = CdmSingleFile
alias = km
input_directories[] = "t:\filestore\km\tiffs"
; input_directories[] = 
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
utils_url = "http://content.lib.sfu.ca/utils/"
temp_directory = "m:\production_loads\km_2_jp2_mods\temp"

[WRITER]
class = CdmSingleFile
alias = km
ws_url = "http://content.lib.sfu.ca:81/dmwebservices/index.php?q="
output_directory = "m:\production_loads\km_jp2_mods"
; Leave blank for Cdm single file objects (the MIK writer assigns the filename).
metadata_filename =
datastreams[] = MODS
datastreams[] = OBJ

[MANIPULATORS]
; fetchermanipulators[] = "RandomSet|10"
; fetchermanipulators[] = "SpecificSet|kn_jp2_test.pointers"
fetchermanipulators[] = "CdmNoParent"
fetchermanipulators[] = "CdmSingleFileByExtension|jp2"
metadatamanipulators[] = "SplitRepeatedValues|Subject|/subject/topic|;"
metadatamanipulators[] = "AddUuidToMods"
metadatamanipulators[] = "AddContentdmData"

[LOGGING]
; Full path to log file for general mik log file.
path_to_log = "m:\production_loads\km_jp2_mods\mik.log"
; Full path to log file for manipulators.
path_to_manipulator_log = "m:\production_loads\km_jp2_mods\manipulator.log"

Useful post-write hooks

Step 3: Testing your configuration

Validate MODS frequently.

Steps 4 and 5: Reconfigure and retest

Iterative process. Metadata is hardest part (datastreams[] = MODS).

https://github.com/MarcusBarnes/mik/wiki/Cookbook:-Verifying-that-your-Islandora-ingest-packages-contain-all-expected-files

Step 6: Performing the migration with MIK

Drush is best. May be necessary to split up into smaller jobs.

Step 7: Ingesting into Islandora

Once MIK has finished running, your ingest packages are ready to load into your Islandora site, either via the web interface for small sets of data, or via Drush for larger sets.

You can use MIK to generate smaller sets from large collections using the SpecificSet fetcher manipulator.

Tips and tricks

  • Use the SpecificSet fetcher manipulator to rerun failed objects, whose pointers you can get from MIK's problem_records.log.
  • Use the SimpleReplace metadata manipulator to replace misspellings and other errors in your metadata during migration.
  • The Islandora Book and Newspaper Batch modules can load existing page-level derivatives, such as OCR, JP2, and TN. Reusing these derivatives instead of having Islandora generate them on ingest speeds up batch ingest substantially.
Clone this wiki locally