-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add detailed description * Add templates for 'Required' and 'Optional' environment variables * Update 'Development' section to provide more details on testing * Add new section: CLI commands * Update CLI function 'help' descriptions * Use noun phrases for command arguments (excl. date and boolean command args)
- Loading branch information
1 parent
15abb5e
commit 6cca678
Showing
2 changed files
with
141 additions
and
56 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,65 +2,149 @@ | |
|
||
# oai-pmh-harvester | ||
|
||
CLI app for harvesting from repositories using OAI-PMH. | ||
OAI-PMH-Harvester is a Python CLI application for harvesting metadata from repositories (also known as "Data Providers") available via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). | ||
|
||
## Harvesting | ||
## Development | ||
- To preview a list of available Makefile commands: `make help` | ||
- To install with dev dependencies: `make install` | ||
- To update dependencies: `make update` | ||
- To run unit tests: `make test` | ||
- To lint the repo: `make lint` | ||
- To run the app: `pipenv run oai --help` | ||
|
||
To install and run tests: | ||
### Running the application on your local machine | ||
|
||
- `make install` | ||
- `make test` | ||
Create a virtual environment and install dev dependencies: `make install`. | ||
|
||
To view available commands and main options: | ||
Additional notes: | ||
|
||
- `pipenv run oai --help` | ||
1. To execute the steps below, you can use the following sample url to an OAI-PMH repo: `https://aspace-staff-dev.mit.edu/oai`. | ||
|
||
To run a harvest: | ||
2. To write the output file to an S3 bucket, include S3 in the `-o/--output-file` argument. | ||
* With AWS credentials: | ||
``` | ||
-o s3://<AWS_KEY>:<AWS_SECRET_KEY>@<BUCKET_NAME>/<output-filename>.xml | ||
``` | ||
* Wihout AWS credentials (if you have your credentials stored locally): | ||
``` | ||
-o s3://<BUCKET_NAME>/<output-filename>.xml | ||
``` | ||
- `pipenv run oai -h [host repo oai-pmh url] -o [path to output file] harvest [any additional desired options]` | ||
#### With Docker | ||
## Development | ||
1. Run `make dist-dev` to build the Docker container image. | ||
Clone the repo and install the dependencies using [Pipenv](https://docs.pipenv.org/): | ||
2. To run a harvest, execute the following command in your terminal: | ||
``` | ||
docker run -it --volume <local-file-path>:<docker-file-path>' oai-pmh-harvester-dev -h <url-to-oai-pmh-repo> -o <docker-file-path>/<output-filename>.xml harvest <optional-command-args> | ||
``` | ||
```bash | ||
git [email protected]:MITLibraries/oai-pmh-harvester.git | ||
cd oai-pmh-harvester | ||
make install | ||
``` | ||
**Note:** The `-v/--volume` argument mounts the \<local-file-path> in the current directory into the container at \<docker-file-path>, which allows us to view the generated output file in \<local-file-path>. | ||
#### Without Docker | ||
## Docker | ||
1. To run a harvest, execute the following command in your terminal: | ||
To build and run in docker: | ||
``` | ||
pipenv run oai -h <url-to-oai-pmh-repo> -o <output-filename>.xml harvest <optional-command-args> | ||
``` | ||
```bash | ||
make dist-dev | ||
docker run -it oaiharvester | ||
## Environment variables | ||
### Required | ||
```shell | ||
# Set to dev for local development, this will be set to 'stage' and 'prod' in those environments by Terraform. | ||
WORKSPACE=dev | ||
``` | ||
|
||
To run this locally in Docker while maintaining the ability to see the output file, you can do something like: | ||
### Optional | ||
|
||
```shell | ||
# Required only if a source has records that cause errors during a harvest and --method=get. The value provided must be a space-separated list of OAI-PMH record identifiers to skip during harvest. | ||
RECORD_SKIP_LIST=<oai-pmh-id1> <oai-pmh-id2> | ||
|
||
# Sets the interval for logging status updates as records are written to the output file. Defaults to 1000, which will log a status update for every thousandth record. | ||
STATUS_UPDATE_INTERVAL = 1000 | ||
|
||
```bash | ||
docker run -it --volume '/FULL/PATH/TO/WHERE/YOU/WANT/FILES/tmp:/app/tmp' oaiharvester -h https://aspace-staff-dev.mit.edu/oai -o tmp/out.xml harvest -m oai_ead | ||
# If set to a valid Sentry DSN, enables Sentry exception monitoring This is not needed for local development. | ||
SENTRY_DSN = <sentry-dsn-for-oai-pmh-harvester> | ||
``` | ||
|
||
## S3 Output | ||
## CLI commands | ||
|
||
You can save to s3 by passing an s3 url as the --output-file (-o) in a format like: | ||
All CLI commands can be run with pipenv run <COMMAND>. | ||
|
||
```bash | ||
-o s3://AWS_KEY:AWS_SECRET_KEY@BUCKET_NAME/FILENAME.xml | ||
### `oai` | ||
|
||
```text | ||
Usage: -c [OPTIONS] COMMAND [ARGS]... | ||
Options: | ||
-h, --host TEXT Hostname of server for an OAI-PMH compliant source. | ||
[required] | ||
-o, --output-file TEXT Filepath for generated output (either an XML file | ||
with harvested metadata or a JSON file describing | ||
set structure of an OAI-PMH compliant source). This | ||
value can be a local filepath or an S3 URI. | ||
[required] | ||
-v, --verbose Pass to log at debug level instead of info | ||
--help Show this message and exit. | ||
Commands: | ||
harvest Harvest command to retrieve records from an OAI-PMH compliant source. | ||
setlist Create a JSON file describing the set structure of an OAI-PMH compliant source. | ||
``` | ||
|
||
If you have your credentials stored locally, you can omit the passed params like: | ||
### `oai harvest` | ||
|
||
```text | ||
Usage: -c harvest [OPTIONS] | ||
Harvest command to retrieve records from an OAI-PMH compliant source. | ||
Options: | ||
--method [get|list] Method for record retrieval. The 'list' method | ||
is faster and should be used in most cases; | ||
'get' method should be used for ArchivesSpace | ||
due to errors retrieving a full record set with | ||
the 'list' method. [default: list] | ||
-m, --metadata-format TEXT Alternate metadata format for harvested records. | ||
A record should only be returned if the format | ||
specified can be disseminated from the item | ||
identified by the value of the identifier | ||
argument. [default: oai_dc] | ||
-f, --from-date TEXT Filter for files modified on or after this date; | ||
format YYYY-MM-DD. | ||
-u, --until-date TEXT Filter for files modified before this date; | ||
format YYYY-MM-DD. | ||
-s, --set-spec TEXT SetSpec of set to be harvested. Limits harvest | ||
to records in the provided set. | ||
-sr, --skip-record TEXT Set of OAI-PMH identifiers for records to skip | ||
during a harvest. Only works when --method=get. | ||
Multiple identifiers can be provided using the | ||
syntax: '-sr oai:12345 -sr oai:67890'. Values | ||
can also be retrieved through the | ||
RECORD_SKIP_LIST env var (see README for more | ||
details). | ||
--exclude-deleted Pass to exclude deleted records from harvest. | ||
--help Show this message and exit. | ||
``` | ||
|
||
```bash | ||
-o s3://BUCKET_NAME/FILENAME.xml | ||
### `oai setlist` | ||
``` | ||
Usage: -c setlist [OPTIONS] | ||
Create a JSON file describing the set structure of an OAI-PMH compliant | ||
source. | ||
Uses the OAI-PMH ListSets verbs to retrieve all sets from a repository, and | ||
writes the set names and specs to a JSON output file. | ||
Options: | ||
--help Show this message and exit. | ||
``` | ||
|
||
|
||
## ENV variables | ||
|
||
- `RECORD_SKIP_LIST` = Required if a source has records that cause errors during harvest, otherwise those records will cause the harvest process to crash. Space-separated list of OAI-PMH record identifiers to skip during harvest, e.g. `RECORD_SKIP_LIST=record1 record2`. Note: this only works if the harvest method used is "get". | ||
- `SENTRY_DSN` = Optional in dev. If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development. | ||
- `STATUS_UPDATE_INTERVAL` = Optional. The transform process logs the # of records transformed every nth record (1000 by default). Set this env variable to any integer to change the frequency of logging status updates. Can be useful for development/debugging. | ||
- `WORKSPACE` = Required. Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters