Skip to content

Commit 6bdc6b3

Browse files
Update CLI commands
Update 'reconcile' command * Add 'bitstreams' to context object Update 'additems' command * Deprecate 'field_map' and 'file_type' args * Remove unused S3 client * Add conditions for retrieving 'bitstream_file_paths' from context or provided metadata_csv * Pull out DSpaceCollection.post_items into CLI command
1 parent 7eb7492 commit 6bdc6b3

9 files changed

+435
-326
lines changed

Pipfile.lock

+167-99
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

+25-20
Original file line numberDiff line numberDiff line change
@@ -16,27 +16,34 @@ Note: Previously, the repository comprised of self-contained scripts that could
1616
### Reconciling files with metadata CSV
1717

1818
```bash
19-
pipenv run dsaps --url $DSPACE_URL -e $DSPACE_EMAIL -p $DSPACE_PASSWORD reconcile -m <metadata-csv> -o /output -d <content-directory> -t <file-type>
19+
pipenv run dsaps --config-file $CONFIG_FILE --url $DSPACE_URL -e $DSPACE_EMAIL -p $DSPACE_PASSWORD reconcile -m <metadata-csv> -o /output -d <content-directory>
2020
```
2121

2222
### Creating a new collection within a DSpace community
2323

2424
```bash
25-
pipenv run dsaps --url $DSPACE_URL -e $DSPACE_EMAIL -p $DSPACE_PASSWORD newcollection -c <community-handle> -n <collection-name>
25+
pipenv run dsaps --config-file $CONFIG_FILE --url $DSPACE_URL -e $DSPACE_EMAIL -p $DSPACE_PASSWORD newcollection -c <community-handle> -n <collection-name>
2626
```
2727

2828
### Adding items to a DSpace collection
2929

3030
The command below shows `newcollection` and `additems` being run in conjunction with each other. Note that the invocation must call `newcollection` first. In practice, this is the command that is usually run:
3131

3232
```bash
33-
pipenv run dsaps --url $DSPACE_URL -e $DSPACE_EMAIL -p $DSPACE_PASSWORD newcollection -c <community-handle> -n <collection-name> additems -m <metadata-csv> -f config/<field-mapping>.json -d <s3-bucket-name> -t <file-type>
33+
pipenv run dsaps --config-file $CONFIG_FILE --url $DSPACE_URL -e $DSPACE_EMAIL -p $DSPACE_PASSWORD newcollection -c <community-handle> -n <collection-name> additems -m <metadata-csv> -d <s3-bucket-path>
3434
```
3535

3636
## Environment
3737

3838
### Required
3939

40+
```shell
41+
# The file path to the source configuration JSON with settings for bitstream retrieval and field mappings.
42+
CONFIG_FILE=
43+
```
44+
45+
### Optional
46+
4047
```shell
4148
# The url for the DSpace REST API
4249
DSPACE_URL=
@@ -58,14 +65,16 @@ All CLI commands can be run with `pipenv run <COMMAND>`.
5865
Usage: -c [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...
5966
6067
Options:
68+
--config-file TEXT File path to source configuration JSON with settings
69+
for bitstream retrieval and field mappings. [required]
6170
--url TEXT The url for the DSpace REST API. Defaults to env var
62-
DSPACE_URL if not set. [required]
71+
DSPACE_URL if not set.
6372
-e, --email TEXT The email associated with the DSpace user account used
6473
for authentication. Defaults to env var DSPACE_EMAIL if
65-
not set. [required]
74+
not set.
6675
-p, --password TEXT The password associated with the DSpace user account
6776
used for authentication. Defaults to env var
68-
DSPACE_PASSWORD if not set. [required]
77+
DSPACE_PASSWORD if not set.
6978
--help Show this message and exit.
7079
7180
Commands:
@@ -87,10 +96,10 @@ Usage: -c reconcile [OPTIONS]
8796
file with a corresponding file in the content directory.
8897
8998
* no_files.csv: File identifiers for entries in metadata CSV file
90-
without a corresponding file in the content directory.
99+
without a corresponding file in the content directory.
91100
92101
* no_metadata.csv: File identifiers for files in the content directory
93-
without a corresponding entry in the metadata CSV file.
102+
without a corresponding entry in the metadata CSV file.
94103
95104
* updated-<metadata-csv>.csv: Entries from the metadata CSV file with a
96105
corresponding file in the content directory.
@@ -101,8 +110,6 @@ Options:
101110
-o, --output-directory TEXT The filepath where output files are written.
102111
-d, --content-directory TEXT The name of the S3 bucket containing files for
103112
DSpace uploads. [required]
104-
-t, --file-type TEXT The file type for DSpace uploads (i.e., the
105-
file extension, excluding the dot).
106113
--help Show this message and exit.
107114
```
108115

@@ -127,20 +134,18 @@ Usage: -c additems [OPTIONS]
127134
128135
Add items to a DSpace collection.
129136
130-
The method relies on a CSV file with metadata for uploads, a JSON document
131-
that maps metadata to a DSpace schema, and a directory containing the files
132-
to be uploaded.
137+
The updated metadata CSV file from running 'reconcile' is used for this
138+
process. The method will first add an item to the specified DSpace
139+
collection. The bitstreams (i.e., files) associated with the item are read
140+
from the metadata CSV file, and uploaded to the newly created item on
141+
DSpace.
133142
134143
Options:
135-
-m, --metadata-csv FILE The filepath to a CSV file containing metadata
136-
for Dspace uploads. [required]
137-
-f, --field-map FILE The filepath to a JSON document that maps
138-
columns in the metadata CSV file to a DSpace
139-
schema. [required]
144+
-m, --metadata-csv FILE File path to a CSV file describing the
145+
metadata and bitstreams for DSpace uploads.
146+
[required]
140147
-d, --content-directory TEXT The name of the S3 bucket containing files for
141148
DSpace uploads. [required]
142-
-t, --file-type TEXT The file type for DSpace uploads (i.e., the
143-
file extension, excluding the dot).
144149
-r, --ingest-report Create ingest report for updating other
145150
systems.
146151
-c, --collection-handle TEXT The handle identifying a DSpace collection

config/aspace.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
"bitstream_folders": [
44
"objects"
55
],
6-
"id_regex": ".*-(.*?-.*)\\..*$"
6+
"id_regex": ".*-(\\d*?-\\d*).*$"
77
},
88
"mapping": {
99
"item_identifier": {

dsaps/cli.py

+46-46
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
import csv
22
import datetime
3-
import json
43
import logging
54
import os
65

@@ -10,9 +9,8 @@
109
import click
1110
import structlog
1211

13-
from dsaps import helpers
12+
from dsaps import dspace, helpers
1413
from dsaps.s3 import S3Client
15-
from dsaps.dspace import DSpaceClient, DSpaceCollection
1614

1715

1816
logger = structlog.get_logger()
@@ -28,7 +26,10 @@ def validate_path(ctx, param, value):
2826

2927
@click.group(chain=True)
3028
@click.option(
31-
"--config-file", required=True, help="File path to source configuration JSON."
29+
"--config-file",
30+
envvar="CONFIG_FILE",
31+
required=True,
32+
help="File path to source configuration JSON with settings for bitstream retrieval and field mappings.",
3233
)
3334
@click.option(
3435
"--url",
@@ -83,10 +84,11 @@ def main(ctx, config_file, url, email, password):
8384
logger.info("Running process")
8485
source_config = helpers.load_source_config(config_file)
8586
if url:
86-
dspace_client = DSpaceClient(url)
87+
dspace_client = dspace.DSpaceClient(url)
8788
dspace_client.authenticate(email, password)
8889
ctx.obj["dspace_client"] = dspace_client
89-
ctx.obj["config"] = source_config
90+
ctx.obj["source_config"] = source_config
91+
logger.info("Initializing S3 client")
9092
ctx.obj["s3_client"] = S3Client.get_client()
9193
ctx.obj["start_time"] = perf_counter()
9294

@@ -97,27 +99,14 @@ def main(ctx, config_file, url, email, password):
9799
"--metadata-csv",
98100
required=True,
99101
type=click.Path(exists=True, file_okay=True, dir_okay=False),
100-
help="The filepath to a CSV file containing metadata for Dspace uploads.",
101-
)
102-
@click.option(
103-
"-f",
104-
"--field-map",
105-
required=True,
106-
type=click.Path(exists=True, file_okay=True, dir_okay=False),
107-
help="The filepath to a JSON document that maps columns in the metadata CSV file to a DSpace schema.",
102+
help="File path to a CSV file describing the metadata and bitstreams for DSpace uploads.",
108103
)
109104
@click.option(
110105
"-d",
111106
"--content-directory",
112107
required=True,
113108
help="The name of the S3 bucket containing files for DSpace uploads.",
114109
)
115-
@click.option(
116-
"-t",
117-
"--file-type",
118-
help="The file type for DSpace uploads (i.e., the file extension, excluding the dot).",
119-
default="*",
120-
)
121110
@click.option(
122111
"-r",
123112
"--ingest-report",
@@ -134,41 +123,51 @@ def main(ctx, config_file, url, email, password):
134123
def additems(
135124
ctx,
136125
metadata_csv,
137-
field_map,
138126
content_directory,
139-
file_type,
140127
ingest_report,
141128
collection_handle,
142129
):
143130
"""Add items to a DSpace collection.
144131
145-
The method relies on a CSV file with metadata for uploads, a JSON document that maps
146-
metadata to a DSpace schema, and a directory containing the files to be uploaded.
132+
The updated metadata CSV file from running 'reconcile' is used for this process.
133+
The method will first add an item to the specified DSpace collection. The bitstreams
134+
(i.e., files) associated with the item are read from the metadata CSV file, and
135+
uploaded to the newly created item on DSpace.
147136
"""
148-
s3_client = ctx.obj["s3_client"]
137+
mapping = ctx.obj["source_config"]["mapping"]
149138
dspace_client = ctx.obj["dspace_client"]
150139

151140
if "collection_uuid" not in ctx.obj and collection_handle is None:
152141
raise click.UsageError(
153-
"collection_handle option must be used or "
154-
"additems must be run after newcollection "
155-
"command."
142+
"Option '--collection-handle' must be used or "
143+
"run 'additems' after 'newcollection' command."
156144
)
157145
elif "collection_uuid" in ctx.obj:
158146
collection_uuid = ctx.obj["collection_uuid"]
159147
else:
160148
collection_uuid = dspace_client.get_uuid_from_handle(collection_handle)
161-
with open(metadata_csv, "r") as csvfile, open(field_map, "r") as jsonfile:
149+
150+
if metadata_csv is None:
151+
raise click.UsageError("Option '--metadata-csv' must be used.")
152+
153+
dspace_collection = dspace.Collection(uuid=collection_uuid)
154+
155+
with open(metadata_csv, "r") as csvfile:
162156
metadata = csv.DictReader(csvfile)
163-
mapping = json.load(jsonfile)
164-
collection = DSpaceCollection.create_metadata_for_items_from_csv(
165-
metadata, mapping
157+
dspace_collection = dspace_collection.add_items(metadata, mapping)
158+
159+
for item in dspace_collection.items:
160+
logger.info(f"Posting item: {item}")
161+
item_uuid, item_handle = dspace_client.post_item_to_collection(
162+
collection_uuid, item
166163
)
167-
for item in collection.items:
168-
item.bitstreams_in_directory(content_directory, s3_client, file_type)
169-
collection.uuid = collection_uuid
170-
for item in collection.post_items(dspace_client):
171-
logger.info(item.file_identifier)
164+
item.uuid = item_uuid
165+
item.handle = item_handle
166+
logger.info(f"Item posted: {item_uuid}")
167+
for bitstream in item.bitstreams:
168+
logger.info(f"Posting bitstream: {bitstream}")
169+
dspace_client.post_bitstream(item.uuid, bitstream)
170+
172171
logger.info(
173172
"Total elapsed: %s",
174173
str(timedelta(seconds=perf_counter() - ctx.obj["start_time"])),
@@ -192,7 +191,9 @@ def additems(
192191
def newcollection(ctx, community_handle, collection_name):
193192
"""Create a new DSpace collection within a community."""
194193
dspace_client = ctx.obj["dspace_client"]
195-
collection_uuid = dspace_client.post_coll_to_comm(community_handle, collection_name)
194+
collection_uuid = dspace_client.post_collection_to_community(
195+
community_handle, collection_name
196+
)
196197
ctx.obj["collection_uuid"] = collection_uuid
197198

198199

@@ -235,22 +236,21 @@ def reconcile(ctx, metadata_csv, output_directory, content_directory):
235236
* updated-<metadata-csv>.csv: Entries from the metadata CSV file with a
236237
corresponding file in the content directory.
237238
"""
238-
source_settings = ctx.obj["config"]["settings"]
239-
s3_client = ctx.obj["s3_client"]
240-
files_dict = helpers.get_files_from_s3(
239+
source_settings = ctx.obj["source_config"]["settings"]
240+
bitstreams = helpers.get_files_from_s3(
241241
s3_path=content_directory,
242-
s3_client=s3_client,
242+
s3_client=ctx.obj["s3_client"],
243243
bitstream_folders=source_settings.get("bitstream_folders"),
244244
id_regex=source_settings["id_regex"],
245245
)
246246
metadata_ids = helpers.create_metadata_id_list(metadata_csv)
247-
metadata_matches = helpers.match_metadata_to_files(files_dict.keys(), metadata_ids)
248-
file_matches = helpers.match_files_to_metadata(files_dict.keys(), metadata_ids)
247+
metadata_matches = helpers.match_metadata_to_files(bitstreams.keys(), metadata_ids)
248+
file_matches = helpers.match_files_to_metadata(bitstreams.keys(), metadata_ids)
249249
no_files = set(metadata_ids) - set(metadata_matches)
250-
no_metadata = set(files_dict.keys()) - set(file_matches)
250+
no_metadata = set(bitstreams.keys()) - set(file_matches)
251251
helpers.create_csv_from_list(no_metadata, f"{output_directory}no_metadata")
252252
helpers.create_csv_from_list(no_files, f"{output_directory}no_files")
253253
helpers.create_csv_from_list(metadata_matches, f"{output_directory}metadata_matches")
254254
helpers.update_metadata_csv(
255-
metadata_csv, output_directory, metadata_matches, files_dict
255+
metadata_csv, output_directory, metadata_matches, bitstreams
256256
)

0 commit comments

Comments
 (0)