Skip to content

Commit

Permalink
Added a new option strict to the make_bag API function, along wit…
Browse files Browse the repository at this point in the history
…h a corresponding CLI argument. If `strict` is enabled,

`make_bag` will automatically validate a newly created or updated bag for structural validity and fail if the resultant bag is invalid. This can be used to ensure that a bag is not persisted without payload file manifests. Additionally, if the created or updated output bag is not structurally valid, the bag will subsequently be reverted back to a normal directory and a BagValidationError exception will be thrown.
Updated unit tests and docs.
  • Loading branch information
mikedarcy committed May 10, 2024
1 parent 5ac6f4b commit f6342e0
Show file tree
Hide file tree
Showing 8 changed files with 78 additions and 17 deletions.
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@
* Fix erroneous encoding of `%` char in URL field of `fetch.txt` which could break already properly encoded URLs.
This was due to a misinterpretation of the spec which states that `%` (along with `CR` and `LF`) should _only_ be URL
encoded for the `filename` field and that whitespace (` ` and `\t`) should _only_ be encoded in the URL field.
* NOTE: As a best practice, applications should always pre-encode URLs that are added to `fetch.txt` and not rely on `bdbag` to do so, since only whitespace will be encoded.
* NOTE: As a best practice, applications should always pre-encode URLs that are added to `fetch.txt` and not rely on `bdbag` to do so, since only whitespace will be encoded.
* Added a new option `strict` to the `make_bag` API function, along with a corresponding CLI argument. If `strict` is enabled,
`make_bag` will automatically validate a newly created or updated bag for structural validity and fail if the resultant bag is invalid.
This can be used to ensure that a bag is not persisted without payload file manifests. Additionally, if the created or
updated output bag is not structurally valid, the bag will subsequently be reverted back to a normal directory and a BagValidationError exception will be thrown.

## 1.7.2

Expand Down
14 changes: 13 additions & 1 deletion bdbag/bdbag_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,7 +242,8 @@ def make_bag(bag_path,
config_file=None,
ro_metadata=None,
ro_metadata_file=None,
idempotent=None):
idempotent=None,
strict=False):
bag = None
try:
bag = bdbagit.BDBag(bag_path)
Expand Down Expand Up @@ -333,6 +334,17 @@ def make_bag(bag_path,
if bag_ro_metadata:
bdbro.serialize_bag_ro_metadata(bag_ro_metadata, bag_path)
bag.save(bag_processes)

if strict:
try:
bag._validate_structure()
except bdbagit.BagValidationError as e:
error = ("The newly created/updated bag is not structurally valid and strict checking has been requested. "
"The bag will be reverted back to a normal directory. Exception: %s\n") % get_typed_exception(e)
logger.error(error)
revert_bag(bag_path)
raise bdbagit.BagValidationError(error)

return bag


Expand Down
11 changes: 10 additions & 1 deletion bdbag/bdbag_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,14 @@ def parse_cli():
update_arg, action="store_true",
help="Update an existing bag dir, regenerating manifests and fetch.txt if necessary.")

strict_arg = "--strict"
standard_args.add_argument(
strict_arg, action="store_true",
help="Automatically validate a newly created or updated bag for structural validity and fail if the resultant "
"bag is invalid. This can be used to ensure that a bag is not persisted without payload file manifests. "
"If this flag is set and the created or updated output bag is not structurally valid, the bag will "
"subsequently be reverted back to a normal directory and an error returned.")

revert_arg = "--revert"
standard_args.add_argument(
revert_arg, action="store_true",
Expand Down Expand Up @@ -370,7 +378,8 @@ def main():
remote_file_manifest=args.remote_file_manifest,
config_file=args.config_file,
ro_metadata_file=args.ro_metadata_file,
idempotent=args.idempotent)
idempotent=args.idempotent,
strict=args.strict)

# otherwise just extract the bag if it is an archive and no other conflicting options specified
elif not (args.validate or args.validate_profile or args.resolve_fetch):
Expand Down
30 changes: 16 additions & 14 deletions doc/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,25 +189,27 @@ make_bag(bag_path,
config_file=None,
ro_metadata=None,
ro_metadata_file=None,
idempotent=None)
idempotent=None,
strict=False)
```
Creates or updates the bag denoted by the `bag_path` argument.

##### Parameters
| Param | Type | Description |
|----------------------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| bag_path | `string` | A normalized, absolute path to a bag directory. |
| algs | `list` | A list of checksum algorithms to use for calculating file fixities. When creating a bag, only the checksums present in this variable will be used. When updating a bag, this function will take the union of any existing bag algorithms and what is specified by this parameter, ***except*** when the `prune_manifests` parameter is specified, in which case then only the algorithms specifed by this parameter will be used. |
| update | `boolean` | If `bag_path` represents an existing bag, update it. If this parameter is not specified when invoking this function on an existing bag, the function is essentially a NOOP and will emit a logging message to that effect. |
| save_manifests | `boolean` | Defaults to `True`. If true, saves all manifests, recalculating all checksums and regenerating `fetch.txt`. If false, only tagfile manifest checksums are recalculated. Use this flag as an optimization (to avoid recalculating payload file checksums) when only the bag metadata has been changed. This parameter is only meaningful during update operations, otherwise it is ignored. |
| prune_manifests | `boolean` | Removes any file and tagfile manifests for checksums that are not listed in the `algs` variable. This parameter is only meaningful during update operations, otherwise it is ignored. |
| metadata | `dict` | A dictionary of key-value pairs that will be written directly to the bag's 'bag-info.txt' file. |
| metadata_file | `string` | A JSON file representation of metadata that will be written directly to the bag's 'bag-info.txt' file. The format of this metadata is described [here](./config.md#metadata). |
| remote_file_manifest | `string` | A path to a JSON file representation of remote file entries that will be used to add remote files to the bag file manifest(s) and used to create the bag's `fetch.txt`. The format of this file is described [here](./config.md/#remote-file-manifest). |
| config_file | `string` | A JSON file representation of configuration data that is used during bag creation and update. The format of this file is described [here](./config.md#bdbag.json). |
| ro_metadata | `dict` | A dictionary that will be used to serialize data into one or more JSON files into the bag's `metadata` directory. The format of this metadata is described [here](./config.md#ro_metadata). |
| ro_metadata_file | `string` | A path to a JSON file representation of RO metadata that will be used to serialize data into one or more JSON files into the bag's `metadata` directory. The format of this metadata is described [here](./config.md#ro_metadata). |
| Param | Type | Description |
|----------------------|-----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| bag_path | `string` | A normalized, absolute path to a bag directory. |
| algs | `list` | A list of checksum algorithms to use for calculating file fixities. When creating a bag, only the checksums present in this variable will be used. When updating a bag, this function will take the union of any existing bag algorithms and what is specified by this parameter, ***except*** when the `prune_manifests` parameter is specified, in which case then only the algorithms specifed by this parameter will be used. |
| update | `boolean` | If `bag_path` represents an existing bag, update it. If this parameter is not specified when invoking this function on an existing bag, the function is essentially a NOOP and will emit a logging message to that effect. |
| save_manifests | `boolean` | Defaults to `True`. If true, saves all manifests, recalculating all checksums and regenerating `fetch.txt`. If false, only tagfile manifest checksums are recalculated. Use this flag as an optimization (to avoid recalculating payload file checksums) when only the bag metadata has been changed. This parameter is only meaningful during update operations, otherwise it is ignored. |
| prune_manifests | `boolean` | Removes any file and tagfile manifests for checksums that are not listed in the `algs` variable. This parameter is only meaningful during update operations, otherwise it is ignored. |
| metadata | `dict` | A dictionary of key-value pairs that will be written directly to the bag's 'bag-info.txt' file. |
| metadata_file | `string` | A JSON file representation of metadata that will be written directly to the bag's 'bag-info.txt' file. The format of this metadata is described [here](./config.md#metadata). |
| remote_file_manifest | `string` | A path to a JSON file representation of remote file entries that will be used to add remote files to the bag file manifest(s) and used to create the bag's `fetch.txt`. The format of this file is described [here](./config.md/#remote-file-manifest). |
| config_file | `string` | A JSON file representation of configuration data that is used during bag creation and update. The format of this file is described [here](./config.md#bdbag.json). |
| ro_metadata | `dict` | A dictionary that will be used to serialize data into one or more JSON files into the bag's `metadata` directory. The format of this metadata is described [here](./config.md#ro_metadata). |
| ro_metadata_file | `string` | A path to a JSON file representation of RO metadata that will be used to serialize data into one or more JSON files into the bag's `metadata` directory. The format of this metadata is described [here](./config.md#ro_metadata). |
| idempotent | `boolean` | If `True`, date and time specific metadata such as `Bagging-Date` and `Bagging-Time` will be _removed_ (if present) from `bag-info.txt`. This value defaults to `False` if not passed via argument. However, a global override default value of `True` can be enabled in the [config file](./config.md). NOTE: use of `ro_metadata` and `ro_metadata_file` in conjunction with `idempotent` is not recommended at this time due to the generated RO Metadata not being compatible with bag idempotency. |
| strict | `boolean` | If `True`, automatically validate a newly created or updated bag for structural validity and fail if the resultant bag is invalid. This can be used to ensure that a bag is not persisted without payload file manifests. Furthermore, if this argument is `True` and the created or updated output bag is not structurally valid, the bag will subsequently be reverted back to a normal directory and a BagValidationError exception is thrown. |

**Returns**: `bag` - An instantiated [bagit-python](https://github.com/LibraryOfCongress/bagit-python/blob/master/bagit.py) `bag` compatible class object.

Expand Down
8 changes: 8 additions & 0 deletions doc/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ The only mandatory argument is a valid path to a local bag directory, a local ba
usage: bdbag
[--version]
[--update]
[--strict]
[--revert]
[--archiver {zip,tar,tgz,bz2,xz}]
[--idempotent]
Expand Down Expand Up @@ -84,6 +85,13 @@ operating on bag _archive_ files) can be configured to write such output to the
#### `--update`
Update an existing bag dir, recalculating tag-manifest checksums and regenerating manifests and fetch.txt if necessary.

----
#### `--strict`
Automatically validate a newly created or updated bag for structural validity and fail if the resultant bag is invalid.
This can be used to ensure that a bag is not persisted without payload file manifests. If this flag is set and the
created or updated output bag is not structurally valid, the bag will subsequently be reverted back to a normal directory
and an error returned.

----
#### `--revert`
Revert an existing bag directory back to a normal directory, deleting all bag metadata files. Payload files in the `data` directory will be moved back to the directory root, and the `data` directory will be deleted.
Expand Down
Loading

0 comments on commit f6342e0

Please sign in to comment.