Creating & updating dataset mirrors of Dandisets and their Zarrs for the
dandisets and
dandizarrs organizations is done with the
backups2datalad
command in this repository.
Before running backups2datalad
, the following setup must be performed:
-
backups2datalad
must be installed in a Python environment using eitherpip install .
(run from a clone of this repository) orpip install git+https://github.com/dandi/backups2datalad
. At least Python 3.10 is required. -
git-annex must be installed. At least version 10.20240430 is required, though you should endeavor to obtain the latest version.
-
An API token needs to be obtained for the DANDI instance that is being mirrored. When invoking
backups2datalad
, the environment variableDANDI_API_KEY
must be set to the token. -
A configuration file should be written. This is a YAML file containing a mapping with the following keys:
-
dandi_instance
— The name of the DANDI instance whose Dandisets should be mirrored. Defaults to"dandi"
. -
s3bucket
— The name of the S3 bucket on which the assets for the DANDI instance are stored. Currently, only buckets in theus-east-1
region are supported. Defaults to"dandiarchive"
.-
When
dandi_instance
is"dandi"
, this should be"dandiarchive"
. -
When
dandi_instance
is"dandi-staging"
, this should be"dandi-api-staging-dandisets"
.
-
-
s3endpoint
— The base endpoint URL of the S3 instance on which the bucket is located. If this is set, the base bucket URL will be calculated as{s3endpoint}/{s3bucket}
; otherwise, it will behttps://{s3bucket}.s3.amazonaws.com
. This option is intended primarily for use in testing. -
content_url_regex
— A regular expression used to identify which of an asset'scontentUrl
s is its S3 URL. Defaults to"amazonaws.com/.*blobs/"
. -
dandisets
— A mapping containing configuration specific to the mirroring of Dandisets. If not given, it will default to a mapping in whichpath
is set to"dandisets"
and all other fields are unset.-
path
(required) — The path to the local directory in which dataset mirrors of Dandisets will be placed, relative tobackup_root
. The directory need not already exist.- This directory will be made into a DataLad dataset.
-
github_org
— The name of the GitHub organization (which must already exist) to which the mirror repositories will be pushed. If not set, mirrors will not be pushed to GitHub.dandisets.github_org
andzarrs.github_org
must be either both set or both unset.
-
remote
— Description of a git-annex special remote to create in new mirror repositories and for thepopulate
subcommand to copy data to. If not set,populate
cannot be run.When present,
remote
is a mapping with the following keys:name
(required) — The name of the remotetype
(required) — The type of the remoteoptions
(required) — A string-valued mapping specifying parameters to pass togit-annex initremote
-
-
zarrs
— A mapping containing configuration specific to the mirroring of Zarrs. If not given,backups2datalad
will error upon trying to back up a Dandiset containing a Zarr. The mapping has the same schema as fordandisets
.-
zarrs.path
will not be made into a DataLad dataset. -
dandisets.github_org
andzarrs.github_org
must be either both set or both unset. -
zarrs.remote
is a prerequisite for thepopulate-zarrs
subcommand.
-
-
backup_root
— The path to the local directory in which the Dandiset and Zarr mirror directories will be placed. Defaults to the current directory.- This option can also be set via the
--backup-root
global CLI option, which overrides any value given in the configuration file.
- This option can also be set via the
-
asset_filter
— A regular expression; if given, only assets whose paths match the regex will be processed.- This option can also be set via the
--asset-filter
option of theupdate-from-backup
andrelease
subcommands, which overrides any value given in the configuration file.
- This option can also be set via the
-
jobs
(integer) — The number of parallel git-annex jobs to use when downloading & pushing assets. Defaults to 10.- This option can also be set via the
--jobs
global CLI option, which overrides any value given in the configuration file.
- This option can also be set via the
-
workers
(integer) — The number of asynchronous worker tasks to run concurrently. Defaults to 5.- This option can also be set via the
--workers
option of theupdate-from-backup
,backup-zarrs
,populate
, andpopulate-zarrs
subcommands, which overrides any value given in the configuration file.
- This option can also be set via the
-
force
— If set to"assets-update"
, all assets are forcibly updated, even those whose metadata hasn't changed.- This option can also be set via the
--force
option of theupdate-from-backup
andrelease
subcommands, which overrides any value given in the configuration file.
- This option can also be set via the
-
enable_tags
(boolean) — Whether to enable creation of tags for releases; defaults totrue
- This option can also be set via the
--tags
/--no-tags
options of theupdate-from-backup
subcommand, which override any value given in the configuration file.
- This option can also be set via the
-
gc_assets
(boolean) — If set andassets.json
contains any assets neither on the server nor in the backup, delete the extra assets instead of erroring. Defaults tofalse
.- This option can also be set via the
--gc-assets
option of theupdate-from-backup
subcommand, which overrides any value given in the configuration file.
- This option can also be set via the
-
mode
— Specify how to decide whether to back up a Dandiset. Possible values are:-
"timestamp"
(default) — only back up if the timestamp of the last backup is older than the "modified" timestamp on the server -
"force"
— always back up -
"verify"
— always back up, but error if there are any changes without a change to the "modified" timestamp
This option can also be set via the
--mode
option of theupdate-from-backup
subcommand, which overrides any value given in the configuration file. -
-
zarr_mode
— Specify how to decide whether to back up a Zarr. Possible values are:-
"timestamp"
(default) — only back up if the timestamp of the last backup is older than some Zarr entry in S3 -
"checksum"
— only back up if the Zarr checksum is out of date or doesn't match the expected value -
"asset-checksum"
— only back up if the Zarr asset's "modified" timestamp is later than that inassets.json
and the checksum is out of date or doesn't match the expected value -
"force"
— always back up
This option can also be set via the
--zarr-mode
option of theupdate-from-backup
subcommand, which overrides any value given in the configuration file. -
-
-
If pushing mirror repositories to GitHub, a GitHub access token with appropriate permissions must be stored in the
hub.oauthtoken
key of your~/.gitconfig
, and an SSH key that has been registered with a GitHub account must be in use as well.
Run backups2datalad
with:
backups2datalad --config path/to/config/file <subcommand> ...
The environment variable DANDI_API_KEY
must be set to an API token for the
DANDI instance being mirrored.
Run backups2datalad --help
for details on the global options and summaries of
the subcommands.
backups2datalad
subcommands:
-
update-from-backup
— Create & update local mirrors of Dandisets and the Zarrs within them -
backup-zarrs
— Create (but do not update) local mirrors of Zarrs for a single Dandiset -
update-github-metadata
— Update homepages and descriptions for mirrors pushed to GitHub -
release
— Create a tag (and a GitHub release, if pushing to GitHub) in a Dandiset mirror for a given published version -
populate
— Copy assets from local Dandiset mirrors to the git-annex special remote -
populate-zarrs
— Copy assets from local Zarr mirrors to the git-annex special remote -
zarr-checksum
— Computes the Zarr checksum for a given Zarr mirror -
register-s3urls
— Ensure that all blob assets in the backup of the given Dandiset have their S3 URLs registered with git-annex
Run backups2datalad <subcommand> --help
for further details on each
subcommand.
The primary mirroring subcommands are update-from-backup
, populate
, and
populate-zarrs
; the other subcommands are for minor/maintenance tasks and
usually do not need to be run.