-
Notifications
You must be signed in to change notification settings - Fork 3
Create transfer
module to package spool files into tar
and upload to DANDI
#51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Below are benchmarking results comparing spool For ~1 GB, it takes ~300 seconds to package and compress to Based on these numbers, I would suggest that we just package into a tarbal without gzip compression and upload. In total, it would take 3.4 days to package the entire 216 TB plus the upload time of 25 days (assuming 100 MBps). |
Hi @calvinchai, have you done similar benchmarking for the OME-Zarr conversion (and/or stitching)? If so, could you point me to those findings? Thanks. |
Tests now pass. See latest run. Not sure why they are failing in this pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new transfer
module to batch .dat
files into tar archives and optionally upload them to DANDI, updates dependencies and exports, and adds CI/test support.
- Adds
linc_convert.modalities.lsm.transfer
for archiving and uploading spool files. - Extends
pyproject.toml
extras to includedandi
and updates__init__.py
. - Introduces
test_transfer
intests/test_lsm.py
and configuresDANDI_API_KEY
in CI.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
tests/test_lsm.py | Add test_transfer to verify archive creation and file integrity |
pyproject.toml | Add dandi to lsm and all extras |
linc_convert/modalities/lsm/transfer.py | New transfer logic for batching, archiving, and uploading |
linc_convert/modalities/lsm/init.py | Export transfer in __all__ |
.github/workflows/ci.yml | Trigger CI on push, PR, dispatch, and set DANDI_API_KEY env |
Comments suppressed due to low confidence (2)
linc_convert/modalities/lsm/transfer.py:94
- There's no test covering the
upload=True
branch. Consider adding a test with a mockeddandi.upload.upload
call to verify upload logic and cleanup.
if upload:
tests/test_lsm.py:37
- The test invokes
dandi.download.download
which performs network I/O. Mockdandi.download.download
to avoid external dependencies and make tests deterministic.
transfer.dandi_transfer(input_dir=input_dir,
transfer
module to package spool files into tar.gz
and upload to DANDItransfer
module to package spool files into tar
and upload to DANDI
Hi @balbasty @calvinchai, following up to see if you have feedback on this pull request. Thanks. |
I may have that on paper. Let me get back to you on that tomorrow. |
My apologee, missed this notification. Taking a look now |
Parameters | ||
---------- | ||
input_dir : str | ||
Directory containing .dat files to upload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative could be for the input to be a list of files, rather than doing the glob ourselves in the function. Dat files can still be filtered using a command-line glob: linc-convert lsm transfer path/to/dir/*.dat
. And it would allows taring non-dat files if needed.
But I understand that the directory-based interface might be easier to use for Emin et al.
dandi_instance : str | ||
DANDI server (e.g. linc, dandi) | ||
output_dir : str, optional | ||
Directory to save the Dandiset directory (default: '.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We delete everyting in the end, right? Should we just use a tempfile.TemporaryDirectory
instead of asking the user for a location?
print(f"Uploading {archive_path}.") | ||
dandi.upload.upload([dandiset_directory], | ||
dandi_instance=dandi_instance, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this robust enough? Would it be useful to wrap this into a loop with try/except and allow for multiple tries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe we should have an option to restart a transfer where it was left off (say that the person stops it with ctrl+c). Not sure how to implement this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this robust enough? Would it be useful to wrap this into a loop with try/except and allow for multiple tries?
Good idea. I have added a loop to continually attempt the upload until successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe we should have an option to restart a transfer where it was left off (say that the person stops it with ctrl+c). Not sure how to implement this.
I could create a manifest file that is stored locally to track the files that have been added to a tar and the tars that have been uploaded.
I was wondering if it is possible to do something like |
I just checked using |
Context
Based on a discussion with Satra, Anastasia, and Aaron (and previously with Emine, Elizabeth, and Wenze), we have decided to combine the spool files into
tar.gz
files and upload them to https://dandiarchive.org. From there, the data can be downloaded, unpackaged, and converted into an OME-Zarr archive.Updates
transfer
module that combines a few gigabytes of spooldat
files into atar
file, uploads thetar
file to DANDI, deletes thetar
file, and repeats until alldat
files have been uploaded. This process is employed so that not much local storage is consumed.transfer.py
moduleDANDI_API_key
secret to GitHub repositorycc @satra @ayendiki