Skip to content

Conversation

kabilar
Copy link
Member

@kabilar kabilar commented May 13, 2025

Context

Based on a discussion with Satra, Anastasia, and Aaron (and previously with Emine, Elizabeth, and Wenze), we have decided to combine the spool files into tar.gz files and upload them to https://dandiarchive.org. From there, the data can be downloaded, unpackaged, and converted into an OME-Zarr archive.

Updates

  • Create transfer module that combines a few gigabytes of spool dat files into a tar file, uploads the tar file to DANDI, deletes the tar file, and repeats until all dat files have been uploaded. This process is employed so that not much local storage is consumed.
  • Test upload to the LINC staging server. See example command:
    linc-convert lsm transfer --input-dir './multicolor_section_run11__y06_z01_HR' --dandiset-url 'https://staging.lincbrain.org/dandiset/000012' --dandi-instance 'linc-staging' --subject 'm1' --output-dir '.' --max-size-gb 1
    
  • Benchmark transfer.py module
  • Add tests

cc @satra @ayendiki

@kabilar kabilar marked this pull request as draft May 13, 2025 23:03
@kabilar
Copy link
Member Author

kabilar commented May 15, 2025

Below are benchmarking results comparing spool .dat conversion to .tar vs .tar.gz. Tested a few times on my laptop (M2 Pro, 32 GB RAM, with many other processes running).

For ~1 GB, it takes ~300 seconds to package and compress to .tar.gz vs <2 seconds to package to .tar. The compression ratio is ~2.4 (1030 MB to 434 MB).

Based on these numbers, I would suggest that we just package into a tarbal without gzip compression and upload. In total, it would take 3.4 days to package the entire 216 TB plus the upload time of 25 days (assuming 100 MBps).

@kabilar
Copy link
Member Author

kabilar commented May 15, 2025

Hi @calvinchai, have you done similar benchmarking for the OME-Zarr conversion (and/or stitching)? If so, could you point me to those findings? Thanks.

@kabilar
Copy link
Member Author

kabilar commented May 17, 2025

Tests now pass. See latest run. Not sure why they are failing in this pull request.

@kabilar kabilar marked this pull request as ready for review May 17, 2025 17:31
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new transfer module to batch .dat files into tar archives and optionally upload them to DANDI, updates dependencies and exports, and adds CI/test support.

  • Adds linc_convert.modalities.lsm.transfer for archiving and uploading spool files.
  • Extends pyproject.toml extras to include dandi and updates __init__.py.
  • Introduces test_transfer in tests/test_lsm.py and configures DANDI_API_KEY in CI.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_lsm.py Add test_transfer to verify archive creation and file integrity
pyproject.toml Add dandi to lsm and all extras
linc_convert/modalities/lsm/transfer.py New transfer logic for batching, archiving, and uploading
linc_convert/modalities/lsm/init.py Export transfer in __all__
.github/workflows/ci.yml Trigger CI on push, PR, dispatch, and set DANDI_API_KEY env
Comments suppressed due to low confidence (2)

linc_convert/modalities/lsm/transfer.py:94

  • There's no test covering the upload=True branch. Consider adding a test with a mocked dandi.upload.upload call to verify upload logic and cleanup.
if upload:

tests/test_lsm.py:37

  • The test invokes dandi.download.download which performs network I/O. Mock dandi.download.download to avoid external dependencies and make tests deterministic.
transfer.dandi_transfer(input_dir=input_dir, 

@kabilar kabilar changed the title Create transfer module to package spool files into tar.gz and upload to DANDI Create transfer module to package spool files into tar and upload to DANDI May 17, 2025
@kabilar
Copy link
Member Author

kabilar commented May 27, 2025

Hi @balbasty @calvinchai, following up to see if you have feedback on this pull request. Thanks.

@calvinchai
Copy link
Contributor

Hi @calvinchai, have you done similar benchmarking for the OME-Zarr conversion (and/or stitching)? If so, could you point me to those findings? Thanks.

I may have that on paper. Let me get back to you on that tomorrow.

@calvinchai
Copy link
Contributor

Hi @balbasty @calvinchai, following up to see if you have feedback on this pull request. Thanks.

My apologee, missed this notification. Taking a look now

Parameters
----------
input_dir : str
Directory containing .dat files to upload
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative could be for the input to be a list of files, rather than doing the glob ourselves in the function. Dat files can still be filtered using a command-line glob: linc-convert lsm transfer path/to/dir/*.dat. And it would allows taring non-dat files if needed.

But I understand that the directory-based interface might be easier to use for Emin et al.

dandi_instance : str
DANDI server (e.g. linc, dandi)
output_dir : str, optional
Directory to save the Dandiset directory (default: '.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We delete everyting in the end, right? Should we just use a tempfile.TemporaryDirectory instead of asking the user for a location?

print(f"Uploading {archive_path}.")
dandi.upload.upload([dandiset_directory],
dandi_instance=dandi_instance,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this robust enough? Would it be useful to wrap this into a loop with try/except and allow for multiple tries?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we should have an option to restart a transfer where it was left off (say that the person stops it with ctrl+c). Not sure how to implement this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this robust enough? Would it be useful to wrap this into a loop with try/except and allow for multiple tries?

Good idea. I have added a loop to continually attempt the upload until successful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we should have an option to restart a transfer where it was left off (say that the person stops it with ctrl+c). Not sure how to implement this.

I could create a manifest file that is stored locally to track the files that have been added to a tar and the tars that have been uploaded.

@calvinchai
Copy link
Contributor

I was wondering if it is possible to do something like
linc-convert ...... --output dandi:// so we have a unified output format

@calvinchai
Copy link
Contributor

I just checked using dandi-cli in the zarr 3 branch, and no luck. becuause hdmf-zarr requires zarr-python < 3.0,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants