DANDI sync to MIT Engaging #189

kabilar · 2024-10-18T19:49:34Z

@puja-trivedi @aaronkanzer and I are using this issue to track our work to sync the DANDI public bucket to MIT Engaging.

Requirements

Possible solutions

Globus collections
scp
rsync
s5cmd
dandi-cli

Open questions

How will we handle MFA for programmatic access?
Basic Globus endpoints will transfer data unencrypted
Globus AWS S3 connector would need to be added to our subscription to access data on S3. Is this add-on currently included in the MIT Globus subscription?

Resources

satra · 2024-10-18T19:53:36Z

s5cmd and the dandi cli should also be solutions, no? they both offer parallelism that scp and rsync don't.

kabilar · 2024-10-18T20:39:11Z

Thanks, Satra. Yes, definitely. Exploring s5cmd next.

kabilar · 2024-10-18T20:56:55Z

@puja-trivedi @aaronkanzer s5cmd installation on Engaging got stuck again. Will need to try a different mechanism.

satra · 2024-10-21T15:55:38Z

@kabilar - s5cmd has prebuilt binaries and also conda. how are you installing on engaging?

kabilar · 2024-10-21T17:48:35Z

Hi @satra, we tried conda but it wasn't able to resolve the download. Will try the binaries.

yarikoptic · 2024-10-21T20:03:05Z

Didn't look into globus one, but all given the size of our bucket (in number of keys) others afaik wouldn't be sufficient for efficient incremental backups. FTR

s5cmd sync -- any way for efficient --incremental ? peak/s5cmd#746

We would need a tool which would make use of that extra service @satra mentioned (can't recall name) we have enabled which tracks changes to our s3. Before initiating full backup might be worth first deciding how incremental to be done so that initial backup would be done with future incrementals in mind (eg may be capturing the state/position in that extra service)

kabilar · 2024-10-23T18:15:28Z

Thanks, Yarik. That sounds good.

kabilar · 2024-10-23T18:16:00Z

From MIT ORCD team:

We do have an S3 license included in our Globus subscription and are looking into how we might set it up. It may require some admin things on our end. I'll keep you posted with updates.

kabilar · 2024-10-23T20:49:45Z

@puja-trivedi For reference, DANDI design docs.

yarikoptic · 2024-10-24T14:19:00Z

FTR (if someone would inquire on "scales") -- 000108 alone, though the zarrs in it, points to 332,739,854 keys on S3 according to its description on https://github.com/dandisets/000108 . If MIT has Globus subscription, could someone inquire from Globus on either they have any factual data (benchmarks, use-cases) for S3 connector to be used with hundreds of millions of keys in a bucket for incremental backup?

note: not sure we are already there or not, but even "built in" AWS S3 Backup service seems to have a limitation of "The AWS Backup can be used only for buckets with less than 3 billion objects" ref

Meanwhile, @satra, who/where we have access to S3 inventory associated with our sponsored bucket? (FWIW, I insofar failed to use pre-cooked tool/script which would make use of inventory for backups, odd)

satra · 2024-10-24T14:49:17Z

who/where we have access to S3 inventory associated with our sponsored bucket?

it's in the sponsored bucket, so whoever has keys to that (it's dumped into a specific directory there that is only readable using the appropriate access keys).

yarikoptic · 2024-10-24T15:04:10Z

I guess I might not have an appropriate access key since I see only

$ s3cmd ls -l s3://dandiarchive/
                          DIR                                                    s3://dandiarchive/blobs/
                          DIR                                                    s3://dandiarchive/dandiarchive/
                          DIR                                                    s3://dandiarchive/dandisets/
                          DIR                                                    s3://dandiarchive/zarr-checksums/
                          DIR                                                    s3://dandiarchive/zarr/
2021-09-22 22:20         2137  99d1fd07269359b636b34bd402c58fbc     STANDARD     s3://dandiarchive/README.md
2021-09-22 22:20         3094  1b484c3b547a89efd67da353397556a4     STANDARD     s3://dandiarchive/index.html
2021-01-29 22:07         4008  ef4867d3c21a0034a98cd9453f14efe3     STANDARD     s3://dandiarchive/ros3test.hdf5
2021-08-12 00:48       177728  35574be1cdfe3ae4c4235d34d7348f99     STANDARD     s3://dandiarchive/ros3test.nwb

?

satra · 2024-10-24T15:08:08Z

should be inside: s3://dandiarchive/dandiarchive/

yarikoptic · 2024-10-24T15:55:54Z

interesting! So s3cmd gives me "empty directory" (although there is no directories on S3), which is different from 'non-existing':

$ s3cmd ls -l s3://dandiarchive/dandiarchive/
                          DIR                                                    s3://dandiarchive/dandiarchive/dandiarchive/
$ s3cmd ls -l s3://dandiarchive/dandiarchives/
$

I guess I do not have access to the keys under it, but they are there.

satra · 2024-10-24T18:05:00Z

perhaps check through web account, so you log in with the credentials for that account.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DANDI sync to MIT Engaging #189

DANDI sync to MIT Engaging #189

kabilar commented Oct 18, 2024 •

edited

Loading

satra commented Oct 18, 2024

kabilar commented Oct 18, 2024

kabilar commented Oct 18, 2024

satra commented Oct 21, 2024

kabilar commented Oct 21, 2024

yarikoptic commented Oct 21, 2024 •

edited

Loading

kabilar commented Oct 23, 2024

kabilar commented Oct 23, 2024

kabilar commented Oct 23, 2024

yarikoptic commented Oct 24, 2024

satra commented Oct 24, 2024

yarikoptic commented Oct 24, 2024

satra commented Oct 24, 2024

yarikoptic commented Oct 24, 2024

satra commented Oct 24, 2024

DANDI sync to MIT Engaging #189

DANDI sync to MIT Engaging #189

Comments

kabilar commented Oct 18, 2024 • edited Loading

satra commented Oct 18, 2024

kabilar commented Oct 18, 2024

kabilar commented Oct 18, 2024

satra commented Oct 21, 2024

kabilar commented Oct 21, 2024

yarikoptic commented Oct 21, 2024 • edited Loading

kabilar commented Oct 23, 2024

kabilar commented Oct 23, 2024

kabilar commented Oct 23, 2024

yarikoptic commented Oct 24, 2024

satra commented Oct 24, 2024

yarikoptic commented Oct 24, 2024

satra commented Oct 24, 2024

yarikoptic commented Oct 24, 2024

satra commented Oct 24, 2024

kabilar commented Oct 18, 2024 •

edited

Loading

yarikoptic commented Oct 21, 2024 •

edited

Loading