Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DANDI sync to MIT Engaging #189

Open
kabilar opened this issue Oct 18, 2024 · 15 comments
Open

DANDI sync to MIT Engaging #189

kabilar opened this issue Oct 18, 2024 · 15 comments

Comments

@kabilar
Copy link
Member

kabilar commented Oct 18, 2024

@puja-trivedi @aaronkanzer and I are using this issue to track our work to sync the DANDI public bucket to MIT Engaging.

Requirements

Possible solutions

  • Globus collections
  • scp
  • rsync
  • s5cmd
  • dandi-cli

Open questions

  1. How will we handle MFA for programmatic access?
  2. Basic Globus endpoints will transfer data unencrypted
  3. Globus AWS S3 connector would need to be added to our subscription to access data on S3. Is this add-on currently included in the MIT Globus subscription?

Resources

@satra
Copy link
Member

satra commented Oct 18, 2024

s5cmd and the dandi cli should also be solutions, no? they both offer parallelism that scp and rsync don't.

@kabilar
Copy link
Member Author

kabilar commented Oct 18, 2024

Thanks, Satra. Yes, definitely. Exploring s5cmd next.

@kabilar
Copy link
Member Author

kabilar commented Oct 18, 2024

@puja-trivedi @aaronkanzer s5cmd installation on Engaging got stuck again. Will need to try a different mechanism.

@satra
Copy link
Member

satra commented Oct 21, 2024

@kabilar - s5cmd has prebuilt binaries and also conda. how are you installing on engaging?

@kabilar
Copy link
Member Author

kabilar commented Oct 21, 2024

Hi @satra, we tried conda but it wasn't able to resolve the download. Will try the binaries.

@yarikoptic
Copy link
Member

yarikoptic commented Oct 21, 2024

Didn't look into globus one, but all given the size of our bucket (in number of keys) others afaik wouldn't be sufficient for efficient incremental backups. FTR

We would need a tool which would make use of that extra service @satra mentioned (can't recall name) we have enabled which tracks changes to our s3. Before initiating full backup might be worth first deciding how incremental to be done so that initial backup would be done with future incrementals in mind (eg may be capturing the state/position in that extra service)

@kabilar
Copy link
Member Author

kabilar commented Oct 23, 2024

Thanks, Yarik. That sounds good.

@kabilar
Copy link
Member Author

kabilar commented Oct 23, 2024

From MIT ORCD team:

We do have an S3 license included in our Globus subscription and are looking into how we might set it up. It may require some admin things on our end. I'll keep you posted with updates.

@kabilar
Copy link
Member Author

kabilar commented Oct 23, 2024

@puja-trivedi For reference, DANDI design docs.

@yarikoptic
Copy link
Member

FTR (if someone would inquire on "scales") -- 000108 alone, though the zarrs in it, points to 332,739,854 keys on S3 according to its description on https://github.com/dandisets/000108 . If MIT has Globus subscription, could someone inquire from Globus on either they have any factual data (benchmarks, use-cases) for S3 connector to be used with hundreds of millions of keys in a bucket for incremental backup?

  • note: not sure we are already there or not, but even "built in" AWS S3 Backup service seems to have a limitation of "The AWS Backup can be used only for buckets with less than 3 billion objects" ref

Meanwhile, @satra, who/where we have access to S3 inventory associated with our sponsored bucket? (FWIW, I insofar failed to use pre-cooked tool/script which would make use of inventory for backups, odd)

@satra
Copy link
Member

satra commented Oct 24, 2024

who/where we have access to S3 inventory associated with our sponsored bucket?

it's in the sponsored bucket, so whoever has keys to that (it's dumped into a specific directory there that is only readable using the appropriate access keys).

@yarikoptic
Copy link
Member

I guess I might not have an appropriate access key since I see only

$ s3cmd ls -l s3://dandiarchive/
                          DIR                                                    s3://dandiarchive/blobs/
                          DIR                                                    s3://dandiarchive/dandiarchive/
                          DIR                                                    s3://dandiarchive/dandisets/
                          DIR                                                    s3://dandiarchive/zarr-checksums/
                          DIR                                                    s3://dandiarchive/zarr/
2021-09-22 22:20         2137  99d1fd07269359b636b34bd402c58fbc     STANDARD     s3://dandiarchive/README.md
2021-09-22 22:20         3094  1b484c3b547a89efd67da353397556a4     STANDARD     s3://dandiarchive/index.html
2021-01-29 22:07         4008  ef4867d3c21a0034a98cd9453f14efe3     STANDARD     s3://dandiarchive/ros3test.hdf5
2021-08-12 00:48       177728  35574be1cdfe3ae4c4235d34d7348f99     STANDARD     s3://dandiarchive/ros3test.nwb

?

@satra
Copy link
Member

satra commented Oct 24, 2024

should be inside: s3://dandiarchive/dandiarchive/

@yarikoptic
Copy link
Member

interesting! So s3cmd gives me "empty directory" (although there is no directories on S3), which is different from 'non-existing':

$ s3cmd ls -l s3://dandiarchive/dandiarchive/
                          DIR                                                    s3://dandiarchive/dandiarchive/dandiarchive/
$ s3cmd ls -l s3://dandiarchive/dandiarchives/
$ 

I guess I do not have access to the keys under it, but they are there.

@satra
Copy link
Member

satra commented Oct 24, 2024

perhaps check through web account, so you log in with the credentials for that account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants