-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FOI: minimalistic test case #107
Comments
|
|
@yarikoptic This new bucket requires credentials to access, and my Dandi AWS credentials don't seem to be accepted. |
@satra would you be kind whenever you get a minute to generate an "IAM for backup" to access new dandiarchive-inventory bucket AND dandiarchive bucket (fully, not just public portions since for backup), ideally read-only. @jwodder , I wonder also how it could/should work in the scope of s3invsync if there would be need for two separate IAMs for the two different buckets/locations (one for inventory and one for the actual storage). Would it be easy to support? (I think we could avoid that ATM, just curious) |
@yarikoptic I believe that's what AWS profiles are for. EDIT: I misinterpreted your question; I thought you were asking about running |
@yarikoptic That value of 1.5 GB is not correct. The backup's been running for less than a minute, and it's already downloaded 4.7 GB. |
right, most likely it is because 1.5GB is only for current version(s) and many of those files would have multiple (actually -- many) versions of the files for "draft"s. With that in mind asked chatgpt again and it gave following recipe which gave following answer(s): dandi@drogon:~$ aws s3api list-object-versions --bucket dandiarchive --prefix dandisets/ --output json --query "[sum(Versions[].Size), length(Versions[])]"
[
1470886105794,
490337
]
dandi@drogon:~$ python3
Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import humanize
>>> humanize.naturalsize(1470886105794,)
'1.5 TB' so -- indeed it is notably larger! Actually GC on those was introduced 4 days ago by @mvandenburgh in for staging bucket and eventually, upon verification of correct operation there, to production one. I would expect that size then to shrink considerably! and have a nice example of older versions being pruned from S3 while we would still keep them in backup (might eventually want to add similar policy for prior versions pruning in backup I guess) |
I guess, for now, might be worth testing with some match regex to limit to some subset of dandisets. |
@yarikoptic I already started a backup without a regex filter, and 1.5 TB should fit on the disk. Based on the number of files downloaded so far, I calculate that the process will take about 11 hours to finish. |
Well, that was the wrong way to estimate the runtime. It actually took almost exactly two hours. |
now we have a dedicated to inventories bucket (main one still goes into dandiarchive for now), where we have one for
dandisets/
prefix of dandiarchive bucket:dandi@drogon:~$ aws s3 ls s3://dandiarchive-inventory/dandiarchive/dandiset-manifest/ PRE 2024-12-20T01-00Z/ PRE 2024-12-21T01-00Z/ PRE data/ PRE hive/
dandi@drogon:~$ aws s3 ls s3://dandiarchive-inventory/dandiarchive/dandiset-manifest/data/ 2024-12-21 10:27:16 0 2024-12-20 16:21:18 28711737 aeee7c6c-e8e8-4caf-adac-b6ee3bced18d.csv.gz 2024-12-21 13:23:46 28713840 c15823ab-9099-41b8-97b5-9cfb5fd198e6.csv.gz
which should be (didn't check) identical to the main inventory in terms of columns collected, but residing in non-public bucket (accessible from dandi at drogon)
(note that
0
empty key -- the one @jwodder observed and now ignored since #96).The text was updated successfully, but these errors were encountered: