FOI: minimalistic test case #107

yarikoptic · 2024-12-22T14:54:04Z

now we have a dedicated to inventories bucket (main one still goes into dandiarchive for now), where we have one for dandisets/ prefix of dandiarchive bucket:

dandi@drogon:~$ aws s3 ls s3://dandiarchive-inventory/dandiarchive/dandiset-manifest/
                           PRE 2024-12-20T01-00Z/
                           PRE 2024-12-21T01-00Z/
                           PRE data/
                           PRE hive/

dandi@drogon:~$ aws s3 ls s3://dandiarchive-inventory/dandiarchive/dandiset-manifest/data/
2024-12-21 10:27:16          0 
2024-12-20 16:21:18   28711737 aeee7c6c-e8e8-4caf-adac-b6ee3bced18d.csv.gz
2024-12-21 13:23:46   28713840 c15823ab-9099-41b8-97b5-9cfb5fd198e6.csv.gz

which should be (didn't check) identical to the main inventory in terms of columns collected, but residing in non-public bucket (accessible from dandi at drogon)

(note that 0 empty key -- the one @jwodder observed and now ignored since #96).

The text was updated successfully, but these errors were encountered:

jwodder · 2025-01-06T13:51:08Z

@yarikoptic

So does this address Produce a small bucket with inventory for use in testing #65?
To be clear, this bucket just contains inventories for the dandisets/ folder on the dandiarchive bucket, correct? About how large is the dandisets/ folder?

yarikoptic · 2025-01-06T13:59:46Z

let's assume that
yes

dandi@drogon:~$ aws s3api list-objects --bucket dandiarchive --prefix dandisets/ --output json --query "[sum(Contents[].Size), length(Contents[])]" 
[
    1475663939,
    5462
]

>>> humanize.naturalsize(1475663939)
'1.5 GB'

jwodder · 2025-01-06T14:11:53Z

@yarikoptic This new bucket requires credentials to access, and my Dandi AWS credentials don't seem to be accepted.

yarikoptic · 2025-01-06T15:26:27Z

@satra would you be kind whenever you get a minute to generate an "IAM for backup" to access new dandiarchive-inventory bucket AND dandiarchive bucket (fully, not just public portions since for backup), ideally read-only.

@jwodder , I wonder also how it could/should work in the scope of s3invsync if there would be need for two separate IAMs for the two different buckets/locations (one for inventory and one for the actual storage). Would it be easy to support? (I think we could avoid that ATM, just curious)

jwodder · 2025-01-06T15:29:54Z

@yarikoptic I believe that's what AWS profiles are for. s3invsync should already honor the AWS_PROFILE environment variable, but support for a --profile CLI option could be added.

EDIT: I misinterpreted your question; I thought you were asking about running s3invsync against two different buckets in separate invocations which each needed their own credentials. I think situations in which the inventory bucket and data bucket require different credentials should be avoided in the first place.

jwodder · 2025-01-06T16:23:32Z

@yarikoptic That value of 1.5 GB is not correct. The backup's been running for less than a minute, and it's already downloaded 4.7 GB.

yarikoptic · 2025-01-06T16:43:02Z

right, most likely it is because 1.5GB is only for current version(s) and many of those files would have multiple (actually -- many) versions of the files for "draft"s. With that in mind asked chatgpt again and it gave following recipe which gave following answer(s):

dandi@drogon:~$ aws s3api list-object-versions --bucket dandiarchive --prefix dandisets/ --output json --query "[sum(Versions[].Size), length(Versions[])]"
[
    1470886105794,
    490337
]

dandi@drogon:~$ python3
Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import humanize
>>> humanize.naturalsize(1470886105794,)
'1.5 TB'

so -- indeed it is notably larger! Actually GC on those was introduced 4 days ago by @mvandenburgh in

Manifest file garbage collection dandi-infrastructure#199

for staging bucket and eventually, upon verification of correct operation there, to production one. I would expect that size then to shrink considerably! and have a nice example of older versions being pruned from S3 while we would still keep them in backup (might eventually want to add similar policy for prior versions pruning in backup I guess)

yarikoptic · 2025-01-06T16:43:37Z

I guess, for now, might be worth testing with some match regex to limit to some subset of dandisets.

jwodder · 2025-01-06T17:10:58Z

@yarikoptic I already started a backup without a regex filter, and 1.5 TB should fit on the disk. Based on the number of files downloaded so far, I calculate that the process will take about 11 hours to finish.

jwodder · 2025-01-06T18:24:43Z

Well, that was the wrong way to estimate the runtime. It actually took almost exactly two hours.

yarikoptic mentioned this issue Jan 6, 2025

Produce a small bucket with inventory for use in testing #65

Closed

jwodder added the informational For Our Information label Jan 8, 2025

jwodder closed this as completed Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FOI: minimalistic test case #107

FOI: minimalistic test case #107

yarikoptic commented Dec 22, 2024

jwodder commented Jan 6, 2025

yarikoptic commented Jan 6, 2025

jwodder commented Jan 6, 2025

yarikoptic commented Jan 6, 2025

jwodder commented Jan 6, 2025 •

edited

Loading

jwodder commented Jan 6, 2025

yarikoptic commented Jan 6, 2025

yarikoptic commented Jan 6, 2025

jwodder commented Jan 6, 2025

jwodder commented Jan 6, 2025 •

edited

Loading

FOI: minimalistic test case #107

FOI: minimalistic test case #107

Comments

yarikoptic commented Dec 22, 2024

jwodder commented Jan 6, 2025

yarikoptic commented Jan 6, 2025

jwodder commented Jan 6, 2025

yarikoptic commented Jan 6, 2025

jwodder commented Jan 6, 2025 • edited Loading

jwodder commented Jan 6, 2025

yarikoptic commented Jan 6, 2025

yarikoptic commented Jan 6, 2025

jwodder commented Jan 6, 2025

jwodder commented Jan 6, 2025 • edited Loading

jwodder commented Jan 6, 2025 •

edited

Loading

jwodder commented Jan 6, 2025 •

edited

Loading