Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FOI: minimalistic test case #107

Closed
yarikoptic opened this issue Dec 22, 2024 · 10 comments
Closed

FOI: minimalistic test case #107

yarikoptic opened this issue Dec 22, 2024 · 10 comments
Labels
informational For Our Information

Comments

@yarikoptic
Copy link
Member

now we have a dedicated to inventories bucket (main one still goes into dandiarchive for now), where we have one for dandisets/ prefix of dandiarchive bucket:

dandi@drogon:~$ aws s3 ls s3://dandiarchive-inventory/dandiarchive/dandiset-manifest/
                           PRE 2024-12-20T01-00Z/
                           PRE 2024-12-21T01-00Z/
                           PRE data/
                           PRE hive/
dandi@drogon:~$ aws s3 ls s3://dandiarchive-inventory/dandiarchive/dandiset-manifest/data/
2024-12-21 10:27:16          0 
2024-12-20 16:21:18   28711737 aeee7c6c-e8e8-4caf-adac-b6ee3bced18d.csv.gz
2024-12-21 13:23:46   28713840 c15823ab-9099-41b8-97b5-9cfb5fd198e6.csv.gz

which should be (didn't check) identical to the main inventory in terms of columns collected, but residing in non-public bucket (accessible from dandi at drogon)

(note that 0 empty key -- the one @jwodder observed and now ignored since #96).

@jwodder
Copy link
Member

jwodder commented Jan 6, 2025

@yarikoptic

@yarikoptic
Copy link
Member Author

  • let's assume that
  • yes
dandi@drogon:~$ aws s3api list-objects --bucket dandiarchive --prefix dandisets/ --output json --query "[sum(Contents[].Size), length(Contents[])]" 
[
    1475663939,
    5462
]
>>> humanize.naturalsize(1475663939)
'1.5 GB'

@jwodder
Copy link
Member

jwodder commented Jan 6, 2025

@yarikoptic This new bucket requires credentials to access, and my Dandi AWS credentials don't seem to be accepted.

@yarikoptic
Copy link
Member Author

@satra would you be kind whenever you get a minute to generate an "IAM for backup" to access new dandiarchive-inventory bucket AND dandiarchive bucket (fully, not just public portions since for backup), ideally read-only.

@jwodder , I wonder also how it could/should work in the scope of s3invsync if there would be need for two separate IAMs for the two different buckets/locations (one for inventory and one for the actual storage). Would it be easy to support? (I think we could avoid that ATM, just curious)

@jwodder
Copy link
Member

jwodder commented Jan 6, 2025

@yarikoptic I believe that's what AWS profiles are for. s3invsync should already honor the AWS_PROFILE environment variable, but support for a --profile CLI option could be added.

EDIT: I misinterpreted your question; I thought you were asking about running s3invsync against two different buckets in separate invocations which each needed their own credentials. I think situations in which the inventory bucket and data bucket require different credentials should be avoided in the first place.

@jwodder
Copy link
Member

jwodder commented Jan 6, 2025

@yarikoptic That value of 1.5 GB is not correct. The backup's been running for less than a minute, and it's already downloaded 4.7 GB.

@yarikoptic
Copy link
Member Author

right, most likely it is because 1.5GB is only for current version(s) and many of those files would have multiple (actually -- many) versions of the files for "draft"s. With that in mind asked chatgpt again and it gave following recipe which gave following answer(s):

dandi@drogon:~$ aws s3api list-object-versions --bucket dandiarchive --prefix dandisets/ --output json --query "[sum(Versions[].Size), length(Versions[])]"
[
    1470886105794,
    490337
]

dandi@drogon:~$ python3
Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import humanize
>>> humanize.naturalsize(1470886105794,)
'1.5 TB'

so -- indeed it is notably larger! Actually GC on those was introduced 4 days ago by @mvandenburgh in

for staging bucket and eventually, upon verification of correct operation there, to production one. I would expect that size then to shrink considerably! and have a nice example of older versions being pruned from S3 while we would still keep them in backup (might eventually want to add similar policy for prior versions pruning in backup I guess)

@yarikoptic
Copy link
Member Author

I guess, for now, might be worth testing with some match regex to limit to some subset of dandisets.

@jwodder
Copy link
Member

jwodder commented Jan 6, 2025

@yarikoptic I already started a backup without a regex filter, and 1.5 TB should fit on the disk. Based on the number of files downloaded so far, I calculate that the process will take about 11 hours to finish.

@jwodder
Copy link
Member

jwodder commented Jan 6, 2025

Well, that was the wrong way to estimate the runtime. It actually took almost exactly two hours.

@jwodder jwodder added the informational For Our Information label Jan 8, 2025
@jwodder jwodder closed this as completed Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
informational For Our Information
Projects
None yet
Development

No branches or pull requests

2 participants