Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check Goobi's working storage and clear it out #407

Open
aray-wellcome opened this issue Aug 6, 2021 · 3 comments
Open

Check Goobi's working storage and clear it out #407

aray-wellcome opened this issue Aug 6, 2021 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@aray-wellcome
Copy link

aray-wellcome commented Aug 6, 2021

In looking at our AWS bills, it may be that we have way too much stuff stored in Goobi's working storage.

I have suspected that since the migration our image cleanup steps may not be working correctly to clean this stuff up so that's why I've labeled it a bug.

Searching "stepinwork:Image removal" gives me 1100+ items that are in progress on this step and have been for months, for example.

Can we check and see what's in our working storage? How much stuff is in there?

Can we make sure the image removal steps are already working?

And we'll need to clear out anything that's finished beyond the DDS API call and has been for more than 90 days, as that's the usual time we'd save the working images.

Alex Chan is a good one to ask about this while I'm away.

@aray-wellcome aray-wellcome added the bug Something isn't working label Aug 6, 2021
@rsehr rsehr self-assigned this Aug 9, 2021
@mgeerdsen
Copy link
Contributor

I'll move this back in TODO
We should also check "stepopen:Image removal" and others

@alexwlchan
Copy link
Contributor

I've done some clearing out of Goobi's working storage this week. I removed ~10TB of ALTO files and JP2 images which I could match to content in the storage service (and thus was definitely redundant). For more details, see the individual tickets (wellcomecollection/platform#5399, wellcomecollection/platform#5404).

The next step to reducing costs is #423, which is quick and easy.

If somebody wanted to clean this up further, here's where I'd start:

  • Look at the JPEG images which appeared alongside JP2s. I found lots of processes that had both JPEG and JP2 images, but only the JP2 images were uploaded to the storage service. Because I couldn't match the JPEG images to the storage service, I didn't delete them on this pass, but they can probably be safely removed.

    I'd do something like:

    • Get all the JPEG files in a given process ID prefix
    • Work out what the equivalent JP2 files would have been
    • If all of the equivalent JP2 files were deleted (see attached list for convenience), delete the JPEGs. Otherwise, do nothing.
  • Look at ALTO files which weren't a perfect match. I found some ALTO files which have a file with the same name in the storage service, but the version in Goobi working storage is much bigger. It looks like the ones in working storage are from ABBYY, whereas the ones in the storage service have been through Goobi. Something like:

    • Get all the ALTO files in a given process ID prefix
    • Look for objects with the same filenames in the storage service
    • If all of the ALTO files have an object with the same filename, delete the ALTOs. Otherwise, do nothing.
  • Look at gzip/zip bundles. This is an opportunity to make a cheap cleanup – some information is duplicated in compressed/uncompressed form, e.g. https://console.aws.amazon.com/s3/buckets/wellcomedigitalworkflow-workflow-data?region=eu-west-1&showversions=false&prefix=101377%2Fimages%2Fsource%2F

    Screenshot 2022-01-20 at 22 26 15

    You can probably delete:

    • the uncompressed XML file – if it's equivalent to what's in the gzip
    • the compressed ZIP of JP2 files – if we deleted all the uncompressed JP2s already, see list below
  • Delete any objects whose filename starts with a '.', e.g. .DS_Store. There's a few thousand of them hanging around in the bucket. Enough to be worth deleting if you're already clearing up the bucket, but not worth tackling on their own.

deleted_jp2_keys.txt.gz

@alexwlchan alexwlchan self-assigned this Oct 31, 2022
@alexwlchan
Copy link
Contributor

alexwlchan commented Oct 31, 2022

I did another pass of deletions last week, which has had a big effect on the size of the bucket:

3,513,501 objects deleted, total 6.9 TB, saving $80.5285 per month ($966.34 per year)

This was backfilling deletions that are applied to new Goobi processes, but haven’t been applied to stuff from 2021 and earlier:

  • Do we have a folder full of JP2s that are all in the storage service? Duplicates, bin.
  • Do we have a folder full of JPEGs that have similarly-named JP2s in the storage service? Duplicates, bin.
  • Do we have a folder with both a compressed and uncompressed copy of the JP2s or OCR data? Duplicates, bin.
  • Do we still have the jpylzer validation results? No long-term value, bin.

I have another idea for a deletion pass, but it's going to take a while to pull down all the files to evaluate that.

Screenshot 2022-10-31 at 20 07 31

@alexwlchan alexwlchan removed their assignment Oct 6, 2023
@github-project-automation github-project-automation bot moved this to To do in 2024 Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: To do
Development

No branches or pull requests

4 participants