Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SET-483 Set up job to generate image statistics page on a schedule #202

Conversation

dancoates
Copy link
Contributor

@dancoates dancoates commented Mar 21, 2025

This PR adds a new script and github workflow to generate a page containing statistics for our usage of images.

To generate the page first install dependencies then run:

python scripts/get_image_statistics.py

Then in the scripts/image_statistics/ folder run:

npm install
npm run dev

This should open up a browser with the page running

@dancoates dancoates changed the title Set 483 set up job to generate image statistics page on a schedule SET-483 Set up job to generate image statistics page on a schedule Mar 21, 2025
@dancoates dancoates force-pushed the SET-483-set-up-job-to-generate-image-statistics-page-on-a-schedule branch from 0b566a4 to 075b3be Compare March 21, 2025 04:07
@violetbrina
Copy link
Collaborator

Let me know the best way to functionally test this code and I'm happy to review once it's no longer a draft 😊

@dancoates dancoates force-pushed the upgrade-from-black-to-ruff branch from 7489545 to 2058a91 Compare March 24, 2025 21:50
@dancoates dancoates force-pushed the SET-483-set-up-job-to-generate-image-statistics-page-on-a-schedule branch from 3729e67 to 4fb08e7 Compare March 24, 2025 22:06
@dancoates dancoates force-pushed the upgrade-from-black-to-ruff branch from 2058a91 to f88c793 Compare March 24, 2025 22:10
Base automatically changed from upgrade-from-black-to-ruff to main March 24, 2025 22:12
@dancoates dancoates force-pushed the SET-483-set-up-job-to-generate-image-statistics-page-on-a-schedule branch from 4fb08e7 to 54f7f52 Compare March 24, 2025 22:13
@dancoates dancoates marked this pull request as ready for review March 24, 2025 22:14
@vivbak vivbak requested review from milo-hyben and nevoodoo March 24, 2025 22:55
@dancoates dancoates removed the request for review from jmarshall March 24, 2025 22:55
@dancoates dancoates force-pushed the SET-483-set-up-job-to-generate-image-statistics-page-on-a-schedule branch from e98792b to ae15d7c Compare March 24, 2025 23:14
Copy link
Contributor

@nevoodoo nevoodoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Would still recommend an additional review from @milo-hyben before merging however :)

Copy link
Contributor

@milo-hyben milo-hyben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking amazing, have not used Observable Framework before, so I've learned a lot from this PR.
I had just one small comment regarding BQ table, but it can be ignored.

protoPayload.authenticationInfo.principalEmail as principal_subject,
protoPayload.requestMetadata.callerIp as request_ip,
protoPayload.requestMetadata.callerSuppliedUserAgent as request_user_agent
from cpg-common.image_logs.historical_logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding BQ table 'cpg-common.image_logs.historical_logs', this table has no partition, timestamp as DAY would be a good candidate, then this query wouldn't do a full scan. This is not a big deal as the table is relatively small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, definitely a good call out, as in the general case could easily incur large costs without partitioning. In this case though, the table is pretty tiny, only 100k rows, and will never grow in size as any new logs go into the other partitioned table, so hopefully will be all good!

Makefile is just for easier running of commands like the
compile-requirements one which is best run via docker to avoid
architecture conflicts. dev/non-dev requirements are a bit of a funny
distinction in this repo as there isn't really a "production" version, but as a
rough guide intent is to have dev requirements be devops stuff and
requirements be the actual libs needed to run the code
This helper accepts an image repository name as an argument and returns
a list of Image dataclasses. The Image dataclass has some helpful
methods to better parse the image data and determine whether the image
is an active or archived image
The script to get the images stats pulls data from the artifact registry
and from the bigquery table which is a sink for image audit logs.

This data is output as parquet files, the page to analyze the data is
built using observable framework which uses a series of duckdb queries
to drill down into the image data and display the results in interactive
tables and charts
this way is a bit cleaner, no need to run with `-m`, and the imports
don't have to be relative either
"pre_archived" makes less sense than archived/active seeing as there is
a need to transform both ways
currently set to run every day in the middle of the night.
having it above breaks rendering
images-tmp isn't used any more
this works better to delete existing files at destination, as doing a
separate rm command fails if there are no files to delete
rather than writing to a different folder each day. This way you don't
have to put the current date in the URL. There isn't really any need to
save old versions of the page
@dancoates dancoates force-pushed the SET-483-set-up-job-to-generate-image-statistics-page-on-a-schedule branch from 4ac1522 to 8e29a50 Compare March 27, 2025 22:44
@dancoates dancoates merged commit 1b65c62 into main Mar 27, 2025
2 checks passed
@dancoates dancoates deleted the SET-483-set-up-job-to-generate-image-statistics-page-on-a-schedule branch March 27, 2025 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants