-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SET-483 Set up job to generate image statistics page on a schedule #202
SET-483 Set up job to generate image statistics page on a schedule #202
Conversation
0b566a4
to
075b3be
Compare
Let me know the best way to functionally test this code and I'm happy to review once it's no longer a draft 😊 |
7489545
to
2058a91
Compare
3729e67
to
4fb08e7
Compare
2058a91
to
f88c793
Compare
4fb08e7
to
54f7f52
Compare
e98792b
to
ae15d7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Would still recommend an additional review from @milo-hyben before merging however :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking amazing, have not used Observable Framework before, so I've learned a lot from this PR.
I had just one small comment regarding BQ table, but it can be ignored.
protoPayload.authenticationInfo.principalEmail as principal_subject, | ||
protoPayload.requestMetadata.callerIp as request_ip, | ||
protoPayload.requestMetadata.callerSuppliedUserAgent as request_user_agent | ||
from cpg-common.image_logs.historical_logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding BQ table 'cpg-common.image_logs.historical_logs', this table has no partition, timestamp as DAY would be a good candidate, then this query wouldn't do a full scan. This is not a big deal as the table is relatively small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, definitely a good call out, as in the general case could easily incur large costs without partitioning. In this case though, the table is pretty tiny, only 100k rows, and will never grow in size as any new logs go into the other partitioned table, so hopefully will be all good!
Makefile is just for easier running of commands like the compile-requirements one which is best run via docker to avoid architecture conflicts. dev/non-dev requirements are a bit of a funny distinction in this repo as there isn't really a "production" version, but as a rough guide intent is to have dev requirements be devops stuff and requirements be the actual libs needed to run the code
This helper accepts an image repository name as an argument and returns a list of Image dataclasses. The Image dataclass has some helpful methods to better parse the image data and determine whether the image is an active or archived image
The script to get the images stats pulls data from the artifact registry and from the bigquery table which is a sink for image audit logs. This data is output as parquet files, the page to analyze the data is built using observable framework which uses a series of duckdb queries to drill down into the image data and display the results in interactive tables and charts
this way is a bit cleaner, no need to run with `-m`, and the imports don't have to be relative either
"pre_archived" makes less sense than archived/active seeing as there is a need to transform both ways
currently set to run every day in the middle of the night.
having it above breaks rendering
images-tmp isn't used any more
this works better to delete existing files at destination, as doing a separate rm command fails if there are no files to delete
rather than writing to a different folder each day. This way you don't have to put the current date in the URL. There isn't really any need to save old versions of the page
4ac1522
to
8e29a50
Compare
This PR adds a new script and github workflow to generate a page containing statistics for our usage of images.
To generate the page first install dependencies then run:
Then in the
scripts/image_statistics/
folder run:This should open up a browser with the page running