Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplicate API images that are identical #2525

Open
Eric-Arellano opened this issue Dec 23, 2024 · 2 comments
Open

De-duplicate API images that are identical #2525

Eric-Arellano opened this issue Dec 23, 2024 · 2 comments

Comments

@Eric-Arellano
Copy link
Collaborator

Our Git repo size is very large. This is mostly from blobs, like our images and videos.

It appears some of the images are identical, such as https://github.com/Qiskit/documentation/blob/main/public/images/api/qiskit/depth.gif. So, it's very inefficient for us to duplicate the same image ~25 times.

An original idea was for historical API docs to use the asset from latest if the blob is bit-for-bit identical. However, there is an edge case there if a new version of latest removes the blob, then all the historical docs are pointing to an asset that no longer exists.

Instead, we could use a folder public/images/api/qiskit/common. If a blob appears in >1 version, we store the blob in /common.

  • There is a risk that the same blob filename has multiple versions over time, e.g. version A is in Qiskit 0.19-1.1, then version B is in Qiskit 1.2-1.3+. So, we should probably put something in the file name as a suffix, like the number of bytes or a hash.
  • Be careful that the algorithm doesn't slow down gen-api too much. To determine whether an image has a duplicate, we need to inspect every other API version, including the new version we're currently generating
    • Ideally we can do the de-duplication as part of gen-api, rather than a standalone process we sometimes manually run to post-process. With Git repo size, we need to avoid introducing the binary at all because once a blob is saved to Git, it is there forever unless we force push.
    • If we set up a new de-duplication, we need to remember to rewrite the link in the historical API version that now has a common blob
@frankharkins
Copy link
Member

frankharkins commented Dec 23, 2024

Another approach to the folders is to have newer versions reference images in older versions of Qiskit. E.g., if we add v2.0 and an image is unchanged from v1.4, then v2.0 just points to the v1.4 image, otherwise it adds the new image to the v2.0 folder.

@Eric-Arellano
Copy link
Collaborator Author

Eric-Arellano commented Dec 23, 2024

Another approach to the folders is to have newer versions reference images in older versions of Qiskit. E.g., if we add v2.0 and an image is unchanged from v1.4, then v2.0 just points to the v1.4 image, otherwise it adds the new image to the v2.0 folder.

That's a really good idea because we wouldn't expect images to change in historical API versions. Great suggestion!

When implementing, I suspect Frank's suggestion will be simplest. However, we should evaluate both options and use whatever is the simplest/most maintainable.


Update Jan 10, 2025: research if this actually does help, per Jake's suggestion in #2533 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

2 participants