Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git Blame Cache #1848

Open
holodorum opened this issue Feb 18, 2025 · 5 comments
Open

Git Blame Cache #1848

holodorum opened this issue Feb 18, 2025 · 5 comments
Labels
acknowledged an issue is accepted as shortcoming to be fixed enhancement New feature or request help wanted Extra attention is needed

Comments

@holodorum
Copy link

Summary 💡

Would there be any interest for a git blame cache. I came across this thread, where this is discussed with some links to existing implementations.
This thesis also implemented something similar for a company.
With the recent improvements in speed achieved by @cruessler I think this might be a nice addition to beat git and I would be interested to give implementing it a shot.

Motivation 🔦

No response

@holodorum holodorum added the enhancement New feature or request label Feb 18, 2025
@jtwaleson
Copy link

Adding a bit of context here, as I worked on this with @holodorum .

I'm working on a product that tracks code ownership / code age for all files in many repos. For that, we'd like to have very fast git blame statistics. The solution we came up with is to create checkpoints of specific commits and store the full blame coverage for those. When we want to calculate the blame for a newer commit, we only have to go as far back as the latest checkpoint. It stores chunks with (commit_id, start_line, end_line) tuples.

We're storing the chunks in RocksDB so I think it's too specific to include in gitoxide itself, but we were wondering if there is some appetite to restructure the gix-blame code to make it easy to add a cache.

@Byron Byron added help wanted Extra attention is needed acknowledged an issue is accepted as shortcoming to be fixed labels Feb 20, 2025
@Byron
Copy link
Member

Byron commented Feb 20, 2025

Thanks a lot for sharing - I am very interested in this line of work!
Right now, without a cache and from my imperfect memory, we are easily slower by a factor of 2x. This is due to doing more work than Git, usually, and by not being able to use certain optimizations. But even in plain blames without the use of the commit-graph, we are typically significantly slower than Git.

Making the code more cache-friendly and generally having more eyes concerned with performance on it is very much in my interest, so please feel free to join the efforts in making gix-blame as fast (and faster) than Git :D.

As for the implementation of a cache, there already is ein t query that builds up a sqlite database with a lot of expensive-to-compute derived data, some of which is also useful for blame. In theory, gix-blame could be abstracted more to allow using such caches, and possible also to more easily generate checkpoints.

Please note that making such changes should serve an actual product or tool so they serve a purpose/satisfy a demand, while having stakeholders that are interested in keeping it functioning.

@holodorum
Copy link
Author

Thanks for your response! I've submitted an initial pull request that could speed up the blame process by using an existing blame as a starting point.

In the full implementation, we perform a tree diff between the commit for which we have a cache and the new commit. The files that were modified or added can then be blamed using the updated blame::function::file().

I wasn't aware of ein t query, it looks quite powerful. In our implementation, we utilized resource_cache to accelerate diffing. If I understand correctly, the resource cache optimizes the diffing of blobs, while ein t query is only beneficial to speed up tree diffs.

@jtwaleson
Copy link

Please note that making such changes should serve an actual product or tool so they serve a purpose/satisfy a demand, while having stakeholders that are interested in keeping it functioning.

@Byron It's definitely serving an actual problem for my product (in stealth), but it's a tiny bootstrapped startup so I can't guarantee any long-term commitment yet :) Please be aware of that when reviewing / accepting changes from our side.

@Byron
Copy link
Member

Byron commented Feb 22, 2025

Thanks for being so upfront about it, much appreciated!

Accepting cache-support in some shape or form should, and this is my hope, also mean that you can submit fixes and improvements to the core-algorithm in future, without having to fork and maintain your own.
If this resonates in some way it should work out fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged an issue is accepted as shortcoming to be fixed enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants