Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why vitality is computed via a cloned repo? #367

Closed
claudiu-cristea opened this issue Oct 31, 2023 · 3 comments
Closed

Why vitality is computed via a cloned repo? #367

claudiu-cristea opened this issue Oct 31, 2023 · 3 comments
Labels
question Further information is requested

Comments

@claudiu-cristea
Copy link

Both GitHub and GitLab APIs seems to be able to retrieve commits data via HTTP calls. I wonder what was the reason of fetching such data from a cloned repo rather than doing API calls?

@bfabio bfabio added the question Further information is requested label Oct 31, 2023
@bfabio
Copy link
Member

bfabio commented Nov 1, 2023

Hi @claudiu-cristea, good question.

There are multiple reasons. The first one is simply historical: "it has always been like that". When vitality was first introduced, it was a generic repo activity code (2e93c65), and I believe the direction back then was not clear about what it would include.

I presume using the repo was the quickest way to iterate while remaining flexible enough to add metrics not returned from the code forge. Using the same generic git code, irrespective of the original source location, rather than implementing it for each API, was likely another reason.

However, as time went on, relying solely on the API proved problematic because GitHub has an API limit - even with an authenticated token. I would argue that this limit is the scarcest resource we have in the crawler, which already has a minimal disk/CPU footprint. We had to recognize and treat it as such, even if it was convenient.

A while ago, the crawler employed many go-routines (#189), leading to the need for multiple GitHub tokens. We even had an awkward dedicated configuration file for that: (PR removing it #302). This approach introduced complexities like managing multiple tokens, supporting the code for it, handling parallelism bugs, synchronizing the go-routines, tracking the tokens, using ones not hitting the limit, and occasionally, we'd still hit some limits.

It's not even that the crawler was more efficient because our synchronization primitive turned out to be GitHub itself with its backoff limits!

So, in the refactor, we simplified things, resulting in faster crawling and a more straightforward, robust design. Now, if we want to scale, we do so with more processes rather than more threads.

Additionally, cloning the repo allowed us to:

That's the background, but moving forward, I believe it would be logical to generalize metadata retrieval in the scanners and introduce a new git-only scanner consistent with the others' interfaces (#196, #132).

The primary question is: Does bare cloning a repo and scanning it locally consume fewer or more resources (I think it does consume more) concerning GitHub limits? And what trade-offs are we willing to make?
We always have to keep an eye on the API calls and try to be efficient, maybe we can minimize them by caching something - if possibile - especially when we'll scale up.

Although I've used GitHub as an example, the same considerations could apply to other code forges.

@claudiu-cristea
Copy link
Author

@bfabio,

Thank you so much for your detailed response and apologies for the delay.

As I reverser engineered the vitality index to fully understand the algorithm I think I understand now why a clone has been used. Probably the API isn't able to process such a large volume of data without hitting limits. I can also see the benefits of having the repos locally. And, true, I was looking to repos through the lens of "code hosting platforms" (GitHub, GitLab, etc). What about simple repos hosted on a server? Good point!

@claudiu-cristea
Copy link
Author

I will close this issue as it responded to the initial question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants