Why vitality is computed via a cloned repo? #367

claudiu-cristea · 2023-10-31T09:06:33Z

Both GitHub and GitLab APIs seems to be able to retrieve commits data via HTTP calls. I wonder what was the reason of fetching such data from a cloned repo rather than doing API calls?

bfabio · 2023-11-01T10:44:01Z

Hi @claudiu-cristea, good question.

There are multiple reasons. The first one is simply historical: "it has always been like that". When vitality was first introduced, it was a generic repo activity code (2e93c65), and I believe the direction back then was not clear about what it would include.

I presume using the repo was the quickest way to iterate while remaining flexible enough to add metrics not returned from the code forge. Using the same generic git code, irrespective of the original source location, rather than implementing it for each API, was likely another reason.

However, as time went on, relying solely on the API proved problematic because GitHub has an API limit - even with an authenticated token. I would argue that this limit is the scarcest resource we have in the crawler, which already has a minimal disk/CPU footprint. We had to recognize and treat it as such, even if it was convenient.

A while ago, the crawler employed many go-routines (#189), leading to the need for multiple GitHub tokens. We even had an awkward dedicated configuration file for that: (PR removing it #302). This approach introduced complexities like managing multiple tokens, supporting the code for it, handling parallelism bugs, synchronizing the go-routines, tracking the tokens, using ones not hitting the limit, and occasionally, we'd still hit some limits.

It's not even that the crawler was more efficient because our synchronization primitive turned out to be GitHub itself with its backoff limits!

So, in the refactor, we simplified things, resulting in faster crawling and a more straightforward, robust design. Now, if we want to scale, we do so with more processes rather than more threads.

Additionally, cloning the repo allowed us to:

Locally obtain and validate screenshots, logos, etc., in terms of size and format.
Potentially transfer assets to a proxy instead of directly linking them (Proxy images instead of linking to the repo's URL developers.italia.it#1246).

That's the background, but moving forward, I believe it would be logical to generalize metadata retrieval in the scanners and introduce a new git-only scanner consistent with the others' interfaces (#196, #132).

The primary question is: Does bare cloning a repo and scanning it locally consume fewer or more resources (I think it does consume more) concerning GitHub limits? And what trade-offs are we willing to make?
We always have to keep an eye on the API calls and try to be efficient, maybe we can minimize them by caching something - if possibile - especially when we'll scale up.

Although I've used GitHub as an example, the same considerations could apply to other code forges.

claudiu-cristea · 2023-12-15T12:20:54Z

@bfabio,

Thank you so much for your detailed response and apologies for the delay.

As I reverser engineered the vitality index to fully understand the algorithm I think I understand now why a clone has been used. Probably the API isn't able to process such a large volume of data without hitting limits. I can also see the benefits of having the repos locally. And, true, I was looking to repos through the lens of "code hosting platforms" (GitHub, GitLab, etc). What about simple repos hosted on a server? Good point!

claudiu-cristea · 2023-12-15T12:21:26Z

I will close this issue as it responded to the initial question

bfabio added the question Further information is requested label Oct 31, 2023

claudiu-cristea closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why vitality is computed via a cloned repo? #367

Why vitality is computed via a cloned repo? #367

claudiu-cristea commented Oct 31, 2023

bfabio commented Nov 1, 2023

claudiu-cristea commented Dec 15, 2023

claudiu-cristea commented Dec 15, 2023

Why vitality is computed via a cloned repo? #367

Why vitality is computed via a cloned repo? #367

Comments

claudiu-cristea commented Oct 31, 2023

bfabio commented Nov 1, 2023

claudiu-cristea commented Dec 15, 2023

claudiu-cristea commented Dec 15, 2023