All the git information can be found inside commits that are located inside git repositories Our tree element steps are the following:
- Collect all repository URL's from an object (org, user, group).
- Clone them with the appropriate authentication.
- Run git commands to extract the information we need on each repository.
- Gather data and store this information in a json file.
The root package is the abstract implementation of the extractor.
It contains a Pipeline that extracts git information for every git artifact (currently a git file but we could support commit), of every repository of an organization.
The cmd/src-fingerprint package contains the binary code. It reads from CLI and environment the configuration and run the Pipeline on an organization.
-
Build binary
go build ./cmd/src-fingerprint
-
Set env var
VCS_TOKEN
to the GitHub Token or GitLab Tokenexport VCS_TOKEN="<token>"
-
Run and read doc
./src-fingerprint
-
Run on a given user/group
./src-fingerprint collect --provider github --object Uber ./src-fingerprint collect --provider-url http://gitlab.example.com --provider gitlab --object Groupe
Streaming is prefered in this scenario to avoid accumulation in memory of objects.
What we have done for now to improve performance:
- Write object by object to output/file by using jsonl format by default
- Clone using the native git executable. Natively written libraries tend to clone in memory at some point.
- Limiting go channel numbers
- GitHub wrapper: "github.com/google/go-github/v36/github"
- Gitlab go wrapper: "github.com/xanzy/go-gitlab"
- Bitbucket wrapper: "github.com/suhaibmujahid/go-bitbucket-server/bitbucket"
- Repository: None
- native wrapped git command
Using go-git resulted in in-memory cloning (stream to memory and then to directory). This caused too high peaks of memory unsuitable for small VMs.
Packaging is done using GoReleaser and nFPM.
You can test packaging using make dist
.