git2bitext: a tool to derive a "git patches" to "commit messages" bitext from a git repo as "source" to "target" language mapping

Utility to massively download GitHub repositories and parse theis history log into a dataset for machine translation.

Getting started

Downloading repo data

export CREDENTIALS=<github-user>:<github-api-key>
export LANGUAGE=java
./make_top_github_repos.sh

spacy initialization

Download Spacy POS tagger data for English (small version):

python -m spacy download en_core_web_sm

Running the parser

A repository and a file name prefix are expected. For example:

./git2bitext.py ~/librosa.git current -b main

parses the local repository librosa.git, scanning branch main and writing the 2 files current.msg and current.diff, respectively containing commit messages and related diff. Omitting -b flag triggers an autodetection of main branch.

Help is available with:

./git2bitext.py -h

Splitting the files

The utility ./split_test_train_valid.py can assist in generating the splits (80% train, 10% validation, 10% test) from the bitexts.

Crawling the data from online repo

massive_clone.sh reads a text file with one repo per line and parallely downloads the repos.
make_top_github_repos.sh builds the text file with the top 1000 repositories for a given programming language. Needs GitHub credentials set as $CREDENTIALS env var and a language name set at $LANGUAGE env var.
parallel_generation.sh launches git2bitext.py in parallel. parallel_split.sh splits the bitexts in parallel.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
git2bitext.py		git2bitext.py
make_top_github_repos.sh		make_top_github_repos.sh
massive_clone.sh		massive_clone.sh
parallel_generation.sh		parallel_generation.sh
parallel_split.sh		parallel_split.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
split_test_train_valid.py		split_test_train_valid.py
top-java-github.txt		top-java-github.txt
top-python-github.txt		top-python-github.txt
top-ruby-github.txt		top-ruby-github.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

git2bitext: a tool to derive a "git patches" to "commit messages" bitext from a git repo as "source" to "target" language mapping

Getting started

Downloading repo data

spacy initialization

Running the parser

Splitting the files

Crawling the data from online repo

About

Releases

Packages

Languages

License

aijanai/commit-suggester-dataset-builder

Folders and files

Latest commit

History

Repository files navigation

git2bitext: a tool to derive a "git patches" to "commit messages" bitext from a git repo as "source" to "target" language mapping

Getting started

Downloading repo data

spacy initialization

Running the parser

Splitting the files

Crawling the data from online repo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages