Replication package for our work on "Taxing Collaborative Software Engineering"
This replication package requires Python 3.10 or higher. Install the dependencies via:
python3 -m pip install -r requirements.txt
For a faster loading, we recommend to optionally install orjson via pip:
python3 -m pip install orjson
First, we collect all timelines from all pull requests at a GitHub instance. crawler.py requires an <api_token> for your GitHub instance and an <out_dir> where the results are stored into:
python3 crawl.py <api_token> <out_dir>
crawl.py also provides the following optional command line arguments:
--api_urlfor the GitHub instance URL (default:https://api.github.com)--disable_cachefor disable caching (for larger instances not recommended)--num_workersfor parallel processes (default: 1)--organizationfor limiting to one organization (helpful for organizations hosted on github.com)
To list all options in detail, run:
python3 crawl.py --h
For this step, you will need:
- The directory of the previously collected data; and,
- A mapping of users and countries. This can be either a
dictfor a static mapping (does not capture changes in the users' location over time) or a dataframe for time-dependent mapping as data frame monthly sampled (captures changes in the users' location over time).
Run notebook.ipynb. Look out for the instructions as inline comments.
Copyright © 2023 Michael Dorner.
This work is licensed under MIT license.