This forked repository has been archived and is no longer actively maintained (original project: LLNL/scraper). Please visit DOE CODE for more information, or for any questions, please contact [email protected].
Scraper is a tool for scraping and visualizing open source data from various code hosting platforms, such as:, GitHub Enterprise,, hosted GitLab, and Bitbucket Server. is a newly launched website of the US Federal Government to allow the People to access metadata from the governments custom developed software. This site requires metadata to function, and this Python library can help with that!
To get started, you will need a GitHub Personal Auth
to make requests to the GitHub API. This should be set in your environment or
shell rc
file with the name GITHUB_API_TOKEN
$ echo "export GITHUB_API_TOKEN=XYZ" >> ~/.bashrc
Additionally, to perform the labor hours estimation, you will need to install
into your environment. This is typically done with a Package
Manager such as
or homebrew
Then to generate a code.json
file for your agency, you will need a
file to coordinate the platforms you will connect to and scrape
data from. An example config file can be found in demo.json. Once
you have your config file, you are ready to install and run the scraper!
# Install Scraper from a local copy of this repository
$ pip install -e .
# OR
# Install Scraper from PyPI
$ pip install llnl-scraper
# Run Scraper with your config file ``config.json``
$ scraper --config config.json
A full example of the resulting code.json
file can be found
The configuration file is a json file that specifies what repository platforms to pull projects from as well as some settings that can be used to override incomplete or inaccurate data returned via the scraping.
The basic structure is:
"GitHub": [
"url": "", // or GitHub Enterprise URL to inventory
"token": null, // Private token for accessing this GitHub instance
"public_only": true, // Only inventory public repositories
"orgs": [ ... ], // List of organizations to inventory
"repos": [ ... ], // List of single repositories to inventory
"exclude": [ ... ] // List of organizations / repositories to exclude from inventory
"GitLab": [
"url": "", // or hosted GitLab instance URL to inventory
"token": null, // Private token for accessing this GitHub instance
"fetch_languages": false, // Include individual calls to API for language metadata. Very slow, so defaults to false. (eg, for 191 projects on internal server, 5 seconds for False, 12 minutes, 38 seconds for True)
"orgs": [ ... ], // List of organizations to inventory
"repos": [ ... ], // List of single repositories to inventory
"exclude": [ ... ] // List of groups / repositories to exclude from inventory
"Bitbucket": [
"url": "https://bitbucket.internal", // Base URL for a Bitbucket Server instance
"username": "", // Username to authenticate with
"password": "", // Password to authenticate with
"token": "", // Token to authenticate with, if supplied username and password are ignored
"exclude": [ ... ] // List of projects / repositories to exclude from inventory
"TFS": [
"url": "https://tfs.internal", // Base URL for a Team Foundation Server (TFS) or Visual Studio Team Services (VSTS) or Azure DevOps instance
"token": null, // Private token for accessing this TFS instance
"exclude": [ ... ] // List of projects / repositories to exclude from inventory
Scraper is released under an MIT license. For more details see the LICENSE file.