Skip to content

Latest commit

 

History

History
58 lines (40 loc) · 3.24 KB

README.md

File metadata and controls

58 lines (40 loc) · 3.24 KB

GitHub-dataset-samples

A sample dataset of 1001 GitHub repositories

GitHub dataset header

A GitHub dataset sample of over 1000 repositories. Dataset was extracted using the Bright Data API.

Some of the data points that are included in the dataset:

  • url: Repository web address
  • id: Unique repository ID
  • code_language: Main programming language
  • code: Repository source code
  • num_lines: Total lines of code
  • user_name: Repository owner's username
  • user_url: Owner's profile URL
  • size: Repository size
  • size_unit: Repository size units
  • size_num: Repository size number
  • breadcrumbs: Repository navigation path
  • num_issues: Total issues count
  • num_pull_requests: Total pull requests count
  • num_projects: Number of associated projects
  • num_fork: Fork count
  • num_stared: Star count
  • last_feature: Latest feature change
  • latest_update: Date of last update

And a lot more.

This is a sample subset which is derived from the "GitHub Repositories (public data)" dataset which includes more than 2,200,000 repositories.

Available dataset file formats: JSON, NDJSON, JSON Lines, CSV, or Parquet. Optionally, files can be compressed to .gz.

Dataset delivery type options: Email, API download, Webhook, Amazon S3, Google Cloud storage, Google Cloud PubSub, Microsoft Azure, Snowflake, SFTP.

Update frequency: Once, Daily, Weekly, Monthly, Quarterly, or Custom basis.

Data enrichment available as an addition to the data points extracted: Based on request.

Get the full GitHub dataset.

What are the GitHub datasets use cases?

1. Developer Engagement

Gain insights into the activity and health of open-source projects by tracking data points like commit histories, pull requests, and issue discussions. This data can help businesses identify high-impact projects, monitor trends, and discover collaboration opportunities in the open-source community.

2. Community Engagement

Evaluate the popularity and community backing of open-source projects by analyzing metrics such as star and fork counts. This information enables businesses to understand which projects are gaining traction, making informed decisions on adoption, and identifying technology trends.

3. Community Advocacy

Utilize public GitHub profile data to foster engagement and advocacy within the open-source community. Identify active users who star, fork, and contribute to repositories in your field to create a network of advocates who can amplify your projects and fuel collaborative innovation.

Free access to web scraping tools and datasets for academic researchers and NGOs

The Bright Initiative offers access to Bright Data's Web Scraper APIs and ready-to-use datasets to leading academic faculties and researchers, NGOs and NPOs promoting various environmental and social causes. You can submit an application here.