A scalable GitHub repository scraper that analyzes commit history and generates a leaderboard of contributors. The project is built with a modern stack for both backend and frontend, featuring asynchronous processing with a task queue.
- Fastify Server:
- Implements multiple endpoints:
/health
: Check server status./leaderboard
(GET): Retrieve the leaderboard for a processed repository./leaderboard
(POST): Submit a repository for processing./repositories
: List all repositories in the database.
- Handles repository states (
pending
,in_progress
,failed
,completed
) dynamically.
- Implements multiple endpoints:
- Efficient Repository Management:
- Bare cloning and incremental updates using
simple-git
. - Normalizes repository URLs for consistent processing.
- Bare cloning and incremental updates using
- Task Queue:
- Asynchronous repository processing with Bull and Redis.
- Database Integration:
- PostgreSQL for persistent caching of repositories and contributors.
- Prisma ORM for structured and efficient database queries.
- Error Handling:
- Graceful handling of invalid repository URLs, missing data, and processing failures.
- Modern UI: Built with Next.js and styled with Tailwind CSS.
- Leaderboard Display: Interactive table showing contributor rankings and commit counts.
- Repository Management: Add and monitor GitHub repositories through a responsive interface.
Ensure the following tools are installed on your machine:
- Docker and Docker Compose
- Git
-
Clone the Repository:
git clone https://github.com/aalexmrt/github-scraper cd github-scraper
-
Set Up Environment Variables:
-
A sample
.env.example
file is provided in thebackend
folder. You can copy this file to create your.env
file.cp backend/.env.example backend/.env
-
Open the newly created
backend/.env
file and replace<your_github_personal_access_token>
with your GitHub Personal Access Token. Examplebackend/.env
file:# Database connection string DATABASE_URL=postgresql://user:password@db:5432/github_scraper # Redis connection settings REDIS_HOST=redis REDIS_PORT=6379 # GitHub API Personal Access Token GITHUB_TOKEN=<your_github_personal_access_token>
-
Note: The
backend/.env.example
file includes placeholder values to guide you. Ensure the actual.env
file is not shared or committed to version control to keep sensitive data secure. -
If you don't have a GitHub Personal Access Token yet, you can create one:
- Go to GitHub Developer Settings.
- Click "Generate new token" (classic).
- Select the necessary scopes (
read:user
andrepo
for private repository access if required). - Copy the token and add it to the
GITHUB_TOKEN
variable in yourbackend/.env
file.
-
-
Start Services:
- Run the following command to build and start all services using Docker Compose:
docker-compose up --build
- This will start the backend, frontend, PostgreSQL database, Redis, and the worker service.
- Run the following command to build and start all services using Docker Compose:
-
Access the Application:
- Backend API: Accessible at
http://localhost:3000
- To verify if the backend is running, you can use the
/health
endpoint:curl -X GET "http://localhost:3000/health"
- Expected Response:
{ "message": "Server is running." }
- Expected Response:
- To verify if the backend is running, you can use the
- Frontend UI: Accessible at
http://localhost:4000
- Backend API: Accessible at
Endpoint: /leaderboard
Method: POST
Parameter | Type | Description | Required |
---|---|---|---|
repoUrl |
string | The GitHub repository URL to process | Yes |
Header | Type | Description | Required |
---|---|---|---|
Authorization |
string | Bearer token for private repositories | No |
curl -X POST "http://localhost:3000/leaderboard?repoUrl=https://github.com/aalexmrt/github-scraper"
{ "message": "Repository is being processed." }
{ "message": "Repository still processing." }
{
"message": "Repository processed successfully.",
"lastProcessedAt": "2024-11-28T12:00:00Z"
}
Endpoint: /leaderboard
Method: GET
URL: http://localhost:3000/leaderboard
Query Parameters
Parameter | Type | Description | Required |
---|---|---|---|
repoUrl |
string | The GitHub repository URL to process | Yes |
curl -X GET "http://localhost:3000/leaderboard?repoUrl=https://github.com/aalexmrt/github-scraper"
{
"error": "Repository not found, remember to submit for processing first."
}
{
"repository": "https://github.com/aalexmrt/github-scraper",
"top_contributors": [
{
"identifier": "aalexmrt",
"username": "aalexmrt",
"email": "[email protected]",
"profileUrl": "https://github.com/aalexmrt",
"commitCount": 23
}
]
}
The application frontend provides an interface to interact with the backend, making it easier to process repositories and view leaderboards.
-
Add a Repository
- Open the application frontend at
http://localhost:4000
. - Use the Add Repository form to submit a GitHub repository URL for processing.
- Open the application frontend at
-
Monitor Repository Processing
- Navigate to the Processed Repositories section to view the status of your repositories:
- Processing: The repository is currently being analyzed.
- On Queue: The repository is waiting for processing.
- Completed: The repository has been successfully processed.
- Navigate to the Processed Repositories section to view the status of your repositories:
-
View Contributor Leaderboard
- For completed repositories, click the Leaderboard button to view a detailed contributor leaderboard.
data:image/s3,"s3://crabby-images/97851/97851731185c450505d2fe8467e1ff9d646b22fb" alt="Screenshot 2024-11-28 at 2 54 21 PM"
data:image/s3,"s3://crabby-images/781a6/781a6bed2b734607522a9f24cc0001675da4e234" alt="Screenshot 2024-11-28 at 2 54 28 PM"
data:image/s3,"s3://crabby-images/ef8ef/ef8efdd4424e672ccbac32f261ff6f8d59986eeb" alt="Screenshot 2024-11-28 at 2 54 38 PM"
- Add support for private repositories with GitHub token validation in the
/leaderboard
endpoint. - Update the /leaderboard endpoint to split responsibilities by creating a new endpoint for processing and retrieving the leaderboard, and include the repository URL in the response.
- Improve handling API limits error and optimize the current flow.
- Add retries to failed processed repositories
- Continue improving general optimization and performance
- Escale horizontally with multiple workers and with smart queues management
- Add a form to input a repository URL and optional GitHub token.
- Improve UI...