Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop #1

Merged
merged 15 commits into from
Nov 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 119 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,27 @@ A scalable GitHub repository scraper that analyzes commit history and generates
## Features

### Backend
- **Fastify Server**: Single endpoint for repository processing and leaderboard generation.
- **Asynchronous Processing**: Uses Bull and Redis for task queue management.
- **Efficient Cloning**: Bare cloning and incremental updates with `simple-git`.
- **Caching**: PostgreSQL and Prisma for caching contributor data and reducing redundant API calls.

- **Fastify Server**:
- Implements multiple endpoints:
- `/health`: Check server status.
- `/leaderboard` (GET): Retrieve the leaderboard for a processed repository.
- `/leaderboard` (POST): Submit a repository for processing.
- `/repositories`: List all repositories in the database.
- Handles repository states (`pending`, `in_progress`, `failed`, `completed`) dynamically.
- **Efficient Repository Management**:
- Bare cloning and incremental updates using `simple-git`.
- Normalizes repository URLs for consistent processing.
- **Task Queue**:
- Asynchronous repository processing with Bull and Redis.
- **Database Integration**:
- PostgreSQL for persistent caching of repositories and contributors.
- Prisma ORM for structured and efficient database queries.
- **Error Handling**:
- Graceful handling of invalid repository URLs, missing data, and processing failures.

### Frontend

- **Modern UI**: Built with Next.js and styled with Tailwind CSS.
- **Leaderboard Display**: Interactive table showing contributor rankings and commit counts.
- **Repository Management**: Add and monitor GitHub repositories through a responsive interface.
Expand All @@ -22,6 +37,7 @@ A scalable GitHub repository scraper that analyzes commit history and generates
### Prerequisites

Ensure the following tools are installed on your machine:

- [Docker](https://www.docker.com/) and [Docker Compose](https://docs.docker.com/compose/)
- [Git](https://git-scm.com/)

Expand All @@ -31,13 +47,16 @@ Ensure the following tools are installed on your machine:
```bash
git clone https://github.com/aalexmrt/github-scraper
cd github-scraper
```
2. **Set Up Environment Variables**:

- A sample `.env.example` file is provided in the `backend` folder. You can copy this file to create your `.env` file.
```bash
cp backend/.env.example backend/.env
```
- Open the newly created `backend/.env` file and replace `<your_github_personal_access_token>` with your GitHub Personal Access Token.
Example `backend/.env` file:

```env
# Database connection string
DATABASE_URL=postgresql://user:password@db:5432/github_scraper
Expand All @@ -53,15 +72,18 @@ Ensure the following tools are installed on your machine:
- **Note**: The `backend/.env.example` file includes placeholder values to guide you. Ensure the actual `.env` file is not shared or committed to version control to keep sensitive data secure.

- If you don't have a GitHub Personal Access Token yet, you can create one:

1. Go to [GitHub Developer Settings](https://github.com/settings/tokens).
2. Click "Generate new token" (classic).
3. Select the necessary scopes (`read:user` and `repo` for private repository access if required).
4. Copy the token and add it to the `GITHUB_TOKEN` variable in your `backend/.env` file.

```

```

3. **Start Services**:

- Run the following command to build and start all services using Docker Compose:
```bash
docker-compose up --build
Expand All @@ -84,86 +106,136 @@ Ensure the following tools are installed on your machine:

## Usage

### Backend: Accessing the `/leaderboard` Endpoint
### Submit a Repository for Processing

You can use the `/leaderboard` endpoint to process a GitHub repository and retrieve the leaderboard of contributors.
**Endpoint**: `/leaderboard`

#### Endpoint Details
- **Method**: `GET`
- **URL**: `http://localhost:3000/leaderboard`
**Method**: `POST`

#### Query Parameters
| Parameter | Type | Description | Required |
|-----------|--------|-------------------------------------|----------|
| `repoUrl` | string | The URL of the GitHub repository. | Yes |

| Parameter | Type | Description | Required |
| --------- | ------ | ------------------------------------ | -------- |
| `repoUrl` | string | The GitHub repository URL to process | Yes |

#### Headers

| Header | Type | Description | Required |
| --------------- | ------ | ------------------------------------- | -------- |
| `Authorization` | string | Bearer token for private repositories | No |

#### Example Request
Using `curl`:

```bash
curl -X GET "http://localhost:3000/leaderboard?repoUrl=https://github.com/aalexmrt/github-scraper"
curl -X POST "http://localhost:3000/leaderboard?repoUrl=https://github.com/aalexmrt/github-scraper"
```

#### Example Responses
##### Processing in Progress
### Responses

#### Repository Added for Processing

```json
{ "message": "Repository is being processed." }
```

#### Repository Already Processing

```json
{ "message": "Repository still processing." }
```

#### Processing Completed

```json
{
"message": "Repository is being processed."
"message": "Repository processed successfully.",
"lastProcessedAt": "2024-11-28T12:00:00Z"
}
```

##### Processing Completed
### Retrieve Leaderboard for a Processed Repository

**Endpoint**: `/leaderboard`

**Method**: `GET`

**URL**: `http://localhost:3000/leaderboard`

**Query Parameters**

| Parameter | Type | Description | Required |
| --------- | ------ | ------------------------------------ | -------- |
| `repoUrl` | string | The GitHub repository URL to process | Yes |

#### Example Request

```bash
curl -X GET "http://localhost:3000/leaderboard?repoUrl=https://github.com/aalexmrt/github-scraper"
```

#### Example Responses

##### Repository Not Found

```json
{
"leaderboard": [
{
"commitCount": 43,
"username": null,
"email": "[email protected]",
"profileUrl": null
},
{
"commitCount": 2,
"username": "aalexmrt",
"email": "[email protected]",
"profileUrl": "https://github.com/aalexmrt"
}
]
"error": "Repository not found, remember to submit for processing first."
}
```

##### Error
##### Leaderboard Response

```json
{
"error": "Failed to process the leaderboard request."
"repository": "https://github.com/aalexmrt/github-scraper",
"top_contributors": [
{
"identifier": "aalexmrt",
"username": "aalexmrt",
"email": "[email protected]",
"profileUrl": "https://github.com/aalexmrt",
"commitCount": 23
}
]
}
```

### Frontend: Using the Application

The application frontend provides an interface to interact with the backend, making it easier to process repositories and view leaderboards.

1. **Add a Repository**
- Open the application frontend at `http://localhost:4000`.
1. **Add a Repository**

- Open the application frontend at `http://localhost:4000`.
- Use the **Add Repository** form to submit a GitHub repository URL for processing.

2. **Monitor Repository Processing**
- Navigate to the **Processed Repositories** section to view the status of your repositories:
- **Processing**: The repository is currently being analyzed.
- **On Queue**: The repository is waiting for processing.
- **Completed**: The repository has been successfully processed.
2. **Monitor Repository Processing**

- Navigate to the **Processed Repositories** section to view the status of your repositories:
- **Processing**: The repository is currently being analyzed.
- **On Queue**: The repository is waiting for processing.
- **Completed**: The repository has been successfully processed.

3. **View Contributor Leaderboard**
3. **View Contributor Leaderboard**
- For completed repositories, click the **Leaderboard** button to view a detailed contributor leaderboard.

<img width="1728" alt="Screenshot 2024-11-24 at 5 17 49 PM" src="https://github.com/user-attachments/assets/97ac4397-2556-44c5-89df-011133f6b455">
<img width="1728" alt="Screenshot 2024-11-24 at 5 18 03 PM" src="https://github.com/user-attachments/assets/200d80e1-59ed-4821-8a60-9a0f8807096f">
<img width="1728" alt="Screenshot 2024-11-28 at 2 54 21 PM" src="https://github.com/user-attachments/assets/e75a9997-405f-4b2c-83e6-6f24a28c1a20">
<img width="1728" alt="Screenshot 2024-11-28 at 2 54 28 PM" src="https://github.com/user-attachments/assets/7ef773a2-39bc-4906-879d-c32f045090e9">
<img width="1728" alt="Screenshot 2024-11-28 at 2 54 38 PM" src="https://github.com/user-attachments/assets/49417023-aab0-4edf-8158-91afdecbd138">


## **Next Steps**

### **Backend**
- [ ] Add support for private repositories with GitHub token validation in the `/leaderboard` endpoint.
- [ ] Update the /leaderboard endpoint to split responsibilities by creating a new endpoint for processing and retrieving the leaderboard, and include the repository URL in the response.

- [X] Add support for private repositories with GitHub token validation in the `/leaderboard` endpoint.
- [X] Update the /leaderboard endpoint to split responsibilities by creating a new endpoint for processing and retrieving the leaderboard, and include the repository URL in the response.
- [ ] Improve handling API limits error and optimize the current flow.
- [ ] Add retries to failed processed repositories
- [ ] Continue improving general optimization and performance
- [ ] Escale horizontally with multiple workers and with smart queues management

### **Frontend**
- [ ] Add a form to input a repository URL and optional GitHub token.

- [X] Add a form to input a repository URL and optional GitHub token.
- [ ] Improve UI...
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
-- CreateTable
CREATE TABLE "Repository" (
"id" SERIAL NOT NULL,
"url" TEXT NOT NULL,
"state" TEXT NOT NULL DEFAULT 'pending',
"lastAttempt" TIMESTAMP(3),
"lastProcessedAt" TIMESTAMP(3),
"createdAt" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
"updatedAt" TIMESTAMP(3) NOT NULL,

CONSTRAINT "Repository_pkey" PRIMARY KEY ("id")
);

-- CreateTable
CREATE TABLE "RepositoryContributor" (
"id" SERIAL NOT NULL,
"repositoryId" INTEGER NOT NULL,
"contributorId" INTEGER NOT NULL,
"commitCount" INTEGER NOT NULL DEFAULT 0,

CONSTRAINT "RepositoryContributor_pkey" PRIMARY KEY ("id")
);

-- CreateIndex
CREATE UNIQUE INDEX "Repository_url_key" ON "Repository"("url");

-- CreateIndex
CREATE UNIQUE INDEX "RepositoryContributor_repositoryId_contributorId_key" ON "RepositoryContributor"("repositoryId", "contributorId");

-- AddForeignKey
ALTER TABLE "RepositoryContributor" ADD CONSTRAINT "RepositoryContributor_repositoryId_fkey" FOREIGN KEY ("repositoryId") REFERENCES "Repository"("id") ON DELETE RESTRICT ON UPDATE CASCADE;

-- AddForeignKey
ALTER TABLE "RepositoryContributor" ADD CONSTRAINT "RepositoryContributor_contributorId_fkey" FOREIGN KEY ("contributorId") REFERENCES "Contributor"("id") ON DELETE RESTRICT ON UPDATE CASCADE;
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
/*
Warnings:

- You are about to drop the column `identifier` on the `Contributor` table. All the data in the column will be lost.
- Added the required column `pathName` to the `Repository` table without a default value. This is not possible if the table is not empty.

*/
-- DropIndex
DROP INDEX "Contributor_email_key";

-- DropIndex
DROP INDEX "Contributor_identifier_key";

-- AlterTable
ALTER TABLE "Contributor" DROP COLUMN "identifier",
ALTER COLUMN "email" DROP NOT NULL;

-- AlterTable
ALTER TABLE "Repository" ADD COLUMN "pathName" TEXT NOT NULL;
12 changes: 12 additions & 0 deletions backend/prisma/migrations/20241128160132_/migration.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
/*
Warnings:

- A unique constraint covering the columns `[username]` on the table `Contributor` will be added. If there are existing duplicate values, this will fail.
- A unique constraint covering the columns `[email]` on the table `Contributor` will be added. If there are existing duplicate values, this will fail.

*/
-- CreateIndex
CREATE UNIQUE INDEX "Contributor_username_key" ON "Contributor"("username");

-- CreateIndex
CREATE UNIQUE INDEX "Contributor_email_key" ON "Contributor"("email");
38 changes: 30 additions & 8 deletions backend/prisma/schema.prisma
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,35 @@ datasource db {
url = env("DATABASE_URL")
}

model Repository {
id Int @id @default(autoincrement())
url String @unique
pathName String
state String @default("pending") // "pending", "in_progress", "completed", "failed"
lastAttempt DateTime? // Nullable for repositories that haven’t been processed
lastProcessedAt DateTime? // Nullable for repositories that haven’t been successfully processed
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
contributors RepositoryContributor[] // Relation to RepositoryContributor
}

model Contributor {
id Int @id @default(autoincrement())
identifier String @unique
username String?
email String @unique
profileUrl String?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
}
id Int @id @default(autoincrement())
username String? @unique
email String? @unique
profileUrl String?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
repositories RepositoryContributor[] // Relation to RepositoryContributor
}

model RepositoryContributor {
id Int @id @default(autoincrement())
repository Repository @relation(fields: [repositoryId], references: [id])
repositoryId Int
contributor Contributor @relation(fields: [contributorId], references: [id])
contributorId Int
commitCount Int @default(0)

@@unique([repositoryId, contributorId]) // Composite unique constraint
}
Loading