GitHub

Web Scrape & Generate API (FastAPI)

Scrape a web page (title, images, description, full text, extras) and generate a refined blog-style post using OpenAI or HuggingFace. Exposes a single endpoint POST /scrape and provides Swagger and ReDoc documentation.

Features

Scraping: httpx + selectolax for fast HTML parsing
AI Generation: OpenAI (preferred), HuggingFace (fallback), or local deterministic fallback (no keys required)
API Docs: Swagger UI (/docs) and ReDoc (/redoc)
Health Check: /health
Tests: Minimal unit test with network mocked
Docker: Optional containerized run

Requirements

Python 3.9+

Quickstart (Local)

Create and activate a virtual environment (Windows PowerShell):

python -m venv .venv
.\.venv\Scripts\Activate.ps1

Install dependencies:

pip install -r requirements.txt

Configure environment (optional, for AI APIs):
- Copy .env.example to .env and set values.
- If you skip, a local deterministic generator is used.
Run the server:

uvicorn app.main:app --reload --port 8000

Open docs:
- Swagger: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

Endpoint

POST `/scrape`

Request body:

{
  "url": "https://example.com/article"
}

Response body (shape):

{
  "scraped_data": {
    "title": "...",
    "description": "...",
    "images": ["https://..."],
    "full_text": "...",
    "extras": {
      "author": "...",
      "publish_date": "...",
      "tags": ["..."],
      "tables": [{"rows": [["h1","h2"],["v1","v2"]]}]
    }
  },
  "generated_post": {
    "title": "...",
    "intro": "...",
    "body": "...",
    "highlights": ["..."],
    "conclusion": "..."
  }
}

AI Configuration

Set keys in .env (copy from .env.example):

OPENAI_API_KEY and optional OPENAI_MODEL (defaults to gpt-4o-mini)
or HUGGINGFACE_API_KEY
If neither is set, a local deterministic generator is used.

Run Tests

pytest -q

Docker

Build and run:

docker build -t scrape-api .
docker run --rm -p 8000:8000 --env-file .env scrape-api

Notes & Design

Scraper extracts: title, description, images (absolute URLs), full text, extras (author, date, tags, up to 3 tables).
Selectors and heuristics are conservative to work across many sites; further tuning can be added.
Network and AI calls are async; the endpoint composes both steps.
No DB is required; you can persist results by adding a repository layer (SQLite/Postgres) if desired.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
conftest.py		conftest.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scrape & Generate API (FastAPI)

Features

Requirements

Quickstart (Local)

Endpoint

POST `/scrape`

AI Configuration

Run Tests

Docker

Notes & Design

License

About

Uh oh!

Releases

Packages

Languages

pandastackDev/scraper_posting

Folders and files

Latest commit

History

Repository files navigation

Web Scrape & Generate API (FastAPI)

Features

Requirements

Quickstart (Local)

Endpoint

POST /scrape

AI Configuration

Run Tests

Docker

Notes & Design

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

POST `/scrape`

Packages