Scrape a web page (title, images, description, full text, extras) and generate a refined blog-style post using OpenAI or HuggingFace. Exposes a single endpoint POST /scrape and provides Swagger and ReDoc documentation.
- Scraping:
httpx+selectolaxfor fast HTML parsing - AI Generation: OpenAI (preferred), HuggingFace (fallback), or local deterministic fallback (no keys required)
- API Docs: Swagger UI (
/docs) and ReDoc (/redoc) - Health Check:
/health - Tests: Minimal unit test with network mocked
- Docker: Optional containerized run
- Python 3.9+
- Create and activate a virtual environment (Windows PowerShell):
python -m venv .venv
.\.venv\Scripts\Activate.ps1- Install dependencies:
pip install -r requirements.txt- Configure environment (optional, for AI APIs):
- Copy
.env.exampleto.envand set values. - If you skip, a local deterministic generator is used.
- Copy
- Run the server:
uvicorn app.main:app --reload --port 8000- Open docs:
- Swagger:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
- Swagger:
Request body:
{
"url": "https://example.com/article"
}Response body (shape):
{
"scraped_data": {
"title": "...",
"description": "...",
"images": ["https://..."],
"full_text": "...",
"extras": {
"author": "...",
"publish_date": "...",
"tags": ["..."],
"tables": [{"rows": [["h1","h2"],["v1","v2"]]}]
}
},
"generated_post": {
"title": "...",
"intro": "...",
"body": "...",
"highlights": ["..."],
"conclusion": "..."
}
}Set keys in .env (copy from .env.example):
OPENAI_API_KEYand optionalOPENAI_MODEL(defaults togpt-4o-mini)- or
HUGGINGFACE_API_KEY - If neither is set, a local deterministic generator is used.
pytest -qBuild and run:
docker build -t scrape-api .
docker run --rm -p 8000:8000 --env-file .env scrape-api- Scraper extracts: title, description, images (absolute URLs), full text, extras (author, date, tags, up to 3 tables).
- Selectors and heuristics are conservative to work across many sites; further tuning can be added.
- Network and AI calls are async; the endpoint composes both steps.
- No DB is required; you can persist results by adding a repository layer (SQLite/Postgres) if desired.
MIT