Skip to content

pandastackDev/scraper_posting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scrape & Generate API (FastAPI)

Scrape a web page (title, images, description, full text, extras) and generate a refined blog-style post using OpenAI or HuggingFace. Exposes a single endpoint POST /scrape and provides Swagger and ReDoc documentation.

Features

  • Scraping: httpx + selectolax for fast HTML parsing
  • AI Generation: OpenAI (preferred), HuggingFace (fallback), or local deterministic fallback (no keys required)
  • API Docs: Swagger UI (/docs) and ReDoc (/redoc)
  • Health Check: /health
  • Tests: Minimal unit test with network mocked
  • Docker: Optional containerized run

Requirements

  • Python 3.9+

Quickstart (Local)

  1. Create and activate a virtual environment (Windows PowerShell):
python -m venv .venv
.\.venv\Scripts\Activate.ps1
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure environment (optional, for AI APIs):
    • Copy .env.example to .env and set values.
    • If you skip, a local deterministic generator is used.
  2. Run the server:
uvicorn app.main:app --reload --port 8000
  1. Open docs:
    • Swagger: http://localhost:8000/docs
    • ReDoc: http://localhost:8000/redoc

Endpoint

POST /scrape

Request body:

{
  "url": "https://example.com/article"
}

Response body (shape):

{
  "scraped_data": {
    "title": "...",
    "description": "...",
    "images": ["https://..."],
    "full_text": "...",
    "extras": {
      "author": "...",
      "publish_date": "...",
      "tags": ["..."],
      "tables": [{"rows": [["h1","h2"],["v1","v2"]]}]
    }
  },
  "generated_post": {
    "title": "...",
    "intro": "...",
    "body": "...",
    "highlights": ["..."],
    "conclusion": "..."
  }
}

AI Configuration

Set keys in .env (copy from .env.example):

  • OPENAI_API_KEY and optional OPENAI_MODEL (defaults to gpt-4o-mini)
  • or HUGGINGFACE_API_KEY
  • If neither is set, a local deterministic generator is used.

Run Tests

pytest -q

Docker

Build and run:

docker build -t scrape-api .
docker run --rm -p 8000:8000 --env-file .env scrape-api

Notes & Design

  • Scraper extracts: title, description, images (absolute URLs), full text, extras (author, date, tags, up to 3 tables).
  • Selectors and heuristics are conservative to work across many sites; further tuning can be added.
  • Network and AI calls are async; the endpoint composes both steps.
  • No DB is required; you can persist results by adding a repository layer (SQLite/Postgres) if desired.

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published