A flexible, configuration-driven web scraper designed to extract article content and feed it into a Retrieval-Augmented Generation (RAG) pipeline.
This project uses a professional, scalable architecture with domain-specific parser configurations to accurately extract structured data from web pages.
- Selector Guide - Creating parser configurations with CSS and XPath
- XPath Feature Guide - Advanced XPath selector usage
- Article API - Article model reference
- Quick examples and tutorials
- π― Dual Selector System: Support for both CSS and XPath selectors with automatic detection
- π§Ή 3-Layer Cleanup Architecture: Global, per-field, and safety cleanup for pristine content
- π Markdown Output: Clean, structured markdown with preserved links and formatting
- βοΈ Config-Driven Parsing: Define how to scrape any site using simple JSON configuration files
- π Flexible Selectors: Support for fallback chains, parent scoping, and per-selector attributes
- π Rich Metadata: Extracts OpenGraph, Schema.org, authors, dates, tags, and topics
- π€ RAG-Ready:
- Automatically chunks content with token estimation for LLM context windows.
- Features a modular vector store engine with an adapter pattern for different databases (AstraDB) and embedding models (OpenAI).
- β Production-Ready: Pydantic v2 validation, lazy-loaded clients for robust startup, error handling, and deterministic UUIDs.
- XPath Support: Use powerful XPath expressions for precise element selection
- Automatic Type Detection: Mix CSS and XPath selectors - the system auto-detects
- Attribute Extraction: Direct attribute access via XPath (e.g.,
//time[@datetime]/@datetime) - 3-Layer Cleanup:
- Global cleanup (script, style, noscript, iframe)
- Per-field cleanup (ads, sponsors, related posts)
- Safety cleanup with preset selectors
- Markdown Output: HTML β Clean HTML β Markdown workflow preserves structure
- 90+ Migrated Configs: All parser configs updated to new architecture
.
βββ Procfile # Defines processes for Honcho (api, worker, beat)
βββ api.py # FastAPI application, the user-facing entrypoint
βββ celery_app.py # Celery application instance configuration
βββ pyproject.toml # Project metadata and dependencies
βββ src/
β βββ llm_scraper/
β βββ __init__.py
β βββ articles.py # Core Article data model and chunking logic
β βββ meta.py # Metadata extraction logic
β βββ parsers/ # Site-specific parser configurations
β βββ schema.py # Pydantic models for configuration and data
β βββ settings.py # Application settings management (from .env)
β βββ utils/ # Utility functions
β βββ vectors/ # Modular vector store engine and adapters
β βββ abc.py # Abstract base classes for adapters
β βββ engine.py # The main VectorStoreEngine
β βββ dbs/ # Vector database adapters (e.g., AstraDB)
β βββ embeddings/ # Embedding model adapters (e.g., OpenAI)
βββ worker.py # Celery worker and scheduler (Celery Beat) definitions
-
Install Dependencies: This project uses
uvfor package management.uv pip install -r requirements.txt
-
Environment Variables: Create a
.envfile in the root directory and add your credentials:# .env OPENAI_API_KEY="sk-..." ASTRA_DB_APPLICATION_TOKEN="AstraCS:..." ASTRA_DB_API_ENDPOINT="https://..." ASTRA_DB_COLLECTION_NAME="your_collection_name" REDIS_URL="redis://localhost:6379/0"
-
Run Redis: Ensure you have a Redis server running locally. You can use Docker for this:
docker run -d -p 6379:6379 redis
Validate article extraction from a fixture:
# Test with markdown output (default)
python scripts/validate_article_fixture.py fixtures/en/c/cryptoslate.com.json
# Test with HTML output
python scripts/validate_article_fixture.py fixtures/en/c/crypto.news.json --format htmlFetch HTML and create a test fixture:
python scripts/fetch_and_create_fixture.py https://crypto.news/article-slug/Process multiple URLs at once:
# From a file (one URL per line)
python scripts/batch_create_fixtures.py urls.txt
# From command line
python scripts/batch_create_fixtures.py --urls https://site1.com/article https://site2.com/articleAnalyze HTML structure using preset selectors:
python scripts/debug_site_structure.py fixtures/en/c/domain.jsonYou can run the system in two ways: locally using honcho or with Docker.
This is the easiest way to run the entire system, including the Redis database.
Prerequisites:
- Docker and Docker Compose installed.
- A
.envfile with your credentials (see Setup section).
To start the entire system, run:
docker-compose up --buildThis command will:
- Build the Docker image for the application based on the
Dockerfile. - Start containers for the
api,worker,beat, andredisservices. - Display all logs in your terminal.
To stop the services, press Ctrl+C.
Use this method if you prefer not to use Docker.
Prerequisites:
- Python and
uvinstalled. - A running Redis server (e.g.,
docker run -d -p 6379:6379 redis). - Dependencies installed (
uv pip install -r requirements.txt). - A
.envfile with your credentials.
To start the entire system, run:
honcho startThe API provides two main functions:
- Scraping: Extract article content from URLs (single pages, sitemaps, or RSS feeds)
- Querying: Search the RAG vector database
Notes:
- User scraping via API does NOT automatically store articles in the vector database. Vector storage is handled by system-scheduled tasks to optimize cost and ensure parsing accuracy.
- Bulk scraping modes (sitemap, rss) are protected by a system secret header when configured.
Scrape content based on the specified mode.
- single_page: Scrapes a single article URL and returns the Article object inline.
- sitemap or rss: Starts a background task and returns a task_id (results are retrieved via the /tasks and /scrapes endpoints).
Headers (required for sitemap/rss when SYSTEM_SCRAPE_SECRET is set):
- X-System-Key:
Request Body:
- url: string (article/sitemap/feed URL)
- mode: one of single_page | sitemap | rss
- output_format: markdown | html (default markdown)
Responses:
- single_page: Article object (includes content).
- sitemap/rss: { "task_id": "...", "status_endpoint": "/tasks/{id}" }
Check background task status. Heavy payloads are stripped.
Response fields:
- task_id, status (PENDING|PROGRESS|SUCCESS|FAILURE), result (lightweight meta), article_ids (if any)
Fetch paginated scrape results for a task.
Query params:
- include: ids | compact | full (full currently behaves like compact to force fetching bodies via /article/{id})
- offset: integer >= 0
- limit: integer 1..50 (enforced)
Responses:
- include=ids: { ids: [id, ...], total, offset, limit, next_offset }
- include=compact|full: { articles: [{ id, title, source_url, domain, word_count, format }], total, offset, limit, next_offset, note }
Fetch a single persisted article including body. Use this to retrieve content by id listed in /scrapes results.
Paginated list of discovered URLs for the task (helpful for large sitemap/rss runs).
Query params: offset, limit (1..50)
Diagnostic stats for the task: totals, success/failure counts, and failed URL list.
Delete cached results, stats, and the per-task URL queue.
Perform a similarity search on the vectorized data in AstraDB.
Request Body:
{
"query": "What is blockchain?",
"limit": 5
}Parser configs support both CSS and XPath selectors with automatic type detection:
{
"domain": "example.com",
"lang": "en",
"type": "article",
"cleanup": ["script", "style", "noscript", "iframe"],
"title": {
"selector": ["h1.article-title", "h1"]
},
"content": {
"selector": [
"//article[@id='main']/div[3]",
".article-content",
"article"
],
"cleanup": [
".ads",
".related-posts",
"[class*='sponsor']"
]
},
"authors": {
"selector": ["//a[@rel='author']", ".author-name"],
"all": true
},
"date_published": {
"selector": ["//time[@datetime]/@datetime", "time[datetime]"],
"attribute": "datetime"
},
"tags": {
"selector": ["//a[@rel='tag']", ".tags a"],
"all": true
}
}Key Features:
- Selector fallback chains: Try XPath first, fall back to CSS
- Global cleanup: Remove script/style/iframe from entire page
- Per-field cleanup: Remove ads/sponsors from specific fields
- Attribute extraction: Get attribute values directly with XPath
@attrsyntax - Multi-value extraction: Use
"all": trueto extract all matches
See XPATH_FEATURE.md for detailed examples and best practices.
Core RAG/vector:
- OPENAI_API_KEY
- ASTRA_DB_APPLICATION_TOKEN
- ASTRA_DB_API_ENDPOINT
- ASTRA_DB_COLLECTION_NAME
Async/background & caching:
- REDIS_URL: Redis broker for Celery
- SCRAPE_RESULT_TTL_DAYS: Days to keep cached results (default 7)
- SCRAPE_RESULT_MAX_FULL: Max articles to store as a full list per task (beyond this, only per-article docs are saved)
- MAX_CONCURRENT_SCRAPES: Limit concurrent fetches in bulk modes (default 8)
- SCRAPE_TIMEOUT_SECONDS: Per-request timeout (default 20)
Security:
- SYSTEM_SCRAPE_SECRET: If set, sitemap/rss modes require X-System-Key header to match
Hashing (advanced):
- LLM_SCRAPER_HASH_ALGO: md5 | sha1 | sha256 | hmac-sha256 (default md5 for backward compatibility)
- LLM_SCRAPER_HASH_SECRET: required when using hmac-sha256