π AI Data Retriever
An AI-powered data collection, semantic search, and analysis tool built with FastAPI, SQLite, and FAISS.
Collect, organize, and explore web content intelligently β locally and privately.
AI Data Retriever is an end-to-end Python application that:
- Automatically scrapes content from any website
- Stores and tracks data locally in a SQLite database
- Converts text into vector embeddings for semantic search using FAISS
- Provides a clean, fast web UI (HTML/CSS/JS served by FastAPI)
Think of it as your personal AI-powered web intelligence dashboard β perfect for research, knowledge management, or building datasets for AI/ML projects.
| Feature | Description |
|---|---|
| π§ Web Scraper | Extracts article titles, text, and metadata using BeautifulSoup. |
| πΎ Local Database (SQLite) | Stores and tracks all retrieved data securely. |
| π§ Semantic Search Engine (FAISS) | Search by meaning, not keywords. |
| π§© AI Embeddings | Uses SentenceTransformers (all-MiniLM-L6-v2) for vector representations. |
| π¨ Built-in Web UI | HTML + CSS interface to add URLs, view pages, and perform AI searches. |
| βοΈ Offline-Ready | No external APIs required β runs completely on your machine. |
| π§± Modular Codebase | Cleanly separated backend, scraper, embedder, and templates. |
| Layer | Technology | Purpose |
|---|---|---|
| Backend | π FastAPI | API + Template rendering |
| Database | πΎ SQLite (SQLModel) | Local structured storage |
| Scraper | π Requests + BeautifulSoup | Web content extraction |
| Embeddings | π§ SentenceTransformers | Text vectorization |
| Vector Search | β‘ FAISS | Semantic similarity search |
| Frontend | π¨ HTML + CSS + JS | Interactive dashboard |
| Optional | π§© Playwright | Dynamic site scraping (JS pages) |
- Python 3.9+
- pip (Python package manager)
- (Optional) Playwright if you want dynamic page scraping
1. Clone the repo:
git clone https://github.com/yourusername/ai-data-retriever.git
cd ai-data-retriever
2. Create virtual environment:
python -m venv .venv
source .venv/bin/activate
3. Install dependencies:
pip install -r backend/requirements.txt
4. Run backend:
uvicorn backend.app.main:app --reload
5. Open in browser:
http://127.0.0.1:8000