diff --git a/GEMINI.md b/GEMINI.md
index 5686cd8..eb79b1b 100644
--- a/GEMINI.md
+++ b/GEMINI.md
@@ -1,87 +1,105 @@
-# Gemini Code Assistant Context
-
-This document provides context for the `insta_rag` project, a Python-based Retrieval-Augmented Generation (RAG) library.
+# Gemini Context: insta_rag Project
 
 ## Project Overview
 
-`insta_rag` is a modular and extensible library for building RAG pipelines. It abstracts the complexity of document processing, chunking, embedding, and retrieval into a simple-to-use client. The library is designed with a plug-and-play architecture, allowing developers to easily swap components like embedding models, vector databases, and rerankers.
+This project, `insta_rag`, is a modular and extensible Python library designed for building Retrieval-Augmented Generation (RAG) pipelines. It abstracts the complexity of RAG into three primary operations: adding, updating, and retrieving documents.
+
+**Key Technologies & Architecture:**
+
+- **Core Client:** The main entry point is the `RAGClient`, which orchestrates all operations.
+- **Embeddings & LLMs:** Utilizes OpenAI (`text-embedding-3-large`, GPT-4) or Azure OpenAI for generating embeddings and hypothetical answers (HyDE).
+- **Vector Database:** Uses Qdrant for efficient vector storage and search.
+- **Reranking:** Integrates Cohere for cross-encoder reranking to improve the relevance of search results.
+- **Architecture:** The library is built on an interface-based design, allowing for plug-and-play components. Core modules for `chunking`, `embedding`, `vectordb`, and `retrieval` each have a `base.py` defining an abstract interface, making it easy to extend with new implementations (e.g., adding Pinecone as a vector DB).
+- **Data Models:** Pydantic is used for robust data validation and clear data structures for documents, chunks, and API responses.
+
+The primary goal is to provide a complete, configuration-driven RAG system that is both easy to use and easy to extend.
 
-### Key Technologies
+## Documentation
 
--   **Programming Language:** Python 3.9+
--   **Core Dependencies:**
-    -   `openai`: For generating embeddings and powering HyDE (Hypothetical Document Embeddings).
-    -   `qdrant-client`: For vector storage and search.
-    -   `cohere`: For reranking search results.
-    -   `pdfplumber` & `PyPDF2`: For PDF text extraction.
-    -   `pymongo`: For metadata storage.
-    -   `fastapi`: For the testing and example API.
--   **Architecture:**
-    -   **Modular:** The library is divided into distinct modules for chunking, embedding, vector database interaction, and retrieval.
-    -   **Interface-Based:** Core components are built around abstract base classes, making it easy to add new implementations.
-    -   **Configuration-Driven:** A central `RAGConfig` object controls the behavior of the entire library.
+The project documentation has been reorganized for clarity and is located in the `/docs` directory.
+
+- **[README.md](./docs/README.md):** Main landing page with links to all other documents.
+- **[installation.md](./docs/installation.md):** Detailed installation instructions.
+- **[quickstart.md](./docs/quickstart.md):** A hands-on guide to get started quickly.
+- **Guides (`/docs/guides`):**
+  - **[document-management.md](./docs/guides/document-management.md):** Covers adding, updating, and deleting documents.
+  - **[retrieval.md](./docs/guides/retrieval.md):** Explains the advanced hybrid retrieval pipeline.
+  - **[storage-backends.md](./docs/guides/storage-backends.md):** Details on configuring Qdrant-only vs. hybrid Qdrant+MongoDB storage.
+  - **[local-development.md](./docs/guides/local-development.md):** Instructions for setting up a local Qdrant instance.
 
 ## Building and Running
 
-### Installation
+### 1. Installation
+
+The project uses `uv` for package management.
+
+```bash
+# Install the package in editable mode with all dependencies
+uv pip install -e .
+```
 
-1.  **Install Python:** Ensure you have Python 3.9 or higher installed.
-2.  **Install Dependencies:** It is recommended to use a virtual environment.
+Alternatively, using `pip` and a virtual environment:
 
-    ```bash
-    python -m venv venv
-    source venv/bin/activate
-    pip install -r requirements.txt
-    pip install -e .
-    ```
+```bash
+# Create and activate a virtual environment
+python3 -m venv .venv
+source .venv/bin/activate
 
-### Running the Testing API
+# Install in editable mode
+pip install -e .
+```
 
-The project includes a FastAPI-based testing API in the `testing_api` directory. This is the best way to test the library's functionality.
+### 2. Environment Setup
 
-1.  **Set up Environment Variables:** Create a `.env` file in the root of the project with the following variables:
+The client is configured via a `.env` file. Create one in the project root with the variables listed in `docs/installation.md`.
 
-    ```env
-    QDRANT_URL="your_qdrant_url"
-    QDRANT_API_KEY="your_qdrant_api_key"
-    AZURE_OPENAI_ENDPOINT="your_azure_openai_endpoint"
-    AZURE_OPENAI_API_KEY="your_azure_openai_api_key"
-    AZURE_EMBEDDING_DEPLOYMENT="text-embedding-3-large"
-    ```
+### 3. Running the Example
 
-2.  **Run the API:**
+The `examples/basic_usage.py` script demonstrates the core functionality of the library.
 
-    ```bash
-    cd testing_api
-    python main.py
-    ```
+```bash
+# Run the basic usage example
+python examples/basic_usage.py
+```
 
-    The API will be available at `http://localhost:8000`. You can access the Swagger UI for interactive documentation at `http://localhost:8000/docs`.
+### 4. Running Tests
+
+The project contains a `tests/` directory. Tests can be run using `pytest`.
+
+```bash
+# TODO: Verify if this is the correct test command.
+pytest
+```
 
 ## Development Conventions
 
-### Code Style
+This project has a strong focus on code quality and consistency, enforced by several tools.
+
+### 1. Linting and Formatting
+
+- **Tool:** `Ruff` is used for both linting and formatting.
+
+- **Usage:**
+
+  ```bash
+  # Check for linting errors and auto-fix them
+  ruff check . --fix
 
--   The project uses `ruff` for linting and formatting. The configuration is in `pyproject.toml`.
--   **Line Length:** 88 characters.
--   **Quotes:** Double quotes (`"`).
--   **Indentation:** 4 spaces.
+  # Format the codebase
+  ruff format .
+  ```
 
-### Testing
+### 2. Pre-commit Hooks
 
--   The `testing_api` directory contains a comprehensive suite of endpoints for testing all components of the `insta_rag` library.
--   To run the tests, start the testing API and use a tool like `curl` or the Swagger UI to send requests to the various endpoints.
+- **Framework:** `pre-commit` is used to run checks before each commit.
 
-### Commits
+- **Setup:** First-time contributors must install the hooks:
 
--   The project uses [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/).
--   `commitizen` is used to format commit messages.
+  ```bash
+  pre-commit install
+  ```
 
-### Key Files
+### 3. Commit Messages
 
--   `src/insta_rag/core/client.py`: The main entry point for the RAG library.
--   `src/insta_rag/core/config.py`: Defines the configuration for the RAG client.
--   `src/insta_rag/retrieval/reranker.py`: Implements reranking logic.
--   `testing_api/main.py`: The FastAPI application for testing the library.
--   `README.md`: Provides a detailed overview of the library's architecture and usage.
--   `pyproject.toml`: Defines project dependencies and tool configurations.
+- **Standard:** The project follows the **Conventional Commits** specification, enforced by `commitizen`.
diff --git a/README.md b/README.md
index 5baf1e4..3e53499 100644
--- a/README.md
+++ b/README.md
@@ -1,1145 +1,64 @@
-# RAG Library Design for Doc Directors Pipeline
+# insta_rag
 
-## 1. Library Architecture
+`insta_rag` is a modular, plug-and-play Python library for building advanced Retrieval-Augmented Generation (RAG) pipelines. It abstracts the complexity of document processing, embedding, and hybrid retrieval into a simple, configuration-driven client.
 
-### Overview
+## Core Features
 
-The RAG library is designed as a modular, plug-and-play system that abstracts all RAG complexity into three primary operations: **Input**, **Update**, and **Retrieve**. The architecture follows the principle of separation of concerns with clear interfaces between components.
+- **Semantic Chunking**: Splits documents at natural topic boundaries to preserve context.
+- **Hybrid Retrieval**: Combines semantic vector search with BM25 keyword search for the best of both worlds.
+- **Query Transformation (HyDE)**: Uses an LLM to generate hypothetical answers, improving retrieval relevance.
+- **Reranking**: Integrates with state-of-the-art rerankers like Cohere to intelligently re-order results.
+- **Pluggable Architecture**: Easily extend the library by adding new chunkers, embedders, or vector databases.
+- **Hybrid Storage**: Optional integration with MongoDB for cost-effective content storage, keeping Qdrant lean for vector search.
 
-### Core Components
+## Quick Start
 
-**Module Structure:**
+### 1. Installation
 
-```text
-insta_rag/
-├── core/                      # Central orchestration layer
-│   ├── client.py              # Main RAGClient - entry point for all operations
-│   └── config.py              # Configuration management and validation
-│
-├── chunking/                  # Document chunking strategies
-│   ├── base.py                # Abstract interface for all chunkers
-│   ├── semantic.py            # Semantic chunking (primary method)
-│   └── utils.py               # Token counting, text splitting utilities
-│
-├── embedding/                 # Vector embedding generation
-│   ├── base.py                # Abstract interface for embedding providers
-│   └── openai.py              # OpenAI text-embedding-3-large implementation
-│
-├── vectordb/                  # Vector database operations
-│   ├── base.py                # Abstract interface for vector stores
-│   └── qdrant.py              # Qdrant implementation
-│
-├── retrieval/                 # Hybrid retrieval system
-│   ├── query_generator.py     # Query optimization and HyDE generation
-│   ├── vector_search.py       # Semantic similarity search
-│   ├── keyword_search.py      # BM25 lexical search
-│   └── reranker.py            # Cross-encoder reranking (Cohere)
-│
-└── models/                    # Data structures
-    ├── chunk.py               # Chunk representation and metadata
-    ├── document.py            # Document input specifications
-    └── response.py            # Standardized API responses
-```
-
-### Component Interaction Flow
-
-**High-Level Architecture:**
-
-<!-- ```text
-┌─────────────────────────────────────────────────────────────────┐
-│                         RAGClient                                │
-│  (Orchestrates all operations, manages configuration)            │
-└──────────────┬────────────┬──────────────┬──────────────────────┘
-               │            │              │
-       ┌───────▼────┐  ┌───▼──────┐  ┌───▼─────────────┐
-       │ Chunking   │  │Embedding │  │ Vector Database │
-       │ Strategy   │  │ Provider │  │    (Qdrant)     │
-       └────────────┘  └──────────┘  └─────────────────┘
-                                              │
-                                     ┌────────▼─────────┐
-                                     │  Retrieval       │
-                                     │  Pipeline        │
-                                     └──────┬───────────┘
-                                            │
-                      ┌─────────────────────┼─────────────────────┐
-                      │                     │                     │
-              ┌───────▼────────┐   ┌───────▼────────┐   ┌───────▼────────┐
-              │ Query Generator│   │ Vector Search  │   │ Keyword Search │
-              │  (HyDE)        │   │                │   │    (BM25)      │
-              └────────────────┘   └────────────────┘   └────────────────┘
-                                            │
-                                    ┌───────▼────────┐
-                                    │   Reranker     │
-                                    │   (Cohere)     │
-                                    └────────────────┘
-``` -->
-
-```mermaid
-graph TD
-    A["RAGClient<br/>(Orchestrates all operations, manages configuration)"] --> B["Chunking Strategy"];
-    A --> C["Embedding Provider"];
-    A --> D["Vector Database<br/>(Qdrant)"];
-
-    D --> E["Retrieval Pipeline"];
-
-    E --> F["Query Generator<br/>(HyDE)"];
-    E --> G["Vector Search"];
-    E --> H["Keyword Search<br/>(BM25)"];
-
-    G --> I["Reranker<br/>(Cohere)"];
-```
-
-### Design Principles
-
-**1. Interface-Based Design**
-
-- All major components (chunking, embedding, vector DB, reranking) have abstract base interfaces
-- Enables easy swapping of implementations without affecting client code
-- Future providers can be added by implementing the base interface
-
-**2. Configuration-Driven Behavior**
-
-- Single configuration object controls all library behavior
-- All parameters have sensible defaults based on research best practices
-- Configuration is validated at initialization to fail fast
-
-**3. Extensibility Without Breaking Changes**
-
-- New chunking methods, embedding providers, or vector databases can be added
-- Existing code continues to work as new options are introduced
-- Version-controlled feature flags for experimental capabilities
-
-______________________________________________________________________
-
-## 2. Core API Operations
-
-The library exposes three primary operations that handle the complete RAG lifecycle:
-
-### 2.1 Library Initialization
-
-**Purpose:** Configure the RAG system with all necessary credentials and parameters.
-
-**Configuration Categories:**
-
-#### A. Vector Database Configuration
-
-- Connection details for Qdrant instance
-- API authentication credentials
-- Collection management settings
-
-#### B. Embedding Configuration
-
-- Provider selection (OpenAI, with future support for Cohere, Azure, etc.)
-- Model specification (default: text-embedding-3-large)
-- Dimensionality settings (3072 dimensions)
-- API credentials
-
-#### C. Reranking Configuration
-
-- Provider selection (Cohere Rerank 3.5, with future cross-encoder support)
-- Model specification
-- API credentials
-- Top-k selection parameters
-
-#### D. LLM Configuration (for Query Generation)
-
-- Provider selection (OpenAI GPT-4, with future Anthropic, Azure support)
-- Model specification
-- API credentials for HyDE query generation
-
-#### E. Chunking Strategy Configuration
-
-- Method selection (semantic chunking as primary)
-- Maximum chunk size (default: 1000 tokens)
-- Overlap percentage (default: 20%)
-- Semantic breakpoint threshold (95th percentile)
-
-#### F. PDF Processing Configuration
-
-- Parser selection (pdfplumber, with future Chunkr, Unstructured.io support)
-- Text extraction settings
-- Quality validation parameters
-
-#### G. Retrieval Configuration
-
-- Vector search limits (25 chunks per query)
-- Keyword search limits (50 BM25 chunks)
-- Feature toggles (HyDE, keyword search)
-- Final reranking top-k (20 chunks)
-- Distance metric (cosine similarity)
-
-**Initialization Flow:**
+```bash
+# Recommended: using uv
+uv pip install insta-rag
 
-```text
-User Creates Config Object
-         ↓
-Configuration Validation
-    (Check required fields, validate API keys format, verify parameters)
-         ↓
-Initialize RAGClient
-         ↓
-Establish Connections
-    (Qdrant, OpenAI API, Cohere API)
-         ↓
-Verify System Health
-    (Test connections, validate models available)
-         ↓
-Client Ready for Operations
+# Or with pip
+pip install insta-rag
 ```
 
-______________________________________________________________________
+### 2. Basic Usage
 
-### 2.2 Knowledge Base Input Operation
+```python
+from insta_rag import RAGClient, RAGConfig, DocumentInput
 
-**Function**: `add_documents()`
+# Load configuration from environment variables (.env file)
+config = RAGConfig.from_env()
+client = RAGClient(config)
 
-**Purpose:** Process documents, create semantic chunks, generate embeddings, and store in vector database.
+# 1. Add documents to a collection
+documents = [DocumentInput.from_text("Your first document content.")]
+client.add_documents(documents, collection_name="my_docs")
 
-**Input Parameters:**
+# 2. Retrieve relevant information
+response = client.retrieve(
+    query="What is this document about?", collection_name="my_docs"
+)
 
-1. **documents**: List of document inputs
-
-   - Accepts files (PDFs), raw text, or binary content
-   - Each document can have individual metadata
-   - Supports batch processing of multiple documents
-
-1. **collection_name**: Target Qdrant collection
-
-   - Auto-creates collection if it doesn't exist
-   - Manages collection schema and indexing
-
-1. **metadata**: Global metadata for all chunks
-
-   - User identification (user_id)
-   - Document categorization (document_type, is_standalone)
-   - Template association (template_id)
-   - Any custom fields
-
-1. **batch_size**: Processing batch size (default: 100)
-
-   - Controls memory usage during embedding generation
-   - Optimizes API calls to embedding provider
-
-1. **validate_chunks**: Quality validation toggle (default: True)
-
-   - Token count validation
-   - Garbled text detection
-   - Minimum length requirements
-
-**Processing Flow:**
-
-```text
-Documents Input (PDF/Text/Binary)
-         ↓
-┌────────────────────────────────┐
-│  PHASE 1: Document Loading     │
-└────────────────────────────────┘
-    - Read files from paths
-    - Decode binary content
-    - Validate file formats
-         ↓
-┌────────────────────────────────┐
-│  PHASE 2: Text Extraction      │
-└────────────────────────────────┘
-    - Extract text using pdfplumber (primary)
-    - Fallback to PyPDF2 if needed
-    - Detect encryption/corruption
-    - Validate text quality
-         ↓
-┌────────────────────────────────┐
-│  PHASE 3: Semantic Chunking    │
-└────────────────────────────────┘
-    - Check if single chunk sufficient
-    - Apply semantic boundary detection
-    - Enforce token limits (1000 max)
-    - Add 20% overlap between chunks
-    - Fallback to character-based if needed
-         ↓
-┌────────────────────────────────┐
-│  PHASE 4: Chunk Validation     │
-└────────────────────────────────┘
-    - Verify token counts
-    - Detect garbled text
-    - Check minimum length
-    - Attach metadata to each chunk
-         ↓
-┌────────────────────────────────┐
-│  PHASE 5: Batch Embedding      │
-└────────────────────────────────┘
-    - Generate embeddings in batches
-    - Use OpenAI text-embedding-3-large
-    - 3072-dimensional vectors
-    - Rate limit handling
-         ↓
-┌────────────────────────────────┐
-│  PHASE 6: Vector Storage       │
-└────────────────────────────────┘
-    - Store in Qdrant collection
-    - Upsert points with metadata
-    - Create indexes if needed
-    - Track performance metrics
-         ↓
-Response: Success with chunk details
+# Print the most relevant chunk
+if response.chunks:
+    print(response.chunks[0].content)
 ```
 
-**Response Data Includes:**
-
-- Success status and document count
-- Total chunks created with individual IDs
-- Complete metadata for each chunk (document_id, source, chunk_index, token counts, etc.)
-- Processing statistics (timings, token usage, failures)
-- Any errors encountered during processing
-
-**Error Handling:**
-
-- **PDFEncryptedError**: Password-protected PDFs detected
-- **PDFCorruptedError**: Invalid or damaged PDF files
-- **PDFEmptyError**: No extractable text content
-- **ChunkingError**: Semantic chunking failures
-- **EmbeddingError**: API failures during embedding generation
-- **VectorDBError**: Qdrant connection or storage issues
-
-**Use Cases:**
-
-1. **Business Document Upload**: User-specific PDFs with user_id metadata
-1. **Website Content Storage**: Scraped text with source URL tracking
-1. **Knowledge Base Creation**: Template-specific documents with template_id
-
-______________________________________________________________________
-
-### 2.3 Knowledge Base Update Operation
-
-**Function**: `update_documents()`
-
-**Purpose:** Modify, replace, or delete existing documents in the knowledge base.
-
-**Input Parameters:**
-
-1. **collection_name**: Target Qdrant collection
-
-1. **update_strategy**: Operation type
-
-   - **replace**: Delete existing documents and add new ones
-   - **append**: Add new documents without deleting
-   - **delete**: Remove documents matching criteria
-   - **upsert**: Update if exists, insert if doesn't
-
-1. **filters**: Metadata-based selection criteria
-
-   - Filter by user_id, document_type, template_id, etc.
-   - Supports complex filter combinations
-
-1. **document_ids**: Specific document IDs to target
-
-   - Alternative to metadata filters
-   - Precise document selection
-
-1. **new_documents**: Replacement or additional documents
-
-   - Used with replace, append, and upsert strategies
-
-1. **metadata_updates**: Metadata field updates
-
-   - Update metadata without reprocessing content
-   - Useful for status changes, tags, timestamps
-
-1. **reprocess_chunks**: Content reprocessing toggle
-
-   - If True: Regenerate chunks and embeddings
-   - If False: Metadata-only updates
-
-**Update Strategies Explained:**
-
-**1. Replace Strategy**
-
-```text
-Identify Target Documents (via filters/IDs)
-         ↓
-Delete Existing Chunks from Qdrant
-         ↓
-Process New Documents
-         ↓
-Create New Chunks with Semantic Chunking
-         ↓
-Generate New Embeddings
-         ↓
-Store New Chunks in Qdrant
-         ↓
-Return: Deleted count + Added count
-```
-
-**2. Append Strategy**
-
-```text
-Keep All Existing Documents
-         ↓
-Process New Documents
-         ↓
-Create New Chunks
-         ↓
-Generate Embeddings
-         ↓
-Add to Collection (No Deletion)
-         ↓
-Return: Added count
-```
-
-**3. Delete Strategy**
-
-```text
-Identify Target Documents (via filters/IDs)
-         ↓
-Delete All Matching Chunks from Qdrant
-         ↓
-Clean Up References
-         ↓
-Return: Deleted count
-```
-
-**4. Upsert Strategy**
-
-<!-- ```text
-Check if Documents Exist
-         ↓
-    ┌───────┴───────┐
-    │               │
- Exists        Doesn't Exist
-    │               │
- Replace          Insert
-    │               │
-    └───────┬───────┘
-            ↓
-Return: Updated + Inserted counts
-``` -->
-
-```mermaid
-graph TD
-    A{Check if Documents Exist}
-    A -- Exists --> B[Replace];
-    A -- "Doesn't Exist" --> C[Insert];
-    B --> D["Return: Updated + Inserted counts"];
-    C --> D;
-```
-
-**Response Data Includes:**
-
-- Success status
-- Strategy used
-- Documents affected count
-- Chunks deleted, added, and updated counts
-- List of updated document IDs
-- Any errors encountered
-
-**Error Handling:**
-
-- **CollectionNotFoundError**: Target collection doesn't exist
-- **NoDocumentsFoundError**: No documents match filters/IDs
-- **VectorDBError**: Qdrant operation failures
-
-**Use Cases:**
-
-1. **Document Replacement**: User uploads updated version of existing document
-1. **Metadata Updates**: Mark documents as archived or add tags
-1. **Bulk Deletion**: Remove all documents for a specific user or template
-1. **Incremental Additions**: Add new documents to existing knowledge base
-
-______________________________________________________________________
-
-### 2.4 Knowledge Base Retrieval Operation
-
-**Function**: `retrieve()`
-
-**Purpose:** Find and return most relevant document chunks using hybrid search with reranking.
-
-**Input Parameters:**
-
-1. **query**: User's search question or query string
-
-1. **collection_name**: Target Qdrant collection to search
-
-1. **filters**: Metadata filters to narrow search scope
-
-   - Filter by user_id, template_id, document_type, etc.
-   - Ensures user isolation and template-specific retrieval
-
-1. **top_k**: Final number of chunks to return (default: 20)
-
-1. **enable_reranking**: Use Cohere reranking (default: True)
-
-   - Improves relevance ranking significantly
-   - Slight latency increase
-
-1. **enable_keyword_search**: Include BM25 search (default: True)
-
-   - Adds lexical matching to semantic search
-   - Better for exact term matches
-
-1. **enable_hyde**: Use HyDE query generation (default: True)
-
-   - Generates hypothetical answer for better retrieval
-   - Research-proven improvement
-
-1. **score_threshold**: Minimum relevance score filter
-
-   - Optional quality gate for results
-
-1. **return_full_chunks**: Return complete vs truncated content
-
-1. **deduplicate**: Remove duplicate chunks (default: True)
-
-**Retrieval Pipeline Flow:**
-
-```text
-User Query
-    ↓
-┌─────────────────────────────────────────────┐
-│  STEP 1: Query Generation                   │
-└─────────────────────────────────────────────┘
-    - LLM generates optimized search query
-    - LLM generates HyDE (hypothetical answer)
-    - Single API call using structured output
-    - Result: 2 queries (standard + HyDE)
-    ↓
-┌─────────────────────────────────────────────┐
-│  STEP 2: Vector Search                      │
-└─────────────────────────────────────────────┘
-    - Embed both queries using OpenAI
-    - Search Qdrant with each query
-    - 25 chunks per query = 50 chunks total
-    - Apply metadata filters
-    - Track vector similarity scores
-    ↓
-┌─────────────────────────────────────────────┐
-│  STEP 3: Keyword Search (BM25)              │
-└─────────────────────────────────────────────┘
-    - Tokenize original query
-    - BM25 algorithm on document corpus
-    - Retrieve 50 additional chunks
-    - Lexical matching for exact terms
-    - Apply same metadata filters
-    ↓
-┌─────────────────────────────────────────────┐
-│  STEP 4: Combine & Deduplicate              │
-└─────────────────────────────────────────────┘
-    - Pool: 50 vector + 50 keyword = ~100 chunks
-    - Remove duplicates by document_id or content hash
-    - Result: ~100 unique chunks
-    ↓
-┌─────────────────────────────────────────────┐
-│  STEP 5: Reranking                          │
-└─────────────────────────────────────────────┘
-    - Send all unique chunks to Cohere Rerank 3.5
-    - Cross-encoder scoring for relevance
-    - Considers query-chunk semantic relationship
-    - Produces 0-1 relevance scores
-    ↓
-
-┌─────────────────────────────────────────────┐
-│  STEP 6: Selection & Formatting             │
-└─────────────────────────────────────────────┘
-    - Sort by reranker scores (highest first)
-    - Select top_k chunks (default: 20)
-    - Apply score_threshold if specified
-    - Return full chunks (no truncation)
-    - Include all metadata and scores
-    ↓
-Response: Top-k relevant chunks with scores
-```
-
-**Response Data Includes:**
-
-1. **Success status and original query**
-
-1. **Generated queries**: Standard and HyDE queries used
-
-1. **Retrieved chunks** (for each chunk):
-
-   - Full content text
-   - Complete metadata (source, document_id, chunk_index, etc.)
-   - Relevance score (from reranker, 0-1)
-   - Vector similarity score
-   - BM25 keyword score (if applicable)
-   - Rank position (1 to top_k)
-
-1. **Retrieval statistics**:
-
-   - Total chunks retrieved (before dedup/reranking)
-   - Vector search chunk count
-   - Keyword search chunk count
-   - Chunks after deduplication
-   - Chunks after reranking
-   - Timing breakdown:
-     - Query generation time
-     - Vector search time
-     - Keyword search time
-     - Reranking time
-     - Total time
-
-1. **Source information**:
-
-   - List of source documents
-   - Chunk count per source
-   - Average relevance per source
-
-**Error Handling:**
-
-- **CollectionNotFoundError**: Collection doesn't exist
-- **QueryGenerationError**: LLM query generation failure
-- **EmbeddingError**: Query embedding failure
-- **RerankingError**: Cohere API issues
-- **VectorDBError**: Qdrant search failures
-
-**Retrieval Modes:**
-
-**1. Full Hybrid (Default - Best Quality)**
-
-- HyDE enabled + Vector search + Keyword search + Reranking
-- Retrieves ~100 chunks, reranks to top 20
-- Best accuracy, slightly higher latency
-
-**2. Hybrid Without HyDE**
-
-- Standard vector + Keyword search + Reranking
-- Faster query generation, still excellent results
-
-**3. Vector Only with Reranking**
-
-- Pure semantic search + Reranking
-- Good for conceptual queries, misses exact terms
-
-**4. Fast Vector Search**
-
-- Vector search only, no reranking, no keyword
-- Fastest retrieval, lower accuracy
-- Good for preview/suggestion use cases
-
-**Use Cases:**
-
-1. **Document Generation Context**: Retrieve business documents for AI writing
-1. **Question Answering**: Find specific information from knowledge base
-1. **Template-Specific Retrieval**: Get template-associated knowledge
-1. **User-Specific Search**: Find documents belonging to specific user
-
-______________________________________________________________________
-
-## 3. Data Models
-
-### 3.1 Core Data Structures
-
-#### DocumentInput
-
-- Represents an input document for processing
-- Fields:
-  - `source`: File path, text string, or binary content
-  - `source_type`: "file", "text", or "binary"
-  - `metadata`: Optional document-specific metadata
-  - `custom_chunking`: Optional chunking override settings
-
-#### Chunk
-
-- Represents a processed document chunk
-- Fields:
-  - `chunk_id`: Unique internal identifier
-  - `vector_id`: Qdrant point ID
-  - `content`: Chunk text content
-  - `metadata`: ChunkMetadata object
-  - `embedding_dimensions`: Vector dimension count
-
-#### ChunkMetadata
-
-- Complete metadata for a chunk
-- Fields:
-  - `document_id`: Parent document identifier
-  - `source`: Original source file/URL
-  - `chunk_index`: Position in document
-  - `total_chunks`: Total chunks in document
-  - `token_count`: Number of tokens
-  - `char_count`: Character count
-  - `chunking_method`: Method used (e.g., "semantic")
-  - `extraction_date`: Timestamp
-  - `custom_fields`: Dictionary of additional metadata
-
-### 3.2 Response Models
-
-#### AddDocumentsResponse
-
-- Result from adding documents
-- Fields:
-  - `success`: Boolean status
-  - `documents_processed`: Count of documents
-  - `total_chunks`: Total chunks created
-  - `chunks`: List of Chunk objects
-  - `processing_stats`: ProcessingStats object
-  - `errors`: List of error messages
-
-#### ProcessingStats
-
-- Performance metrics for document processing
-- Fields:
-  - `total_tokens`: Total tokens processed
-  - `embedding_time_ms`: Time for embedding generation
-  - `chunking_time_ms`: Time for chunking
-  - `upload_time_ms`: Time for Qdrant upload
-  - `failed_chunks`: Count of failed chunks
-
-#### UpdateDocumentsResponse
-
-- Result from update operations
-- Fields:
-  - `success`: Boolean status
-  - `strategy_used`: Update strategy applied
-  - `documents_affected`: Count of affected documents
-  - `chunks_deleted`: Chunks removed
-  - `chunks_added`: Chunks added
-  - `chunks_updated`: Chunks modified
-  - `updated_document_ids`: List of affected IDs
-  - `errors`: Error list
-
-#### RetrievalResponse
-
-- Result from retrieval operations
-- Fields:
-  - `success`: Boolean status
-  - `query_original`: Original query string
-  - `queries_generated`: Dict with standard and HyDE queries
-  - `chunks`: List of RetrievedChunk objects
-  - `retrieval_stats`: RetrievalStats object
-  - `sources`: List of SourceInfo objects
-
-#### RetrievedChunk
-
-- A chunk returned from retrieval
-- Fields:
-  - `content`: Chunk text
-  - `metadata`: ChunkMetadata
-  - `relevance_score`: Reranker score (0-1)
-  - `vector_score`: Cosine similarity score
-  - `keyword_score`: BM25 score (optional)
-  - `rank`: Position in results
-
-#### RetrievalStats
-
-- Performance and count metrics for retrieval
-- Fields:
-  - `total_chunks_retrieved`: Initial retrieval count
-  - `vector_search_chunks`: From vector search
-  - `keyword_search_chunks`: From BM25 search
-  - `chunks_after_dedup`: After deduplication
-  - `chunks_after_reranking`: Final count
-  - Timing fields for each stage
-
-#### SourceInfo
-
-- Aggregated information per source document
-- Fields:
-  - `source`: Source document name
-  - `chunks_count`: Chunks from this source
-  - `avg_relevance`: Average relevance score
-
-______________________________________________________________________
-
-## 4. Implementation Best Practices
-
-### 4.1 Semantic Chunking Strategy
-
-**Philosophy:** Semantic chunking splits documents at natural topic boundaries rather than arbitrary character counts, preserving contextual coherence.
-
-**Implementation Approach:**
-
-#### Step 1: Single Chunk Optimization
-
-- Check if entire document ≤ max_chunk_size (1000 tokens)
-- If yes, return as single chunk (no splitting)
-- Reduces unnecessary overhead for short documents
-
-#### Step 2: Semantic Boundary Detection
-
-- Generate embeddings for sentences or paragraphs
-- Calculate similarity between adjacent segments
-- Identify low-similarity points (topic transitions)
-- Use percentile-based threshold (95th percentile default)
-- Split at these natural boundaries
-
-#### Step 3: Token Limit Enforcement
-
-- If any semantic chunk > 1000 tokens
-- Split oversized chunks using RecursiveCharacterTextSplitter
-- Maintains strict token limits for embedding model
-- Preserves semantic boundaries where possible
-
-#### Step 4: Overlap Addition
-
-- Add 20% overlap between adjacent chunks
-- Preserves context at boundaries
-- Helps with retrieval accuracy
-- Ensures no information loss at splits
-
-#### Step 5: Fallback Strategy
-
-- If semantic chunking fails (embedding errors, etc.)
-- Use RecursiveCharacterTextSplitter
-- Chunk size: 1000 characters
-- Overlap: 200 characters (20%)
-- Separators: ["\\n\\n", "\\n", ". ", " ", ""]
-
-**Benefits:**
-
-- Better context preservation
-- Improved retrieval accuracy
-- Natural information boundaries
-- Flexible chunk sizes based on content
-
-**Metadata Tracking:**
-
-- Record chunking method used
-- Store token and character counts
-- Track chunk position in document
-- Enable analysis and optimization
-
-______________________________________________________________________
-
-### 4.2 Hybrid Retrieval Pipeline
-
-**Philosophy:** Combine multiple retrieval strategies to maximize both semantic understanding and exact term matching, then use reranking to select the best results.
-
-**Component Breakdown:**
-
-#### 1. Query Generation with HyDE
-
-**Standard Query:**
-
-- Direct optimization of user's query
-- Remove stop words, normalize terms
-- Expand abbreviations if needed
-
-**HyDE Query:**
-
-- LLM generates hypothetical answer to query
-- Embed the answer instead of the question
-- Research shows 30%+ improvement in retrieval
-- Works because answers are semantically similar to actual answers
-
-**Single LLM Call:**
-
-- Use structured output (JSON mode)
-- Request both queries in one call
-- Reduces latency and API costs
-- Ensures consistency
-
-#### 2. Vector Search
-
-**Dual Query Search:**
-
-- Search with standard query → 25 chunks
-- Search with HyDE query → 25 chunks
-- Total: 50 chunks from vector search
-
-**Cosine Similarity:**
-
-- Distance metric for semantic similarity
-- Range: -1 to 1 (typically 0.5 to 1 for relevant)
-- Fast computation on Qdrant
-
-**Metadata Filtering:**
-
-- Apply before search (not post-filter)
-- Ensures search efficiency
-- User isolation, template filtering
-
-#### 3. Keyword Search (BM25)
-
-**BM25 Algorithm:**
-
-- Best practice for lexical search
-- Term frequency / Inverse document frequency
-- Handles exact term matches
-- Complements semantic search
-
-**Retrieval Count:**
-
-- Get 50 chunks via BM25
-- Same metadata filters applied
-- Catches terms missed by embeddings
-- Essential for names, codes, IDs
-
-#### 4. Deduplication
-
-**Why Needed:**
-
-- Vector and keyword search overlap
-- Same chunk may score well in both
-- Reduces reranking cost
-
-**Method:**
-
-- Hash-based deduplication on content
-- Or document_id + chunk_index
-- Keep highest score variant
-- Result: ~100 unique chunks
-
-#### 5. Reranking
-
-**Cohere Rerank 3.5:**
-
-- Cross-encoder model
-- Scores query-chunk relevance
-- More accurate than embedding similarity
-- 0-1 relevance scores
-
-**Why It Works:**
-
-- Considers full query-chunk interaction
-- Not limited by embedding dimensions
-- Trained specifically for relevance ranking
-- Research-proven improvement
-
-**Process:**
-
-- Send all ~100 chunks to Cohere
-- API returns relevance scores
-- Sort by score
-- Select top_k (default: 20)
-
-#### 6. Final Selection
-
-**No Truncation:**
-
-- Return full chunks (not truncated)
-- Preserves complete context
-- Critical for generation quality
-
-**Score Preservation:**
-
-- Include all scores (vector, keyword, reranker)
-- Enables debugging and analysis
-- Supports confidence thresholds
-
-**Comprehensive Stats:**
-
-- Track timing for each stage
-- Count chunks at each step
-- Identify bottlenecks
-- Optimize future queries
-
-______________________________________________________________________
-
-### 4.3 Metadata Management
-
-**Philosophy:** Rich metadata enables precise filtering, performance analysis, and system observability.
-
-**Document-Level Metadata:**
-
-- `user_id`: User isolation and ownership
-- `document_type`: Categorization (business_document, knowledge_base)
-- `is_standalone`: Lifecycle management (profile vs standalone)
-- `template_id`: Template association for filtering
-- `source_type`: Origin indicator (PDF, Website, Text)
-
-**Chunk-Level Metadata:**
-
-- `document_id`: Parent document reference
-- `chunk_index`: Position in document
-- `total_chunks`: Document size context
-- `token_count`: Size tracking
-- `char_count`: Alternative size metric
-- `chunking_method`: Processing history
-- `extraction_date`: Timestamp for freshness
-
-**Custom Metadata:**
-
-- Extensible dictionary for application-specific fields
-- Examples: `document_name`, `website_url`, `status`, `tags`
-- Enables flexible filtering and organization
-- No schema restrictions
-
-**Metadata Usage:**
-
-**Filtering:**
-
-- User isolation: `{"user_id": "user_123"}`
-- Template-specific: `{"template_id": "template_456"}`
-- Document type: `{"document_type": "business_document"}`
-- Combined filters: `{"user_id": "user_123", "is_standalone": true}`
-
-**Analytics:**
-
-- Track chunking performance by method
-- Monitor token usage per user
-- Analyze source distribution
-- Identify optimal chunk sizes
-
-**Lifecycle Management:**
-
-- Cascade deletes based on `is_standalone`
-- Archive documents by status
-- Update timestamps for freshness
-- Version control via custom fields
-
-______________________________________________________________________
-
-### 4.4 Error Handling Strategy
-
-**Philosophy:** Fail fast with clear error messages, but degrade gracefully where possible.
-
-**Error Categories:**
-
-**1. Input Validation Errors (Fail Fast)**
-
-- Invalid API keys → Raise at initialization
-- Missing required parameters → Raise before processing
-- Invalid file formats → Raise with specific error type
-- Prevents wasted processing
-
-**2. Processing Errors (Retry with Fallback)**
-
-- PDF extraction failure → Try alternative parser
-- Semantic chunking failure → Fallback to character-based
-- Embedding API rate limit → Exponential backoff retry
-- Partial success where possible
-
-**3. Storage Errors (Transactional)**
-
-- Qdrant connection failure → Roll back operation
-- Partial upload failure → Mark failed chunks
-- Network timeout → Retry with backoff
-- Ensure consistency
-
-**4. Retrieval Errors (Graceful Degradation)**
-
-- Reranking failure → Return vector-sorted results
-- Keyword search failure → Vector-only results
-- HyDE generation failure → Standard query only
-- Never return empty results if any method succeeds
-
-**Error Response Structure:**
-
-- Clear error type classification
-- Descriptive message for debugging
-- Context (document ID, chunk index, etc.)
-- Actionable guidance for resolution
-
-______________________________________________________________________
-
-## 5. Extension Points
-
-### 5.1 Adding New Chunking Methods
-
-**Interface:** All chunkers implement `BaseChunker` interface
-
-**Required Methods:**
-
-- `chunk(text: str) -> List[Chunk]`
-- `validate_config(config: Dict) -> bool`
-
-**Future Additions:**
-
-- Recursive character chunking
-- Fixed-size chunking
-- Markdown-aware chunking
-- Code-specific chunking
-
-**Configuration:**
-
-- Add new `chunking_method` option
-- Method-specific parameters in config
-- Backward compatible defaults
-
-______________________________________________________________________
-
-### 5.2 Adding New Embedding Providers
-
-**Interface:** All embedders implement `BaseEmbedder` interface
-
-**Required Methods:**
-
-- `embed(texts: List[str]) -> List[Vector]`
-- `get_dimensions() -> int`
-
-**Future Additions:**
-
-- Cohere embeddings
-- Azure OpenAI embeddings
-- Anthropic embeddings
-- Local embedding models
-
-**Configuration:**
-
-- Add `embedding_provider` option
-- Provider-specific credentials
-- Model selection per provider
-
-______________________________________________________________________
-
-### 5.3 Adding New Vector Databases
-
-**Interface:** All vector DBs implement `BaseVectorDB` interface
-
-**Required Methods:**
-
-- `create_collection(name: str, dimensions: int)`
-- `upsert(collection: str, points: List[Point])`
-- `search(collection: str, query_vector: Vector, filters: Dict) -> List[Result]`
-- `delete(collection: str, filters: Dict)`
-
-**Future Additions:**
-
-- Pinecone
-- Weaviate
-- Milvus
-- ChromaDB
-
-**Configuration:**
-
-- Add `vector_db_provider` option
-- Provider-specific connection details
-- Migrate existing data between providers
-
-______________________________________________________________________
-
-### 5.4 Adding New Rerankers
-
-**Interface:** All rerankers implement `BaseReranker` interface
-
-**Required Methods:**
-
-- `rerank(query: str, chunks: List[Chunk], top_k: int) -> List[ScoredChunk]`
-
-**Future Additions:**
-
-- Cross-encoder models (local)
-- Anthropic Claude reranking
-- Custom fine-tuned models
-
-**Configuration:**
-
-- Add `reranker_provider` option
-- Model selection
-- Custom model loading paths
-
-______________________________________________________________________
-
-## 6. Performance Considerations
-
-**Query Generation:**
-
-- Single LLM call for standard + HyDE queries
-- Structured output for parsing efficiency
-- Cache common query patterns
-
-**Vector Search:**
+## Documentation
 
-- Parallel searches for multiple queries
-- Efficient metadata filtering in Qdrant
-- Index optimization (HNSW algorithm)
+For detailed guides on installation, configuration, and advanced features, please see the **[Full Documentation](./docs/README.md)**.
 
-**Keyword Search:**
+Key sections include:
 
-- Pre-computed BM25 indexes
-- Incremental index updates
-- Cache for frequent queries
+- **[Installation Guide](./docs/installation.md)**
+- **[Quickstart Guide](./docs/quickstart.md)**
+- **Guides**
+  - [Document Management](./docs/guides/document-management.md)
+  - [Advanced Retrieval](./docs/guides/retrieval.md)
+  - [Storage Backends](./docs/guides/storage-backends.md)
 
-**Reranking:**
+## License
 
-- Batch API calls where possible
-- Limit candidate set size (~100 chunks)
-- Parallel processing for multiple queries
+This project is licensed under the [MIT License](./LICENSE).
diff --git a/docs/DEBUG_RETRIEVAL_ISSUES.md b/docs/DEBUG_RETRIEVAL_ISSUES.md
deleted file mode 100644
index 10cc09a..0000000
--- a/docs/DEBUG_RETRIEVAL_ISSUES.md
+++ /dev/null
@@ -1,314 +0,0 @@
-# Debugging Retrieval Issues
-
-## 🔍 Issues You're Experiencing
-
-Based on your API response:
-
-```json
-{
-  "chunks": [],                      // ❌ No chunks returned!
-  "chunks_after_reranking": 0,       // ❌ All chunks filtered out!
-  "keyword_search_chunks": 0,        // ❌ BM25 not working
-  "total_chunks_retrieved": 16,      // ✅ Got 16 chunks from vector search
-  "chunks_after_dedup": 16           // ✅ Deduplication worked
-}
-```
-
-## 🐛 Root Causes
-
-### Issue 1: No Chunks Returned (`chunks_after_reranking: 0`)
-
-**Possible causes:**
-
-1. **High score_threshold** - You might have set `score_threshold` too high
-
-   - Vector similarity scores are typically 0.0-1.0
-   - Setting threshold > 0.5 might filter everything out
-
-1. **top_k=0** - Check if `top_k` was accidentally set to 0
-
-1. **MongoDB fetch failing silently** - Content not being retrieved from MongoDB
-
-### Issue 2: BM25 Not Working (`keyword_search_chunks: 0`)
-
-**Root cause:** Content is stored in MongoDB, not in Qdrant payload.
-
-BM25 keyword search requires **content in Qdrant** to build a searchable index. When content is stored in MongoDB:
-
-- Qdrant only has: `vectors + metadata + mongodb_id`
-- Qdrant does NOT have: `content` field
-- BM25 cannot build index → skips keyword search
-
-## ✅ Solutions
-
-### Solution 1: Fix No Chunks Issue
-
-**Test without score_threshold:**
-
-```bash
-curl -X POST "http://localhost:8000/api/v1/retrieve" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "query": "semantic chunking",
-    "collection_name": "insta_rag_test_collection",
-    "top_k": 10,
-    "enable_hyde": true,
-    "enable_keyword_search": true,
-    "score_threshold": null,
-    "return_full_chunks": true,
-    "deduplicate": true
-  }'
-```
-
-**Or test with low threshold:**
-
-```bash
-curl -X POST "http://localhost:8000/api/v1/retrieve" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "query": "semantic chunking",
-    "collection_name": "insta_rag_test_collection",
-    "top_k": 10,
-    "score_threshold": 0.1
-  }'
-```
-
-### Solution 2: Enable BM25 (Fix Content Storage)
-
-**Option A: Store content in Qdrant (Recommended for BM25)**
-
-Modify your document upload to **NOT** use MongoDB content storage:
-
-```python
-from insta_rag import RAGClient, RAGConfig
-
-# Create config WITHOUT MongoDB
-config = RAGConfig.from_env()
-config.mongodb = None  # Disable MongoDB content storage
-
-client = RAGClient(config)
-
-# Upload documents - content will be stored in Qdrant
-response = client.add_documents(documents=[...], collection_name="your_collection")
-
-# Now BM25 will work!
-response = client.retrieve(
-    query="your query",
-    collection_name="your_collection",
-    enable_keyword_search=True,  # BM25 will now work
-)
-```
-
-**Option B: Enhance BM25 to fetch from MongoDB (Future Enhancement)**
-
-This would require modifying `src/insta_rag/retrieval/keyword_search.py` to fetch content from MongoDB during corpus building.
-
-**Option C: Use Vector Search Only (Current Workaround)**
-
-```python
-response = client.retrieve(
-    query="your query",
-    collection_name="your_collection",
-    enable_hyde=True,  # Keep HyDE
-    enable_keyword_search=False,  # Disable BM25
-)
-```
-
-## 🧪 Testing Steps
-
-### Step 1: Run Debug Test Script
-
-```bash
-# Test the API endpoint
-python test_api_retrieve.py
-```
-
-This will show detailed diagnostics.
-
-### Step 2: Check API Server Logs
-
-When you call the `/api/v1/retrieve` endpoint, the server now prints detailed logs:
-
-```
-✓ MongoDB connected: Test_Insta_RAG
-Warning: HyDE generation failed: Error code: 404 - ...
-   BM25 corpus built: 0 documents indexed
-   Warning: BM25 index not available, skipping keyword search
-   ✓ Fetched content for 16 chunks from MongoDB
-   Step 6: Selecting top-10 chunks from 16 ranked chunks
-   After top-k selection: 10 chunks
-   After score threshold (0.9): 0 chunks (filtered out: 10)  ← HERE'S THE PROBLEM!
-   ✓ Final chunks to return: 0
-```
-
-Look for:
-
-- MongoDB fetch count
-- BM25 corpus size
-- Score threshold filtering
-
-### Step 3: Test with Simple Python Script
-
-```python
-from insta_rag import RAGClient, RAGConfig
-
-config = RAGConfig.from_env()
-client = RAGClient(config)
-
-# Test with NO threshold
-response = client.retrieve(
-    query="semantic chunking",
-    collection_name="insta_rag_test_collection",
-    top_k=10,
-    score_threshold=None,  # No filtering
-    enable_hyde=False,  # Disable for faster test
-    enable_keyword_search=False,
-)
-
-print(f"Success: {response.success}")
-print(f"Chunks returned: {len(response.chunks)}")
-print(f"Stats: {response.retrieval_stats.to_dict()}")
-
-if response.chunks:
-    chunk = response.chunks[0]
-    print(f"\nFirst chunk score: {chunk.relevance_score}")
-    print(f"Content preview: {chunk.content[:200]}")
-else:
-    print("\n❌ No chunks returned!")
-```
-
-## 📊 Understanding Typical Scores
-
-Vector similarity scores (COSINE distance):
-
-| Score Range | Meaning |
-|------------|---------|
-| 0.9 - 1.0 | Extremely similar (exact or near-exact match) |
-| 0.7 - 0.9 | Very similar (good match) |
-| 0.5 - 0.7 | Moderately similar (may be relevant) |
-| 0.3 - 0.5 | Somewhat similar (loosely related) |
-| 0.0 - 0.3 | Not very similar (likely not relevant) |
-
-**Typical scores you'll see:** 0.15 - 0.4 for normal searches
-
-**Setting `score_threshold=0.7` will likely filter out ALL results!**
-
-## ✅ Recommended Settings
-
-### For Best Results:
-
-```python
-response = client.retrieve(
-    query="your question",
-    collection_name="your_collection",
-    top_k=20,  # Good default
-    enable_hyde=True,  # Better retrieval quality
-    enable_keyword_search=False,  # Disable (won't work with MongoDB storage)
-    score_threshold=None,  # Let top-k handle selection
-    return_full_chunks=True,  # Get full content
-    deduplicate=True,  # Remove duplicates
-)
-```
-
-### For Faster Queries:
-
-```python
-response = client.retrieve(
-    query="your question",
-    collection_name="your_collection",
-    top_k=10,  # Fewer results
-    enable_hyde=False,  # Skip HyDE (saves ~1.3s)
-    enable_keyword_search=False,  # Skip BM25
-    score_threshold=0.001,  # Basic quality filter
-    return_full_chunks=False,  # Truncated content
-    deduplicate=True,
-)
-```
-
-## 🔧 Quick Fixes
-
-### Fix 1: Remove score_threshold
-
-```bash
-# DON'T DO THIS:
-curl ... -d '{"score_threshold": 0.9}'  # ❌ Too high!
-
-# DO THIS:
-curl ... -d '{"score_threshold": null}'  # ✅ No filtering
-```
-
-### Fix 2: Use Reasonable Threshold
-
-```bash
-curl ... -d '{"score_threshold": 0.15}'  # ✅ Reasonable
-```
-
-### Fix 3: Increase top_k
-
-```bash
-curl ... -d '{"top_k": 20}'  # ✅ More results
-```
-
-## 📖 API Documentation
-
-Visit your Swagger UI for complete API documentation:
-
-```
-http://localhost:8000/docs
-```
-
-Look for the `/api/v1/retrieve` endpoint with:
-
-- Parameter descriptions
-- Default values
-- Example requests
-- Example responses
-
-## 🆘 Still Having Issues?
-
-Run this comprehensive diagnostic:
-
-```python
-from insta_rag import RAGClient, RAGConfig
-
-config = RAGConfig.from_env()
-client = RAGClient(config)
-
-print("=" * 80)
-print("DIAGNOSTIC TEST")
-print("=" * 80)
-
-# Test 1: Simple search (no Phase 2 features)
-print("\n1. Testing basic search...")
-response1 = client.search(
-    query="test", collection_name="insta_rag_test_collection", top_k=5
-)
-print(f"   Chunks: {len(response1.chunks)}")
-print(f"   Success: {response1.success}")
-
-# Test 2: Retrieve with no filters
-print("\n2. Testing retrieve (no threshold)...")
-response2 = client.retrieve(
-    query="test",
-    collection_name="insta_rag_test_collection",
-    top_k=5,
-    score_threshold=None,
-    enable_hyde=False,
-    enable_keyword_search=False,
-)
-print(f"   Chunks: {len(response2.chunks)}")
-print(f"   Success: {response2.success}")
-
-# Test 3: Check collection
-print("\n3. Checking collection...")
-try:
-    info = client.get_collection_info("insta_rag_test_collection")
-    print(f"   Vectors: {info['vectors_count']}")
-    print(f"   Status: {info['status']}")
-except Exception as e:
-    print(f"   Error: {e}")
-
-print("\n" + "=" * 80)
-```
-
-This will help identify the exact issue!
diff --git a/docs/DOCUMENT_UPLOAD_FLOW.md b/docs/DOCUMENT_UPLOAD_FLOW.md
deleted file mode 100644
index 01021cb..0000000
--- a/docs/DOCUMENT_UPLOAD_FLOW.md
+++ /dev/null
@@ -1,642 +0,0 @@
-# Document Upload Flow to Qdrant - Complete Analysis
-
-## Overview
-
-The insta_rag library implements a sophisticated 6-phase pipeline to process documents and store them in Qdrant vector database. Here's how it works:
-
-______________________________________________________________________
-
-## 📊 High-Level Architecture
-
-```
-DocumentInput → Text Extraction → Semantic Chunking → Embedding → Vector Storage (Qdrant)
-                                                                  ↓
-                                                          Content Storage (MongoDB - Optional)
-```
-
-______________________________________________________________________
-
-## 🔄 Complete Pipeline Flow
-
-### Entry Point: `RAGClient.add_documents()`
-
-**Location:** `src/insta_rag/core/client.py:83-283`
-
-```python
-client.add_documents(
-    documents=[DocumentInput.from_file("document.pdf")],
-    collection_name="my_documents",
-    metadata={"category": "research"},
-    batch_size=100,
-)
-```
-
-______________________________________________________________________
-
-## Phase-by-Phase Breakdown
-
-### **PHASE 1: Document Loading**
-
-**Handler:** `_load_and_extract_document()` (client.py:285-336)
-
-**What happens:**
-
-1. Generate unique `document_id` using UUID
-1. Merge global metadata with document-specific metadata
-1. Determine source type (FILE, TEXT, or BINARY)
-
-**Input:**
-
-```python
-DocumentInput(
-    source=Path("example.pdf"), source_type=SourceType.FILE, metadata={"author": "John"}
-)
-```
-
-**Output:**
-
-```python
-document_id = "123e4567-e89b-12d3-a456-426614174000"
-doc_metadata = {
-    "document_id": "123e4567...",
-    "source": "/path/to/example.pdf",
-    "author": "John",
-}
-```
-
-______________________________________________________________________
-
-### **PHASE 2: Text Extraction**
-
-**Handler:** `extract_text_from_pdf()` (pdf_processing.py:9-48)
-
-**What happens:**
-
-1. Validate PDF file exists
-1. Try primary parser (pdfplumber by default)
-1. If primary fails, fallback to PyPDF2
-1. Handle encryption and corruption errors
-1. Extract text page-by-page
-1. Join pages with double newlines
-
-**Code Flow:**
-
-```python
-# pdf_processing.py:51-75
-with pdfplumber.open(pdf_path) as pdf:
-    # Check encryption
-    if pdf.metadata.get("Encrypt"):
-        raise PDFEncryptedError()
-
-    # Extract page by page
-    text_parts = []
-    for page in pdf.pages:
-        page_text = page.extract_text()
-        if page_text:
-            text_parts.append(page_text)
-
-    return "\n\n".join(text_parts)
-```
-
-**Output:**
-
-```
-"This is the full text extracted from the PDF.
-It contains all pages joined together with
-double newlines between pages..."
-```
-
-______________________________________________________________________
-
-### **PHASE 3: Semantic Chunking**
-
-**Handler:** `SemanticChunker.chunk()` (chunking/semantic.py:56-93)
-
-**What happens:**
-
-1. Count total tokens in document
-1. If ≤ max_chunk_size (1000 tokens), return as single chunk
-1. Otherwise, perform semantic chunking:
-   - Split text into sentences
-   - Generate embeddings for each sentence
-   - Calculate cosine similarity between consecutive sentences
-   - Find breakpoints (low similarity = topic change)
-   - Split at breakpoints
-   - Enforce token limits
-   - Add overlap between chunks (20% by default)
-
-**Detailed Semantic Chunking Process:**
-
-```python
-# Step 1: Split into sentences
-sentences = split_into_sentences(text)
-# Result: ["First sentence.", "Second sentence.", ...]
-
-# Step 2: Embed sentences
-embeddings = self.embedder.embed(sentences)
-# Result: [[0.1, 0.2, ...], [0.15, 0.19, ...], ...]
-
-# Step 3: Calculate similarities
-similarities = []
-for i in range(len(embeddings) - 1):
-    vec1 = np.array(embeddings[i])
-    vec2 = np.array(embeddings[i + 1])
-    similarity = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
-    similarities.append(similarity)
-# Result: [0.95, 0.92, 0.45, 0.88, ...]  # 0.45 = topic change!
-
-# Step 4: Find breakpoints using percentile threshold (95th percentile)
-threshold = np.percentile(similarities, 100 - 95)  # = 5th percentile (low values)
-breakpoints = [i + 1 for i, sim in enumerate(similarities) if sim < threshold]
-# Result: [3, 7, 12]  # Split at these indices
-
-# Step 5: Split at breakpoints
-chunks = []
-start = 0
-for bp in breakpoints:
-    chunk = " ".join(sentences[start:bp])
-    chunks.append(chunk)
-    start = bp
-# Result: ["Chunk 1 text...", "Chunk 2 text...", "Chunk 3 text..."]
-
-# Step 6: Enforce token limits and add overlap
-chunks = self._enforce_token_limits(chunks)
-chunks = add_overlap_to_chunks(chunks, overlap_percentage=0.2)
-```
-
-**Chunk Object Creation:**
-
-```python
-# chunking/semantic.py:231-276
-for idx, text in enumerate(text_chunks):
-    metadata = ChunkMetadata(
-        document_id=document_id,
-        source="/path/to/file.pdf",
-        chunk_index=idx,
-        total_chunks=len(text_chunks),
-        token_count=count_tokens_accurate(text),
-        char_count=len(text),
-        chunking_method="semantic",
-        extraction_date=datetime.utcnow(),
-        custom_fields={...},
-    )
-
-    chunk = Chunk(
-        chunk_id=f"{document_id}_chunk_{idx}",
-        content=text,
-        metadata=metadata,
-        embedding=None,  # Will be filled in Phase 5
-    )
-```
-
-**Output:**
-
-```python
-[
-    Chunk(
-        chunk_id="123e4567..._chunk_0",
-        content="First chunk of text about topic A...",
-        metadata=ChunkMetadata(...),
-        embedding=None,
-    ),
-    Chunk(
-        chunk_id="123e4567..._chunk_1",
-        content="...overlap from previous chunk. Second chunk about topic B...",
-        metadata=ChunkMetadata(...),
-        embedding=None,
-    ),
-    ...,
-]
-```
-
-______________________________________________________________________
-
-### **PHASE 4: Chunk Validation**
-
-**Handler:** `validate_chunk_quality()` (chunking/utils.py)
-
-**What happens:**
-
-- Minimum length check (>= 10 characters)
-- Quality checks built into chunking process
-- Token count validation
-
-**Note:** Most validation happens during chunk creation in Phase 3
-
-______________________________________________________________________
-
-### **PHASE 5: Batch Embedding Generation**
-
-**Handler:** `OpenAIEmbedder.embed()` (embedding/openai.py:70-109)
-
-**What happens:**
-
-1. Extract content text from all chunks
-1. Process in batches (default: 100 chunks per batch)
-1. Call Azure OpenAI or OpenAI API
-1. Attach embeddings back to chunk objects
-
-**Code Flow:**
-
-```python
-# client.py:169-180
-chunk_texts = [chunk.content for chunk in all_chunks]
-# Result: ["Chunk 1 text...", "Chunk 2 text...", ...]
-
-embeddings = self.embedder.embed(chunk_texts)
-
-# Attach embeddings to chunks
-for chunk, embedding in zip(all_chunks, embeddings):
-    chunk.embedding = embedding
-```
-
-**Inside the Embedder:**
-
-```python
-# embedding/openai.py:86-105
-all_embeddings = []
-
-# Process in batches
-for i in range(0, len(texts), batch_size):  # batch_size = 100
-    batch = texts[i : i + batch_size]
-
-    # Call Azure OpenAI API
-    response = self.client.embeddings.create(
-        input=batch,
-        model="text-embedding-3-large",  # Deployment name
-    )
-
-    # Extract embeddings
-    batch_embeddings = [item.embedding for item in response.data]
-    all_embeddings.extend(batch_embeddings)
-
-return all_embeddings
-```
-
-**Output:**
-Each chunk now has:
-
-```python
-chunk.embedding = [0.023, -0.012, 0.045, ..., 0.019]  # 3072 dimensions
-```
-
-______________________________________________________________________
-
-### **PHASE 6: Vector & Content Storage**
-
-**Handler:** `QdrantVectorDB.upsert()` (vectordb/qdrant.py:108-159)
-
-**What happens:**
-
-#### **6A: Collection Setup**
-
-```python
-# client.py:186-193
-if not self.vectordb.collection_exists(collection_name):
-    print(f"Creating collection '{collection_name}'...")
-    self.vectordb.create_collection(
-        collection_name=collection_name,
-        vector_size=3072,  # From embedder.get_dimensions()
-        distance_metric="cosine",
-    )
-```
-
-**Qdrant Collection Created With:**
-
-- **Vector size:** 3072 dimensions (text-embedding-3-large)
-- **Distance metric:** COSINE similarity
-- **Status:** Ready to receive vectors
-
-______________________________________________________________________
-
-#### **6B: Two Storage Modes**
-
-##### **Mode 1: MongoDB Enabled (Hybrid Storage)**
-
-**Used when:** `config.mongodb.enabled = True`
-
-```python
-# client.py:196-238
-
-# Store full content in MongoDB
-mongo_docs = []
-for chunk in all_chunks:
-    mongo_docs.append(
-        {
-            "chunk_id": chunk.chunk_id,
-            "content": chunk.content,  # Full text stored here
-            "document_id": chunk.metadata.document_id,
-            "collection_name": collection_name,
-            "metadata": chunk.metadata.to_dict(),
-        }
-    )
-
-mongo_ids = self.mongodb.store_chunks_batch(mongo_docs)
-
-# Store vectors in Qdrant with MongoDB references
-chunk_ids = [chunk.chunk_id for chunk in all_chunks]
-vectors = [chunk.embedding for chunk in all_chunks]
-contents = []  # Empty - content is in MongoDB
-metadatas = []
-
-for i, chunk in enumerate(all_chunks):
-    meta = chunk.metadata.to_dict()
-    meta["mongodb_id"] = mongo_ids[i]  # Reference to MongoDB
-    meta["content_storage"] = "mongodb"
-    metadatas.append(meta)
-    contents.append("")  # Empty placeholder
-```
-
-**Storage Architecture (MongoDB Mode):**
-
-```
-MongoDB (Content Storage):
-{
-    "_id": ObjectId("..."),
-    "chunk_id": "123e4567..._chunk_0",
-    "content": "Full chunk text stored here...",
-    "document_id": "123e4567...",
-    "collection_name": "my_documents",
-    "metadata": {...}
-}
-
-Qdrant (Vector + Metadata):
-{
-    "id": "uuid-deterministic",
-    "vector": [0.023, -0.012, ..., 0.019],  # 3072 dims
-    "payload": {
-        "chunk_id": "123e4567..._chunk_0",
-        "document_id": "123e4567...",
-        "source": "/path/to/file.pdf",
-        "chunk_index": 0,
-        "total_chunks": 5,
-        "token_count": 850,
-        "mongodb_id": "ObjectId(...)",  # Reference
-        "content_storage": "mongodb"
-    }
-}
-```
-
-##### **Mode 2: Qdrant Only (Direct Storage)**
-
-**Used when:** `config.mongodb.enabled = False`
-
-```python
-# client.py:239-244
-chunk_ids = [chunk.chunk_id for chunk in all_chunks]
-vectors = [chunk.embedding for chunk in all_chunks]
-contents = [chunk.content for chunk in all_chunks]  # Content in Qdrant
-metadatas = [chunk.metadata.to_dict() for chunk in all_chunks]
-```
-
-**Storage Architecture (Qdrant-Only Mode):**
-
-```
-Qdrant (Vector + Content + Metadata):
-{
-    "id": "uuid-deterministic",
-    "vector": [0.023, -0.012, ..., 0.019],  # 3072 dims
-    "payload": {
-        "content": "Full chunk text stored directly...",  # Content here!
-        "chunk_id": "123e4567..._chunk_0",
-        "document_id": "123e4567...",
-        "source": "/path/to/file.pdf",
-        "chunk_index": 0,
-        "total_chunks": 5,
-        "token_count": 850,
-        "char_count": 4200,
-        "chunking_method": "semantic",
-        "extraction_date": "2025-10-09T10:00:00"
-    }
-}
-```
-
-______________________________________________________________________
-
-#### **6C: Qdrant Upload Process**
-
-**Handler:** `QdrantVectorDB.upsert()` (vectordb/qdrant.py:108-159)
-
-```python
-# vectordb/qdrant.py:135-154
-points = []
-for chunk_id, vector, content, metadata in zip(chunk_ids, vectors, contents, metadatas):
-    # Combine content and metadata for payload
-    payload = {
-        "content": content,  # Empty if MongoDB mode
-        **metadata,  # Spread all metadata fields
-    }
-
-    # Create deterministic UUID from chunk_id
-    point_id = str(uuid.uuid5(uuid.NAMESPACE_DNS, chunk_id))
-
-    point = PointStruct(
-        id=point_id,
-        vector=vector,  # 3072-dimensional embedding
-        payload=payload,
-    )
-    points.append(point)
-
-# Upsert in batches (100 points per batch)
-batch_size = 100
-for i in range(0, len(points), batch_size):
-    batch = points[i : i + batch_size]
-    self.client.upsert(collection_name=collection_name, points=batch)
-```
-
-**Key Details:**
-
-- **Point ID:** Deterministic UUID generated from `chunk_id` using `uuid.uuid5()`
-  - Same chunk_id always produces same UUID
-  - Allows for idempotent upserts (re-uploading overwrites)
-- **Batch Size:** 100 points per batch to optimize network calls
-- **Upsert:** Updates if exists, inserts if new
-
-______________________________________________________________________
-
-## 📈 Performance Statistics
-
-The library tracks timing for each phase:
-
-```python
-ProcessingStats(
-    chunking_time_ms=1250.5,  # Phase 3
-    embedding_time_ms=3420.8,  # Phase 5
-    upload_time_ms=890.2,  # Phase 6
-    total_time_ms=5561.5,  # All phases
-    total_tokens=12500,  # Total tokens processed
-)
-```
-
-______________________________________________________________________
-
-## 🔍 Example: Complete Flow for 1 PDF
-
-**Input:**
-
-```python
-doc = DocumentInput.from_file("research_paper.pdf")
-client.add_documents([doc], collection_name="papers")
-```
-
-**Processing:**
-
-1. **Phase 1-2:** Extract 50 pages → 25,000 words
-1. **Phase 3:** Semantic chunking → 18 chunks (avg 1,389 words each)
-1. **Phase 4:** Validation → All 18 pass
-1. **Phase 5:** Embedding → 18 API calls (batched) → 18 × 3072-dim vectors
-1. **Phase 6:**
-   - MongoDB: Store 18 full text chunks
-   - Qdrant: Store 18 vectors + metadata references
-   - Total: 18 points in Qdrant collection
-
-**Result in Qdrant:**
-
-```
-Collection: "papers"
-├── Vector 0: [0.023, -0.012, ...] + metadata
-├── Vector 1: [0.019, 0.045, ...] + metadata
-├── Vector 2: [0.031, -0.008, ...] + metadata
-...
-└── Vector 17: [0.012, 0.028, ...] + metadata
-
-Total: 18 points, 3072 dimensions, COSINE distance
-```
-
-______________________________________________________________________
-
-## 🎯 Key Design Decisions
-
-### 1. **Semantic Chunking**
-
-- **Why:** Preserves topical coherence better than fixed-size chunks
-- **How:** Analyzes sentence-to-sentence similarity using embeddings
-- **Benefit:** Chunks align with natural topic boundaries
-
-### 2. **Deterministic UUIDs**
-
-```python
-uuid.uuid5(uuid.NAMESPACE_DNS, chunk_id)
-```
-
-- **Why:** Same chunk_id always produces same Qdrant point ID
-- **Benefit:** Idempotent uploads, easy to update/replace
-
-### 3. **Hybrid Storage (MongoDB + Qdrant)**
-
-- **Qdrant:** Fast vector search (optimized for embeddings)
-- **MongoDB:** Full text storage (cheaper, better for large content)
-- **Benefit:** Best of both worlds - fast retrieval + cost efficiency
-
-### 4. **Batch Processing**
-
-- **Embeddings:** 100 chunks per API call
-- **Qdrant Upload:** 100 points per upsert
-- **Benefit:** Reduced API calls, better performance
-
-### 5. **Overlap Between Chunks**
-
-- **Default:** 20% overlap
-- **Why:** Prevents information loss at chunk boundaries
-- **Example:** Last 200 tokens of chunk N = first 200 tokens of chunk N+1
-
-______________________________________________________________________
-
-## 🔧 Configuration Impact
-
-Your `.env` settings affect the flow:
-
-```bash
-# Embedding Configuration
-AZURE_OPENAI_ENDPOINT=https://...      # API endpoint
-AZURE_EMBEDDING_DEPLOYMENT=text-embedding-3-large  # Model
-# → Affects: Phase 5 (3072 dimensions)
-
-# Qdrant Configuration
-QDRANT_URL=https://8cb410af...         # Cloud instance
-QDRANT_API_KEY=eyJhbGci...             # Authentication
-# → Affects: Phase 6 (where vectors go)
-
-# MongoDB Configuration (Optional)
-MONGO_CONNECTION_STRING=mongodb://...  # If enabled
-# → Affects: Phase 6 (hybrid vs direct storage)
-
-# Chunking Configuration (in code)
-max_chunk_size=1000                    # Token limit per chunk
-overlap_percentage=0.2                 # 20% overlap
-semantic_threshold_percentile=95       # Sensitivity to topic changes
-# → Affects: Phase 3 (how chunks are created)
-```
-
-______________________________________________________________________
-
-## 🚀 Usage Recommendations
-
-### For Best Results:
-
-1. **Document Size:**
-
-   - Small (< 1000 tokens): Stored as single chunk
-   - Medium (1K-100K tokens): Semantic chunking shines
-   - Large (> 100K tokens): May need batch processing
-
-1. **Collection Naming:**
-
-   - Use descriptive names: `research_papers`, `user_manuals`
-   - One collection per document type/domain
-
-1. **Metadata:**
-
-   - Add meaningful metadata: `{"category": "AI", "year": 2024}`
-   - Used for filtering during search
-
-1. **Batch Size:**
-
-   - Default 100 works well for most cases
-   - Reduce if hitting API rate limits
-   - Increase for faster processing (if API allows)
-
-______________________________________________________________________
-
-## 🔬 Search Flow (Reverse Process)
-
-When you search:
-
-```python
-results = client.search("What is semantic chunking?", collection_name="papers")
-```
-
-1. **Query Embedding:** Your question → 3072-dim vector
-1. **Qdrant Search:** Find similar vectors using COSINE distance
-1. **MongoDB Retrieval:** (If enabled) Fetch full content using mongodb_id
-1. **Reranking:** (If enabled with Cohere) Re-order by relevance
-1. **Return:** Top K most relevant chunks with content + metadata
-
-______________________________________________________________________
-
-## 📝 Summary
-
-**The upload flow is a 6-phase pipeline:**
-
-```
-PDF → Text → Semantic Chunks → Embeddings → Qdrant (vectors) + MongoDB (content)
-```
-
-**Key transformations:**
-
-- **Document** (1 PDF) → **Text** (25K words) → **Chunks** (18 semantic pieces) → **Vectors** (18 × 3072 floats) → **Stored** (18 Qdrant points)
-
-**Your setup specifically:**
-
-- ✅ Azure OpenAI for embeddings (text-embedding-3-large, 3072 dims)
-- ✅ Qdrant Cloud for vector storage (COSINE similarity)
-- ✅ MongoDB for content storage (hybrid mode)
-- ✅ Semantic chunking with 20% overlap
-- ✅ Cohere for reranking
-
-This architecture provides:
-
-- **Fast search** via Qdrant's vector similarity
-- **Rich context** via semantic chunks
-- **Cost efficiency** via MongoDB content storage
-- **Scalability** via batch processing
diff --git a/docs/IMPLEMENTATION_SUMMARY.md b/docs/IMPLEMENTATION_SUMMARY.md
deleted file mode 100644
index 5d71f0f..0000000
--- a/docs/IMPLEMENTATION_SUMMARY.md
+++ /dev/null
@@ -1,357 +0,0 @@
-# insta_rag Implementation Summary
-
-## 🎉 Implementation Status: COMPLETE
-
-The core `add_documents()` functionality has been fully implemented according to the design specification.
-
-## 📦 What Has Been Implemented
-
-### 1. Core Architecture
-
-✅ **Data Models** (`models/`)
-
-- `Chunk` and `ChunkMetadata` - Complete chunk representation
-- `DocumentInput` with support for file, text, and binary sources
-- `AddDocumentsResponse` and `ProcessingStats` for operation results
-- Additional response models for future features (Retrieval, Update)
-
-✅ **Configuration System** (`core/config.py`)
-
-- `RAGConfig` with all subsystem configurations
-- Support for both OpenAI and Azure OpenAI
-- Environment variable loading with `RAGConfig.from_env()`
-- Validation for all configuration parameters
-
-✅ **Exception Handling** (`exceptions.py`)
-
-- Comprehensive exception hierarchy
-- Specific exceptions for PDF, chunking, embedding, and vector DB errors
-- Clear error messages for debugging
-
-### 2. Core Components
-
-✅ **Base Interfaces**
-
-- `BaseChunker` - Abstract interface for chunking strategies
-- `BaseEmbedder` - Abstract interface for embedding providers
-- `BaseVectorDB` - Abstract interface for vector databases
-- `BaseReranker` - Abstract interface for reranking (for future use)
-
-✅ **Chunking System** (`chunking/`)
-
-- Utility functions: token counting, text splitting, validation
-- `SemanticChunker` with full implementation:
-  - Single chunk optimization
-  - Semantic boundary detection
-  - Token limit enforcement
-  - Overlap addition
-  - Fallback to character-based splitting
-  - Chunk quality validation
-
-✅ **Embedding Provider** (`embedding/`)
-
-- `OpenAIEmbedder` supporting both OpenAI and Azure OpenAI
-- Batch processing for efficiency
-- Error handling with retries
-
-✅ **Vector Database** (`vectordb/`)
-
-- `QdrantVectorDB` with complete implementation:
-  - Collection management
-  - Vector upsert with batching
-  - Metadata-based filtering
-  - Search functionality
-  - Deletion with filters
-
-✅ **PDF Processing** (`pdf_processing.py`)
-
-- Text extraction with pdfplumber (primary)
-- Fallback to PyPDF2
-- Encrypted PDF detection
-- Corrupted PDF handling
-- Empty PDF validation
-
-### 3. Main Client
-
-✅ **RAGClient** (`core/client.py`)
-
-- Complete `add_documents()` implementation with 6 processing phases:
-  1. Document Loading
-  1. Text Extraction (PDF, TXT, MD support)
-  1. Semantic Chunking
-  1. Chunk Validation
-  1. Batch Embedding
-  1. Vector Storage
-- Collection management
-- Comprehensive error handling
-- Performance tracking and statistics
-
-### 4. Documentation & Examples
-
-✅ **Documentation**
-
-- `USAGE.md` - Complete usage guide
-- `README.md` - Architecture and design documentation (existing)
-- Inline code documentation and docstrings
-
-✅ **Examples**
-
-- `examples/basic_usage.py` - Comprehensive example
-- `examples/simple_test.py` - Quick test script
-
-✅ **Environment Configuration**
-
-- Updated `.env` with all required API keys
-- Added placeholders for optional services (Cohere, etc.)
-
-## 🏗️ Project Structure
-
-```
-insta_rag/
-├── src/insta_rag/
-│   ├── __init__.py              ✅ Package entry point
-│   ├── exceptions.py            ✅ Custom exceptions
-│   ├── pdf_processing.py        ✅ PDF utilities
-│   │
-│   ├── core/                    ✅ Central orchestration
-│   │   ├── __init__.py
-│   │   ├── client.py            ✅ RAGClient with add_documents()
-│   │   └── config.py            ✅ Configuration management
-│   │
-│   ├── models/                  ✅ Data structures
-│   │   ├── __init__.py
-│   │   ├── chunk.py             ✅ Chunk models
-│   │   ├── document.py          ✅ Document input models
-│   │   └── response.py          ✅ Response models
-│   │
-│   ├── chunking/                ✅ Chunking strategies
-│   │   ├── __init__.py
-│   │   ├── base.py              ✅ Base interface
-│   │   ├── semantic.py          ✅ Semantic chunker
-│   │   └── utils.py             ✅ Utility functions
-│   │
-│   ├── embedding/               ✅ Embedding providers
-│   │   ├── __init__.py
-│   │   ├── base.py              ✅ Base interface
-│   │   └── openai.py            ✅ OpenAI/Azure implementation
-│   │
-│   ├── vectordb/                ✅ Vector databases
-│   │   ├── __init__.py
-│   │   ├── base.py              ✅ Base interface
-│   │   └── qdrant.py            ✅ Qdrant implementation
-│   │
-│   └── retrieval/               ✅ Retrieval components (future)
-│       ├── __init__.py
-│       ├── base.py              ✅ Base interface
-│       ├── query_generator.py   ⏳ To be implemented
-│       ├── vector_search.py     ⏳ To be implemented
-│       ├── keyword_search.py    ⏳ To be implemented
-│       └── reranker.py          ⏳ To be implemented
-│
-├── examples/
-│   ├── basic_usage.py           ✅ Comprehensive example
-│   └── simple_test.py           ✅ Quick test
-│
-├── .env                         ✅ Environment variables
-├── README.md                    ✅ Project documentation
-├── USAGE.md                     ✅ Usage guide
-└── IMPLEMENTATION_SUMMARY.md    ✅ This file
-```
-
-## 🧪 Testing the Implementation
-
-### Quick Test
-
-```bash
-# Install dependencies
-pip install openai qdrant-client pdfplumber PyPDF2 tiktoken numpy python-dotenv
-
-# Run simple test
-python examples/simple_test.py
-```
-
-### Full Example
-
-```bash
-# Run comprehensive example
-python examples/basic_usage.py
-```
-
-### Manual Test
-
-```python
-from dotenv import load_dotenv
-from insta_rag import RAGClient, RAGConfig, DocumentInput
-
-load_dotenv()
-
-# Initialize
-config = RAGConfig.from_env()
-client = RAGClient(config)
-
-# Add document
-doc = DocumentInput.from_text("Your test text here")
-response = client.add_documents([doc], "test_collection")
-
-print(f"Success: {response.success}")
-print(f"Chunks: {response.total_chunks}")
-```
-
-## 🔧 Required Dependencies
-
-Add these to your `pyproject.toml` or install manually:
-
-```bash
-pip install openai>=1.0.0          # OpenAI API
-pip install qdrant-client>=1.7.0   # Qdrant vector DB
-pip install pdfplumber>=0.10.0     # PDF extraction (primary)
-pip install PyPDF2>=3.0.0          # PDF extraction (fallback)
-pip install tiktoken>=0.5.0        # Token counting
-pip install numpy>=1.24.0          # Numerical operations
-pip install python-dotenv>=1.0.0   # Environment variables
-```
-
-## ✅ Completed Features
-
-1. **Document Input**
-
-   - PDF files (.pdf)
-   - Text files (.txt, .md)
-   - Raw text strings
-   - Metadata attachment
-
-1. **Text Processing**
-
-   - PDF text extraction with fallback
-   - Error handling for encrypted/corrupted PDFs
-   - Text quality validation
-
-1. **Semantic Chunking**
-
-   - Sentence-level semantic analysis
-   - Boundary detection using embedding similarity
-   - Token limit enforcement (1000 tokens max)
-   - 20% overlap between chunks
-   - Quality validation
-
-1. **Embedding Generation**
-
-   - OpenAI text-embedding-3-large (3072 dimensions)
-   - Azure OpenAI support
-   - Batch processing (configurable batch size)
-   - Error handling and retries
-
-1. **Vector Storage**
-
-   - Qdrant collection auto-creation
-   - Batch upsert for efficiency
-   - Metadata preservation
-   - Deterministic UUID generation
-
-1. **Configuration**
-
-   - Environment variable loading
-   - Override support
-   - Validation at initialization
-   - Multiple provider support
-
-1. **Error Handling**
-
-   - Graceful degradation
-   - Detailed error messages
-   - Partial success support
-   - Comprehensive exception types
-
-1. **Performance Tracking**
-
-   - Phase-by-phase timing
-   - Token counting
-   - Chunk statistics
-   - Success/failure tracking
-
-## 🔮 Future Implementation (Not Yet Done)
-
-These features are designed but not yet implemented:
-
-1. **Update Operations**
-
-   - `update_documents()` method
-   - Replace, append, delete, upsert strategies
-   - Metadata-only updates
-
-1. **Retrieval System**
-
-   - `retrieve()` method
-   - HyDE query generation
-   - Vector search
-   - BM25 keyword search
-   - Hybrid fusion
-   - Cohere reranking
-
-1. **Additional Features**
-
-   - More chunking strategies (recursive, fixed-size)
-   - More embedding providers (Cohere, custom)
-   - More vector databases (Pinecone, Weaviate)
-   - Image and table extraction from PDFs
-   - Batch document updates
-   - Query caching
-   - Performance optimizations
-
-## 🚀 Next Steps
-
-To use the library:
-
-1. **Set up environment variables** in `.env`:
-
-   ```bash
-   QDRANT_URL=your_qdrant_url
-   QDRANT_API_KEY=your_qdrant_key
-   AZURE_OPENAI_ENDPOINT=your_azure_endpoint
-   AZURE_OPENAI_API_KEY=your_azure_key
-   AZURE_EMBEDDING_DEPLOYMENT=text-embedding-3-large
-   ```
-
-1. **Install dependencies**:
-
-   ```bash
-   pip install openai qdrant-client pdfplumber PyPDF2 tiktoken numpy python-dotenv
-   ```
-
-1. **Run the test**:
-
-   ```bash
-   python examples/simple_test.py
-   ```
-
-1. **Integrate into your project**:
-
-   ```python
-   from insta_rag import RAGClient, RAGConfig, DocumentInput
-
-   config = RAGConfig.from_env()
-   client = RAGClient(config)
-
-   # Use add_documents() to process and store documents
-   response = client.add_documents(documents, "collection_name")
-   ```
-
-## 📝 Notes
-
-- All core functionality for `add_documents()` is complete and tested
-- The architecture supports easy extension for future features
-- Configuration is flexible and supports multiple providers
-- Error handling is comprehensive with graceful degradation
-- Performance tracking provides visibility into operation costs
-
-## 🎯 Summary
-
-The `add_documents()` method is **fully implemented and ready to use**. It provides a complete pipeline for:
-
-- Loading documents from multiple sources
-- Extracting text from PDFs
-- Semantic chunking with overlap
-- Generating embeddings
-- Storing in Qdrant vector database
-
-All components follow the design specification and include proper error handling, validation, and performance tracking.
diff --git a/docs/MONGODB_INTEGRATION.md b/docs/MONGODB_INTEGRATION.md
deleted file mode 100644
index 210d4e1..0000000
--- a/docs/MONGODB_INTEGRATION.md
+++ /dev/null
@@ -1,356 +0,0 @@
-# MongoDB Integration Guide
-
-## Overview
-
-The insta_rag library now supports MongoDB integration for efficient content storage. Instead of storing full text content in Qdrant, the content is stored in MongoDB and only the reference (MongoDB ID) is stored in Qdrant's metadata.
-
-## Architecture
-
-### Without MongoDB (Default)
-
-```
-Document → Chunking → Embedding → Qdrant (vectors + full content)
-```
-
-### With MongoDB (New)
-
-```
-Document → Chunking → Embedding → MongoDB (full content)
-                                → Qdrant (vectors + MongoDB reference)
-```
-
-## Benefits
-
-1. **Reduced Qdrant Storage**: Only vectors and metadata in Qdrant
-1. **Centralized Content**: All content in MongoDB for easy management
-1. **Easy Updates**: Update content without re-embedding
-1. **Better Separation**: Vectors and content stored separately
-1. **Cost Effective**: Cheaper storage for large text content
-
-## Setup
-
-### 1. Environment Configuration
-
-Add to your `.env` file:
-
-```env
-# MongoDB Connection (already in your .env)
-MONGO_CONNECTION_STRING=mongodb://root:password/?directConnection=true
-
-# Database Name (optional, defaults to Test_Insta_RAG)
-MONGO_DATABASE_NAME=Test_Insta_RAG
-```
-
-### 2. Install MongoDB Driver
-
-```bash
-pip install pymongo>=4.6.0
-```
-
-Or update from requirements:
-
-```bash
-pip install -r requirements-rag.txt
-```
-
-## Usage
-
-### Automatic Configuration
-
-MongoDB integration is automatically enabled when connection string is present:
-
-```python
-from insta_rag import RAGClient, RAGConfig
-
-# Load config from .env (automatically includes MongoDB if configured)
-config = RAGConfig.from_env()
-client = RAGClient(config)
-
-# Check if MongoDB is enabled
-if client.mongodb:
-    print("MongoDB integration active")
-```
-
-### Adding Documents
-
-The API remains the same - MongoDB integration is transparent:
-
-```python
-from insta_rag import DocumentInput
-
-doc = DocumentInput.from_text("Your document content here")
-
-response = client.add_documents(documents=[doc], collection_name="my_collection")
-
-# Content is now in MongoDB, reference in Qdrant
-print(f"Processed {response.total_chunks} chunks")
-```
-
-### Retrieving Content
-
-When retrieving, content is automatically fetched from MongoDB:
-
-```python
-# Get chunk content from MongoDB
-if client.mongodb:
-    chunk_content = client.mongodb.get_chunk_content(chunk_id)
-    print(chunk_content["content"])
-```
-
-## MongoDB Collections
-
-### document_contents
-
-Stores actual chunk content:
-
-```json
-{
-  "_id": "ObjectId",
-  "chunk_id": "doc_123_chunk_0",
-  "content": "Full text content of the chunk...",
-  "document_id": "doc_123",
-  "collection_name": "my_collection",
-  "metadata": {
-    "token_count": 150,
-    "chunk_index": 0,
-    ...
-  },
-  "created_at": "2024-01-15T10:30:00Z",
-  "updated_at": "2024-01-15T10:30:00Z"
-}
-```
-
-### document_metadata
-
-Stores aggregated document information:
-
-```json
-{
-  "_id": "ObjectId",
-  "document_id": "doc_123",
-  "collection_name": "my_collection",
-  "total_chunks": 5,
-  "metadata": {...},
-  "created_at": "2024-01-15T10:30:00Z"
-}
-```
-
-## MongoDB Client API
-
-### Store Chunk Content
-
-```python
-mongo_id = client.mongodb.store_chunk_content(
-    chunk_id="chunk_123",
-    content="Full text content",
-    document_id="doc_123",
-    collection_name="my_collection",
-    metadata={"key": "value"},
-)
-```
-
-### Retrieve Content
-
-```python
-# By chunk_id
-chunk_doc = client.mongodb.get_chunk_content("chunk_123")
-
-# By MongoDB _id
-chunk_doc = client.mongodb.get_chunk_content_by_mongo_id(mongo_id)
-
-# All chunks for a document
-chunks = client.mongodb.get_chunks_by_document("doc_123")
-```
-
-### Delete Content
-
-```python
-# Delete single chunk
-client.mongodb.delete_chunk("chunk_123")
-
-# Delete all chunks for a document
-count = client.mongodb.delete_chunks_by_document("doc_123")
-
-# Delete all chunks in a collection
-count = client.mongodb.delete_chunks_by_collection("my_collection")
-```
-
-### Get Statistics
-
-```python
-stats = client.mongodb.get_collection_stats("my_collection")
-
-print(f"Total chunks: {stats['total_chunks']}")
-print(f"Total documents: {stats['total_documents']}")
-print(f"Content size: {stats['total_content_size_bytes']} bytes")
-```
-
-## Qdrant Metadata Structure
-
-When MongoDB is enabled, Qdrant stores metadata with references:
-
-```json
-{
-  "mongodb_id": "65a1b2c3d4e5f6g7h8i9j0k1",
-  "content_storage": "mongodb",
-  "document_id": "doc_123",
-  "chunk_index": 0,
-  "token_count": 150,
-  ...
-}
-```
-
-Note: The `content` field in Qdrant is empty when MongoDB is enabled.
-
-## Testing MongoDB Integration
-
-### Run Test Script
-
-```bash
-cd testing_api
-python test_mongodb.py
-```
-
-### Expected Output
-
-```
-Testing MongoDB Integration
-================================================
-
-1. Initializing RAG client with MongoDB...
-   ✓ Client initialized
-   ✓ MongoDB enabled: True
-   ✓ Database: Test_Insta_RAG
-
-2. Creating test document...
-   ✓ Document created
-
-3. Processing document...
-   Storing content in MongoDB...
-   ✓ Stored 3 chunks in MongoDB
-   ✓ Document processed successfully
-
-4. Verifying MongoDB storage...
-   ✓ Chunk found in MongoDB
-     - MongoDB ID: 65a1b2c3d4e5f6g7h8i9j0k1
-     - Content length: 250 chars
-
-5. MongoDB Collection Statistics...
-   - Total chunks: 3
-   - Total documents: 1
-   - Total content size: 750 bytes
-```
-
-## Migration
-
-### From Qdrant-only to MongoDB
-
-If you have existing data in Qdrant:
-
-1. Existing collections continue to work (content in Qdrant)
-1. New documents will use MongoDB storage
-1. To migrate existing data, re-process documents with MongoDB enabled
-
-### Disabling MongoDB
-
-To disable MongoDB and revert to Qdrant-only storage:
-
-```python
-# Remove or comment out in .env
-# MONGO_CONNECTION_STRING=...
-
-# Or explicitly disable in code
-config.mongodb = None
-```
-
-## Monitoring
-
-### Check MongoDB Connection
-
-```python
-if client.mongodb:
-    try:
-        client.mongodb.client.admin.command("ping")
-        print("MongoDB connected")
-    except Exception as e:
-        print(f"MongoDB error: {e}")
-```
-
-### View Collections
-
-```bash
-# Using MongoDB shell
-mongosh "$MONGO_CONNECTION_STRING"
-
-> use Test_Insta_RAG
-> db.document_contents.countDocuments()
-> db.document_metadata.countDocuments()
-```
-
-## Performance Considerations
-
-1. **Batch Operations**: Use `store_chunks_batch()` for multiple chunks
-1. **Indexing**: Indexes are automatically created on `chunk_id`, `document_id`, and `collection_name`
-1. **Connection Pooling**: MongoDB client uses connection pooling automatically
-1. **Network Latency**: Consider co-locating MongoDB and Qdrant for best performance
-
-## Troubleshooting
-
-### "MongoDB not installed"
-
-```bash
-pip install pymongo>=4.6.0
-```
-
-### "Failed to connect to MongoDB"
-
-- Check connection string is correct
-- Verify MongoDB server is running
-- Check network connectivity
-- Verify authentication credentials
-
-### "Collection not found"
-
-MongoDB collections are created automatically on first insert.
-
-### Content not found in MongoDB
-
-- Verify MongoDB was enabled during document processing
-- Check chunk_id is correct
-- Verify database name matches configuration
-
-## API Testing with MongoDB
-
-### Swagger UI
-
-1. Start API: `cd testing_api && ./run.sh`
-1. Open: http://localhost:8000/docs
-1. Test endpoint: `POST /api/v1/test/documents/add`
-1. Check response includes MongoDB storage confirmation
-
-### cURL Test
-
-```bash
-curl -X POST http://localhost:8000/api/v1/test/documents/add \
-  -H "Content-Type: application/json" \
-  -d '{
-    "text": "Test document for MongoDB storage",
-    "collection_name": "test_collection"
-  }'
-```
-
-## Best Practices
-
-1. **Always use MongoDB for production**: Better scalability and management
-1. **Regular backups**: Backup MongoDB collections regularly
-1. **Monitor storage**: Use `get_collection_stats()` to track growth
-1. **Clean up**: Delete old collections when no longer needed
-1. **Indexing**: Add custom indexes for your query patterns
-
-## Future Enhancements
-
-- [ ] Content versioning
-- [ ] Full-text search in MongoDB
-- [ ] Content compression
-- [ ] Multi-tenancy support
-- [ ] Backup and restore utilities
diff --git a/docs/PHASE1_COMPLETION_SUMMARY.md b/docs/PHASE1_COMPLETION_SUMMARY.md
deleted file mode 100644
index 2f10e35..0000000
--- a/docs/PHASE1_COMPLETION_SUMMARY.md
+++ /dev/null
@@ -1,420 +0,0 @@
-# Phase 1 MVP - Completion Summary
-
-## ✅ Status: COMPLETE & WORKING
-
-______________________________________________________________________
-
-## 🎯 What Was Implemented
-
-### 1. **Core `retrieve()` Method**
-
-**Location**: `src/insta_rag/core/client.py:466-675`
-
-**Features**:
-
-- ✅ Dual vector search (query searched twice)
-- ✅ Deduplication logic (removes duplicate chunks, keeps highest score)
-- ✅ MongoDB content fetching (hybrid storage support)
-- ✅ Score threshold filtering
-- ✅ Content truncation option
-- ✅ Comprehensive performance tracking
-- ✅ Source statistics aggregation
-- ✅ Metadata filtering support
-
-**Parameters**:
-
-```python
-def retrieve(
-    query: str,
-    collection_name: str,
-    filters: Optional[Dict] = None,
-    top_k: int = 20,
-    enable_reranking: bool = False,    # Phase 3
-    enable_keyword_search: bool = False, # Phase 4
-    enable_hyde: bool = False,         # Phase 2
-    score_threshold: Optional[float] = None,
-    return_full_chunks: bool = True,
-    deduplicate: bool = True,
-)
-```
-
-______________________________________________________________________
-
-### 2. **API Endpoint**
-
-**Location**: `testing_api/main.py:548-607`
-
-**Endpoint**: `POST /api/v1/retrieve`
-
-**Request Model**: `RetrieveRequest` (all Phase 1 MVP parameters)
-
-**Response Model**: `SearchResponse` (comprehensive results + stats)
-
-**Documentation**: Added to `testing_api/openapi.yaml:339-383`
-
-______________________________________________________________________
-
-## 📊 Test Results
-
-### Test File: `test_phase1_retrieve.py`
-
-### ✅ Passing Tests:
-
-1. **Dual Vector Search**
-
-   - 2 searches performed (25 chunks each)
-   - Total: 6 chunks retrieved
-   - Deduplicated to: 3 unique chunks
-   - ✅ PASS
-
-1. **MongoDB Content Fetching**
-
-   - Content successfully retrieved from MongoDB
-   - Full text displayed in results
-   - ✅ PASS
-
-1. **Score Threshold**
-
-   - Threshold: 0.5
-   - Result: 1 chunk passed (score 0.7144)
-   - ✅ PASS
-
-1. **Content Truncation**
-
-   - `return_full_chunks=False`
-   - Content truncated to 500 chars
-   - ✅ PASS
-
-1. **Performance Stats**
-
-   - Query generation: 0.00ms (no HyDE yet)
-   - Vector search: ~1600-4000ms (varies)
-   - Dedup + formatting: ~800ms
-   - Total: ~2400-4800ms
-   - ✅ PASS
-
-1. **Source Statistics**
-
-   - Chunks grouped by source
-   - Average relevance calculated
-   - ✅ PASS
-
-### ⚠️ Known Limitation:
-
-**Metadata Filtering with Custom Fields**
-
-- Issue: Qdrant requires field index for filtering
-- Example error: `Index required for "category" field`
-- **Workaround**: Use indexed fields (document_id, chunk_id) or create Qdrant index
-- **Impact**: Low - standard fields work fine
-- **Status**: Qdrant configuration issue, not code bug
-
-______________________________________________________________________
-
-## 🏗️ Architecture
-
-### Processing Pipeline
-
-```
-User Query
-    ↓
-┌─────────────────────────────────────┐
-│ STEP 1: Query Generation (MVP)     │
-│ - No HyDE (Phase 2)                 │
-│ - Use original query                │
-└──────────────┬──────────────────────┘
-               ↓
-┌─────────────────────────────────────┐
-│ STEP 2: Dual Vector Search          │
-│ - Search 1: 25 chunks                │
-│ - Search 2: 25 chunks (same query)   │
-│ - Total: 50 chunks                   │
-└──────────────┬──────────────────────┘
-               ↓
-┌─────────────────────────────────────┐
-│ STEP 3: Keyword Search (Skipped)    │
-│ - Phase 4: BM25 not implemented      │
-└──────────────┬──────────────────────┘
-               ↓
-┌─────────────────────────────────────┐
-│ STEP 4: Deduplicate                  │
-│ - Remove duplicate chunk_ids         │
-│ - Keep highest scoring variant       │
-│ - Result: ~25 unique chunks          │
-└──────────────┬──────────────────────┘
-               ↓
-┌─────────────────────────────────────┐
-│ STEP 5: Reranking (Skipped)         │
-│ - Phase 3: Cohere not implemented    │
-│ - Sort by vector score               │
-└──────────────┬──────────────────────┘
-               ↓
-┌─────────────────────────────────────┐
-│ STEP 6: Selection & Formatting      │
-│ - Apply score_threshold              │
-│ - Select top_k chunks                │
-│ - Fetch MongoDB content              │
-│ - Calculate source stats             │
-└──────────────┬──────────────────────┘
-               ↓
-         Top-k Results
-```
-
-______________________________________________________________________
-
-## 📈 Performance Characteristics
-
-### Typical Retrieval Times
-
-| Operation | Time (ms) | Notes |
-|-----------|-----------|-------|
-| Query Generation | 0 | No HyDE in Phase 1 |
-| Vector Search (2x) | 1600-4000 | Depends on collection size |
-| Deduplication | ~50 | Very fast |
-| MongoDB Fetch | ~800 | Depends on chunk count |
-| **TOTAL** | **2400-4800** | Acceptable for MVP |
-
-### Chunk Flow
-
-```
-Initial: 50 chunks (25 + 25 from dual search)
-    ↓
-After Dedup: ~25 chunks (removes duplicates)
-    ↓
-After Filtering: varies (score_threshold applied)
-    ↓
-Final: top_k chunks (default: 20)
-```
-
-______________________________________________________________________
-
-## 📁 Files Modified/Created
-
-### Modified Files:
-
-1. `src/insta_rag/core/client.py`
-
-   - Added `retrieve()` method (lines 466-675)
-
-1. `testing_api/main.py`
-
-   - Added `RetrieveRequest` model (lines 130-156)
-   - Added `/api/v1/retrieve` endpoint (lines 548-607)
-
-1. `testing_api/openapi.yaml`
-
-   - Added `/api/v1/retrieve` documentation (lines 339-383)
-   - Added `RetrieveRequest` schema (lines 736-792)
-
-### Created Files:
-
-1. `RETRIEVAL_IMPLEMENTATION_PLAN.md` - Comprehensive planning doc
-1. `src/insta_rag/core/retrieval_method.py` - Detailed implementation reference
-1. `test_phase1_retrieve.py` - Phase 1 test suite
-1. `PHASE1_COMPLETION_SUMMARY.md` - This document
-
-______________________________________________________________________
-
-## 🔄 Comparison: `search()` vs `retrieve()`
-
-| Feature | `search()` | `retrieve()` |
-|---------|-----------|-------------|
-| Vector Search | Single (1x) | Dual (2x) |
-| Deduplication | No | Yes |
-| MongoDB Fetch | Yes | Yes |
-| HyDE Support | No | Phase 2 |
-| Reranking | No | Phase 3 |
-| BM25 Search | No | Phase 4 |
-| Performance Stats | Basic | Comprehensive |
-| Score Threshold | No | Yes |
-| Content Truncation | No | Yes |
-
-**Recommendation**:
-
-- Use `search()` for simple, fast queries
-- Use `retrieve()` for production RAG applications
-
-______________________________________________________________________
-
-## 🚀 Next Steps (Future Phases)
-
-### Phase 2: HyDE Query Generation (READY)
-
-**Goal**: Improve retrieval by generating hypothetical answers
-
-**Tasks**:
-
-- [ ] Implement HyDEQueryGenerator using Azure OpenAI
-- [ ] Generate standard + HyDE queries in single LLM call
-- [ ] Use structured output for parsing
-- [ ] Add error handling (fallback to original query)
-- [ ] Update `retrieve()` to use HyDE when `enable_hyde=True`
-
-**Expected Improvement**: 20-30% better relevance
-
-______________________________________________________________________
-
-### Phase 3: Cohere Reranking (READY)
-
-**Goal**: Re-rank results using cross-encoder for better relevance
-
-**Tasks**:
-
-- [ ] Implement CohereReranker class
-- [ ] Integrate Cohere Rerank 3.5 API
-- [ ] Add fallback (use vector scores if API fails)
-- [ ] Update `retrieve()` to rerank when `enable_reranking=True`
-
-**Expected Improvement**: 30-40% better relevance
-
-______________________________________________________________________
-
-### Phase 4: BM25 Keyword Search (OPTIONAL)
-
-**Goal**: Add lexical search for exact term matching
-
-**Tasks**:
-
-- [ ] Implement BM25Searcher using rank-bm25 library
-- [ ] Build document corpus from collection
-- [ ] Merge BM25 + vector results
-- [ ] Update `retrieve()` to include BM25 when `enable_keyword_search=True`
-
-**Expected Improvement**: Better for exact matches (names, codes, IDs)
-
-______________________________________________________________________
-
-## 💡 Usage Examples
-
-### Basic Retrieval
-
-```python
-response = client.retrieve(
-    query="What is semantic chunking?", collection_name="knowledge_base", top_k=10
-)
-
-for chunk in response.chunks:
-    print(f"Score: {chunk.relevance_score:.4f}")
-    print(f"Content: {chunk.content}")
-```
-
-### With Filters
-
-```python
-response = client.retrieve(
-    query="pricing information",
-    collection_name="documents",
-    filters={"user_id": "user_123", "template_id": "template_456"},
-    top_k=20,
-)
-```
-
-### With Score Threshold
-
-```python
-response = client.retrieve(
-    query="technical specifications",
-    collection_name="manuals",
-    score_threshold=0.7,  # Only high-quality results
-    top_k=5,
-)
-```
-
-### API Call (cURL)
-
-```bash
-curl -X POST "http://localhost:8000/api/v1/retrieve" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "query": "What is semantic chunking?",
-    "collection_name": "insta_rag_test_collection",
-    "top_k": 10,
-    "deduplicate": true
-  }'
-```
-
-______________________________________________________________________
-
-## 📊 Success Criteria
-
-### ✅ Phase 1 MVP Success Criteria (ALL MET)
-
-- [x] `retrieve()` method callable and working
-- [x] Dual vector search functioning
-- [x] Deduplication removing duplicates correctly
-- [x] Returns `RetrievalResponse` with all fields
-- [x] API endpoint accessible
-- [x] No errors with test data
-- [x] Performance acceptable (< 5 seconds)
-- [x] Comprehensive stats tracking
-- [x] Source aggregation working
-- [x] MongoDB integration working
-
-______________________________________________________________________
-
-## 🎓 Key Learnings
-
-1. **Dual search with same query** still provides value through deduplication logic
-1. **MongoDB content fetching** adds latency but necessary for hybrid storage
-1. **Performance is acceptable** for MVP (~2-5 seconds per query)
-1. **Deduplication is critical** - reduced 50 chunks to ~25 unique
-1. **Comprehensive stats** enable future optimization
-
-______________________________________________________________________
-
-## 🔧 Configuration
-
-### Required Environment Variables
-
-```env
-QDRANT_URL=https://your-qdrant-instance.cloud.qdrant.io
-QDRANT_API_KEY=your_api_key_here
-AZURE_OPENAI_API_KEY=your_azure_key
-AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
-AZURE_EMBEDDING_DEPLOYMENT=text-embedding-3-large
-MONGO_CONNECTION_STRING=mongodb://... (optional for hybrid storage)
-```
-
-### Phase 2 Requirements
-
-```env
-AZURE_LLM_DEPLOYMENT=gpt-4  # For HyDE generation
-```
-
-### Phase 3 Requirements
-
-```env
-COHERE_API_KEY=your_cohere_key  # For reranking
-```
-
-______________________________________________________________________
-
-## 📝 Documentation
-
-- **Implementation Plan**: `RETRIEVAL_IMPLEMENTATION_PLAN.md`
-- **API Reference**: `testing_api/openapi.yaml`
-- **Test Suite**: `test_phase1_retrieve.py`
-- **Code Reference**: `src/insta_rag/core/retrieval_method.py`
-
-______________________________________________________________________
-
-## ✅ Conclusion
-
-**Phase 1 MVP is COMPLETE and WORKING!**
-
-The `retrieve()` method provides:
-
-- ✅ Solid foundation for advanced retrieval
-- ✅ Production-ready performance
-- ✅ Comprehensive tracking and stats
-- ✅ Easy path to Phase 2 & 3 enhancements
-
-**Ready for Production Use**: Yes, with current features
-
-**Ready for Phase 2 (HyDE)**: Yes, all infrastructure in place
-
-**Ready for Phase 3 (Reranking)**: Yes, all infrastructure in place
-
-______________________________________________________________________
-
-**Next Recommended Action**: Implement Phase 2 (HyDE) for 20-30% improvement in relevance 🚀
diff --git a/docs/PHASE2_COMPLETION_SUMMARY.md b/docs/PHASE2_COMPLETION_SUMMARY.md
deleted file mode 100644
index bf1f5b3..0000000
--- a/docs/PHASE2_COMPLETION_SUMMARY.md
+++ /dev/null
@@ -1,470 +0,0 @@
-# Phase 2 Implementation - Completion Summary
-
-## ✅ Status: COMPLETE & WORKING
-
-______________________________________________________________________
-
-## 🎯 What Was Implemented
-
-### Phase 2 Features (ENABLED BY DEFAULT)
-
-**1. HyDE Query Generation** (`src/insta_rag/retrieval/query_generator.py`)
-
-- Generates optimized standard + hypothetical answer queries using Azure OpenAI
-- Single LLM call with JSON structured output
-- Graceful fallback to original query on error
-- 20-30% expected improvement in retrieval quality
-
-**2. BM25 Keyword Search** (`src/insta_rag/retrieval/keyword_search.py`)
-
-- BM25Okapi implementation using rank-bm25 library
-- Builds searchable corpus from Qdrant collection
-- Complements semantic search with exact term matching
-- Graceful fallback if corpus unavailable
-
-**3. Enhanced retrieve() Method** (`src/insta_rag/core/client.py:466-732`)
-
-- **STEP 1**: HyDE query generation (enabled by default)
-- **STEP 2**: Dual vector search (standard + HyDE queries, 25 chunks each)
-- **STEP 3**: BM25 keyword search (50 chunks, enabled by default)
-- **STEP 4**: Smart deduplication across all sources
-- **STEP 5**: Reranking placeholder (Phase 3)
-- **STEP 6**: Selection, filtering, and formatting
-
-**4. API Endpoint Updates** (`testing_api/main.py`)
-
-- Updated `/api/v1/retrieve` endpoint
-- **Default settings**: `enable_hyde=True`, `enable_keyword_search=True`
-- Updated documentation and descriptions
-
-______________________________________________________________________
-
-## 📦 Files Modified/Created
-
-### New Files:
-
-1. **`src/insta_rag/retrieval/query_generator.py`** - HyDE query generation
-1. **`src/insta_rag/retrieval/keyword_search.py`** - BM25 keyword search
-1. **`test_phase2_retrieve.py`** - Phase 2 comprehensive test suite
-1. **`PHASE2_COMPLETION_SUMMARY.md`** - This document
-
-### Modified Files:
-
-1. **`src/insta_rag/core/client.py`** (lines 466-732)
-
-   - Updated `retrieve()` method with Phase 2 features
-   - Default parameters: `enable_hyde=True`, `enable_keyword_search=True`
-
-1. **`testing_api/main.py`** (lines 130-149, 547-566)
-
-   - Updated `RetrieveRequest` model defaults
-   - Updated endpoint documentation
-
-1. **`pyproject.toml`** (indirectly - via pip install)
-
-   - Added `rank-bm25>=0.2.2` dependency
-
-______________________________________________________________________
-
-## 📊 Test Results
-
-### Test Suite: `test_phase2_retrieve.py`
-
-All 5 tests passed successfully:
-
-1. ✅ **Basic Phase 2 Retrieval**
-
-   - HyDE + BM25 + Deduplication working
-   - Graceful fallback when LLM deployment unavailable
-   - Total time: ~9.2s
-
-1. ✅ **Phase 2 vs Phase 1 Comparison**
-
-   - Phase 2 adds ~1.5s overhead (HyDE + BM25)
-   - Same results when fallback occurs
-   - Deduplication working correctly
-
-1. ✅ **HyDE Query Generation**
-
-   - Tested with 3 different queries
-   - Graceful fallback to original query
-   - Average generation time: ~1.3s
-
-1. ✅ **BM25 Keyword Search**
-
-   - BM25Searcher class working
-   - Corpus building mechanism functional
-   - Graceful handling of empty corpus
-
-1. ✅ **Full Hybrid Search**
-
-   - Vector + HyDE + BM25 combined
-   - Deduplication across all sources
-   - Performance tracking comprehensive
-
-______________________________________________________________________
-
-## 🔧 Configuration Requirements
-
-### Required for HyDE Query Generation:
-
-Add to `.env`:
-
-```env
-# Azure OpenAI LLM for HyDE (required)
-AZURE_LLM_DEPLOYMENT=gpt-4
-AZURE_LLM_API_VERSION=2024-02-01
-
-# Or for regular OpenAI
-OPENAI_LLM_MODEL=gpt-4
-```
-
-### Required for BM25 Keyword Search:
-
-**Option 1 (Recommended)**: Store content in Qdrant payload
-
-```python
-# In client configuration, disable MongoDB content storage
-# This keeps content in Qdrant, making it available for BM25
-```
-
-**Option 2**: Enhance BM25 to fetch from MongoDB
-
-```python
-# In keyword_search.py, add MongoDB content fetching
-# Similar to how vector search fetches content
-```
-
-### Current Behavior (Graceful Degradation):
-
-| Feature | Config Missing | Behavior |
-|---------|---------------|----------|
-| HyDE | No LLM deployment | Falls back to original query |
-| BM25 | No content in Qdrant | Skips keyword search |
-| Both | Partial config | Vector search still works |
-
-______________________________________________________________________
-
-## 🏗️ Enhanced Architecture
-
-### Phase 2 Retrieval Pipeline
-
-```
-User Query
-    ↓
-┌─────────────────────────────────────┐
-│ STEP 1: Query Generation (Phase 2) │
-│ - HyDE: Generate hypothetical answer│
-│ - Standard: Optimize query          │
-│ - Fallback: Use original on error   │
-└──────────────┬──────────────────────┘
-               ↓
-┌─────────────────────────────────────┐
-│ STEP 2: Dual Vector Search          │
-│ - Search 1: Standard query (25)     │
-│ - Search 2: HyDE query (25)         │
-│ - Total: 50 chunks                  │
-└──────────────┬──────────────────────┘
-               ↓
-┌─────────────────────────────────────┐
-│ STEP 3: BM25 Keyword Search         │
-│ - Build BM25 corpus                 │
-│ - Search: 50 chunks                 │
-│ - Fallback: Skip if unavailable     │
-└──────────────┬──────────────────────┘
-               ↓
-┌─────────────────────────────────────┐
-│ STEP 4: Deduplicate                 │
-│ - Combine: Vector + Keyword         │
-│ - Remove duplicates by chunk_id     │
-│ - Keep highest scores               │
-│ - Result: ~50-70 unique chunks      │
-└──────────────┬──────────────────────┘
-               ↓
-┌─────────────────────────────────────┐
-│ STEP 5: Reranking (Phase 3)         │
-│ - Placeholder for Cohere reranking  │
-│ - Currently: Sort by score          │
-└──────────────┬──────────────────────┘
-               ↓
-┌─────────────────────────────────────┐
-│ STEP 6: Selection & Formatting      │
-│ - Apply score_threshold             │
-│ - Select top_k chunks               │
-│ - Fetch MongoDB content             │
-│ - Calculate source stats            │
-└──────────────┬──────────────────────┘
-               ↓
-         Top-k Results
-```
-
-______________________________________________________________________
-
-## 📈 Performance Characteristics
-
-### Typical Retrieval Times (Phase 2)
-
-| Operation | Time (ms) | Notes |
-|-----------|-----------|-------|
-| Query Generation (HyDE) | 1200-1700 | LLM call for query optimization |
-| Vector Search (2x) | 2900-7000 | Depends on collection size |
-| Keyword Search (BM25) | 280-300 | Depends on corpus size |
-| Deduplication | ~50 | Very fast |
-| MongoDB Fetch | ~800 | Depends on chunk count |
-| **TOTAL** | **9000-13000** | With all features enabled |
-
-### Chunk Flow (Phase 2)
-
-```
-Initial: 50 (vector) + 50 (keyword) = 100 chunks
-    ↓
-After Dedup: ~50-70 unique chunks
-    ↓
-After Filtering: varies (score_threshold applied)
-    ↓
-Final: top_k chunks (default: 20)
-```
-
-______________________________________________________________________
-
-## 🔄 Comparison: Phase 1 vs Phase 2
-
-| Feature | Phase 1 MVP | Phase 2 (Current) |
-|---------|-------------|-------------------|
-| Query Generation | No | ✅ HyDE (optional) |
-| Vector Search | Dual (same query) | Dual (standard + HyDE) |
-| Keyword Search | No | ✅ BM25 (optional) |
-| Total Chunks | 50 | 100 (50+50) |
-| Deduplication | Yes | Yes (across all sources) |
-| Fallback Handling | Basic | Advanced graceful degradation |
-| Performance | ~2-5s | ~9-13s (with all features) |
-| Expected Quality | Baseline | +20-30% with HyDE, better exact matches with BM25 |
-
-______________________________________________________________________
-
-## 💡 Usage Examples
-
-### Basic Phase 2 Retrieval (All Features Enabled)
-
-```python
-from insta_rag import RAGClient, RAGConfig
-
-config = RAGConfig.from_env()
-client = RAGClient(config)
-
-# Phase 2 with HyDE + BM25 (enabled by default)
-response = client.retrieve(
-    query="What is semantic chunking?",
-    collection_name="knowledge_base",
-    top_k=10,
-    # enable_hyde=True,           # Default
-    # enable_keyword_search=True,  # Default
-)
-
-print(f"Chunks returned: {len(response.chunks)}")
-print(f"Queries generated: {response.queries_generated}")
-print(f"Vector chunks: {response.retrieval_stats.vector_search_chunks}")
-print(f"Keyword chunks: {response.retrieval_stats.keyword_search_chunks}")
-```
-
-### Phase 1 Compatibility (Disable Phase 2 Features)
-
-```python
-# Fallback to Phase 1 behavior
-response = client.retrieve(
-    query="What is semantic chunking?",
-    collection_name="knowledge_base",
-    top_k=10,
-    enable_hyde=False,  # Disable HyDE
-    enable_keyword_search=False,  # Disable BM25
-)
-```
-
-### Selective Feature Usage
-
-```python
-# Only use HyDE (no keyword search)
-response = client.retrieve(
-    query="complex question",
-    collection_name="knowledge_base",
-    enable_hyde=True,
-    enable_keyword_search=False,
-)
-
-# Only use BM25 (no HyDE)
-response = client.retrieve(
-    query="exact term match",
-    collection_name="knowledge_base",
-    enable_hyde=False,
-    enable_keyword_search=True,
-)
-```
-
-### API Call (cURL)
-
-```bash
-curl -X POST "http://localhost:8000/api/v1/retrieve" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "query": "What is semantic chunking?",
-    "collection_name": "insta_rag_test_collection",
-    "top_k": 10,
-    "enable_hyde": true,
-    "enable_keyword_search": true,
-    "deduplicate": true
-  }'
-```
-
-______________________________________________________________________
-
-## ⚠️ Known Limitations
-
-### 1. HyDE Requires LLM Deployment
-
-**Issue**: HyDE query generation requires Azure OpenAI LLM deployment configured
-
-**Current Behavior**: Falls back to original query if deployment missing
-
-**Solution**: Add to `.env`:
-
-```env
-AZURE_LLM_DEPLOYMENT=gpt-4
-```
-
-### 2. BM25 Requires Content in Qdrant Payload
-
-**Issue**: When content is stored in MongoDB (not Qdrant), BM25 corpus is empty
-
-**Current Behavior**: BM25 search is skipped if corpus unavailable
-
-**Workaround Options**:
-
-- Option A: Store content in Qdrant payload (disable MongoDB content storage)
-- Option B: Enhance BM25Searcher to fetch content from MongoDB during corpus building
-- Option C: Use vector search only (still provides good results)
-
-### 3. Performance Overhead
-
-**Issue**: Phase 2 adds ~5-8s overhead compared to Phase 1
-
-**Mitigation**:
-
-- Disable features selectively based on use case
-- HyDE overhead: ~1.3s (worthwhile for quality improvement)
-- BM25 overhead: ~0.3s (negligible)
-- Trade quality for speed by disabling features
-
-______________________________________________________________________
-
-## 🧪 Testing Phase 2
-
-### Run the Test Suite
-
-```bash
-# Using venv
-venv/bin/python test_phase2_retrieve.py
-
-# Or activate venv first
-source venv/bin/activate
-python test_phase2_retrieve.py
-```
-
-### Expected Test Output
-
-```
-✓ All tests passed!
-✓ HyDE query generation working
-✓ BM25 keyword search working
-✓ Hybrid search combining all methods
-✓ Deduplication working across all sources
-
-🎉 Phase 2 implementation complete and working!
-```
-
-______________________________________________________________________
-
-## 🚀 Next Steps (Future Phases)
-
-### Phase 3: Cohere Reranking
-
-**Goal**: Re-rank results using cross-encoder for 30-40% better relevance
-
-**Tasks**:
-
-- [ ] Implement CohereReranker class
-- [ ] Integrate Cohere Rerank 3.5 API
-- [ ] Add fallback (use vector scores if API fails)
-- [ ] Update `retrieve()` to rerank when `enable_reranking=True`
-- [ ] Test and benchmark improvements
-
-**Required Config**:
-
-```env
-COHERE_API_KEY=your_cohere_api_key
-```
-
-______________________________________________________________________
-
-## ✅ Phase 2 Success Criteria
-
-All criteria met:
-
-- [x] HyDE query generator implemented and tested
-- [x] BM25 keyword search implemented and tested
-- [x] `retrieve()` method integrates both features
-- [x] Enabled by default in API and client
-- [x] Graceful fallback for missing configuration
-- [x] Comprehensive test suite created
-- [x] Documentation updated
-- [x] Performance acceptable (\<15 seconds)
-- [x] Error handling robust
-- [x] MongoDB integration maintained
-
-______________________________________________________________________
-
-## 🎓 Key Learnings
-
-1. **Graceful Degradation**: Phase 2 works even with partial configuration
-1. **Hybrid Search Benefits**: Combining semantic + keyword search catches more relevant results
-1. **HyDE Trade-off**: 1.3s overhead worthwhile for 20-30% quality improvement
-1. **BM25 Limitation**: Requires content in searchable format (Qdrant or MongoDB fetch)
-1. **Deduplication Critical**: Prevents duplicate results across multiple search methods
-1. **Performance Acceptable**: 9-13s total time is reasonable for production RAG
-
-______________________________________________________________________
-
-## 📝 Documentation References
-
-- **Implementation Plan**: `RETRIEVAL_IMPLEMENTATION_PLAN.md`
-- **Phase 1 Summary**: `PHASE1_COMPLETION_SUMMARY.md`
-- **Test Suite**: `test_phase2_retrieve.py`
-- **API Reference**: `testing_api/openapi.yaml`
-
-______________________________________________________________________
-
-## ✅ Conclusion
-
-**Phase 2 is COMPLETE and PRODUCTION-READY!**
-
-The implementation provides:
-
-- ✅ HyDE query generation for better retrieval
-- ✅ BM25 keyword search for exact matches
-- ✅ Hybrid search combining all methods
-- ✅ Graceful fallback for missing config
-- ✅ Comprehensive performance tracking
-- ✅ Easy path to Phase 3 (Cohere reranking)
-
-**Production Status**:
-
-- ✅ Core functionality: Production-ready
-- ⚠️ HyDE: Requires LLM deployment configuration
-- ⚠️ BM25: Requires content in Qdrant or MongoDB fetch enhancement
-
-**Recommended Next Action**: Configure LLM deployment for HyDE, then move to Phase 3 (Cohere reranking) 🚀
-
-______________________________________________________________________
-
-**Date**: 2025-10-09
-**Version**: Phase 2 Complete
-**Status**: ✅ WORKING
diff --git a/docs/PHASE2_QUICK_START.md b/docs/PHASE2_QUICK_START.md
deleted file mode 100644
index c000a9d..0000000
--- a/docs/PHASE2_QUICK_START.md
+++ /dev/null
@@ -1,275 +0,0 @@
-# Phase 2 Quick Start Guide
-
-## 🚀 What's New in Phase 2?
-
-Phase 2 adds **HyDE query generation** and **BM25 keyword search** to your RAG system, both **enabled by default** for better retrieval quality.
-
-______________________________________________________________________
-
-## ⚡ Quick Usage
-
-### Python Client
-
-```python
-from insta_rag import RAGClient, RAGConfig
-
-config = RAGConfig.from_env()
-client = RAGClient(config)
-
-# Phase 2 retrieval (HyDE + BM25 enabled by default)
-response = client.retrieve(
-    query="What is semantic chunking?", collection_name="your_collection", top_k=10
-)
-
-# Check what happened
-print(f"✓ Retrieved {len(response.chunks)} chunks")
-print(f"✓ Queries: {response.queries_generated}")
-print(f"✓ Vector chunks: {response.retrieval_stats.vector_search_chunks}")
-print(f"✓ Keyword chunks: {response.retrieval_stats.keyword_search_chunks}")
-print(f"✓ Total time: {response.retrieval_stats.total_time_ms:.2f}ms")
-```
-
-### API Call
-
-```bash
-curl -X POST "http://localhost:8000/api/v1/retrieve" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "query": "your question here",
-    "collection_name": "your_collection",
-    "top_k": 10
-  }'
-```
-
-______________________________________________________________________
-
-## ⚙️ Configuration
-
-### Required for Full Phase 2 Functionality
-
-Add to `.env` file:
-
-```env
-# For HyDE query generation (REQUIRED)
-AZURE_LLM_DEPLOYMENT=gpt-4
-AZURE_LLM_API_VERSION=2024-02-01
-
-# For BM25 keyword search (OPTIONAL)
-# Either store content in Qdrant OR enhance BM25 to fetch from MongoDB
-# Current: Gracefully skips if unavailable
-```
-
-### What Happens Without Config?
-
-| Feature | Missing Config | Behavior |
-|---------|---------------|----------|
-| HyDE | No LLM deployment | ✅ Falls back to original query (still works) |
-| BM25 | No content in Qdrant | ✅ Skips keyword search (vector search still works) |
-
-**Result**: System still works, just without the enhanced features.
-
-______________________________________________________________________
-
-## 🎛️ Feature Control
-
-### Use All Features (Default)
-
-```python
-response = client.retrieve(
-    query="your question",
-    collection_name="your_collection",
-    # enable_hyde=True,           # Default
-    # enable_keyword_search=True,  # Default
-)
-```
-
-### Disable Specific Features
-
-```python
-# Only vector search (like Phase 1)
-response = client.retrieve(
-    query="your question",
-    collection_name="your_collection",
-    enable_hyde=False,
-    enable_keyword_search=False,
-)
-
-# Only HyDE (no BM25)
-response = client.retrieve(
-    query="your question",
-    collection_name="your_collection",
-    enable_hyde=True,
-    enable_keyword_search=False,
-)
-
-# Only BM25 (no HyDE)
-response = client.retrieve(
-    query="your question",
-    collection_name="your_collection",
-    enable_hyde=False,
-    enable_keyword_search=True,
-)
-```
-
-______________________________________________________________________
-
-## 📊 Performance Expectations
-
-| Configuration | Time | Quality |
-|--------------|------|---------|
-| Vector only (Phase 1) | ~2-5s | Baseline |
-| + HyDE | ~4-7s | +20-30% better |
-| + BM25 | ~3-6s | Better exact matches |
-| + HyDE + BM25 | ~9-13s | Best quality |
-
-______________________________________________________________________
-
-## 🧪 Test Your Setup
-
-```bash
-# Run Phase 2 test suite
-venv/bin/python test_phase2_retrieve.py
-
-# Expected output:
-# ✓ All tests passed!
-# ✓ HyDE query generation working
-# ✓ BM25 keyword search working
-# 🎉 Phase 2 implementation complete and working!
-```
-
-______________________________________________________________________
-
-## 🔍 Understanding the Results
-
-```python
-response = client.retrieve(query="...", collection_name="...")
-
-# Generated queries
-print(response.queries_generated)
-# {
-#   "original": "your original query",
-#   "standard": "optimized query",
-#   "hyde": "hypothetical answer that would match your query"
-# }
-
-# Performance breakdown
-stats = response.retrieval_stats
-print(f"Query generation: {stats.query_generation_time_ms}ms")  # HyDE
-print(f"Vector search: {stats.vector_search_time_ms}ms")  # Semantic
-print(f"Keyword search: {stats.keyword_search_time_ms}ms")  # BM25
-print(f"Total: {stats.total_time_ms}ms")
-
-# Chunk counts
-print(f"Vector chunks: {stats.vector_search_chunks}")  # Usually ~50
-print(f"Keyword chunks: {stats.keyword_search_chunks}")  # Usually ~50
-print(f"Total retrieved: {stats.total_chunks_retrieved}")  # Combined
-print(f"After dedup: {stats.chunks_after_dedup}")  # Unique
-print(f"Final returned: {len(response.chunks)}")  # top_k
-```
-
-______________________________________________________________________
-
-## 🐛 Troubleshooting
-
-### "Warning: HyDE generation failed"
-
-**Cause**: Azure OpenAI LLM deployment not configured
-
-**Fix**: Add to `.env`:
-
-```env
-AZURE_LLM_DEPLOYMENT=gpt-4
-```
-
-**Impact**: Falls back to original query, retrieval still works
-
-______________________________________________________________________
-
-### "Warning: BM25 index not available"
-
-**Cause**: Content stored in MongoDB, not in Qdrant payload
-
-**Fix Options**:
-
-1. Store content in Qdrant (disable MongoDB content storage)
-1. Enhance BM25 to fetch from MongoDB (future improvement)
-1. Accept vector-only search (still provides good results)
-
-**Impact**: Skips keyword search, vector search still works
-
-______________________________________________________________________
-
-### Slow Performance (>20s)
-
-**Cause**: All features enabled, large collection
-
-**Fix**: Disable features selectively:
-
-```python
-# Fast mode (vector only)
-response = client.retrieve(
-    query="...",
-    collection_name="...",
-    enable_hyde=False,
-    enable_keyword_search=False,
-)
-```
-
-______________________________________________________________________
-
-## 📚 Next Steps
-
-1. **Configure LLM Deployment** for HyDE (recommended)
-
-   ```env
-   AZURE_LLM_DEPLOYMENT=gpt-4
-   ```
-
-1. **Test with Your Data**
-
-   ```python
-   response = client.retrieve(query="...", collection_name="...")
-   ```
-
-1. **Monitor Performance**
-
-   ```python
-   print(response.retrieval_stats.to_dict())
-   ```
-
-1. **Optimize Settings**
-
-   - Adjust `top_k` based on use case
-   - Enable/disable features based on performance needs
-   - Set `score_threshold` to filter low-quality results
-
-______________________________________________________________________
-
-## 📖 Full Documentation
-
-- **Implementation Details**: `PHASE2_COMPLETION_SUMMARY.md`
-- **Architecture & Planning**: `RETRIEVAL_IMPLEMENTATION_PLAN.md`
-- **Test Suite**: `test_phase2_retrieve.py`
-- **API Documentation**: `testing_api/openapi.yaml`
-
-______________________________________________________________________
-
-## ✅ Ready to Use!
-
-Phase 2 is **production-ready** with graceful fallbacks. Start using it now:
-
-```python
-from insta_rag import RAGClient, RAGConfig
-
-config = RAGConfig.from_env()
-client = RAGClient(config)
-
-response = client.retrieve(
-    query="your question here",
-    collection_name="your_collection",
-)
-
-# That's it! Phase 2 features are enabled by default.
-```
-
-🎉 **Enjoy better retrieval with HyDE + BM25!**
diff --git a/docs/QUICK_FIX_QDRANT.md b/docs/QUICK_FIX_QDRANT.md
deleted file mode 100644
index 177041a..0000000
--- a/docs/QUICK_FIX_QDRANT.md
+++ /dev/null
@@ -1,210 +0,0 @@
-# Quick Fix for Qdrant Connection Timeout
-
-## The Problem
-
-Your Qdrant server at `qdrant-okc4ss8owk0ggwg4ccwsoks0.aibuddy-coolify-inventory.aukikaurnab.com` is timing out.
-
-## Solution 1: Use Local Qdrant (Fastest, Recommended for Testing)
-
-### Start Local Qdrant with Docker
-
-```bash
-docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
-```
-
-### Update .env
-
-Edit `/home/macorov/Documents/GitHub/insta_rag/.env`:
-
-```env
-# Comment out remote Qdrant (add # at start)
-# QDRANT_URL=https://qdrant-okc4ss8owk0ggwg4ccwsoks0.aibuddy-coolify-inventory.aukikaurnab.com/
-# QDRANT_API_KEY=edfBd7pP251ev2uiRcjcBGt7QXJe1P70
-
-# Add local Qdrant
-QDRANT_URL=http://localhost:6333
-QDRANT_API_KEY=
-```
-
-### Test It
-
-```bash
-curl http://localhost:6333/collections
-# Should return: {"result":{"collections":[]}}
-```
-
-### Run Your API
-
-```bash
-cd /home/macorov/Documents/GitHub/insta_rag/testing_api
-./run.sh
-```
-
-**Done!** Your tests will now use local Qdrant with zero latency.
-
-______________________________________________________________________
-
-## Solution 2: Fix Remote Qdrant Connection
-
-If you need to use the remote Qdrant server:
-
-### Step 1: Test Basic Connectivity
-
-```bash
-# Test if server is reachable
-curl -I --max-time 10 "https://qdrant-okc4ss8owk0ggwg4ccwsoks0.aibuddy-coolify-inventory.aukikaurnab.com/"
-```
-
-**If this times out:** The server is not accessible from your network.
-
-- Check if server is running
-- Check firewall settings
-- Try from different network
-
-### Step 2: Increase Timeout Significantly
-
-Edit `/home/macorov/Documents/GitHub/insta_rag/src/insta_rag/vectordb/qdrant.py`:
-
-```python
-def __init__(
-    self,
-    url: str,
-    api_key: str,
-    timeout: int = 300,  # Change from 60 to 300 (5 minutes)
-    prefer_grpc: bool = False,
-):
-```
-
-### Step 3: Add Retry Logic
-
-Edit `/home/macorov/Documents/GitHub/insta_rag/src/insta_rag/vectordb/qdrant.py`, find the `_initialize_client` method:
-
-```python
-def _initialize_client(self):
-    """Initialize Qdrant client."""
-    try:
-        from qdrant_client import QdrantClient
-        from qdrant_client.models import Distance, VectorParams
-
-        # Store for later use
-        self.Distance = Distance
-        self.VectorParams = VectorParams
-
-        # Try multiple times with increasing timeout
-        for attempt in range(3):
-            try:
-                self.client = QdrantClient(
-                    url=self.url,
-                    api_key=self.api_key,
-                    timeout=self.timeout * (attempt + 1),  # Increasing timeout
-                    prefer_grpc=self.prefer_grpc,
-                )
-                # Test connection
-                self.client.get_collections()
-                print(f"✓ Qdrant connected (attempt {attempt + 1})")
-                break
-            except Exception as e:
-                if attempt < 2:
-                    print(f"⚠ Attempt {attempt + 1} failed, retrying...")
-                    import time
-
-                    time.sleep(2)
-                else:
-                    raise
-
-    except ImportError as e:
-        raise VectorDBError(
-            "Qdrant client not installed. Install with: pip install qdrant-client"
-        ) from e
-    except Exception as e:
-        raise VectorDBError(f"Failed to initialize Qdrant client: {str(e)}") from e
-```
-
-______________________________________________________________________
-
-## Solution 3: Use Qdrant Cloud (Free Tier)
-
-### Get Free Qdrant Cloud Account
-
-1. Go to https://cloud.qdrant.io
-1. Sign up (free tier available)
-1. Create a cluster
-1. Copy the URL and API key
-
-### Update .env
-
-```env
-QDRANT_URL=https://your-cluster-xyz.cloud.qdrant.io
-QDRANT_API_KEY=your_api_key_from_qdrant_cloud
-```
-
-______________________________________________________________________
-
-## Quick Test Commands
-
-### Test if Qdrant is accessible
-
-```bash
-# Test remote Qdrant
-curl --max-time 10 "https://qdrant-okc4ss8owk0ggwg4ccwsoks0.aibuddy-coolify-inventory.aukikaurnab.com/collections" \
-  -H "api-key: edfBd7pP251ev2uiRcjcBGt7QXJe1P70"
-
-# Test local Qdrant (if running)
-curl http://localhost:6333/collections
-```
-
-### Test from Python
-
-```python
-from qdrant_client import QdrantClient
-
-# Test remote
-try:
-    client = QdrantClient(
-        url="https://qdrant-okc4ss8owk0ggwg4ccwsoks0.aibuddy-coolify-inventory.aukikaurnab.com/",
-        api_key="edfBd7pP251ev2uiRcjcBGt7QXJe1P70",
-        prefer_grpc=False,
-        timeout=120,
-    )
-    print("Remote:", client.get_collections())
-except Exception as e:
-    print("Remote failed:", e)
-
-# Test local
-try:
-    client = QdrantClient(url="http://localhost:6333", prefer_grpc=False)
-    print("Local:", client.get_collections())
-except Exception as e:
-    print("Local failed:", e)
-```
-
-______________________________________________________________________
-
-## My Recommendation
-
-**Use local Qdrant for testing:**
-
-```bash
-# 1. Start Qdrant
-docker run -d -p 6333:6333 qdrant/qdrant
-
-# 2. Update .env
-QDRANT_URL=http://localhost:6333
-QDRANT_API_KEY=
-
-# 3. Test
-curl http://localhost:6333/collections
-
-# 4. Run your app
-cd testing_api && ./run.sh
-```
-
-**Why?**
-
-- ✅ Zero latency
-- ✅ No timeout issues
-- ✅ Works offline
-- ✅ Perfect for development
-- ✅ Free
-
-You can always switch back to remote Qdrant later when connectivity is fixed!
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 0000000..c0e1b72
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,17 @@
+# Welcome to insta_rag
+
+`insta_rag` is a powerful, modular, and extensible Python library for building Retrieval-Augmented Generation (RAG) pipelines. It provides a simple yet flexible interface to handle the entire RAG lifecycle, from document ingestion and chunking to advanced retrieval and reranking.
+
+This documentation provides a comprehensive guide to installing, using, and understanding the library.
+
+## Where to Start
+
+- **[Installation](./installation.md):** Get the library installed in your environment.
+- **[Quickstart](./quickstart.md):** A hands-on guide to get you started with the core features in minutes.
+
+## In-Depth Guides
+
+- **[Document Management](./guides/document-management.md):** Learn how to add, update, and manage documents in your knowledge base.
+- **[Advanced Retrieval](./guides/retrieval.md):** A deep dive into the hybrid retrieval pipeline, including HyDE, keyword search, and reranking.
+- **[Storage Backends](./guides/storage-backends.md):** Understand how to configure storage, including the hybrid Qdrant and MongoDB setup.
+- **[Local Development](./guides/local-development.md):** Tips for setting up a local development environment, including running a local Qdrant instance.
diff --git a/docs/RERANKER_SCORES.md b/docs/RERANKER_SCORES.md
deleted file mode 100644
index 1900b20..0000000
--- a/docs/RERANKER_SCORES.md
+++ /dev/null
@@ -1,101 +0,0 @@
-# Understanding Reranker Scores
-
-## BGE Reranker (BAAI/bge-reranker-v2-m3)
-
-The BGE reranker produces **negative scores** by default. This is normal behavior for this model.
-
-### Score Characteristics:
-
-- **Range**: Typically -10.0 to +10.0
-- **Higher is better**: -0.96 is MORE relevant than -6.99
-- **Most relevant results**: Usually between -3.0 and +5.0
-- **Less relevant results**: Usually below -5.0
-
-### Score Interpretation:
-
-| Score Range | Relevance Level | Description |
-|-------------|-----------------|-------------|
-| +3.0 to +10.0 | Excellent | Highly relevant, strong semantic match |
-| 0.0 to +3.0 | Very Good | Strong relevance to query |
-| -3.0 to 0.0 | Good | Relevant, useful results |
-| -5.0 to -3.0 | Moderate | Some relevance, may be useful |
-| Below -5.0 | Low | Weak relevance, consider filtering |
-
-### Using Score Thresholds:
-
-When using `score_threshold` with BGE reranker, use **negative values**:
-
-```python
-# Good examples for BGE reranker:
-response = client.retrieve(
-    query="What is semantic chunking?",
-    collection_name="my_collection",
-    top_k=20,
-    score_threshold=-5.0,  # ✅ Filter out weakly relevant results
-)
-
-# Or for stricter filtering:
-response = client.retrieve(
-    query="What is semantic chunking?",
-    collection_name="my_collection",
-    top_k=20,
-    score_threshold=-3.0,  # ✅ Keep only moderately to highly relevant
-)
-
-# ❌ WRONG - This will filter out ALL results:
-response = client.retrieve(
-    query="What is semantic chunking?",
-    collection_name="my_collection",
-    top_k=20,
-    score_threshold=0.01,  # ❌ Too high for BGE negative scores
-)
-```
-
-### Example Output:
-
-```
-🎯 Reranking:
-   Reranking 33 chunks using bge...
-   ✓ Reranked to 20 chunks (398.11ms)
-   ✓ Score range: -6.9961 to -0.9624  # Higher (less negative) is better!
-
-   After score threshold (-5.0): 15 chunks  # Kept results with score >= -5.0
-```
-
-### Recommended Thresholds by Use Case:
-
-| Use Case | Recommended Threshold | Rationale |
-|----------|----------------------|-----------|
-| General retrieval | No threshold or -6.0 | Keep most results |
-| Document generation | -5.0 | Moderate quality filter |
-| Question answering | -3.0 | High quality only |
-| Fact verification | -2.0 | Very high confidence |
-
-## Cohere Reranker (Alternative)
-
-If you prefer **normalized 0-1 scores**, you can use Cohere reranker instead:
-
-```python
-# In .env:
-COHERE_API_KEY = your_cohere_key_here
-
-# Cohere produces 0-1 scores:
-# - 1.0 = Perfect match
-# - 0.8-1.0 = Highly relevant
-# - 0.5-0.8 = Moderately relevant
-# - Below 0.5 = Less relevant
-
-response = client.retrieve(
-    query="What is semantic chunking?",
-    collection_name="my_collection",
-    top_k=20,
-    score_threshold=0.5,  # ✅ Works well with Cohere's 0-1 range
-)
-```
-
-## Key Takeaways:
-
-1. **BGE reranker** uses negative scores (higher = better)
-1. **Use negative thresholds** like -5.0, -3.0, or -2.0 with BGE
-1. **No threshold** (None) returns all results sorted by relevance
-1. **Different models** use different score ranges - adjust accordingly
diff --git a/docs/RETRIEVAL_IMPLEMENTATION_PLAN.md b/docs/RETRIEVAL_IMPLEMENTATION_PLAN.md
deleted file mode 100644
index c6757de..0000000
--- a/docs/RETRIEVAL_IMPLEMENTATION_PLAN.md
+++ /dev/null
@@ -1,765 +0,0 @@
-# Advanced Retrieval Method - Implementation Plan
-
-## 🎯 Objective
-
-Implement a comprehensive `retrieve()` method for RAGClient that uses hybrid search (vector + keyword) with HyDE query generation and Cohere reranking.
-
-______________________________________________________________________
-
-## 📊 Current State Analysis
-
-### ✅ What Already Exists
-
-1. **Vector Search (Qdrant)**
-
-   - `QdrantVectorDB.search()` - WORKING
-   - Returns `VectorSearchResult` objects
-   - Supports metadata filters
-   - Uses `query_points()` method (updated API)
-
-1. **Embeddings (Azure OpenAI)**
-
-   - `OpenAIEmbedder.embed()` - batch embedding
-   - `OpenAIEmbedder.embed_query()` - single query embedding
-   - 3072-dimensional vectors
-
-1. **MongoDB Integration**
-
-   - `MongoDBClient.get_chunk_content_by_mongo_id()` - fetch content
-   - Hybrid storage working
-
-1. **Response Models**
-
-   - `RetrievalResponse` - complete response structure
-   - `RetrievedChunk` - individual result
-   - `RetrievalStats` - performance metrics
-   - `SourceInfo` - source aggregation
-
-1. **Basic Search Method**
-
-   - `RAGClient.search()` - simple vector search
-   - Already implemented and working
-
-### ❌ What Needs to be Built
-
-1. **HyDE Query Generation**
-
-   - LLM call to generate hypothetical answer
-   - Structured output for standard + HyDE queries
-   - Error handling for LLM failures
-
-1. **BM25 Keyword Search**
-
-   - BM25 algorithm implementation OR integration
-   - Document corpus indexing
-   - Query tokenization
-   - Metadata filtering support
-
-1. **Deduplication Logic**
-
-   - Hash-based or ID-based dedup
-   - Score preservation (keep highest)
-   - Efficient merging of results
-
-1. **Cohere Reranking Integration**
-
-   - Cohere API client setup
-   - Batch reranking calls
-   - Score normalization
-   - Error handling / fallback
-
-1. **Advanced Retrieve Method**
-
-   - Orchestrate all 6 steps
-   - Comprehensive error handling
-   - Performance tracking
-   - Flexible mode switching
-
-1. **API Endpoints**
-
-   - `/api/v1/retrieve` endpoint
-   - Request/response models
-   - Testing endpoints
-
-______________________________________________________________________
-
-## 🏗️ Architecture Design
-
-### Component Hierarchy
-
-```
-RAGClient
-├── retrieve()  [NEW - Main orchestration method]
-│   │
-│   ├── Step 1: Query Generation
-│   │   └── HyDEQueryGenerator [NEW]
-│   │       ├── Uses LLMConfig
-│   │       └── Generates standard + HyDE queries
-│   │
-│   ├── Step 2: Vector Search (Dual)
-│   │   ├── OpenAIEmbedder.embed_query() [EXISTS]
-│   │   └── QdrantVectorDB.search() [EXISTS]
-│   │
-│   ├── Step 3: Keyword Search
-│   │   └── BM25Searcher [NEW]
-│   │       ├── Uses MongoDB/Qdrant for corpus
-│   │       └── Returns scored chunks
-│   │
-│   ├── Step 4: Deduplication
-│   │   └── deduplicate_chunks() [NEW]
-│   │       └── Merge & remove duplicates
-│   │
-│   ├── Step 5: Reranking
-│   │   └── CohereReranker [NEW]
-│   │       ├── Uses RerankingConfig
-│   │       └── Cross-encoder scoring
-│   │
-│   └── Step 6: Selection & Formatting
-│       └── format_results() [NEW]
-│           └── Apply threshold, limit, format
-│
-└── search()  [EXISTS - Keep as simple alternative]
-```
-
-______________________________________________________________________
-
-## 📝 Detailed Implementation Plan
-
-### Phase 1: Core Infrastructure (Priority: HIGH)
-
-#### 1.1 HyDE Query Generator
-
-**File**: `src/insta_rag/retrieval/query_generator.py`
-
-**Status**: File exists, needs review and potential updates
-
-**Tasks**:
-
-- [ ] Review existing implementation
-- [ ] Add HyDE generation using Azure OpenAI
-- [ ] Use structured output (JSON mode)
-- [ ] Single LLM call for both queries
-- [ ] Error handling with fallback to original query
-
-**Implementation**:
-
-```python
-class HyDEQueryGenerator:
-    def generate_queries(self, query: str) -> Dict[str, str]:
-        """
-        Generate standard + HyDE queries.
-
-        Returns:
-            {"standard": "optimized query", "hyde": "hypothetical answer"}
-        """
-        # LLM prompt:
-        # "Given the query: {query}
-        #  Generate:
-        #  1. An optimized search query
-        #  2. A hypothetical answer to the query"
-
-        # Use structured output for parsing
-        # Fallback to original query if LLM fails
-```
-
-**Dependencies**: LLMConfig, Azure OpenAI client
-
-______________________________________________________________________
-
-#### 1.2 BM25 Keyword Search
-
-**File**: `src/insta_rag/retrieval/keyword_search.py`
-
-**Status**: File exists, needs review
-
-**Options**:
-
-1. **Use rank_bm25 library** (Python implementation)
-1. **Use Qdrant's payload search** (if available)
-1. **Custom implementation**
-
-**Recommended**: rank_bm25 library (easiest)
-
-**Tasks**:
-
-- [ ] Install rank_bm25: `pip install rank-bm25`
-- [ ] Build document corpus from collection
-- [ ] Implement BM25Searcher class
-- [ ] Support metadata filtering
-- [ ] Cache corpus for performance
-
-**Implementation**:
-
-```python
-from rank_bm25 import BM25Okapi
-
-
-class BM25Searcher:
-    def __init__(self, rag_client, collection_name):
-        self.rag_client = rag_client
-        self.collection_name = collection_name
-        self.corpus = []  # List of documents
-        self.chunk_metadata = []  # Corresponding metadata
-        self._build_corpus()
-
-    def _build_corpus(self):
-        # Fetch all chunks from collection
-        # Tokenize content
-        # Build BM25 index
-        pass
-
-    def search(self, query: str, limit: int, filters: Dict) -> List:
-        # Tokenize query
-        # Get BM25 scores
-        # Apply filters
-        # Return top results
-        pass
-```
-
-**Challenges**:
-
-- Building corpus from Qdrant (need to fetch all docs)
-- Keeping index updated when docs are added
-- Memory usage for large collections
-
-**Alternative Approach** (if BM25 is too complex for now):
-
-- Skip keyword search initially
-- Set `enable_keyword_search=False` by default
-- Implement later as enhancement
-
-______________________________________________________________________
-
-#### 1.3 Cohere Reranker
-
-**File**: `src/insta_rag/retrieval/reranker.py`
-
-**Status**: File exists, needs review
-
-**Tasks**:
-
-- [ ] Review existing implementation
-- [ ] Add Cohere client integration
-- [ ] Implement rerank() method
-- [ ] Handle API errors gracefully
-- [ ] Add fallback (use vector scores if rerank fails)
-
-**Implementation**:
-
-```python
-import cohere
-
-
-class CohereReranker(BaseReranker):
-    def __init__(self, api_key: str, model: str = "rerank-v3.5"):
-        self.client = cohere.Client(api_key)
-        self.model = model
-
-    def rerank(
-        self, query: str, chunks: List[Tuple[str, Dict]], top_k: int
-    ) -> List[Tuple[int, float]]:
-        """
-        Rerank chunks using Cohere.
-
-        Args:
-            query: Search query
-            chunks: List of (content, metadata) tuples
-            top_k: Number to return
-
-        Returns:
-            List of (original_index, relevance_score) tuples
-        """
-        try:
-            # Prepare documents
-            documents = [chunk[0] for chunk in chunks]
-
-            # Call Cohere Rerank API
-            results = self.client.rerank(
-                model=self.model,
-                query=query,
-                documents=documents,
-                top_n=top_k,
-            )
-
-            # Return (index, score) pairs
-            return [(r.index, r.relevance_score) for r in results.results]
-
-        except Exception as e:
-            # Fallback: return original order with dummy scores
-            return [(i, 1.0 - (i * 0.01)) for i in range(min(top_k, len(chunks)))]
-```
-
-**Dependencies**:
-
-- Cohere API key (from config)
-- `cohere` Python package
-
-______________________________________________________________________
-
-### Phase 2: Helper Functions (Priority: MEDIUM)
-
-#### 2.1 Deduplication
-
-**File**: `src/insta_rag/retrieval/utils.py`
-
-**Tasks**:
-
-- [ ] Create utility functions
-- [ ] Hash-based deduplication
-- [ ] Keep highest score
-
-**Implementation**:
-
-```python
-def deduplicate_chunks(
-    chunks: List[VectorSearchResult], key_func=lambda x: x.chunk_id
-) -> List[VectorSearchResult]:
-    """
-    Remove duplicate chunks, keeping highest score.
-
-    Args:
-        chunks: List of search results
-        key_func: Function to extract unique key
-
-    Returns:
-        Deduplicated list
-    """
-    chunk_dict = {}
-    for chunk in chunks:
-        key = key_func(chunk)
-        if key not in chunk_dict or chunk.score > chunk_dict[key].score:
-            chunk_dict[key] = chunk
-    return list(chunk_dict.values())
-```
-
-______________________________________________________________________
-
-#### 2.2 Result Formatting
-
-**File**: `src/insta_rag/retrieval/utils.py`
-
-**Tasks**:
-
-- [ ] Convert VectorSearchResult to RetrievedChunk
-- [ ] Apply score threshold
-- [ ] Truncate content if needed
-- [ ] Calculate source statistics
-
-**Implementation**:
-
-```python
-def format_retrieval_results(
-    search_results: List,
-    query: str,
-    return_full_chunks: bool,
-    score_threshold: Optional[float],
-    mongodb_client: Optional[MongoDBClient],
-) -> List[RetrievedChunk]:
-    """Format search results into RetrievedChunk objects."""
-    # Fetch MongoDB content if needed
-    # Apply score threshold
-    # Truncate if not return_full_chunks
-    # Add rank positions
-    pass
-```
-
-______________________________________________________________________
-
-### Phase 3: Main Retrieve Method (Priority: HIGH)
-
-#### 3.1 RAGClient.retrieve()
-
-**File**: `src/insta_rag/core/client.py`
-
-**Tasks**:
-
-- [ ] Add retrieve() method to RAGClient
-- [ ] Orchestrate all 6 steps
-- [ ] Add comprehensive error handling
-- [ ] Track timing for each step
-- [ ] Support all modes (full hybrid, vector-only, etc.)
-
-**Implementation Structure**:
-
-```python
-def retrieve(
-    self,
-    query: str,
-    collection_name: str,
-    filters: Optional[Dict[str, Any]] = None,
-    top_k: int = 20,
-    enable_reranking: bool = True,
-    enable_keyword_search: bool = True,
-    enable_hyde: bool = True,
-    score_threshold: Optional[float] = None,
-    return_full_chunks: bool = True,
-    deduplicate: bool = True,
-) -> RetrievalResponse:
-    """
-    Advanced hybrid retrieval with HyDE, BM25, and reranking.
-
-    [Full docstring with all details]
-    """
-
-    # Step 1: Query Generation
-    # Step 2: Dual Vector Search
-    # Step 3: Keyword Search (BM25)
-    # Step 4: Combine & Deduplicate
-    # Step 5: Reranking
-    # Step 6: Selection & Formatting
-
-    # Return RetrievalResponse
-```
-
-______________________________________________________________________
-
-### Phase 4: API Endpoints (Priority: MEDIUM)
-
-#### 4.1 Retrieve Endpoint
-
-**File**: `testing_api/main.py`
-
-**Tasks**:
-
-- [ ] Add RetrieveRequest model
-- [ ] Add /api/v1/retrieve endpoint
-- [ ] Support all parameters
-- [ ] Add to OpenAPI spec
-
-**Implementation**:
-
-```python
-class RetrieveRequest(BaseModel):
-    query: str
-    collection_name: str
-    filters: Optional[Dict[str, Any]] = None
-    top_k: int = 20
-    enable_reranking: bool = True
-    enable_keyword_search: bool = True
-    enable_hyde: bool = True
-    score_threshold: Optional[float] = None
-
-
-@app.post("/api/v1/retrieve")
-async def retrieve_documents(request: RetrieveRequest):
-    """Advanced retrieval with hybrid search."""
-    response = rag_client.retrieve(**request.dict())
-    return response.to_dict()
-```
-
-______________________________________________________________________
-
-### Phase 5: Testing & Documentation (Priority: HIGH)
-
-#### 5.1 Unit Tests
-
-**Tasks**:
-
-- [ ] Test HyDE query generation
-- [ ] Test BM25 search
-- [ ] Test deduplication
-- [ ] Test reranking
-- [ ] Test full pipeline
-
-#### 5.2 Integration Tests
-
-**Tasks**:
-
-- [ ] End-to-end retrieval test
-- [ ] Test with MongoDB hybrid storage
-- [ ] Test all retrieval modes
-- [ ] Performance benchmarking
-
-#### 5.3 Documentation
-
-**Tasks**:
-
-- [ ] API documentation
-- [ ] Usage examples
-- [ ] Performance guidelines
-- [ ] Troubleshooting guide
-
-______________________________________________________________________
-
-## 🚀 Implementation Phases & Timeline
-
-### Phase 1: MVP (Minimum Viable Product)
-
-**Goal**: Basic retrieve() method working
-
-**Components**:
-
-1. ✅ Basic retrieve() method structure (DONE - created retrieval_method.py)
-1. ⏳ Simple query generation (no HyDE initially)
-1. ⏳ Dual vector search using existing methods
-1. ⏳ Basic deduplication
-1. ⏳ NO keyword search (skip for MVP)
-1. ⏳ NO reranking (skip for MVP)
-1. ⏳ Basic API endpoint
-
-**Output**: Functional retrieve() that does dual vector search + dedup
-
-______________________________________________________________________
-
-### Phase 2: HyDE Integration
-
-**Goal**: Add query generation
-
-**Components**:
-
-1. Review/implement HyDEQueryGenerator
-1. LLM-based query optimization
-1. Structured output parsing
-1. Error handling
-
-**Output**: retrieve() with HyDE query generation
-
-______________________________________________________________________
-
-### Phase 3: Reranking Integration
-
-**Goal**: Add Cohere reranking
-
-**Components**:
-
-1. Review/implement CohereReranker
-1. API integration
-1. Error handling & fallback
-1. Performance optimization
-
-**Output**: retrieve() with reranking for better results
-
-______________________________________________________________________
-
-### Phase 4: BM25 Integration (Optional)
-
-**Goal**: Add keyword search
-
-**Components**:
-
-1. Implement BM25Searcher OR use library
-1. Corpus building
-1. Index management
-1. Hybrid fusion
-
-**Output**: Full hybrid search (vector + keyword)
-
-______________________________________________________________________
-
-## 📋 Dependencies & Requirements
-
-### Python Packages Needed
-
-```bash
-# Already installed (verify):
-- qdrant-client>=1.7.0
-- openai>=1.12.0
-- pymongo>=4.6.0
-
-# Need to add to pyproject.toml:
-- cohere>=4.47.0  # ✅ Already in dependencies
-- rank-bm25>=0.2.2  # ❌ Need to add
-```
-
-### API Keys Needed
-
-```env
-# Already configured:
-AZURE_OPENAI_API_KEY ✅
-QDRANT_API_KEY ✅
-MONGO_CONNECTION_STRING ✅
-
-# Need to verify:
-COHERE_API_KEY=your_cohere_api_key_here
-```
-
-### Configuration Updates
-
-**File**: `src/insta_rag/core/config.py`
-
-- ✅ RerankingConfig exists
-- ✅ LLMConfig exists
-- ⏳ Verify all fields are present
-
-______________________________________________________________________
-
-## 🎯 Decision Points
-
-### Decision 1: BM25 Implementation
-
-**Options**:
-A. Use rank-bm25 library (simple, fast to implement)
-B. Custom implementation (more control)
-C. Skip for now (focus on vector + reranking first)
-
-**Recommendation**: Start with **Option C** (skip), add later as **Option A**
-
-**Rationale**: Vector search + reranking provides 80% of value, BM25 adds 20%
-
-______________________________________________________________________
-
-### Decision 2: HyDE Implementation
-
-**Options**:
-A. Full LLM-based HyDE generation
-B. Simple query expansion (synonyms, etc.)
-C. Skip for MVP
-
-**Recommendation**: **Option A** for Phase 2, **Option C** for Phase 1 MVP
-
-**Rationale**: HyDE provides significant improvements, but not critical for MVP
-
-______________________________________________________________________
-
-### Decision 3: Reranking Fallback
-
-**Options**:
-A. Fail if Cohere API fails
-B. Fall back to vector scores
-C. Cache rerank results
-
-**Recommendation**: **Option B** (fallback)
-
-**Rationale**: System should work even if reranking fails
-
-______________________________________________________________________
-
-## 📊 Success Criteria
-
-### Phase 1 MVP Success
-
-- [ ] retrieve() method callable
-- [ ] Dual vector search working
-- [ ] Deduplication working
-- [ ] Returns RetrievalResponse
-- [ ] API endpoint working
-- [ ] No errors with test data
-
-### Full Implementation Success
-
-- [ ] All 6 steps working
-- [ ] HyDE improves results
-- [ ] Reranking improves relevance
-- [ ] BM25 catches exact matches
-- [ ] Performance < 2 seconds average
-- [ ] Comprehensive error handling
-- [ ] Full documentation
-
-______________________________________________________________________
-
-## 🔧 Implementation Order (Recommended)
-
-**Week 1: Foundation**
-
-1. Review existing retrieval code
-1. Create MVP retrieve() method (no HyDE, no BM25, no reranking)
-1. Add basic API endpoint
-1. Test with existing collections
-
-**Week 2: Core Features**
-5\. Add HyDE query generation
-6\. Add Cohere reranking
-7\. Test performance improvements
-
-**Week 3: Advanced Features**
-8\. Add BM25 keyword search (if needed)
-9\. Optimize performance
-10\. Add comprehensive tests
-
-**Week 4: Polish**
-11\. Documentation
-12\. Error handling refinement
-13\. Production readiness
-
-______________________________________________________________________
-
-## 🚨 Risks & Mitigation
-
-### Risk 1: BM25 Corpus Building
-
-**Issue**: Building BM25 corpus requires fetching all documents
-**Impact**: High memory usage, slow initialization
-**Mitigation**:
-
-- Use lazy loading
-- Cache corpus
-- Implement incremental updates
-- OR skip BM25 for now
-
-### Risk 2: Cohere API Rate Limits
-
-**Issue**: Reranking API may have rate limits
-**Impact**: Failed retrievals
-**Mitigation**:
-
-- Implement retry logic
-- Fallback to vector scores
-- Batch processing
-
-### Risk 3: Performance Degradation
-
-**Issue**: 6-step pipeline may be slow
-**Impact**: Poor user experience
-**Mitigation**:
-
-- Parallel execution where possible
-- Caching
-- Mode switching (fast vs accurate)
-
-______________________________________________________________________
-
-## 📁 File Structure
-
-```
-src/insta_rag/
-├── core/
-│   ├── client.py  [UPDATE - add retrieve() method]
-│   └── retrieval_method.py  [NEW - orchestration logic]
-│
-├── retrieval/
-│   ├── __init__.py
-│   ├── base.py  [EXISTS - review]
-│   ├── query_generator.py  [EXISTS - review/update]
-│   ├── keyword_search.py  [EXISTS - review/update]
-│   ├── reranker.py  [EXISTS - review/update]
-│   ├── vector_search.py  [EXISTS - review]
-│   └── utils.py  [NEW - helper functions]
-│
-└── models/
-    └── response.py  [EXISTS - all models ready]
-
-testing_api/
-├── main.py  [UPDATE - add /api/v1/retrieve endpoint]
-└── openapi.yaml  [UPDATE - add retrieve spec]
-```
-
-______________________________________________________________________
-
-## ✅ Next Immediate Steps
-
-1. **Review existing retrieval modules**
-
-   - Check what's in `src/insta_rag/retrieval/`
-   - Identify what works vs needs updates
-
-1. **Build Phase 1 MVP**
-
-   - Start with simple retrieve() (dual vector search)
-   - No HyDE, no BM25, no reranking
-   - Get it working end-to-end
-
-1. **Test MVP**
-
-   - Upload test documents
-   - Call retrieve()
-   - Verify results
-
-1. **Iterate**
-
-   - Add HyDE
-   - Add reranking
-   - Add BM25 (if needed)
-
-______________________________________________________________________
-
-This plan provides a clear roadmap from current state to full implementation, with flexibility to adjust based on priorities and challenges discovered during implementation.
diff --git a/docs/UPDATE_DOCUMENTS_GUIDE.md b/docs/UPDATE_DOCUMENTS_GUIDE.md
deleted file mode 100644
index e0ceb50..0000000
--- a/docs/UPDATE_DOCUMENTS_GUIDE.md
+++ /dev/null
@@ -1,381 +0,0 @@
-# Knowledge Base Update Operations - Complete Guide
-
-## Overview
-
-The `update_documents()` method provides comprehensive document management capabilities for your RAG knowledge base. This implementation adds flexible CRUD operations to the insta_rag library.
-
-## Features Implemented
-
-### 1. **Four Update Strategies**
-
-#### REPLACE Strategy
-
-- Delete existing documents and add new ones
-- Use Case: User uploads a new version of an existing document
-- Supports filtering by metadata or document IDs
-
-```python
-response = client.update_documents(
-    collection_name="my_collection",
-    update_strategy="replace",
-    filters={"user_id": "123", "document_type": "report"},
-    new_documents=[new_doc1, new_doc2],
-)
-```
-
-#### APPEND Strategy
-
-- Add new documents without deleting any existing ones
-- Use Case: Adding new documents to the knowledge base
-- Simple addition operation
-
-```python
-response = client.update_documents(
-    collection_name="my_collection",
-    update_strategy="append",
-    new_documents=[new_doc1, new_doc2],
-    metadata_updates={"category": "technical_docs"},
-)
-```
-
-#### DELETE Strategy
-
-- Remove documents matching specified criteria
-- Use Case: Remove outdated or irrelevant documents
-- Supports both filter-based and ID-based deletion
-
-```python
-# Delete by filters
-response = client.update_documents(
-    collection_name="my_collection",
-    update_strategy="delete",
-    filters={"status": "archived"},
-)
-
-# Delete by document IDs
-response = client.update_documents(
-    collection_name="my_collection",
-    update_strategy="delete",
-    document_ids=["doc-123", "doc-456"],
-)
-```
-
-#### UPSERT Strategy
-
-- Update if document exists, insert if it doesn't
-- Use Case: Synchronizing external data sources
-- Automatically detects existence by document_id
-
-```python
-response = client.update_documents(
-    collection_name="my_collection",
-    update_strategy="upsert",
-    new_documents=[doc_with_id_1, doc_with_id_2],
-)
-```
-
-### 2. **Metadata-Only Updates**
-
-Update metadata fields without reprocessing content:
-
-```python
-response = client.update_documents(
-    collection_name="my_collection",
-    update_strategy="delete",  # Placeholder
-    filters={"document_type": "report"},
-    metadata_updates={"status": "reviewed", "updated_at": "2025-01-01"},
-    reprocess_chunks=False,  # Key parameter!
-)
-```
-
-## Method Signature
-
-```python
-def update_documents(
-    self,
-    collection_name: str,
-    update_strategy: str,  # "replace", "append", "delete", "upsert"
-    filters: Optional[Dict[str, Any]] = None,
-    document_ids: Optional[List[str]] = None,
-    new_documents: Optional[List[DocumentInput]] = None,
-    metadata_updates: Optional[Dict[str, Any]] = None,
-    reprocess_chunks: bool = True,
-) -> UpdateDocumentsResponse
-```
-
-### Parameters
-
-| Parameter | Type | Description |
-|-----------|------|-------------|
-| `collection_name` | `str` | Target Qdrant collection |
-| `update_strategy` | `str` | Operation type: "replace", "append", "delete", "upsert" |
-| `filters` | `Dict[str, Any]` | Metadata-based selection criteria |
-| `document_ids` | `List[str]` | Specific document IDs to target |
-| `new_documents` | `List[DocumentInput]` | Replacement or additional documents |
-| `metadata_updates` | `Dict[str, Any]` | Metadata fields to update |
-| `reprocess_chunks` | `bool` | If True, regenerate chunks and embeddings |
-
-### Response Structure
-
-```python
-@dataclass
-class UpdateDocumentsResponse:
-    success: bool
-    strategy_used: str
-    documents_affected: int
-    chunks_deleted: int
-    chunks_added: int
-    chunks_updated: int
-    updated_document_ids: List[str]
-    errors: List[str]
-```
-
-## Implementation Details
-
-### New Helper Methods in VectorDB Layer
-
-1. **`get_document_ids()`** - Get unique document IDs matching filters
-1. **`count_chunks()`** - Count chunks matching criteria
-1. **`get_chunk_ids_by_documents()`** - Get all chunk IDs for specific documents
-1. **`update_metadata()`** - Update metadata without reprocessing content
-
-### MongoDB Integration
-
-- Automatically handles content deletion from MongoDB when enabled
-- Maintains consistency between Qdrant and MongoDB
-- New methods: `delete_chunks_by_ids()`, `delete_chunks_by_document_ids()`
-
-### Error Handling
-
-- `ValidationError`: Invalid parameters or strategy
-- `CollectionNotFoundError`: Target collection doesn't exist
-- `NoDocumentsFoundError`: No documents match filters/IDs
-- `VectorDBError`: Qdrant operation failures
-
-## Testing
-
-### 1. **Run the Test Script**
-
-```bash
-# Make sure you're in the project directory
-cd /home/macorov/Documents/GitHub/insta_rag
-
-# Activate virtual environment
-source venv/bin/activate
-
-# Run the comprehensive test script
-python test_update_documents.py
-```
-
-This will test all 4 strategies plus metadata updates.
-
-### 2. **Use the Testing API**
-
-Start the testing API server:
-
-```bash
-cd testing_api
-uvicorn main:app --reload --host 0.0.0.0 --port 8000
-```
-
-Available endpoints:
-
-- **Generic Update**: `POST /api/v1/test/documents/update`
-- **Delete**: `POST /api/v1/test/documents/update/delete`
-- **Append**: `POST /api/v1/test/documents/update/append`
-- **Replace**: `POST /api/v1/test/documents/update/replace`
-- **Upsert**: `POST /api/v1/test/documents/update/upsert`
-- **Metadata**: `POST /api/v1/test/documents/update/metadata`
-
-### 3. **Example API Request**
-
-```bash
-curl -X POST "http://localhost:8000/api/v1/test/documents/update" \
-  -H "Content-Type: application/json" \
-  -d '{
-    "collection_name": "test_collection",
-    "update_strategy": "replace",
-    "filters": {"user_id": "123"},
-    "new_documents_text": ["Updated content for the document."],
-    "metadata_updates": {"status": "updated"}
-  }'
-```
-
-## Usage Examples
-
-### Example 1: Replace Documents by Filter
-
-```python
-from insta_rag import RAGClient, RAGConfig, DocumentInput
-
-# Initialize client
-config = RAGConfig.from_env()
-client = RAGClient(config)
-
-# Create replacement documents
-new_docs = [
-    DocumentInput.from_text(
-        text="Updated version of the document with new information.",
-        metadata={"version": 2, "updated_by": "admin"},
-    )
-]
-
-# Replace all documents matching the filter
-response = client.update_documents(
-    collection_name="knowledge_base",
-    update_strategy="replace",
-    filters={"template_id": "report_template_v1"},
-    new_documents=new_docs,
-)
-
-print(f"Replaced {response.documents_affected} documents")
-print(f"Deleted {response.chunks_deleted} chunks")
-print(f"Added {response.chunks_added} chunks")
-```
-
-### Example 2: Bulk Delete by Metadata
-
-```python
-# Delete all archived documents
-response = client.update_documents(
-    collection_name="knowledge_base",
-    update_strategy="delete",
-    filters={"status": "archived", "age_days": ">90"},
-)
-
-print(f"Deleted {response.chunks_deleted} chunks")
-print(f"Affected {response.documents_affected} documents")
-```
-
-### Example 3: Upsert with Document IDs
-
-```python
-# Prepare documents with explicit IDs for upsert
-docs = [
-    DocumentInput.from_text(
-        text="Content for document 1",
-        metadata={"document_id": "user-123-profile", "type": "profile"},
-    ),
-    DocumentInput.from_text(
-        text="Content for document 2",
-        metadata={"document_id": "user-123-settings", "type": "settings"},
-    ),
-]
-
-# Upsert - will update if exists, insert if not
-response = client.update_documents(
-    collection_name="user_data",
-    update_strategy="upsert",
-    new_documents=docs,
-)
-
-print(f"Updated: {response.chunks_updated} chunks")
-print(f"Inserted: {response.chunks_added} chunks")
-```
-
-### Example 4: Metadata-Only Update
-
-```python
-# Update metadata without reprocessing content
-response = client.update_documents(
-    collection_name="knowledge_base",
-    update_strategy="delete",  # Placeholder
-    filters={"document_type": "manual"},
-    metadata_updates={
-        "status": "reviewed",
-        "reviewed_by": "john.doe",
-        "review_date": "2025-01-15",
-    },
-    reprocess_chunks=False,  # Don't regenerate embeddings
-)
-
-print(f"Updated metadata for {response.chunks_updated} chunks")
-```
-
-## Performance Characteristics
-
-### Operation Complexity
-
-| Strategy | Time Complexity | Notes |
-|----------|----------------|-------|
-| APPEND | O(n) | n = new documents |
-| DELETE | O(m) | m = chunks to delete |
-| REPLACE | O(m + n) | m = delete, n = add |
-| UPSERT | O(k + m + n) | k = existence checks |
-
-### Best Practices
-
-1. **Use Filters Wisely**: Add metadata fields to enable efficient filtering
-1. **Batch Operations**: Group related updates into single calls
-1. **Metadata-Only Updates**: Use when only metadata changes (faster)
-1. **Document IDs**: Provide explicit IDs for upsert operations
-1. **Monitor Performance**: Check `chunks_deleted/added/updated` in response
-
-## Error Handling Example
-
-```python
-from insta_rag.exceptions import (
-    CollectionNotFoundError,
-    NoDocumentsFoundError,
-    ValidationError,
-)
-
-try:
-    response = client.update_documents(
-        collection_name="my_collection",
-        update_strategy="replace",
-        filters={"user_id": "123"},
-        new_documents=new_docs,
-    )
-
-    if response.success:
-        print(f"✓ Update successful!")
-    else:
-        print(f"✗ Update failed: {response.errors}")
-
-except CollectionNotFoundError as e:
-    print(f"Collection doesn't exist: {e}")
-
-except NoDocumentsFoundError as e:
-    print(f"No matching documents: {e}")
-
-except ValidationError as e:
-    print(f"Invalid parameters: {e}")
-```
-
-## Files Modified/Created
-
-### Core Library Files
-
-1. **`src/insta_rag/vectordb/base.py`** - Added abstract helper methods
-1. **`src/insta_rag/vectordb/qdrant.py`** - Implemented Qdrant helper methods
-1. **`src/insta_rag/mongodb_client.py`** - Added batch deletion methods
-1. **`src/insta_rag/core/client.py`** - Implemented main `update_documents()` method
-1. **`src/insta_rag/models/response.py`** - Already had `UpdateDocumentsResponse` model
-
-### Testing Files
-
-1. **`testing_api/main.py`** - Added 6 test endpoints for update operations
-1. **`test_update_documents.py`** - Comprehensive test script for all strategies
-1. **`UPDATE_DOCUMENTS_GUIDE.md`** - This documentation file
-
-## Summary
-
-The `update_documents()` method provides a complete CRUD interface for managing documents in your RAG knowledge base. Key features:
-
-- ✅ **4 Update Strategies**: replace, append, delete, upsert
-- ✅ **Flexible Filtering**: By metadata or document IDs
-- ✅ **Metadata-Only Updates**: No reprocessing when only metadata changes
-- ✅ **MongoDB Integration**: Automatic content synchronization
-- ✅ **Comprehensive Error Handling**: Clear error messages for all failure cases
-- ✅ **Testing Infrastructure**: Test script + API endpoints
-- ✅ **Production-Ready**: Full error handling, logging, and validation
-
-## Next Steps
-
-1. Run the test script: `python test_update_documents.py`
-1. Start the testing API: `cd testing_api && uvicorn main:app --reload`
-1. Try the example code snippets above
-1. Integrate into your application
-
-Enjoy your new document management capabilities! 🚀
diff --git a/docs/USAGE.md b/docs/USAGE.md
deleted file mode 100644
index 1396d33..0000000
--- a/docs/USAGE.md
+++ /dev/null
@@ -1,374 +0,0 @@
-# insta_rag Usage Guide
-
-This guide shows you how to use the insta_rag library for document processing and retrieval.
-
-## Installation
-
-```bash
-# Install the library (development mode)
-pip install -e .
-
-# Install required dependencies
-pip install openai qdrant-client pdfplumber PyPDF2 tiktoken numpy python-dotenv
-```
-
-## Quick Start
-
-### 1. Set Up Environment Variables
-
-Create a `.env` file with your API keys:
-
-```env
-# Qdrant Vector Database
-QDRANT_URL=https://your-qdrant-instance.cloud.qdrant.io
-QDRANT_API_KEY=your_qdrant_api_key
-
-# Option 1: Azure OpenAI (recommended for enterprise)
-AZURE_OPENAI_ENDPOINT=https://your-instance.openai.azure.com/
-AZURE_OPENAI_API_KEY=your_azure_openai_key
-AZURE_EMBEDDING_DEPLOYMENT=text-embedding-3-large
-
-# Option 2: Standard OpenAI
-OPENAI_API_KEY=your_openai_api_key
-
-# Cohere (for reranking - optional)
-COHERE_API_KEY=your_cohere_api_key
-```
-
-### 2. Basic Usage
-
-```python
-from insta_rag import RAGClient, RAGConfig, DocumentInput
-
-# Initialize client from environment variables
-config = RAGConfig.from_env()
-client = RAGClient(config)
-
-# Add documents from different sources
-documents = [
-    # From PDF file
-    DocumentInput.from_file(
-        "path/to/document.pdf",
-        metadata={"user_id": "user_123", "document_type": "business_doc"},
-    ),
-    # From text
-    DocumentInput.from_text(
-        "Your text content here...", metadata={"source": "web_scrape"}
-    ),
-]
-
-# Process and store documents
-response = client.add_documents(
-    documents=documents,
-    collection_name="my_knowledge_base",
-    metadata={"project": "my_project"},
-)
-
-print(f"Processed {response.total_chunks} chunks")
-print(f"Total time: {response.processing_stats.total_time_ms}ms")
-```
-
-## Detailed Usage
-
-### Configuration
-
-#### Using Environment Variables (Recommended)
-
-```python
-config = RAGConfig.from_env()
-```
-
-#### Custom Configuration
-
-```python
-from insta_rag.core.config import (
-    RAGConfig,
-    VectorDBConfig,
-    EmbeddingConfig,
-    ChunkingConfig,
-    PDFConfig,
-)
-
-config = RAGConfig(
-    vectordb=VectorDBConfig(
-        url="https://your-qdrant-instance.cloud.qdrant.io", api_key="your_api_key"
-    ),
-    embedding=EmbeddingConfig(
-        provider="openai",
-        model="text-embedding-3-large",
-        api_key="your_openai_key",
-        dimensions=3072,
-    ),
-    chunking=ChunkingConfig(
-        method="semantic", max_chunk_size=1000, overlap_percentage=0.2
-    ),
-    pdf=PDFConfig(parser="pdfplumber", validate_text=True),
-)
-```
-
-### Document Input
-
-#### From PDF Files
-
-```python
-doc = DocumentInput.from_file(
-    file_path="document.pdf",
-    metadata={
-        "user_id": "user_123",
-        "document_type": "contract",
-        "department": "legal",
-    },
-)
-```
-
-#### From Text Files
-
-```python
-doc = DocumentInput.from_file(
-    file_path="document.txt", metadata={"source": "knowledge_base"}
-)
-```
-
-#### From Raw Text
-
-```python
-doc = DocumentInput.from_text(
-    text="Your document content...",
-    metadata={"source": "web_scrape", "url": "https://example.com"},
-)
-```
-
-### Processing Documents
-
-```python
-response = client.add_documents(
-    documents=[doc1, doc2, doc3],
-    collection_name="my_collection",
-    metadata={"batch_id": "batch_001"},
-    batch_size=100,  # Embedding batch size
-    validate_chunks=True,  # Enable quality validation
-)
-
-# Check results
-if response.success:
-    print(f"✓ Processed {response.documents_processed} documents")
-    print(f"✓ Created {response.total_chunks} chunks")
-
-    # Access chunks
-    for chunk in response.chunks:
-        print(f"Chunk {chunk.chunk_id}: {chunk.metadata.token_count} tokens")
-else:
-    print("Errors:", response.errors)
-```
-
-### Collection Management
-
-```python
-# List all collections
-collections = client.list_collections()
-print("Available collections:", collections)
-
-# Get collection info
-info = client.get_collection_info("my_collection")
-print(f"Vectors: {info['vectors_count']}")
-print(f"Status: {info['status']}")
-```
-
-## Advanced Configuration
-
-### Chunking Strategies
-
-```python
-from insta_rag.core.config import ChunkingConfig
-
-# Semantic chunking (default - best quality)
-chunking = ChunkingConfig(
-    method="semantic",
-    max_chunk_size=1000,
-    overlap_percentage=0.2,
-    semantic_threshold_percentile=95,
-)
-
-# For faster processing with less accuracy
-chunking = ChunkingConfig(
-    method="recursive", max_chunk_size=800, overlap_percentage=0.15
-)
-```
-
-### PDF Processing
-
-```python
-from insta_rag.core.config import PDFConfig
-
-pdf_config = PDFConfig(
-    parser="pdfplumber",  # or "pypdf2"
-    extract_images=False,
-    extract_tables=False,
-    validate_text=True,
-)
-```
-
-### Embedding Providers
-
-#### Azure OpenAI
-
-```python
-from insta_rag.core.config import EmbeddingConfig
-
-embedding = EmbeddingConfig(
-    provider="azure_openai",
-    model="text-embedding-3-large",
-    api_key="your_key",
-    api_base="https://your-instance.openai.azure.com/",
-    api_version="2024-02-01",
-    deployment_name="text-embedding-3-large",
-    dimensions=3072,
-)
-```
-
-#### Standard OpenAI
-
-```python
-embedding = EmbeddingConfig(
-    provider="openai",
-    model="text-embedding-3-large",
-    api_key="your_key",
-    dimensions=3072,
-)
-```
-
-## Metadata Management
-
-Metadata is crucial for filtering and organization:
-
-### Document-Level Metadata
-
-```python
-doc = DocumentInput.from_file(
-    "document.pdf",
-    metadata={
-        # User identification
-        "user_id": "user_123",
-        # Document categorization
-        "document_type": "business_document",
-        "department": "sales",
-        # Lifecycle management
-        "is_standalone": True,
-        # Template association
-        "template_id": "template_456",
-        # Custom fields
-        "status": "active",
-        "tags": ["contract", "2024"],
-        "priority": "high",
-    },
-)
-```
-
-### Global Metadata
-
-Applied to all chunks in a batch:
-
-```python
-response = client.add_documents(
-    documents=documents,
-    collection_name="my_collection",
-    metadata={
-        "project": "project_name",
-        "batch_id": "batch_001",
-        "uploaded_by": "user_123",
-        "timestamp": "2024-01-15",
-    },
-)
-```
-
-## Error Handling
-
-```python
-from insta_rag.exceptions import (
-    PDFEncryptedError,
-    PDFCorruptedError,
-    PDFEmptyError,
-    ChunkingError,
-    EmbeddingError,
-    VectorDBError,
-)
-
-try:
-    response = client.add_documents(documents, "collection")
-except PDFEncryptedError:
-    print("PDF is password-protected")
-except PDFCorruptedError:
-    print("PDF file is corrupted")
-except PDFEmptyError:
-    print("No text could be extracted from PDF")
-except EmbeddingError as e:
-    print(f"Embedding generation failed: {e}")
-except VectorDBError as e:
-    print(f"Vector database error: {e}")
-```
-
-## Performance Optimization
-
-### Batch Processing
-
-```python
-# Process large batches efficiently
-response = client.add_documents(
-    documents=large_document_list,
-    collection_name="large_collection",
-    batch_size=100,  # Adjust based on API limits
-)
-```
-
-### Monitoring
-
-```python
-response = client.add_documents(documents, "collection")
-
-stats = response.processing_stats
-print(f"Chunking: {stats.chunking_time_ms}ms")
-print(f"Embedding: {stats.embedding_time_ms}ms")
-print(f"Upload: {stats.upload_time_ms}ms")
-print(f"Total: {stats.total_time_ms}ms")
-print(f"Tokens processed: {stats.total_tokens}")
-```
-
-## Complete Example
-
-See `examples/basic_usage.py` for a complete working example.
-
-```bash
-# Run the example
-python examples/basic_usage.py
-```
-
-## Next Steps
-
-- Implement `update_documents()` for document updates and deletions
-- Implement `retrieve()` for hybrid search with reranking
-- Add support for more document types
-- Implement custom chunking strategies
-
-## Troubleshooting
-
-### Common Issues
-
-1. **"Collection not found"**: Create the collection first or let `add_documents()` auto-create it
-1. **"Embedding API error"**: Check API keys and rate limits
-1. **"PDF extraction failed"**: Try a different parser or check PDF quality
-1. **"Token count exceeded"**: Reduce `max_chunk_size` in configuration
-
-### Debug Mode
-
-```python
-import logging
-
-logging.basicConfig(level=logging.DEBUG)
-```
-
-## Support
-
-For issues and questions:
-
-- GitHub Issues: https://github.com/AI-Buddy-Catalyst-Labs/insta_rag/issues
-- Documentation: See README.md
diff --git a/docs/USE_LOCAL_QDRANT.md b/docs/USE_LOCAL_QDRANT.md
deleted file mode 100644
index bb4101a..0000000
--- a/docs/USE_LOCAL_QDRANT.md
+++ /dev/null
@@ -1,200 +0,0 @@
-# Using Local Qdrant for Testing
-
-If you're experiencing connection issues with remote Qdrant, you can easily run a local instance for testing.
-
-## Quick Start with Docker
-
-### 1. Run Qdrant Container
-
-```bash
-docker run -d -p 6333:6333 -p 6334:6334 \
-  -v $(pwd)/qdrant_storage:/qdrant/storage \
-  qdrant/qdrant
-```
-
-**Without Docker volume (data will be lost on restart):**
-
-```bash
-docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
-```
-
-### 2. Update .env File
-
-```env
-# Comment out or replace remote Qdrant
-# QDRANT_URL=https://qdrant-okc4ss8owk0ggwg4ccwsoks0.aibuddy-coolify-inventory.aukikaurnab.com/
-# QDRANT_API_KEY=edfBd7pP251ev2uiRcjcBGt7QXJe1P70
-
-# Use local Qdrant
-QDRANT_URL=http://localhost:6333
-QDRANT_API_KEY=
-```
-
-### 3. Test Connection
-
-```bash
-# Check if Qdrant is running
-curl http://localhost:6333/collections
-
-# Should return: {"result":{"collections":[]}}
-```
-
-### 4. Run Your Tests
-
-```bash
-cd testing_api
-./run.sh
-```
-
-Now all tests will use your local Qdrant instance!
-
-## Alternative: Use Docker Compose
-
-### 1. Create docker-compose.yml
-
-```yaml
-version: '3.8'
-
-services:
-  qdrant:
-    image: qdrant/qdrant:latest
-    ports:
-      - "6333:6333"
-      - "6334:6334"
-    volumes:
-      - ./qdrant_storage:/qdrant/storage
-    environment:
-      - QDRANT__SERVICE__GRPC_PORT=6334
-```
-
-### 2. Start Services
-
-```bash
-docker-compose up -d
-```
-
-### 3. Stop Services
-
-```bash
-docker-compose down
-```
-
-## Verify Local Qdrant is Working
-
-```bash
-# Test with curl
-curl http://localhost:6333/collections
-
-# Test with Python
-python -c "
-from qdrant_client import QdrantClient
-client = QdrantClient(url='http://localhost:6333', prefer_grpc=False)
-print('Collections:', client.get_collections())
-print('✓ Local Qdrant working!')
-"
-```
-
-## Access Qdrant Dashboard
-
-Open in browser: **http://localhost:6333/dashboard**
-
-You can:
-
-- View collections
-- Browse points
-- Run queries
-- Monitor performance
-
-## Switching Back to Remote Qdrant
-
-When you want to use remote Qdrant again:
-
-1. Stop local Qdrant:
-
-```bash
-docker stop $(docker ps -q --filter ancestor=qdrant/qdrant)
-```
-
-2. Restore .env:
-
-```env
-QDRANT_URL=https://qdrant-okc4ss8owk0ggwg4ccwsoks0.aibuddy-coolify-inventory.aukikaurnab.com/
-QDRANT_API_KEY=edfBd7pP251ev2uiRcjcBGt7QXJe1P70
-```
-
-## Benefits of Local Qdrant
-
-- ✅ No network latency
-- ✅ No connection timeouts
-- ✅ Free and unlimited
-- ✅ Full control
-- ✅ Works offline
-- ✅ Perfect for development and testing
-
-## Troubleshooting Local Qdrant
-
-### Port already in use
-
-```bash
-# Find what's using the port
-lsof -i :6333
-
-# Kill the process or use different ports
-docker run -d -p 7333:6333 qdrant/qdrant
-# Then use QDRANT_URL=http://localhost:7333
-```
-
-### Docker not installed
-
-**Ubuntu/Debian:**
-
-```bash
-sudo apt update
-sudo apt install docker.io
-sudo systemctl start docker
-sudo usermod -aG docker $USER
-# Log out and back in
-```
-
-**macOS:**
-
-```bash
-brew install docker
-```
-
-### Permission denied
-
-```bash
-sudo docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
-```
-
-Or add user to docker group (Ubuntu):
-
-```bash
-sudo usermod -aG docker $USER
-newgrp docker
-```
-
-## Production Considerations
-
-For production, consider:
-
-- Using persistent volumes
-- Setting up authentication
-- Configuring backups
-- Using Qdrant Cloud for managed service
-
-## Summary
-
-Local Qdrant is perfect for:
-
-- Development and testing
-- Learning and experimentation
-- When remote Qdrant has connection issues
-- Offline development
-
-It's **not recommended** for:
-
-- Production deployments
-- Shared/team environments
-- When you need remote access
diff --git a/docs/guides/document-management.md b/docs/guides/document-management.md
new file mode 100644
index 0000000..c7c5e6c
--- /dev/null
+++ b/docs/guides/document-management.md
@@ -0,0 +1,179 @@
+# Guide: Document Management
+
+This guide covers the complete lifecycle of documents in `insta_rag`, from initial ingestion and processing to updating and deleting them.
+
+## Part 1: Adding Documents (`add_documents`)
+
+The `add_documents()` method is the entry point for adding new knowledge to your RAG system. It orchestrates a sophisticated 6-phase pipeline to process raw documents into searchable vector embeddings.
+
+### The Ingestion Pipeline
+
+```mermaid
+graph TD
+    A[Document Input] --> B{Phase 1: Loading};
+    B --> C{Phase 2: Text Extraction};
+    C --> D{Phase 3: Semantic Chunking};
+    D --> E{Phase 4: Validation};
+    E --> F{Phase 5: Embedding};
+    F --> G{Phase 6: Storage};
+    G --> H[Searchable in Qdrant];
+```
+
+#### **Phase 1: Document Loading**
+
+- **Input**: A list of `DocumentInput` objects (from files, text, or binary).
+- **Action**: A unique `document_id` is generated, and metadata is consolidated.
+
+#### **Phase 2: Text Extraction**
+
+- **Action**: For PDF files, text is extracted page-by-page. The system uses `pdfplumber` and falls back to `PyPDF2` if needed. It also handles encrypted or corrupted files.
+
+#### **Phase 3: Semantic Chunking**
+
+- **Goal**: To split documents at natural topic boundaries, preserving context.
+- **Process**:
+  1. If a document is small (e.g., \<= 1000 tokens), it's treated as a single chunk.
+  1. Otherwise, the text is split into sentences.
+  1. Embeddings are generated for each sentence.
+  1. The cosine similarity between adjacent sentences is calculated.
+  1. Low-similarity points are identified as "breakpoints" or topic changes.
+  1. The text is split at these breakpoints.
+  1. A 20% overlap is added between chunks to ensure no context is lost at the boundaries.
+
+#### **Phase 4: Chunk Validation**
+
+- **Action**: Each chunk is validated for quality (e.g., minimum length) and a `ChunkMetadata` object is created, containing token counts, source information, and other useful data.
+
+#### **Phase 5: Batch Embedding Generation**
+
+- **Action**: The content of all chunks is sent to the configured embedding provider (e.g., Azure OpenAI) in batches.
+- **Output**: Each chunk is associated with a high-dimensional vector embedding (e.g., 3072 dimensions for `text-embedding-3-large`).
+
+#### **Phase 6: Storage**
+
+- **Action**: The chunks (embeddings and metadata) are uploaded to the specified Qdrant collection.
+- **ID Generation**: A deterministic UUID is generated for each chunk, ensuring that re-uploading the same chunk is an idempotent operation.
+- **Hybrid Storage**: If MongoDB is configured, the full text content is stored in MongoDB, while Qdrant stores only the vector and a reference ID to the MongoDB document. This is the recommended setup for production. See the [Storage Backends Guide](./storage-backends.md) for more details.
+
+### Example: Adding a Document
+
+```python
+from insta_rag import RAGClient, RAGConfig, DocumentInput
+
+config = RAGConfig.from_env()
+client = RAGClient(config)
+
+documents = [
+    DocumentInput.from_file(
+        "./annual-report.pdf", metadata={"year": 2024, "company": "InstaCo"}
+    )
+]
+
+response = client.add_documents(
+    documents=documents, collection_name="financial_reports"
+)
+
+print(f"Successfully created {response.total_chunks} chunks.")
+```
+
+______________________________________________________________________
+
+## Part 2: Updating and Deleting Documents (`update_documents`)
+
+The `update_documents()` method provides flexible CRUD (Create, Read, Update, Delete) operations for managing your knowledge base after initial ingestion.
+
+### Update Strategies
+
+You can choose one of four strategies for any update operation.
+
+#### 1. `replace`
+
+- **Action**: Deletes a set of existing documents and adds a new set in their place.
+- **Use Case**: A user uploads a new version of a document that should completely replace the old one.
+
+```python
+# Replace all documents from 2023 with the new 2024 report
+response = client.update_documents(
+    collection_name="financial_reports",
+    update_strategy="replace",
+    filters={"year": 2023, "company": "InstaCo"},
+    new_documents=[DocumentInput.from_file("./new-report.pdf")],
+)
+```
+
+#### 2. `append`
+
+- **Action**: Adds new documents to a collection without affecting existing ones.
+- **Use Case**: Incrementally adding new information to a knowledge base.
+
+```python
+# Add a new quarterly report without touching the old ones
+response = client.update_documents(
+    collection_name="financial_reports",
+    update_strategy="append",
+    new_documents=[DocumentInput.from_file("./q3-report.pdf")],
+)
+```
+
+#### 3. `delete`
+
+- **Action**: Removes documents and their associated chunks from the knowledge base.
+- **Use Case**: Removing outdated, irrelevant, or incorrect information.
+
+```python
+# Delete by metadata filter
+response = client.update_documents(
+    collection_name="financial_reports",
+    update_strategy="delete",
+    filters={"status": "archived"},
+)
+
+# Or delete by specific document IDs
+response = client.update_documents(
+    collection_name="financial_reports",
+    update_strategy="delete",
+    document_ids=["doc-id-123", "doc-id-456"],
+)
+```
+
+#### 4. `upsert`
+
+- **Action**: Updates documents if they exist (based on `document_id` in metadata), or inserts them if they don't.
+- **Use Case**: Synchronizing data from an external source where you want to ensure the latest version is present without creating duplicates.
+
+```python
+# Documents with explicit IDs for upserting
+docs_to_sync = [
+    DocumentInput.from_text(
+        "Profile for user 1", metadata={"document_id": "user-profile-1"}
+    ),
+    DocumentInput.from_text(
+        "Profile for user 2", metadata={"document_id": "user-profile-2"}
+    ),
+]
+
+response = client.update_documents(
+    collection_name="user_profiles",
+    update_strategy="upsert",
+    new_documents=docs_to_sync,
+)
+```
+
+### Metadata-Only Updates
+
+If you only need to change the metadata of existing chunks without the overhead of re-chunking and re-embedding, set `reprocess_chunks=False`.
+
+This is highly efficient for tasks like changing a document's status, adding tags, or updating timestamps.
+
+```python
+# Mark all reports from 2023 as archived without reprocessing them
+response = client.update_documents(
+    collection_name="financial_reports",
+    update_strategy="delete",  # Strategy is ignored here
+    filters={"year": 2023},
+    metadata_updates={"status": "archived"},
+    reprocess_chunks=False,  # This is the key parameter
+)
+
+print(f"Updated metadata for {response.chunks_updated} chunks.")
+```
diff --git a/docs/guides/local-development.md b/docs/guides/local-development.md
new file mode 100644
index 0000000..1934358
--- /dev/null
+++ b/docs/guides/local-development.md
@@ -0,0 +1,94 @@
+# Guide: Local Development Setup
+
+This guide provides tips for setting up a smooth and efficient local development environment for `insta_rag`, including how to run a local instance of Qdrant to avoid network latency and connection issues.
+
+## Using a Local Qdrant Instance
+
+For development and testing, running a local Qdrant instance via Docker is highly recommended. It's fast, free, and eliminates network-related issues.
+
+### 1. Start Qdrant with Docker
+
+Run the following command in your terminal to start a Qdrant container. This command also mounts a local volume to persist your data between container restarts.
+
+```bash
+# This command will download the Qdrant image and run it in the background.
+docker run -d -p 6333:6333 -p 6334:6334 \
+  -v $(pwd)/qdrant_storage:/qdrant/storage \
+  qdrant/qdrant
+```
+
+- `-p 6333:6333`: Maps the HTTP REST API port.
+- `-p 6334:6334`: Maps the gRPC port.
+- `-v $(pwd)/qdrant_storage:/qdrant/storage`: Persists data in a `qdrant_storage` directory in your current folder.
+
+### 2. Update Your `.env` File
+
+Modify your `.env` file to point to your local instance. Comment out the remote/cloud Qdrant variables and add the local ones.
+
+```env
+# .env file
+
+# Comment out the remote Qdrant configuration
+# QDRANT_URL=https://your-remote-qdrant-url.com/
+# QDRANT_API_KEY=your-remote-api-key
+
+# Add the local Qdrant configuration
+QDRANT_URL=http://localhost:6333
+QDRANT_API_KEY=
+```
+
+Leave `QDRANT_API_KEY` blank, as the default local instance does not require one.
+
+### 3. Verify the Connection
+
+You can quickly check if your local Qdrant is running correctly.
+
+```bash
+# Use curl to check the collections endpoint
+curl http://localhost:6333/collections
+```
+
+You should see a response like: `{"result":{"collections":[]},"status":"ok","time":...}`.
+
+### 4. Access the Qdrant Dashboard
+
+Qdrant provides a web dashboard to view your collections, search points, and monitor the instance. Access it at:
+
+**<http://localhost:6333/dashboard>**
+
+## Troubleshooting Connection Issues
+
+If you are having trouble connecting to a remote Qdrant server, here are a few steps to debug the issue.
+
+### 1. Test Basic Connectivity
+
+Use `curl` to see if the server is reachable from your network. A timeout here indicates a network or firewall issue, not a problem with the library itself.
+
+```bash
+# Test if the server responds to a basic request (10-second timeout)
+curl -I --max-time 10 "https://your-remote-qdrant-url.com/"
+```
+
+### 2. Increase Client Timeout
+
+For slow remote connections, you can increase the timeout directly in the client configuration. In `src/insta_rag/core/config.py`, you can add a `timeout` parameter to the `VectorDBConfig`.
+
+```python
+# src/insta_rag/core/config.py
+
+
+@dataclass
+class VectorDBConfig:
+    provider: str = "qdrant"
+    url: Optional[str] = None
+    api_key: Optional[str] = None
+    timeout: int = 120  # Increase timeout to 120 seconds
+```
+
+### 3. Use a Free Qdrant Cloud Instance
+
+If your self-hosted remote server is unreliable, consider using the free tier from [Qdrant Cloud](https://cloud.qdrant.io). It's a quick and easy way to get a stable remote vector database for development.
+
+1. Sign up and create a free cluster.
+1. Copy the URL and generate an API key.
+1. Update your `.env` file with the new credentials.
diff --git a/docs/guides/retrieval.md b/docs/guides/retrieval.md
new file mode 100644
index 0000000..e317a76
--- /dev/null
+++ b/docs/guides/retrieval.md
@@ -0,0 +1,130 @@
+# Guide: Advanced Retrieval
+
+The `retrieve()` method in `insta_rag` is designed to find the most relevant information for a user's query using a sophisticated hybrid search pipeline. This guide breaks down how it works.
+
+## The Retrieval Pipeline
+
+The retrieval process is a multi-step pipeline designed to maximize both semantic understanding (finding conceptually similar results) and lexical matching (finding exact keywords).
+
+```mermaid
+graph TD
+    A[User Query] --> B{Step 1: Query Generation (HyDE)};
+    B --> C{Step 2: Dual Vector Search};
+    B --> D{Step 3: Keyword Search (BM25)};
+    C --> E{Step 4: Combine & Deduplicate};
+    D --> E;
+    E --> F{Step 5: Reranking};
+    F --> G{Step 6: Selection & Formatting};
+    G --> H[Top-k Relevant Chunks];
+```
+
+### Step 1: Query Generation (HyDE)
+
+- **Goal**: To overcome the challenge of matching a short user query with long, detailed document chunks.
+- **Process**: The user's query is sent to an LLM (e.g., GPT-4) which generates two things:
+  1. **Optimized Query**: A rewritten, clearer version of the original query.
+  1. **Hypothetical Document Embedding (HyDE)**: A hypothetical answer or document that would perfectly answer the user's query.
+- **Benefit**: Searching with the embedding of the hypothetical answer is often more effective at finding relevant chunks than searching with the embedding of the short original query.
+
+### Step 2: Dual Vector Search
+
+- **Goal**: To find semantically relevant chunks.
+- **Process**: Two parallel vector searches are performed in Qdrant:
+  1. Search with the embedding of the **optimized query**.
+  1. Search with the embedding of the **HyDE query**.
+- **Output**: A list of candidate chunks from both searches (e.g., 25 chunks from each, for a total of 50).
+
+### Step 3: Keyword Search (BM25)
+
+- **Goal**: To find chunks containing exact keyword matches, which semantic search might miss.
+- **Process**: The original query is tokenized, and a BM25 (Best Match 25) algorithm is used to find chunks with high lexical overlap.
+- **Benefit**: Crucial for finding specific names, codes, acronyms, or direct quotes.
+- **Output**: A list of candidate chunks based on keyword relevance (e.g., 50 chunks).
+
+### Step 4: Combine & Deduplicate
+
+- **Goal**: To create a single, unified pool of candidate chunks.
+- **Process**: The results from vector search and keyword search are combined. Duplicate chunks (which may have been found by both methods) are removed, keeping the instance with the highest score.
+- **Output**: A single list of unique candidate chunks (e.g., ~70-80 chunks).
+
+### Step 5: Reranking
+
+- **Goal**: To intelligently re-order the candidate chunks for maximum relevance.
+- **Process**: The combined list of chunks is sent to a powerful cross-encoder model (like Cohere's Reranker or BAAI's BGE-Reranker). Unlike vector similarity, a cross-encoder directly compares the user's query against each candidate chunk's full text, providing a much more accurate relevance score.
+- **Benefit**: This is the most computationally intensive but also the most impactful step for improving the final quality of the results.
+
+### Step 6: Selection & Formatting
+
+- **Goal**: To prepare the final response for the user.
+- **Process**:
+  1. The results are sorted by their new reranker scores.
+  1. A final `score_threshold` can be applied to filter out low-quality results.
+  1. The top `k` chunks are selected.
+  1. If using hybrid storage, the full content is fetched from MongoDB.
+
+## Controlling Retrieval Features
+
+All advanced features are enabled by default, but you can easily disable them to trade quality for speed.
+
+```python
+# High-quality (default)
+response = client.retrieve(
+    query="...",
+    collection_name="...",
+    enable_hyde=True,
+    enable_keyword_search=True,
+    enable_reranking=True,
+)
+
+# Fast mode (vector search only)
+response = client.retrieve(
+    query="...",
+    collection_name="...",
+    enable_hyde=False,
+    enable_keyword_search=False,
+    enable_reranking=False,
+)
+```
+
+## Understanding Reranker Scores
+
+The reranker model you use determines the range and interpretation of the `relevance_score`.
+
+### BGE Reranker (e.g., `BAAI/bge-reranker-v2-m3`)
+
+This model produces scores that are often negative. **Higher is better**.
+
+- **Range**: Typically -10.0 to +10.0
+
+- **Interpretation**:
+
+  - `> 0.0`: Very good relevance.
+  - `-3.0 to 0.0`: Good relevance.
+  - `-5.0 to -3.0`: Moderate relevance.
+  - `< -5.0`: Low relevance.
+
+- **Thresholding**: Use negative values for the `score_threshold`.
+
+  ```python
+  # Keep only moderately to highly relevant results
+  client.retrieve(query="...", collection_name="...", score_threshold=-3.0)
+  ```
+
+### Cohere Reranker
+
+Cohere's reranker produces normalized scores between 0 and 1. **Higher is better**.
+
+- **Range**: 0.0 to 1.0
+
+- **Interpretation**:
+
+  - `> 0.8`: Highly relevant.
+  - `0.5 to 0.8`: Moderately relevant.
+  - `< 0.5`: Low relevance.
+
+- **Thresholding**: Use positive float values.
+
+  ```python
+  # Keep only moderately to highly relevant results
+  client.retrieve(query="...", collection_name="...", score_threshold=0.5)
+  ```
diff --git a/docs/guides/storage-backends.md b/docs/guides/storage-backends.md
new file mode 100644
index 0000000..06f07c1
--- /dev/null
+++ b/docs/guides/storage-backends.md
@@ -0,0 +1,78 @@
+# Guide: Storage Backends
+
+`insta_rag` supports two primary storage configurations for your document chunks: **Qdrant-Only** and **Hybrid (Qdrant + MongoDB)**. This guide explains the difference and how to configure them.
+
+## Storage Architectures
+
+### 1. Qdrant-Only Mode (Default)
+
+In this mode, all data associated with a chunk is stored directly in the Qdrant vector database.
+
+- **Architecture**: `Document → Chunking → Embedding → Qdrant (stores vectors, metadata, and full text content)`
+- **Qdrant Payload**: Contains the vector embedding, all metadata, and the full `content` of the chunk.
+- **Pros**: Simple to set up, requires only one database.
+- **Cons**: Can be less cost-effective for very large text content, as vector databases are optimized and priced for vector search, not bulk text storage.
+
+### 2. Hybrid Mode: Qdrant + MongoDB (Recommended for Production)
+
+In this mode, storage is split: Qdrant stores the vectors for fast searching, and MongoDB stores the actual text content.
+
+- **Architecture**: `Document → Chunking → Embedding → MongoDB (stores full content) & Qdrant (stores vectors + a reference to MongoDB)`
+- **Qdrant Payload**: Contains the vector embedding and metadata, but the `content` field is empty. Instead, it stores a `mongodb_id` pointing to the document in MongoDB.
+- **MongoDB Document**: Contains the `chunk_id`, the full `content`, and a copy of the metadata.
+- **Pros**:
+  - **Cost-Effective**: Leverages MongoDB for cheaper, efficient bulk text storage.
+  - **Separation of Concerns**: Qdrant handles what it does best (vector search), and MongoDB handles content storage.
+  - **Flexibility**: Allows you to manage and update content in MongoDB without needing to re-index vectors in Qdrant.
+- **Cons**: Requires managing a second database (MongoDB).
+
+## Configuration
+
+The storage mode is **automatically determined** based on your environment configuration.
+
+### Enabling Hybrid Mode (Qdrant + MongoDB)
+
+To enable hybrid mode, simply add your MongoDB connection string to your `.env` file. If the connection string is present, `insta_rag` will automatically use the hybrid storage architecture.
+
+```env
+# .env file
+
+# Qdrant Configuration
+QDRANT_URL="..."
+QDRANT_API_KEY="..."
+
+# MongoDB Configuration (enables Hybrid Mode)
+MONGO_CONNECTION_STRING="mongodb://user:password@host:port/"
+MONGO_DATABASE_NAME="your_db_name" # Optional, defaults to Test_Insta_RAG
+```
+
+### Using Qdrant-Only Mode
+
+To use the Qdrant-only mode, simply omit or comment out the `MONGO_CONNECTION_STRING` from your `.env` file.
+
+```env
+# .env file
+
+# Qdrant Configuration
+QDRANT_URL="..."
+QDRANT_API_KEY="..."
+
+# MONGO_CONNECTION_STRING is not present, so Qdrant-only mode is used.
+```
+
+## How It Works During Retrieval
+
+The library handles the difference in storage transparently during retrieval:
+
+1. A search query is sent to Qdrant.
+1. Qdrant returns a list of matching vectors and their payloads.
+1. The `RAGClient` inspects the payload of each result.
+1. If a `mongodb_id` is present, it automatically fetches the full content from MongoDB.
+1. If there is no `mongodb_id`, it uses the `content` directly from the Qdrant payload.
+
+The final `RetrievalResponse` will contain the full text content regardless of which storage backend was used.
+
+## Best Practices
+
+- For **local development and testing**, the **Qdrant-only** mode is often simpler and sufficient.
+- For **production environments**, especially with a large volume of documents, the **Hybrid (Qdrant + MongoDB)** mode is highly recommended for its cost-effectiveness and scalability.
diff --git a/docs/INSTALL.md b/docs/installation.md
similarity index 97%
rename from docs/INSTALL.md
rename to docs/installation.md
index e912c42..86b86f9 100644
--- a/docs/INSTALL.md
+++ b/docs/installation.md
@@ -130,7 +130,9 @@ pip install --upgrade pdfplumber PyPDF2
 ### Qdrant Connection Issues
 
 1. Verify your `QDRANT_URL` and `QDRANT_API_KEY` in `.env`
+
 1. Test connection:
+
    ```python
    from qdrant_client import QdrantClient
 
@@ -181,4 +183,4 @@ If you encounter any installation issues:
 1. Check that Python version is 3.9+: `python --version`
 1. Ensure pip is up to date: `pip install --upgrade pip`
 1. Try installing in a fresh virtual environment
-1. Check the GitHub Issues: https://github.com/AI-Buddy-Catalyst-Labs/insta_rag/issues
+1. Check the GitHub Issues: <https://github.com/AI-Buddy-Catalyst-Labs/insta_rag/issues>
diff --git a/docs/quickstart.md b/docs/quickstart.md
new file mode 100644
index 0000000..91af564
--- /dev/null
+++ b/docs/quickstart.md
@@ -0,0 +1,163 @@
+# Quickstart Guide
+
+This guide provides a hands-on walkthrough of the core features of the `insta_rag` library. You'll learn how to install the library, configure your environment, add documents, and perform advanced retrieval queries.
+
+## 1. Installation
+
+First, ensure you have Python 3.9+ and install the library. Using `uv` is recommended.
+
+```bash
+# Install the package in editable mode
+uv pip install -e .
+```
+
+For more detailed installation instructions, see the [Installation Guide](./installation.md).
+
+## 2. Environment Setup
+
+Create a `.env` file in your project root and add the necessary API keys and URLs. The library will automatically load these variables.
+
+```env
+# Required: Qdrant Vector Database
+QDRANT_URL="https://your-qdrant-instance.cloud.qdrant.io"
+QDRANT_API_KEY="your_qdrant_api_key"
+
+# Required: OpenAI or Azure OpenAI for embeddings
+AZURE_OPENAI_ENDPOINT="https://your-instance.openai.azure.com/"
+AZURE_OPENAI_API_KEY="your_azure_key"
+AZURE_EMBEDDING_DEPLOYMENT="text-embedding-3-large"
+
+# Required for HyDE Query Generation
+AZURE_LLM_DEPLOYMENT="gpt-4"
+
+# Optional: Cohere for reranking
+COHERE_API_KEY="your_cohere_api_key"
+```
+
+## 3. Initialize the RAG Client
+
+The `RAGClient` is the main entry point to the library. It's configured via a `RAGConfig` object, which can be easily loaded from your environment variables.
+
+```python
+from insta_rag import RAGClient, RAGConfig
+
+# Load configuration from .env file
+config = RAGConfig.from_env()
+
+# Initialize the client
+client = RAGClient(config)
+
+print("✓ RAG Client initialized successfully")
+```
+
+## 4. Add Documents to a Collection
+
+You can add documents from files (PDF, TXT) or raw text. The library handles text extraction, semantic chunking, embedding, and storage in a single command.
+
+```python
+from insta_rag import DocumentInput
+
+# Prepare documents from different sources
+documents = (
+    [
+        # From a PDF file
+        DocumentInput.from_file(
+            "path/to/your/document.pdf",
+            metadata={"user_id": "user_123", "document_type": "report"},
+        ),
+        # From a raw text string
+        DocumentInput.from_text(
+            "This is the content of a short document about insta_rag.",
+            metadata={"source": "manual", "author": "Gemini"},
+        ),
+    ],
+)
+
+# Process and store the documents in a collection
+response = client.add_documents(
+    documents=documents,
+    collection_name="my_knowledge_base",
+    metadata={"project": "quickstart_demo"},  # Global metadata for this batch
+)
+
+# Review the results
+if response.success:
+    print(f"✓ Processed {response.documents_processed} documents")
+    print(f"✓ Created {response.total_chunks} chunks")
+    print(f"✓ Total time: {response.processing_stats.total_time_ms:.2f}ms")
+else:
+    print(f"✗ Errors: {response.errors}")
+```
+
+For a detailed explanation of the ingestion process, see the [Document Management Guide](./guides/document-management.md).
+
+## 5. Perform a Retrieval Query
+
+The `retrieve()` method performs a hybrid search query, combining semantic search, keyword search, and query expansion to find the most relevant results.
+
+By default, **HyDE query generation** and **BM25 keyword search** are enabled for the highest quality results.
+
+```python
+# Perform a retrieval query
+response = client.retrieve(
+    query="What is semantic chunking?", collection_name="my_knowledge_base", top_k=5
+)
+
+# Print the results
+if response.success:
+    print(f"✓ Retrieved {len(response.chunks)} chunks")
+    print(f"\nGenerated Queries: {response.queries_generated}")
+
+    for i, chunk in enumerate(response.chunks):
+        print(f"\n--- Result {i + 1} (Score: {chunk.relevance_score:.4f}) ---")
+        print(f"Source: {chunk.metadata.source}")
+        print(chunk.content)
+```
+
+### Understanding the Retrieval Response
+
+- `response.chunks`: A list of the top `k` most relevant document chunks.
+- `response.queries_generated`: Shows the original query, the optimized standard query, and the hypothetical answer (HyDE) used for searching.
+- `response.retrieval_stats`: Provides a detailed performance breakdown, including timings for each stage and the number of chunks found by each method.
+
+## 6. Controlling Retrieval Features
+
+You can easily enable or disable advanced retrieval features to balance speed and quality.
+
+```python
+# Fast mode: Vector search only (like a traditional RAG)
+fast_response = client.retrieve(
+    query="your question",
+    collection_name="my_knowledge_base",
+    enable_hyde=False,
+    enable_keyword_search=False,
+    enable_reranking=False,  # Assuming reranking is a future feature
+)
+
+# High-quality mode: Vector search + HyDE (no keyword search)
+quality_response = client.retrieve(
+    query="your question",
+    collection_name="my_knowledge_base",
+    enable_hyde=True,
+    enable_keyword_search=False,
+)
+```
+
+To learn more about the advanced retrieval pipeline, see the [Retrieval Guide](./guides/retrieval.md).
+
+## 7. Document Management
+
+The library also supports updating and deleting documents.
+
+```python
+# Example: Delete all documents for a specific user
+delete_response = client.update_documents(
+    collection_name="my_knowledge_base",
+    update_strategy="delete",
+    filters={"user_id": "user_123"},
+)
+
+print(f"Deleted {delete_response.chunks_deleted} chunks.")
+```
+
+For more, see the [Document Management Guide](./guides/document-management.md).
diff --git a/docs/upload_flow_diagram.txt b/docs/upload_flow_diagram.txt
deleted file mode 100644
index 9023da7..0000000
--- a/docs/upload_flow_diagram.txt
+++ /dev/null
@@ -1,192 +0,0 @@
-╔══════════════════════════════════════════════════════════════════════════════╗
-║                   INSTA_RAG DOCUMENT UPLOAD PIPELINE                         ║
-╚══════════════════════════════════════════════════════════════════════════════╝
-
-                                 USER INPUT
-                                     │
-                                     ▼
-        ┌────────────────────────────────────────────────────┐
-        │  PHASE 1: DOCUMENT LOADING                         │
-        │  ─────────────────────────                         │
-        │  • Read file path or text input                    │
-        │  • Generate unique document_id (UUID)              │
-        │  • Merge metadata                                  │
-        └───────────────────┬────────────────────────────────┘
-                            │
-                            ▼
-        ┌────────────────────────────────────────────────────┐
-        │  PHASE 2: TEXT EXTRACTION                          │
-        │  ───────────────────────                           │
-        │  • Detect file type (.pdf, .txt, .md)              │
-        │  • Extract with pdfplumber (primary)               │
-        │  • Fallback to PyPDF2 if needed                    │
-        │  • Join pages with double newlines                 │
-        │                                                     │
-        │  Input:  document.pdf (50 pages)                   │
-        │  Output: "This is page 1...\n\nThis is page 2..." │
-        └───────────────────┬────────────────────────────────┘
-                            │
-                            ▼
-        ┌────────────────────────────────────────────────────┐
-        │  PHASE 3: SEMANTIC CHUNKING                        │
-        │  ─────────────────────────                         │
-        │  Step 1: Split into sentences                      │
-        │          ["Sent 1.", "Sent 2.", ...]               │
-        │                                                     │
-        │  Step 2: Embed sentences                           │
-        │          [[0.1, 0.2, ...], [0.15, 0.19, ...]]     │
-        │                                                     │
-        │  Step 3: Calculate similarities                    │
-        │          [0.95, 0.92, 0.45 ← low!, 0.88, ...]     │
-        │                                                     │
-        │  Step 4: Find breakpoints (topic changes)          │
-        │          Break at indices: [3, 7, 12]              │
-        │                                                     │
-        │  Step 5: Create chunks + add 20% overlap           │
-        │                                                     │
-        │  Input:  25,000 words                              │
-        │  Output: 18 semantic chunks (avg 1,389 words)      │
-        └───────────────────┬────────────────────────────────┘
-                            │
-                            ▼
-        ┌────────────────────────────────────────────────────┐
-        │  PHASE 4: CHUNK VALIDATION                         │
-        │  ────────────────────────                          │
-        │  • Validate minimum length (>= 10 chars)           │
-        │  • Count tokens per chunk                          │
-        │  • Create ChunkMetadata objects                    │
-        │                                                     │
-        │  Each chunk has:                                   │
-        │  - chunk_id: "uuid_chunk_0"                        │
-        │  - content: "Chunk text..."                        │
-        │  - metadata: {document_id, source, tokens, ...}    │
-        └───────────────────┬────────────────────────────────┘
-                            │
-                            ▼
-        ┌────────────────────────────────────────────────────┐
-        │  PHASE 5: BATCH EMBEDDING GENERATION               │
-        │  ──────────────────────────────────                │
-        │  • Extract content from all chunks                 │
-        │  • Batch into groups of 100                        │
-        │  • Call Azure OpenAI API:                          │
-        │    - Model: text-embedding-3-large                 │
-        │    - Output: 3072-dimensional vectors              │
-        │  • Attach embeddings to chunk objects              │
-        │                                                     │
-        │  Input:  18 chunks                                 │
-        │  Output: 18 × [3072 floats]                        │
-        │          Example: [0.023, -0.012, 0.045, ...]      │
-        └───────────────────┬────────────────────────────────┘
-                            │
-                            ▼
-        ┌────────────────────────────────────────────────────┐
-        │  PHASE 6: VECTOR & CONTENT STORAGE                 │
-        │  ────────────────────────────────                  │
-        │                                                     │
-        │  ┌─────────────────────────────────────────┐       │
-        │  │ 6A: Create/Verify Qdrant Collection     │       │
-        │  │     - Name: "my_documents"              │       │
-        │  │     - Vector size: 3072                 │       │
-        │  │     - Distance: COSINE                  │       │
-        │  └─────────────────────────────────────────┘       │
-        │                                                     │
-        │  ┌─────────────────────┬───────────────────────┐   │
-        │  │  MongoDB Enabled?   │                       │   │
-        │  └──────────┬──────────┴───────────────────────┘   │
-        │             │                                       │
-        │      YES ───┴─── NO                                │
-        │       │            │                                │
-        │       ▼            ▼                                │
-        │  ┌─────────┐  ┌──────────────┐                     │
-        │  │ HYBRID  │  │ QDRANT-ONLY  │                     │
-        │  │  MODE   │  │     MODE     │                     │
-        │  └─────────┘  └──────────────┘                     │
-        └───────┬────────────┬───────────────────────────────┘
-                │            │
-                ▼            ▼
-
-    ┌───────────────────┐         ┌──────────────────────┐
-    │   HYBRID MODE     │         │   QDRANT-ONLY MODE   │
-    │   ───────────     │         │   ──────────────     │
-    │                   │         │                      │
-    │   MongoDB:        │         │   Qdrant:            │
-    │   ─────────       │         │   ───────            │
-    │   {               │         │   {                  │
-    │     chunk_id,     │         │     id: uuid,        │
-    │     content: "...",│        │     vector: [...],   │
-    │     metadata      │         │     payload: {       │
-    │   }               │         │       content: "...", │
-    │                   │         │       chunk_id,      │
-    │   Qdrant:         │         │       metadata       │
-    │   ───────         │         │     }                │
-    │   {               │         │   }                  │
-    │     id: uuid,     │         │                      │
-    │     vector: [...],│         │   All data in        │
-    │     payload: {    │         │   single store       │
-    │       mongodb_id, │         │                      │
-    │       metadata    │         │                      │
-    │     }             │         │                      │
-    │   }               │         │                      │
-    │                   │         │                      │
-    │   Content in      │         │                      │
-    │   MongoDB,        │         │                      │
-    │   Vectors in      │         │                      │
-    │   Qdrant          │         │                      │
-    └───────────────────┘         └──────────────────────┘
-
-╔══════════════════════════════════════════════════════════════════════════════╗
-║                              FINAL RESULT                                    ║
-╚══════════════════════════════════════════════════════════════════════════════╝
-
-Qdrant Collection: "my_documents"
-├── Point 0: id="uuid-0", vector=[3072 dims], payload={chunk_id, metadata...}
-├── Point 1: id="uuid-1", vector=[3072 dims], payload={chunk_id, metadata...}
-├── Point 2: id="uuid-2", vector=[3072 dims], payload={chunk_id, metadata...}
-...
-└── Point 17: id="uuid-17", vector=[3072 dims], payload={chunk_id, metadata...}
-
-Status: ✅ Ready for semantic search
-Vector Similarity: COSINE distance
-Searchable Fields: All metadata fields + content (if not in MongoDB)
-
-╔══════════════════════════════════════════════════════════════════════════════╗
-║                           PERFORMANCE STATS                                  ║
-╚══════════════════════════════════════════════════════════════════════════════╝
-
-Example for 1 PDF (50 pages):
-  • Chunking:   1,250 ms  (Phase 3)
-  • Embedding:  3,421 ms  (Phase 5)
-  • Upload:       890 ms  (Phase 6)
-  • ─────────────────────
-  • TOTAL:      5,561 ms  (~5.6 seconds)
-
-  Documents:   1
-  Chunks:      18
-  Tokens:      12,500
-  Vectors:     18 × 3072 dimensions
-  Storage:     Qdrant + MongoDB (hybrid)
-
-╔══════════════════════════════════════════════════════════════════════════════╗
-║                      KEY TECHNICAL DECISIONS                                 ║
-╚══════════════════════════════════════════════════════════════════════════════╝
-
-1. Semantic Chunking
-   └─ Analyzes sentence similarity to find natural topic boundaries
-   └─ Better than fixed-size chunks for preserving context
-
-2. Deterministic UUIDs
-   └─ uuid.uuid5(NAMESPACE_DNS, chunk_id)
-   └─ Same chunk always gets same ID → idempotent uploads
-
-3. Hybrid Storage
-   └─ Qdrant: Fast vector search (optimized for embeddings)
-   └─ MongoDB: Cheaper text storage
-   └─ Best of both worlds
-
-4. Batch Processing
-   └─ 100 chunks per API call
-   └─ Reduces latency and costs
-
-5. 20% Overlap
-   └─ Prevents information loss at chunk boundaries
-   └─ Improves retrieval quality