diff --git a/README.md b/README.md
index efa5625..f634b8b 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,140 @@
 # Restaurant Recommendation System
 
-Architecture documentation for a Restaurant Recommendation System.
+A scalable, ML-powered restaurant recommendation system that delivers personalized dining suggestions based on user preferences, location, behavior history, and contextual signals.
+
+## Table of Contents
+
+- [Architecture Overview](#architecture-overview)
+- [Core Modules](#core-modules)
+- [Tech Stack](#tech-stack)
+- [Documentation Index](#documentation-index)
+
+---
+
+## Architecture Overview
+
+```mermaid
+graph TB
+    subgraph Client["Client Layer"]
+        MobileApp["📱 Mobile App"]
+        WebApp["🌐 Web App"]
+        ThirdParty["🔗 Third-party API"]
+    end
+
+    subgraph Gateway["API Gateway Layer"]
+        LB["Load Balancer"]
+        APIGW["API Gateway\n(Auth · Rate Limit · Routing)"]
+    end
+
+    subgraph Serving["Serving Layer"]
+        RecAPI["Recommendation\nService (FastAPI)"]
+        SearchAPI["Search\nRe-rank Service"]
+    end
+
+    subgraph RecsEngine["Recommendation Engine"]
+        CandGen["Candidate\nGeneration"]
+        Ranking["Scoring &\nRanking"]
+        ReRank["Re-ranking &\nBusiness Logic"]
+    end
+
+    subgraph DataLayer["Data Layer"]
+        Redis[("Redis Cache")]
+        Postgres[("PostgreSQL")]
+        Elasticsearch[("Elasticsearch")]
+        Milvus[("Milvus\nVector DB")]
+    end
+
+    subgraph MLPlatform["ML Platform"]
+        FeatureStore["Feature Store"]
+        ModelRegistry["Model Registry\n(MLflow)"]
+        Training["Training\nPipeline"]
+    end
+
+    subgraph Streaming["Streaming Layer"]
+        Kafka["Apache Kafka"]
+        Flink["Apache Flink"]
+    end
+
+    subgraph Observability["Observability"]
+        Prometheus["Prometheus"]
+        Grafana["Grafana"]
+        ELK["ELK Stack"]
+    end
+
+    Client --> LB --> APIGW --> Serving
+    Serving --> RecsEngine
+    RecsEngine --> DataLayer
+    RecsEngine --> FeatureStore
+    MLPlatform --> RecsEngine
+    Streaming --> FeatureStore
+    Serving --> Observability
+```
+
+---
+
+## Core Modules
+
+| # | Module | Description |
+|---|--------|-------------|
+| 01 | [Data Ingestion](docs/modules/01-data-ingestion.md) | Clickstream tracking, restaurant catalog sync, user profile aggregation, review ingestion |
+| 02 | [Recommendation Engine](docs/modules/02-recommendation-engine.md) | Candidate generation → Scoring/Ranking → Re-ranking with business rules |
+| 03 | [Serving & Delivery](docs/modules/03-serving-delivery.md) | REST API endpoints, 4-layer caching, real-time personalization stream |
+| 04 | [Feature Store](docs/modules/04-feature-store.md) | Online/Offline store, feature registry, <5ms latency SLA |
+| 05 | [ML Platform](docs/modules/05-ml-platform.md) | Training pipelines, model lifecycle, canary/blue-green deployment |
+| 06 | [Experimentation](docs/modules/06-experimentation.md) | A/B testing framework, statistical testing, experiment lifecycle |
+| 07 | [Observability](docs/modules/07-observability.md) | System monitoring, ML drift detection, business metrics, alerting |
+| 08 | [Data Pipeline](docs/modules/08-data-pipeline.md) | Airflow DAGs, Kafka+Flink streaming, SLAs, retry strategies |
+
+---
+
+## Tech Stack
+
+| Layer | Technology |
+|-------|-----------|
+| API | FastAPI (Python), Go (high-throughput endpoints) |
+| ML Framework | PyTorch, TensorFlow, LightGBM, XGBoost |
+| Feature Store | Redis (online), S3 + BigQuery (offline) |
+| Vector Search | Milvus (HNSW ANN), Elasticsearch |
+| Streaming | Apache Kafka, Apache Flink |
+| Batch Processing | Apache Spark, Apache Airflow |
+| Relational DB | PostgreSQL |
+| Cache | Redis Cluster |
+| Model Serving | TensorFlow Serving, NVIDIA Triton |
+| Model Registry | MLflow |
+| Experiment Tracking | Evidently AI |
+| Monitoring | Prometheus, Grafana, Metabase |
+| Logging | ELK Stack / Grafana Loki |
+| CI/CD | GitHub Actions |
+| Container / Orchestration | Docker, Kubernetes |
+
+---
+
+## Documentation Index
+
+### Architecture
+- [System Overview](docs/architecture/system-overview.md) — System diagram, request lifecycle, latency budgets
+- [Tech Stack](docs/architecture/tech-stack.md) — Technology choices and justification
+- [Data Flow](docs/architecture/data-flow.md) — Real-time, batch, and streaming flow diagrams
+
+### Modules
+- [01 Data Ingestion](docs/modules/01-data-ingestion.md)
+- [02 Recommendation Engine](docs/modules/02-recommendation-engine.md)
+- [03 Serving & Delivery](docs/modules/03-serving-delivery.md)
+- [04 Feature Store](docs/modules/04-feature-store.md)
+- [05 ML Platform](docs/modules/05-ml-platform.md)
+- [06 Experimentation](docs/modules/06-experimentation.md)
+- [07 Observability](docs/modules/07-observability.md)
+- [08 Data Pipeline](docs/modules/08-data-pipeline.md)
+
+### Schemas
+- [User Event Schema](docs/schemas/user-event.md)
+- [Restaurant Catalog Schema](docs/schemas/restaurant-catalog.md)
+- [User Profile Schema](docs/schemas/user-profile.md)
+- [Review & Rating Schema](docs/schemas/review-rating.md)
+
+### API
+- [API Specification](docs/api/api-specification.md)
+
+### Infrastructure
+- [Deployment](docs/infrastructure/deployment.md)
 
-## Status
-🚧 Under Development
diff --git a/docs/api/api-specification.md b/docs/api/api-specification.md
new file mode 100644
index 0000000..e9df2c2
--- /dev/null
+++ b/docs/api/api-specification.md
@@ -0,0 +1,527 @@
+# API Specification
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Authentication](#authentication)
+- [Rate Limiting](#rate-limiting)
+- [Common Request Headers](#common-request-headers)
+- [Common Response Format](#common-response-format)
+- [Error Handling](#error-handling)
+- [Endpoints](#endpoints)
+  - [GET /api/v1/recommendations/homepage](#get-apiv1recommendationshomepage)
+  - [GET /api/v1/recommendations/nearby](#get-apiv1recommendationsnearby)
+  - [GET /api/v1/recommendations/similar/{restaurant_id}](#get-apiv1recommendationssimilarrestaurant_id)
+  - [GET /api/v1/recommendations/complementary/{restaurant_id}](#get-apiv1recommendationscomplementaryrestaurant_id)
+  - [POST /api/v1/recommendations/search-rerank](#post-apiv1recommendationssearch-rerank)
+  - [POST /api/v1/recommendations/group](#post-apiv1recommendationsgroup)
+- [SLA Summary](#sla-summary)
+
+---
+
+## Overview
+
+| Property | Value |
+|----------|-------|
+| **Base URL** | `https://api.restaurantrec.example.com` |
+| **API Version** | `v1` |
+| **Protocol** | HTTPS only |
+| **Format** | JSON (application/json) |
+| **Encoding** | UTF-8 |
+| **API Style** | RESTful |
+
+---
+
+## Authentication
+
+All endpoints require a valid Bearer JWT token in the `Authorization` header.
+
+```
+Authorization: Bearer <jwt_token>
+```
+
+JWT tokens are issued by the authentication service with:
+- **Expiry**: 1 hour (access token), 30 days (refresh token)
+- **Claims**: `user_id`, `exp`, `iat`, `scope`
+- **Algorithm**: RS256
+
+Server-to-server integrations use API keys with the `X-API-Key` header (rate limit: 1,000 req/min).
+
+---
+
+## Rate Limiting
+
+| Client Type | Limit | Window | Header Returned |
+|-------------|-------|--------|----------------|
+| Authenticated user (JWT) | 100 requests | 1 minute | `X-RateLimit-Remaining` |
+| Server API key | 1,000 requests | 1 minute | `X-RateLimit-Remaining` |
+| Unauthenticated IP | 20 requests | 1 minute | `X-RateLimit-Remaining` |
+| Group endpoint (per user) | 10 requests | 1 minute | `X-RateLimit-Remaining` |
+
+When rate limit is exceeded, the API returns `HTTP 429` with a `Retry-After` header.
+
+---
+
+## Common Request Headers
+
+| Header | Required | Description |
+|--------|----------|-------------|
+| `Authorization` | ✅ | `Bearer <jwt_token>` |
+| `Content-Type` | For POST | `application/json` |
+| `Accept` | — | `application/json` (default) |
+| `X-Request-ID` | Recommended | Client-generated UUID for request tracing |
+| `X-API-Key` | For server auth | Alternative to Bearer token for server-to-server |
+
+---
+
+## Common Response Format
+
+### Success Response
+
+```json
+{
+  "request_id": "req_01HXZ789ABC",
+  "data": { ... },
+  "meta": {
+    "total": 20,
+    "model_version": "ranking_v2.4.1",
+    "latency_ms": 42,
+    "cache_hit": false,
+    "experiment_variants": {
+      "ranking_model": "treatment_a"
+    }
+  }
+}
+```
+
+### Recommendation Item
+
+All recommendation endpoints return an array of `RecommendationItem`:
+
+```json
+{
+  "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+  "name": "Pho Saigon House",
+  "score": 0.92,
+  "rank": 1,
+  "distance_km": 1.2,
+  "primary_cuisine": "vietnamese",
+  "cuisine_types": ["vietnamese", "asian", "noodles"],
+  "price_tier": "$$",
+  "avg_rating": 4.3,
+  "total_reviews": 1842,
+  "is_open_now": true,
+  "photo_url": "https://cdn.example.com/restaurants/rst_3b7c9e/photo_1.jpg",
+  "address": "123 Nguyen Hue Boulevard, District 1",
+  "lat": 10.7769,
+  "lng": 106.7009,
+  "ambiance_tags": ["casual", "family-friendly"],
+  "dietary_highlights": ["halal", "vegetarian_options"],
+  "reason": "Based on your love of Vietnamese cuisine"
+}
+```
+
+---
+
+## Error Handling
+
+### Error Response Format
+
+```json
+{
+  "error": {
+    "code": "INVALID_PARAMETERS",
+    "message": "Query parameter 'radius_km' must be between 1 and 50.",
+    "details": {
+      "field": "radius_km",
+      "provided_value": 100,
+      "valid_range": [1, 50]
+    }
+  },
+  "request_id": "req_01HXZ789ABC",
+  "timestamp": "2024-03-15T14:32:07.123Z"
+}
+```
+
+### HTTP Status Codes
+
+| HTTP Status | Error Code | Description |
+|-------------|-----------|-------------|
+| `400 Bad Request` | `INVALID_PARAMETERS` | Missing required param or out-of-range value |
+| `401 Unauthorized` | `UNAUTHORIZED` | Missing, invalid, or expired JWT token |
+| `403 Forbidden` | `FORBIDDEN` | Valid token but insufficient permissions |
+| `404 Not Found` | `RESTAURANT_NOT_FOUND` | Specified `restaurant_id` does not exist |
+| `429 Too Many Requests` | `RATE_LIMIT_EXCEEDED` | Rate limit exceeded |
+| `500 Internal Server Error` | `INTERNAL_ERROR` | Unexpected server error (with fallback) |
+| `503 Service Unavailable` | `RECOMMENDATION_ENGINE_TIMEOUT` | Engine exceeded latency budget; fallback returned |
+| `503 Service Unavailable` | `FEATURE_STORE_UNAVAILABLE` | Feature store unreachable; fallback returned |
+
+---
+
+## Endpoints
+
+### GET /api/v1/recommendations/homepage
+
+Returns personalized recommendations for the user's home feed.
+
+**SLA**: p50: 50ms, p99: 200ms
+
+#### Request
+
+```
+GET /api/v1/recommendations/homepage?limit=20&lat=10.7769&lng=106.7009&radius_km=5
+Authorization: Bearer <token>
+```
+
+#### Query Parameters
+
+| Parameter | Type | Default | Required | Description |
+|-----------|------|---------|----------|-------------|
+| `limit` | integer | 20 | — | Number of recommendations (1–50) |
+| `offset` | integer | 0 | — | Pagination offset |
+| `lat` | float | — | — | User latitude; falls back to profile location |
+| `lng` | float | — | — | User longitude |
+| `radius_km` | float | 5.0 | — | Search radius in km (1–50) |
+| `open_now` | boolean | false | — | Filter to only open restaurants |
+| `exclude_visited` | boolean | false | — | Exclude recently visited restaurants |
+
+#### Response (200 OK)
+
+```json
+{
+  "request_id": "req_01HXZ789ABC",
+  "data": {
+    "recommendations": [
+      {
+        "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+        "name": "Pho Saigon House",
+        "score": 0.92,
+        "rank": 1,
+        "distance_km": 1.2,
+        "primary_cuisine": "vietnamese",
+        "cuisine_types": ["vietnamese", "asian", "noodles"],
+        "price_tier": "$$",
+        "avg_rating": 4.3,
+        "total_reviews": 1842,
+        "is_open_now": true,
+        "photo_url": "https://cdn.example.com/restaurants/rst_3b7c9e/photo_1.jpg",
+        "address": "123 Nguyen Hue Boulevard, District 1",
+        "lat": 10.7769,
+        "lng": 106.7009,
+        "ambiance_tags": ["casual", "family-friendly"],
+        "dietary_highlights": ["halal", "vegetarian_options"],
+        "reason": "Based on your love of Vietnamese cuisine"
+      }
+    ]
+  },
+  "meta": {
+    "total": 20,
+    "model_version": "ranking_v2.4.1",
+    "latency_ms": 42,
+    "cache_hit": false,
+    "experiment_variants": {
+      "ranking_model": "treatment_a"
+    }
+  }
+}
+```
+
+---
+
+### GET /api/v1/recommendations/nearby
+
+Returns top-rated and personalized nearby restaurants based on the user's current location.
+
+**SLA**: p50: 40ms, p99: 150ms
+
+#### Request
+
+```
+GET /api/v1/recommendations/nearby?lat=10.7769&lng=106.7009&radius_km=3&open_now=true
+Authorization: Bearer <token>
+```
+
+#### Query Parameters
+
+| Parameter | Type | Default | Required | Description |
+|-----------|------|---------|----------|-------------|
+| `lat` | float | — | ✅ | User latitude |
+| `lng` | float | — | ✅ | User longitude |
+| `radius_km` | float | 3.0 | — | Search radius in km (1–50) |
+| `limit` | integer | 20 | — | Max results (1–50) |
+| `cuisine` | string | — | — | Filter by cuisine type |
+| `price_tier` | string | — | — | Filter: `$`, `$$`, `$$$`, `$$$$` |
+| `open_now` | boolean | false | — | Only return currently open restaurants |
+| `dietary` | string | — | — | Comma-separated: `vegetarian`, `vegan`, `halal`, `gluten_free` |
+
+#### Response (200 OK)
+
+Same structure as homepage endpoint with `"reason": "Near your location"`.
+
+---
+
+### GET /api/v1/recommendations/similar/{restaurant_id}
+
+Returns restaurants similar to a given restaurant.
+
+**SLA**: p50: 30ms, p99: 100ms
+
+#### Request
+
+```
+GET /api/v1/recommendations/similar/rst_3b7c9e1a2f4d5e6b?limit=10&lat=10.7769&lng=106.7009
+Authorization: Bearer <token>
+```
+
+#### Path Parameters
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `restaurant_id` | string | ✅ | Source restaurant ID |
+
+#### Query Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `limit` | integer | 10 | Number of similar restaurants (1–20) |
+| `lat` | float | — | User location for geo-scoring |
+| `lng` | float | — | User location for geo-scoring |
+
+#### Response (200 OK)
+
+```json
+{
+  "request_id": "req_02HXZ890BCD",
+  "data": {
+    "source_restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+    "recommendations": [
+      {
+        "restaurant_id": "rst_4c8d0f2b3e5f6a7c",
+        "name": "Bun Bo Nam Bo",
+        "score": 0.88,
+        "rank": 1,
+        "similarity_score": 0.91,
+        "primary_cuisine": "vietnamese",
+        "price_tier": "$$",
+        "avg_rating": 4.4,
+        "is_open_now": true,
+        "distance_km": 0.8,
+        "reason": "Similar Vietnamese cuisine"
+      }
+    ]
+  },
+  "meta": {
+    "total": 10,
+    "model_version": "similarity_v1.2.0",
+    "latency_ms": 28,
+    "cache_hit": true
+  }
+}
+```
+
+---
+
+### GET /api/v1/recommendations/complementary/{restaurant_id}
+
+Returns "People also liked" recommendations for users who interacted with a given restaurant.
+
+**SLA**: p50: 30ms, p99: 100ms
+
+#### Request
+
+```
+GET /api/v1/recommendations/complementary/rst_3b7c9e1a2f4d5e6b?limit=10
+Authorization: Bearer <token>
+```
+
+#### Path Parameters
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `restaurant_id` | string | ✅ | Source restaurant ID |
+
+#### Query Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `limit` | integer | 10 | Number of results (1–20) |
+| `lat` | float | — | User location for geo-scoring |
+| `lng` | float | — | User location for geo-scoring |
+
+---
+
+### POST /api/v1/recommendations/search-rerank
+
+Re-ranks a set of search results using personalization signals.
+
+**SLA**: p50: 60ms, p99: 200ms
+
+#### Request
+
+```
+POST /api/v1/recommendations/search-rerank
+Authorization: Bearer <token>
+Content-Type: application/json
+```
+
+#### Request Body
+
+```json
+{
+  "query": "sushi downtown",
+  "candidate_ids": [
+    "rst_abc123",
+    "rst_def456",
+    "rst_ghi789",
+    "rst_jkl012"
+  ],
+  "context": {
+    "lat": 10.7769,
+    "lng": 106.7009,
+    "session_id": "sess_7e2f1a3b4c5d6e7f"
+  },
+  "limit": 10
+}
+```
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `query` | string | ✅ | Original search query |
+| `candidate_ids` | string[] | ✅ | Restaurant IDs to re-rank (max 200) |
+| `context.lat` | float | — | User location |
+| `context.lng` | float | — | User location |
+| `context.session_id` | string | — | Session ID for personalization context |
+| `limit` | integer | — | Max results to return (default: all) |
+
+#### Response (200 OK)
+
+```json
+{
+  "request_id": "req_03HXZ901CDE",
+  "data": {
+    "query": "sushi downtown",
+    "recommendations": [
+      {
+        "restaurant_id": "rst_def456",
+        "name": "Sakura Sushi Bar",
+        "score": 0.89,
+        "rank": 1,
+        "original_rank": 2,
+        "primary_cuisine": "japanese",
+        "avg_rating": 4.6,
+        "price_tier": "$$$",
+        "is_open_now": true,
+        "distance_km": 0.5
+      }
+    ]
+  },
+  "meta": {
+    "total": 4,
+    "model_version": "rerank_v1.1.0",
+    "latency_ms": 55,
+    "cache_hit": false
+  }
+}
+```
+
+---
+
+### POST /api/v1/recommendations/group
+
+Generates recommendations for group dining by aggregating preferences of multiple users.
+
+**SLA**: p50: 80ms, p99: 250ms
+
+#### Request
+
+```
+POST /api/v1/recommendations/group
+Authorization: Bearer <token>
+Content-Type: application/json
+```
+
+#### Request Body
+
+```json
+{
+  "user_ids": [
+    "usr_9f3a8b2c1d4e5f6a",
+    "usr_1a2b3c4d5e6f7a8b",
+    "usr_2b3c4d5e6f7a8b9c"
+  ],
+  "context": {
+    "lat": 10.7769,
+    "lng": 106.7009,
+    "occasion": "business",
+    "party_size": 3
+  },
+  "constraints": {
+    "dietary": ["halal"],
+    "price_tier_max": "$$$",
+    "radius_km": 5.0,
+    "open_now": true
+  },
+  "limit": 10
+}
+```
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `user_ids` | string[] | ✅ | User IDs in the group (max 10) |
+| `context.lat` | float | — | Meeting location |
+| `context.lng` | float | — | Meeting location |
+| `context.occasion` | string | — | `business`, `casual`, `celebration`, `date` |
+| `context.party_size` | integer | — | Number of people (overrides user_ids count) |
+| `constraints.dietary` | string[] | — | Dietary requirements for the whole group |
+| `constraints.price_tier_max` | string | — | Maximum price tier |
+| `constraints.radius_km` | float | — | Search radius |
+| `constraints.open_now` | boolean | — | Filter to open restaurants |
+| `limit` | integer | — | Max results (default 10) |
+
+#### Response (200 OK)
+
+```json
+{
+  "request_id": "req_04HXZ012DEF",
+  "data": {
+    "group_size": 3,
+    "recommendations": [
+      {
+        "restaurant_id": "rst_xyz789",
+        "name": "The Meeting Room Restaurant",
+        "score": 0.85,
+        "rank": 1,
+        "group_compatibility_score": 0.92,
+        "primary_cuisine": "international",
+        "price_tier": "$$$",
+        "avg_rating": 4.5,
+        "is_open_now": true,
+        "distance_km": 1.1,
+        "capacity": { "private_rooms": 2, "total_seats": 120 },
+        "reason": "Matches cuisine preferences for all 3 members"
+      }
+    ]
+  },
+  "meta": {
+    "total": 10,
+    "model_version": "group_v1.0.0",
+    "latency_ms": 76,
+    "cache_hit": false
+  }
+}
+```
+
+---
+
+## SLA Summary
+
+| Endpoint | Method | p50 | p99 | Cache TTL | Notes |
+|----------|--------|-----|-----|-----------|-------|
+| `/recommendations/homepage` | GET | 50ms | 200ms | Redis: 30min, CDN: 10min | High cache hit rate |
+| `/recommendations/nearby` | GET | 40ms | 150ms | Redis: 15min, CDN: 5min | Location-sensitive |
+| `/recommendations/similar/{id}` | GET | 30ms | 100ms | Redis: 2h, CDN: 30min | Pre-computable |
+| `/recommendations/complementary/{id}` | GET | 30ms | 100ms | Redis: 2h, CDN: 30min | Pre-computable |
+| `/recommendations/search-rerank` | POST | 60ms | 200ms | Not cached | Query-specific |
+| `/recommendations/group` | POST | 80ms | 250ms | Redis: 15min | Multi-user aggregation |
+
+All SLAs measured at the API Gateway (excluding client network time). Monitored via Prometheus with Grafana alerting on p99 breaches.
diff --git a/docs/architecture/data-flow.md b/docs/architecture/data-flow.md
new file mode 100644
index 0000000..30a3be7
--- /dev/null
+++ b/docs/architecture/data-flow.md
@@ -0,0 +1,201 @@
+# Data Flow
+
+## Table of Contents
+
+- [Real-time Recommendation Flow](#real-time-recommendation-flow)
+- [Batch Processing Flow](#batch-processing-flow)
+- [Event Streaming Flow](#event-streaming-flow)
+
+---
+
+## Real-time Recommendation Flow
+
+This flow describes how a user request is processed to produce personalized recommendations in real time (cache miss path).
+
+```mermaid
+flowchart TD
+    A["👤 User Request\nGET /api/v1/recommendations/homepage"] --> B["API Gateway\n· Auth · Rate Limit"]
+    B --> C{CDN/Redis\nCache Hit?}
+    C -- HIT --> Z["✅ Return Cached Response\n~5ms"]
+    C -- MISS --> D["Recommendation Service"]
+
+    D --> E["Fetch User Features\nFeature Store (Redis)\n~3ms"]
+    E --> F["Candidate Generation\n~10ms"]
+
+    F --> F1["Collaborative Filtering\n(Two-Tower embedding ANN via Milvus)"]
+    F --> F2["Content-Based Filtering\n(Item embedding ANN via Milvus)"]
+    F --> F3["Geo / Location Filtering\n(Elasticsearch geo_distance)"]
+    F --> F4["Popularity / Trending\n(Redis Sorted Set)"]
+    F --> F5["Cuisine Matching\n(Postgres lookup)"]
+
+    F1 & F2 & F3 & F4 & F5 --> G["Merge & Deduplicate\n500-2000 candidates"]
+
+    G --> H["Batch Fetch Restaurant Features\nFeature Store (Redis)\n~5ms"]
+    H --> I["Scoring & Ranking\nLightGBM / DeepFM inference\n~15ms"]
+
+    I --> J["Re-ranking & Business Rules\n~5ms"]
+    J --> J1["MMR Diversity Injection"]
+    J --> J2["Operating Hours Filter"]
+    J --> J3["Sponsored / Fairness Boost"]
+    J --> J4["Cold Start Handling"]
+
+    J1 & J2 & J3 & J4 --> K["Final Top 10-50 Results"]
+    K --> L["Store in Redis Cache\nTTL: 30min"]
+    L --> M["✅ Return Response\n~40-60ms total"]
+```
+
+### Flow Description
+
+1. **Request arrives** at the API Gateway, which validates the JWT token and checks rate limits
+2. **Cache check**: the system checks CDN edge cache (TTL 5–15 min) and Redis cache (TTL 30 min) before executing the full pipeline
+3. **User features** are fetched from the online feature store (Redis) — includes user embedding (128d), segment, cuisine preferences, price preference, recency signals
+4. **Candidate generation** runs multiple retrieval strategies in parallel:
+   - *Collaborative filtering*: ANN search on Two-Tower user/item embeddings via Milvus
+   - *Content-based*: ANN search on item text/image embeddings
+   - *Geo filtering*: Elasticsearch geo_distance query within configurable radius (1–50km)
+   - *Popularity*: Top-N restaurants from Redis time-decayed sorted set
+   - *Cuisine matching*: Direct lookup of restaurants matching user's top cuisine preferences
+5. **Candidates are merged** and deduplicated, yielding 500–2000 candidates
+6. **Restaurant features** are batch-fetched from the online feature store
+7. **Scoring and ranking** uses a pre-loaded LightGBM/DeepFM model to score each candidate with cross features (user × restaurant × context)
+8. **Re-ranking** applies business rules, diversity constraints, and filters
+9. **Result is cached** in Redis and returned to the client
+
+---
+
+## Batch Processing Flow
+
+This flow describes the daily/weekly offline pipelines for model training, feature computation, and catalog enrichment.
+
+```mermaid
+flowchart LR
+    subgraph Sources["Data Sources"]
+        EventLog["Event Log\n(S3/GCS)"]
+        PG["PostgreSQL\n(Catalog, Users, Orders)"]
+        ReviewDB["Review DB\n(PostgreSQL)"]
+    end
+
+    subgraph Spark["Spark Batch Jobs (Airflow-orchestrated)"]
+        FeatureEng["Feature Engineering\n· User behavior aggregates\n· Restaurant interaction stats\n· Temporal features"]
+        EmbeddingJob["Embedding Generation\n· User collaborative embeddings\n· Restaurant text/image embeddings"]
+        TrainingDataJob["Training Data Construction\n· Positive/negative sampling\n· Label assignment\n· Feature joining"]
+    end
+
+    subgraph Training["ML Training"]
+        CFTrain["Collaborative Filtering Training\n(ALS / Two-Tower / Neural CF)"]
+        RankTrain["Ranking Model Training\n(LightGBM / XGBoost / DeepFM)"]
+        Evaluate["Model Evaluation\n· AUC-ROC · NDCG@K\n· MAP@K · Coverage · Diversity"]
+    end
+
+    subgraph Registry["Model Registry"]
+        MLflowReg["MLflow Registry\n· Version tagging\n· Artifact storage\n· Lineage tracking"]
+    end
+
+    subgraph Deployment["Serving Update"]
+        Canary["Canary Deploy\n5% traffic, 2h window"]
+        Promote["Promote to 100%\n(if metrics pass)"]
+        Rollback["Rollback\n(if guardrail breached)"]
+    end
+
+    subgraph OfflineStore["Offline Feature Store"]
+        S3Parquet["S3/GCS Parquet\n(daily snapshots)"]
+        BigQuery["BigQuery\n(analytics queries)"]
+    end
+
+    subgraph OnlineStore["Online Feature Store"]
+        RedisCluster["Redis Cluster\n(warm-up from batch)"]
+        MilvusIdx["Milvus Index\n(embedding rebuild)"]
+    end
+
+    Sources --> Spark
+    Spark --> OfflineStore
+    Spark --> Training
+    OfflineStore --> Training
+    Training --> Evaluate
+    Evaluate --> MLflowReg
+    MLflowReg --> Canary
+    Canary --> Promote
+    Canary --> Rollback
+    OfflineStore --> OnlineStore
+    EmbeddingJob --> MilvusIdx
+```
+
+### Batch Pipeline Schedule
+
+| Pipeline | Schedule | SLA | Description |
+|----------|----------|-----|-------------|
+| `user_profile_rebuild` | Daily 2:00 AM | 5:00 AM | Rebuild user behavioral aggregates and preference vectors |
+| `restaurant_catalog_refresh` | Daily 1:00 AM | 4:00 AM | Sync restaurant catalog, compute NLP tags and image embeddings |
+| `feature_engineering` | Daily 3:00 AM | 6:00 AM | Compute all offline features, write to S3 Parquet + BigQuery |
+| `online_store_warmup` | Daily 6:30 AM | 7:30 AM | Populate Redis from latest offline feature snapshot |
+| `model_retraining` | Weekly, Sunday 3:00 AM | 8:00 AM | Full model retrain (CF + ranking) with last 30 days of data |
+| `embedding_index_rebuild` | Weekly, Sunday 6:00 AM | 10:00 AM | Rebuild Milvus HNSW index with updated embeddings |
+
+---
+
+## Event Streaming Flow
+
+This flow describes how real-time user behavior events are ingested, processed, and used to update session features and trigger near-real-time re-scoring.
+
+```mermaid
+flowchart TD
+    subgraph Clients["Client Applications"]
+        MobileSDK["📱 Mobile SDK"]
+        WebSDK["🌐 Web SDK (JS)"]
+        ServerEvents["Server-side Events\n(Order, Review)"]
+    end
+
+    subgraph Kafka["Apache Kafka Topics"]
+        RawEvents["Topic: user-events-raw\n(partitioned by user_id)"]
+        EnrichedEvents["Topic: user-events-enriched"]
+        SessionUpdates["Topic: session-feature-updates"]
+        RecTriggers["Topic: recs-refresh-triggers"]
+    end
+
+    subgraph Flink["Apache Flink Jobs"]
+        Validate["Event Validation\n· Schema check\n· Dedup (event_id)\n· Late event handling"]
+        Enrich["Event Enrichment\n· Resolve restaurant_id metadata\n· Add geo context\n· Lookup user segment"]
+        SessionAgg["Session Aggregation\n· Real-time session features\n· Click sequences\n· Dwell time"]
+        TriggerLogic["Trigger Logic\n· ORDER_PLACED → immediate refresh\n· N clicks → soft refresh\n· Idle >30min → session end"]
+    end
+
+    subgraph FeatureUpdates["Feature Updates"]
+        RedisUpdate["Redis Cache Update\n(session features, <5s latency)"]
+        CDNInvalidate["CDN Cache Invalidation\n(on significant behavior change)"]
+    end
+
+    subgraph Analytics["Analytics & Monitoring"]
+        ClickstreamAnalytics["Clickstream Analytics\n(BigQuery / Elasticsearch)"]
+        ABTracking["A/B Experiment Event Tracking"]
+        AnomalyDetect["Anomaly Detection\n(unusual event patterns)"]
+    end
+
+    MobileSDK & WebSDK & ServerEvents --> RawEvents
+    RawEvents --> Validate
+    Validate --> Enrich
+    Enrich --> EnrichedEvents
+    EnrichedEvents --> SessionAgg
+    SessionAgg --> SessionUpdates
+    SessionUpdates --> RedisUpdate
+    SessionAgg --> TriggerLogic
+    TriggerLogic --> RecTriggers
+    RecTriggers --> CDNInvalidate
+    EnrichedEvents --> ClickstreamAnalytics
+    EnrichedEvents --> ABTracking
+    EnrichedEvents --> AnomalyDetect
+```
+
+### Event Processing SLAs
+
+| Stage | Latency Target | Description |
+|-------|---------------|-------------|
+| SDK → Kafka | <500ms | Client SDK batches events and flushes every 500ms |
+| Kafka → Flink (ingestion lag) | <1s | Flink consumer lag monitoring; alert if >5s |
+| Flink validation + enrichment | <1s | Per-event processing |
+| Session feature update (Redis) | <5s end-to-end | From user action to updated feature in Redis |
+| Recommendation refresh trigger | <10s | After ORDER_PLACED event triggers fresh recommendations |
+| Analytics sink (BigQuery) | <5min | Micro-batch write to BigQuery for analytics |
+
+### Event Deduplication Strategy
+
+Events are deduplicated using the `event_id` (UUID v4 generated by the client SDK) within a **60-second deduplication window** in Flink. Late-arriving events (up to 5 minutes) are handled via Flink's watermark mechanism and routed to a separate late-events correction topic to avoid corrupting real-time session state.
diff --git a/docs/architecture/system-overview.md b/docs/architecture/system-overview.md
new file mode 100644
index 0000000..af9bc38
--- /dev/null
+++ b/docs/architecture/system-overview.md
@@ -0,0 +1,259 @@
+# System Overview
+
+## Table of Contents
+
+- [System Architecture Diagram](#system-architecture-diagram)
+- [Core Components](#core-components)
+- [Request Lifecycle](#request-lifecycle)
+- [Latency Budget](#latency-budget)
+- [Deployment Topology](#deployment-topology)
+
+---
+
+## System Architecture Diagram
+
+```mermaid
+graph TB
+    subgraph Clients["Client Layer"]
+        Mobile["📱 Mobile App"]
+        Web["🌐 Web App"]
+        Partner["🔗 Partner API"]
+    end
+
+    subgraph Ingress["Ingress Layer"]
+        CDN["CDN / Edge Cache\n(CloudFront)"]
+        LB["Load Balancer\n(L7 — NGINX / ALB)"]
+    end
+
+    subgraph Gateway["API Gateway"]
+        APIGW["API Gateway\n· JWT Auth\n· Rate Limiting\n· Request Routing\n· TLS Termination"]
+    end
+
+    subgraph ServingLayer["Serving Layer"]
+        RecSvc["Recommendation Service\n(FastAPI)"]
+        SearchSvc["Search Re-rank Service\n(FastAPI)"]
+        CatalogSvc["Catalog Service\n(Go)"]
+    end
+
+    subgraph Engine["Recommendation Engine"]
+        CandGen["Candidate Generation\n(500-2000 candidates)"]
+        Ranker["Scoring & Ranking\n(top 50-200)"]
+        ReRanker["Re-ranking & Business Rules\n(final top 10-50)"]
+    end
+
+    subgraph DataStores["Data Layer"]
+        Redis[("Redis Cluster\n(Cache / Sessions)")]
+        Postgres[("PostgreSQL\n(User / Restaurant data)")]
+        ES[("Elasticsearch\n(Full-text / Geo search)")]
+        Milvus[("Milvus\n(Vector / ANN search)")]
+        S3[("S3 / GCS\n(Offline feature store)")]
+    end
+
+    subgraph ML["ML Platform"]
+        FeatureStore["Feature Store\n(Online: Redis | Offline: BigQuery)"]
+        ModelServing["Model Serving\n(TF Serving / Triton)"]
+        MLflow["MLflow\n(Model Registry)"]
+        Training["Training Pipeline\n(Spark + Airflow)"]
+    end
+
+    subgraph Streaming["Event Streaming"]
+        Kafka["Apache Kafka"]
+        Flink["Apache Flink"]
+    end
+
+    subgraph Observability["Observability Stack"]
+        Prometheus["Prometheus"]
+        Grafana["Grafana"]
+        ELK["ELK / Loki"]
+        Evidently["Evidently AI\n(ML Monitoring)"]
+    end
+
+    Clients --> CDN
+    CDN --> LB
+    LB --> APIGW
+    APIGW --> ServingLayer
+    RecSvc --> CandGen --> Ranker --> ReRanker
+    ReRanker --> RecSvc
+    CandGen --> Redis
+    CandGen --> Milvus
+    CandGen --> ES
+    Ranker --> FeatureStore
+    ModelServing --> Ranker
+    MLflow --> ModelServing
+    Training --> MLflow
+    DataStores --> Training
+    Kafka --> Flink --> FeatureStore
+    ServingLayer --> Kafka
+    ServingLayer --> Observability
+    Engine --> Observability
+```
+
+---
+
+## Core Components
+
+### API Gateway
+- **Role**: Single entry point for all client traffic
+- **Responsibilities**: JWT authentication, rate limiting (100 req/min per user), request routing, TLS termination, request/response logging
+- **Technology**: NGINX / AWS API Gateway
+
+### Recommendation Service
+- **Role**: Orchestrates the full recommendation pipeline
+- **Responsibilities**: Calls candidate generation, ranking, and re-ranking; manages caching; returns final recommendations
+- **Technology**: FastAPI (Python)
+
+### Candidate Generation
+- **Role**: Retrieves 500–2000 candidate restaurants from multiple retrieval sources
+- **Sources**: Collaborative filtering, content-based (vector ANN), geo/location filtering, popularity/trending, cuisine matching, knowledge graph
+- **Technology**: Milvus (HNSW), Elasticsearch, Redis
+
+### Scoring & Ranking
+- **Role**: Scores and ranks the candidate set down to top 50–200
+- **Approach**: Multi-objective ranking model combining relevance, distance, price, and rating signals
+- **Technology**: LightGBM, XGBoost, DeepFM / Wide&Deep
+
+### Re-ranking & Business Logic
+- **Role**: Applies diversity, freshness, and business rules to produce the final top 10–50
+- **Key operations**: MMR diversity injection, operating hours filter, sponsored boost, fairness constraints
+
+### Feature Store
+- **Role**: Serves pre-computed user and restaurant features with low latency
+- **Online store**: Redis Cluster (<5ms p99), **Offline store**: S3/GCS + BigQuery
+
+### ML Platform
+- **Role**: Manages the full model lifecycle — training, evaluation, registration, and serving
+- **Technology**: Apache Spark (training), MLflow (registry), TF Serving / Triton (inference)
+
+### Event Streaming
+- **Role**: Processes real-time user behavior events to update session features and trigger near-real-time re-scoring
+- **Technology**: Apache Kafka (transport), Apache Flink (stream processing)
+
+---
+
+## Request Lifecycle
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant C as Client
+    participant CDN as CDN/Edge
+    participant GW as API Gateway
+    participant Svc as Rec Service
+    participant Cache as Redis Cache
+    participant CG as Candidate Gen
+    participant FS as Feature Store
+    participant Rank as Ranker
+    participant RR as Re-ranker
+
+    C->>CDN: GET /api/v1/recommendations/homepage
+    alt Cache HIT (CDN)
+        CDN-->>C: 200 OK (cached, <5ms)
+    else Cache MISS
+        CDN->>GW: Forward request
+        GW->>GW: Auth + Rate limit check (~2ms)
+        GW->>Svc: Route request
+
+        Svc->>Cache: Lookup user recommendation cache
+        alt Redis Cache HIT
+            Cache-->>Svc: Cached recommendations (~2ms)
+            Svc-->>GW: 200 OK
+            GW-->>CDN: Response
+            CDN-->>C: 200 OK
+        else Cache MISS
+            Svc->>FS: Fetch user features (~3ms)
+            FS-->>Svc: user_embedding, segments, prefs
+
+            Svc->>CG: Generate candidates (~10ms)
+            CG-->>Svc: 500-2000 restaurant candidates
+
+            Svc->>FS: Batch fetch restaurant features (~5ms)
+            FS-->>Svc: restaurant features
+
+            Svc->>Rank: Score & rank candidates (~15ms)
+            Rank-->>Svc: top 50-200 ranked
+
+            Svc->>RR: Apply business rules (~5ms)
+            RR-->>Svc: final top 10-50
+
+            Svc->>Cache: Store result (TTL 30min)
+            Svc-->>GW: 200 OK
+            GW-->>CDN: Response
+            CDN-->>C: 200 OK
+        end
+    end
+```
+
+---
+
+## Latency Budget
+
+The system targets **p50 ≤ 60ms** and **p99 ≤ 150ms** for the homepage recommendation endpoint (cache miss path).
+
+| Step | Component | p50 | p99 | Notes |
+|------|-----------|-----|-----|-------|
+| Network + TLS | Client → API Gateway | ~5ms | ~15ms | CDN proximity helps |
+| Auth & Rate Limit | API Gateway | ~2ms | ~5ms | JWT verification |
+| User Feature Fetch | Feature Store (Redis) | ~2ms | ~5ms | Single key lookup |
+| Candidate Generation | Milvus ANN + ES Geo | ~10ms | ~25ms | Parallel retrieval |
+| Restaurant Feature Fetch | Feature Store (Redis) | ~3ms | ~8ms | Batch pipeline key lookup |
+| Scoring & Ranking | ML Inference (Triton) | ~12ms | ~35ms | Batched inference |
+| Re-ranking | Rule engine | ~3ms | ~8ms | In-memory computation |
+| Serialization + Response | Rec Service | ~2ms | ~5ms | JSON serialization |
+| **Total (cache miss)** | **End-to-end** | **~39ms** | **~106ms** | Within budget |
+| **Total (cache hit)** | **Redis cache** | **~5ms** | **~15ms** | Dominant path |
+
+> **Cache hit rate target**: ≥80% for homepage endpoint (30min TTL in Redis).
+
+---
+
+## Deployment Topology
+
+The system runs as containerized microservices on Kubernetes.
+
+```mermaid
+graph TB
+    subgraph Zone_A["Availability Zone A"]
+        APIPOD_A["API Pod (3 replicas)"]
+        RecPOD_A["Rec Service Pod (3 replicas)"]
+        FlinkPOD_A["Flink TaskManager"]
+    end
+
+    subgraph Zone_B["Availability Zone B"]
+        APIPOD_B["API Pod (3 replicas)"]
+        RecPOD_B["Rec Service Pod (3 replicas)"]
+        FlinkPOD_B["Flink TaskManager"]
+    end
+
+    subgraph DataTier["Data Tier (Multi-AZ)"]
+        RedisPrimary[("Redis Primary")]
+        RedisReplica[("Redis Replica")]
+        PGPrimary[("PostgreSQL Primary")]
+        PGReplica[("PostgreSQL Read Replica")]
+        KafkaCluster[("Kafka Cluster\n(3 brokers)")]
+    end
+
+    subgraph MLTier["ML Tier (GPU nodes)"]
+        TritonA["Triton Server\n(GPU Node A)"]
+        TritonB["Triton Server\n(GPU Node B)"]
+    end
+
+    LB["Load Balancer"] --> Zone_A
+    LB --> Zone_B
+    Zone_A --> DataTier
+    Zone_B --> DataTier
+    Zone_A --> MLTier
+    Zone_B --> MLTier
+    RedisPrimary <--> RedisReplica
+    PGPrimary <--> PGReplica
+```
+
+### Kubernetes Namespaces
+
+| Namespace | Services |
+|-----------|----------|
+| `rr-serving` | API Gateway, Recommendation Service, Search Service, Catalog Service |
+| `rr-ml` | Model serving (Triton), Feature Store workers |
+| `rr-streaming` | Kafka brokers, Flink Job/TaskManagers |
+| `rr-pipeline` | Airflow webserver, scheduler, workers |
+| `rr-observability` | Prometheus, Grafana, Loki, Evidently |
+| `rr-data` | PostgreSQL, Redis, Elasticsearch, Milvus |
diff --git a/docs/architecture/tech-stack.md b/docs/architecture/tech-stack.md
new file mode 100644
index 0000000..e34d1df
--- /dev/null
+++ b/docs/architecture/tech-stack.md
@@ -0,0 +1,160 @@
+# Tech Stack
+
+## Table of Contents
+
+- [Overview](#overview)
+- [API & Serving Layer](#api--serving-layer)
+- [ML & Recommendation Layer](#ml--recommendation-layer)
+- [Data Storage Layer](#data-storage-layer)
+- [Streaming & Batch Processing](#streaming--batch-processing)
+- [ML Operations](#ml-operations)
+- [Observability](#observability)
+- [Infrastructure & CI/CD](#infrastructure--cicd)
+- [Alternatives Considered](#alternatives-considered)
+
+---
+
+## Overview
+
+The technology choices for the Restaurant Recommendation System prioritize:
+1. **Low latency** — p50 <60ms, p99 <150ms for recommendation serving
+2. **Scalability** — horizontally scalable stateless services
+3. **ML Flexibility** — support rapid model iteration and A/B testing
+4. **Operational Maturity** — proven technologies with strong community and tooling
+
+---
+
+## API & Serving Layer
+
+| Component | Technology | Version | Justification |
+|-----------|-----------|---------|---------------|
+| Primary API | **FastAPI** (Python) | ≥0.100 | Async I/O, native Pydantic validation, auto OpenAPI docs, extensive ML ecosystem integration |
+| High-throughput endpoints | **Go** (net/http) | ≥1.21 | Sub-millisecond overhead for catalog lookups and health checks; lower memory footprint per connection |
+| API Gateway | **NGINX** / AWS API Gateway | — | Battle-tested reverse proxy; supports JWT validation, rate limiting, TLS termination |
+| Load Balancer | **AWS ALB** / NGINX | — | L7 routing, health checks, sticky sessions for stateful A/B buckets |
+| Service Mesh | **Istio** | — | mTLS between services, fine-grained traffic management, distributed tracing |
+
+### Justification: FastAPI vs. Flask/Django
+FastAPI was chosen over Flask because:
+- Native `async`/`await` support eliminates blocking I/O bottlenecks during feature-store and database calls
+- Pydantic models enforce schema validation with zero extra code
+- Auto-generated OpenAPI documentation reduces maintenance burden
+- ~3× higher throughput than synchronous Flask for I/O-bound workloads
+
+---
+
+## ML & Recommendation Layer
+
+| Component | Technology | Version | Justification |
+|-----------|-----------|---------|---------------|
+| Deep Learning Framework | **PyTorch** | ≥2.0 | Dynamic computation graphs accelerate research; TorchScript for production export |
+| Gradient Boosting | **LightGBM** | ≥4.0 | Fastest training on tabular ranking features; native ranking loss (LambdaRank) |
+| Gradient Boosting (alt) | **XGBoost** | ≥2.0 | Complementary to LightGBM; better on sparse features |
+| Deep Ranking Models | **TensorFlow** | ≥2.13 | Wide&Deep, DIN models production-deployed via TF Serving |
+| ANN / Vector Search | **Milvus** | ≥2.3 | Purpose-built for billion-scale ANN; HNSW + IVF indexes; cloud-native |
+| Full-text & Geo Search | **Elasticsearch** | ≥8.x | Geo-distance queries, BM25 text ranking, real-time index updates |
+| Recommendation Frameworks | **RecBole** / custom | — | Rapid prototyping of CF, BPR, Two-Tower baselines |
+
+### Justification: PyTorch vs. TensorFlow
+- PyTorch used for **research and Two-Tower / Neural CF** training due to flexible architecture design
+- TensorFlow used for **production serving** (Wide&Deep, DIN) due to mature TF Serving ecosystem
+- Both frameworks are maintained in the same ML platform; model conversion via ONNX where needed
+
+---
+
+## Data Storage Layer
+
+| Component | Technology | Version | Justification |
+|-----------|-----------|---------|---------------|
+| Primary OLTP | **PostgreSQL** | ≥15 | ACID transactions for user accounts, restaurant catalog, orders; PostGIS extension for geo queries |
+| Cache / Online Feature Store | **Redis Cluster** | ≥7.0 | Sub-millisecond key lookups; native data structures (Sorted Sets for rankings, Hashes for feature vectors) |
+| Vector Database | **Milvus** | ≥2.3 | Persistent ANN search; supports 128–768d embeddings; horizontal sharding |
+| Search Engine | **Elasticsearch** | ≥8.x | Restaurant geo search, full-text menu/description search, aggregations |
+| Offline Feature Store | **S3 / GCS** (Parquet) | — | Cost-effective columnar storage for historical features; Spark-compatible |
+| Analytics / BI | **BigQuery** | — | Serverless analytics on offline feature data; direct connector for Metabase |
+| Object Storage | **S3 / GCS** | — | Model artifacts, restaurant photos, log archives |
+
+### Justification: Redis vs. Memcached
+Redis was chosen because:
+- Rich data structures (Sorted Sets, Hashes) map naturally to recommendation scores and feature vectors
+- Persistence (AOF/RDB) provides warm-restart capability
+- Cluster mode supports horizontal sharding for terabyte-scale feature data
+- Pub/Sub used for real-time feature invalidation
+
+### Justification: Milvus vs. Pinecone / Weaviate
+- Milvus is fully open-source and self-hosted (no vendor lock-in)
+- Supports multiple index types (HNSW, IVF_PQ) allowing latency/recall trade-off tuning
+- Native Kubernetes operator for deployment
+- At 100M+ restaurant embeddings, self-hosted is significantly more cost-effective
+
+---
+
+## Streaming & Batch Processing
+
+| Component | Technology | Version | Justification |
+|-----------|-----------|---------|---------------|
+| Message Broker | **Apache Kafka** | ≥3.5 | High-throughput, durable event log; partitioned for parallelism; compacted topics for CDC |
+| Stream Processing | **Apache Flink** | ≥1.17 | Exactly-once semantics; stateful stream processing for session aggregation; native Kafka connector |
+| Batch Processing | **Apache Spark** | ≥3.4 | Distributed large-scale feature engineering; PySpark for ML pipelines |
+| Workflow Orchestration | **Apache Airflow** | ≥2.7 | DAG-based pipelines; extensive operator ecosystem; native Kubernetes executor |
+
+### Justification: Flink vs. Spark Streaming
+- Flink provides **true streaming** (per-event) with low latency (~100ms), critical for <5s session feature updates
+- Spark Streaming (micro-batch) introduces 1–5s batch delays, insufficient for real-time personalization
+- Flink's stateful operators simplify session window aggregations
+- Spark Structured Streaming used for batch-heavy ETL jobs where Flink is unnecessary overhead
+
+---
+
+## ML Operations
+
+| Component | Technology | Justification |
+|-----------|-----------|---------------|
+| Model Registry | **MLflow** | Experiment tracking, model versioning, artifact store; integrates with Spark and PyTorch/TF |
+| Model Serving | **NVIDIA Triton Inference Server** | GPU-optimized; supports TensorFlow, PyTorch, ONNX; dynamic batching; concurrent model execution |
+| TF Model Serving | **TensorFlow Serving** | Native TF SavedModel serving; optimized for Wide&Deep / DIN models |
+| ML Monitoring | **Evidently AI** | Open-source; data/prediction drift reports; integrates with Grafana dashboards |
+| Feature Engineering | **Feast** (partial) / custom | Online/offline feature consistency; restaurant-domain custom features handled by bespoke pipelines |
+
+---
+
+## Observability
+
+| Component | Technology | Justification |
+|-----------|-----------|---------------|
+| Metrics | **Prometheus** | De-facto standard; pull-based scraping; rich ecosystem of exporters |
+| Dashboards | **Grafana** | Native Prometheus datasource; Loki log integration; alerting rules |
+| Log Aggregation | **ELK Stack** (Elasticsearch + Logstash + Kibana) / **Grafana Loki** | ELK for complex log search; Loki for cost-effective label-indexed log storage |
+| Distributed Tracing | **Jaeger** | OpenTelemetry-compatible; end-to-end request trace for latency debugging |
+| Business Metrics | **Metabase** | Self-hosted BI; connects to BigQuery/PostgreSQL; used by product/business teams |
+| Alerting | **PagerDuty** + Grafana Alerts | On-call routing with escalation policies; Grafana alert rules for thresholds |
+
+---
+
+## Infrastructure & CI/CD
+
+| Component | Technology | Justification |
+|-----------|-----------|---------------|
+| Containerization | **Docker** (multi-stage builds) | Reproducible builds; minimized image sizes via multi-stage patterns |
+| Orchestration | **Kubernetes** (EKS / GKE) | Industry standard; supports GPU nodes for ML inference; auto-scaling |
+| Service Mesh | **Istio** | mTLS, traffic policies, circuit breaking |
+| CI/CD | **GitHub Actions** | Native GitHub integration; matrix builds; OIDC-based cloud auth; cost-effective |
+| Infrastructure as Code | **Terraform** | Declarative cloud resource management; state management with remote backends |
+| Secret Management | **AWS Secrets Manager** / **HashiCorp Vault** | Centralized secret rotation; Kubernetes external-secrets operator |
+| Container Registry | **AWS ECR** / **GCR** | Private registry with vulnerability scanning; integrated with EKS/GKE |
+
+---
+
+## Alternatives Considered
+
+| Decision | Chosen | Alternative | Reason Not Chosen |
+|----------|--------|-------------|------------------|
+| API Framework | FastAPI | Flask, Django REST | Flask lacks native async; Django too heavyweight for microservice |
+| ML Serving | Triton | BentoML, Seldon | Triton has superior GPU utilization and dynamic batching |
+| Vector DB | Milvus | Pinecone, Weaviate | Pinecone: vendor lock-in, cost at scale; Weaviate: less mature HNSW performance |
+| Stream Processing | Flink | Spark Streaming, Kinesis | Spark Streaming micro-batch too slow; Kinesis: AWS vendor lock-in |
+| Workflow Orchestration | Airflow | Prefect, Dagster | Airflow: largest community, most operators; Prefect/Dagster less mature at enterprise scale |
+| Message Broker | Kafka | RabbitMQ, AWS SQS | RabbitMQ not designed for high-throughput log streaming; SQS: vendor lock-in |
+| Monitoring | Prometheus + Grafana | Datadog, New Relic | Datadog/New Relic: high cost at scale; Prometheus open-source with equivalent capability |
+| Model Registry | MLflow | W&B, Neptune | W&B/Neptune: SaaS cost; MLflow self-hosted, fully open-source |
+| Relational DB | PostgreSQL | MySQL, CockroachDB | PostgreSQL: PostGIS extension critical for geo queries; CockroachDB adds distributed overhead |
diff --git a/docs/infrastructure/deployment.md b/docs/infrastructure/deployment.md
new file mode 100644
index 0000000..e3f225e
--- /dev/null
+++ b/docs/infrastructure/deployment.md
@@ -0,0 +1,357 @@
+# Deployment
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Deployment Architecture Diagram](#deployment-architecture-diagram)
+- [Container Strategy](#container-strategy)
+- [Kubernetes Configuration](#kubernetes-configuration)
+- [CI/CD Pipeline](#cicd-pipeline)
+- [Environments](#environments)
+- [Scaling Strategy](#scaling-strategy)
+
+---
+
+## Overview
+
+The Restaurant Recommendation system is deployed as containerized microservices on Kubernetes. The deployment targets AWS EKS (or GKE) with multi-availability-zone redundancy and GPU node pools for ML inference.
+
+---
+
+## Deployment Architecture Diagram
+
+```mermaid
+graph TB
+    subgraph Internet["Internet"]
+        Users["👤 Users\n(Mobile / Web)"]
+        Partners["🔗 Partners\n(API clients)"]
+    end
+
+    subgraph AWS["AWS / GCP Cloud"]
+        subgraph CDN["CDN Layer"]
+            CloudFront["CloudFront / Cloud CDN\n(Edge caching 5-15min TTL)"]
+        end
+
+        subgraph Ingress["Ingress Layer"]
+            ALB["Application Load Balancer\n(L7, SSL termination)"]
+            NGINX["NGINX Ingress Controller\n(Kubernetes)"]
+        end
+
+        subgraph K8s["Kubernetes Cluster (EKS / GKE)"]
+            subgraph NS_Serving["namespace: rr-serving"]
+                APIGW["API Gateway Pods\n(3+ replicas)"]
+                RecSvc["Recommendation Service\n(3+ replicas, FastAPI)"]
+                SearchSvc["Search Re-rank Service\n(2+ replicas)"]
+                CatalogSvc["Catalog Service\n(2+ replicas, Go)"]
+            end
+
+            subgraph NS_ML["namespace: rr-ml"]
+                Triton["Triton Inference Server\n(2+ GPU replicas)"]
+                FeatureWorker["Feature Store Workers\n(2+ replicas)"]
+            end
+
+            subgraph NS_Streaming["namespace: rr-streaming"]
+                KafkaBrokers["Kafka Brokers\n(3 replicas)"]
+                FlinkJM["Flink JobManager\n(1 replica, HA)"]
+                FlinkTM["Flink TaskManagers\n(3+ replicas)"]
+            end
+
+            subgraph NS_Pipeline["namespace: rr-pipeline"]
+                AirflowWeb["Airflow Webserver\n(1 replica)"]
+                AirflowSched["Airflow Scheduler\n(1 replica)"]
+                AirflowWorkers["Airflow Workers\n(2+ replicas)"]
+            end
+
+            subgraph NS_Observability["namespace: rr-observability"]
+                Prometheus["Prometheus"]
+                Grafana["Grafana"]
+                Loki["Loki"]
+                Jaeger["Jaeger"]
+            end
+        end
+
+        subgraph DataTier["Managed Data Services"]
+            RDSPostgres[("RDS PostgreSQL\n(Multi-AZ)")]
+            ElastiCache[("ElastiCache Redis\n(Cluster mode, 6 shards)")]
+            MilvusCluster[("Milvus Cluster\n(3 replicas)")]
+            ES[("Amazon OpenSearch\n(3 nodes)")]
+            S3[("S3 / GCS\n(Feature store, artifacts)")]
+            BigQuery[("BigQuery\n(Analytics)")]
+        end
+    end
+
+    Users --> CloudFront
+    Partners --> ALB
+    CloudFront --> ALB
+    ALB --> NGINX
+    NGINX --> NS_Serving
+    NS_Serving --> NS_ML
+    NS_Serving --> DataTier
+    NS_ML --> DataTier
+    NS_Streaming --> DataTier
+    NS_Pipeline --> DataTier
+    NS_Serving --> NS_Streaming
+```
+
+---
+
+## Container Strategy
+
+### Docker Multi-Stage Build Pattern
+
+All services use multi-stage Docker builds to minimize image size and attack surface:
+
+```dockerfile
+# Stage 1: Builder
+FROM python:3.11-slim AS builder
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir --user -r requirements.txt
+
+# Stage 2: Runtime
+FROM python:3.11-slim AS runtime
+WORKDIR /app
+
+# Copy only installed packages from builder
+COPY --from=builder /root/.local /root/.local
+
+# Copy application code
+COPY src/ ./src/
+COPY config/ ./config/
+
+ENV PATH=/root/.local/bin:$PATH
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+
+# Non-root user for security
+RUN groupadd -r appuser && useradd -r -g appuser appuser
+USER appuser
+
+EXPOSE 8000
+CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
+```
+
+### Image Registry
+
+| Service | Registry | Image Tag Strategy |
+|---------|----------|-------------------|
+| All services | AWS ECR / GCR | `{service}:{git-sha}` for traceability |
+| ML models | ECR | `{service}:{model-version}` |
+| Base images | ECR (mirrored) | Pinned SHA for reproducibility |
+
+### Image Security
+
+- All images scanned with **Trivy** in CI before push
+- Base images updated weekly (automated PR)
+- No `latest` tag in production — all images pinned to SHA or semantic version
+
+---
+
+## Kubernetes Configuration
+
+### Namespace Strategy
+
+| Namespace | Services | Network Policy |
+|-----------|----------|---------------|
+| `rr-serving` | API Gateway, Recommendation, Search, Catalog | Ingress from load balancer only |
+| `rr-ml` | Triton, Feature Workers | Ingress from `rr-serving` only |
+| `rr-streaming` | Kafka, Flink | Ingress from `rr-serving`, `rr-pipeline` |
+| `rr-pipeline` | Airflow | No ingress from external |
+| `rr-observability` | Prometheus, Grafana, Loki, Jaeger | Ingress from admin network only |
+| `rr-data` | Redis, PostgreSQL (if self-hosted) | Ingress from `rr-serving`, `rr-ml`, `rr-pipeline` |
+
+### Resource Requests & Limits
+
+| Service | CPU Request | CPU Limit | Memory Request | Memory Limit |
+|---------|-------------|-----------|----------------|-------------|
+| Recommendation Service | 500m | 2000m | 512Mi | 2Gi |
+| API Gateway | 250m | 1000m | 256Mi | 512Mi |
+| Search Service | 500m | 1500m | 512Mi | 1Gi |
+| Catalog Service (Go) | 100m | 500m | 128Mi | 256Mi |
+| Triton (GPU) | 2000m | 4000m | 4Gi | 8Gi |
+| Flink TaskManager | 1000m | 2000m | 2Gi | 4Gi |
+| Airflow Worker | 500m | 2000m | 1Gi | 4Gi |
+
+### Horizontal Pod Autoscaler (HPA)
+
+```yaml
+# Recommendation Service HPA
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: recommendation-service-hpa
+  namespace: rr-serving
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: recommendation-service
+  minReplicas: 3
+  maxReplicas: 20
+  metrics:
+    - type: Resource
+      resource:
+        name: cpu
+        target:
+          type: Utilization
+          averageUtilization: 70
+    - type: Pods
+      pods:
+        metric:
+          name: http_requests_per_second
+        target:
+          type: AverageValue
+          averageValue: "500"
+```
+
+### Pod Disruption Budgets
+
+All production services have PDBs to ensure availability during rolling updates:
+
+```yaml
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: recommendation-service-pdb
+  namespace: rr-serving
+spec:
+  minAvailable: 2
+  selector:
+    matchLabels:
+      app: recommendation-service
+```
+
+---
+
+## CI/CD Pipeline
+
+### GitHub Actions Workflow
+
+```mermaid
+flowchart LR
+    PR["Pull Request"] --> Lint["Lint & Type Check\n(ruff, mypy)"]
+    Lint --> Test["Unit & Integration Tests\n(pytest)"]
+    Test --> Security["Security Scan\n(Trivy image scan\nCodeQL SAST)"]
+    Security --> Build["Docker Build\n(multi-stage)"]
+    Build --> Push["Push to ECR\n(tagged with git SHA)"]
+    Push --> DeployDev["Deploy to Dev\n(auto on merge to main)"]
+    DeployDev --> SmokeTest["Smoke Tests\n(API health checks)"]
+    SmokeTest --> DeployStaging["Deploy to Staging\n(auto after smoke pass)"]
+    DeployStaging --> IntegrationTest["Integration Tests\n(E2E test suite)"]
+    IntegrationTest --> ManualApproval["Manual Approval\n(required for production)"]
+    ManualApproval --> DeployProd["Deploy to Production\n(canary → blue-green)"]
+    DeployProd --> ProdVerify["Production Verification\n(automated health checks)"]
+```
+
+### Workflow File Structure
+
+```
+.github/workflows/
+├── ci.yml              # Lint, test, security scan (on PR)
+├── build-push.yml      # Docker build & push (on merge to main)
+├── deploy-dev.yml      # Auto-deploy to dev (on merge to main)
+├── deploy-staging.yml  # Auto-deploy to staging (after dev smoke tests)
+├── deploy-production.yml # Manual-approval production deploy
+└── model-deploy.yml    # ML model canary deployment workflow
+```
+
+### CI Pipeline (`ci.yml`) Steps
+
+| Step | Tool | Failure Mode |
+|------|------|-------------|
+| Lint (Python) | `ruff check` | Block merge |
+| Type check | `mypy` | Block merge |
+| Unit tests | `pytest` (>90% coverage required) | Block merge |
+| Integration tests | `pytest -m integration` | Block merge |
+| Image security scan | Trivy | Block merge (CRITICAL/HIGH CVEs) |
+| SAST | CodeQL | Advisory (non-blocking for now) |
+| Dependency check | `pip-audit` | Block merge |
+
+### Deployment Strategy in CI/CD
+
+```yaml
+# deploy-production.yml (excerpt)
+steps:
+  - name: Deploy Canary (5%)
+    run: |
+      kubectl set image deployment/recommendation-service \
+        recommendation-service=$IMAGE_TAG \
+        --namespace rr-serving
+      kubectl patch service recommendation-service \
+        -p '{"spec":{"selector":{"version":"canary"}}}' \
+        --namespace rr-serving
+      # Route 5% traffic to canary via Istio VirtualService
+
+  - name: Monitor Canary (2h)
+    run: ./scripts/monitor-canary.sh --duration 2h --error-threshold 0.01
+
+  - name: Promote to 100% (Blue-Green)
+    if: success()
+    run: |
+      kubectl patch virtualservice recommendation-service \
+        -p '{"spec":{"http":[{"route":[{"destination":{"host":"recommendation-service","subset":"new"},"weight":100}]}]}}' \
+        --namespace rr-serving
+```
+
+---
+
+## Environments
+
+| Environment | Purpose | Cluster Size | Data | Auto-Deploy |
+|-------------|---------|-------------|------|-------------|
+| **Development** (`dev`) | Feature development, unit testing | Small (1 replica per service) | Synthetic data | Yes (on merge to `main`) |
+| **Staging** (`staging`) | Integration testing, QA, load testing | Medium (2 replicas per service) | Anonymized production copy | Yes (after dev smoke tests) |
+| **Production** (`prod`) | Live traffic | Full HA (3+ replicas, multi-AZ) | Real data | Manual approval required |
+
+### Environment-Specific Configuration
+
+| Configuration | Dev | Staging | Production |
+|--------------|-----|---------|-----------|
+| Redis | Single node | 3-shard cluster | 6-shard cluster |
+| PostgreSQL | RDS t3.medium | RDS r6g.large | RDS r6g.2xlarge (Multi-AZ) |
+| Kafka | 1 broker | 3 brokers | 3 brokers (Multi-AZ) |
+| GPU nodes | None (CPU-only inference) | 1× GPU node | 2+ GPU nodes per AZ |
+| Log retention | 7 days | 30 days | 90 days hot / 1yr cold |
+| Monitoring | Basic | Full | Full + PagerDuty alerting |
+
+---
+
+## Scaling Strategy
+
+### API Service (CPU-Based HPA)
+
+- **Min replicas**: 3 (across 3 AZs)
+- **Max replicas**: 20
+- **Scale-up trigger**: CPU utilization >70% for 60s
+- **Scale-down trigger**: CPU utilization <40% for 300s
+- **Scale-down stabilization**: 5 minutes (prevent flapping)
+
+### ML Inference (GPU-Based HPA)
+
+- **Technology**: KEDA (Kubernetes Event-Driven Autoscaling) with Prometheus metrics
+- **Min replicas**: 2 (1 per AZ)
+- **Max replicas**: 8
+- **Scale-up trigger**: Triton queue depth >50 requests
+- **Scale-down trigger**: Queue depth <5 for 5 minutes
+- **Node provisioner**: Karpenter (AWS) / Cluster Autoscaler for GPU node pool
+
+### Redis (Cluster Mode)
+
+- **Sharding**: 6 shards, consistent hash slots
+- **Replicas**: 1 replica per shard (total 12 nodes)
+- **Scaling**: Manual shard addition (requires data rebalancing, planned maintenance)
+- **Memory target**: <70% utilization before scaling
+
+### Kafka (Partition-Based)
+
+- **Partitions**: `user-events-raw` — 24 partitions (for 24 Flink parallelism)
+- **Retention**: 7 days (raw events), 24h (processed events)
+- **Scaling trigger**: Consumer lag >50,000 messages for >10 minutes
+- **Broker scaling**: Add broker + rebalance partitions (handled via Confluent Operator)
+
+### Flink (TaskManager Scaling)
+
+- **Min TaskManagers**: 3
+- **Max TaskManagers**: 12
+- **Scaling**: Based on Kafka consumer lag via KEDA
+- **Checkpoint interval**: 30 seconds (ensures <30s recovery time)
diff --git a/docs/modules/01-data-ingestion.md b/docs/modules/01-data-ingestion.md
new file mode 100644
index 0000000..84d8cb1
--- /dev/null
+++ b/docs/modules/01-data-ingestion.md
@@ -0,0 +1,343 @@
+# Module 01 — Data Ingestion
+
+## Table of Contents
+
+- [Overview](#overview)
+- [User Behavior Tracking](#user-behavior-tracking)
+- [Restaurant Catalog Management](#restaurant-catalog-management)
+- [User Profile Aggregation](#user-profile-aggregation)
+- [Review & Rating Ingestion](#review--rating-ingestion)
+- [Data Schemas](#data-schemas)
+
+---
+
+## Overview
+
+The Data Ingestion module is responsible for collecting, validating, enriching, and storing all data that feeds the recommendation system. It covers four primary data domains:
+
+1. **User Behavior** — clickstream events, sessions, and interaction history
+2. **Restaurant Catalog** — structured restaurant data with domain-specific attributes
+3. **User Profiles** — aggregated behavioral and demographic profiles
+4. **Reviews & Ratings** — user-generated content with sentiment enrichment
+
+---
+
+## User Behavior Tracking
+
+### Event Types
+
+| Event Type | Trigger | Key Signals |
+|-----------|---------|-------------|
+| `RESTAURANT_VIEW` | Restaurant detail page opened | dwell_time_ms, scroll_depth_pct |
+| `RESTAURANT_CLICK` | Restaurant card clicked in listing | position, source_page, rank |
+| `MENU_VIEW` | Menu page viewed | menu_section, dwell_time_ms |
+| `ORDER_PLACED` | Order successfully submitted | order_value, items_count, payment_method |
+| `REVIEW_SUBMITTED` | User submits review | rating_value, review_length |
+| `SEARCH` | User performs search | query_text, results_count, selected_position |
+| `FILTER_APPLIED` | User applies a search/listing filter | filter_type, filter_value |
+| `FAVORITE_ADDED` | User adds restaurant to favorites | source_page |
+
+### Session Management
+
+- A **session** begins with the first event after a 30-minute idle gap
+- Sessions are tracked by `session_id` (UUID), assigned server-side
+- Session features computed in real-time via Flink: total events, unique restaurants viewed, cuisine diversity, avg dwell time, click-through items
+- Maximum session duration: 4 hours (forced close)
+
+### Clickstream Pipeline
+
+```
+Mobile/Web SDK → Kafka (user-events-raw) → Flink Validation & Enrichment
+    → Kafka (user-events-enriched) → Session Aggregation → Redis (session features)
+    → BigQuery (analytics) + S3 (cold archive)
+```
+
+---
+
+## Restaurant Catalog Management
+
+### Data Sources
+
+| Source | Sync Method | Frequency |
+|--------|-------------|-----------|
+| Partner POS systems | Webhook / REST push | Near-real-time (<1 min) |
+| Google Places API | Batch pull | Daily |
+| Internal manual entries | Admin API | On-demand |
+| Review platforms | Batch scraping (licensed) | Daily |
+
+### Catalog Enrichment Pipeline
+
+After ingestion, each restaurant record is enriched:
+
+1. **NLP Tag Generation** — BERT-based model extracts cuisine tags, ambiance descriptors, and menu highlights from description text
+2. **Text Embedding** — Sentence-BERT generates 768d text embedding from name + description + cuisine tags
+3. **Image Embedding** — ResNet-50 generates 512d image embedding from the primary restaurant photo
+4. **Geohash Encoding** — Latitude/longitude encoded to geohash precision 7 (~150m × 150m cell)
+5. **Operating Hours Normalization** — Hours stored in UTC, `is_open_now` computed at query time
+6. **Price Tier Standardization** — Mapped to 4 tiers: `$` (<$15), `$$` ($15–$30), `$$$` ($30–$60), `$$$$` (>$60) per person
+
+### Restaurant Domain Attributes
+
+The restaurant domain requires specialized fields beyond generic catalog items:
+
+| Category | Fields | Notes |
+|----------|--------|-------|
+| **Location** | lat, lng, geohash, address, city, district, country | Geohash used for fast proximity lookups |
+| **Cuisine** | cuisine_types (array), primary_cuisine | Taxonomy of 120+ cuisine types |
+| **Operations** | operating_hours (per weekday), is_open_now, reservation_required | is_open_now is a computed field |
+| **Pricing** | price_tier ($–$$$$), avg_meal_price | Tier updated quarterly from order data |
+| **Dietary** | vegetarian_options, vegan_options, halal, gluten_free, allergen_info | Boolean flags + allergen list |
+| **Ambiance** | tags (romantic, family-friendly, business, casual, fine-dining) | Multi-value, ML-tagged |
+| **Capacity** | total_seats, private_rooms, outdoor_seating, parking | Operational capacity data |
+| **Ratings** | avg_rating, total_reviews, rating_breakdown (food/service/ambiance/value) | Updated daily from review ingestion |
+| **Media** | photos[], menu_url | Photos stored in S3/CDN |
+
+---
+
+## User Profile Aggregation
+
+### Profile Components
+
+| Component | Source | Update Frequency |
+|-----------|--------|-----------------|
+| Demographic | Registration, onboarding quiz | On registration / user update |
+| Behavioral aggregates | Event log (Spark batch) | Daily (2:00 AM pipeline) |
+| Preference vectors | CF model output | Weekly (model retrain) |
+| Dietary preferences | Explicit (settings) + implicit (order history) | Real-time (explicit) / Daily (implicit) |
+| User segment | Segmentation model | Daily |
+
+### User Segments
+
+| Segment | Criteria | Recommendation Strategy |
+|---------|----------|------------------------|
+| `cold_start` | <10 interactions | Popularity by geo + demographic |
+| `warm` | 10–30 interactions | Blended CF + content-based |
+| `active` | >30 interactions, active <30 days | Full personalization |
+| `power_user` | >100 interactions/month | Deep personalization, early access |
+| `churning` | No activity 30–90 days | Re-engagement recommendations |
+| `dormant` | No activity >90 days | Fallback to popularity |
+
+---
+
+## Review & Rating Ingestion
+
+### Ingestion Pipeline
+
+```
+Review submitted (API/webhook) → Kafka (reviews-raw)
+    → Flink: Basic validation, dedup
+    → Sentiment Analysis service (async, <5s)
+    → PostgreSQL (review stored)
+    → Kafka (reviews-enriched)
+    → Daily batch: Update restaurant avg_rating, rating_breakdown
+```
+
+### Sentiment Analysis
+
+- Model: Fine-tuned DistilBERT on restaurant review corpus
+- Output: `sentiment_score` ∈ [0.0, 1.0] (0 = very negative, 1 = very positive)
+- Aspect-level sentiment: food, service, ambiance, value
+- Processing latency: <3s per review (async, not blocking review submission)
+
+---
+
+## Data Schemas
+
+### User Event Schema
+
+```json
+{
+  "event_id": "evt_01HXYZ123ABC456DEF",
+  "event_type": "RESTAURANT_CLICK",
+  "timestamp": "2024-03-15T14:32:07.123Z",
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "session_id": "sess_7e2f1a3b4c5d6e7f",
+  "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+  "context": {
+    "device": "ios",
+    "platform": "mobile_app",
+    "app_version": "4.2.1",
+    "os_version": "17.2",
+    "location": {
+      "lat": 10.7769,
+      "lng": 106.7009,
+      "accuracy_meters": 15,
+      "geohash": "w3gv2c"
+    },
+    "referrer": "homepage_recommendations",
+    "page_position": 3
+  },
+  "metadata": {
+    "dwell_time_ms": 4200,
+    "scroll_depth_pct": 75,
+    "rank_in_list": 3,
+    "rec_model_version": "ranking_v2.4.1",
+    "experiment_variants": {
+      "homepage_layout": "control",
+      "ranking_model": "treatment_a"
+    }
+  }
+}
+```
+
+### Restaurant Catalog Schema
+
+```json
+{
+  "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+  "name": "Pho Saigon House",
+  "description": "Authentic Vietnamese pho and street food in a casual setting.",
+  "phone": "+84-28-3812-3456",
+  "website": "https://phosaigonhouse.vn",
+  "location": {
+    "address": "123 Nguyen Hue Boulevard",
+    "district": "District 1",
+    "city": "Ho Chi Minh City",
+    "country": "VN",
+    "lat": 10.7769,
+    "lng": 106.7009,
+    "geohash": "w3gv2c"
+  },
+  "cuisine": {
+    "cuisine_types": ["vietnamese", "asian", "noodles", "soup"],
+    "primary_cuisine": "vietnamese"
+  },
+  "operations": {
+    "operating_hours": {
+      "monday":    { "open": "07:00", "close": "22:00" },
+      "tuesday":   { "open": "07:00", "close": "22:00" },
+      "wednesday": { "open": "07:00", "close": "22:00" },
+      "thursday":  { "open": "07:00", "close": "22:00" },
+      "friday":    { "open": "07:00", "close": "23:00" },
+      "saturday":  { "open": "08:00", "close": "23:00" },
+      "sunday":    { "open": "08:00", "close": "21:00" }
+    },
+    "is_open_now": true,
+    "reservation_required": false
+  },
+  "pricing": {
+    "price_tier": "$$",
+    "avg_meal_price_usd": 12.50
+  },
+  "ratings": {
+    "avg_rating": 4.3,
+    "total_reviews": 1842,
+    "rating_breakdown": {
+      "food": 4.5,
+      "service": 4.1,
+      "ambiance": 4.0,
+      "value": 4.6
+    }
+  },
+  "dietary": {
+    "vegetarian_options": true,
+    "vegan_options": false,
+    "halal": true,
+    "gluten_free": false,
+    "allergen_info": ["peanuts", "shellfish"]
+  },
+  "ambiance": {
+    "tags": ["casual", "family-friendly", "quick-bite"]
+  },
+  "capacity": {
+    "total_seats": 80,
+    "private_rooms": 0,
+    "outdoor_seating": true,
+    "parking": false
+  },
+  "media": {
+    "photos": [
+      "https://cdn.example.com/restaurants/rst_3b7c9e/photo_1.jpg",
+      "https://cdn.example.com/restaurants/rst_3b7c9e/photo_2.jpg"
+    ],
+    "menu_url": "https://cdn.example.com/restaurants/rst_3b7c9e/menu.pdf"
+  },
+  "embeddings": {
+    "text_embedding": [0.023, -0.187, 0.412, "... (768 dimensions total)"],
+    "image_embedding": [0.051, 0.334, -0.112, "... (512 dimensions total)"]
+  },
+  "metadata": {
+    "popularity_score": 0.82,
+    "status": "active",
+    "created_at": "2022-06-01T08:00:00Z",
+    "updated_at": "2024-03-15T02:30:00Z"
+  }
+}
+```
+
+### User Profile Schema
+
+```json
+{
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "demographic": {
+    "age_range": "25-34",
+    "gender": "female",
+    "location": {
+      "city": "Ho Chi Minh City",
+      "district": "District 2",
+      "country": "VN",
+      "lat": 10.7880,
+      "lng": 106.7320
+    }
+  },
+  "behavioral": {
+    "total_interactions": 247,
+    "favorite_cuisines": [
+      { "cuisine": "japanese", "affinity_score": 0.87 },
+      { "cuisine": "vietnamese", "affinity_score": 0.76 },
+      { "cuisine": "korean", "affinity_score": 0.65 }
+    ],
+    "avg_session_duration_min": 8.4,
+    "order_frequency_per_month": 6.2,
+    "last_active_at": "2024-03-15T13:45:00Z"
+  },
+  "dietary_preferences": {
+    "vegetarian": false,
+    "vegan": false,
+    "halal": false,
+    "allergies": ["shellfish"]
+  },
+  "price_preference": {
+    "preferred_price_tier": "$$",
+    "avg_spend_per_meal_usd": 18.00
+  },
+  "preference_vector": [0.12, -0.34, 0.87, "... (128 dimensions total)"],
+  "segment": "active",
+  "metadata": {
+    "created_at": "2022-09-14T10:00:00Z",
+    "updated_at": "2024-03-15T02:00:00Z"
+  }
+}
+```
+
+### Review Schema
+
+```json
+{
+  "review_id": "rev_01HXZ789GHI012JKL",
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+  "overall_rating": 4,
+  "dimension_ratings": {
+    "food": 5,
+    "service": 4,
+    "ambiance": 4,
+    "value": 5
+  },
+  "review_text": "Amazing pho! The broth is rich and the service was friendly. A bit noisy during lunch rush but great value for money.",
+  "sentiment_score": 0.84,
+  "aspect_sentiments": {
+    "food": 0.95,
+    "service": 0.78,
+    "ambiance": 0.62,
+    "value": 0.91
+  },
+  "helpful_votes": 23,
+  "photos": [
+    "https://cdn.example.com/reviews/rev_01HXZ789/photo_1.jpg"
+  ],
+  "visit_date": "2024-03-10",
+  "created_at": "2024-03-11T09:22:00Z"
+}
+```
diff --git a/docs/modules/02-recommendation-engine.md b/docs/modules/02-recommendation-engine.md
new file mode 100644
index 0000000..f945332
--- /dev/null
+++ b/docs/modules/02-recommendation-engine.md
@@ -0,0 +1,269 @@
+# Module 02 — Recommendation Engine
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Stage 1: Candidate Generation](#stage-1-candidate-generation)
+- [Stage 2: Scoring & Ranking](#stage-2-scoring--ranking)
+- [Stage 3: Re-ranking & Business Logic](#stage-3-re-ranking--business-logic)
+- [Cold Start Handling](#cold-start-handling)
+- [Pipeline Architecture](#pipeline-architecture)
+
+---
+
+## Overview
+
+The Recommendation Engine operates as a three-stage funnel:
+
+```
+All Restaurants (~100k+)
+        ↓
+[Stage 1: Candidate Generation]  → 500–2000 candidates
+        ↓
+[Stage 2: Scoring & Ranking]     → top 50–200
+        ↓
+[Stage 3: Re-ranking & Business Logic] → final top 10–50
+```
+
+Each stage narrows the candidate set while adding more expensive (but more precise) signals.
+
+---
+
+## Stage 1: Candidate Generation
+
+**Goal**: Retrieve 500–2000 relevant restaurant candidates with latency <15ms using multiple lightweight retrieval strategies in parallel.
+
+### Retrieval Strategies
+
+#### 1.1 Collaborative Filtering
+
+| Model | Description | Output |
+|-------|-------------|--------|
+| **Two-Tower** | Dual encoder: user tower (demographics + behavior) and item tower (restaurant attributes). ANN search in Milvus using HNSW index. | Top-N by cosine similarity |
+| **Neural CF** | Dot-product interaction between learned user and item embeddings | Top-N from pre-computed index |
+| **ALS** (offline) | Matrix factorization; pre-computed user–item scores stored in Redis | Lookup by user_id |
+| **BPR** | Bayesian Personalized Ranking; optimizes for pairwise ranking | Offline scoring |
+
+#### 1.2 Content-Based Filtering
+
+- **Text similarity**: Sentence-BERT embeddings for restaurant description + cuisine tags; ANN search via Milvus HNSW index (768d)
+- **Image similarity**: ResNet-50 image embeddings; HNSW ANN search (512d)
+- **Query**: User's cuisine preference vector used as query embedding
+
+#### 1.3 Popularity & Trending
+
+- **Global popularity**: Pre-computed `popularity_score` ∈ [0, 1] (order count × recency-weighted)
+- **Time-decay**: `score(t) = base_score × e^(-λt)`, λ = 0.1/day, stored in Redis Sorted Set
+- **Local trending**: Top restaurants in the user's current geohash cluster (precision 6, ~1.2km × 0.6km)
+
+#### 1.4 Location-Aware Filtering
+
+- **Geofencing**: Elasticsearch `geo_distance` query within user-configured radius
+- **Configurable radius**: 1–50km (default 5km for mobile, 10km for web)
+- **Geohash pre-filter**: Reduce Elasticsearch scan using geohash bounding cells
+- **Distance score**: `distance_score = 1 / (1 + distance_km)` used as a ranking signal later
+
+#### 1.5 Cuisine Matching
+
+- User's top 5 cuisine preferences (from profile) are used to directly query restaurants matching those cuisines via PostgreSQL index on `primary_cuisine`
+- Ensures strong cuisine relevance even when collaborative signals are weak
+
+#### 1.6 Knowledge Graph (GNN)
+
+- Graph Neural Network captures entity relationships: User → visited → Restaurant → serves → Cuisine → similar → Cuisine
+- GraphSAGE model generates neighborhood-aggregated embeddings
+- Provides candidates based on multi-hop entity relationships (e.g., "users who liked X also liked Y in same cuisine family")
+
+### Candidate Merging
+
+After parallel retrieval, candidates are:
+1. **Deduplicated** by `restaurant_id`
+2. **Geo-filtered**: restaurants outside configured radius are dropped
+3. **Status-filtered**: `status != active` and `is_open_now == false` (soft filter; applied here for efficiency) are optionally excluded
+4. **Capped** at 2000 candidates (top-N by retrieval score per source, then merged)
+
+---
+
+## Stage 2: Scoring & Ranking
+
+**Goal**: Score and rank 500–2000 candidates using a rich feature set and ML ranking models, producing a top 50–200 ranked list in ~15ms.
+
+### Feature Engineering
+
+#### User Features
+
+| Feature | Description | Dimensionality |
+|---------|-------------|---------------|
+| `user_embedding` | Two-Tower user tower output | 128d |
+| `user_segment` | cold_start / warm / active / power_user / churning | 1 (categorical) |
+| `top_cuisines` | Top-3 cuisine preferences with affinity scores | 6 floats |
+| `price_preference` | Preferred price tier (1–4) | 1 |
+| `recency` | Days since last interaction | 1 |
+| `frequency` | Orders per month (last 90d) | 1 |
+| `monetary` | Avg spend per meal (USD) | 1 |
+| `dietary_flags` | vegetarian, vegan, halal, gluten_free | 4 booleans |
+| `session_clicks` | Number of restaurants clicked in current session | 1 |
+| `time_of_day` | Hour of day (0–23) | 1 |
+
+#### Restaurant Features
+
+| Feature | Description | Dimensionality |
+|---------|-------------|---------------|
+| `restaurant_embedding` | Two-Tower item tower output | 128d |
+| `cuisine_embedding` | Cuisine taxonomy embedding | 32d |
+| `price_tier` | $ to $$$$ (1–4) | 1 |
+| `avg_rating` | Overall avg rating (0–5) | 1 |
+| `rating_breakdown` | food, service, ambiance, value ratings | 4 floats |
+| `review_count` | Total review count (log-scaled) | 1 |
+| `distance_km` | Distance from user's location | 1 |
+| `popularity_score` | Time-decayed popularity score | 1 |
+| `is_new` | Restaurant created <30 days ago | 1 boolean |
+| `has_outdoor_seating` | Outdoor seating availability | 1 boolean |
+
+#### Context Features
+
+| Feature | Description |
+|---------|-------------|
+| `time_of_day` | Hour bucket: breakfast (6–10), lunch (11–14), afternoon (14–17), dinner (17–22), late (22–6) |
+| `day_of_week` | Weekday vs. weekend |
+| `device_type` | mobile / tablet / desktop |
+| `meal_occasion` | Inferred from time + user history: solo / business / date / family |
+
+#### Cross Features
+
+| Feature | Description |
+|---------|-------------|
+| `user_cuisine_affinity` | User cuisine preference score × restaurant primary cuisine match |
+| `user_price_match` | Abs difference between user preferred_price_tier and restaurant price_tier |
+| `user_restaurant_embedding_similarity` | Dot product of user_embedding and restaurant_embedding |
+| `user_distance_affinity` | Historical median distance user visits × current restaurant distance |
+
+### Ranking Models
+
+| Model | Type | Use Case |
+|-------|------|---------|
+| **LightGBM** | Gradient Boosting (LambdaRank) | Primary ranker; tabular features; fast inference |
+| **XGBoost** | Gradient Boosting | Ensemble fallback; better on sparse feature sets |
+| **DeepFM** | Deep learning (FM + DNN) | Captures high-order feature interactions |
+| **Wide & Deep** | Deep learning | Wide: memorization; Deep: generalization |
+| **DIN** | Attention-based deep learning | Dynamic interest network using user's click history sequence |
+
+**Production ranker**: LightGBM (primary) + DeepFM (secondary, async reranking for high-traffic users)
+
+### Multi-Objective Scoring
+
+The final ranking score combines multiple objectives:
+
+```
+final_score = w_relevance × relevance_score
+            + w_distance  × distance_score
+            + w_price     × price_match_score
+            + w_rating    × rating_score
+```
+
+Default weights: `w_relevance=0.55, w_distance=0.20, w_price=0.10, w_rating=0.15`
+Weights are tuned per user segment and updated quarterly via A/B experiments.
+
+---
+
+## Stage 3: Re-ranking & Business Logic
+
+**Goal**: Apply diversity, freshness, and business rules to the top 50–200 ranked candidates to produce the final top 10–50 shown to users.
+
+### Diversity Injection (MMR)
+
+Maximal Marginal Relevance (MMR) balances relevance and diversity:
+
+```
+MMR_score(r) = λ × relevance(r) - (1-λ) × max_{s ∈ S} sim(r, s)
+```
+
+Where `S` is the set of already-selected restaurants and `sim` is cuisine/embedding similarity.
+
+- **λ** = 0.7 (tunable; higher = more relevance-focused)
+- **category_max_ratio** = 0.4 (max 40% of results can share the same primary cuisine)
+- Prevents showing 10 Vietnamese restaurants when diverse options exist
+
+### Freshness Boost
+
+- New restaurants (created <30 days): `+0.05` score boost
+- New restaurants (30–90 days): `+0.02` score boost
+- Ensures new restaurant discovery and catalog health
+
+### Business Rules
+
+| Rule | Description | Priority |
+|------|-------------|----------|
+| **Exclude closed** | Remove restaurants currently closed (is_open_now = false) for dinner/lunch requests | High |
+| **Exclude already visited** | Optionally hide restaurants the user visited in the past 7 days (configurable per user setting) | Medium |
+| **Sponsored boost** | Verified sponsored restaurants receive a `+0.10` score lift, capped at max 2 per page | Medium |
+| **High-margin boost** | Internal signal; `+0.03` for partners on premium tier | Low |
+| **Geo availability** | Respect delivery zone boundaries if applicable | High |
+| **Operating hours filter** | Hard filter: closed restaurants excluded from results | High |
+| **Allergen safety filter** | Hard filter: restaurants with user's allergens are excluded | High |
+
+### Fairness Constraints
+
+- **Minimum exposure**: Small restaurants (review_count < 50) are guaranteed ≥1 slot per 10 results page
+- **Exploration slots**: 5–10% of result slots are reserved for exploration (non-greedy picks to combat popularity bias)
+- **Position bias calibration**: Model scores are calibrated to account for position bias in training labels
+
+---
+
+## Cold Start Handling
+
+### New User Cold Start
+
+| Stage | Interactions | Strategy |
+|-------|-------------|---------|
+| Anonymous | 0 | Popularity by geo (city/district) + top-rated restaurants |
+| Registered (no interactions) | 0 | Demographic matching + onboarding quiz (top 3 cuisines, price preference, dietary) |
+| Early interactions | 1–10 | Content-based using stated cuisine prefs + onboarding signals |
+| Blending period | 10–30 | Linear blend: CF weight ramps 0→1 as interactions increase |
+| Full personalization | >30 | Full CF + ranking model |
+
+**Transition threshold**: cold → warm at **10 interactions**, full personalization at **30 interactions**
+
+### New Restaurant Cold Start
+
+| Approach | Description |
+|----------|-------------|
+| **Content similarity** | Use text/image embeddings to find similar restaurants and proxy their interaction scores |
+| **Category injection** | Inject new restaurant into results for users with matching cuisine preferences |
+| **Exploration slots** | 5–10% of result slots reserved for new restaurant exploration |
+| **Metadata priors** | Use rating_prior = 3.5 until ≥10 reviews accumulated |
+
+---
+
+## Pipeline Architecture
+
+```mermaid
+flowchart LR
+    Request["User Request"] --> ParallelRetrieval
+
+    subgraph ParallelRetrieval["Stage 1: Candidate Generation (parallel)"]
+        CF["Collaborative Filtering\n(Two-Tower ANN, ALS)"]
+        CB["Content-Based\n(Text/Image ANN)"]
+        Geo["Geo/Location\n(Elasticsearch)"]
+        Pop["Popularity/Trending\n(Redis)"]
+        Cuisine["Cuisine Match\n(PostgreSQL)"]
+        KG["Knowledge Graph\n(GNN)"]
+    end
+
+    ParallelRetrieval --> Merge["Merge & Deduplicate\n500-2000 candidates"]
+    Merge --> FeatureFetch["Batch Feature Fetch\n(Redis Feature Store)"]
+
+    subgraph Ranking["Stage 2: Scoring & Ranking"]
+        FeatureFetch --> FeatureEng["Feature Engineering\n(user × restaurant × context × cross)"]
+        FeatureEng --> Models["LightGBM / DeepFM / DIN\nMulti-objective scoring"]
+        Models --> TopN["Top 50-200"]
+    end
+
+    subgraph Reranking["Stage 3: Re-ranking & Business Logic"]
+        TopN --> MMR["MMR Diversity"]
+        MMR --> Rules["Business Rules\n(filters, boosts, fairness)"]
+        Rules --> Final["Final Top 10-50"]
+    end
+
+    Final --> Response["API Response"]
+```
diff --git a/docs/modules/03-serving-delivery.md b/docs/modules/03-serving-delivery.md
new file mode 100644
index 0000000..ba922c0
--- /dev/null
+++ b/docs/modules/03-serving-delivery.md
@@ -0,0 +1,343 @@
+# Module 03 — Serving & Delivery
+
+## Table of Contents
+
+- [Overview](#overview)
+- [API Endpoints](#api-endpoints)
+- [Caching Strategy](#caching-strategy)
+- [Real-time Personalization Stream](#real-time-personalization-stream)
+- [Error Handling](#error-handling)
+- [Rate Limiting](#rate-limiting)
+
+---
+
+## Overview
+
+The Serving & Delivery module exposes the recommendation system through a REST API, manages multi-layer caching to minimize latency, and maintains real-time personalization via the event streaming pipeline.
+
+---
+
+## API Endpoints
+
+### GET `/api/v1/recommendations/homepage`
+
+Returns personalized recommendations for the user's home feed.
+
+| Property | Value |
+|----------|-------|
+| **Latency SLA** | p50: 50ms, p99: 200ms |
+| **Cache TTL** | Redis: 30min, CDN: 10min |
+| **Auth** | Bearer token required |
+
+**Query Parameters**
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `limit` | int | 20 | Number of recommendations (max 50) |
+| `offset` | int | 0 | Pagination offset |
+| `lat` | float | — | User latitude (optional; uses profile location if absent) |
+| `lng` | float | — | User longitude (optional) |
+| `radius_km` | float | 5.0 | Search radius in km (1–50) |
+
+**Response**
+
+```json
+{
+  "request_id": "req_01HXZ789ABC",
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "recommendations": [
+    {
+      "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+      "name": "Pho Saigon House",
+      "score": 0.92,
+      "distance_km": 1.2,
+      "primary_cuisine": "vietnamese",
+      "price_tier": "$$",
+      "avg_rating": 4.3,
+      "is_open_now": true,
+      "photo_url": "https://cdn.example.com/restaurants/rst_3b7c9e/photo_1.jpg",
+      "tags": ["casual", "family-friendly"],
+      "reason": "Based on your love of Vietnamese cuisine"
+    }
+  ],
+  "total": 20,
+  "model_version": "ranking_v2.4.1",
+  "latency_ms": 42
+}
+```
+
+---
+
+### GET `/api/v1/recommendations/nearby`
+
+Returns top-rated and personalized nearby restaurants based on the user's current location.
+
+| Property | Value |
+|----------|-------|
+| **Latency SLA** | p50: 40ms, p99: 150ms |
+| **Cache TTL** | Redis: 15min (location-sensitive), CDN: 5min |
+| **Auth** | Bearer token required |
+
+**Query Parameters**
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `lat` | float | ✅ | User latitude |
+| `lng` | float | ✅ | User longitude |
+| `radius_km` | float | — | Search radius (default 3km, max 50km) |
+| `limit` | int | — | Max results (default 20, max 50) |
+| `cuisine` | string | — | Filter by cuisine type |
+| `price_tier` | string | — | Filter: $, $$, $$$, $$$$ |
+| `open_now` | bool | — | Only return currently open restaurants |
+
+---
+
+### GET `/api/v1/recommendations/similar/{restaurant_id}`
+
+Returns restaurants similar to a given restaurant based on content and collaborative signals.
+
+| Property | Value |
+|----------|-------|
+| **Latency SLA** | p50: 30ms, p99: 100ms |
+| **Cache TTL** | Redis: 2h, CDN: 30min |
+| **Auth** | Bearer token required |
+
+**Path Parameters**
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `restaurant_id` | string | Target restaurant ID |
+
+**Query Parameters**
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `limit` | int | 10 | Number of similar restaurants (max 20) |
+| `lat` | float | — | User location for geo-scoring |
+| `lng` | float | — | User location for geo-scoring |
+
+---
+
+### GET `/api/v1/recommendations/complementary/{restaurant_id}`
+
+Returns "People also liked" recommendations for users who interacted with a given restaurant.
+
+| Property | Value |
+|----------|-------|
+| **Latency SLA** | p50: 30ms, p99: 100ms |
+| **Cache TTL** | Redis: 2h, CDN: 30min |
+| **Auth** | Bearer token required |
+
+**Path Parameters**
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `restaurant_id` | string | Source restaurant ID |
+
+---
+
+### POST `/api/v1/recommendations/search-rerank`
+
+Re-ranks a set of search results using personalization signals.
+
+| Property | Value |
+|----------|-------|
+| **Latency SLA** | p50: 60ms, p99: 200ms |
+| **Cache TTL** | Not cached (query-specific) |
+| **Auth** | Bearer token required |
+
+**Request Body**
+
+```json
+{
+  "query": "sushi downtown",
+  "candidate_ids": [
+    "rst_abc123",
+    "rst_def456",
+    "rst_ghi789"
+  ],
+  "context": {
+    "lat": 10.7769,
+    "lng": 106.7009,
+    "session_id": "sess_7e2f1a3b4c5d6e7f"
+  },
+  "limit": 10
+}
+```
+
+---
+
+### POST `/api/v1/recommendations/group`
+
+Generates recommendations for group dining by aggregating preferences of multiple users.
+
+| Property | Value |
+|----------|-------|
+| **Latency SLA** | p50: 80ms, p99: 250ms |
+| **Cache TTL** | Redis: 15min |
+| **Auth** | Bearer token required |
+
+**Request Body**
+
+```json
+{
+  "user_ids": [
+    "usr_9f3a8b2c1d4e5f6a",
+    "usr_1a2b3c4d5e6f7a8b",
+    "usr_2b3c4d5e6f7a8b9c"
+  ],
+  "context": {
+    "lat": 10.7769,
+    "lng": 106.7009,
+    "occasion": "business",
+    "party_size": 3
+  },
+  "constraints": {
+    "dietary": ["halal"],
+    "price_tier_max": "$$$",
+    "radius_km": 5.0
+  },
+  "limit": 10
+}
+```
+
+---
+
+## Caching Strategy
+
+The system uses a **4-layer caching architecture** to achieve high cache hit rates and minimize backend load.
+
+```mermaid
+flowchart TD
+    Request["User Request"] --> L1
+
+    L1["Layer 1: CDN / Edge Cache\n(CloudFront)\nTTL: 5-15 min\nHit rate target: ~40%"]
+    L1 -- MISS --> L2
+
+    L2["Layer 2: Redis Application Cache\n(per user_id + context hash)\nTTL: 30 min - 2h\nHit rate target: ~80%"]
+    L2 -- MISS --> L3
+
+    L3["Layer 3: Feature Store Cache\n(per user_id, per restaurant_id)\nTTL: 1-6h\nPre-warmed daily"]
+    L3 -- MISS --> L4
+
+    L4["Layer 4: Model Prediction Cache\n(pre-computed batch scores)\nTTL: 15-30 min\nFor power users + popular restaurants"]
+    L4 -- MISS --> Engine["Full Recommendation Pipeline\n~40-60ms"]
+
+    Engine --> L2
+    Engine --> L4
+```
+
+### Cache Configuration
+
+| Layer | Technology | TTL | Cache Key | Invalidation |
+|-------|-----------|-----|-----------|-------------|
+| **CDN/Edge** | CloudFront | 5–15 min | URL + user geo-bucket | Manual purge on major catalog updates |
+| **Redis App** | Redis Cluster | 30min (nearby) – 2h (similar) | `rec:{endpoint}:{user_id}:{context_hash}` | Event-driven: ORDER_PLACED, profile update |
+| **Feature Store** | Redis Cluster | 1–6h | `feat:user:{user_id}`, `feat:rst:{restaurant_id}` | Daily batch refresh |
+| **Model Prediction** | Redis | 15–30 min | `pred:{model_version}:{user_segment}:{geo_bucket}` | Model deployment |
+
+### Cache Invalidation Events
+
+| Trigger Event | Cache Invalidated | Reason |
+|---------------|-------------------|--------|
+| `ORDER_PLACED` | Redis App (user) | Immediate preference signal |
+| `FAVORITE_ADDED` | Redis App (user) | Explicit preference update |
+| User profile update | Redis App + Feature Store | Profile change |
+| Restaurant catalog update | Feature Store (restaurant) | Data freshness |
+| Model deployment | Model Prediction Cache | New model scores |
+
+---
+
+## Real-time Personalization Stream
+
+After a significant user interaction, session features are updated in near-real-time (<5s) to improve subsequent requests within the same session.
+
+```mermaid
+sequenceDiagram
+    participant User
+    participant API as Recommendation API
+    participant Kafka
+    participant Flink
+    participant Redis as Redis Cache
+
+    User->>API: Performs action (click, order, review)
+    API->>Kafka: Publish event to user-events-raw
+    API-->>User: Event ACK (async, non-blocking)
+
+    Kafka->>Flink: Event consumed (<1s)
+    Flink->>Flink: Validate + Enrich + Aggregate session
+    Flink->>Redis: Update session features (<3s total)
+
+    Note over Redis: Session features updated<br/>within <5s of user action
+
+    User->>API: Next recommendation request
+    API->>Redis: Fetch updated session features
+    API-->>User: Personalized recommendations<br/>(reflecting recent behavior)
+```
+
+---
+
+## Error Handling
+
+### Error Response Format
+
+```json
+{
+  "error": {
+    "code": "RECOMMENDATION_ENGINE_TIMEOUT",
+    "message": "Recommendation service timed out. Fallback response provided.",
+    "details": {
+      "timeout_ms": 200,
+      "fallback_strategy": "popularity_based"
+    },
+    "request_id": "req_01HXZ789ABC",
+    "timestamp": "2024-03-15T14:32:07.123Z"
+  }
+}
+```
+
+### Error Codes & HTTP Status
+
+| HTTP Status | Error Code | Description | Fallback Strategy |
+|-------------|-----------|-------------|------------------|
+| 400 | `INVALID_PARAMETERS` | Missing required params or out-of-range values | None (client error) |
+| 401 | `UNAUTHORIZED` | Invalid or expired JWT token | None |
+| 404 | `RESTAURANT_NOT_FOUND` | `restaurant_id` does not exist | None |
+| 429 | `RATE_LIMIT_EXCEEDED` | Request rate limit exceeded | None |
+| 503 | `RECOMMENDATION_ENGINE_TIMEOUT` | Engine exceeded latency budget | Popularity-based fallback |
+| 503 | `FEATURE_STORE_UNAVAILABLE` | Redis feature store unreachable | Default feature fallback |
+| 500 | `INTERNAL_ERROR` | Unexpected internal error | Popularity-based fallback |
+
+### Fallback Strategy
+
+When the recommendation engine fails or times out, the service falls back to a **popularity-based response**:
+- Returns top-N restaurants in the user's geo-bucket sorted by `popularity_score`
+- Fallback response is cached for 5 minutes
+- Logged with `fallback=true` flag for monitoring
+
+---
+
+## Rate Limiting
+
+| Scope | Limit | Window |
+|-------|-------|--------|
+| Per user (authenticated) | 100 requests | 1 minute |
+| Per API key (server-to-server) | 1,000 requests | 1 minute |
+| Per IP (unauthenticated) | 20 requests | 1 minute |
+| Group recommendations endpoint | 10 requests | 1 minute per user |
+
+Rate limiting is enforced at the API Gateway layer using a sliding window algorithm. Exceeded limits return HTTP 429 with `Retry-After` header.
+
+```json
+{
+  "error": {
+    "code": "RATE_LIMIT_EXCEEDED",
+    "message": "Rate limit of 100 requests per minute exceeded.",
+    "details": {
+      "limit": 100,
+      "window": "1m",
+      "retry_after_seconds": 23
+    }
+  }
+}
+```
diff --git a/docs/modules/04-feature-store.md b/docs/modules/04-feature-store.md
new file mode 100644
index 0000000..b34b505
--- /dev/null
+++ b/docs/modules/04-feature-store.md
@@ -0,0 +1,258 @@
+# Module 04 — Feature Store
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Offline Store](#offline-store)
+- [Online Store](#online-store)
+- [Feature Registry](#feature-registry)
+- [Restaurant-Specific Features](#restaurant-specific-features)
+- [Feature Pipelines](#feature-pipelines)
+
+---
+
+## Overview
+
+The Feature Store provides a unified layer for storing, serving, and managing ML features with consistency guarantees between offline training and online serving.
+
+```mermaid
+graph TB
+    subgraph Sources["Data Sources"]
+        EventLog["Event Log (S3)"]
+        PG["PostgreSQL"]
+        Kafka["Kafka Streams"]
+        ExternalAPIs["External APIs"]
+    end
+
+    subgraph OfflineStore["Offline Store"]
+        S3Parquet["S3 / GCS\n(Parquet files)\nDaily snapshots"]
+        BigQuery["BigQuery\n(Analytics queries\nTraining data joins)"]
+    end
+
+    subgraph OnlineStore["Online Store"]
+        RedisCluster["Redis Cluster\n(<5ms latency SLA)\n(Hash, Sorted Set, String)"]
+    end
+
+    subgraph Registry["Feature Registry"]
+        FeatureReg["Feature Registry\n(metadata, lineage, SLAs)"]
+    end
+
+    subgraph Consumers["Consumers"]
+        Training["ML Training\n(Spark jobs)"]
+        Serving["Online Serving\n(Recommendation Service)"]
+        Analytics["Analytics\n(Metabase / BI)"]
+    end
+
+    Sources --> OfflineStore
+    Kafka --> OnlineStore
+    OfflineStore --> OnlineStore
+    OfflineStore --> Training
+    OnlineStore --> Serving
+    OfflineStore --> Analytics
+    FeatureReg --> Training
+    FeatureReg --> Serving
+```
+
+---
+
+## Offline Store
+
+### Storage Architecture
+
+| Component | Technology | Purpose |
+|-----------|-----------|---------|
+| **Primary Storage** | S3 / GCS (Parquet) | Partitioned columnar storage for large-scale feature data |
+| **Analytics Layer** | BigQuery | Serverless SQL analytics, training data joins |
+| **Partitioning** | `dt=YYYY-MM-DD / entity_type=user|restaurant` | Time-partitioned for efficient range scans |
+| **Format** | Apache Parquet | Columnar compression, Spark-compatible, schema evolution support |
+
+### Partition Strategy
+
+```
+s3://rr-feature-store/
+├── features/
+│   ├── user_features/
+│   │   └── dt=2024-03-15/
+│   │       └── part-00000.parquet
+│   ├── restaurant_features/
+│   │   └── dt=2024-03-15/
+│   │       └── part-00000.parquet
+│   └── interaction_features/
+│       └── dt=2024-03-15/
+│           └── part-00000.parquet
+├── training_datasets/
+│   └── ranking_model_v2/
+│       └── dt=2024-03-15/
+└── embeddings/
+    ├── user_embeddings/
+    │   └── dt=2024-03-15/
+    └── restaurant_embeddings/
+        └── dt=2024-03-15/
+```
+
+### Offline Feature Retention
+
+| Feature Group | Retention Period | Notes |
+|--------------|-----------------|-------|
+| User behavior features | 24 months | Rolling window for seasonal pattern detection |
+| Restaurant features | Indefinite | Full history; partitioned by date |
+| Interaction features | 12 months | Training data construction |
+| Embeddings | 6 months (last 3 versions) | Previous versions retained for model comparison |
+
+---
+
+## Online Store
+
+### Architecture
+
+- **Technology**: Redis Cluster (6 shards, 3 replicas each)
+- **Latency SLA**: <5ms p99 for feature fetch
+- **Capacity**: ~200GB for user features + ~50GB for restaurant features
+- **Persistence**: AOF (append-only file) enabled; RDB snapshots every 6h
+- **Eviction Policy**: `allkeys-lru` (LRU eviction when memory pressure)
+
+### Data Structures
+
+| Entity | Redis Key Pattern | Data Structure | Example Value |
+|--------|------------------|----------------|--------------|
+| User embedding | `feat:user:{user_id}:embedding` | Redis String (binary) | 128d float32 vector |
+| User preferences | `feat:user:{user_id}:prefs` | Redis Hash | `cuisine_pref_1=japanese score=0.87` |
+| User segment | `feat:user:{user_id}:segment` | Redis String | `active` |
+| Restaurant features | `feat:rst:{restaurant_id}` | Redis Hash | `avg_rating=4.3 price_tier=2 ...` |
+| Restaurant embedding | `feat:rst:{restaurant_id}:embedding` | Redis String (binary) | 128d float32 vector |
+| Popularity sorted set | `pop:geo:{geohash_prefix6}` | Redis Sorted Set | `rst_abc123 = 0.87` |
+| Session features | `sess:{session_id}` | Redis Hash | `click_count=5 viewed_cuisines=japanese,vietnamese` |
+
+### Feature Freshness
+
+| Feature Group | TTL in Redis | Refresh Mechanism |
+|--------------|-------------|-------------------|
+| User preferences | 6h | Daily batch + event-driven on significant action |
+| User embedding | 12h | Weekly model retrain output |
+| User segment | 6h | Daily batch |
+| Restaurant features | 6h | Daily catalog refresh |
+| Restaurant embedding | 24h | Weekly embedding rebuild |
+| Session features | 30min (idle TTL) | Real-time via Flink |
+| Popularity scores | 2h | Hourly batch update |
+
+---
+
+## Feature Registry
+
+The Feature Registry is the single source of truth for all feature metadata, lineage, and SLAs.
+
+### Registry Schema
+
+```yaml
+feature_name: user_cuisine_affinity
+version: "2.1"
+description: "Top-3 cuisine preferences with affinity scores for a user, derived from interaction history."
+entity_type: user
+data_type: array[{cuisine: string, score: float}]
+dimensionality: 3 items × 2 fields
+source:
+  pipeline: user_profile_rebuild
+  tables:
+    - events.restaurant_views
+    - events.orders
+    - users.profiles
+computation_logic: |
+  For each user, aggregate cuisine interactions (views × 0.3 + clicks × 0.6 + orders × 1.0)
+  over last 90 days. Normalize by total interactions. Return top 3 cuisines.
+owner: "ml-platform-team"
+freshness_sla:
+  offline: "daily (by 5:00 AM UTC)"
+  online: "6h TTL in Redis, refreshed by 7:30 AM UTC"
+monitoring:
+  null_rate_threshold: 0.02
+  drift_detection: PSI with α=0.2
+tags: [user, preference, cuisine, behavioral]
+created_at: "2023-04-01"
+updated_at: "2024-01-15"
+```
+
+### Feature Groups
+
+| Feature Group | Entity | Count | Owner | Freshness |
+|--------------|--------|-------|-------|-----------|
+| `user_behavioral` | user | 24 features | ML Platform | Daily |
+| `user_preference` | user | 12 features | ML Platform | Daily + event-driven |
+| `user_embedding` | user | 1 (128d vector) | Recommendation Team | Weekly |
+| `restaurant_static` | restaurant | 18 features | Catalog Team | Daily |
+| `restaurant_dynamic` | restaurant | 10 features | ML Platform | Hourly |
+| `restaurant_embedding` | restaurant | 2 (text 768d + image 512d) | ML Platform | Weekly |
+| `geo_features` | restaurant | 6 features | Platform Team | Daily |
+| `context_features` | request | 8 features | Serving Team | Real-time (computed) |
+
+---
+
+## Restaurant-Specific Features
+
+The restaurant domain requires specialized features beyond generic item features:
+
+### Geo Features
+
+| Feature | Type | Description | Update Frequency |
+|---------|------|-------------|-----------------|
+| `lat` | float | Latitude (-90 to 90) | On catalog update |
+| `lng` | float | Longitude (-180 to 180) | On catalog update |
+| `geohash_7` | string | Geohash precision 7 (~153m × 153m) | On catalog update |
+| `geohash_6` | string | Geohash precision 6 (~1.2km × 0.6km) | On catalog update |
+| `city_id` | string | Normalized city identifier | On catalog update |
+| `district_id` | string | Normalized district identifier | On catalog update |
+
+### Cuisine Embeddings
+
+| Feature | Type | Description |
+|---------|------|-------------|
+| `cuisine_embedding_32d` | float[32] | Learned cuisine taxonomy embedding (captures cuisine similarity) |
+| `primary_cuisine_id` | int | Mapped integer ID from taxonomy (120+ cuisines) |
+| `cuisine_type_ids` | int[] | All cuisine type IDs (multi-label) |
+
+### Time-Based Features
+
+| Feature | Type | Description | Update |
+|---------|------|-------------|--------|
+| `is_open_now` | bool | Computed from operating_hours + current time (UTC) | Request-time |
+| `meal_type_open` | string | breakfast/lunch/dinner/late based on current hour | Request-time |
+| `is_peak_hour_now` | bool | Peak hours: Lunch 11-14, Dinner 18-21 | Request-time |
+| `avg_wait_time_min` | float | Estimated wait time from booking data | Hourly |
+
+### Sentiment & Quality Scores
+
+| Feature | Type | Description | Update |
+|---------|------|-------------|--------|
+| `sentiment_score` | float[0,1] | Aggregate sentiment from all reviews (DistilBERT) | Daily |
+| `food_sentiment` | float[0,1] | Aspect-level food sentiment | Daily |
+| `service_sentiment` | float[0,1] | Aspect-level service sentiment | Daily |
+| `photo_quality_score` | float[0,1] | CNN-based photo quality assessment | Weekly |
+| `review_recency_score` | float[0,1] | Weighted avg rating (recency-weighted, 90d) | Daily |
+
+---
+
+## Feature Pipelines
+
+### Daily Batch Pipeline (Airflow DAG)
+
+```mermaid
+flowchart LR
+    ExtractEvents["Extract Events\n(S3 Parquet)"] --> ComputeUserFeatures["Compute User Features\n(Spark)"]
+    ExtractCatalog["Extract Catalog\n(PostgreSQL)"] --> ComputeRestFeatures["Compute Restaurant Features\n(Spark)"]
+    ComputeUserFeatures --> WriteOffline["Write Offline Store\n(S3 Parquet + BigQuery)"]
+    ComputeRestFeatures --> WriteOffline
+    WriteOffline --> WarmOnline["Warm Online Store\n(Redis bulk upload)"]
+    WarmOnline --> ValidateFeatures["Validate Features\n(null rates, schema, drift)"]
+    ValidateFeatures --> Done["✅ Pipeline Complete"]
+```
+
+### Real-time Feature Update (Flink)
+
+```mermaid
+flowchart LR
+    Kafka["Kafka:\nuser-events-enriched"] --> SessionWindow["Flink Session Window\n(30min idle timeout)"]
+    SessionWindow --> UpdateSession["Update Session Features\n(Redis HSET)"]
+    SessionWindow --> CheckThreshold{"Significant\naction?"}
+    CheckThreshold -- "ORDER_PLACED\nFAVORITE_ADDED" --> UpdateUserPrefs["Update User Prefs\n(Redis HSET)"]
+    CheckThreshold -- "No" --> Skip["Skip user profile update"]
+    UpdateSession & UpdateUserPrefs --> Done["Feature Update Complete\n(<5s from event)"]
+```
diff --git a/docs/modules/05-ml-platform.md b/docs/modules/05-ml-platform.md
new file mode 100644
index 0000000..3b09d27
--- /dev/null
+++ b/docs/modules/05-ml-platform.md
@@ -0,0 +1,239 @@
+# Module 05 — ML Platform
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Training Pipeline](#training-pipeline)
+- [Evaluation Metrics](#evaluation-metrics)
+- [Model Registry](#model-registry)
+- [Model Serving](#model-serving)
+- [Deployment Strategy](#deployment-strategy)
+
+---
+
+## Overview
+
+The ML Platform manages the complete lifecycle of recommendation models — from data preparation and training, through evaluation and registration, to production serving and monitoring. It supports both scheduled retraining and ad-hoc experimentation.
+
+```mermaid
+graph TB
+    subgraph Training["Training Pipeline (Airflow + Spark)"]
+        DataExtract["Data Extraction\n(S3 + BigQuery)"]
+        FeatEng["Feature Engineering\n(Spark)"]
+        ModelTrain["Model Training\n(PyTorch / LightGBM)"]
+        Evaluate["Evaluation\n(Offline metrics)"]
+    end
+
+    subgraph Registry["Model Registry (MLflow)"]
+        MLflowReg["MLflow Registry\n· Versioning\n· Artifacts\n· Lineage"]
+    end
+
+    subgraph Serving["Model Serving"]
+        BatchInference["Batch Inference\n(every 6h, Spark)"]
+        RealtimeServing["Real-time Inference\n(Triton / TF Serving)"]
+        NearRealtime["Near-real-time\n(Kafka → Flink → Re-score, <5s)"]
+    end
+
+    subgraph Deployment["Deployment"]
+        Canary["Canary (5% traffic, 2h)"]
+        Shadow["Shadow Testing"]
+        BlueGreen["Blue-Green Switch"]
+    end
+
+    Training --> Registry
+    Registry --> Serving
+    Serving --> Deployment
+    Deployment --> Monitoring["Production Monitoring\n(Evidently + Grafana)"]
+    Monitoring --> Training
+```
+
+---
+
+## Training Pipeline
+
+### Pipeline Steps
+
+```mermaid
+flowchart LR
+    A["Data Extraction\n(S3 Parquet + BigQuery)"] --> B["Data Validation\n(Great Expectations\nschema + null checks)"]
+    B --> C["Feature Engineering\n(Spark)\n· Interaction aggregates\n· Negative sampling\n· Cross feature construction"]
+    C --> D["Training\n(PyTorch / LightGBM)\nDistributed on Spark/GPU cluster"]
+    D --> E["Evaluation\n(Holdout + temporal split)"]
+    E --> F{Metrics\npass thresholds?}
+    F -- Yes --> G["Register to MLflow\n(version, artifacts, metrics, tags)"]
+    F -- No --> H["Alert & Investigate\n(Slack + PagerDuty)"]
+    G --> I["Candidate for Deployment\n(human or auto-approval)"]
+```
+
+### Training Data Construction
+
+| Component | Description |
+|-----------|-------------|
+| **Positive samples** | User interactions: clicks (weight 0.3), orders (weight 1.0), favorites (weight 0.7), reviews (weight 0.5) |
+| **Negative samples** | Shown but not interacted with (weight -0.1); random unshown restaurants (weight -0.5) |
+| **Temporal split** | Train: last 28 days, Validation: day 29–30, Test: last 2 days before retrain |
+| **Sample ratio** | ~1:10 positive:negative |
+| **Data volume** | ~50M samples per weekly retrain |
+| **Debiasing** | Inverse propensity scoring (IPS) to correct for position bias in training labels |
+
+### Model Training Schedule
+
+| Model | Frequency | Compute | Duration | Trigger |
+|-------|-----------|---------|---------|---------|
+| Two-Tower (CF) | Weekly (Sunday 3AM) | 8× A100 GPU | ~3h | Scheduled |
+| LightGBM Ranker | Weekly (Sunday 3AM) | 32-core CPU | ~1h | Scheduled |
+| DeepFM Ranker | Weekly (Sunday 3AM) | 4× A100 GPU | ~2h | Scheduled |
+| Embedding index rebuild | Weekly (Sunday 6AM) | 16-core CPU | ~2h | Post-training |
+| Cold start model | Monthly | 4-core CPU | ~30min | Scheduled |
+| Knowledge Graph (GNN) | Bi-weekly | 8× A100 GPU | ~4h | Scheduled |
+
+---
+
+## Evaluation Metrics
+
+### Offline Metrics
+
+| Metric | Description | Target | Threshold (Regression) |
+|--------|-------------|--------|------------------------|
+| **AUC-ROC** | Area under ROC curve for click prediction | ≥0.82 | Drop >2% triggers alert |
+| **NDCG@10** | Normalized Discounted Cumulative Gain at 10 | ≥0.45 | Drop >5% blocks deploy |
+| **NDCG@20** | NDCG at 20 | ≥0.42 | Drop >5% blocks deploy |
+| **MAP@10** | Mean Average Precision at 10 | ≥0.38 | Drop >5% blocks deploy |
+| **Coverage** | % of catalog appearing in recommendations | ≥30% | Drop >10% triggers alert |
+| **Diversity** | Avg intra-list cuisine diversity (ILD) | ≥0.60 | Drop >10% triggers alert |
+| **Location Relevance** | % recommendations within user's historical visit radius | ≥70% | Drop >10% triggers alert |
+
+### Online Metrics (A/B Test Acceptance)
+
+| Metric | Baseline | Minimum Improvement |
+|--------|----------|---------------------|
+| Click-through Rate (CTR) | ~8% | +0.5% absolute |
+| Conversion Rate (order) | ~3.5% | +0.2% absolute |
+| Revenue per User | $X | +1% relative |
+| Session Duration | X min | No significant regression |
+
+---
+
+## Model Registry
+
+### MLflow Registry Structure
+
+Each model version in MLflow contains:
+
+| Artifact | Description |
+|---------|-------------|
+| **Model binary** | Trained model weights (PyTorch `.pt`, LightGBM `.txt`, TF SavedModel) |
+| **Feature schema** | Exact feature names, types, and preprocessing steps used |
+| **Evaluation report** | All offline metrics on validation and test sets |
+| **Training config** | Hyperparameters, data split dates, sample counts |
+| **Data lineage** | Source datasets (S3 paths + versions) used for training |
+| **Dependency manifest** | Python packages + versions |
+
+### Model States
+
+```
+Registered → Staging → Production → Archived
+```
+
+| State | Description | Who Can Promote |
+|-------|-------------|----------------|
+| `Registered` | Just trained; awaiting evaluation | Automated pipeline |
+| `Staging` | Passed offline eval; ready for shadow/canary | ML Engineer (manual review) |
+| `Production` | Serving live traffic | Senior ML Engineer / auto-promotion after canary |
+| `Archived` | No longer serving; retained for lineage | Auto-archival after 90d |
+
+### Model Tagging Convention
+
+```python
+mlflow.set_tags({
+    "model_type": "ranking",
+    "framework": "lightgbm",
+    "training_date": "2024-03-15",
+    "train_data_start": "2024-02-15",
+    "train_data_end": "2024-03-13",
+    "ndcg_10": 0.463,
+    "auc_roc": 0.831,
+    "approved_by": "ml-platform-team",
+    "experiment_id": "exp_ranking_v2"
+})
+```
+
+---
+
+## Model Serving
+
+### Serving Modes
+
+#### Batch Inference (every 6 hours)
+
+- Pre-computes recommendation scores for active users (segment: active, power_user)
+- Results stored in Redis with 6h TTL
+- Triggered by Airflow DAG; uses Spark for distributed scoring
+- Covers ~80% of traffic (high-cache-hit endpoint)
+
+#### Real-time Inference
+
+- Handles cold-path requests not covered by batch inference
+- Model loaded in NVIDIA Triton Inference Server
+- Supports dynamic batching (batch_size up to 512)
+- GPU nodes: minimum 2× A10G per environment
+- Latency target: <20ms p99 for ranking model inference
+
+#### Near-real-time Re-scoring (<5s)
+
+- Kafka → Flink → Recommendation re-score
+- Triggered by significant events: ORDER_PLACED, FAVORITE_ADDED
+- Uses lightweight session-aware ranking (LightGBM only)
+- Updates Redis cache with fresh scores
+
+### Triton Model Configuration
+
+```yaml
+name: "lightgbm_ranker"
+platform: "ensemble"
+max_batch_size: 512
+dynamic_batching:
+  preferred_batch_size: [64, 128, 256]
+  max_queue_delay_microseconds: 5000
+instance_group:
+  - kind: KIND_GPU
+    count: 1
+    gpus: [0]
+```
+
+---
+
+## Deployment Strategy
+
+### Canary Deployment Flow
+
+```mermaid
+flowchart LR
+    NewModel["New Model Version\n(Staging)"] --> ShadowTest["Shadow Testing\n(0% live traffic\nlog predictions only)"]
+    ShadowTest --> CanaryDeploy["Canary Deploy\n5% traffic\n2h window"]
+    CanaryDeploy --> MetricCheck{Guardrail\nMetrics OK?}
+    MetricCheck -- Yes --> Expand["Expand to 25% → 50% → 100%\nBlue-Green final switch"]
+    MetricCheck -- No --> Rollback["Automatic Rollback\nto previous version"]
+    Expand --> Archived["Archive Previous Version\n(retain 90 days)"]
+```
+
+### Rollback Triggers
+
+Automatic rollback is triggered if any of the following conditions are met during canary:
+
+| Trigger | Threshold | Window |
+|---------|-----------|--------|
+| API latency p99 | >300ms | 5-minute rolling window |
+| API error rate | >1% | 5-minute rolling window |
+| NDCG@10 (online proxy) | Drop >5% vs. baseline | 1-hour window |
+| Model inference latency | >50ms p99 | 5-minute rolling window |
+| Feature fetch errors | >0.5% | 5-minute rolling window |
+
+### Blue-Green Switch
+
+Once canary passes guardrails:
+1. New version promoted to `production` state in MLflow
+2. Kubernetes traffic split updated: new version receives 100% traffic
+3. Old version demoted to `archived` state
+4. Redis model prediction cache invalidated
+5. Alert sent to ML team: deployment complete
diff --git a/docs/modules/06-experimentation.md b/docs/modules/06-experimentation.md
new file mode 100644
index 0000000..1016b9f
--- /dev/null
+++ b/docs/modules/06-experimentation.md
@@ -0,0 +1,249 @@
+# Module 06 — Experimentation
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Experiment Design](#experiment-design)
+- [Metrics Framework](#metrics-framework)
+- [Statistical Testing](#statistical-testing)
+- [Experiment Lifecycle](#experiment-lifecycle)
+- [Infrastructure](#infrastructure)
+
+---
+
+## Overview
+
+The Experimentation module provides a rigorous A/B testing framework to evaluate recommendation algorithm changes, UI treatments, and business rule modifications before full rollout.
+
+---
+
+## Experiment Design
+
+### Traffic Splitting
+
+Traffic is split using **hash-based user assignment** to ensure:
+- Deterministic assignment (same user always in same bucket)
+- No leakage between variants
+- Consistent experience across sessions
+
+```
+experiment_bucket = hash(user_id + experiment_id) % 100
+```
+
+Assignment to variants based on configured bucket ranges:
+- Control: buckets 0–49 (50%)
+- Treatment A: buckets 50–74 (25%)
+- Treatment B: buckets 75–99 (25%)
+
+### Sample Size Requirements
+
+| Parameter | Value | Notes |
+|-----------|-------|-------|
+| Minimum samples per variant | 10,000 unique users | Statistical power requirement |
+| Minimum experiment duration | 7 days | Capture weekly behavioral patterns |
+| Maximum experiment duration | 14 days | Avoid novelty effects confounding results |
+| Holdback group | 5% of users | Always in control; used for long-term health tracking |
+
+### Experiment Layers
+
+To run multiple experiments simultaneously without interference:
+
+| Layer | Scope | Examples |
+|-------|-------|---------|
+| `algorithm` | Recommendation model / ranking logic | New ranker, CF model update |
+| `ui` | Frontend presentation | Card layout, photo size, order of sections |
+| `business_rules` | Re-ranking business rules | Diversity weights, sponsored ratio |
+| `features` | Feature engineering changes | New feature, feature removal |
+
+Each layer uses independent hash seeds, preventing cross-layer interaction.
+
+---
+
+## Metrics Framework
+
+### Primary Metrics
+
+| Metric | Definition | Minimum Detectable Effect (MDE) |
+|--------|-----------|--------------------------------|
+| **CTR** | Clicks / Impressions per session | ±0.5% absolute |
+| **Conversion Rate** | Orders / Sessions | ±0.2% absolute |
+| **Revenue per User** | Total order revenue / Active users | ±1% relative |
+
+### Secondary Metrics
+
+| Metric | Definition | Direction |
+|--------|-----------|-----------|
+| `avg_restaurants_viewed` | Avg unique restaurants viewed per session | Higher is better (exploration) |
+| `session_duration_min` | Avg session duration in minutes | Neutral (engagement proxy) |
+| `bounce_rate` | Sessions with 0 restaurant clicks / All sessions | Lower is better |
+| `search_rate` | Sessions with ≥1 search / All sessions | Lower = recs more satisfying |
+| `diversity_score` | Avg intra-list cuisine diversity (ILD) | Track; don't optimize directly |
+| `new_restaurant_discovery` | % recs showing restaurants user never visited | Track for catalog health |
+
+### Guardrail Metrics
+
+These metrics must not regress significantly; a breach blocks experiment promotion:
+
+| Guardrail | Threshold | Justification |
+|-----------|-----------|--------------|
+| `page_load_time_p99` | No increase >50ms | User experience |
+| `error_rate` | No increase >0.5% absolute | Reliability |
+| `user_complaints` | No increase >10% relative | User satisfaction |
+| `recommendation_latency_p99` | No increase >20ms | Performance |
+
+---
+
+## Statistical Testing
+
+### Methodology
+
+- **Primary test**: Sequential testing (allows early stopping without inflating Type I error)
+- **Significance level**: α = 0.05 (5% false positive rate)
+- **Statistical power**: 0.80 (80% chance of detecting true effect at MDE)
+- **Multiple comparison correction**: Bonferroni correction when testing >3 metrics simultaneously
+
+### Sequential Testing
+
+Sequential testing allows monitoring results continuously and stopping early if:
+- A significant result is detected (stop early — winner found)
+- The experiment is clearly futile (stop early — no effect)
+
+This uses the **mSPRT** (mixture Sequential Probability Ratio Test) framework with α-spending to maintain the overall Type I error rate at 0.05.
+
+### Power Analysis
+
+For CTR (baseline 8%, MDE 0.5%):
+
+```
+Required sample size ≈ 10,000 users per variant
+(two-tailed, α=0.05, power=0.80)
+```
+
+### Variance Reduction
+
+**CUPED** (Controlled-experiment Using Pre-Experiment Data) is applied to reduce variance and improve sensitivity:
+
+```
+Y_cuped = Y_post - θ × (Y_pre - E[Y_pre])
+θ = Cov(Y_post, Y_pre) / Var(Y_pre)
+```
+
+Pre-experiment metric: same user's CTR from the 7 days prior to experiment start.
+
+---
+
+## Experiment Lifecycle
+
+```mermaid
+flowchart LR
+    H["1. Hypothesis\n(What & Why)"] --> D["2. Design\n(Metrics, Power, Duration)"]
+    D --> I["3. Implementation\n(Feature flag, traffic split)"]
+    I --> M["4. Monitoring\n(Daily metric review)"]
+    M --> A["5. Analysis\n(Statistical tests)"]
+    A --> Dec{"6. Decision"}
+    Dec -- "Winner" --> Roll["7a. Rollout\n(100% traffic)"]
+    Dec -- "Inconclusive" --> Doc["7b. Document\n& Archive"]
+    Dec -- "Loser" --> Kill["7c. Kill\n& Post-mortem"]
+    Roll --> Docs2["8. Documentation\n(Impact report)"]
+    Kill --> Docs2
+```
+
+### Stage 1: Hypothesis
+
+Document the hypothesis before any implementation:
+- **Problem**: What user pain point or opportunity is being addressed?
+- **Hypothesis**: "We believe that [change X] will [achieve outcome Y] for [user segment Z]"
+- **Expected impact**: Estimated CTR lift, revenue impact
+- **Risks**: Potential negative effects to guardrail metrics
+
+### Stage 2: Design
+
+| Decision | Options |
+|----------|---------|
+| Control vs. treatment split | 50/50 (default), 80/20 (for risky changes) |
+| User segments targeted | All users, cold_start only, power_users only, etc. |
+| Experiment layer | algorithm / ui / business_rules / features |
+| Primary metric | CTR / Conversion Rate / Revenue per User |
+| Duration | 7–14 days |
+
+### Stage 3: Implementation
+
+- **Feature flags**: All experiment branches behind a feature flag service
+- **Traffic assignment**: Hash-based, server-side, logged with every event
+- **Event tracking**: Every impression and interaction includes `experiment_variants` metadata
+- **Instrumentation review**: QA sign-off on tracking before experiment starts
+
+### Stage 4: Monitoring
+
+Daily monitoring during experiment:
+- Primary and secondary metrics per variant
+- Guardrail metric alerts (auto-pause if guardrail breached)
+- Sample ratio mismatch (SRM) check: Chi-squared test on traffic distribution
+- Data quality checks: null rates, logging errors
+
+### Stage 5: Analysis
+
+At experiment conclusion:
+1. Run statistical tests (mSPRT / standard t-test depending on stopping reason)
+2. Apply Bonferroni correction for multiple metrics
+3. Check heterogeneous treatment effects (by user segment, device, geo)
+4. Calculate practical significance alongside statistical significance
+
+### Stage 6: Decision Framework
+
+| Outcome | Condition | Action |
+|---------|-----------|--------|
+| **Ship** | Statistically significant positive on primary metric AND no guardrail regression | Full rollout, then document |
+| **Iterate** | Directional positive but insufficient power | Extend or redesign experiment |
+| **Hold** | Significant on primary but guardrail regression | Investigate before decision |
+| **Kill** | Negative or neutral on primary metric | Document learnings, archive |
+
+---
+
+## Infrastructure
+
+### Feature Flag Service
+
+- All experiment branches controlled via feature flags
+- Flags stored in Redis with real-time update propagation
+- SDK available for Python (serving layer) and JavaScript (frontend)
+
+### Experiment Config Example
+
+```json
+{
+  "experiment_id": "exp_ranking_v3_diversity",
+  "name": "MMR Diversity Weight Tuning v3",
+  "layer": "algorithm",
+  "status": "running",
+  "start_date": "2024-03-10",
+  "end_date": "2024-03-24",
+  "variants": [
+    {
+      "id": "control",
+      "name": "Current MMR (λ=0.7)",
+      "traffic_pct": 50,
+      "config": { "mmr_lambda": 0.7, "category_max_ratio": 0.4 }
+    },
+    {
+      "id": "treatment_a",
+      "name": "Higher Diversity (λ=0.6)",
+      "traffic_pct": 50,
+      "config": { "mmr_lambda": 0.6, "category_max_ratio": 0.35 }
+    }
+  ],
+  "primary_metric": "conversion_rate",
+  "secondary_metrics": ["ctr", "avg_restaurants_viewed", "diversity_score"],
+  "guardrails": ["page_load_time_p99", "error_rate"],
+  "target_segments": ["active", "power_user"],
+  "min_samples_per_variant": 10000,
+  "owner": "ml-platform-team"
+}
+```
+
+### Results Tracking
+
+Experiment results are stored in BigQuery and surfaced via:
+- **Metabase dashboard**: Real-time metric visualization per variant
+- **Automated report**: Generated at experiment end; includes statistical analysis, confidence intervals, and recommendation
+- **Experiment registry**: All experiments documented with hypothesis, results, and decision rationale
diff --git a/docs/modules/07-observability.md b/docs/modules/07-observability.md
new file mode 100644
index 0000000..e99cf79
--- /dev/null
+++ b/docs/modules/07-observability.md
@@ -0,0 +1,289 @@
+# Module 07 — Observability
+
+## Table of Contents
+
+- [Overview](#overview)
+- [System Monitoring](#system-monitoring)
+- [ML Monitoring](#ml-monitoring)
+- [Business Monitoring](#business-monitoring)
+- [Logging](#logging)
+- [Alerting Rules](#alerting-rules)
+- [Dashboards](#dashboards)
+
+---
+
+## Overview
+
+Observability covers three pillars: **metrics**, **logs**, and **traces**. The system is monitored at three layers:
+
+1. **System Monitoring** — infrastructure and service-level health (Prometheus + Grafana)
+2. **ML Monitoring** — model quality, feature drift, and prediction distribution (Evidently AI)
+3. **Business Monitoring** — revenue attribution, CTR, user satisfaction (Metabase)
+
+```mermaid
+graph TB
+    subgraph Services["Services & Infrastructure"]
+        RecSvc["Recommendation Service"]
+        FeatureStore["Feature Store"]
+        ModelServing["Model Serving (Triton)"]
+        Kafka["Kafka / Flink"]
+        Redis["Redis"]
+        PG["PostgreSQL"]
+    end
+
+    subgraph Collection["Collection Layer"]
+        PrometheusExp["Prometheus Exporters\n(/metrics endpoints)"]
+        OpenTelemetry["OpenTelemetry SDK\n(traces + logs)"]
+        EventSink["Event Sink\n(BigQuery / S3)"]
+    end
+
+    subgraph Backends["Observability Backends"]
+        Prometheus["Prometheus\n(metrics storage)"]
+        Loki["Grafana Loki\n(log storage)"]
+        Jaeger["Jaeger\n(trace storage)"]
+        BigQuery["BigQuery\n(analytics)"]
+        Evidently["Evidently AI\n(ML drift)"]
+    end
+
+    subgraph Visualization["Visualization & Alerting"]
+        Grafana["Grafana\n(dashboards + alerts)"]
+        Metabase["Metabase\n(business BI)"]
+        PagerDuty["PagerDuty\n(on-call alerting)"]
+    end
+
+    Services --> Collection
+    Collection --> Backends
+    Backends --> Visualization
+    Grafana --> PagerDuty
+```
+
+---
+
+## System Monitoring
+
+### Key Metrics
+
+#### API / Serving Layer
+
+| Metric | PromQL Key | Description | Alert Threshold |
+|--------|-----------|-------------|----------------|
+| Request rate | `rate(http_requests_total[5m])` | Requests per second by endpoint | — (informational) |
+| Latency p50 | `histogram_quantile(0.50, http_duration_seconds)` | Median response time | >60ms |
+| Latency p99 | `histogram_quantile(0.99, http_duration_seconds)` | 99th percentile response time | >200ms |
+| Error rate | `rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])` | 5xx error rate | >1% |
+| Cache hit ratio | `rate(cache_hits_total[5m]) / rate(cache_requests_total[5m])` | Redis cache hit rate | <70% |
+
+#### ML Inference
+
+| Metric | Description | Alert Threshold |
+|--------|-------------|----------------|
+| Model inference latency p99 | Triton inference time | >50ms |
+| Model throughput (requests/s) | Triton requests per second | — |
+| GPU utilization | Per GPU node utilization | >90% sustained |
+| Inference error rate | Failed inference requests | >0.5% |
+| Queue depth | Pending inference requests in Triton queue | >100 |
+
+#### Feature Store
+
+| Metric | Description | Alert Threshold |
+|--------|-------------|----------------|
+| Feature fetch latency p99 | Redis GET latency | >10ms |
+| Feature cache miss rate | Redis keyspace misses / total | >20% |
+| Redis memory utilization | Used memory / max memory | >85% |
+| Feature staleness | Age of oldest feature in online store | >8h |
+
+#### Kafka / Flink
+
+| Metric | Description | Alert Threshold |
+|--------|-------------|----------------|
+| Consumer lag | Records behind latest offset | >10,000 events |
+| Processing throughput | Events processed per second | — |
+| Checkpoint duration | Flink checkpoint completion time | >30s |
+| Kafka broker disk usage | Disk used per broker | >80% |
+
+---
+
+## ML Monitoring
+
+Managed via **Evidently AI** with reports generated hourly and stored in S3.
+
+### Prediction Distribution Monitoring
+
+| Check | Metric | Alert Condition |
+|-------|--------|----------------|
+| Score distribution shift | Kullback-Leibler divergence | KL > 0.1 |
+| Mean score drift | Moving average of recommendation scores | >5% change in 24h window |
+| Rank position distribution | Distribution of top-1 restaurant positions | Significant shift |
+| Score range | Min/max recommendation score | Outside [0.05, 0.99] |
+
+### Feature Drift Detection (PSI)
+
+Population Stability Index (PSI) is computed for all features daily:
+
+```
+PSI = Σ (actual_pct - expected_pct) × ln(actual_pct / expected_pct)
+```
+
+| PSI Range | Interpretation | Action |
+|-----------|---------------|--------|
+| PSI < 0.1 | No significant change | No action |
+| 0.1 ≤ PSI < 0.2 | Minor change | Monitor closely |
+| PSI ≥ 0.2 | Significant drift | Alert + investigate retraining |
+
+### Online ML Metrics (Proxy)
+
+| Metric | Description | Computation | Target |
+|--------|-------------|-------------|--------|
+| **NDCG@10 (proxy)** | NDCG estimated from click positions | Daily batch compute | ≥0.43 |
+| **Coverage** | % of catalog appearing in top-10 recommendations | Daily batch | ≥28% |
+| **Popularity bias** | Ratio of recommendations for top-10% restaurants | Daily | <60% |
+| **Novelty score** | % recommendations user has never seen before | Daily | ≥40% |
+
+### Drift Dashboard (Grafana)
+
+The ML drift Grafana dashboard shows:
+- PSI heatmap per feature (feature × time)
+- Prediction score distribution over rolling 7 days
+- Online NDCG proxy trend
+- Feature importance change vs. previous week (SHAP values)
+
+---
+
+## Business Monitoring
+
+Managed via **Metabase** with dashboards refreshed every 4 hours from BigQuery.
+
+### Business Metrics
+
+| Metric | Definition | Refresh |
+|--------|-----------|---------|
+| **CTR by placement** | Clicks / Impressions by page section (homepage, nearby, similar) | 4h |
+| **Recommendation revenue attribution** | Orders originating from recommendation clicks | Daily |
+| **Influenced conversion rate** | Orders / Users exposed to recommendations | Daily |
+| **Avg session satisfaction score** | Avg post-session rating (voluntary feedback) | Daily |
+| **Discovery rate** | % of orders at restaurants user never previously ordered from | Daily |
+| **Repeat restaurant rate** | % of orders at previously visited restaurants | Daily |
+
+### Business Alerts
+
+| Alert | Threshold | Notification |
+|-------|-----------|-------------|
+| CTR drops >20% from 7d average | Daily check | Slack #product-alerts |
+| Conversion rate drops >10% | Hourly check | Slack + PagerDuty (severity 2) |
+| Revenue attribution anomaly | Statistical outlier detection | Slack #business-alerts |
+
+---
+
+## Logging
+
+### Structured Log Format
+
+Every recommendation API request produces a structured log entry:
+
+```json
+{
+  "timestamp": "2024-03-15T14:32:07.456Z",
+  "level": "INFO",
+  "service": "recommendation-service",
+  "version": "2.4.1",
+  "request_id": "req_01HXZ789ABC",
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "session_id": "sess_7e2f1a3b4c5d6e7f",
+  "endpoint": "/api/v1/recommendations/homepage",
+  "http_status": 200,
+  "model_version": "ranking_v2.4.1",
+  "experiment_variants": {
+    "ranking_model": "treatment_a",
+    "diversity_weight": "control"
+  },
+  "pipeline": {
+    "candidates_generated": 847,
+    "candidates_after_geo_filter": 312,
+    "candidates_ranked": 312,
+    "final_results": 20
+  },
+  "latency_breakdown_ms": {
+    "feature_fetch": 3,
+    "candidate_generation": 11,
+    "restaurant_feature_fetch": 4,
+    "scoring": 14,
+    "reranking": 4,
+    "serialization": 2,
+    "total": 38
+  },
+  "cache": {
+    "l1_cdn_hit": false,
+    "l2_redis_hit": false,
+    "fallback_used": false
+  }
+}
+```
+
+### Log Retention Policy
+
+| Tier | Storage | Retention | Access |
+|------|---------|-----------|--------|
+| **Hot** | Elasticsearch / Loki | 90 days | Real-time query via Kibana/Grafana |
+| **Warm** | S3 (Parquet, compressed) | 12 months | Query via Athena / BigQuery |
+| **Cold** | S3 Glacier | 3 years | Restore on demand (regulatory) |
+
+### Log Levels
+
+| Level | Condition | Includes |
+|-------|-----------|---------|
+| `ERROR` | 5xx response, fallback used, feature store unavailable | Full request context, stack trace |
+| `WARN` | Latency SLA miss, cache miss rate spike, partial feature fetch | Request context, metric values |
+| `INFO` | Every successful recommendation request | Full structured log (see above) |
+| `DEBUG` | Detailed candidate scoring (disabled in production) | Per-candidate scores, feature values |
+
+---
+
+## Alerting Rules
+
+### Severity Levels
+
+| Level | Response Time | Channels |
+|-------|--------------|----------|
+| **P1 (Critical)** | Immediate (24/7) | PagerDuty + Slack + SMS |
+| **P2 (High)** | 15 min (business hours), 30 min (off-hours) | PagerDuty + Slack |
+| **P3 (Medium)** | Next business day | Slack #alerts |
+| **P4 (Low)** | Weekly review | Slack #metrics-review |
+
+### Alert Rules
+
+| Alert | Condition | Severity | Runbook |
+|-------|-----------|----------|---------|
+| High API error rate | error_rate > 1% for 5min | P1 | runbook/api-errors |
+| API latency SLA breach | p99 > 300ms for 5min | P1 | runbook/latency |
+| Model serving down | inference_error_rate = 100% | P1 | runbook/model-serving |
+| Redis memory critical | memory_utilization > 90% | P1 | runbook/redis |
+| Kafka consumer lag | consumer_lag > 50,000 for 10min | P2 | runbook/kafka-lag |
+| Feature staleness | feature_age_hours > 8 | P2 | runbook/feature-store |
+| ML drift detected | PSI > 0.2 on any key feature | P2 | runbook/ml-drift |
+| Low cache hit rate | cache_hit_ratio < 60% for 1h | P3 | runbook/caching |
+| NDCG proxy decline | ndcg_proxy < 0.40 for 24h | P3 | runbook/model-quality |
+| Airflow DAG failure | DAG past SLA deadline | P2 | runbook/airflow |
+
+---
+
+## Dashboards
+
+### Grafana Dashboard Suite
+
+| Dashboard | Audience | Key Panels |
+|-----------|---------|-----------|
+| **Recommendation Service Overview** | Engineering | RPS, latency p50/p99/p999, error rate, cache hit rate |
+| **ML Inference Performance** | ML Team | Triton throughput, GPU utilization, inference latency, queue depth |
+| **Feature Store Health** | ML Team | Redis memory, fetch latency, staleness per feature group |
+| **Kafka / Flink Pipeline** | Data Engineering | Consumer lag, throughput, checkpoint success rate |
+| **ML Drift Monitor** | ML Team | PSI heatmap, prediction distribution, NDCG proxy trend |
+| **Infrastructure Overview** | SRE | Pod CPU/memory, node health, disk usage, network I/O |
+
+### Metabase Business Dashboards
+
+| Dashboard | Audience | Refresh |
+|-----------|---------|---------|
+| **Recommendation Performance** | Product / Business | 4h |
+| **A/B Experiment Results** | Product / ML | 4h |
+| **User Engagement Funnel** | Product | Daily |
+| **Revenue Attribution** | Business / Finance | Daily |
diff --git a/docs/modules/08-data-pipeline.md b/docs/modules/08-data-pipeline.md
new file mode 100644
index 0000000..6f1e1de
--- /dev/null
+++ b/docs/modules/08-data-pipeline.md
@@ -0,0 +1,272 @@
+# Module 08 — Data Pipeline
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Daily Batch Pipelines](#daily-batch-pipelines)
+- [Real-time Streaming Pipelines](#real-time-streaming-pipelines)
+- [Pipeline DAGs](#pipeline-dags)
+- [SLAs and Monitoring](#slas-and-monitoring)
+- [Failure Handling & Retry Strategy](#failure-handling--retry-strategy)
+
+---
+
+## Overview
+
+The Data Pipeline module orchestrates all data movement and transformation workflows across the system. It consists of two types of pipelines:
+
+1. **Batch Pipelines** — scheduled daily/weekly jobs orchestrated by Apache Airflow
+2. **Real-time Streaming Pipelines** — continuously running jobs using Apache Kafka + Apache Flink
+
+```mermaid
+graph TB
+    subgraph BatchLayer["Batch Layer (Apache Airflow)"]
+        UserRebuild["user_profile_rebuild\nDaily 2:00 AM"]
+        CatalogRefresh["restaurant_catalog_refresh\nDaily 1:00 AM"]
+        FeatureEng["feature_engineering\nDaily 3:00 AM"]
+        OnlineWarmup["online_store_warmup\nDaily 6:30 AM"]
+        ModelRetrain["model_retraining\nWeekly Sunday 3:00 AM"]
+        EmbeddingRebuild["embedding_index_rebuild\nWeekly Sunday 6:00 AM"]
+    end
+
+    subgraph StreamingLayer["Streaming Layer (Kafka + Flink)"]
+        EventIngest["Event Ingestion\n(always-on)"]
+        SessionAgg["Session Aggregation\n(always-on)"]
+        FeatureUpdate["Real-time Feature Update\n(always-on)"]
+        RecTrigger["Recommendation Trigger\n(event-driven)"]
+    end
+
+    subgraph Storage["Storage"]
+        S3["S3 / GCS\n(offline store)"]
+        BQ["BigQuery\n(analytics)"]
+        Redis["Redis\n(online store)"]
+        Milvus["Milvus\n(vector index)"]
+    end
+
+    BatchLayer --> Storage
+    StreamingLayer --> Redis
+    Storage --> ModelRetrain
+    Storage --> EmbeddingRebuild
+```
+
+---
+
+## Daily Batch Pipelines
+
+### Pipeline Schedule Summary
+
+| Pipeline | Schedule (UTC) | SLA Deadline | Owner | Dependencies |
+|----------|---------------|-------------|-------|-------------|
+| `restaurant_catalog_refresh` | Daily 1:00 AM | 4:00 AM | Catalog Team | External APIs, PostgreSQL |
+| `user_profile_rebuild` | Daily 2:00 AM | 5:00 AM | ML Platform | Event log (S3), Orders (PG) |
+| `feature_engineering` | Daily 3:00 AM | 6:00 AM | ML Platform | user_profile_rebuild |
+| `online_store_warmup` | Daily 6:30 AM | 7:30 AM | Platform Team | feature_engineering |
+| `model_retraining` | Weekly Sunday 3:00 AM | 8:00 AM | ML Platform | feature_engineering (latest) |
+| `embedding_index_rebuild` | Weekly Sunday 6:00 AM | 10:00 AM | ML Platform | model_retraining |
+| `analytics_export` | Daily 7:00 AM | 9:00 AM | Data Engineering | feature_engineering |
+
+---
+
+## Pipeline DAGs
+
+### `user_profile_rebuild` DAG
+
+**Schedule**: Daily 2:00 AM UTC  
+**SLA**: 5:00 AM UTC  
+**Duration**: ~2.5h
+
+```mermaid
+flowchart LR
+    A["extract_raw_events\n(S3 Parquet read)"] --> B["validate_event_schema\n(Great Expectations)"]
+    A2["extract_orders\n(PostgreSQL)"] --> C["compute_interaction_features\n(Spark)"]
+    B --> C
+    A3["extract_user_registrations\n(PostgreSQL)"] --> D["compute_demographic_features\n(Spark)"]
+    C --> E["compute_cuisine_affinities\n(Spark)\nTop-5 cuisines with affinity score"]
+    C --> F["compute_rfm_features\n(Spark)\nRecency, Frequency, Monetary"]
+    D & E & F --> G["join_user_features\n(Spark)"]
+    G --> H["write_offline_store\n(S3 Parquet + BigQuery)"]
+    H --> I["validate_output\n(Row count, null rate checks)"]
+    I --> J["notify_downstream\n(Airflow sensor trigger)"]
+```
+
+**Key Tasks**:
+- `extract_raw_events`: Reads last 90 days of events from S3, partitioned by `dt`
+- `compute_interaction_features`: Aggregates view/click/order counts, weighted interaction score per restaurant-user pair
+- `compute_cuisine_affinities`: Top-5 cuisines from interactions (weighted: order×1.0, click×0.6, view×0.3)
+- `compute_rfm_features`: Recency (days since last order), Frequency (orders/month), Monetary (avg spend)
+- `validate_output`: Checks null rates <2%, row counts within ±5% of previous day
+
+---
+
+### `restaurant_catalog_refresh` DAG
+
+**Schedule**: Daily 1:00 AM UTC  
+**SLA**: 4:00 AM UTC  
+**Duration**: ~2h
+
+```mermaid
+flowchart LR
+    A["pull_partner_updates\n(REST API / webhooks)"] --> D["merge_catalog_changes\n(PostgreSQL UPSERT)"]
+    B["pull_google_places\n(Google Places API)"] --> C["normalize_external_data\n(Spark)"]
+    C --> D
+    D --> E["enrich_nlp_tags\n(BERT NLP service)"]
+    D --> F["compute_text_embeddings\n(Sentence-BERT)"]
+    D --> G["compute_image_embeddings\n(ResNet-50)"]
+    E & F & G --> H["update_restaurant_features\n(S3 Parquet)"]
+    H --> I["update_sentiment_scores\n(DistilBERT reviews)"]
+    I --> J["write_catalog_snapshot\n(PostgreSQL + S3)"]
+    J --> K["validate_catalog\n(count, schema, geo checks)"]
+```
+
+---
+
+### `feature_engineering` DAG
+
+**Schedule**: Daily 3:00 AM UTC (after `user_profile_rebuild`)  
+**SLA**: 6:00 AM UTC  
+**Duration**: ~2.5h
+
+```mermaid
+flowchart LR
+    A["wait_user_profile_rebuild\n(Airflow ExternalTaskSensor)"] --> C
+    B["wait_catalog_refresh\n(Airflow ExternalTaskSensor)"] --> C
+    C["compute_cross_features\n(Spark)\n· user×cuisine affinity\n· user×price match\n· geo distance features"] --> D["compute_context_features\n(Spark)\n· time-of-day features\n· day-of-week features"]
+    D --> E["compute_restaurant_dynamic\n(Spark)\n· review_recency_score\n· popularity decay\n· trend score"]
+    C & D & E --> F["join_training_features\n(Spark)"]
+    F --> G["write_feature_store\n(S3 Parquet + BigQuery)"]
+    G --> H["validate_features\n(PSI vs previous day, null rates)"]
+```
+
+---
+
+### `online_store_warmup` DAG
+
+**Schedule**: Daily 6:30 AM UTC (after `feature_engineering`)  
+**SLA**: 7:30 AM UTC  
+**Duration**: ~45min
+
+```mermaid
+flowchart LR
+    A["wait_feature_engineering\n(ExternalTaskSensor)"] --> B["export_user_features\n(Spark → Redis pipeline)"]
+    A --> C["export_restaurant_features\n(Spark → Redis pipeline)"]
+    A --> D["export_popularity_scores\n(→ Redis Sorted Sets)"]
+    B & C & D --> E["validate_online_store\n(sample spot-check latency)"]
+    E --> F["notify_serving_team\n(Slack notification)"]
+```
+
+---
+
+### `model_retraining` DAG (Weekly)
+
+**Schedule**: Weekly, Sunday 3:00 AM UTC  
+**SLA**: Sunday 8:00 AM UTC  
+**Duration**: ~4.5h
+
+```mermaid
+flowchart LR
+    A["extract_training_data\n(last 28d, S3 + BigQuery)"] --> B["construct_training_samples\n(Spark)\n+/- sampling, label assignment, IPS weights"]
+    B --> C["train_two_tower_cf\n(PyTorch, 8× GPU)\n~3h"]
+    B --> D["train_lightgbm_ranker\n(CPU cluster)\n~1h"]
+    B --> E["train_deepfm_ranker\n(TF, 4× GPU)\n~2h"]
+    C & D & E --> F["evaluate_all_models\n(Holdout set)\nAUC-ROC, NDCG@10, MAP@10"]
+    F --> G{All metrics\npass thresholds?}
+    G -- Yes --> H["register_to_mlflow\n(version, artifacts, metrics)"]
+    G -- No --> I["alert_ml_team\n(PagerDuty P2)"]
+    H --> J["create_canary_deployment\n(5% traffic, 2h)"]
+```
+
+---
+
+## Real-time Streaming Pipelines
+
+### Event Ingestion Pipeline
+
+**Technology**: Kafka (producer) + Flink (consumer)  
+**Throughput**: ~10,000 events/second (peak)  
+**Latency SLA**: End-to-end <5s from event to Redis update
+
+```mermaid
+flowchart LR
+    SDK["Mobile/Web SDK\n(batch flush 500ms)"] --> KafkaRaw["Kafka:\nuser-events-raw"]
+    KafkaRaw --> FlinkValidate["Flink:\nEvent Validation\n· Schema check\n· Dedup by event_id\n· Late event handling (5min)"]
+    FlinkValidate --> FlinkEnrich["Flink:\nEvent Enrichment\n· Resolve restaurant metadata\n· Add geohash context\n· Lookup user segment"]
+    FlinkEnrich --> KafkaEnriched["Kafka:\nuser-events-enriched"]
+    KafkaEnriched --> AnalyticsSink["BigQuery\n(analytics, ~5min lag)"]
+    KafkaEnriched --> SessionFlink["Flink:\nSession Aggregation"]
+```
+
+### Session Aggregation Pipeline
+
+```mermaid
+flowchart LR
+    KafkaEnriched["Kafka:\nuser-events-enriched"] --> SessionWindow["Flink Session Window\n(30min idle gap)"]
+    SessionWindow --> UpdateSession["Update Session State\n(Flink keyed state by user_id)"]
+    UpdateSession --> WriteRedis["Write to Redis\n(HSET sess:{session_id})\n<5s from event"]
+    UpdateSession --> CheckTrigger{"Trigger\ncondition?"}
+    CheckTrigger -- "ORDER_PLACED\nor FAVORITE_ADDED" --> UpdateProfile["Update User Prefs\n(Redis HSET)"]
+    CheckTrigger -- "N=5 clicks" --> InvalidateCache["Invalidate Rec Cache\n(Redis DEL rec:{user_id}:*)"]
+    CheckTrigger -- "Session end\n(idle 30min)" --> WriteSessionLog["Write Session Summary\n(S3 → BigQuery)"]
+```
+
+---
+
+## SLAs and Monitoring
+
+### Batch Pipeline SLA Dashboard
+
+| Pipeline | Scheduled | SLA | P50 Duration | P99 Duration | SLA Miss Alert |
+|----------|-----------|-----|-------------|-------------|----------------|
+| restaurant_catalog_refresh | 01:00 | 04:00 | 1h 45m | 2h 30m | PagerDuty P2 |
+| user_profile_rebuild | 02:00 | 05:00 | 2h 20m | 2h 50m | PagerDuty P2 |
+| feature_engineering | 03:00 | 06:00 | 2h 10m | 2h 45m | PagerDuty P2 |
+| online_store_warmup | 06:30 | 07:30 | 40m | 55m | PagerDuty P2 |
+| model_retraining (weekly) | 03:00 | 08:00 | 4h 20m | 5h 15m | PagerDuty P2 |
+
+### Streaming Pipeline SLAs
+
+| Pipeline | Latency SLA | Throughput SLA | Consumer Lag Alert |
+|----------|-------------|---------------|-------------------|
+| Event ingestion | <5s end-to-end | >8,000 ev/s | >10,000 records |
+| Session aggregation | <3s to Redis update | >8,000 ev/s | >10,000 records |
+| Analytics sink | <5min to BigQuery | — | >100,000 records |
+
+---
+
+## Failure Handling & Retry Strategy
+
+### Batch Pipeline Retry Policy
+
+| Scenario | Retry Behavior | Escalation |
+|----------|---------------|-----------|
+| Task failure (transient) | Retry 3× with exponential backoff (5min, 15min, 45min) | Alert after 2nd retry |
+| External API unavailable (catalog) | Retry 5× with 10min intervals; use cached data if all retries fail | PagerDuty P2 |
+| SLA miss | Alert sent; pipeline continues (no auto-retry of full DAG) | PagerDuty P2 |
+| Upstream DAG late | Downstream waits up to 30min past SLA before alerting | Slack notification |
+| Data quality check failure | Pipeline halts; does not update downstream stores | PagerDuty P1 |
+
+### Streaming Pipeline Failure Handling
+
+| Failure Type | Behavior | Recovery |
+|-------------|----------|---------|
+| Flink job crash | Restart from last successful checkpoint (checkpoint interval: 30s) | Auto-restart by Kubernetes |
+| Kafka broker unavailable | Consumer pauses; resumes when broker recovers; processes backlog | Auto-recovery |
+| Redis write failure | Retry 3×; log failure if all retries fail; feature staleness alert | Manual intervention |
+| Late events (>5min) | Route to `user-events-late` topic; processed in hourly correction job | Automated correction job |
+| Duplicate events | Deduplicated within 60s window using `event_id`; late deduplication in correction job | Automated |
+
+### Airflow DAG Configuration
+
+```python
+default_args = {
+    "owner": "ml-platform-team",
+    "retries": 3,
+    "retry_delay": timedelta(minutes=5),
+    "retry_exponential_backoff": True,
+    "max_retry_delay": timedelta(minutes=60),
+    "email_on_failure": True,
+    "email_on_retry": False,
+    "sla": timedelta(hours=3),
+    "on_failure_callback": pagerduty_alert,
+    "on_sla_miss_callback": pagerduty_sla_alert,
+}
+```
diff --git a/docs/schemas/restaurant-catalog.md b/docs/schemas/restaurant-catalog.md
new file mode 100644
index 0000000..66ba1ab
--- /dev/null
+++ b/docs/schemas/restaurant-catalog.md
@@ -0,0 +1,455 @@
+# Restaurant Catalog Schema
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Schema Definition](#schema-definition)
+- [Field Reference](#field-reference)
+- [JSON Example](#json-example)
+- [Cuisine Taxonomy](#cuisine-taxonomy)
+- [Price Tier Reference](#price-tier-reference)
+
+---
+
+## Overview
+
+The restaurant catalog is the core entity in the system. Each restaurant record combines structured operational data with ML-generated enrichments (embeddings, NLP tags, sentiment scores).
+
+- **Storage**: PostgreSQL (primary), S3 Parquet (offline feature store), Redis (online feature store)
+- **Update mechanism**: Daily batch refresh + real-time webhooks from partners
+- **Unique identifier**: `restaurant_id` (stable UUID, never reused)
+
+---
+
+## Schema Definition
+
+```yaml
+Restaurant:
+  type: object
+  required:
+    - restaurant_id
+    - name
+    - location
+    - cuisine
+    - operations
+    - pricing
+    - status
+  properties:
+
+    # --- Basic Info ---
+    restaurant_id:
+      type: string
+      description: Globally unique restaurant identifier
+      example: "rst_3b7c9e1a2f4d5e6b"
+
+    name:
+      type: string
+      maxLength: 200
+      example: "Pho Saigon House"
+
+    description:
+      type: string
+      maxLength: 2000
+      description: Free-text description; used for NLP tag generation and text embedding
+
+    phone:
+      type: string
+      example: "+84-28-3812-3456"
+
+    website:
+      type: string
+      format: uri
+      nullable: true
+      example: "https://phosaigonhouse.vn"
+
+    # --- Location ---
+    location:
+      type: object
+      required: [address, city, country, lat, lng, geohash]
+      properties:
+        address:
+          type: string
+          example: "123 Nguyen Hue Boulevard"
+        district:
+          type: string
+          example: "District 1"
+        city:
+          type: string
+          example: "Ho Chi Minh City"
+        country:
+          type: string
+          format: iso-3166-1-alpha-2
+          example: "VN"
+        lat:
+          type: number
+          minimum: -90
+          maximum: 90
+          example: 10.7769
+        lng:
+          type: number
+          minimum: -180
+          maximum: 180
+          example: 106.7009
+        geohash:
+          type: string
+          description: Geohash precision 7 (~153m × 153m)
+          example: "w3gv2c5"
+
+    # --- Cuisine ---
+    cuisine:
+      type: object
+      required: [primary_cuisine, cuisine_types]
+      properties:
+        cuisine_types:
+          type: array
+          items:
+            type: string
+          description: All applicable cuisine types (from taxonomy)
+          example: ["vietnamese", "asian", "noodles", "soup"]
+        primary_cuisine:
+          type: string
+          description: Main cuisine type (single value)
+          example: "vietnamese"
+
+    # --- Operations ---
+    operations:
+      type: object
+      properties:
+        operating_hours:
+          type: object
+          description: Opening hours per weekday (24h format, local timezone)
+          properties:
+            monday:    { $ref: "#/DayHours" }
+            tuesday:   { $ref: "#/DayHours" }
+            wednesday: { $ref: "#/DayHours" }
+            thursday:  { $ref: "#/DayHours" }
+            friday:    { $ref: "#/DayHours" }
+            saturday:  { $ref: "#/DayHours" }
+            sunday:    { $ref: "#/DayHours" }
+        timezone:
+          type: string
+          example: "Asia/Ho_Chi_Minh"
+        is_open_now:
+          type: boolean
+          description: Computed field at query time; not stored persistently
+        reservation_required:
+          type: boolean
+          default: false
+        accepts_walk_ins:
+          type: boolean
+          default: true
+
+    DayHours:
+      type: object
+      properties:
+        open:
+          type: string
+          pattern: "^([01][0-9]|2[0-3]):[0-5][0-9]$"
+          example: "07:00"
+        close:
+          type: string
+          pattern: "^([01][0-9]|2[0-3]):[0-5][0-9]$"
+          example: "22:00"
+        closed:
+          type: boolean
+          default: false
+          description: Set to true for weekly closed days
+
+    # --- Pricing ---
+    pricing:
+      type: object
+      properties:
+        price_tier:
+          type: string
+          enum: ["$", "$$", "$$$", "$$$$"]
+          description: "$ <$15 | $$ $15-$30 | $$$ $30-$60 | $$$$ >$60 per person"
+        avg_meal_price_usd:
+          type: number
+          description: Average meal price per person in USD
+          example: 12.50
+
+    # --- Ratings ---
+    ratings:
+      type: object
+      properties:
+        avg_rating:
+          type: number
+          minimum: 0
+          maximum: 5
+          example: 4.3
+        total_reviews:
+          type: integer
+          example: 1842
+        rating_breakdown:
+          type: object
+          properties:
+            food:
+              type: number
+              minimum: 0
+              maximum: 5
+            service:
+              type: number
+              minimum: 0
+              maximum: 5
+            ambiance:
+              type: number
+              minimum: 0
+              maximum: 5
+            value:
+              type: number
+              minimum: 0
+              maximum: 5
+
+    # --- Dietary ---
+    dietary:
+      type: object
+      properties:
+        vegetarian_options:
+          type: boolean
+        vegan_options:
+          type: boolean
+        halal:
+          type: boolean
+        gluten_free:
+          type: boolean
+        allergen_info:
+          type: array
+          items:
+            type: string
+          description: Known allergens present in menu items
+          example: ["peanuts", "shellfish", "gluten"]
+
+    # --- Ambiance ---
+    ambiance:
+      type: object
+      properties:
+        tags:
+          type: array
+          items:
+            type: string
+            enum:
+              - romantic
+              - family-friendly
+              - business
+              - casual
+              - fine-dining
+              - quick-bite
+              - outdoor
+              - live-music
+              - rooftop
+              - pet-friendly
+          description: Ambiance descriptors (ML-tagged from reviews + manual)
+
+    # --- Capacity ---
+    capacity:
+      type: object
+      properties:
+        total_seats:
+          type: integer
+          example: 80
+        private_rooms:
+          type: integer
+          default: 0
+        outdoor_seating:
+          type: boolean
+          default: false
+        parking:
+          type: boolean
+          default: false
+
+    # --- Media ---
+    media:
+      type: object
+      properties:
+        photos:
+          type: array
+          items:
+            type: string
+            format: uri
+          description: CDN URLs for restaurant photos
+        menu_url:
+          type: string
+          format: uri
+          nullable: true
+
+    # --- ML Embeddings (internal) ---
+    embeddings:
+      type: object
+      description: ML-generated embeddings; not returned in API responses
+      properties:
+        text_embedding:
+          type: array
+          items:
+            type: number
+          minItems: 768
+          maxItems: 768
+          description: Sentence-BERT embedding (768d) from name + description + cuisine tags
+        image_embedding:
+          type: array
+          items:
+            type: number
+          minItems: 512
+          maxItems: 512
+          description: ResNet-50 embedding (512d) from primary photo
+
+    # --- Metadata ---
+    metadata:
+      type: object
+      properties:
+        popularity_score:
+          type: number
+          minimum: 0
+          maximum: 1
+          description: Time-decayed popularity score (updated hourly)
+        status:
+          type: string
+          enum: [active, temporarily_closed, permanently_closed, pending_review]
+        data_sources:
+          type: array
+          items:
+            type: string
+          example: ["partner_api", "google_places"]
+        created_at:
+          type: string
+          format: date-time
+        updated_at:
+          type: string
+          format: date-time
+```
+
+---
+
+## Field Reference
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `restaurant_id` | string | ✅ | Unique stable identifier |
+| `name` | string | ✅ | Restaurant display name |
+| `location.lat` | float | ✅ | Latitude (WGS84) |
+| `location.lng` | float | ✅ | Longitude (WGS84) |
+| `location.geohash` | string | ✅ | Geohash precision 7 for ANN proximity |
+| `cuisine.primary_cuisine` | string | ✅ | Single primary cuisine from taxonomy |
+| `cuisine.cuisine_types` | string[] | ✅ | All applicable cuisines |
+| `operations.is_open_now` | bool | Computed | Computed at query time, not stored |
+| `pricing.price_tier` | enum | ✅ | $, $$, $$$, $$$$ |
+| `ratings.avg_rating` | float | — | Updated daily from review ingestion |
+| `dietary.*` | bool | — | Explicit dietary flags |
+| `embeddings.text_embedding` | float[768] | Internal | Not exposed in API responses |
+| `metadata.popularity_score` | float | — | Updated hourly by batch job |
+
+---
+
+## JSON Example
+
+```json
+{
+  "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+  "name": "Pho Saigon House",
+  "description": "Authentic Vietnamese pho and street food in a casual setting. Known for our rich, slow-cooked bone broth and fresh herbs.",
+  "phone": "+84-28-3812-3456",
+  "website": "https://phosaigonhouse.vn",
+  "location": {
+    "address": "123 Nguyen Hue Boulevard",
+    "district": "District 1",
+    "city": "Ho Chi Minh City",
+    "country": "VN",
+    "lat": 10.7769,
+    "lng": 106.7009,
+    "geohash": "w3gv2c5",
+    "timezone": "Asia/Ho_Chi_Minh"
+  },
+  "cuisine": {
+    "cuisine_types": ["vietnamese", "asian", "noodles", "soup"],
+    "primary_cuisine": "vietnamese"
+  },
+  "operations": {
+    "operating_hours": {
+      "monday":    { "open": "07:00", "close": "22:00" },
+      "tuesday":   { "open": "07:00", "close": "22:00" },
+      "wednesday": { "open": "07:00", "close": "22:00" },
+      "thursday":  { "open": "07:00", "close": "22:00" },
+      "friday":    { "open": "07:00", "close": "23:00" },
+      "saturday":  { "open": "08:00", "close": "23:00" },
+      "sunday":    { "open": "08:00", "close": "21:00" }
+    },
+    "timezone": "Asia/Ho_Chi_Minh",
+    "is_open_now": true,
+    "reservation_required": false,
+    "accepts_walk_ins": true
+  },
+  "pricing": {
+    "price_tier": "$$",
+    "avg_meal_price_usd": 12.50
+  },
+  "ratings": {
+    "avg_rating": 4.3,
+    "total_reviews": 1842,
+    "rating_breakdown": {
+      "food": 4.5,
+      "service": 4.1,
+      "ambiance": 4.0,
+      "value": 4.6
+    }
+  },
+  "dietary": {
+    "vegetarian_options": true,
+    "vegan_options": false,
+    "halal": true,
+    "gluten_free": false,
+    "allergen_info": ["peanuts", "shellfish"]
+  },
+  "ambiance": {
+    "tags": ["casual", "family-friendly", "quick-bite"]
+  },
+  "capacity": {
+    "total_seats": 80,
+    "private_rooms": 0,
+    "outdoor_seating": true,
+    "parking": false
+  },
+  "media": {
+    "photos": [
+      "https://cdn.example.com/restaurants/rst_3b7c9e/photo_1.jpg",
+      "https://cdn.example.com/restaurants/rst_3b7c9e/photo_2.jpg"
+    ],
+    "menu_url": "https://cdn.example.com/restaurants/rst_3b7c9e/menu.pdf"
+  },
+  "metadata": {
+    "popularity_score": 0.82,
+    "status": "active",
+    "data_sources": ["partner_api", "google_places"],
+    "created_at": "2022-06-01T08:00:00Z",
+    "updated_at": "2024-03-15T02:30:00Z"
+  }
+}
+```
+
+---
+
+## Cuisine Taxonomy
+
+The system uses a hierarchical cuisine taxonomy with 120+ types. Top-level categories:
+
+| Category | Examples |
+|----------|---------|
+| Asian | Vietnamese, Japanese, Chinese, Korean, Thai, Indian, Indonesian |
+| European | Italian, French, Spanish, Greek, German, Mediterranean |
+| American | American, Mexican, Tex-Mex, BBQ, Cajun |
+| Middle Eastern | Lebanese, Turkish, Persian, Israeli |
+| African | Ethiopian, Moroccan, Nigerian |
+| Fusion | Asian Fusion, Modern Australian, Pan-Asian |
+| Dietary-specific | Vegan, Vegetarian, Raw Food, Gluten-Free |
+| Meal type | Breakfast, Brunch, Desserts, Coffee & Tea, Seafood, Steakhouse |
+
+---
+
+## Price Tier Reference
+
+| Tier | Symbol | Avg Price per Person | Target Segment |
+|------|--------|---------------------|----------------|
+| 1 | `$` | Under $15 | Budget, casual, fast food |
+| 2 | `$$` | $15 – $30 | Mid-range, family dining |
+| 3 | `$$$` | $30 – $60 | Upscale casual, business lunch |
+| 4 | `$$$$` | Over $60 | Fine dining, special occasions |
+
+Price tier is updated quarterly based on aggregated order data per restaurant.
diff --git a/docs/schemas/review-rating.md b/docs/schemas/review-rating.md
new file mode 100644
index 0000000..dcea321
--- /dev/null
+++ b/docs/schemas/review-rating.md
@@ -0,0 +1,366 @@
+# Review & Rating Schema
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Schema Definition](#schema-definition)
+- [Field Reference](#field-reference)
+- [Sentiment Analysis](#sentiment-analysis)
+- [JSON Examples](#json-examples)
+
+---
+
+## Overview
+
+Reviews and ratings are a critical signal for the recommendation system. They provide:
+1. **Explicit quality signals** (star ratings, dimension ratings)
+2. **Textual sentiment** (DistilBERT-based aspect-level sentiment analysis)
+3. **Social signals** (helpful votes, recency)
+
+- **Ingestion**: Via app API + partner webhooks → Kafka (`reviews-raw`) → PostgreSQL
+- **Sentiment enrichment**: Asynchronous DistilBERT inference (<3s), written back to review record
+- **Update to catalog**: Daily batch job updates `restaurant.ratings.avg_rating` and `rating_breakdown`
+- **Retention**: Reviews retained indefinitely in PostgreSQL; sentiment scores recomputed on model update
+
+---
+
+## Schema Definition
+
+```yaml
+Review:
+  type: object
+  required:
+    - review_id
+    - user_id
+    - restaurant_id
+    - overall_rating
+    - created_at
+  properties:
+
+    review_id:
+      type: string
+      description: Globally unique review identifier
+      example: "rev_01HXZ789GHI012JKL"
+
+    user_id:
+      type: string
+      description: Anonymized user identifier
+      example: "usr_9f3a8b2c1d4e5f6a"
+
+    restaurant_id:
+      type: string
+      description: Restaurant being reviewed
+      example: "rst_3b7c9e1a2f4d5e6b"
+
+    overall_rating:
+      type: integer
+      minimum: 1
+      maximum: 5
+      description: Overall star rating (1–5)
+      example: 4
+
+    dimension_ratings:
+      type: object
+      description: Aspect-level ratings (optional; shown on detailed review UI)
+      properties:
+        food:
+          type: integer
+          minimum: 1
+          maximum: 5
+          nullable: true
+        service:
+          type: integer
+          minimum: 1
+          maximum: 5
+          nullable: true
+        ambiance:
+          type: integer
+          minimum: 1
+          maximum: 5
+          nullable: true
+        value:
+          type: integer
+          minimum: 1
+          maximum: 5
+          nullable: true
+
+    review_text:
+      type: string
+      maxLength: 5000
+      nullable: true
+      description: Free-text review content (optional)
+      example: "Amazing pho! The broth is rich and the service was friendly."
+
+    # --- ML-Enriched Fields ---
+    sentiment_score:
+      type: number
+      minimum: 0.0
+      maximum: 1.0
+      nullable: true
+      description: >
+        Overall sentiment score computed by DistilBERT model.
+        0.0 = very negative, 1.0 = very positive.
+        Null until async enrichment completes (<5s).
+      example: 0.84
+
+    aspect_sentiments:
+      type: object
+      nullable: true
+      description: Aspect-level sentiment scores computed from review_text
+      properties:
+        food:
+          type: number
+          minimum: 0.0
+          maximum: 1.0
+          nullable: true
+        service:
+          type: number
+          minimum: 0.0
+          maximum: 1.0
+          nullable: true
+        ambiance:
+          type: number
+          minimum: 0.0
+          maximum: 1.0
+          nullable: true
+        value:
+          type: number
+          minimum: 0.0
+          maximum: 1.0
+          nullable: true
+
+    # --- Social Signals ---
+    helpful_votes:
+      type: integer
+      minimum: 0
+      default: 0
+      description: Number of users who marked this review as helpful
+      example: 23
+
+    reported_count:
+      type: integer
+      minimum: 0
+      default: 0
+      description: Number of times this review was reported as inappropriate
+
+    # --- Media ---
+    photos:
+      type: array
+      items:
+        type: string
+        format: uri
+      description: CDN URLs for photos attached to the review
+      maxItems: 10
+
+    # --- Context ---
+    visit_date:
+      type: string
+      format: date
+      nullable: true
+      description: Date of the dining visit (user-reported)
+      example: "2024-03-10"
+
+    visit_type:
+      type: string
+      enum: [dine_in, takeaway, delivery]
+      nullable: true
+      description: How the user experienced the restaurant
+
+    party_size:
+      type: integer
+      minimum: 1
+      nullable: true
+      description: Number of people in the dining party
+
+    # --- Metadata ---
+    language:
+      type: string
+      description: Detected language of review_text (ISO 639-1)
+      example: "en"
+
+    is_verified_order:
+      type: boolean
+      description: Whether the reviewer placed a verified order via the platform
+      default: false
+
+    moderation_status:
+      type: string
+      enum: [pending, approved, rejected, flagged]
+      default: pending
+      description: Content moderation status
+
+    created_at:
+      type: string
+      format: date-time
+      description: Timestamp when review was submitted
+      example: "2024-03-11T09:22:00Z"
+
+    updated_at:
+      type: string
+      format: date-time
+      description: Last update timestamp (e.g., after sentiment enrichment)
+```
+
+---
+
+## Field Reference
+
+| Field | Type | Required | Notes |
+|-------|------|----------|-------|
+| `review_id` | string | ✅ | Server-generated UUID |
+| `user_id` | string | ✅ | Anonymized user identifier |
+| `restaurant_id` | string | ✅ | Target restaurant |
+| `overall_rating` | int (1–5) | ✅ | Star rating |
+| `dimension_ratings.food` | int (1–5) | — | Optional aspect rating |
+| `dimension_ratings.service` | int (1–5) | — | Optional aspect rating |
+| `dimension_ratings.ambiance` | int (1–5) | — | Optional aspect rating |
+| `dimension_ratings.value` | int (1–5) | — | Optional aspect rating |
+| `review_text` | string | — | Max 5000 chars |
+| `sentiment_score` | float [0,1] | Computed | Async enrichment, <5s |
+| `aspect_sentiments.*` | float [0,1] | Computed | Async enrichment, only if review_text present |
+| `helpful_votes` | int | — | Incremented by user votes |
+| `photos` | string[] | — | Max 10 photos |
+| `visit_date` | date | — | User-reported visit date |
+| `is_verified_order` | bool | — | True if user ordered via platform |
+| `moderation_status` | enum | — | Default: pending; auto-moderation applied |
+
+---
+
+## Sentiment Analysis
+
+Reviews are enriched with sentiment scores using a fine-tuned DistilBERT model.
+
+### Model Details
+
+| Property | Value |
+|----------|-------|
+| **Base model** | DistilBERT (distilbert-base-uncased) |
+| **Fine-tuning dataset** | ~500K restaurant reviews (Yelp + internal) |
+| **Output** | Overall score + 4 aspect scores (food, service, ambiance, value) |
+| **Inference latency** | <2s per review |
+| **Model update frequency** | Quarterly |
+
+### Sentiment Score Interpretation
+
+| Range | Label | Description |
+|-------|-------|-------------|
+| 0.0 – 0.3 | Negative | Strongly negative sentiment |
+| 0.3 – 0.5 | Mixed-negative | More negative than positive |
+| 0.5 – 0.7 | Neutral | Balanced or unclear |
+| 0.7 – 0.9 | Positive | Generally positive |
+| 0.9 – 1.0 | Very positive | Highly enthusiastic |
+
+### Usage in Recommendations
+
+- `sentiment_score` contributes to `restaurant_dynamic.sentiment_score` (daily aggregate)
+- Aspect sentiments (`food_sentiment`, `service_sentiment`) are features in the ranking model
+- Reviews with `sentiment_score < 0.3` and `helpful_votes > 5` are flagged for quality review
+
+---
+
+## JSON Examples
+
+### Complete Review (with Sentiment Enrichment)
+
+```json
+{
+  "review_id": "rev_01HXZ789GHI012JKL",
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+  "overall_rating": 4,
+  "dimension_ratings": {
+    "food": 5,
+    "service": 4,
+    "ambiance": 4,
+    "value": 5
+  },
+  "review_text": "Amazing pho! The broth is rich and the service was friendly. A bit noisy during lunch rush but great value for money.",
+  "sentiment_score": 0.84,
+  "aspect_sentiments": {
+    "food": 0.95,
+    "service": 0.78,
+    "ambiance": 0.62,
+    "value": 0.91
+  },
+  "helpful_votes": 23,
+  "reported_count": 0,
+  "photos": [
+    "https://cdn.example.com/reviews/rev_01HXZ789/photo_1.jpg"
+  ],
+  "visit_date": "2024-03-10",
+  "visit_type": "dine_in",
+  "party_size": 2,
+  "language": "en",
+  "is_verified_order": true,
+  "moderation_status": "approved",
+  "created_at": "2024-03-11T09:22:00Z",
+  "updated_at": "2024-03-11T09:22:04Z"
+}
+```
+
+### Rating-Only Review (No Text)
+
+```json
+{
+  "review_id": "rev_02HXZ890HIJ123KLM",
+  "user_id": "usr_2c3d4e5f6a7b8c9d",
+  "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+  "overall_rating": 5,
+  "dimension_ratings": {
+    "food": 5,
+    "service": 5,
+    "ambiance": null,
+    "value": 4
+  },
+  "review_text": null,
+  "sentiment_score": null,
+  "aspect_sentiments": null,
+  "helpful_votes": 0,
+  "reported_count": 0,
+  "photos": [],
+  "visit_date": "2024-03-12",
+  "visit_type": "delivery",
+  "party_size": null,
+  "language": null,
+  "is_verified_order": true,
+  "moderation_status": "approved",
+  "created_at": "2024-03-12T20:15:00Z",
+  "updated_at": "2024-03-12T20:15:00Z"
+}
+```
+
+### Negative Review
+
+```json
+{
+  "review_id": "rev_03HXZ901IJK234LMN",
+  "user_id": "usr_3d4e5f6a7b8c9d0e",
+  "restaurant_id": "rst_5d9f1b3c4e6g7h8i",
+  "overall_rating": 2,
+  "dimension_ratings": {
+    "food": 2,
+    "service": 1,
+    "ambiance": 3,
+    "value": 2
+  },
+  "review_text": "Very disappointing experience. The food was cold when it arrived and the waiter was rude. The place looks nice but the quality doesn't match the price.",
+  "sentiment_score": 0.18,
+  "aspect_sentiments": {
+    "food": 0.12,
+    "service": 0.08,
+    "ambiance": 0.65,
+    "value": 0.22
+  },
+  "helpful_votes": 8,
+  "reported_count": 0,
+  "photos": [],
+  "visit_date": "2024-03-13",
+  "visit_type": "dine_in",
+  "party_size": 3,
+  "language": "en",
+  "is_verified_order": false,
+  "moderation_status": "approved",
+  "created_at": "2024-03-14T18:45:00Z",
+  "updated_at": "2024-03-14T18:45:02Z"
+}
+```
diff --git a/docs/schemas/user-event.md b/docs/schemas/user-event.md
new file mode 100644
index 0000000..da5c981
--- /dev/null
+++ b/docs/schemas/user-event.md
@@ -0,0 +1,382 @@
+# User Event Schema
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Event Types](#event-types)
+- [Schema Definition](#schema-definition)
+- [Field Reference](#field-reference)
+- [JSON Examples](#json-examples)
+
+---
+
+## Overview
+
+User events are the primary behavioral signal for the recommendation system. Every user interaction with the platform is captured as a structured event and published to Kafka for downstream processing.
+
+- **Transport**: Kafka topic `user-events-raw` (partitioned by `user_id`)
+- **Format**: JSON (UTF-8)
+- **Deduplication**: `event_id` (UUID v4), 60-second dedup window in Flink
+- **Late event handling**: Events arriving >5 minutes late routed to `user-events-late`
+- **SDK**: Mobile (iOS/Android) and Web (JavaScript) SDKs batch events every 500ms
+
+---
+
+## Event Types
+
+| Event Type | Trigger | Key Behavioral Signal |
+|-----------|---------|----------------------|
+| `RESTAURANT_VIEW` | Restaurant detail page opened | Intent to learn more about a restaurant |
+| `RESTAURANT_CLICK` | Restaurant card clicked in listing/map | Direct interest signal |
+| `MENU_VIEW` | Menu page viewed | Deeper engagement, pre-order intent |
+| `ORDER_PLACED` | Order successfully submitted | Strongest positive signal |
+| `REVIEW_SUBMITTED` | User submits a review | Post-visit satisfaction signal |
+| `SEARCH` | User submits a search query | Intent + query context |
+| `FILTER_APPLIED` | User applies a filter (cuisine, price, etc.) | Explicit preference signal |
+| `FAVORITE_ADDED` | User adds restaurant to favorites | Explicit interest signal |
+
+---
+
+## Schema Definition
+
+```yaml
+UserEvent:
+  type: object
+  required:
+    - event_id
+    - event_type
+    - timestamp
+    - user_id
+    - session_id
+  properties:
+    event_id:
+      type: string
+      format: uuid-v4
+      description: Globally unique event identifier (UUID v4, generated by client SDK)
+      example: "evt_01HXYZ123ABC456DEF"
+
+    event_type:
+      type: string
+      enum:
+        - RESTAURANT_VIEW
+        - RESTAURANT_CLICK
+        - MENU_VIEW
+        - ORDER_PLACED
+        - REVIEW_SUBMITTED
+        - SEARCH
+        - FILTER_APPLIED
+        - FAVORITE_ADDED
+      description: Type of user interaction event
+
+    timestamp:
+      type: string
+      format: date-time
+      description: ISO 8601 UTC timestamp of the event
+      example: "2024-03-15T14:32:07.123Z"
+
+    user_id:
+      type: string
+      description: Hashed user identifier (SHA-256, anonymized)
+      example: "usr_9f3a8b2c1d4e5f6a"
+
+    session_id:
+      type: string
+      description: Session identifier (server-assigned UUID)
+      example: "sess_7e2f1a3b4c5d6e7f"
+
+    restaurant_id:
+      type: string
+      nullable: true
+      description: Target restaurant ID (null for SEARCH events)
+      example: "rst_3b7c9e1a2f4d5e6b"
+
+    context:
+      type: object
+      properties:
+        device:
+          type: string
+          enum: [ios, android, web_mobile, web_desktop]
+        platform:
+          type: string
+          enum: [mobile_app, web_app, partner_api]
+        app_version:
+          type: string
+          example: "4.2.1"
+        os_version:
+          type: string
+          example: "17.2"
+        location:
+          type: object
+          properties:
+            lat:
+              type: number
+              minimum: -90
+              maximum: 90
+            lng:
+              type: number
+              minimum: -180
+              maximum: 180
+            accuracy_meters:
+              type: integer
+            geohash:
+              type: string
+              description: Geohash precision 6
+        referrer:
+          type: string
+          description: Source page or feature that led to this event
+          example: "homepage_recommendations"
+        page_position:
+          type: integer
+          description: Position in the list/page (1-indexed)
+
+    metadata:
+      type: object
+      description: Event-type-specific metadata
+      properties:
+        dwell_time_ms:
+          type: integer
+          description: Time spent on the page/card in milliseconds
+        scroll_depth_pct:
+          type: integer
+          minimum: 0
+          maximum: 100
+          description: Percentage of page scrolled
+        rank_in_list:
+          type: integer
+          description: Rank of this restaurant in the recommendation list
+        rec_model_version:
+          type: string
+          description: Model version that generated the recommendation
+        experiment_variants:
+          type: object
+          description: A/B experiment assignment for this event
+          additionalProperties:
+            type: string
+        rating_value:
+          type: integer
+          minimum: 1
+          maximum: 5
+          description: Rating given (for REVIEW_SUBMITTED events)
+        order_value_usd:
+          type: number
+          description: Order total in USD (for ORDER_PLACED events)
+        items_count:
+          type: integer
+          description: Number of items in order (for ORDER_PLACED events)
+        search_query:
+          type: string
+          description: Search query text (for SEARCH events)
+        search_results_count:
+          type: integer
+          description: Number of results returned (for SEARCH events)
+        filter_type:
+          type: string
+          description: Filter category (for FILTER_APPLIED events)
+          example: "cuisine"
+        filter_value:
+          type: string
+          description: Filter value applied
+          example: "japanese"
+```
+
+---
+
+## Field Reference
+
+| Field | Required | Type | Notes |
+|-------|----------|------|-------|
+| `event_id` | ✅ | UUID string | Client-generated; used for deduplication |
+| `event_type` | ✅ | enum | See Event Types table |
+| `timestamp` | ✅ | ISO 8601 UTC | Client-side timestamp; server validates within ±5min |
+| `user_id` | ✅ | string | Anonymized user identifier |
+| `session_id` | ✅ | string | Server-assigned session identifier |
+| `restaurant_id` | For most types | string | Null for SEARCH, FILTER_APPLIED |
+| `context.device` | ✅ | enum | Client platform |
+| `context.location.lat/lng` | Recommended | float | Used for geo features; user may deny permission |
+| `metadata.dwell_time_ms` | For VIEW events | integer | Client measures, flush on page leave |
+| `metadata.rank_in_list` | For CLICK events | integer | Position of clicked restaurant |
+| `metadata.experiment_variants` | ✅ | object | Always included for A/B analysis |
+
+---
+
+## JSON Examples
+
+### `RESTAURANT_CLICK`
+
+```json
+{
+  "event_id": "evt_01HXYZ123ABC456DEF",
+  "event_type": "RESTAURANT_CLICK",
+  "timestamp": "2024-03-15T14:32:07.123Z",
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "session_id": "sess_7e2f1a3b4c5d6e7f",
+  "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+  "context": {
+    "device": "ios",
+    "platform": "mobile_app",
+    "app_version": "4.2.1",
+    "os_version": "17.2",
+    "location": {
+      "lat": 10.7769,
+      "lng": 106.7009,
+      "accuracy_meters": 15,
+      "geohash": "w3gv2c"
+    },
+    "referrer": "homepage_recommendations",
+    "page_position": 3
+  },
+  "metadata": {
+    "rank_in_list": 3,
+    "rec_model_version": "ranking_v2.4.1",
+    "experiment_variants": {
+      "homepage_layout": "control",
+      "ranking_model": "treatment_a"
+    }
+  }
+}
+```
+
+### `RESTAURANT_VIEW`
+
+```json
+{
+  "event_id": "evt_02HXYZ456DEF789GHI",
+  "event_type": "RESTAURANT_VIEW",
+  "timestamp": "2024-03-15T14:32:12.456Z",
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "session_id": "sess_7e2f1a3b4c5d6e7f",
+  "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+  "context": {
+    "device": "ios",
+    "platform": "mobile_app",
+    "app_version": "4.2.1",
+    "location": {
+      "lat": 10.7769,
+      "lng": 106.7009,
+      "accuracy_meters": 15,
+      "geohash": "w3gv2c"
+    },
+    "referrer": "restaurant_click"
+  },
+  "metadata": {
+    "dwell_time_ms": 4200,
+    "scroll_depth_pct": 75,
+    "rec_model_version": "ranking_v2.4.1",
+    "experiment_variants": {
+      "homepage_layout": "control",
+      "ranking_model": "treatment_a"
+    }
+  }
+}
+```
+
+### `ORDER_PLACED`
+
+```json
+{
+  "event_id": "evt_03HXYZ789GHI012JKL",
+  "event_type": "ORDER_PLACED",
+  "timestamp": "2024-03-15T14:52:33.789Z",
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "session_id": "sess_7e2f1a3b4c5d6e7f",
+  "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+  "context": {
+    "device": "ios",
+    "platform": "mobile_app",
+    "app_version": "4.2.1",
+    "location": {
+      "lat": 10.7769,
+      "lng": 106.7009,
+      "geohash": "w3gv2c"
+    }
+  },
+  "metadata": {
+    "order_value_usd": 24.50,
+    "items_count": 3,
+    "experiment_variants": {
+      "homepage_layout": "control",
+      "ranking_model": "treatment_a"
+    }
+  }
+}
+```
+
+### `SEARCH`
+
+```json
+{
+  "event_id": "evt_04HXYZ012JKL345MNO",
+  "event_type": "SEARCH",
+  "timestamp": "2024-03-15T15:10:05.000Z",
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "session_id": "sess_8f3e2b4c5d6e7f8a",
+  "restaurant_id": null,
+  "context": {
+    "device": "ios",
+    "platform": "mobile_app",
+    "app_version": "4.2.1",
+    "location": {
+      "lat": 10.7769,
+      "lng": 106.7009,
+      "geohash": "w3gv2c"
+    },
+    "referrer": "search_bar"
+  },
+  "metadata": {
+    "search_query": "sushi near me",
+    "search_results_count": 42,
+    "experiment_variants": {
+      "search_ranking": "control"
+    }
+  }
+}
+```
+
+### `FILTER_APPLIED`
+
+```json
+{
+  "event_id": "evt_05HXYZ345MNO678PQR",
+  "event_type": "FILTER_APPLIED",
+  "timestamp": "2024-03-15T15:10:45.123Z",
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "session_id": "sess_8f3e2b4c5d6e7f8a",
+  "restaurant_id": null,
+  "context": {
+    "device": "ios",
+    "platform": "mobile_app",
+    "app_version": "4.2.1",
+    "referrer": "search_results"
+  },
+  "metadata": {
+    "filter_type": "price_tier",
+    "filter_value": "$$",
+    "experiment_variants": {
+      "search_ranking": "control"
+    }
+  }
+}
+```
+
+### `REVIEW_SUBMITTED`
+
+```json
+{
+  "event_id": "evt_06HXYZ678PQR901STU",
+  "event_type": "REVIEW_SUBMITTED",
+  "timestamp": "2024-03-16T09:22:00.000Z",
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "session_id": "sess_9g4f3c5d6e7f8a9b",
+  "restaurant_id": "rst_3b7c9e1a2f4d5e6b",
+  "context": {
+    "device": "ios",
+    "platform": "mobile_app",
+    "app_version": "4.2.1",
+    "referrer": "post_order_prompt"
+  },
+  "metadata": {
+    "rating_value": 4,
+    "experiment_variants": {}
+  }
+}
+```
diff --git a/docs/schemas/user-profile.md b/docs/schemas/user-profile.md
new file mode 100644
index 0000000..eb5997e
--- /dev/null
+++ b/docs/schemas/user-profile.md
@@ -0,0 +1,341 @@
+# User Profile Schema
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Schema Definition](#schema-definition)
+- [Field Reference](#field-reference)
+- [User Segments](#user-segments)
+- [JSON Examples](#json-examples)
+
+---
+
+## Overview
+
+The user profile aggregates demographic information, behavioral signals, and ML-derived preference representations for each user. It is the primary input to the recommendation engine's personalization logic.
+
+- **Primary storage**: PostgreSQL (source of truth) + S3 Parquet (offline feature store)
+- **Online serving**: Redis (feature store cache, TTL 6h)
+- **Update frequency**: Daily batch rebuild (2:00 AM UTC) + event-driven updates for significant actions
+- **Privacy**: Profile contains only aggregated/derived signals; raw interaction history is not stored in the profile
+
+---
+
+## Schema Definition
+
+```yaml
+UserProfile:
+  type: object
+  required:
+    - user_id
+    - segment
+    - metadata
+  properties:
+
+    user_id:
+      type: string
+      description: Anonymized user identifier
+      example: "usr_9f3a8b2c1d4e5f6a"
+
+    # --- Demographic ---
+    demographic:
+      type: object
+      description: Self-reported demographic information (from registration/onboarding)
+      properties:
+        age_range:
+          type: string
+          enum: ["18-24", "25-34", "35-44", "45-54", "55-64", "65+"]
+          nullable: true
+          description: Age bucket (not exact age; privacy-preserving)
+        gender:
+          type: string
+          enum: ["male", "female", "non-binary", "prefer_not_to_say"]
+          nullable: true
+        location:
+          type: object
+          description: User's primary location (home or frequently visited area)
+          properties:
+            city:
+              type: string
+              example: "Ho Chi Minh City"
+            district:
+              type: string
+              example: "District 2"
+            country:
+              type: string
+              format: iso-3166-1-alpha-2
+              example: "VN"
+            lat:
+              type: number
+              description: Approximate centroid of user's activity area
+              example: 10.7880
+            lng:
+              type: number
+              example: 106.7320
+
+    # --- Behavioral Aggregates ---
+    behavioral:
+      type: object
+      description: Derived from interaction history (last 90 days)
+      properties:
+        total_interactions:
+          type: integer
+          description: Total weighted interactions (views + clicks + orders) all-time
+          example: 247
+        favorite_cuisines:
+          type: array
+          maxItems: 5
+          items:
+            type: object
+            properties:
+              cuisine:
+                type: string
+                example: "japanese"
+              affinity_score:
+                type: number
+                minimum: 0
+                maximum: 1
+                description: Normalized affinity score (0=no preference, 1=strong preference)
+                example: 0.87
+          description: Top-5 cuisine preferences with affinity scores
+        avg_session_duration_min:
+          type: number
+          description: Average session duration in minutes (last 30 days)
+          example: 8.4
+        order_frequency_per_month:
+          type: number
+          description: Average orders per month (last 90 days)
+          example: 6.2
+        last_active_at:
+          type: string
+          format: date-time
+          description: Timestamp of most recent interaction
+          example: "2024-03-15T13:45:00Z"
+        visited_restaurant_ids:
+          type: array
+          items:
+            type: string
+          description: IDs of restaurants visited in the last 90 days (for exclusion logic)
+          maxItems: 200
+
+    # --- Dietary Preferences ---
+    dietary_preferences:
+      type: object
+      description: Dietary requirements and restrictions
+      properties:
+        vegetarian:
+          type: boolean
+          default: false
+        vegan:
+          type: boolean
+          default: false
+        halal:
+          type: boolean
+          default: false
+        kosher:
+          type: boolean
+          default: false
+        allergies:
+          type: array
+          items:
+            type: string
+          description: Food allergens to avoid
+          example: ["shellfish", "peanuts"]
+
+    # --- Price Preference ---
+    price_preference:
+      type: object
+      properties:
+        preferred_price_tier:
+          type: string
+          enum: ["$", "$$", "$$$", "$$$$"]
+          nullable: true
+          description: Preferred price tier (from onboarding or inferred from order history)
+          example: "$$"
+        avg_spend_per_meal_usd:
+          type: number
+          description: Average spend per meal order in USD (last 90 days)
+          example: 18.00
+
+    # --- ML Representations ---
+    preference_vector:
+      type: array
+      items:
+        type: number
+      minItems: 128
+      maxItems: 128
+      description: Learned 128-dimensional preference embedding from Two-Tower CF model
+
+    # --- Segment ---
+    segment:
+      type: string
+      enum:
+        - cold_start
+        - warm
+        - active
+        - power_user
+        - churning
+        - dormant
+      description: User lifecycle segment; determines recommendation strategy
+
+    # --- Metadata ---
+    metadata:
+      type: object
+      properties:
+        profile_completeness_pct:
+          type: integer
+          minimum: 0
+          maximum: 100
+          description: Percentage of optional profile fields filled in
+        onboarding_completed:
+          type: boolean
+          description: Whether user completed onboarding quiz
+        created_at:
+          type: string
+          format: date-time
+        updated_at:
+          type: string
+          format: date-time
+```
+
+---
+
+## Field Reference
+
+| Field | Type | Required | Update Frequency | Privacy |
+|-------|------|----------|-----------------|---------|
+| `user_id` | string | ✅ | Immutable | Anonymized (hashed) |
+| `demographic.age_range` | enum | — | On profile update | Bucketed (not exact age) |
+| `demographic.gender` | enum | — | On profile update | Optional, nullable |
+| `demographic.location` | object | — | Daily (from activity centroid) | Approx. centroid, not GPS |
+| `behavioral.total_interactions` | int | — | Daily | Aggregated count |
+| `behavioral.favorite_cuisines` | array[5] | — | Daily | Derived signal |
+| `behavioral.order_frequency_per_month` | float | — | Daily | Derived |
+| `behavioral.last_active_at` | datetime | — | Event-driven | — |
+| `behavioral.visited_restaurant_ids` | string[] | — | Daily | Last 90 days only |
+| `dietary_preferences.*` | bool | — | On settings update + inference | User-controlled |
+| `dietary_preferences.allergies` | string[] | — | On settings update | User-controlled |
+| `price_preference.preferred_price_tier` | enum | — | Daily (inferred) | Derived |
+| `preference_vector` | float[128] | — | Weekly (model retrain) | Internal; not in API |
+| `segment` | enum | ✅ | Daily | — |
+
+---
+
+## User Segments
+
+| Segment | Criteria | Recommendation Strategy | Feature Store TTL |
+|---------|----------|------------------------|------------------|
+| `cold_start` | <10 total interactions | Popularity by geo + demographic matching | 1h |
+| `warm` | 10–30 interactions | Blended: CF (ramping) + content-based | 3h |
+| `active` | >30 interactions, active <30 days | Full personalization (CF + ranking) | 6h |
+| `power_user` | >100 interactions/month | Deep personalization + early access features | 6h |
+| `churning` | Active 30–90 days ago | Re-engagement: trending, high-rated nearby | 6h |
+| `dormant` | No activity >90 days | Default to geo-based popularity | 12h |
+
+---
+
+## JSON Examples
+
+### Active User (Full Profile)
+
+```json
+{
+  "user_id": "usr_9f3a8b2c1d4e5f6a",
+  "demographic": {
+    "age_range": "25-34",
+    "gender": "female",
+    "location": {
+      "city": "Ho Chi Minh City",
+      "district": "District 2",
+      "country": "VN",
+      "lat": 10.7880,
+      "lng": 106.7320
+    }
+  },
+  "behavioral": {
+    "total_interactions": 247,
+    "favorite_cuisines": [
+      { "cuisine": "japanese", "affinity_score": 0.87 },
+      { "cuisine": "vietnamese", "affinity_score": 0.76 },
+      { "cuisine": "korean", "affinity_score": 0.65 },
+      { "cuisine": "italian", "affinity_score": 0.42 },
+      { "cuisine": "chinese", "affinity_score": 0.38 }
+    ],
+    "avg_session_duration_min": 8.4,
+    "order_frequency_per_month": 6.2,
+    "last_active_at": "2024-03-15T13:45:00Z",
+    "visited_restaurant_ids": [
+      "rst_3b7c9e1a2f4d5e6b",
+      "rst_4c8d0f2b3e5f6a7c"
+    ]
+  },
+  "dietary_preferences": {
+    "vegetarian": false,
+    "vegan": false,
+    "halal": false,
+    "kosher": false,
+    "allergies": ["shellfish"]
+  },
+  "price_preference": {
+    "preferred_price_tier": "$$",
+    "avg_spend_per_meal_usd": 18.00
+  },
+  "preference_vector": [0.12, -0.34, 0.87, "... (128 dimensions total)"],
+  "segment": "active",
+  "metadata": {
+    "profile_completeness_pct": 85,
+    "onboarding_completed": true,
+    "created_at": "2022-09-14T10:00:00Z",
+    "updated_at": "2024-03-15T02:00:00Z"
+  }
+}
+```
+
+### Cold Start User (New Registration)
+
+```json
+{
+  "user_id": "usr_1a2b3c4d5e6f7a8b",
+  "demographic": {
+    "age_range": "18-24",
+    "gender": "prefer_not_to_say",
+    "location": {
+      "city": "Hanoi",
+      "district": "Hoan Kiem",
+      "country": "VN",
+      "lat": 21.0285,
+      "lng": 105.8542
+    }
+  },
+  "behavioral": {
+    "total_interactions": 3,
+    "favorite_cuisines": [
+      { "cuisine": "vietnamese", "affinity_score": 0.60 },
+      { "cuisine": "japanese", "affinity_score": 0.40 }
+    ],
+    "avg_session_duration_min": 3.2,
+    "order_frequency_per_month": 0,
+    "last_active_at": "2024-03-15T11:10:00Z",
+    "visited_restaurant_ids": []
+  },
+  "dietary_preferences": {
+    "vegetarian": true,
+    "vegan": false,
+    "halal": false,
+    "kosher": false,
+    "allergies": []
+  },
+  "price_preference": {
+    "preferred_price_tier": "$",
+    "avg_spend_per_meal_usd": null
+  },
+  "preference_vector": null,
+  "segment": "cold_start",
+  "metadata": {
+    "profile_completeness_pct": 40,
+    "onboarding_completed": true,
+    "created_at": "2024-03-15T10:55:00Z",
+    "updated_at": "2024-03-15T11:10:00Z"
+  }
+}
+```