diff --git a/README.md b/README.md index efa5625..f634b8b 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,140 @@ # Restaurant Recommendation System -Architecture documentation for a Restaurant Recommendation System. +A scalable, ML-powered restaurant recommendation system that delivers personalized dining suggestions based on user preferences, location, behavior history, and contextual signals. + +## Table of Contents + +- [Architecture Overview](#architecture-overview) +- [Core Modules](#core-modules) +- [Tech Stack](#tech-stack) +- [Documentation Index](#documentation-index) + +--- + +## Architecture Overview + +```mermaid +graph TB + subgraph Client["Client Layer"] + MobileApp["πŸ“± Mobile App"] + WebApp["🌐 Web App"] + ThirdParty["πŸ”— Third-party API"] + end + + subgraph Gateway["API Gateway Layer"] + LB["Load Balancer"] + APIGW["API Gateway\n(Auth Β· Rate Limit Β· Routing)"] + end + + subgraph Serving["Serving Layer"] + RecAPI["Recommendation\nService (FastAPI)"] + SearchAPI["Search\nRe-rank Service"] + end + + subgraph RecsEngine["Recommendation Engine"] + CandGen["Candidate\nGeneration"] + Ranking["Scoring &\nRanking"] + ReRank["Re-ranking &\nBusiness Logic"] + end + + subgraph DataLayer["Data Layer"] + Redis[("Redis Cache")] + Postgres[("PostgreSQL")] + Elasticsearch[("Elasticsearch")] + Milvus[("Milvus\nVector DB")] + end + + subgraph MLPlatform["ML Platform"] + FeatureStore["Feature Store"] + ModelRegistry["Model Registry\n(MLflow)"] + Training["Training\nPipeline"] + end + + subgraph Streaming["Streaming Layer"] + Kafka["Apache Kafka"] + Flink["Apache Flink"] + end + + subgraph Observability["Observability"] + Prometheus["Prometheus"] + Grafana["Grafana"] + ELK["ELK Stack"] + end + + Client --> LB --> APIGW --> Serving + Serving --> RecsEngine + RecsEngine --> DataLayer + RecsEngine --> FeatureStore + MLPlatform --> RecsEngine + Streaming --> FeatureStore + Serving --> Observability +``` + +--- + +## Core Modules + +| # | Module | Description | +|---|--------|-------------| +| 01 | [Data Ingestion](docs/modules/01-data-ingestion.md) | Clickstream tracking, restaurant catalog sync, user profile aggregation, review ingestion | +| 02 | [Recommendation Engine](docs/modules/02-recommendation-engine.md) | Candidate generation β†’ Scoring/Ranking β†’ Re-ranking with business rules | +| 03 | [Serving & Delivery](docs/modules/03-serving-delivery.md) | REST API endpoints, 4-layer caching, real-time personalization stream | +| 04 | [Feature Store](docs/modules/04-feature-store.md) | Online/Offline store, feature registry, <5ms latency SLA | +| 05 | [ML Platform](docs/modules/05-ml-platform.md) | Training pipelines, model lifecycle, canary/blue-green deployment | +| 06 | [Experimentation](docs/modules/06-experimentation.md) | A/B testing framework, statistical testing, experiment lifecycle | +| 07 | [Observability](docs/modules/07-observability.md) | System monitoring, ML drift detection, business metrics, alerting | +| 08 | [Data Pipeline](docs/modules/08-data-pipeline.md) | Airflow DAGs, Kafka+Flink streaming, SLAs, retry strategies | + +--- + +## Tech Stack + +| Layer | Technology | +|-------|-----------| +| API | FastAPI (Python), Go (high-throughput endpoints) | +| ML Framework | PyTorch, TensorFlow, LightGBM, XGBoost | +| Feature Store | Redis (online), S3 + BigQuery (offline) | +| Vector Search | Milvus (HNSW ANN), Elasticsearch | +| Streaming | Apache Kafka, Apache Flink | +| Batch Processing | Apache Spark, Apache Airflow | +| Relational DB | PostgreSQL | +| Cache | Redis Cluster | +| Model Serving | TensorFlow Serving, NVIDIA Triton | +| Model Registry | MLflow | +| Experiment Tracking | Evidently AI | +| Monitoring | Prometheus, Grafana, Metabase | +| Logging | ELK Stack / Grafana Loki | +| CI/CD | GitHub Actions | +| Container / Orchestration | Docker, Kubernetes | + +--- + +## Documentation Index + +### Architecture +- [System Overview](docs/architecture/system-overview.md) β€” System diagram, request lifecycle, latency budgets +- [Tech Stack](docs/architecture/tech-stack.md) β€” Technology choices and justification +- [Data Flow](docs/architecture/data-flow.md) β€” Real-time, batch, and streaming flow diagrams + +### Modules +- [01 Data Ingestion](docs/modules/01-data-ingestion.md) +- [02 Recommendation Engine](docs/modules/02-recommendation-engine.md) +- [03 Serving & Delivery](docs/modules/03-serving-delivery.md) +- [04 Feature Store](docs/modules/04-feature-store.md) +- [05 ML Platform](docs/modules/05-ml-platform.md) +- [06 Experimentation](docs/modules/06-experimentation.md) +- [07 Observability](docs/modules/07-observability.md) +- [08 Data Pipeline](docs/modules/08-data-pipeline.md) + +### Schemas +- [User Event Schema](docs/schemas/user-event.md) +- [Restaurant Catalog Schema](docs/schemas/restaurant-catalog.md) +- [User Profile Schema](docs/schemas/user-profile.md) +- [Review & Rating Schema](docs/schemas/review-rating.md) + +### API +- [API Specification](docs/api/api-specification.md) + +### Infrastructure +- [Deployment](docs/infrastructure/deployment.md) -## Status -🚧 Under Development diff --git a/docs/api/api-specification.md b/docs/api/api-specification.md new file mode 100644 index 0000000..e9df2c2 --- /dev/null +++ b/docs/api/api-specification.md @@ -0,0 +1,527 @@ +# API Specification + +## Table of Contents + +- [Overview](#overview) +- [Authentication](#authentication) +- [Rate Limiting](#rate-limiting) +- [Common Request Headers](#common-request-headers) +- [Common Response Format](#common-response-format) +- [Error Handling](#error-handling) +- [Endpoints](#endpoints) + - [GET /api/v1/recommendations/homepage](#get-apiv1recommendationshomepage) + - [GET /api/v1/recommendations/nearby](#get-apiv1recommendationsnearby) + - [GET /api/v1/recommendations/similar/{restaurant_id}](#get-apiv1recommendationssimilarrestaurant_id) + - [GET /api/v1/recommendations/complementary/{restaurant_id}](#get-apiv1recommendationscomplementaryrestaurant_id) + - [POST /api/v1/recommendations/search-rerank](#post-apiv1recommendationssearch-rerank) + - [POST /api/v1/recommendations/group](#post-apiv1recommendationsgroup) +- [SLA Summary](#sla-summary) + +--- + +## Overview + +| Property | Value | +|----------|-------| +| **Base URL** | `https://api.restaurantrec.example.com` | +| **API Version** | `v1` | +| **Protocol** | HTTPS only | +| **Format** | JSON (application/json) | +| **Encoding** | UTF-8 | +| **API Style** | RESTful | + +--- + +## Authentication + +All endpoints require a valid Bearer JWT token in the `Authorization` header. + +``` +Authorization: Bearer +``` + +JWT tokens are issued by the authentication service with: +- **Expiry**: 1 hour (access token), 30 days (refresh token) +- **Claims**: `user_id`, `exp`, `iat`, `scope` +- **Algorithm**: RS256 + +Server-to-server integrations use API keys with the `X-API-Key` header (rate limit: 1,000 req/min). + +--- + +## Rate Limiting + +| Client Type | Limit | Window | Header Returned | +|-------------|-------|--------|----------------| +| Authenticated user (JWT) | 100 requests | 1 minute | `X-RateLimit-Remaining` | +| Server API key | 1,000 requests | 1 minute | `X-RateLimit-Remaining` | +| Unauthenticated IP | 20 requests | 1 minute | `X-RateLimit-Remaining` | +| Group endpoint (per user) | 10 requests | 1 minute | `X-RateLimit-Remaining` | + +When rate limit is exceeded, the API returns `HTTP 429` with a `Retry-After` header. + +--- + +## Common Request Headers + +| Header | Required | Description | +|--------|----------|-------------| +| `Authorization` | βœ… | `Bearer ` | +| `Content-Type` | For POST | `application/json` | +| `Accept` | β€” | `application/json` (default) | +| `X-Request-ID` | Recommended | Client-generated UUID for request tracing | +| `X-API-Key` | For server auth | Alternative to Bearer token for server-to-server | + +--- + +## Common Response Format + +### Success Response + +```json +{ + "request_id": "req_01HXZ789ABC", + "data": { ... }, + "meta": { + "total": 20, + "model_version": "ranking_v2.4.1", + "latency_ms": 42, + "cache_hit": false, + "experiment_variants": { + "ranking_model": "treatment_a" + } + } +} +``` + +### Recommendation Item + +All recommendation endpoints return an array of `RecommendationItem`: + +```json +{ + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "name": "Pho Saigon House", + "score": 0.92, + "rank": 1, + "distance_km": 1.2, + "primary_cuisine": "vietnamese", + "cuisine_types": ["vietnamese", "asian", "noodles"], + "price_tier": "$$", + "avg_rating": 4.3, + "total_reviews": 1842, + "is_open_now": true, + "photo_url": "https://cdn.example.com/restaurants/rst_3b7c9e/photo_1.jpg", + "address": "123 Nguyen Hue Boulevard, District 1", + "lat": 10.7769, + "lng": 106.7009, + "ambiance_tags": ["casual", "family-friendly"], + "dietary_highlights": ["halal", "vegetarian_options"], + "reason": "Based on your love of Vietnamese cuisine" +} +``` + +--- + +## Error Handling + +### Error Response Format + +```json +{ + "error": { + "code": "INVALID_PARAMETERS", + "message": "Query parameter 'radius_km' must be between 1 and 50.", + "details": { + "field": "radius_km", + "provided_value": 100, + "valid_range": [1, 50] + } + }, + "request_id": "req_01HXZ789ABC", + "timestamp": "2024-03-15T14:32:07.123Z" +} +``` + +### HTTP Status Codes + +| HTTP Status | Error Code | Description | +|-------------|-----------|-------------| +| `400 Bad Request` | `INVALID_PARAMETERS` | Missing required param or out-of-range value | +| `401 Unauthorized` | `UNAUTHORIZED` | Missing, invalid, or expired JWT token | +| `403 Forbidden` | `FORBIDDEN` | Valid token but insufficient permissions | +| `404 Not Found` | `RESTAURANT_NOT_FOUND` | Specified `restaurant_id` does not exist | +| `429 Too Many Requests` | `RATE_LIMIT_EXCEEDED` | Rate limit exceeded | +| `500 Internal Server Error` | `INTERNAL_ERROR` | Unexpected server error (with fallback) | +| `503 Service Unavailable` | `RECOMMENDATION_ENGINE_TIMEOUT` | Engine exceeded latency budget; fallback returned | +| `503 Service Unavailable` | `FEATURE_STORE_UNAVAILABLE` | Feature store unreachable; fallback returned | + +--- + +## Endpoints + +### GET /api/v1/recommendations/homepage + +Returns personalized recommendations for the user's home feed. + +**SLA**: p50: 50ms, p99: 200ms + +#### Request + +``` +GET /api/v1/recommendations/homepage?limit=20&lat=10.7769&lng=106.7009&radius_km=5 +Authorization: Bearer +``` + +#### Query Parameters + +| Parameter | Type | Default | Required | Description | +|-----------|------|---------|----------|-------------| +| `limit` | integer | 20 | β€” | Number of recommendations (1–50) | +| `offset` | integer | 0 | β€” | Pagination offset | +| `lat` | float | β€” | β€” | User latitude; falls back to profile location | +| `lng` | float | β€” | β€” | User longitude | +| `radius_km` | float | 5.0 | β€” | Search radius in km (1–50) | +| `open_now` | boolean | false | β€” | Filter to only open restaurants | +| `exclude_visited` | boolean | false | β€” | Exclude recently visited restaurants | + +#### Response (200 OK) + +```json +{ + "request_id": "req_01HXZ789ABC", + "data": { + "recommendations": [ + { + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "name": "Pho Saigon House", + "score": 0.92, + "rank": 1, + "distance_km": 1.2, + "primary_cuisine": "vietnamese", + "cuisine_types": ["vietnamese", "asian", "noodles"], + "price_tier": "$$", + "avg_rating": 4.3, + "total_reviews": 1842, + "is_open_now": true, + "photo_url": "https://cdn.example.com/restaurants/rst_3b7c9e/photo_1.jpg", + "address": "123 Nguyen Hue Boulevard, District 1", + "lat": 10.7769, + "lng": 106.7009, + "ambiance_tags": ["casual", "family-friendly"], + "dietary_highlights": ["halal", "vegetarian_options"], + "reason": "Based on your love of Vietnamese cuisine" + } + ] + }, + "meta": { + "total": 20, + "model_version": "ranking_v2.4.1", + "latency_ms": 42, + "cache_hit": false, + "experiment_variants": { + "ranking_model": "treatment_a" + } + } +} +``` + +--- + +### GET /api/v1/recommendations/nearby + +Returns top-rated and personalized nearby restaurants based on the user's current location. + +**SLA**: p50: 40ms, p99: 150ms + +#### Request + +``` +GET /api/v1/recommendations/nearby?lat=10.7769&lng=106.7009&radius_km=3&open_now=true +Authorization: Bearer +``` + +#### Query Parameters + +| Parameter | Type | Default | Required | Description | +|-----------|------|---------|----------|-------------| +| `lat` | float | β€” | βœ… | User latitude | +| `lng` | float | β€” | βœ… | User longitude | +| `radius_km` | float | 3.0 | β€” | Search radius in km (1–50) | +| `limit` | integer | 20 | β€” | Max results (1–50) | +| `cuisine` | string | β€” | β€” | Filter by cuisine type | +| `price_tier` | string | β€” | β€” | Filter: `$`, `$$`, `$$$`, `$$$$` | +| `open_now` | boolean | false | β€” | Only return currently open restaurants | +| `dietary` | string | β€” | β€” | Comma-separated: `vegetarian`, `vegan`, `halal`, `gluten_free` | + +#### Response (200 OK) + +Same structure as homepage endpoint with `"reason": "Near your location"`. + +--- + +### GET /api/v1/recommendations/similar/{restaurant_id} + +Returns restaurants similar to a given restaurant. + +**SLA**: p50: 30ms, p99: 100ms + +#### Request + +``` +GET /api/v1/recommendations/similar/rst_3b7c9e1a2f4d5e6b?limit=10&lat=10.7769&lng=106.7009 +Authorization: Bearer +``` + +#### Path Parameters + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| `restaurant_id` | string | βœ… | Source restaurant ID | + +#### Query Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `limit` | integer | 10 | Number of similar restaurants (1–20) | +| `lat` | float | β€” | User location for geo-scoring | +| `lng` | float | β€” | User location for geo-scoring | + +#### Response (200 OK) + +```json +{ + "request_id": "req_02HXZ890BCD", + "data": { + "source_restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "recommendations": [ + { + "restaurant_id": "rst_4c8d0f2b3e5f6a7c", + "name": "Bun Bo Nam Bo", + "score": 0.88, + "rank": 1, + "similarity_score": 0.91, + "primary_cuisine": "vietnamese", + "price_tier": "$$", + "avg_rating": 4.4, + "is_open_now": true, + "distance_km": 0.8, + "reason": "Similar Vietnamese cuisine" + } + ] + }, + "meta": { + "total": 10, + "model_version": "similarity_v1.2.0", + "latency_ms": 28, + "cache_hit": true + } +} +``` + +--- + +### GET /api/v1/recommendations/complementary/{restaurant_id} + +Returns "People also liked" recommendations for users who interacted with a given restaurant. + +**SLA**: p50: 30ms, p99: 100ms + +#### Request + +``` +GET /api/v1/recommendations/complementary/rst_3b7c9e1a2f4d5e6b?limit=10 +Authorization: Bearer +``` + +#### Path Parameters + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| `restaurant_id` | string | βœ… | Source restaurant ID | + +#### Query Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `limit` | integer | 10 | Number of results (1–20) | +| `lat` | float | β€” | User location for geo-scoring | +| `lng` | float | β€” | User location for geo-scoring | + +--- + +### POST /api/v1/recommendations/search-rerank + +Re-ranks a set of search results using personalization signals. + +**SLA**: p50: 60ms, p99: 200ms + +#### Request + +``` +POST /api/v1/recommendations/search-rerank +Authorization: Bearer +Content-Type: application/json +``` + +#### Request Body + +```json +{ + "query": "sushi downtown", + "candidate_ids": [ + "rst_abc123", + "rst_def456", + "rst_ghi789", + "rst_jkl012" + ], + "context": { + "lat": 10.7769, + "lng": 106.7009, + "session_id": "sess_7e2f1a3b4c5d6e7f" + }, + "limit": 10 +} +``` + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `query` | string | βœ… | Original search query | +| `candidate_ids` | string[] | βœ… | Restaurant IDs to re-rank (max 200) | +| `context.lat` | float | β€” | User location | +| `context.lng` | float | β€” | User location | +| `context.session_id` | string | β€” | Session ID for personalization context | +| `limit` | integer | β€” | Max results to return (default: all) | + +#### Response (200 OK) + +```json +{ + "request_id": "req_03HXZ901CDE", + "data": { + "query": "sushi downtown", + "recommendations": [ + { + "restaurant_id": "rst_def456", + "name": "Sakura Sushi Bar", + "score": 0.89, + "rank": 1, + "original_rank": 2, + "primary_cuisine": "japanese", + "avg_rating": 4.6, + "price_tier": "$$$", + "is_open_now": true, + "distance_km": 0.5 + } + ] + }, + "meta": { + "total": 4, + "model_version": "rerank_v1.1.0", + "latency_ms": 55, + "cache_hit": false + } +} +``` + +--- + +### POST /api/v1/recommendations/group + +Generates recommendations for group dining by aggregating preferences of multiple users. + +**SLA**: p50: 80ms, p99: 250ms + +#### Request + +``` +POST /api/v1/recommendations/group +Authorization: Bearer +Content-Type: application/json +``` + +#### Request Body + +```json +{ + "user_ids": [ + "usr_9f3a8b2c1d4e5f6a", + "usr_1a2b3c4d5e6f7a8b", + "usr_2b3c4d5e6f7a8b9c" + ], + "context": { + "lat": 10.7769, + "lng": 106.7009, + "occasion": "business", + "party_size": 3 + }, + "constraints": { + "dietary": ["halal"], + "price_tier_max": "$$$", + "radius_km": 5.0, + "open_now": true + }, + "limit": 10 +} +``` + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `user_ids` | string[] | βœ… | User IDs in the group (max 10) | +| `context.lat` | float | β€” | Meeting location | +| `context.lng` | float | β€” | Meeting location | +| `context.occasion` | string | β€” | `business`, `casual`, `celebration`, `date` | +| `context.party_size` | integer | β€” | Number of people (overrides user_ids count) | +| `constraints.dietary` | string[] | β€” | Dietary requirements for the whole group | +| `constraints.price_tier_max` | string | β€” | Maximum price tier | +| `constraints.radius_km` | float | β€” | Search radius | +| `constraints.open_now` | boolean | β€” | Filter to open restaurants | +| `limit` | integer | β€” | Max results (default 10) | + +#### Response (200 OK) + +```json +{ + "request_id": "req_04HXZ012DEF", + "data": { + "group_size": 3, + "recommendations": [ + { + "restaurant_id": "rst_xyz789", + "name": "The Meeting Room Restaurant", + "score": 0.85, + "rank": 1, + "group_compatibility_score": 0.92, + "primary_cuisine": "international", + "price_tier": "$$$", + "avg_rating": 4.5, + "is_open_now": true, + "distance_km": 1.1, + "capacity": { "private_rooms": 2, "total_seats": 120 }, + "reason": "Matches cuisine preferences for all 3 members" + } + ] + }, + "meta": { + "total": 10, + "model_version": "group_v1.0.0", + "latency_ms": 76, + "cache_hit": false + } +} +``` + +--- + +## SLA Summary + +| Endpoint | Method | p50 | p99 | Cache TTL | Notes | +|----------|--------|-----|-----|-----------|-------| +| `/recommendations/homepage` | GET | 50ms | 200ms | Redis: 30min, CDN: 10min | High cache hit rate | +| `/recommendations/nearby` | GET | 40ms | 150ms | Redis: 15min, CDN: 5min | Location-sensitive | +| `/recommendations/similar/{id}` | GET | 30ms | 100ms | Redis: 2h, CDN: 30min | Pre-computable | +| `/recommendations/complementary/{id}` | GET | 30ms | 100ms | Redis: 2h, CDN: 30min | Pre-computable | +| `/recommendations/search-rerank` | POST | 60ms | 200ms | Not cached | Query-specific | +| `/recommendations/group` | POST | 80ms | 250ms | Redis: 15min | Multi-user aggregation | + +All SLAs measured at the API Gateway (excluding client network time). Monitored via Prometheus with Grafana alerting on p99 breaches. diff --git a/docs/architecture/data-flow.md b/docs/architecture/data-flow.md new file mode 100644 index 0000000..30a3be7 --- /dev/null +++ b/docs/architecture/data-flow.md @@ -0,0 +1,201 @@ +# Data Flow + +## Table of Contents + +- [Real-time Recommendation Flow](#real-time-recommendation-flow) +- [Batch Processing Flow](#batch-processing-flow) +- [Event Streaming Flow](#event-streaming-flow) + +--- + +## Real-time Recommendation Flow + +This flow describes how a user request is processed to produce personalized recommendations in real time (cache miss path). + +```mermaid +flowchart TD + A["πŸ‘€ User Request\nGET /api/v1/recommendations/homepage"] --> B["API Gateway\nΒ· Auth Β· Rate Limit"] + B --> C{CDN/Redis\nCache Hit?} + C -- HIT --> Z["βœ… Return Cached Response\n~5ms"] + C -- MISS --> D["Recommendation Service"] + + D --> E["Fetch User Features\nFeature Store (Redis)\n~3ms"] + E --> F["Candidate Generation\n~10ms"] + + F --> F1["Collaborative Filtering\n(Two-Tower embedding ANN via Milvus)"] + F --> F2["Content-Based Filtering\n(Item embedding ANN via Milvus)"] + F --> F3["Geo / Location Filtering\n(Elasticsearch geo_distance)"] + F --> F4["Popularity / Trending\n(Redis Sorted Set)"] + F --> F5["Cuisine Matching\n(Postgres lookup)"] + + F1 & F2 & F3 & F4 & F5 --> G["Merge & Deduplicate\n500-2000 candidates"] + + G --> H["Batch Fetch Restaurant Features\nFeature Store (Redis)\n~5ms"] + H --> I["Scoring & Ranking\nLightGBM / DeepFM inference\n~15ms"] + + I --> J["Re-ranking & Business Rules\n~5ms"] + J --> J1["MMR Diversity Injection"] + J --> J2["Operating Hours Filter"] + J --> J3["Sponsored / Fairness Boost"] + J --> J4["Cold Start Handling"] + + J1 & J2 & J3 & J4 --> K["Final Top 10-50 Results"] + K --> L["Store in Redis Cache\nTTL: 30min"] + L --> M["βœ… Return Response\n~40-60ms total"] +``` + +### Flow Description + +1. **Request arrives** at the API Gateway, which validates the JWT token and checks rate limits +2. **Cache check**: the system checks CDN edge cache (TTL 5–15 min) and Redis cache (TTL 30 min) before executing the full pipeline +3. **User features** are fetched from the online feature store (Redis) β€” includes user embedding (128d), segment, cuisine preferences, price preference, recency signals +4. **Candidate generation** runs multiple retrieval strategies in parallel: + - *Collaborative filtering*: ANN search on Two-Tower user/item embeddings via Milvus + - *Content-based*: ANN search on item text/image embeddings + - *Geo filtering*: Elasticsearch geo_distance query within configurable radius (1–50km) + - *Popularity*: Top-N restaurants from Redis time-decayed sorted set + - *Cuisine matching*: Direct lookup of restaurants matching user's top cuisine preferences +5. **Candidates are merged** and deduplicated, yielding 500–2000 candidates +6. **Restaurant features** are batch-fetched from the online feature store +7. **Scoring and ranking** uses a pre-loaded LightGBM/DeepFM model to score each candidate with cross features (user Γ— restaurant Γ— context) +8. **Re-ranking** applies business rules, diversity constraints, and filters +9. **Result is cached** in Redis and returned to the client + +--- + +## Batch Processing Flow + +This flow describes the daily/weekly offline pipelines for model training, feature computation, and catalog enrichment. + +```mermaid +flowchart LR + subgraph Sources["Data Sources"] + EventLog["Event Log\n(S3/GCS)"] + PG["PostgreSQL\n(Catalog, Users, Orders)"] + ReviewDB["Review DB\n(PostgreSQL)"] + end + + subgraph Spark["Spark Batch Jobs (Airflow-orchestrated)"] + FeatureEng["Feature Engineering\nΒ· User behavior aggregates\nΒ· Restaurant interaction stats\nΒ· Temporal features"] + EmbeddingJob["Embedding Generation\nΒ· User collaborative embeddings\nΒ· Restaurant text/image embeddings"] + TrainingDataJob["Training Data Construction\nΒ· Positive/negative sampling\nΒ· Label assignment\nΒ· Feature joining"] + end + + subgraph Training["ML Training"] + CFTrain["Collaborative Filtering Training\n(ALS / Two-Tower / Neural CF)"] + RankTrain["Ranking Model Training\n(LightGBM / XGBoost / DeepFM)"] + Evaluate["Model Evaluation\nΒ· AUC-ROC Β· NDCG@K\nΒ· MAP@K Β· Coverage Β· Diversity"] + end + + subgraph Registry["Model Registry"] + MLflowReg["MLflow Registry\nΒ· Version tagging\nΒ· Artifact storage\nΒ· Lineage tracking"] + end + + subgraph Deployment["Serving Update"] + Canary["Canary Deploy\n5% traffic, 2h window"] + Promote["Promote to 100%\n(if metrics pass)"] + Rollback["Rollback\n(if guardrail breached)"] + end + + subgraph OfflineStore["Offline Feature Store"] + S3Parquet["S3/GCS Parquet\n(daily snapshots)"] + BigQuery["BigQuery\n(analytics queries)"] + end + + subgraph OnlineStore["Online Feature Store"] + RedisCluster["Redis Cluster\n(warm-up from batch)"] + MilvusIdx["Milvus Index\n(embedding rebuild)"] + end + + Sources --> Spark + Spark --> OfflineStore + Spark --> Training + OfflineStore --> Training + Training --> Evaluate + Evaluate --> MLflowReg + MLflowReg --> Canary + Canary --> Promote + Canary --> Rollback + OfflineStore --> OnlineStore + EmbeddingJob --> MilvusIdx +``` + +### Batch Pipeline Schedule + +| Pipeline | Schedule | SLA | Description | +|----------|----------|-----|-------------| +| `user_profile_rebuild` | Daily 2:00 AM | 5:00 AM | Rebuild user behavioral aggregates and preference vectors | +| `restaurant_catalog_refresh` | Daily 1:00 AM | 4:00 AM | Sync restaurant catalog, compute NLP tags and image embeddings | +| `feature_engineering` | Daily 3:00 AM | 6:00 AM | Compute all offline features, write to S3 Parquet + BigQuery | +| `online_store_warmup` | Daily 6:30 AM | 7:30 AM | Populate Redis from latest offline feature snapshot | +| `model_retraining` | Weekly, Sunday 3:00 AM | 8:00 AM | Full model retrain (CF + ranking) with last 30 days of data | +| `embedding_index_rebuild` | Weekly, Sunday 6:00 AM | 10:00 AM | Rebuild Milvus HNSW index with updated embeddings | + +--- + +## Event Streaming Flow + +This flow describes how real-time user behavior events are ingested, processed, and used to update session features and trigger near-real-time re-scoring. + +```mermaid +flowchart TD + subgraph Clients["Client Applications"] + MobileSDK["πŸ“± Mobile SDK"] + WebSDK["🌐 Web SDK (JS)"] + ServerEvents["Server-side Events\n(Order, Review)"] + end + + subgraph Kafka["Apache Kafka Topics"] + RawEvents["Topic: user-events-raw\n(partitioned by user_id)"] + EnrichedEvents["Topic: user-events-enriched"] + SessionUpdates["Topic: session-feature-updates"] + RecTriggers["Topic: recs-refresh-triggers"] + end + + subgraph Flink["Apache Flink Jobs"] + Validate["Event Validation\nΒ· Schema check\nΒ· Dedup (event_id)\nΒ· Late event handling"] + Enrich["Event Enrichment\nΒ· Resolve restaurant_id metadata\nΒ· Add geo context\nΒ· Lookup user segment"] + SessionAgg["Session Aggregation\nΒ· Real-time session features\nΒ· Click sequences\nΒ· Dwell time"] + TriggerLogic["Trigger Logic\nΒ· ORDER_PLACED β†’ immediate refresh\nΒ· N clicks β†’ soft refresh\nΒ· Idle >30min β†’ session end"] + end + + subgraph FeatureUpdates["Feature Updates"] + RedisUpdate["Redis Cache Update\n(session features, <5s latency)"] + CDNInvalidate["CDN Cache Invalidation\n(on significant behavior change)"] + end + + subgraph Analytics["Analytics & Monitoring"] + ClickstreamAnalytics["Clickstream Analytics\n(BigQuery / Elasticsearch)"] + ABTracking["A/B Experiment Event Tracking"] + AnomalyDetect["Anomaly Detection\n(unusual event patterns)"] + end + + MobileSDK & WebSDK & ServerEvents --> RawEvents + RawEvents --> Validate + Validate --> Enrich + Enrich --> EnrichedEvents + EnrichedEvents --> SessionAgg + SessionAgg --> SessionUpdates + SessionUpdates --> RedisUpdate + SessionAgg --> TriggerLogic + TriggerLogic --> RecTriggers + RecTriggers --> CDNInvalidate + EnrichedEvents --> ClickstreamAnalytics + EnrichedEvents --> ABTracking + EnrichedEvents --> AnomalyDetect +``` + +### Event Processing SLAs + +| Stage | Latency Target | Description | +|-------|---------------|-------------| +| SDK β†’ Kafka | <500ms | Client SDK batches events and flushes every 500ms | +| Kafka β†’ Flink (ingestion lag) | <1s | Flink consumer lag monitoring; alert if >5s | +| Flink validation + enrichment | <1s | Per-event processing | +| Session feature update (Redis) | <5s end-to-end | From user action to updated feature in Redis | +| Recommendation refresh trigger | <10s | After ORDER_PLACED event triggers fresh recommendations | +| Analytics sink (BigQuery) | <5min | Micro-batch write to BigQuery for analytics | + +### Event Deduplication Strategy + +Events are deduplicated using the `event_id` (UUID v4 generated by the client SDK) within a **60-second deduplication window** in Flink. Late-arriving events (up to 5 minutes) are handled via Flink's watermark mechanism and routed to a separate late-events correction topic to avoid corrupting real-time session state. diff --git a/docs/architecture/system-overview.md b/docs/architecture/system-overview.md new file mode 100644 index 0000000..af9bc38 --- /dev/null +++ b/docs/architecture/system-overview.md @@ -0,0 +1,259 @@ +# System Overview + +## Table of Contents + +- [System Architecture Diagram](#system-architecture-diagram) +- [Core Components](#core-components) +- [Request Lifecycle](#request-lifecycle) +- [Latency Budget](#latency-budget) +- [Deployment Topology](#deployment-topology) + +--- + +## System Architecture Diagram + +```mermaid +graph TB + subgraph Clients["Client Layer"] + Mobile["πŸ“± Mobile App"] + Web["🌐 Web App"] + Partner["πŸ”— Partner API"] + end + + subgraph Ingress["Ingress Layer"] + CDN["CDN / Edge Cache\n(CloudFront)"] + LB["Load Balancer\n(L7 β€” NGINX / ALB)"] + end + + subgraph Gateway["API Gateway"] + APIGW["API Gateway\nΒ· JWT Auth\nΒ· Rate Limiting\nΒ· Request Routing\nΒ· TLS Termination"] + end + + subgraph ServingLayer["Serving Layer"] + RecSvc["Recommendation Service\n(FastAPI)"] + SearchSvc["Search Re-rank Service\n(FastAPI)"] + CatalogSvc["Catalog Service\n(Go)"] + end + + subgraph Engine["Recommendation Engine"] + CandGen["Candidate Generation\n(500-2000 candidates)"] + Ranker["Scoring & Ranking\n(top 50-200)"] + ReRanker["Re-ranking & Business Rules\n(final top 10-50)"] + end + + subgraph DataStores["Data Layer"] + Redis[("Redis Cluster\n(Cache / Sessions)")] + Postgres[("PostgreSQL\n(User / Restaurant data)")] + ES[("Elasticsearch\n(Full-text / Geo search)")] + Milvus[("Milvus\n(Vector / ANN search)")] + S3[("S3 / GCS\n(Offline feature store)")] + end + + subgraph ML["ML Platform"] + FeatureStore["Feature Store\n(Online: Redis | Offline: BigQuery)"] + ModelServing["Model Serving\n(TF Serving / Triton)"] + MLflow["MLflow\n(Model Registry)"] + Training["Training Pipeline\n(Spark + Airflow)"] + end + + subgraph Streaming["Event Streaming"] + Kafka["Apache Kafka"] + Flink["Apache Flink"] + end + + subgraph Observability["Observability Stack"] + Prometheus["Prometheus"] + Grafana["Grafana"] + ELK["ELK / Loki"] + Evidently["Evidently AI\n(ML Monitoring)"] + end + + Clients --> CDN + CDN --> LB + LB --> APIGW + APIGW --> ServingLayer + RecSvc --> CandGen --> Ranker --> ReRanker + ReRanker --> RecSvc + CandGen --> Redis + CandGen --> Milvus + CandGen --> ES + Ranker --> FeatureStore + ModelServing --> Ranker + MLflow --> ModelServing + Training --> MLflow + DataStores --> Training + Kafka --> Flink --> FeatureStore + ServingLayer --> Kafka + ServingLayer --> Observability + Engine --> Observability +``` + +--- + +## Core Components + +### API Gateway +- **Role**: Single entry point for all client traffic +- **Responsibilities**: JWT authentication, rate limiting (100 req/min per user), request routing, TLS termination, request/response logging +- **Technology**: NGINX / AWS API Gateway + +### Recommendation Service +- **Role**: Orchestrates the full recommendation pipeline +- **Responsibilities**: Calls candidate generation, ranking, and re-ranking; manages caching; returns final recommendations +- **Technology**: FastAPI (Python) + +### Candidate Generation +- **Role**: Retrieves 500–2000 candidate restaurants from multiple retrieval sources +- **Sources**: Collaborative filtering, content-based (vector ANN), geo/location filtering, popularity/trending, cuisine matching, knowledge graph +- **Technology**: Milvus (HNSW), Elasticsearch, Redis + +### Scoring & Ranking +- **Role**: Scores and ranks the candidate set down to top 50–200 +- **Approach**: Multi-objective ranking model combining relevance, distance, price, and rating signals +- **Technology**: LightGBM, XGBoost, DeepFM / Wide&Deep + +### Re-ranking & Business Logic +- **Role**: Applies diversity, freshness, and business rules to produce the final top 10–50 +- **Key operations**: MMR diversity injection, operating hours filter, sponsored boost, fairness constraints + +### Feature Store +- **Role**: Serves pre-computed user and restaurant features with low latency +- **Online store**: Redis Cluster (<5ms p99), **Offline store**: S3/GCS + BigQuery + +### ML Platform +- **Role**: Manages the full model lifecycle β€” training, evaluation, registration, and serving +- **Technology**: Apache Spark (training), MLflow (registry), TF Serving / Triton (inference) + +### Event Streaming +- **Role**: Processes real-time user behavior events to update session features and trigger near-real-time re-scoring +- **Technology**: Apache Kafka (transport), Apache Flink (stream processing) + +--- + +## Request Lifecycle + +```mermaid +sequenceDiagram + autonumber + participant C as Client + participant CDN as CDN/Edge + participant GW as API Gateway + participant Svc as Rec Service + participant Cache as Redis Cache + participant CG as Candidate Gen + participant FS as Feature Store + participant Rank as Ranker + participant RR as Re-ranker + + C->>CDN: GET /api/v1/recommendations/homepage + alt Cache HIT (CDN) + CDN-->>C: 200 OK (cached, <5ms) + else Cache MISS + CDN->>GW: Forward request + GW->>GW: Auth + Rate limit check (~2ms) + GW->>Svc: Route request + + Svc->>Cache: Lookup user recommendation cache + alt Redis Cache HIT + Cache-->>Svc: Cached recommendations (~2ms) + Svc-->>GW: 200 OK + GW-->>CDN: Response + CDN-->>C: 200 OK + else Cache MISS + Svc->>FS: Fetch user features (~3ms) + FS-->>Svc: user_embedding, segments, prefs + + Svc->>CG: Generate candidates (~10ms) + CG-->>Svc: 500-2000 restaurant candidates + + Svc->>FS: Batch fetch restaurant features (~5ms) + FS-->>Svc: restaurant features + + Svc->>Rank: Score & rank candidates (~15ms) + Rank-->>Svc: top 50-200 ranked + + Svc->>RR: Apply business rules (~5ms) + RR-->>Svc: final top 10-50 + + Svc->>Cache: Store result (TTL 30min) + Svc-->>GW: 200 OK + GW-->>CDN: Response + CDN-->>C: 200 OK + end + end +``` + +--- + +## Latency Budget + +The system targets **p50 ≀ 60ms** and **p99 ≀ 150ms** for the homepage recommendation endpoint (cache miss path). + +| Step | Component | p50 | p99 | Notes | +|------|-----------|-----|-----|-------| +| Network + TLS | Client β†’ API Gateway | ~5ms | ~15ms | CDN proximity helps | +| Auth & Rate Limit | API Gateway | ~2ms | ~5ms | JWT verification | +| User Feature Fetch | Feature Store (Redis) | ~2ms | ~5ms | Single key lookup | +| Candidate Generation | Milvus ANN + ES Geo | ~10ms | ~25ms | Parallel retrieval | +| Restaurant Feature Fetch | Feature Store (Redis) | ~3ms | ~8ms | Batch pipeline key lookup | +| Scoring & Ranking | ML Inference (Triton) | ~12ms | ~35ms | Batched inference | +| Re-ranking | Rule engine | ~3ms | ~8ms | In-memory computation | +| Serialization + Response | Rec Service | ~2ms | ~5ms | JSON serialization | +| **Total (cache miss)** | **End-to-end** | **~39ms** | **~106ms** | Within budget | +| **Total (cache hit)** | **Redis cache** | **~5ms** | **~15ms** | Dominant path | + +> **Cache hit rate target**: β‰₯80% for homepage endpoint (30min TTL in Redis). + +--- + +## Deployment Topology + +The system runs as containerized microservices on Kubernetes. + +```mermaid +graph TB + subgraph Zone_A["Availability Zone A"] + APIPOD_A["API Pod (3 replicas)"] + RecPOD_A["Rec Service Pod (3 replicas)"] + FlinkPOD_A["Flink TaskManager"] + end + + subgraph Zone_B["Availability Zone B"] + APIPOD_B["API Pod (3 replicas)"] + RecPOD_B["Rec Service Pod (3 replicas)"] + FlinkPOD_B["Flink TaskManager"] + end + + subgraph DataTier["Data Tier (Multi-AZ)"] + RedisPrimary[("Redis Primary")] + RedisReplica[("Redis Replica")] + PGPrimary[("PostgreSQL Primary")] + PGReplica[("PostgreSQL Read Replica")] + KafkaCluster[("Kafka Cluster\n(3 brokers)")] + end + + subgraph MLTier["ML Tier (GPU nodes)"] + TritonA["Triton Server\n(GPU Node A)"] + TritonB["Triton Server\n(GPU Node B)"] + end + + LB["Load Balancer"] --> Zone_A + LB --> Zone_B + Zone_A --> DataTier + Zone_B --> DataTier + Zone_A --> MLTier + Zone_B --> MLTier + RedisPrimary <--> RedisReplica + PGPrimary <--> PGReplica +``` + +### Kubernetes Namespaces + +| Namespace | Services | +|-----------|----------| +| `rr-serving` | API Gateway, Recommendation Service, Search Service, Catalog Service | +| `rr-ml` | Model serving (Triton), Feature Store workers | +| `rr-streaming` | Kafka brokers, Flink Job/TaskManagers | +| `rr-pipeline` | Airflow webserver, scheduler, workers | +| `rr-observability` | Prometheus, Grafana, Loki, Evidently | +| `rr-data` | PostgreSQL, Redis, Elasticsearch, Milvus | diff --git a/docs/architecture/tech-stack.md b/docs/architecture/tech-stack.md new file mode 100644 index 0000000..e34d1df --- /dev/null +++ b/docs/architecture/tech-stack.md @@ -0,0 +1,160 @@ +# Tech Stack + +## Table of Contents + +- [Overview](#overview) +- [API & Serving Layer](#api--serving-layer) +- [ML & Recommendation Layer](#ml--recommendation-layer) +- [Data Storage Layer](#data-storage-layer) +- [Streaming & Batch Processing](#streaming--batch-processing) +- [ML Operations](#ml-operations) +- [Observability](#observability) +- [Infrastructure & CI/CD](#infrastructure--cicd) +- [Alternatives Considered](#alternatives-considered) + +--- + +## Overview + +The technology choices for the Restaurant Recommendation System prioritize: +1. **Low latency** β€” p50 <60ms, p99 <150ms for recommendation serving +2. **Scalability** β€” horizontally scalable stateless services +3. **ML Flexibility** β€” support rapid model iteration and A/B testing +4. **Operational Maturity** β€” proven technologies with strong community and tooling + +--- + +## API & Serving Layer + +| Component | Technology | Version | Justification | +|-----------|-----------|---------|---------------| +| Primary API | **FastAPI** (Python) | β‰₯0.100 | Async I/O, native Pydantic validation, auto OpenAPI docs, extensive ML ecosystem integration | +| High-throughput endpoints | **Go** (net/http) | β‰₯1.21 | Sub-millisecond overhead for catalog lookups and health checks; lower memory footprint per connection | +| API Gateway | **NGINX** / AWS API Gateway | β€” | Battle-tested reverse proxy; supports JWT validation, rate limiting, TLS termination | +| Load Balancer | **AWS ALB** / NGINX | β€” | L7 routing, health checks, sticky sessions for stateful A/B buckets | +| Service Mesh | **Istio** | β€” | mTLS between services, fine-grained traffic management, distributed tracing | + +### Justification: FastAPI vs. Flask/Django +FastAPI was chosen over Flask because: +- Native `async`/`await` support eliminates blocking I/O bottlenecks during feature-store and database calls +- Pydantic models enforce schema validation with zero extra code +- Auto-generated OpenAPI documentation reduces maintenance burden +- ~3Γ— higher throughput than synchronous Flask for I/O-bound workloads + +--- + +## ML & Recommendation Layer + +| Component | Technology | Version | Justification | +|-----------|-----------|---------|---------------| +| Deep Learning Framework | **PyTorch** | β‰₯2.0 | Dynamic computation graphs accelerate research; TorchScript for production export | +| Gradient Boosting | **LightGBM** | β‰₯4.0 | Fastest training on tabular ranking features; native ranking loss (LambdaRank) | +| Gradient Boosting (alt) | **XGBoost** | β‰₯2.0 | Complementary to LightGBM; better on sparse features | +| Deep Ranking Models | **TensorFlow** | β‰₯2.13 | Wide&Deep, DIN models production-deployed via TF Serving | +| ANN / Vector Search | **Milvus** | β‰₯2.3 | Purpose-built for billion-scale ANN; HNSW + IVF indexes; cloud-native | +| Full-text & Geo Search | **Elasticsearch** | β‰₯8.x | Geo-distance queries, BM25 text ranking, real-time index updates | +| Recommendation Frameworks | **RecBole** / custom | β€” | Rapid prototyping of CF, BPR, Two-Tower baselines | + +### Justification: PyTorch vs. TensorFlow +- PyTorch used for **research and Two-Tower / Neural CF** training due to flexible architecture design +- TensorFlow used for **production serving** (Wide&Deep, DIN) due to mature TF Serving ecosystem +- Both frameworks are maintained in the same ML platform; model conversion via ONNX where needed + +--- + +## Data Storage Layer + +| Component | Technology | Version | Justification | +|-----------|-----------|---------|---------------| +| Primary OLTP | **PostgreSQL** | β‰₯15 | ACID transactions for user accounts, restaurant catalog, orders; PostGIS extension for geo queries | +| Cache / Online Feature Store | **Redis Cluster** | β‰₯7.0 | Sub-millisecond key lookups; native data structures (Sorted Sets for rankings, Hashes for feature vectors) | +| Vector Database | **Milvus** | β‰₯2.3 | Persistent ANN search; supports 128–768d embeddings; horizontal sharding | +| Search Engine | **Elasticsearch** | β‰₯8.x | Restaurant geo search, full-text menu/description search, aggregations | +| Offline Feature Store | **S3 / GCS** (Parquet) | β€” | Cost-effective columnar storage for historical features; Spark-compatible | +| Analytics / BI | **BigQuery** | β€” | Serverless analytics on offline feature data; direct connector for Metabase | +| Object Storage | **S3 / GCS** | β€” | Model artifacts, restaurant photos, log archives | + +### Justification: Redis vs. Memcached +Redis was chosen because: +- Rich data structures (Sorted Sets, Hashes) map naturally to recommendation scores and feature vectors +- Persistence (AOF/RDB) provides warm-restart capability +- Cluster mode supports horizontal sharding for terabyte-scale feature data +- Pub/Sub used for real-time feature invalidation + +### Justification: Milvus vs. Pinecone / Weaviate +- Milvus is fully open-source and self-hosted (no vendor lock-in) +- Supports multiple index types (HNSW, IVF_PQ) allowing latency/recall trade-off tuning +- Native Kubernetes operator for deployment +- At 100M+ restaurant embeddings, self-hosted is significantly more cost-effective + +--- + +## Streaming & Batch Processing + +| Component | Technology | Version | Justification | +|-----------|-----------|---------|---------------| +| Message Broker | **Apache Kafka** | β‰₯3.5 | High-throughput, durable event log; partitioned for parallelism; compacted topics for CDC | +| Stream Processing | **Apache Flink** | β‰₯1.17 | Exactly-once semantics; stateful stream processing for session aggregation; native Kafka connector | +| Batch Processing | **Apache Spark** | β‰₯3.4 | Distributed large-scale feature engineering; PySpark for ML pipelines | +| Workflow Orchestration | **Apache Airflow** | β‰₯2.7 | DAG-based pipelines; extensive operator ecosystem; native Kubernetes executor | + +### Justification: Flink vs. Spark Streaming +- Flink provides **true streaming** (per-event) with low latency (~100ms), critical for <5s session feature updates +- Spark Streaming (micro-batch) introduces 1–5s batch delays, insufficient for real-time personalization +- Flink's stateful operators simplify session window aggregations +- Spark Structured Streaming used for batch-heavy ETL jobs where Flink is unnecessary overhead + +--- + +## ML Operations + +| Component | Technology | Justification | +|-----------|-----------|---------------| +| Model Registry | **MLflow** | Experiment tracking, model versioning, artifact store; integrates with Spark and PyTorch/TF | +| Model Serving | **NVIDIA Triton Inference Server** | GPU-optimized; supports TensorFlow, PyTorch, ONNX; dynamic batching; concurrent model execution | +| TF Model Serving | **TensorFlow Serving** | Native TF SavedModel serving; optimized for Wide&Deep / DIN models | +| ML Monitoring | **Evidently AI** | Open-source; data/prediction drift reports; integrates with Grafana dashboards | +| Feature Engineering | **Feast** (partial) / custom | Online/offline feature consistency; restaurant-domain custom features handled by bespoke pipelines | + +--- + +## Observability + +| Component | Technology | Justification | +|-----------|-----------|---------------| +| Metrics | **Prometheus** | De-facto standard; pull-based scraping; rich ecosystem of exporters | +| Dashboards | **Grafana** | Native Prometheus datasource; Loki log integration; alerting rules | +| Log Aggregation | **ELK Stack** (Elasticsearch + Logstash + Kibana) / **Grafana Loki** | ELK for complex log search; Loki for cost-effective label-indexed log storage | +| Distributed Tracing | **Jaeger** | OpenTelemetry-compatible; end-to-end request trace for latency debugging | +| Business Metrics | **Metabase** | Self-hosted BI; connects to BigQuery/PostgreSQL; used by product/business teams | +| Alerting | **PagerDuty** + Grafana Alerts | On-call routing with escalation policies; Grafana alert rules for thresholds | + +--- + +## Infrastructure & CI/CD + +| Component | Technology | Justification | +|-----------|-----------|---------------| +| Containerization | **Docker** (multi-stage builds) | Reproducible builds; minimized image sizes via multi-stage patterns | +| Orchestration | **Kubernetes** (EKS / GKE) | Industry standard; supports GPU nodes for ML inference; auto-scaling | +| Service Mesh | **Istio** | mTLS, traffic policies, circuit breaking | +| CI/CD | **GitHub Actions** | Native GitHub integration; matrix builds; OIDC-based cloud auth; cost-effective | +| Infrastructure as Code | **Terraform** | Declarative cloud resource management; state management with remote backends | +| Secret Management | **AWS Secrets Manager** / **HashiCorp Vault** | Centralized secret rotation; Kubernetes external-secrets operator | +| Container Registry | **AWS ECR** / **GCR** | Private registry with vulnerability scanning; integrated with EKS/GKE | + +--- + +## Alternatives Considered + +| Decision | Chosen | Alternative | Reason Not Chosen | +|----------|--------|-------------|------------------| +| API Framework | FastAPI | Flask, Django REST | Flask lacks native async; Django too heavyweight for microservice | +| ML Serving | Triton | BentoML, Seldon | Triton has superior GPU utilization and dynamic batching | +| Vector DB | Milvus | Pinecone, Weaviate | Pinecone: vendor lock-in, cost at scale; Weaviate: less mature HNSW performance | +| Stream Processing | Flink | Spark Streaming, Kinesis | Spark Streaming micro-batch too slow; Kinesis: AWS vendor lock-in | +| Workflow Orchestration | Airflow | Prefect, Dagster | Airflow: largest community, most operators; Prefect/Dagster less mature at enterprise scale | +| Message Broker | Kafka | RabbitMQ, AWS SQS | RabbitMQ not designed for high-throughput log streaming; SQS: vendor lock-in | +| Monitoring | Prometheus + Grafana | Datadog, New Relic | Datadog/New Relic: high cost at scale; Prometheus open-source with equivalent capability | +| Model Registry | MLflow | W&B, Neptune | W&B/Neptune: SaaS cost; MLflow self-hosted, fully open-source | +| Relational DB | PostgreSQL | MySQL, CockroachDB | PostgreSQL: PostGIS extension critical for geo queries; CockroachDB adds distributed overhead | diff --git a/docs/infrastructure/deployment.md b/docs/infrastructure/deployment.md new file mode 100644 index 0000000..e3f225e --- /dev/null +++ b/docs/infrastructure/deployment.md @@ -0,0 +1,357 @@ +# Deployment + +## Table of Contents + +- [Overview](#overview) +- [Deployment Architecture Diagram](#deployment-architecture-diagram) +- [Container Strategy](#container-strategy) +- [Kubernetes Configuration](#kubernetes-configuration) +- [CI/CD Pipeline](#cicd-pipeline) +- [Environments](#environments) +- [Scaling Strategy](#scaling-strategy) + +--- + +## Overview + +The Restaurant Recommendation system is deployed as containerized microservices on Kubernetes. The deployment targets AWS EKS (or GKE) with multi-availability-zone redundancy and GPU node pools for ML inference. + +--- + +## Deployment Architecture Diagram + +```mermaid +graph TB + subgraph Internet["Internet"] + Users["πŸ‘€ Users\n(Mobile / Web)"] + Partners["πŸ”— Partners\n(API clients)"] + end + + subgraph AWS["AWS / GCP Cloud"] + subgraph CDN["CDN Layer"] + CloudFront["CloudFront / Cloud CDN\n(Edge caching 5-15min TTL)"] + end + + subgraph Ingress["Ingress Layer"] + ALB["Application Load Balancer\n(L7, SSL termination)"] + NGINX["NGINX Ingress Controller\n(Kubernetes)"] + end + + subgraph K8s["Kubernetes Cluster (EKS / GKE)"] + subgraph NS_Serving["namespace: rr-serving"] + APIGW["API Gateway Pods\n(3+ replicas)"] + RecSvc["Recommendation Service\n(3+ replicas, FastAPI)"] + SearchSvc["Search Re-rank Service\n(2+ replicas)"] + CatalogSvc["Catalog Service\n(2+ replicas, Go)"] + end + + subgraph NS_ML["namespace: rr-ml"] + Triton["Triton Inference Server\n(2+ GPU replicas)"] + FeatureWorker["Feature Store Workers\n(2+ replicas)"] + end + + subgraph NS_Streaming["namespace: rr-streaming"] + KafkaBrokers["Kafka Brokers\n(3 replicas)"] + FlinkJM["Flink JobManager\n(1 replica, HA)"] + FlinkTM["Flink TaskManagers\n(3+ replicas)"] + end + + subgraph NS_Pipeline["namespace: rr-pipeline"] + AirflowWeb["Airflow Webserver\n(1 replica)"] + AirflowSched["Airflow Scheduler\n(1 replica)"] + AirflowWorkers["Airflow Workers\n(2+ replicas)"] + end + + subgraph NS_Observability["namespace: rr-observability"] + Prometheus["Prometheus"] + Grafana["Grafana"] + Loki["Loki"] + Jaeger["Jaeger"] + end + end + + subgraph DataTier["Managed Data Services"] + RDSPostgres[("RDS PostgreSQL\n(Multi-AZ)")] + ElastiCache[("ElastiCache Redis\n(Cluster mode, 6 shards)")] + MilvusCluster[("Milvus Cluster\n(3 replicas)")] + ES[("Amazon OpenSearch\n(3 nodes)")] + S3[("S3 / GCS\n(Feature store, artifacts)")] + BigQuery[("BigQuery\n(Analytics)")] + end + end + + Users --> CloudFront + Partners --> ALB + CloudFront --> ALB + ALB --> NGINX + NGINX --> NS_Serving + NS_Serving --> NS_ML + NS_Serving --> DataTier + NS_ML --> DataTier + NS_Streaming --> DataTier + NS_Pipeline --> DataTier + NS_Serving --> NS_Streaming +``` + +--- + +## Container Strategy + +### Docker Multi-Stage Build Pattern + +All services use multi-stage Docker builds to minimize image size and attack surface: + +```dockerfile +# Stage 1: Builder +FROM python:3.11-slim AS builder +WORKDIR /app +COPY requirements.txt . +RUN pip install --no-cache-dir --user -r requirements.txt + +# Stage 2: Runtime +FROM python:3.11-slim AS runtime +WORKDIR /app + +# Copy only installed packages from builder +COPY --from=builder /root/.local /root/.local + +# Copy application code +COPY src/ ./src/ +COPY config/ ./config/ + +ENV PATH=/root/.local/bin:$PATH +ENV PYTHONDONTWRITEBYTECODE=1 +ENV PYTHONUNBUFFERED=1 + +# Non-root user for security +RUN groupadd -r appuser && useradd -r -g appuser appuser +USER appuser + +EXPOSE 8000 +CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"] +``` + +### Image Registry + +| Service | Registry | Image Tag Strategy | +|---------|----------|-------------------| +| All services | AWS ECR / GCR | `{service}:{git-sha}` for traceability | +| ML models | ECR | `{service}:{model-version}` | +| Base images | ECR (mirrored) | Pinned SHA for reproducibility | + +### Image Security + +- All images scanned with **Trivy** in CI before push +- Base images updated weekly (automated PR) +- No `latest` tag in production β€” all images pinned to SHA or semantic version + +--- + +## Kubernetes Configuration + +### Namespace Strategy + +| Namespace | Services | Network Policy | +|-----------|----------|---------------| +| `rr-serving` | API Gateway, Recommendation, Search, Catalog | Ingress from load balancer only | +| `rr-ml` | Triton, Feature Workers | Ingress from `rr-serving` only | +| `rr-streaming` | Kafka, Flink | Ingress from `rr-serving`, `rr-pipeline` | +| `rr-pipeline` | Airflow | No ingress from external | +| `rr-observability` | Prometheus, Grafana, Loki, Jaeger | Ingress from admin network only | +| `rr-data` | Redis, PostgreSQL (if self-hosted) | Ingress from `rr-serving`, `rr-ml`, `rr-pipeline` | + +### Resource Requests & Limits + +| Service | CPU Request | CPU Limit | Memory Request | Memory Limit | +|---------|-------------|-----------|----------------|-------------| +| Recommendation Service | 500m | 2000m | 512Mi | 2Gi | +| API Gateway | 250m | 1000m | 256Mi | 512Mi | +| Search Service | 500m | 1500m | 512Mi | 1Gi | +| Catalog Service (Go) | 100m | 500m | 128Mi | 256Mi | +| Triton (GPU) | 2000m | 4000m | 4Gi | 8Gi | +| Flink TaskManager | 1000m | 2000m | 2Gi | 4Gi | +| Airflow Worker | 500m | 2000m | 1Gi | 4Gi | + +### Horizontal Pod Autoscaler (HPA) + +```yaml +# Recommendation Service HPA +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: recommendation-service-hpa + namespace: rr-serving +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: recommendation-service + minReplicas: 3 + maxReplicas: 20 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 + - type: Pods + pods: + metric: + name: http_requests_per_second + target: + type: AverageValue + averageValue: "500" +``` + +### Pod Disruption Budgets + +All production services have PDBs to ensure availability during rolling updates: + +```yaml +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: recommendation-service-pdb + namespace: rr-serving +spec: + minAvailable: 2 + selector: + matchLabels: + app: recommendation-service +``` + +--- + +## CI/CD Pipeline + +### GitHub Actions Workflow + +```mermaid +flowchart LR + PR["Pull Request"] --> Lint["Lint & Type Check\n(ruff, mypy)"] + Lint --> Test["Unit & Integration Tests\n(pytest)"] + Test --> Security["Security Scan\n(Trivy image scan\nCodeQL SAST)"] + Security --> Build["Docker Build\n(multi-stage)"] + Build --> Push["Push to ECR\n(tagged with git SHA)"] + Push --> DeployDev["Deploy to Dev\n(auto on merge to main)"] + DeployDev --> SmokeTest["Smoke Tests\n(API health checks)"] + SmokeTest --> DeployStaging["Deploy to Staging\n(auto after smoke pass)"] + DeployStaging --> IntegrationTest["Integration Tests\n(E2E test suite)"] + IntegrationTest --> ManualApproval["Manual Approval\n(required for production)"] + ManualApproval --> DeployProd["Deploy to Production\n(canary β†’ blue-green)"] + DeployProd --> ProdVerify["Production Verification\n(automated health checks)"] +``` + +### Workflow File Structure + +``` +.github/workflows/ +β”œβ”€β”€ ci.yml # Lint, test, security scan (on PR) +β”œβ”€β”€ build-push.yml # Docker build & push (on merge to main) +β”œβ”€β”€ deploy-dev.yml # Auto-deploy to dev (on merge to main) +β”œβ”€β”€ deploy-staging.yml # Auto-deploy to staging (after dev smoke tests) +β”œβ”€β”€ deploy-production.yml # Manual-approval production deploy +└── model-deploy.yml # ML model canary deployment workflow +``` + +### CI Pipeline (`ci.yml`) Steps + +| Step | Tool | Failure Mode | +|------|------|-------------| +| Lint (Python) | `ruff check` | Block merge | +| Type check | `mypy` | Block merge | +| Unit tests | `pytest` (>90% coverage required) | Block merge | +| Integration tests | `pytest -m integration` | Block merge | +| Image security scan | Trivy | Block merge (CRITICAL/HIGH CVEs) | +| SAST | CodeQL | Advisory (non-blocking for now) | +| Dependency check | `pip-audit` | Block merge | + +### Deployment Strategy in CI/CD + +```yaml +# deploy-production.yml (excerpt) +steps: + - name: Deploy Canary (5%) + run: | + kubectl set image deployment/recommendation-service \ + recommendation-service=$IMAGE_TAG \ + --namespace rr-serving + kubectl patch service recommendation-service \ + -p '{"spec":{"selector":{"version":"canary"}}}' \ + --namespace rr-serving + # Route 5% traffic to canary via Istio VirtualService + + - name: Monitor Canary (2h) + run: ./scripts/monitor-canary.sh --duration 2h --error-threshold 0.01 + + - name: Promote to 100% (Blue-Green) + if: success() + run: | + kubectl patch virtualservice recommendation-service \ + -p '{"spec":{"http":[{"route":[{"destination":{"host":"recommendation-service","subset":"new"},"weight":100}]}]}}' \ + --namespace rr-serving +``` + +--- + +## Environments + +| Environment | Purpose | Cluster Size | Data | Auto-Deploy | +|-------------|---------|-------------|------|-------------| +| **Development** (`dev`) | Feature development, unit testing | Small (1 replica per service) | Synthetic data | Yes (on merge to `main`) | +| **Staging** (`staging`) | Integration testing, QA, load testing | Medium (2 replicas per service) | Anonymized production copy | Yes (after dev smoke tests) | +| **Production** (`prod`) | Live traffic | Full HA (3+ replicas, multi-AZ) | Real data | Manual approval required | + +### Environment-Specific Configuration + +| Configuration | Dev | Staging | Production | +|--------------|-----|---------|-----------| +| Redis | Single node | 3-shard cluster | 6-shard cluster | +| PostgreSQL | RDS t3.medium | RDS r6g.large | RDS r6g.2xlarge (Multi-AZ) | +| Kafka | 1 broker | 3 brokers | 3 brokers (Multi-AZ) | +| GPU nodes | None (CPU-only inference) | 1Γ— GPU node | 2+ GPU nodes per AZ | +| Log retention | 7 days | 30 days | 90 days hot / 1yr cold | +| Monitoring | Basic | Full | Full + PagerDuty alerting | + +--- + +## Scaling Strategy + +### API Service (CPU-Based HPA) + +- **Min replicas**: 3 (across 3 AZs) +- **Max replicas**: 20 +- **Scale-up trigger**: CPU utilization >70% for 60s +- **Scale-down trigger**: CPU utilization <40% for 300s +- **Scale-down stabilization**: 5 minutes (prevent flapping) + +### ML Inference (GPU-Based HPA) + +- **Technology**: KEDA (Kubernetes Event-Driven Autoscaling) with Prometheus metrics +- **Min replicas**: 2 (1 per AZ) +- **Max replicas**: 8 +- **Scale-up trigger**: Triton queue depth >50 requests +- **Scale-down trigger**: Queue depth <5 for 5 minutes +- **Node provisioner**: Karpenter (AWS) / Cluster Autoscaler for GPU node pool + +### Redis (Cluster Mode) + +- **Sharding**: 6 shards, consistent hash slots +- **Replicas**: 1 replica per shard (total 12 nodes) +- **Scaling**: Manual shard addition (requires data rebalancing, planned maintenance) +- **Memory target**: <70% utilization before scaling + +### Kafka (Partition-Based) + +- **Partitions**: `user-events-raw` β€” 24 partitions (for 24 Flink parallelism) +- **Retention**: 7 days (raw events), 24h (processed events) +- **Scaling trigger**: Consumer lag >50,000 messages for >10 minutes +- **Broker scaling**: Add broker + rebalance partitions (handled via Confluent Operator) + +### Flink (TaskManager Scaling) + +- **Min TaskManagers**: 3 +- **Max TaskManagers**: 12 +- **Scaling**: Based on Kafka consumer lag via KEDA +- **Checkpoint interval**: 30 seconds (ensures <30s recovery time) diff --git a/docs/modules/01-data-ingestion.md b/docs/modules/01-data-ingestion.md new file mode 100644 index 0000000..84d8cb1 --- /dev/null +++ b/docs/modules/01-data-ingestion.md @@ -0,0 +1,343 @@ +# Module 01 β€” Data Ingestion + +## Table of Contents + +- [Overview](#overview) +- [User Behavior Tracking](#user-behavior-tracking) +- [Restaurant Catalog Management](#restaurant-catalog-management) +- [User Profile Aggregation](#user-profile-aggregation) +- [Review & Rating Ingestion](#review--rating-ingestion) +- [Data Schemas](#data-schemas) + +--- + +## Overview + +The Data Ingestion module is responsible for collecting, validating, enriching, and storing all data that feeds the recommendation system. It covers four primary data domains: + +1. **User Behavior** β€” clickstream events, sessions, and interaction history +2. **Restaurant Catalog** β€” structured restaurant data with domain-specific attributes +3. **User Profiles** β€” aggregated behavioral and demographic profiles +4. **Reviews & Ratings** β€” user-generated content with sentiment enrichment + +--- + +## User Behavior Tracking + +### Event Types + +| Event Type | Trigger | Key Signals | +|-----------|---------|-------------| +| `RESTAURANT_VIEW` | Restaurant detail page opened | dwell_time_ms, scroll_depth_pct | +| `RESTAURANT_CLICK` | Restaurant card clicked in listing | position, source_page, rank | +| `MENU_VIEW` | Menu page viewed | menu_section, dwell_time_ms | +| `ORDER_PLACED` | Order successfully submitted | order_value, items_count, payment_method | +| `REVIEW_SUBMITTED` | User submits review | rating_value, review_length | +| `SEARCH` | User performs search | query_text, results_count, selected_position | +| `FILTER_APPLIED` | User applies a search/listing filter | filter_type, filter_value | +| `FAVORITE_ADDED` | User adds restaurant to favorites | source_page | + +### Session Management + +- A **session** begins with the first event after a 30-minute idle gap +- Sessions are tracked by `session_id` (UUID), assigned server-side +- Session features computed in real-time via Flink: total events, unique restaurants viewed, cuisine diversity, avg dwell time, click-through items +- Maximum session duration: 4 hours (forced close) + +### Clickstream Pipeline + +``` +Mobile/Web SDK β†’ Kafka (user-events-raw) β†’ Flink Validation & Enrichment + β†’ Kafka (user-events-enriched) β†’ Session Aggregation β†’ Redis (session features) + β†’ BigQuery (analytics) + S3 (cold archive) +``` + +--- + +## Restaurant Catalog Management + +### Data Sources + +| Source | Sync Method | Frequency | +|--------|-------------|-----------| +| Partner POS systems | Webhook / REST push | Near-real-time (<1 min) | +| Google Places API | Batch pull | Daily | +| Internal manual entries | Admin API | On-demand | +| Review platforms | Batch scraping (licensed) | Daily | + +### Catalog Enrichment Pipeline + +After ingestion, each restaurant record is enriched: + +1. **NLP Tag Generation** β€” BERT-based model extracts cuisine tags, ambiance descriptors, and menu highlights from description text +2. **Text Embedding** β€” Sentence-BERT generates 768d text embedding from name + description + cuisine tags +3. **Image Embedding** β€” ResNet-50 generates 512d image embedding from the primary restaurant photo +4. **Geohash Encoding** β€” Latitude/longitude encoded to geohash precision 7 (~150m Γ— 150m cell) +5. **Operating Hours Normalization** β€” Hours stored in UTC, `is_open_now` computed at query time +6. **Price Tier Standardization** β€” Mapped to 4 tiers: `$` (<$15), `$$` ($15–$30), `$$$` ($30–$60), `$$$$` (>$60) per person + +### Restaurant Domain Attributes + +The restaurant domain requires specialized fields beyond generic catalog items: + +| Category | Fields | Notes | +|----------|--------|-------| +| **Location** | lat, lng, geohash, address, city, district, country | Geohash used for fast proximity lookups | +| **Cuisine** | cuisine_types (array), primary_cuisine | Taxonomy of 120+ cuisine types | +| **Operations** | operating_hours (per weekday), is_open_now, reservation_required | is_open_now is a computed field | +| **Pricing** | price_tier ($–$$$$), avg_meal_price | Tier updated quarterly from order data | +| **Dietary** | vegetarian_options, vegan_options, halal, gluten_free, allergen_info | Boolean flags + allergen list | +| **Ambiance** | tags (romantic, family-friendly, business, casual, fine-dining) | Multi-value, ML-tagged | +| **Capacity** | total_seats, private_rooms, outdoor_seating, parking | Operational capacity data | +| **Ratings** | avg_rating, total_reviews, rating_breakdown (food/service/ambiance/value) | Updated daily from review ingestion | +| **Media** | photos[], menu_url | Photos stored in S3/CDN | + +--- + +## User Profile Aggregation + +### Profile Components + +| Component | Source | Update Frequency | +|-----------|--------|-----------------| +| Demographic | Registration, onboarding quiz | On registration / user update | +| Behavioral aggregates | Event log (Spark batch) | Daily (2:00 AM pipeline) | +| Preference vectors | CF model output | Weekly (model retrain) | +| Dietary preferences | Explicit (settings) + implicit (order history) | Real-time (explicit) / Daily (implicit) | +| User segment | Segmentation model | Daily | + +### User Segments + +| Segment | Criteria | Recommendation Strategy | +|---------|----------|------------------------| +| `cold_start` | <10 interactions | Popularity by geo + demographic | +| `warm` | 10–30 interactions | Blended CF + content-based | +| `active` | >30 interactions, active <30 days | Full personalization | +| `power_user` | >100 interactions/month | Deep personalization, early access | +| `churning` | No activity 30–90 days | Re-engagement recommendations | +| `dormant` | No activity >90 days | Fallback to popularity | + +--- + +## Review & Rating Ingestion + +### Ingestion Pipeline + +``` +Review submitted (API/webhook) β†’ Kafka (reviews-raw) + β†’ Flink: Basic validation, dedup + β†’ Sentiment Analysis service (async, <5s) + β†’ PostgreSQL (review stored) + β†’ Kafka (reviews-enriched) + β†’ Daily batch: Update restaurant avg_rating, rating_breakdown +``` + +### Sentiment Analysis + +- Model: Fine-tuned DistilBERT on restaurant review corpus +- Output: `sentiment_score` ∈ [0.0, 1.0] (0 = very negative, 1 = very positive) +- Aspect-level sentiment: food, service, ambiance, value +- Processing latency: <3s per review (async, not blocking review submission) + +--- + +## Data Schemas + +### User Event Schema + +```json +{ + "event_id": "evt_01HXYZ123ABC456DEF", + "event_type": "RESTAURANT_CLICK", + "timestamp": "2024-03-15T14:32:07.123Z", + "user_id": "usr_9f3a8b2c1d4e5f6a", + "session_id": "sess_7e2f1a3b4c5d6e7f", + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "context": { + "device": "ios", + "platform": "mobile_app", + "app_version": "4.2.1", + "os_version": "17.2", + "location": { + "lat": 10.7769, + "lng": 106.7009, + "accuracy_meters": 15, + "geohash": "w3gv2c" + }, + "referrer": "homepage_recommendations", + "page_position": 3 + }, + "metadata": { + "dwell_time_ms": 4200, + "scroll_depth_pct": 75, + "rank_in_list": 3, + "rec_model_version": "ranking_v2.4.1", + "experiment_variants": { + "homepage_layout": "control", + "ranking_model": "treatment_a" + } + } +} +``` + +### Restaurant Catalog Schema + +```json +{ + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "name": "Pho Saigon House", + "description": "Authentic Vietnamese pho and street food in a casual setting.", + "phone": "+84-28-3812-3456", + "website": "https://phosaigonhouse.vn", + "location": { + "address": "123 Nguyen Hue Boulevard", + "district": "District 1", + "city": "Ho Chi Minh City", + "country": "VN", + "lat": 10.7769, + "lng": 106.7009, + "geohash": "w3gv2c" + }, + "cuisine": { + "cuisine_types": ["vietnamese", "asian", "noodles", "soup"], + "primary_cuisine": "vietnamese" + }, + "operations": { + "operating_hours": { + "monday": { "open": "07:00", "close": "22:00" }, + "tuesday": { "open": "07:00", "close": "22:00" }, + "wednesday": { "open": "07:00", "close": "22:00" }, + "thursday": { "open": "07:00", "close": "22:00" }, + "friday": { "open": "07:00", "close": "23:00" }, + "saturday": { "open": "08:00", "close": "23:00" }, + "sunday": { "open": "08:00", "close": "21:00" } + }, + "is_open_now": true, + "reservation_required": false + }, + "pricing": { + "price_tier": "$$", + "avg_meal_price_usd": 12.50 + }, + "ratings": { + "avg_rating": 4.3, + "total_reviews": 1842, + "rating_breakdown": { + "food": 4.5, + "service": 4.1, + "ambiance": 4.0, + "value": 4.6 + } + }, + "dietary": { + "vegetarian_options": true, + "vegan_options": false, + "halal": true, + "gluten_free": false, + "allergen_info": ["peanuts", "shellfish"] + }, + "ambiance": { + "tags": ["casual", "family-friendly", "quick-bite"] + }, + "capacity": { + "total_seats": 80, + "private_rooms": 0, + "outdoor_seating": true, + "parking": false + }, + "media": { + "photos": [ + "https://cdn.example.com/restaurants/rst_3b7c9e/photo_1.jpg", + "https://cdn.example.com/restaurants/rst_3b7c9e/photo_2.jpg" + ], + "menu_url": "https://cdn.example.com/restaurants/rst_3b7c9e/menu.pdf" + }, + "embeddings": { + "text_embedding": [0.023, -0.187, 0.412, "... (768 dimensions total)"], + "image_embedding": [0.051, 0.334, -0.112, "... (512 dimensions total)"] + }, + "metadata": { + "popularity_score": 0.82, + "status": "active", + "created_at": "2022-06-01T08:00:00Z", + "updated_at": "2024-03-15T02:30:00Z" + } +} +``` + +### User Profile Schema + +```json +{ + "user_id": "usr_9f3a8b2c1d4e5f6a", + "demographic": { + "age_range": "25-34", + "gender": "female", + "location": { + "city": "Ho Chi Minh City", + "district": "District 2", + "country": "VN", + "lat": 10.7880, + "lng": 106.7320 + } + }, + "behavioral": { + "total_interactions": 247, + "favorite_cuisines": [ + { "cuisine": "japanese", "affinity_score": 0.87 }, + { "cuisine": "vietnamese", "affinity_score": 0.76 }, + { "cuisine": "korean", "affinity_score": 0.65 } + ], + "avg_session_duration_min": 8.4, + "order_frequency_per_month": 6.2, + "last_active_at": "2024-03-15T13:45:00Z" + }, + "dietary_preferences": { + "vegetarian": false, + "vegan": false, + "halal": false, + "allergies": ["shellfish"] + }, + "price_preference": { + "preferred_price_tier": "$$", + "avg_spend_per_meal_usd": 18.00 + }, + "preference_vector": [0.12, -0.34, 0.87, "... (128 dimensions total)"], + "segment": "active", + "metadata": { + "created_at": "2022-09-14T10:00:00Z", + "updated_at": "2024-03-15T02:00:00Z" + } +} +``` + +### Review Schema + +```json +{ + "review_id": "rev_01HXZ789GHI012JKL", + "user_id": "usr_9f3a8b2c1d4e5f6a", + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "overall_rating": 4, + "dimension_ratings": { + "food": 5, + "service": 4, + "ambiance": 4, + "value": 5 + }, + "review_text": "Amazing pho! The broth is rich and the service was friendly. A bit noisy during lunch rush but great value for money.", + "sentiment_score": 0.84, + "aspect_sentiments": { + "food": 0.95, + "service": 0.78, + "ambiance": 0.62, + "value": 0.91 + }, + "helpful_votes": 23, + "photos": [ + "https://cdn.example.com/reviews/rev_01HXZ789/photo_1.jpg" + ], + "visit_date": "2024-03-10", + "created_at": "2024-03-11T09:22:00Z" +} +``` diff --git a/docs/modules/02-recommendation-engine.md b/docs/modules/02-recommendation-engine.md new file mode 100644 index 0000000..f945332 --- /dev/null +++ b/docs/modules/02-recommendation-engine.md @@ -0,0 +1,269 @@ +# Module 02 β€” Recommendation Engine + +## Table of Contents + +- [Overview](#overview) +- [Stage 1: Candidate Generation](#stage-1-candidate-generation) +- [Stage 2: Scoring & Ranking](#stage-2-scoring--ranking) +- [Stage 3: Re-ranking & Business Logic](#stage-3-re-ranking--business-logic) +- [Cold Start Handling](#cold-start-handling) +- [Pipeline Architecture](#pipeline-architecture) + +--- + +## Overview + +The Recommendation Engine operates as a three-stage funnel: + +``` +All Restaurants (~100k+) + ↓ +[Stage 1: Candidate Generation] β†’ 500–2000 candidates + ↓ +[Stage 2: Scoring & Ranking] β†’ top 50–200 + ↓ +[Stage 3: Re-ranking & Business Logic] β†’ final top 10–50 +``` + +Each stage narrows the candidate set while adding more expensive (but more precise) signals. + +--- + +## Stage 1: Candidate Generation + +**Goal**: Retrieve 500–2000 relevant restaurant candidates with latency <15ms using multiple lightweight retrieval strategies in parallel. + +### Retrieval Strategies + +#### 1.1 Collaborative Filtering + +| Model | Description | Output | +|-------|-------------|--------| +| **Two-Tower** | Dual encoder: user tower (demographics + behavior) and item tower (restaurant attributes). ANN search in Milvus using HNSW index. | Top-N by cosine similarity | +| **Neural CF** | Dot-product interaction between learned user and item embeddings | Top-N from pre-computed index | +| **ALS** (offline) | Matrix factorization; pre-computed user–item scores stored in Redis | Lookup by user_id | +| **BPR** | Bayesian Personalized Ranking; optimizes for pairwise ranking | Offline scoring | + +#### 1.2 Content-Based Filtering + +- **Text similarity**: Sentence-BERT embeddings for restaurant description + cuisine tags; ANN search via Milvus HNSW index (768d) +- **Image similarity**: ResNet-50 image embeddings; HNSW ANN search (512d) +- **Query**: User's cuisine preference vector used as query embedding + +#### 1.3 Popularity & Trending + +- **Global popularity**: Pre-computed `popularity_score` ∈ [0, 1] (order count Γ— recency-weighted) +- **Time-decay**: `score(t) = base_score Γ— e^(-Ξ»t)`, Ξ» = 0.1/day, stored in Redis Sorted Set +- **Local trending**: Top restaurants in the user's current geohash cluster (precision 6, ~1.2km Γ— 0.6km) + +#### 1.4 Location-Aware Filtering + +- **Geofencing**: Elasticsearch `geo_distance` query within user-configured radius +- **Configurable radius**: 1–50km (default 5km for mobile, 10km for web) +- **Geohash pre-filter**: Reduce Elasticsearch scan using geohash bounding cells +- **Distance score**: `distance_score = 1 / (1 + distance_km)` used as a ranking signal later + +#### 1.5 Cuisine Matching + +- User's top 5 cuisine preferences (from profile) are used to directly query restaurants matching those cuisines via PostgreSQL index on `primary_cuisine` +- Ensures strong cuisine relevance even when collaborative signals are weak + +#### 1.6 Knowledge Graph (GNN) + +- Graph Neural Network captures entity relationships: User β†’ visited β†’ Restaurant β†’ serves β†’ Cuisine β†’ similar β†’ Cuisine +- GraphSAGE model generates neighborhood-aggregated embeddings +- Provides candidates based on multi-hop entity relationships (e.g., "users who liked X also liked Y in same cuisine family") + +### Candidate Merging + +After parallel retrieval, candidates are: +1. **Deduplicated** by `restaurant_id` +2. **Geo-filtered**: restaurants outside configured radius are dropped +3. **Status-filtered**: `status != active` and `is_open_now == false` (soft filter; applied here for efficiency) are optionally excluded +4. **Capped** at 2000 candidates (top-N by retrieval score per source, then merged) + +--- + +## Stage 2: Scoring & Ranking + +**Goal**: Score and rank 500–2000 candidates using a rich feature set and ML ranking models, producing a top 50–200 ranked list in ~15ms. + +### Feature Engineering + +#### User Features + +| Feature | Description | Dimensionality | +|---------|-------------|---------------| +| `user_embedding` | Two-Tower user tower output | 128d | +| `user_segment` | cold_start / warm / active / power_user / churning | 1 (categorical) | +| `top_cuisines` | Top-3 cuisine preferences with affinity scores | 6 floats | +| `price_preference` | Preferred price tier (1–4) | 1 | +| `recency` | Days since last interaction | 1 | +| `frequency` | Orders per month (last 90d) | 1 | +| `monetary` | Avg spend per meal (USD) | 1 | +| `dietary_flags` | vegetarian, vegan, halal, gluten_free | 4 booleans | +| `session_clicks` | Number of restaurants clicked in current session | 1 | +| `time_of_day` | Hour of day (0–23) | 1 | + +#### Restaurant Features + +| Feature | Description | Dimensionality | +|---------|-------------|---------------| +| `restaurant_embedding` | Two-Tower item tower output | 128d | +| `cuisine_embedding` | Cuisine taxonomy embedding | 32d | +| `price_tier` | $ to $$$$ (1–4) | 1 | +| `avg_rating` | Overall avg rating (0–5) | 1 | +| `rating_breakdown` | food, service, ambiance, value ratings | 4 floats | +| `review_count` | Total review count (log-scaled) | 1 | +| `distance_km` | Distance from user's location | 1 | +| `popularity_score` | Time-decayed popularity score | 1 | +| `is_new` | Restaurant created <30 days ago | 1 boolean | +| `has_outdoor_seating` | Outdoor seating availability | 1 boolean | + +#### Context Features + +| Feature | Description | +|---------|-------------| +| `time_of_day` | Hour bucket: breakfast (6–10), lunch (11–14), afternoon (14–17), dinner (17–22), late (22–6) | +| `day_of_week` | Weekday vs. weekend | +| `device_type` | mobile / tablet / desktop | +| `meal_occasion` | Inferred from time + user history: solo / business / date / family | + +#### Cross Features + +| Feature | Description | +|---------|-------------| +| `user_cuisine_affinity` | User cuisine preference score Γ— restaurant primary cuisine match | +| `user_price_match` | Abs difference between user preferred_price_tier and restaurant price_tier | +| `user_restaurant_embedding_similarity` | Dot product of user_embedding and restaurant_embedding | +| `user_distance_affinity` | Historical median distance user visits Γ— current restaurant distance | + +### Ranking Models + +| Model | Type | Use Case | +|-------|------|---------| +| **LightGBM** | Gradient Boosting (LambdaRank) | Primary ranker; tabular features; fast inference | +| **XGBoost** | Gradient Boosting | Ensemble fallback; better on sparse feature sets | +| **DeepFM** | Deep learning (FM + DNN) | Captures high-order feature interactions | +| **Wide & Deep** | Deep learning | Wide: memorization; Deep: generalization | +| **DIN** | Attention-based deep learning | Dynamic interest network using user's click history sequence | + +**Production ranker**: LightGBM (primary) + DeepFM (secondary, async reranking for high-traffic users) + +### Multi-Objective Scoring + +The final ranking score combines multiple objectives: + +``` +final_score = w_relevance Γ— relevance_score + + w_distance Γ— distance_score + + w_price Γ— price_match_score + + w_rating Γ— rating_score +``` + +Default weights: `w_relevance=0.55, w_distance=0.20, w_price=0.10, w_rating=0.15` +Weights are tuned per user segment and updated quarterly via A/B experiments. + +--- + +## Stage 3: Re-ranking & Business Logic + +**Goal**: Apply diversity, freshness, and business rules to the top 50–200 ranked candidates to produce the final top 10–50 shown to users. + +### Diversity Injection (MMR) + +Maximal Marginal Relevance (MMR) balances relevance and diversity: + +``` +MMR_score(r) = Ξ» Γ— relevance(r) - (1-Ξ») Γ— max_{s ∈ S} sim(r, s) +``` + +Where `S` is the set of already-selected restaurants and `sim` is cuisine/embedding similarity. + +- **Ξ»** = 0.7 (tunable; higher = more relevance-focused) +- **category_max_ratio** = 0.4 (max 40% of results can share the same primary cuisine) +- Prevents showing 10 Vietnamese restaurants when diverse options exist + +### Freshness Boost + +- New restaurants (created <30 days): `+0.05` score boost +- New restaurants (30–90 days): `+0.02` score boost +- Ensures new restaurant discovery and catalog health + +### Business Rules + +| Rule | Description | Priority | +|------|-------------|----------| +| **Exclude closed** | Remove restaurants currently closed (is_open_now = false) for dinner/lunch requests | High | +| **Exclude already visited** | Optionally hide restaurants the user visited in the past 7 days (configurable per user setting) | Medium | +| **Sponsored boost** | Verified sponsored restaurants receive a `+0.10` score lift, capped at max 2 per page | Medium | +| **High-margin boost** | Internal signal; `+0.03` for partners on premium tier | Low | +| **Geo availability** | Respect delivery zone boundaries if applicable | High | +| **Operating hours filter** | Hard filter: closed restaurants excluded from results | High | +| **Allergen safety filter** | Hard filter: restaurants with user's allergens are excluded | High | + +### Fairness Constraints + +- **Minimum exposure**: Small restaurants (review_count < 50) are guaranteed β‰₯1 slot per 10 results page +- **Exploration slots**: 5–10% of result slots are reserved for exploration (non-greedy picks to combat popularity bias) +- **Position bias calibration**: Model scores are calibrated to account for position bias in training labels + +--- + +## Cold Start Handling + +### New User Cold Start + +| Stage | Interactions | Strategy | +|-------|-------------|---------| +| Anonymous | 0 | Popularity by geo (city/district) + top-rated restaurants | +| Registered (no interactions) | 0 | Demographic matching + onboarding quiz (top 3 cuisines, price preference, dietary) | +| Early interactions | 1–10 | Content-based using stated cuisine prefs + onboarding signals | +| Blending period | 10–30 | Linear blend: CF weight ramps 0β†’1 as interactions increase | +| Full personalization | >30 | Full CF + ranking model | + +**Transition threshold**: cold β†’ warm at **10 interactions**, full personalization at **30 interactions** + +### New Restaurant Cold Start + +| Approach | Description | +|----------|-------------| +| **Content similarity** | Use text/image embeddings to find similar restaurants and proxy their interaction scores | +| **Category injection** | Inject new restaurant into results for users with matching cuisine preferences | +| **Exploration slots** | 5–10% of result slots reserved for new restaurant exploration | +| **Metadata priors** | Use rating_prior = 3.5 until β‰₯10 reviews accumulated | + +--- + +## Pipeline Architecture + +```mermaid +flowchart LR + Request["User Request"] --> ParallelRetrieval + + subgraph ParallelRetrieval["Stage 1: Candidate Generation (parallel)"] + CF["Collaborative Filtering\n(Two-Tower ANN, ALS)"] + CB["Content-Based\n(Text/Image ANN)"] + Geo["Geo/Location\n(Elasticsearch)"] + Pop["Popularity/Trending\n(Redis)"] + Cuisine["Cuisine Match\n(PostgreSQL)"] + KG["Knowledge Graph\n(GNN)"] + end + + ParallelRetrieval --> Merge["Merge & Deduplicate\n500-2000 candidates"] + Merge --> FeatureFetch["Batch Feature Fetch\n(Redis Feature Store)"] + + subgraph Ranking["Stage 2: Scoring & Ranking"] + FeatureFetch --> FeatureEng["Feature Engineering\n(user Γ— restaurant Γ— context Γ— cross)"] + FeatureEng --> Models["LightGBM / DeepFM / DIN\nMulti-objective scoring"] + Models --> TopN["Top 50-200"] + end + + subgraph Reranking["Stage 3: Re-ranking & Business Logic"] + TopN --> MMR["MMR Diversity"] + MMR --> Rules["Business Rules\n(filters, boosts, fairness)"] + Rules --> Final["Final Top 10-50"] + end + + Final --> Response["API Response"] +``` diff --git a/docs/modules/03-serving-delivery.md b/docs/modules/03-serving-delivery.md new file mode 100644 index 0000000..ba922c0 --- /dev/null +++ b/docs/modules/03-serving-delivery.md @@ -0,0 +1,343 @@ +# Module 03 β€” Serving & Delivery + +## Table of Contents + +- [Overview](#overview) +- [API Endpoints](#api-endpoints) +- [Caching Strategy](#caching-strategy) +- [Real-time Personalization Stream](#real-time-personalization-stream) +- [Error Handling](#error-handling) +- [Rate Limiting](#rate-limiting) + +--- + +## Overview + +The Serving & Delivery module exposes the recommendation system through a REST API, manages multi-layer caching to minimize latency, and maintains real-time personalization via the event streaming pipeline. + +--- + +## API Endpoints + +### GET `/api/v1/recommendations/homepage` + +Returns personalized recommendations for the user's home feed. + +| Property | Value | +|----------|-------| +| **Latency SLA** | p50: 50ms, p99: 200ms | +| **Cache TTL** | Redis: 30min, CDN: 10min | +| **Auth** | Bearer token required | + +**Query Parameters** + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `limit` | int | 20 | Number of recommendations (max 50) | +| `offset` | int | 0 | Pagination offset | +| `lat` | float | β€” | User latitude (optional; uses profile location if absent) | +| `lng` | float | β€” | User longitude (optional) | +| `radius_km` | float | 5.0 | Search radius in km (1–50) | + +**Response** + +```json +{ + "request_id": "req_01HXZ789ABC", + "user_id": "usr_9f3a8b2c1d4e5f6a", + "recommendations": [ + { + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "name": "Pho Saigon House", + "score": 0.92, + "distance_km": 1.2, + "primary_cuisine": "vietnamese", + "price_tier": "$$", + "avg_rating": 4.3, + "is_open_now": true, + "photo_url": "https://cdn.example.com/restaurants/rst_3b7c9e/photo_1.jpg", + "tags": ["casual", "family-friendly"], + "reason": "Based on your love of Vietnamese cuisine" + } + ], + "total": 20, + "model_version": "ranking_v2.4.1", + "latency_ms": 42 +} +``` + +--- + +### GET `/api/v1/recommendations/nearby` + +Returns top-rated and personalized nearby restaurants based on the user's current location. + +| Property | Value | +|----------|-------| +| **Latency SLA** | p50: 40ms, p99: 150ms | +| **Cache TTL** | Redis: 15min (location-sensitive), CDN: 5min | +| **Auth** | Bearer token required | + +**Query Parameters** + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| `lat` | float | βœ… | User latitude | +| `lng` | float | βœ… | User longitude | +| `radius_km` | float | β€” | Search radius (default 3km, max 50km) | +| `limit` | int | β€” | Max results (default 20, max 50) | +| `cuisine` | string | β€” | Filter by cuisine type | +| `price_tier` | string | β€” | Filter: $, $$, $$$, $$$$ | +| `open_now` | bool | β€” | Only return currently open restaurants | + +--- + +### GET `/api/v1/recommendations/similar/{restaurant_id}` + +Returns restaurants similar to a given restaurant based on content and collaborative signals. + +| Property | Value | +|----------|-------| +| **Latency SLA** | p50: 30ms, p99: 100ms | +| **Cache TTL** | Redis: 2h, CDN: 30min | +| **Auth** | Bearer token required | + +**Path Parameters** + +| Parameter | Type | Description | +|-----------|------|-------------| +| `restaurant_id` | string | Target restaurant ID | + +**Query Parameters** + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `limit` | int | 10 | Number of similar restaurants (max 20) | +| `lat` | float | β€” | User location for geo-scoring | +| `lng` | float | β€” | User location for geo-scoring | + +--- + +### GET `/api/v1/recommendations/complementary/{restaurant_id}` + +Returns "People also liked" recommendations for users who interacted with a given restaurant. + +| Property | Value | +|----------|-------| +| **Latency SLA** | p50: 30ms, p99: 100ms | +| **Cache TTL** | Redis: 2h, CDN: 30min | +| **Auth** | Bearer token required | + +**Path Parameters** + +| Parameter | Type | Description | +|-----------|------|-------------| +| `restaurant_id` | string | Source restaurant ID | + +--- + +### POST `/api/v1/recommendations/search-rerank` + +Re-ranks a set of search results using personalization signals. + +| Property | Value | +|----------|-------| +| **Latency SLA** | p50: 60ms, p99: 200ms | +| **Cache TTL** | Not cached (query-specific) | +| **Auth** | Bearer token required | + +**Request Body** + +```json +{ + "query": "sushi downtown", + "candidate_ids": [ + "rst_abc123", + "rst_def456", + "rst_ghi789" + ], + "context": { + "lat": 10.7769, + "lng": 106.7009, + "session_id": "sess_7e2f1a3b4c5d6e7f" + }, + "limit": 10 +} +``` + +--- + +### POST `/api/v1/recommendations/group` + +Generates recommendations for group dining by aggregating preferences of multiple users. + +| Property | Value | +|----------|-------| +| **Latency SLA** | p50: 80ms, p99: 250ms | +| **Cache TTL** | Redis: 15min | +| **Auth** | Bearer token required | + +**Request Body** + +```json +{ + "user_ids": [ + "usr_9f3a8b2c1d4e5f6a", + "usr_1a2b3c4d5e6f7a8b", + "usr_2b3c4d5e6f7a8b9c" + ], + "context": { + "lat": 10.7769, + "lng": 106.7009, + "occasion": "business", + "party_size": 3 + }, + "constraints": { + "dietary": ["halal"], + "price_tier_max": "$$$", + "radius_km": 5.0 + }, + "limit": 10 +} +``` + +--- + +## Caching Strategy + +The system uses a **4-layer caching architecture** to achieve high cache hit rates and minimize backend load. + +```mermaid +flowchart TD + Request["User Request"] --> L1 + + L1["Layer 1: CDN / Edge Cache\n(CloudFront)\nTTL: 5-15 min\nHit rate target: ~40%"] + L1 -- MISS --> L2 + + L2["Layer 2: Redis Application Cache\n(per user_id + context hash)\nTTL: 30 min - 2h\nHit rate target: ~80%"] + L2 -- MISS --> L3 + + L3["Layer 3: Feature Store Cache\n(per user_id, per restaurant_id)\nTTL: 1-6h\nPre-warmed daily"] + L3 -- MISS --> L4 + + L4["Layer 4: Model Prediction Cache\n(pre-computed batch scores)\nTTL: 15-30 min\nFor power users + popular restaurants"] + L4 -- MISS --> Engine["Full Recommendation Pipeline\n~40-60ms"] + + Engine --> L2 + Engine --> L4 +``` + +### Cache Configuration + +| Layer | Technology | TTL | Cache Key | Invalidation | +|-------|-----------|-----|-----------|-------------| +| **CDN/Edge** | CloudFront | 5–15 min | URL + user geo-bucket | Manual purge on major catalog updates | +| **Redis App** | Redis Cluster | 30min (nearby) – 2h (similar) | `rec:{endpoint}:{user_id}:{context_hash}` | Event-driven: ORDER_PLACED, profile update | +| **Feature Store** | Redis Cluster | 1–6h | `feat:user:{user_id}`, `feat:rst:{restaurant_id}` | Daily batch refresh | +| **Model Prediction** | Redis | 15–30 min | `pred:{model_version}:{user_segment}:{geo_bucket}` | Model deployment | + +### Cache Invalidation Events + +| Trigger Event | Cache Invalidated | Reason | +|---------------|-------------------|--------| +| `ORDER_PLACED` | Redis App (user) | Immediate preference signal | +| `FAVORITE_ADDED` | Redis App (user) | Explicit preference update | +| User profile update | Redis App + Feature Store | Profile change | +| Restaurant catalog update | Feature Store (restaurant) | Data freshness | +| Model deployment | Model Prediction Cache | New model scores | + +--- + +## Real-time Personalization Stream + +After a significant user interaction, session features are updated in near-real-time (<5s) to improve subsequent requests within the same session. + +```mermaid +sequenceDiagram + participant User + participant API as Recommendation API + participant Kafka + participant Flink + participant Redis as Redis Cache + + User->>API: Performs action (click, order, review) + API->>Kafka: Publish event to user-events-raw + API-->>User: Event ACK (async, non-blocking) + + Kafka->>Flink: Event consumed (<1s) + Flink->>Flink: Validate + Enrich + Aggregate session + Flink->>Redis: Update session features (<3s total) + + Note over Redis: Session features updated
within <5s of user action + + User->>API: Next recommendation request + API->>Redis: Fetch updated session features + API-->>User: Personalized recommendations
(reflecting recent behavior) +``` + +--- + +## Error Handling + +### Error Response Format + +```json +{ + "error": { + "code": "RECOMMENDATION_ENGINE_TIMEOUT", + "message": "Recommendation service timed out. Fallback response provided.", + "details": { + "timeout_ms": 200, + "fallback_strategy": "popularity_based" + }, + "request_id": "req_01HXZ789ABC", + "timestamp": "2024-03-15T14:32:07.123Z" + } +} +``` + +### Error Codes & HTTP Status + +| HTTP Status | Error Code | Description | Fallback Strategy | +|-------------|-----------|-------------|------------------| +| 400 | `INVALID_PARAMETERS` | Missing required params or out-of-range values | None (client error) | +| 401 | `UNAUTHORIZED` | Invalid or expired JWT token | None | +| 404 | `RESTAURANT_NOT_FOUND` | `restaurant_id` does not exist | None | +| 429 | `RATE_LIMIT_EXCEEDED` | Request rate limit exceeded | None | +| 503 | `RECOMMENDATION_ENGINE_TIMEOUT` | Engine exceeded latency budget | Popularity-based fallback | +| 503 | `FEATURE_STORE_UNAVAILABLE` | Redis feature store unreachable | Default feature fallback | +| 500 | `INTERNAL_ERROR` | Unexpected internal error | Popularity-based fallback | + +### Fallback Strategy + +When the recommendation engine fails or times out, the service falls back to a **popularity-based response**: +- Returns top-N restaurants in the user's geo-bucket sorted by `popularity_score` +- Fallback response is cached for 5 minutes +- Logged with `fallback=true` flag for monitoring + +--- + +## Rate Limiting + +| Scope | Limit | Window | +|-------|-------|--------| +| Per user (authenticated) | 100 requests | 1 minute | +| Per API key (server-to-server) | 1,000 requests | 1 minute | +| Per IP (unauthenticated) | 20 requests | 1 minute | +| Group recommendations endpoint | 10 requests | 1 minute per user | + +Rate limiting is enforced at the API Gateway layer using a sliding window algorithm. Exceeded limits return HTTP 429 with `Retry-After` header. + +```json +{ + "error": { + "code": "RATE_LIMIT_EXCEEDED", + "message": "Rate limit of 100 requests per minute exceeded.", + "details": { + "limit": 100, + "window": "1m", + "retry_after_seconds": 23 + } + } +} +``` diff --git a/docs/modules/04-feature-store.md b/docs/modules/04-feature-store.md new file mode 100644 index 0000000..b34b505 --- /dev/null +++ b/docs/modules/04-feature-store.md @@ -0,0 +1,258 @@ +# Module 04 β€” Feature Store + +## Table of Contents + +- [Overview](#overview) +- [Offline Store](#offline-store) +- [Online Store](#online-store) +- [Feature Registry](#feature-registry) +- [Restaurant-Specific Features](#restaurant-specific-features) +- [Feature Pipelines](#feature-pipelines) + +--- + +## Overview + +The Feature Store provides a unified layer for storing, serving, and managing ML features with consistency guarantees between offline training and online serving. + +```mermaid +graph TB + subgraph Sources["Data Sources"] + EventLog["Event Log (S3)"] + PG["PostgreSQL"] + Kafka["Kafka Streams"] + ExternalAPIs["External APIs"] + end + + subgraph OfflineStore["Offline Store"] + S3Parquet["S3 / GCS\n(Parquet files)\nDaily snapshots"] + BigQuery["BigQuery\n(Analytics queries\nTraining data joins)"] + end + + subgraph OnlineStore["Online Store"] + RedisCluster["Redis Cluster\n(<5ms latency SLA)\n(Hash, Sorted Set, String)"] + end + + subgraph Registry["Feature Registry"] + FeatureReg["Feature Registry\n(metadata, lineage, SLAs)"] + end + + subgraph Consumers["Consumers"] + Training["ML Training\n(Spark jobs)"] + Serving["Online Serving\n(Recommendation Service)"] + Analytics["Analytics\n(Metabase / BI)"] + end + + Sources --> OfflineStore + Kafka --> OnlineStore + OfflineStore --> OnlineStore + OfflineStore --> Training + OnlineStore --> Serving + OfflineStore --> Analytics + FeatureReg --> Training + FeatureReg --> Serving +``` + +--- + +## Offline Store + +### Storage Architecture + +| Component | Technology | Purpose | +|-----------|-----------|---------| +| **Primary Storage** | S3 / GCS (Parquet) | Partitioned columnar storage for large-scale feature data | +| **Analytics Layer** | BigQuery | Serverless SQL analytics, training data joins | +| **Partitioning** | `dt=YYYY-MM-DD / entity_type=user|restaurant` | Time-partitioned for efficient range scans | +| **Format** | Apache Parquet | Columnar compression, Spark-compatible, schema evolution support | + +### Partition Strategy + +``` +s3://rr-feature-store/ +β”œβ”€β”€ features/ +β”‚ β”œβ”€β”€ user_features/ +β”‚ β”‚ └── dt=2024-03-15/ +β”‚ β”‚ └── part-00000.parquet +β”‚ β”œβ”€β”€ restaurant_features/ +β”‚ β”‚ └── dt=2024-03-15/ +β”‚ β”‚ └── part-00000.parquet +β”‚ └── interaction_features/ +β”‚ └── dt=2024-03-15/ +β”‚ └── part-00000.parquet +β”œβ”€β”€ training_datasets/ +β”‚ └── ranking_model_v2/ +β”‚ └── dt=2024-03-15/ +└── embeddings/ + β”œβ”€β”€ user_embeddings/ + β”‚ └── dt=2024-03-15/ + └── restaurant_embeddings/ + └── dt=2024-03-15/ +``` + +### Offline Feature Retention + +| Feature Group | Retention Period | Notes | +|--------------|-----------------|-------| +| User behavior features | 24 months | Rolling window for seasonal pattern detection | +| Restaurant features | Indefinite | Full history; partitioned by date | +| Interaction features | 12 months | Training data construction | +| Embeddings | 6 months (last 3 versions) | Previous versions retained for model comparison | + +--- + +## Online Store + +### Architecture + +- **Technology**: Redis Cluster (6 shards, 3 replicas each) +- **Latency SLA**: <5ms p99 for feature fetch +- **Capacity**: ~200GB for user features + ~50GB for restaurant features +- **Persistence**: AOF (append-only file) enabled; RDB snapshots every 6h +- **Eviction Policy**: `allkeys-lru` (LRU eviction when memory pressure) + +### Data Structures + +| Entity | Redis Key Pattern | Data Structure | Example Value | +|--------|------------------|----------------|--------------| +| User embedding | `feat:user:{user_id}:embedding` | Redis String (binary) | 128d float32 vector | +| User preferences | `feat:user:{user_id}:prefs` | Redis Hash | `cuisine_pref_1=japanese score=0.87` | +| User segment | `feat:user:{user_id}:segment` | Redis String | `active` | +| Restaurant features | `feat:rst:{restaurant_id}` | Redis Hash | `avg_rating=4.3 price_tier=2 ...` | +| Restaurant embedding | `feat:rst:{restaurant_id}:embedding` | Redis String (binary) | 128d float32 vector | +| Popularity sorted set | `pop:geo:{geohash_prefix6}` | Redis Sorted Set | `rst_abc123 = 0.87` | +| Session features | `sess:{session_id}` | Redis Hash | `click_count=5 viewed_cuisines=japanese,vietnamese` | + +### Feature Freshness + +| Feature Group | TTL in Redis | Refresh Mechanism | +|--------------|-------------|-------------------| +| User preferences | 6h | Daily batch + event-driven on significant action | +| User embedding | 12h | Weekly model retrain output | +| User segment | 6h | Daily batch | +| Restaurant features | 6h | Daily catalog refresh | +| Restaurant embedding | 24h | Weekly embedding rebuild | +| Session features | 30min (idle TTL) | Real-time via Flink | +| Popularity scores | 2h | Hourly batch update | + +--- + +## Feature Registry + +The Feature Registry is the single source of truth for all feature metadata, lineage, and SLAs. + +### Registry Schema + +```yaml +feature_name: user_cuisine_affinity +version: "2.1" +description: "Top-3 cuisine preferences with affinity scores for a user, derived from interaction history." +entity_type: user +data_type: array[{cuisine: string, score: float}] +dimensionality: 3 items Γ— 2 fields +source: + pipeline: user_profile_rebuild + tables: + - events.restaurant_views + - events.orders + - users.profiles +computation_logic: | + For each user, aggregate cuisine interactions (views Γ— 0.3 + clicks Γ— 0.6 + orders Γ— 1.0) + over last 90 days. Normalize by total interactions. Return top 3 cuisines. +owner: "ml-platform-team" +freshness_sla: + offline: "daily (by 5:00 AM UTC)" + online: "6h TTL in Redis, refreshed by 7:30 AM UTC" +monitoring: + null_rate_threshold: 0.02 + drift_detection: PSI with Ξ±=0.2 +tags: [user, preference, cuisine, behavioral] +created_at: "2023-04-01" +updated_at: "2024-01-15" +``` + +### Feature Groups + +| Feature Group | Entity | Count | Owner | Freshness | +|--------------|--------|-------|-------|-----------| +| `user_behavioral` | user | 24 features | ML Platform | Daily | +| `user_preference` | user | 12 features | ML Platform | Daily + event-driven | +| `user_embedding` | user | 1 (128d vector) | Recommendation Team | Weekly | +| `restaurant_static` | restaurant | 18 features | Catalog Team | Daily | +| `restaurant_dynamic` | restaurant | 10 features | ML Platform | Hourly | +| `restaurant_embedding` | restaurant | 2 (text 768d + image 512d) | ML Platform | Weekly | +| `geo_features` | restaurant | 6 features | Platform Team | Daily | +| `context_features` | request | 8 features | Serving Team | Real-time (computed) | + +--- + +## Restaurant-Specific Features + +The restaurant domain requires specialized features beyond generic item features: + +### Geo Features + +| Feature | Type | Description | Update Frequency | +|---------|------|-------------|-----------------| +| `lat` | float | Latitude (-90 to 90) | On catalog update | +| `lng` | float | Longitude (-180 to 180) | On catalog update | +| `geohash_7` | string | Geohash precision 7 (~153m Γ— 153m) | On catalog update | +| `geohash_6` | string | Geohash precision 6 (~1.2km Γ— 0.6km) | On catalog update | +| `city_id` | string | Normalized city identifier | On catalog update | +| `district_id` | string | Normalized district identifier | On catalog update | + +### Cuisine Embeddings + +| Feature | Type | Description | +|---------|------|-------------| +| `cuisine_embedding_32d` | float[32] | Learned cuisine taxonomy embedding (captures cuisine similarity) | +| `primary_cuisine_id` | int | Mapped integer ID from taxonomy (120+ cuisines) | +| `cuisine_type_ids` | int[] | All cuisine type IDs (multi-label) | + +### Time-Based Features + +| Feature | Type | Description | Update | +|---------|------|-------------|--------| +| `is_open_now` | bool | Computed from operating_hours + current time (UTC) | Request-time | +| `meal_type_open` | string | breakfast/lunch/dinner/late based on current hour | Request-time | +| `is_peak_hour_now` | bool | Peak hours: Lunch 11-14, Dinner 18-21 | Request-time | +| `avg_wait_time_min` | float | Estimated wait time from booking data | Hourly | + +### Sentiment & Quality Scores + +| Feature | Type | Description | Update | +|---------|------|-------------|--------| +| `sentiment_score` | float[0,1] | Aggregate sentiment from all reviews (DistilBERT) | Daily | +| `food_sentiment` | float[0,1] | Aspect-level food sentiment | Daily | +| `service_sentiment` | float[0,1] | Aspect-level service sentiment | Daily | +| `photo_quality_score` | float[0,1] | CNN-based photo quality assessment | Weekly | +| `review_recency_score` | float[0,1] | Weighted avg rating (recency-weighted, 90d) | Daily | + +--- + +## Feature Pipelines + +### Daily Batch Pipeline (Airflow DAG) + +```mermaid +flowchart LR + ExtractEvents["Extract Events\n(S3 Parquet)"] --> ComputeUserFeatures["Compute User Features\n(Spark)"] + ExtractCatalog["Extract Catalog\n(PostgreSQL)"] --> ComputeRestFeatures["Compute Restaurant Features\n(Spark)"] + ComputeUserFeatures --> WriteOffline["Write Offline Store\n(S3 Parquet + BigQuery)"] + ComputeRestFeatures --> WriteOffline + WriteOffline --> WarmOnline["Warm Online Store\n(Redis bulk upload)"] + WarmOnline --> ValidateFeatures["Validate Features\n(null rates, schema, drift)"] + ValidateFeatures --> Done["βœ… Pipeline Complete"] +``` + +### Real-time Feature Update (Flink) + +```mermaid +flowchart LR + Kafka["Kafka:\nuser-events-enriched"] --> SessionWindow["Flink Session Window\n(30min idle timeout)"] + SessionWindow --> UpdateSession["Update Session Features\n(Redis HSET)"] + SessionWindow --> CheckThreshold{"Significant\naction?"} + CheckThreshold -- "ORDER_PLACED\nFAVORITE_ADDED" --> UpdateUserPrefs["Update User Prefs\n(Redis HSET)"] + CheckThreshold -- "No" --> Skip["Skip user profile update"] + UpdateSession & UpdateUserPrefs --> Done["Feature Update Complete\n(<5s from event)"] +``` diff --git a/docs/modules/05-ml-platform.md b/docs/modules/05-ml-platform.md new file mode 100644 index 0000000..3b09d27 --- /dev/null +++ b/docs/modules/05-ml-platform.md @@ -0,0 +1,239 @@ +# Module 05 β€” ML Platform + +## Table of Contents + +- [Overview](#overview) +- [Training Pipeline](#training-pipeline) +- [Evaluation Metrics](#evaluation-metrics) +- [Model Registry](#model-registry) +- [Model Serving](#model-serving) +- [Deployment Strategy](#deployment-strategy) + +--- + +## Overview + +The ML Platform manages the complete lifecycle of recommendation models β€” from data preparation and training, through evaluation and registration, to production serving and monitoring. It supports both scheduled retraining and ad-hoc experimentation. + +```mermaid +graph TB + subgraph Training["Training Pipeline (Airflow + Spark)"] + DataExtract["Data Extraction\n(S3 + BigQuery)"] + FeatEng["Feature Engineering\n(Spark)"] + ModelTrain["Model Training\n(PyTorch / LightGBM)"] + Evaluate["Evaluation\n(Offline metrics)"] + end + + subgraph Registry["Model Registry (MLflow)"] + MLflowReg["MLflow Registry\nΒ· Versioning\nΒ· Artifacts\nΒ· Lineage"] + end + + subgraph Serving["Model Serving"] + BatchInference["Batch Inference\n(every 6h, Spark)"] + RealtimeServing["Real-time Inference\n(Triton / TF Serving)"] + NearRealtime["Near-real-time\n(Kafka β†’ Flink β†’ Re-score, <5s)"] + end + + subgraph Deployment["Deployment"] + Canary["Canary (5% traffic, 2h)"] + Shadow["Shadow Testing"] + BlueGreen["Blue-Green Switch"] + end + + Training --> Registry + Registry --> Serving + Serving --> Deployment + Deployment --> Monitoring["Production Monitoring\n(Evidently + Grafana)"] + Monitoring --> Training +``` + +--- + +## Training Pipeline + +### Pipeline Steps + +```mermaid +flowchart LR + A["Data Extraction\n(S3 Parquet + BigQuery)"] --> B["Data Validation\n(Great Expectations\nschema + null checks)"] + B --> C["Feature Engineering\n(Spark)\nΒ· Interaction aggregates\nΒ· Negative sampling\nΒ· Cross feature construction"] + C --> D["Training\n(PyTorch / LightGBM)\nDistributed on Spark/GPU cluster"] + D --> E["Evaluation\n(Holdout + temporal split)"] + E --> F{Metrics\npass thresholds?} + F -- Yes --> G["Register to MLflow\n(version, artifacts, metrics, tags)"] + F -- No --> H["Alert & Investigate\n(Slack + PagerDuty)"] + G --> I["Candidate for Deployment\n(human or auto-approval)"] +``` + +### Training Data Construction + +| Component | Description | +|-----------|-------------| +| **Positive samples** | User interactions: clicks (weight 0.3), orders (weight 1.0), favorites (weight 0.7), reviews (weight 0.5) | +| **Negative samples** | Shown but not interacted with (weight -0.1); random unshown restaurants (weight -0.5) | +| **Temporal split** | Train: last 28 days, Validation: day 29–30, Test: last 2 days before retrain | +| **Sample ratio** | ~1:10 positive:negative | +| **Data volume** | ~50M samples per weekly retrain | +| **Debiasing** | Inverse propensity scoring (IPS) to correct for position bias in training labels | + +### Model Training Schedule + +| Model | Frequency | Compute | Duration | Trigger | +|-------|-----------|---------|---------|---------| +| Two-Tower (CF) | Weekly (Sunday 3AM) | 8Γ— A100 GPU | ~3h | Scheduled | +| LightGBM Ranker | Weekly (Sunday 3AM) | 32-core CPU | ~1h | Scheduled | +| DeepFM Ranker | Weekly (Sunday 3AM) | 4Γ— A100 GPU | ~2h | Scheduled | +| Embedding index rebuild | Weekly (Sunday 6AM) | 16-core CPU | ~2h | Post-training | +| Cold start model | Monthly | 4-core CPU | ~30min | Scheduled | +| Knowledge Graph (GNN) | Bi-weekly | 8Γ— A100 GPU | ~4h | Scheduled | + +--- + +## Evaluation Metrics + +### Offline Metrics + +| Metric | Description | Target | Threshold (Regression) | +|--------|-------------|--------|------------------------| +| **AUC-ROC** | Area under ROC curve for click prediction | β‰₯0.82 | Drop >2% triggers alert | +| **NDCG@10** | Normalized Discounted Cumulative Gain at 10 | β‰₯0.45 | Drop >5% blocks deploy | +| **NDCG@20** | NDCG at 20 | β‰₯0.42 | Drop >5% blocks deploy | +| **MAP@10** | Mean Average Precision at 10 | β‰₯0.38 | Drop >5% blocks deploy | +| **Coverage** | % of catalog appearing in recommendations | β‰₯30% | Drop >10% triggers alert | +| **Diversity** | Avg intra-list cuisine diversity (ILD) | β‰₯0.60 | Drop >10% triggers alert | +| **Location Relevance** | % recommendations within user's historical visit radius | β‰₯70% | Drop >10% triggers alert | + +### Online Metrics (A/B Test Acceptance) + +| Metric | Baseline | Minimum Improvement | +|--------|----------|---------------------| +| Click-through Rate (CTR) | ~8% | +0.5% absolute | +| Conversion Rate (order) | ~3.5% | +0.2% absolute | +| Revenue per User | $X | +1% relative | +| Session Duration | X min | No significant regression | + +--- + +## Model Registry + +### MLflow Registry Structure + +Each model version in MLflow contains: + +| Artifact | Description | +|---------|-------------| +| **Model binary** | Trained model weights (PyTorch `.pt`, LightGBM `.txt`, TF SavedModel) | +| **Feature schema** | Exact feature names, types, and preprocessing steps used | +| **Evaluation report** | All offline metrics on validation and test sets | +| **Training config** | Hyperparameters, data split dates, sample counts | +| **Data lineage** | Source datasets (S3 paths + versions) used for training | +| **Dependency manifest** | Python packages + versions | + +### Model States + +``` +Registered β†’ Staging β†’ Production β†’ Archived +``` + +| State | Description | Who Can Promote | +|-------|-------------|----------------| +| `Registered` | Just trained; awaiting evaluation | Automated pipeline | +| `Staging` | Passed offline eval; ready for shadow/canary | ML Engineer (manual review) | +| `Production` | Serving live traffic | Senior ML Engineer / auto-promotion after canary | +| `Archived` | No longer serving; retained for lineage | Auto-archival after 90d | + +### Model Tagging Convention + +```python +mlflow.set_tags({ + "model_type": "ranking", + "framework": "lightgbm", + "training_date": "2024-03-15", + "train_data_start": "2024-02-15", + "train_data_end": "2024-03-13", + "ndcg_10": 0.463, + "auc_roc": 0.831, + "approved_by": "ml-platform-team", + "experiment_id": "exp_ranking_v2" +}) +``` + +--- + +## Model Serving + +### Serving Modes + +#### Batch Inference (every 6 hours) + +- Pre-computes recommendation scores for active users (segment: active, power_user) +- Results stored in Redis with 6h TTL +- Triggered by Airflow DAG; uses Spark for distributed scoring +- Covers ~80% of traffic (high-cache-hit endpoint) + +#### Real-time Inference + +- Handles cold-path requests not covered by batch inference +- Model loaded in NVIDIA Triton Inference Server +- Supports dynamic batching (batch_size up to 512) +- GPU nodes: minimum 2Γ— A10G per environment +- Latency target: <20ms p99 for ranking model inference + +#### Near-real-time Re-scoring (<5s) + +- Kafka β†’ Flink β†’ Recommendation re-score +- Triggered by significant events: ORDER_PLACED, FAVORITE_ADDED +- Uses lightweight session-aware ranking (LightGBM only) +- Updates Redis cache with fresh scores + +### Triton Model Configuration + +```yaml +name: "lightgbm_ranker" +platform: "ensemble" +max_batch_size: 512 +dynamic_batching: + preferred_batch_size: [64, 128, 256] + max_queue_delay_microseconds: 5000 +instance_group: + - kind: KIND_GPU + count: 1 + gpus: [0] +``` + +--- + +## Deployment Strategy + +### Canary Deployment Flow + +```mermaid +flowchart LR + NewModel["New Model Version\n(Staging)"] --> ShadowTest["Shadow Testing\n(0% live traffic\nlog predictions only)"] + ShadowTest --> CanaryDeploy["Canary Deploy\n5% traffic\n2h window"] + CanaryDeploy --> MetricCheck{Guardrail\nMetrics OK?} + MetricCheck -- Yes --> Expand["Expand to 25% β†’ 50% β†’ 100%\nBlue-Green final switch"] + MetricCheck -- No --> Rollback["Automatic Rollback\nto previous version"] + Expand --> Archived["Archive Previous Version\n(retain 90 days)"] +``` + +### Rollback Triggers + +Automatic rollback is triggered if any of the following conditions are met during canary: + +| Trigger | Threshold | Window | +|---------|-----------|--------| +| API latency p99 | >300ms | 5-minute rolling window | +| API error rate | >1% | 5-minute rolling window | +| NDCG@10 (online proxy) | Drop >5% vs. baseline | 1-hour window | +| Model inference latency | >50ms p99 | 5-minute rolling window | +| Feature fetch errors | >0.5% | 5-minute rolling window | + +### Blue-Green Switch + +Once canary passes guardrails: +1. New version promoted to `production` state in MLflow +2. Kubernetes traffic split updated: new version receives 100% traffic +3. Old version demoted to `archived` state +4. Redis model prediction cache invalidated +5. Alert sent to ML team: deployment complete diff --git a/docs/modules/06-experimentation.md b/docs/modules/06-experimentation.md new file mode 100644 index 0000000..1016b9f --- /dev/null +++ b/docs/modules/06-experimentation.md @@ -0,0 +1,249 @@ +# Module 06 β€” Experimentation + +## Table of Contents + +- [Overview](#overview) +- [Experiment Design](#experiment-design) +- [Metrics Framework](#metrics-framework) +- [Statistical Testing](#statistical-testing) +- [Experiment Lifecycle](#experiment-lifecycle) +- [Infrastructure](#infrastructure) + +--- + +## Overview + +The Experimentation module provides a rigorous A/B testing framework to evaluate recommendation algorithm changes, UI treatments, and business rule modifications before full rollout. + +--- + +## Experiment Design + +### Traffic Splitting + +Traffic is split using **hash-based user assignment** to ensure: +- Deterministic assignment (same user always in same bucket) +- No leakage between variants +- Consistent experience across sessions + +``` +experiment_bucket = hash(user_id + experiment_id) % 100 +``` + +Assignment to variants based on configured bucket ranges: +- Control: buckets 0–49 (50%) +- Treatment A: buckets 50–74 (25%) +- Treatment B: buckets 75–99 (25%) + +### Sample Size Requirements + +| Parameter | Value | Notes | +|-----------|-------|-------| +| Minimum samples per variant | 10,000 unique users | Statistical power requirement | +| Minimum experiment duration | 7 days | Capture weekly behavioral patterns | +| Maximum experiment duration | 14 days | Avoid novelty effects confounding results | +| Holdback group | 5% of users | Always in control; used for long-term health tracking | + +### Experiment Layers + +To run multiple experiments simultaneously without interference: + +| Layer | Scope | Examples | +|-------|-------|---------| +| `algorithm` | Recommendation model / ranking logic | New ranker, CF model update | +| `ui` | Frontend presentation | Card layout, photo size, order of sections | +| `business_rules` | Re-ranking business rules | Diversity weights, sponsored ratio | +| `features` | Feature engineering changes | New feature, feature removal | + +Each layer uses independent hash seeds, preventing cross-layer interaction. + +--- + +## Metrics Framework + +### Primary Metrics + +| Metric | Definition | Minimum Detectable Effect (MDE) | +|--------|-----------|--------------------------------| +| **CTR** | Clicks / Impressions per session | Β±0.5% absolute | +| **Conversion Rate** | Orders / Sessions | Β±0.2% absolute | +| **Revenue per User** | Total order revenue / Active users | Β±1% relative | + +### Secondary Metrics + +| Metric | Definition | Direction | +|--------|-----------|-----------| +| `avg_restaurants_viewed` | Avg unique restaurants viewed per session | Higher is better (exploration) | +| `session_duration_min` | Avg session duration in minutes | Neutral (engagement proxy) | +| `bounce_rate` | Sessions with 0 restaurant clicks / All sessions | Lower is better | +| `search_rate` | Sessions with β‰₯1 search / All sessions | Lower = recs more satisfying | +| `diversity_score` | Avg intra-list cuisine diversity (ILD) | Track; don't optimize directly | +| `new_restaurant_discovery` | % recs showing restaurants user never visited | Track for catalog health | + +### Guardrail Metrics + +These metrics must not regress significantly; a breach blocks experiment promotion: + +| Guardrail | Threshold | Justification | +|-----------|-----------|--------------| +| `page_load_time_p99` | No increase >50ms | User experience | +| `error_rate` | No increase >0.5% absolute | Reliability | +| `user_complaints` | No increase >10% relative | User satisfaction | +| `recommendation_latency_p99` | No increase >20ms | Performance | + +--- + +## Statistical Testing + +### Methodology + +- **Primary test**: Sequential testing (allows early stopping without inflating Type I error) +- **Significance level**: Ξ± = 0.05 (5% false positive rate) +- **Statistical power**: 0.80 (80% chance of detecting true effect at MDE) +- **Multiple comparison correction**: Bonferroni correction when testing >3 metrics simultaneously + +### Sequential Testing + +Sequential testing allows monitoring results continuously and stopping early if: +- A significant result is detected (stop early β€” winner found) +- The experiment is clearly futile (stop early β€” no effect) + +This uses the **mSPRT** (mixture Sequential Probability Ratio Test) framework with Ξ±-spending to maintain the overall Type I error rate at 0.05. + +### Power Analysis + +For CTR (baseline 8%, MDE 0.5%): + +``` +Required sample size β‰ˆ 10,000 users per variant +(two-tailed, Ξ±=0.05, power=0.80) +``` + +### Variance Reduction + +**CUPED** (Controlled-experiment Using Pre-Experiment Data) is applied to reduce variance and improve sensitivity: + +``` +Y_cuped = Y_post - ΞΈ Γ— (Y_pre - E[Y_pre]) +ΞΈ = Cov(Y_post, Y_pre) / Var(Y_pre) +``` + +Pre-experiment metric: same user's CTR from the 7 days prior to experiment start. + +--- + +## Experiment Lifecycle + +```mermaid +flowchart LR + H["1. Hypothesis\n(What & Why)"] --> D["2. Design\n(Metrics, Power, Duration)"] + D --> I["3. Implementation\n(Feature flag, traffic split)"] + I --> M["4. Monitoring\n(Daily metric review)"] + M --> A["5. Analysis\n(Statistical tests)"] + A --> Dec{"6. Decision"} + Dec -- "Winner" --> Roll["7a. Rollout\n(100% traffic)"] + Dec -- "Inconclusive" --> Doc["7b. Document\n& Archive"] + Dec -- "Loser" --> Kill["7c. Kill\n& Post-mortem"] + Roll --> Docs2["8. Documentation\n(Impact report)"] + Kill --> Docs2 +``` + +### Stage 1: Hypothesis + +Document the hypothesis before any implementation: +- **Problem**: What user pain point or opportunity is being addressed? +- **Hypothesis**: "We believe that [change X] will [achieve outcome Y] for [user segment Z]" +- **Expected impact**: Estimated CTR lift, revenue impact +- **Risks**: Potential negative effects to guardrail metrics + +### Stage 2: Design + +| Decision | Options | +|----------|---------| +| Control vs. treatment split | 50/50 (default), 80/20 (for risky changes) | +| User segments targeted | All users, cold_start only, power_users only, etc. | +| Experiment layer | algorithm / ui / business_rules / features | +| Primary metric | CTR / Conversion Rate / Revenue per User | +| Duration | 7–14 days | + +### Stage 3: Implementation + +- **Feature flags**: All experiment branches behind a feature flag service +- **Traffic assignment**: Hash-based, server-side, logged with every event +- **Event tracking**: Every impression and interaction includes `experiment_variants` metadata +- **Instrumentation review**: QA sign-off on tracking before experiment starts + +### Stage 4: Monitoring + +Daily monitoring during experiment: +- Primary and secondary metrics per variant +- Guardrail metric alerts (auto-pause if guardrail breached) +- Sample ratio mismatch (SRM) check: Chi-squared test on traffic distribution +- Data quality checks: null rates, logging errors + +### Stage 5: Analysis + +At experiment conclusion: +1. Run statistical tests (mSPRT / standard t-test depending on stopping reason) +2. Apply Bonferroni correction for multiple metrics +3. Check heterogeneous treatment effects (by user segment, device, geo) +4. Calculate practical significance alongside statistical significance + +### Stage 6: Decision Framework + +| Outcome | Condition | Action | +|---------|-----------|--------| +| **Ship** | Statistically significant positive on primary metric AND no guardrail regression | Full rollout, then document | +| **Iterate** | Directional positive but insufficient power | Extend or redesign experiment | +| **Hold** | Significant on primary but guardrail regression | Investigate before decision | +| **Kill** | Negative or neutral on primary metric | Document learnings, archive | + +--- + +## Infrastructure + +### Feature Flag Service + +- All experiment branches controlled via feature flags +- Flags stored in Redis with real-time update propagation +- SDK available for Python (serving layer) and JavaScript (frontend) + +### Experiment Config Example + +```json +{ + "experiment_id": "exp_ranking_v3_diversity", + "name": "MMR Diversity Weight Tuning v3", + "layer": "algorithm", + "status": "running", + "start_date": "2024-03-10", + "end_date": "2024-03-24", + "variants": [ + { + "id": "control", + "name": "Current MMR (Ξ»=0.7)", + "traffic_pct": 50, + "config": { "mmr_lambda": 0.7, "category_max_ratio": 0.4 } + }, + { + "id": "treatment_a", + "name": "Higher Diversity (Ξ»=0.6)", + "traffic_pct": 50, + "config": { "mmr_lambda": 0.6, "category_max_ratio": 0.35 } + } + ], + "primary_metric": "conversion_rate", + "secondary_metrics": ["ctr", "avg_restaurants_viewed", "diversity_score"], + "guardrails": ["page_load_time_p99", "error_rate"], + "target_segments": ["active", "power_user"], + "min_samples_per_variant": 10000, + "owner": "ml-platform-team" +} +``` + +### Results Tracking + +Experiment results are stored in BigQuery and surfaced via: +- **Metabase dashboard**: Real-time metric visualization per variant +- **Automated report**: Generated at experiment end; includes statistical analysis, confidence intervals, and recommendation +- **Experiment registry**: All experiments documented with hypothesis, results, and decision rationale diff --git a/docs/modules/07-observability.md b/docs/modules/07-observability.md new file mode 100644 index 0000000..e99cf79 --- /dev/null +++ b/docs/modules/07-observability.md @@ -0,0 +1,289 @@ +# Module 07 β€” Observability + +## Table of Contents + +- [Overview](#overview) +- [System Monitoring](#system-monitoring) +- [ML Monitoring](#ml-monitoring) +- [Business Monitoring](#business-monitoring) +- [Logging](#logging) +- [Alerting Rules](#alerting-rules) +- [Dashboards](#dashboards) + +--- + +## Overview + +Observability covers three pillars: **metrics**, **logs**, and **traces**. The system is monitored at three layers: + +1. **System Monitoring** β€” infrastructure and service-level health (Prometheus + Grafana) +2. **ML Monitoring** β€” model quality, feature drift, and prediction distribution (Evidently AI) +3. **Business Monitoring** β€” revenue attribution, CTR, user satisfaction (Metabase) + +```mermaid +graph TB + subgraph Services["Services & Infrastructure"] + RecSvc["Recommendation Service"] + FeatureStore["Feature Store"] + ModelServing["Model Serving (Triton)"] + Kafka["Kafka / Flink"] + Redis["Redis"] + PG["PostgreSQL"] + end + + subgraph Collection["Collection Layer"] + PrometheusExp["Prometheus Exporters\n(/metrics endpoints)"] + OpenTelemetry["OpenTelemetry SDK\n(traces + logs)"] + EventSink["Event Sink\n(BigQuery / S3)"] + end + + subgraph Backends["Observability Backends"] + Prometheus["Prometheus\n(metrics storage)"] + Loki["Grafana Loki\n(log storage)"] + Jaeger["Jaeger\n(trace storage)"] + BigQuery["BigQuery\n(analytics)"] + Evidently["Evidently AI\n(ML drift)"] + end + + subgraph Visualization["Visualization & Alerting"] + Grafana["Grafana\n(dashboards + alerts)"] + Metabase["Metabase\n(business BI)"] + PagerDuty["PagerDuty\n(on-call alerting)"] + end + + Services --> Collection + Collection --> Backends + Backends --> Visualization + Grafana --> PagerDuty +``` + +--- + +## System Monitoring + +### Key Metrics + +#### API / Serving Layer + +| Metric | PromQL Key | Description | Alert Threshold | +|--------|-----------|-------------|----------------| +| Request rate | `rate(http_requests_total[5m])` | Requests per second by endpoint | β€” (informational) | +| Latency p50 | `histogram_quantile(0.50, http_duration_seconds)` | Median response time | >60ms | +| Latency p99 | `histogram_quantile(0.99, http_duration_seconds)` | 99th percentile response time | >200ms | +| Error rate | `rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])` | 5xx error rate | >1% | +| Cache hit ratio | `rate(cache_hits_total[5m]) / rate(cache_requests_total[5m])` | Redis cache hit rate | <70% | + +#### ML Inference + +| Metric | Description | Alert Threshold | +|--------|-------------|----------------| +| Model inference latency p99 | Triton inference time | >50ms | +| Model throughput (requests/s) | Triton requests per second | β€” | +| GPU utilization | Per GPU node utilization | >90% sustained | +| Inference error rate | Failed inference requests | >0.5% | +| Queue depth | Pending inference requests in Triton queue | >100 | + +#### Feature Store + +| Metric | Description | Alert Threshold | +|--------|-------------|----------------| +| Feature fetch latency p99 | Redis GET latency | >10ms | +| Feature cache miss rate | Redis keyspace misses / total | >20% | +| Redis memory utilization | Used memory / max memory | >85% | +| Feature staleness | Age of oldest feature in online store | >8h | + +#### Kafka / Flink + +| Metric | Description | Alert Threshold | +|--------|-------------|----------------| +| Consumer lag | Records behind latest offset | >10,000 events | +| Processing throughput | Events processed per second | β€” | +| Checkpoint duration | Flink checkpoint completion time | >30s | +| Kafka broker disk usage | Disk used per broker | >80% | + +--- + +## ML Monitoring + +Managed via **Evidently AI** with reports generated hourly and stored in S3. + +### Prediction Distribution Monitoring + +| Check | Metric | Alert Condition | +|-------|--------|----------------| +| Score distribution shift | Kullback-Leibler divergence | KL > 0.1 | +| Mean score drift | Moving average of recommendation scores | >5% change in 24h window | +| Rank position distribution | Distribution of top-1 restaurant positions | Significant shift | +| Score range | Min/max recommendation score | Outside [0.05, 0.99] | + +### Feature Drift Detection (PSI) + +Population Stability Index (PSI) is computed for all features daily: + +``` +PSI = Ξ£ (actual_pct - expected_pct) Γ— ln(actual_pct / expected_pct) +``` + +| PSI Range | Interpretation | Action | +|-----------|---------------|--------| +| PSI < 0.1 | No significant change | No action | +| 0.1 ≀ PSI < 0.2 | Minor change | Monitor closely | +| PSI β‰₯ 0.2 | Significant drift | Alert + investigate retraining | + +### Online ML Metrics (Proxy) + +| Metric | Description | Computation | Target | +|--------|-------------|-------------|--------| +| **NDCG@10 (proxy)** | NDCG estimated from click positions | Daily batch compute | β‰₯0.43 | +| **Coverage** | % of catalog appearing in top-10 recommendations | Daily batch | β‰₯28% | +| **Popularity bias** | Ratio of recommendations for top-10% restaurants | Daily | <60% | +| **Novelty score** | % recommendations user has never seen before | Daily | β‰₯40% | + +### Drift Dashboard (Grafana) + +The ML drift Grafana dashboard shows: +- PSI heatmap per feature (feature Γ— time) +- Prediction score distribution over rolling 7 days +- Online NDCG proxy trend +- Feature importance change vs. previous week (SHAP values) + +--- + +## Business Monitoring + +Managed via **Metabase** with dashboards refreshed every 4 hours from BigQuery. + +### Business Metrics + +| Metric | Definition | Refresh | +|--------|-----------|---------| +| **CTR by placement** | Clicks / Impressions by page section (homepage, nearby, similar) | 4h | +| **Recommendation revenue attribution** | Orders originating from recommendation clicks | Daily | +| **Influenced conversion rate** | Orders / Users exposed to recommendations | Daily | +| **Avg session satisfaction score** | Avg post-session rating (voluntary feedback) | Daily | +| **Discovery rate** | % of orders at restaurants user never previously ordered from | Daily | +| **Repeat restaurant rate** | % of orders at previously visited restaurants | Daily | + +### Business Alerts + +| Alert | Threshold | Notification | +|-------|-----------|-------------| +| CTR drops >20% from 7d average | Daily check | Slack #product-alerts | +| Conversion rate drops >10% | Hourly check | Slack + PagerDuty (severity 2) | +| Revenue attribution anomaly | Statistical outlier detection | Slack #business-alerts | + +--- + +## Logging + +### Structured Log Format + +Every recommendation API request produces a structured log entry: + +```json +{ + "timestamp": "2024-03-15T14:32:07.456Z", + "level": "INFO", + "service": "recommendation-service", + "version": "2.4.1", + "request_id": "req_01HXZ789ABC", + "user_id": "usr_9f3a8b2c1d4e5f6a", + "session_id": "sess_7e2f1a3b4c5d6e7f", + "endpoint": "/api/v1/recommendations/homepage", + "http_status": 200, + "model_version": "ranking_v2.4.1", + "experiment_variants": { + "ranking_model": "treatment_a", + "diversity_weight": "control" + }, + "pipeline": { + "candidates_generated": 847, + "candidates_after_geo_filter": 312, + "candidates_ranked": 312, + "final_results": 20 + }, + "latency_breakdown_ms": { + "feature_fetch": 3, + "candidate_generation": 11, + "restaurant_feature_fetch": 4, + "scoring": 14, + "reranking": 4, + "serialization": 2, + "total": 38 + }, + "cache": { + "l1_cdn_hit": false, + "l2_redis_hit": false, + "fallback_used": false + } +} +``` + +### Log Retention Policy + +| Tier | Storage | Retention | Access | +|------|---------|-----------|--------| +| **Hot** | Elasticsearch / Loki | 90 days | Real-time query via Kibana/Grafana | +| **Warm** | S3 (Parquet, compressed) | 12 months | Query via Athena / BigQuery | +| **Cold** | S3 Glacier | 3 years | Restore on demand (regulatory) | + +### Log Levels + +| Level | Condition | Includes | +|-------|-----------|---------| +| `ERROR` | 5xx response, fallback used, feature store unavailable | Full request context, stack trace | +| `WARN` | Latency SLA miss, cache miss rate spike, partial feature fetch | Request context, metric values | +| `INFO` | Every successful recommendation request | Full structured log (see above) | +| `DEBUG` | Detailed candidate scoring (disabled in production) | Per-candidate scores, feature values | + +--- + +## Alerting Rules + +### Severity Levels + +| Level | Response Time | Channels | +|-------|--------------|----------| +| **P1 (Critical)** | Immediate (24/7) | PagerDuty + Slack + SMS | +| **P2 (High)** | 15 min (business hours), 30 min (off-hours) | PagerDuty + Slack | +| **P3 (Medium)** | Next business day | Slack #alerts | +| **P4 (Low)** | Weekly review | Slack #metrics-review | + +### Alert Rules + +| Alert | Condition | Severity | Runbook | +|-------|-----------|----------|---------| +| High API error rate | error_rate > 1% for 5min | P1 | runbook/api-errors | +| API latency SLA breach | p99 > 300ms for 5min | P1 | runbook/latency | +| Model serving down | inference_error_rate = 100% | P1 | runbook/model-serving | +| Redis memory critical | memory_utilization > 90% | P1 | runbook/redis | +| Kafka consumer lag | consumer_lag > 50,000 for 10min | P2 | runbook/kafka-lag | +| Feature staleness | feature_age_hours > 8 | P2 | runbook/feature-store | +| ML drift detected | PSI > 0.2 on any key feature | P2 | runbook/ml-drift | +| Low cache hit rate | cache_hit_ratio < 60% for 1h | P3 | runbook/caching | +| NDCG proxy decline | ndcg_proxy < 0.40 for 24h | P3 | runbook/model-quality | +| Airflow DAG failure | DAG past SLA deadline | P2 | runbook/airflow | + +--- + +## Dashboards + +### Grafana Dashboard Suite + +| Dashboard | Audience | Key Panels | +|-----------|---------|-----------| +| **Recommendation Service Overview** | Engineering | RPS, latency p50/p99/p999, error rate, cache hit rate | +| **ML Inference Performance** | ML Team | Triton throughput, GPU utilization, inference latency, queue depth | +| **Feature Store Health** | ML Team | Redis memory, fetch latency, staleness per feature group | +| **Kafka / Flink Pipeline** | Data Engineering | Consumer lag, throughput, checkpoint success rate | +| **ML Drift Monitor** | ML Team | PSI heatmap, prediction distribution, NDCG proxy trend | +| **Infrastructure Overview** | SRE | Pod CPU/memory, node health, disk usage, network I/O | + +### Metabase Business Dashboards + +| Dashboard | Audience | Refresh | +|-----------|---------|---------| +| **Recommendation Performance** | Product / Business | 4h | +| **A/B Experiment Results** | Product / ML | 4h | +| **User Engagement Funnel** | Product | Daily | +| **Revenue Attribution** | Business / Finance | Daily | diff --git a/docs/modules/08-data-pipeline.md b/docs/modules/08-data-pipeline.md new file mode 100644 index 0000000..6f1e1de --- /dev/null +++ b/docs/modules/08-data-pipeline.md @@ -0,0 +1,272 @@ +# Module 08 β€” Data Pipeline + +## Table of Contents + +- [Overview](#overview) +- [Daily Batch Pipelines](#daily-batch-pipelines) +- [Real-time Streaming Pipelines](#real-time-streaming-pipelines) +- [Pipeline DAGs](#pipeline-dags) +- [SLAs and Monitoring](#slas-and-monitoring) +- [Failure Handling & Retry Strategy](#failure-handling--retry-strategy) + +--- + +## Overview + +The Data Pipeline module orchestrates all data movement and transformation workflows across the system. It consists of two types of pipelines: + +1. **Batch Pipelines** β€” scheduled daily/weekly jobs orchestrated by Apache Airflow +2. **Real-time Streaming Pipelines** β€” continuously running jobs using Apache Kafka + Apache Flink + +```mermaid +graph TB + subgraph BatchLayer["Batch Layer (Apache Airflow)"] + UserRebuild["user_profile_rebuild\nDaily 2:00 AM"] + CatalogRefresh["restaurant_catalog_refresh\nDaily 1:00 AM"] + FeatureEng["feature_engineering\nDaily 3:00 AM"] + OnlineWarmup["online_store_warmup\nDaily 6:30 AM"] + ModelRetrain["model_retraining\nWeekly Sunday 3:00 AM"] + EmbeddingRebuild["embedding_index_rebuild\nWeekly Sunday 6:00 AM"] + end + + subgraph StreamingLayer["Streaming Layer (Kafka + Flink)"] + EventIngest["Event Ingestion\n(always-on)"] + SessionAgg["Session Aggregation\n(always-on)"] + FeatureUpdate["Real-time Feature Update\n(always-on)"] + RecTrigger["Recommendation Trigger\n(event-driven)"] + end + + subgraph Storage["Storage"] + S3["S3 / GCS\n(offline store)"] + BQ["BigQuery\n(analytics)"] + Redis["Redis\n(online store)"] + Milvus["Milvus\n(vector index)"] + end + + BatchLayer --> Storage + StreamingLayer --> Redis + Storage --> ModelRetrain + Storage --> EmbeddingRebuild +``` + +--- + +## Daily Batch Pipelines + +### Pipeline Schedule Summary + +| Pipeline | Schedule (UTC) | SLA Deadline | Owner | Dependencies | +|----------|---------------|-------------|-------|-------------| +| `restaurant_catalog_refresh` | Daily 1:00 AM | 4:00 AM | Catalog Team | External APIs, PostgreSQL | +| `user_profile_rebuild` | Daily 2:00 AM | 5:00 AM | ML Platform | Event log (S3), Orders (PG) | +| `feature_engineering` | Daily 3:00 AM | 6:00 AM | ML Platform | user_profile_rebuild | +| `online_store_warmup` | Daily 6:30 AM | 7:30 AM | Platform Team | feature_engineering | +| `model_retraining` | Weekly Sunday 3:00 AM | 8:00 AM | ML Platform | feature_engineering (latest) | +| `embedding_index_rebuild` | Weekly Sunday 6:00 AM | 10:00 AM | ML Platform | model_retraining | +| `analytics_export` | Daily 7:00 AM | 9:00 AM | Data Engineering | feature_engineering | + +--- + +## Pipeline DAGs + +### `user_profile_rebuild` DAG + +**Schedule**: Daily 2:00 AM UTC +**SLA**: 5:00 AM UTC +**Duration**: ~2.5h + +```mermaid +flowchart LR + A["extract_raw_events\n(S3 Parquet read)"] --> B["validate_event_schema\n(Great Expectations)"] + A2["extract_orders\n(PostgreSQL)"] --> C["compute_interaction_features\n(Spark)"] + B --> C + A3["extract_user_registrations\n(PostgreSQL)"] --> D["compute_demographic_features\n(Spark)"] + C --> E["compute_cuisine_affinities\n(Spark)\nTop-5 cuisines with affinity score"] + C --> F["compute_rfm_features\n(Spark)\nRecency, Frequency, Monetary"] + D & E & F --> G["join_user_features\n(Spark)"] + G --> H["write_offline_store\n(S3 Parquet + BigQuery)"] + H --> I["validate_output\n(Row count, null rate checks)"] + I --> J["notify_downstream\n(Airflow sensor trigger)"] +``` + +**Key Tasks**: +- `extract_raw_events`: Reads last 90 days of events from S3, partitioned by `dt` +- `compute_interaction_features`: Aggregates view/click/order counts, weighted interaction score per restaurant-user pair +- `compute_cuisine_affinities`: Top-5 cuisines from interactions (weighted: orderΓ—1.0, clickΓ—0.6, viewΓ—0.3) +- `compute_rfm_features`: Recency (days since last order), Frequency (orders/month), Monetary (avg spend) +- `validate_output`: Checks null rates <2%, row counts within Β±5% of previous day + +--- + +### `restaurant_catalog_refresh` DAG + +**Schedule**: Daily 1:00 AM UTC +**SLA**: 4:00 AM UTC +**Duration**: ~2h + +```mermaid +flowchart LR + A["pull_partner_updates\n(REST API / webhooks)"] --> D["merge_catalog_changes\n(PostgreSQL UPSERT)"] + B["pull_google_places\n(Google Places API)"] --> C["normalize_external_data\n(Spark)"] + C --> D + D --> E["enrich_nlp_tags\n(BERT NLP service)"] + D --> F["compute_text_embeddings\n(Sentence-BERT)"] + D --> G["compute_image_embeddings\n(ResNet-50)"] + E & F & G --> H["update_restaurant_features\n(S3 Parquet)"] + H --> I["update_sentiment_scores\n(DistilBERT reviews)"] + I --> J["write_catalog_snapshot\n(PostgreSQL + S3)"] + J --> K["validate_catalog\n(count, schema, geo checks)"] +``` + +--- + +### `feature_engineering` DAG + +**Schedule**: Daily 3:00 AM UTC (after `user_profile_rebuild`) +**SLA**: 6:00 AM UTC +**Duration**: ~2.5h + +```mermaid +flowchart LR + A["wait_user_profile_rebuild\n(Airflow ExternalTaskSensor)"] --> C + B["wait_catalog_refresh\n(Airflow ExternalTaskSensor)"] --> C + C["compute_cross_features\n(Spark)\nΒ· userΓ—cuisine affinity\nΒ· userΓ—price match\nΒ· geo distance features"] --> D["compute_context_features\n(Spark)\nΒ· time-of-day features\nΒ· day-of-week features"] + D --> E["compute_restaurant_dynamic\n(Spark)\nΒ· review_recency_score\nΒ· popularity decay\nΒ· trend score"] + C & D & E --> F["join_training_features\n(Spark)"] + F --> G["write_feature_store\n(S3 Parquet + BigQuery)"] + G --> H["validate_features\n(PSI vs previous day, null rates)"] +``` + +--- + +### `online_store_warmup` DAG + +**Schedule**: Daily 6:30 AM UTC (after `feature_engineering`) +**SLA**: 7:30 AM UTC +**Duration**: ~45min + +```mermaid +flowchart LR + A["wait_feature_engineering\n(ExternalTaskSensor)"] --> B["export_user_features\n(Spark β†’ Redis pipeline)"] + A --> C["export_restaurant_features\n(Spark β†’ Redis pipeline)"] + A --> D["export_popularity_scores\n(β†’ Redis Sorted Sets)"] + B & C & D --> E["validate_online_store\n(sample spot-check latency)"] + E --> F["notify_serving_team\n(Slack notification)"] +``` + +--- + +### `model_retraining` DAG (Weekly) + +**Schedule**: Weekly, Sunday 3:00 AM UTC +**SLA**: Sunday 8:00 AM UTC +**Duration**: ~4.5h + +```mermaid +flowchart LR + A["extract_training_data\n(last 28d, S3 + BigQuery)"] --> B["construct_training_samples\n(Spark)\n+/- sampling, label assignment, IPS weights"] + B --> C["train_two_tower_cf\n(PyTorch, 8Γ— GPU)\n~3h"] + B --> D["train_lightgbm_ranker\n(CPU cluster)\n~1h"] + B --> E["train_deepfm_ranker\n(TF, 4Γ— GPU)\n~2h"] + C & D & E --> F["evaluate_all_models\n(Holdout set)\nAUC-ROC, NDCG@10, MAP@10"] + F --> G{All metrics\npass thresholds?} + G -- Yes --> H["register_to_mlflow\n(version, artifacts, metrics)"] + G -- No --> I["alert_ml_team\n(PagerDuty P2)"] + H --> J["create_canary_deployment\n(5% traffic, 2h)"] +``` + +--- + +## Real-time Streaming Pipelines + +### Event Ingestion Pipeline + +**Technology**: Kafka (producer) + Flink (consumer) +**Throughput**: ~10,000 events/second (peak) +**Latency SLA**: End-to-end <5s from event to Redis update + +```mermaid +flowchart LR + SDK["Mobile/Web SDK\n(batch flush 500ms)"] --> KafkaRaw["Kafka:\nuser-events-raw"] + KafkaRaw --> FlinkValidate["Flink:\nEvent Validation\nΒ· Schema check\nΒ· Dedup by event_id\nΒ· Late event handling (5min)"] + FlinkValidate --> FlinkEnrich["Flink:\nEvent Enrichment\nΒ· Resolve restaurant metadata\nΒ· Add geohash context\nΒ· Lookup user segment"] + FlinkEnrich --> KafkaEnriched["Kafka:\nuser-events-enriched"] + KafkaEnriched --> AnalyticsSink["BigQuery\n(analytics, ~5min lag)"] + KafkaEnriched --> SessionFlink["Flink:\nSession Aggregation"] +``` + +### Session Aggregation Pipeline + +```mermaid +flowchart LR + KafkaEnriched["Kafka:\nuser-events-enriched"] --> SessionWindow["Flink Session Window\n(30min idle gap)"] + SessionWindow --> UpdateSession["Update Session State\n(Flink keyed state by user_id)"] + UpdateSession --> WriteRedis["Write to Redis\n(HSET sess:{session_id})\n<5s from event"] + UpdateSession --> CheckTrigger{"Trigger\ncondition?"} + CheckTrigger -- "ORDER_PLACED\nor FAVORITE_ADDED" --> UpdateProfile["Update User Prefs\n(Redis HSET)"] + CheckTrigger -- "N=5 clicks" --> InvalidateCache["Invalidate Rec Cache\n(Redis DEL rec:{user_id}:*)"] + CheckTrigger -- "Session end\n(idle 30min)" --> WriteSessionLog["Write Session Summary\n(S3 β†’ BigQuery)"] +``` + +--- + +## SLAs and Monitoring + +### Batch Pipeline SLA Dashboard + +| Pipeline | Scheduled | SLA | P50 Duration | P99 Duration | SLA Miss Alert | +|----------|-----------|-----|-------------|-------------|----------------| +| restaurant_catalog_refresh | 01:00 | 04:00 | 1h 45m | 2h 30m | PagerDuty P2 | +| user_profile_rebuild | 02:00 | 05:00 | 2h 20m | 2h 50m | PagerDuty P2 | +| feature_engineering | 03:00 | 06:00 | 2h 10m | 2h 45m | PagerDuty P2 | +| online_store_warmup | 06:30 | 07:30 | 40m | 55m | PagerDuty P2 | +| model_retraining (weekly) | 03:00 | 08:00 | 4h 20m | 5h 15m | PagerDuty P2 | + +### Streaming Pipeline SLAs + +| Pipeline | Latency SLA | Throughput SLA | Consumer Lag Alert | +|----------|-------------|---------------|-------------------| +| Event ingestion | <5s end-to-end | >8,000 ev/s | >10,000 records | +| Session aggregation | <3s to Redis update | >8,000 ev/s | >10,000 records | +| Analytics sink | <5min to BigQuery | β€” | >100,000 records | + +--- + +## Failure Handling & Retry Strategy + +### Batch Pipeline Retry Policy + +| Scenario | Retry Behavior | Escalation | +|----------|---------------|-----------| +| Task failure (transient) | Retry 3Γ— with exponential backoff (5min, 15min, 45min) | Alert after 2nd retry | +| External API unavailable (catalog) | Retry 5Γ— with 10min intervals; use cached data if all retries fail | PagerDuty P2 | +| SLA miss | Alert sent; pipeline continues (no auto-retry of full DAG) | PagerDuty P2 | +| Upstream DAG late | Downstream waits up to 30min past SLA before alerting | Slack notification | +| Data quality check failure | Pipeline halts; does not update downstream stores | PagerDuty P1 | + +### Streaming Pipeline Failure Handling + +| Failure Type | Behavior | Recovery | +|-------------|----------|---------| +| Flink job crash | Restart from last successful checkpoint (checkpoint interval: 30s) | Auto-restart by Kubernetes | +| Kafka broker unavailable | Consumer pauses; resumes when broker recovers; processes backlog | Auto-recovery | +| Redis write failure | Retry 3Γ—; log failure if all retries fail; feature staleness alert | Manual intervention | +| Late events (>5min) | Route to `user-events-late` topic; processed in hourly correction job | Automated correction job | +| Duplicate events | Deduplicated within 60s window using `event_id`; late deduplication in correction job | Automated | + +### Airflow DAG Configuration + +```python +default_args = { + "owner": "ml-platform-team", + "retries": 3, + "retry_delay": timedelta(minutes=5), + "retry_exponential_backoff": True, + "max_retry_delay": timedelta(minutes=60), + "email_on_failure": True, + "email_on_retry": False, + "sla": timedelta(hours=3), + "on_failure_callback": pagerduty_alert, + "on_sla_miss_callback": pagerduty_sla_alert, +} +``` diff --git a/docs/schemas/restaurant-catalog.md b/docs/schemas/restaurant-catalog.md new file mode 100644 index 0000000..66ba1ab --- /dev/null +++ b/docs/schemas/restaurant-catalog.md @@ -0,0 +1,455 @@ +# Restaurant Catalog Schema + +## Table of Contents + +- [Overview](#overview) +- [Schema Definition](#schema-definition) +- [Field Reference](#field-reference) +- [JSON Example](#json-example) +- [Cuisine Taxonomy](#cuisine-taxonomy) +- [Price Tier Reference](#price-tier-reference) + +--- + +## Overview + +The restaurant catalog is the core entity in the system. Each restaurant record combines structured operational data with ML-generated enrichments (embeddings, NLP tags, sentiment scores). + +- **Storage**: PostgreSQL (primary), S3 Parquet (offline feature store), Redis (online feature store) +- **Update mechanism**: Daily batch refresh + real-time webhooks from partners +- **Unique identifier**: `restaurant_id` (stable UUID, never reused) + +--- + +## Schema Definition + +```yaml +Restaurant: + type: object + required: + - restaurant_id + - name + - location + - cuisine + - operations + - pricing + - status + properties: + + # --- Basic Info --- + restaurant_id: + type: string + description: Globally unique restaurant identifier + example: "rst_3b7c9e1a2f4d5e6b" + + name: + type: string + maxLength: 200 + example: "Pho Saigon House" + + description: + type: string + maxLength: 2000 + description: Free-text description; used for NLP tag generation and text embedding + + phone: + type: string + example: "+84-28-3812-3456" + + website: + type: string + format: uri + nullable: true + example: "https://phosaigonhouse.vn" + + # --- Location --- + location: + type: object + required: [address, city, country, lat, lng, geohash] + properties: + address: + type: string + example: "123 Nguyen Hue Boulevard" + district: + type: string + example: "District 1" + city: + type: string + example: "Ho Chi Minh City" + country: + type: string + format: iso-3166-1-alpha-2 + example: "VN" + lat: + type: number + minimum: -90 + maximum: 90 + example: 10.7769 + lng: + type: number + minimum: -180 + maximum: 180 + example: 106.7009 + geohash: + type: string + description: Geohash precision 7 (~153m Γ— 153m) + example: "w3gv2c5" + + # --- Cuisine --- + cuisine: + type: object + required: [primary_cuisine, cuisine_types] + properties: + cuisine_types: + type: array + items: + type: string + description: All applicable cuisine types (from taxonomy) + example: ["vietnamese", "asian", "noodles", "soup"] + primary_cuisine: + type: string + description: Main cuisine type (single value) + example: "vietnamese" + + # --- Operations --- + operations: + type: object + properties: + operating_hours: + type: object + description: Opening hours per weekday (24h format, local timezone) + properties: + monday: { $ref: "#/DayHours" } + tuesday: { $ref: "#/DayHours" } + wednesday: { $ref: "#/DayHours" } + thursday: { $ref: "#/DayHours" } + friday: { $ref: "#/DayHours" } + saturday: { $ref: "#/DayHours" } + sunday: { $ref: "#/DayHours" } + timezone: + type: string + example: "Asia/Ho_Chi_Minh" + is_open_now: + type: boolean + description: Computed field at query time; not stored persistently + reservation_required: + type: boolean + default: false + accepts_walk_ins: + type: boolean + default: true + + DayHours: + type: object + properties: + open: + type: string + pattern: "^([01][0-9]|2[0-3]):[0-5][0-9]$" + example: "07:00" + close: + type: string + pattern: "^([01][0-9]|2[0-3]):[0-5][0-9]$" + example: "22:00" + closed: + type: boolean + default: false + description: Set to true for weekly closed days + + # --- Pricing --- + pricing: + type: object + properties: + price_tier: + type: string + enum: ["$", "$$", "$$$", "$$$$"] + description: "$ <$15 | $$ $15-$30 | $$$ $30-$60 | $$$$ >$60 per person" + avg_meal_price_usd: + type: number + description: Average meal price per person in USD + example: 12.50 + + # --- Ratings --- + ratings: + type: object + properties: + avg_rating: + type: number + minimum: 0 + maximum: 5 + example: 4.3 + total_reviews: + type: integer + example: 1842 + rating_breakdown: + type: object + properties: + food: + type: number + minimum: 0 + maximum: 5 + service: + type: number + minimum: 0 + maximum: 5 + ambiance: + type: number + minimum: 0 + maximum: 5 + value: + type: number + minimum: 0 + maximum: 5 + + # --- Dietary --- + dietary: + type: object + properties: + vegetarian_options: + type: boolean + vegan_options: + type: boolean + halal: + type: boolean + gluten_free: + type: boolean + allergen_info: + type: array + items: + type: string + description: Known allergens present in menu items + example: ["peanuts", "shellfish", "gluten"] + + # --- Ambiance --- + ambiance: + type: object + properties: + tags: + type: array + items: + type: string + enum: + - romantic + - family-friendly + - business + - casual + - fine-dining + - quick-bite + - outdoor + - live-music + - rooftop + - pet-friendly + description: Ambiance descriptors (ML-tagged from reviews + manual) + + # --- Capacity --- + capacity: + type: object + properties: + total_seats: + type: integer + example: 80 + private_rooms: + type: integer + default: 0 + outdoor_seating: + type: boolean + default: false + parking: + type: boolean + default: false + + # --- Media --- + media: + type: object + properties: + photos: + type: array + items: + type: string + format: uri + description: CDN URLs for restaurant photos + menu_url: + type: string + format: uri + nullable: true + + # --- ML Embeddings (internal) --- + embeddings: + type: object + description: ML-generated embeddings; not returned in API responses + properties: + text_embedding: + type: array + items: + type: number + minItems: 768 + maxItems: 768 + description: Sentence-BERT embedding (768d) from name + description + cuisine tags + image_embedding: + type: array + items: + type: number + minItems: 512 + maxItems: 512 + description: ResNet-50 embedding (512d) from primary photo + + # --- Metadata --- + metadata: + type: object + properties: + popularity_score: + type: number + minimum: 0 + maximum: 1 + description: Time-decayed popularity score (updated hourly) + status: + type: string + enum: [active, temporarily_closed, permanently_closed, pending_review] + data_sources: + type: array + items: + type: string + example: ["partner_api", "google_places"] + created_at: + type: string + format: date-time + updated_at: + type: string + format: date-time +``` + +--- + +## Field Reference + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `restaurant_id` | string | βœ… | Unique stable identifier | +| `name` | string | βœ… | Restaurant display name | +| `location.lat` | float | βœ… | Latitude (WGS84) | +| `location.lng` | float | βœ… | Longitude (WGS84) | +| `location.geohash` | string | βœ… | Geohash precision 7 for ANN proximity | +| `cuisine.primary_cuisine` | string | βœ… | Single primary cuisine from taxonomy | +| `cuisine.cuisine_types` | string[] | βœ… | All applicable cuisines | +| `operations.is_open_now` | bool | Computed | Computed at query time, not stored | +| `pricing.price_tier` | enum | βœ… | $, $$, $$$, $$$$ | +| `ratings.avg_rating` | float | β€” | Updated daily from review ingestion | +| `dietary.*` | bool | β€” | Explicit dietary flags | +| `embeddings.text_embedding` | float[768] | Internal | Not exposed in API responses | +| `metadata.popularity_score` | float | β€” | Updated hourly by batch job | + +--- + +## JSON Example + +```json +{ + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "name": "Pho Saigon House", + "description": "Authentic Vietnamese pho and street food in a casual setting. Known for our rich, slow-cooked bone broth and fresh herbs.", + "phone": "+84-28-3812-3456", + "website": "https://phosaigonhouse.vn", + "location": { + "address": "123 Nguyen Hue Boulevard", + "district": "District 1", + "city": "Ho Chi Minh City", + "country": "VN", + "lat": 10.7769, + "lng": 106.7009, + "geohash": "w3gv2c5", + "timezone": "Asia/Ho_Chi_Minh" + }, + "cuisine": { + "cuisine_types": ["vietnamese", "asian", "noodles", "soup"], + "primary_cuisine": "vietnamese" + }, + "operations": { + "operating_hours": { + "monday": { "open": "07:00", "close": "22:00" }, + "tuesday": { "open": "07:00", "close": "22:00" }, + "wednesday": { "open": "07:00", "close": "22:00" }, + "thursday": { "open": "07:00", "close": "22:00" }, + "friday": { "open": "07:00", "close": "23:00" }, + "saturday": { "open": "08:00", "close": "23:00" }, + "sunday": { "open": "08:00", "close": "21:00" } + }, + "timezone": "Asia/Ho_Chi_Minh", + "is_open_now": true, + "reservation_required": false, + "accepts_walk_ins": true + }, + "pricing": { + "price_tier": "$$", + "avg_meal_price_usd": 12.50 + }, + "ratings": { + "avg_rating": 4.3, + "total_reviews": 1842, + "rating_breakdown": { + "food": 4.5, + "service": 4.1, + "ambiance": 4.0, + "value": 4.6 + } + }, + "dietary": { + "vegetarian_options": true, + "vegan_options": false, + "halal": true, + "gluten_free": false, + "allergen_info": ["peanuts", "shellfish"] + }, + "ambiance": { + "tags": ["casual", "family-friendly", "quick-bite"] + }, + "capacity": { + "total_seats": 80, + "private_rooms": 0, + "outdoor_seating": true, + "parking": false + }, + "media": { + "photos": [ + "https://cdn.example.com/restaurants/rst_3b7c9e/photo_1.jpg", + "https://cdn.example.com/restaurants/rst_3b7c9e/photo_2.jpg" + ], + "menu_url": "https://cdn.example.com/restaurants/rst_3b7c9e/menu.pdf" + }, + "metadata": { + "popularity_score": 0.82, + "status": "active", + "data_sources": ["partner_api", "google_places"], + "created_at": "2022-06-01T08:00:00Z", + "updated_at": "2024-03-15T02:30:00Z" + } +} +``` + +--- + +## Cuisine Taxonomy + +The system uses a hierarchical cuisine taxonomy with 120+ types. Top-level categories: + +| Category | Examples | +|----------|---------| +| Asian | Vietnamese, Japanese, Chinese, Korean, Thai, Indian, Indonesian | +| European | Italian, French, Spanish, Greek, German, Mediterranean | +| American | American, Mexican, Tex-Mex, BBQ, Cajun | +| Middle Eastern | Lebanese, Turkish, Persian, Israeli | +| African | Ethiopian, Moroccan, Nigerian | +| Fusion | Asian Fusion, Modern Australian, Pan-Asian | +| Dietary-specific | Vegan, Vegetarian, Raw Food, Gluten-Free | +| Meal type | Breakfast, Brunch, Desserts, Coffee & Tea, Seafood, Steakhouse | + +--- + +## Price Tier Reference + +| Tier | Symbol | Avg Price per Person | Target Segment | +|------|--------|---------------------|----------------| +| 1 | `$` | Under $15 | Budget, casual, fast food | +| 2 | `$$` | $15 – $30 | Mid-range, family dining | +| 3 | `$$$` | $30 – $60 | Upscale casual, business lunch | +| 4 | `$$$$` | Over $60 | Fine dining, special occasions | + +Price tier is updated quarterly based on aggregated order data per restaurant. diff --git a/docs/schemas/review-rating.md b/docs/schemas/review-rating.md new file mode 100644 index 0000000..dcea321 --- /dev/null +++ b/docs/schemas/review-rating.md @@ -0,0 +1,366 @@ +# Review & Rating Schema + +## Table of Contents + +- [Overview](#overview) +- [Schema Definition](#schema-definition) +- [Field Reference](#field-reference) +- [Sentiment Analysis](#sentiment-analysis) +- [JSON Examples](#json-examples) + +--- + +## Overview + +Reviews and ratings are a critical signal for the recommendation system. They provide: +1. **Explicit quality signals** (star ratings, dimension ratings) +2. **Textual sentiment** (DistilBERT-based aspect-level sentiment analysis) +3. **Social signals** (helpful votes, recency) + +- **Ingestion**: Via app API + partner webhooks β†’ Kafka (`reviews-raw`) β†’ PostgreSQL +- **Sentiment enrichment**: Asynchronous DistilBERT inference (<3s), written back to review record +- **Update to catalog**: Daily batch job updates `restaurant.ratings.avg_rating` and `rating_breakdown` +- **Retention**: Reviews retained indefinitely in PostgreSQL; sentiment scores recomputed on model update + +--- + +## Schema Definition + +```yaml +Review: + type: object + required: + - review_id + - user_id + - restaurant_id + - overall_rating + - created_at + properties: + + review_id: + type: string + description: Globally unique review identifier + example: "rev_01HXZ789GHI012JKL" + + user_id: + type: string + description: Anonymized user identifier + example: "usr_9f3a8b2c1d4e5f6a" + + restaurant_id: + type: string + description: Restaurant being reviewed + example: "rst_3b7c9e1a2f4d5e6b" + + overall_rating: + type: integer + minimum: 1 + maximum: 5 + description: Overall star rating (1–5) + example: 4 + + dimension_ratings: + type: object + description: Aspect-level ratings (optional; shown on detailed review UI) + properties: + food: + type: integer + minimum: 1 + maximum: 5 + nullable: true + service: + type: integer + minimum: 1 + maximum: 5 + nullable: true + ambiance: + type: integer + minimum: 1 + maximum: 5 + nullable: true + value: + type: integer + minimum: 1 + maximum: 5 + nullable: true + + review_text: + type: string + maxLength: 5000 + nullable: true + description: Free-text review content (optional) + example: "Amazing pho! The broth is rich and the service was friendly." + + # --- ML-Enriched Fields --- + sentiment_score: + type: number + minimum: 0.0 + maximum: 1.0 + nullable: true + description: > + Overall sentiment score computed by DistilBERT model. + 0.0 = very negative, 1.0 = very positive. + Null until async enrichment completes (<5s). + example: 0.84 + + aspect_sentiments: + type: object + nullable: true + description: Aspect-level sentiment scores computed from review_text + properties: + food: + type: number + minimum: 0.0 + maximum: 1.0 + nullable: true + service: + type: number + minimum: 0.0 + maximum: 1.0 + nullable: true + ambiance: + type: number + minimum: 0.0 + maximum: 1.0 + nullable: true + value: + type: number + minimum: 0.0 + maximum: 1.0 + nullable: true + + # --- Social Signals --- + helpful_votes: + type: integer + minimum: 0 + default: 0 + description: Number of users who marked this review as helpful + example: 23 + + reported_count: + type: integer + minimum: 0 + default: 0 + description: Number of times this review was reported as inappropriate + + # --- Media --- + photos: + type: array + items: + type: string + format: uri + description: CDN URLs for photos attached to the review + maxItems: 10 + + # --- Context --- + visit_date: + type: string + format: date + nullable: true + description: Date of the dining visit (user-reported) + example: "2024-03-10" + + visit_type: + type: string + enum: [dine_in, takeaway, delivery] + nullable: true + description: How the user experienced the restaurant + + party_size: + type: integer + minimum: 1 + nullable: true + description: Number of people in the dining party + + # --- Metadata --- + language: + type: string + description: Detected language of review_text (ISO 639-1) + example: "en" + + is_verified_order: + type: boolean + description: Whether the reviewer placed a verified order via the platform + default: false + + moderation_status: + type: string + enum: [pending, approved, rejected, flagged] + default: pending + description: Content moderation status + + created_at: + type: string + format: date-time + description: Timestamp when review was submitted + example: "2024-03-11T09:22:00Z" + + updated_at: + type: string + format: date-time + description: Last update timestamp (e.g., after sentiment enrichment) +``` + +--- + +## Field Reference + +| Field | Type | Required | Notes | +|-------|------|----------|-------| +| `review_id` | string | βœ… | Server-generated UUID | +| `user_id` | string | βœ… | Anonymized user identifier | +| `restaurant_id` | string | βœ… | Target restaurant | +| `overall_rating` | int (1–5) | βœ… | Star rating | +| `dimension_ratings.food` | int (1–5) | β€” | Optional aspect rating | +| `dimension_ratings.service` | int (1–5) | β€” | Optional aspect rating | +| `dimension_ratings.ambiance` | int (1–5) | β€” | Optional aspect rating | +| `dimension_ratings.value` | int (1–5) | β€” | Optional aspect rating | +| `review_text` | string | β€” | Max 5000 chars | +| `sentiment_score` | float [0,1] | Computed | Async enrichment, <5s | +| `aspect_sentiments.*` | float [0,1] | Computed | Async enrichment, only if review_text present | +| `helpful_votes` | int | β€” | Incremented by user votes | +| `photos` | string[] | β€” | Max 10 photos | +| `visit_date` | date | β€” | User-reported visit date | +| `is_verified_order` | bool | β€” | True if user ordered via platform | +| `moderation_status` | enum | β€” | Default: pending; auto-moderation applied | + +--- + +## Sentiment Analysis + +Reviews are enriched with sentiment scores using a fine-tuned DistilBERT model. + +### Model Details + +| Property | Value | +|----------|-------| +| **Base model** | DistilBERT (distilbert-base-uncased) | +| **Fine-tuning dataset** | ~500K restaurant reviews (Yelp + internal) | +| **Output** | Overall score + 4 aspect scores (food, service, ambiance, value) | +| **Inference latency** | <2s per review | +| **Model update frequency** | Quarterly | + +### Sentiment Score Interpretation + +| Range | Label | Description | +|-------|-------|-------------| +| 0.0 – 0.3 | Negative | Strongly negative sentiment | +| 0.3 – 0.5 | Mixed-negative | More negative than positive | +| 0.5 – 0.7 | Neutral | Balanced or unclear | +| 0.7 – 0.9 | Positive | Generally positive | +| 0.9 – 1.0 | Very positive | Highly enthusiastic | + +### Usage in Recommendations + +- `sentiment_score` contributes to `restaurant_dynamic.sentiment_score` (daily aggregate) +- Aspect sentiments (`food_sentiment`, `service_sentiment`) are features in the ranking model +- Reviews with `sentiment_score < 0.3` and `helpful_votes > 5` are flagged for quality review + +--- + +## JSON Examples + +### Complete Review (with Sentiment Enrichment) + +```json +{ + "review_id": "rev_01HXZ789GHI012JKL", + "user_id": "usr_9f3a8b2c1d4e5f6a", + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "overall_rating": 4, + "dimension_ratings": { + "food": 5, + "service": 4, + "ambiance": 4, + "value": 5 + }, + "review_text": "Amazing pho! The broth is rich and the service was friendly. A bit noisy during lunch rush but great value for money.", + "sentiment_score": 0.84, + "aspect_sentiments": { + "food": 0.95, + "service": 0.78, + "ambiance": 0.62, + "value": 0.91 + }, + "helpful_votes": 23, + "reported_count": 0, + "photos": [ + "https://cdn.example.com/reviews/rev_01HXZ789/photo_1.jpg" + ], + "visit_date": "2024-03-10", + "visit_type": "dine_in", + "party_size": 2, + "language": "en", + "is_verified_order": true, + "moderation_status": "approved", + "created_at": "2024-03-11T09:22:00Z", + "updated_at": "2024-03-11T09:22:04Z" +} +``` + +### Rating-Only Review (No Text) + +```json +{ + "review_id": "rev_02HXZ890HIJ123KLM", + "user_id": "usr_2c3d4e5f6a7b8c9d", + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "overall_rating": 5, + "dimension_ratings": { + "food": 5, + "service": 5, + "ambiance": null, + "value": 4 + }, + "review_text": null, + "sentiment_score": null, + "aspect_sentiments": null, + "helpful_votes": 0, + "reported_count": 0, + "photos": [], + "visit_date": "2024-03-12", + "visit_type": "delivery", + "party_size": null, + "language": null, + "is_verified_order": true, + "moderation_status": "approved", + "created_at": "2024-03-12T20:15:00Z", + "updated_at": "2024-03-12T20:15:00Z" +} +``` + +### Negative Review + +```json +{ + "review_id": "rev_03HXZ901IJK234LMN", + "user_id": "usr_3d4e5f6a7b8c9d0e", + "restaurant_id": "rst_5d9f1b3c4e6g7h8i", + "overall_rating": 2, + "dimension_ratings": { + "food": 2, + "service": 1, + "ambiance": 3, + "value": 2 + }, + "review_text": "Very disappointing experience. The food was cold when it arrived and the waiter was rude. The place looks nice but the quality doesn't match the price.", + "sentiment_score": 0.18, + "aspect_sentiments": { + "food": 0.12, + "service": 0.08, + "ambiance": 0.65, + "value": 0.22 + }, + "helpful_votes": 8, + "reported_count": 0, + "photos": [], + "visit_date": "2024-03-13", + "visit_type": "dine_in", + "party_size": 3, + "language": "en", + "is_verified_order": false, + "moderation_status": "approved", + "created_at": "2024-03-14T18:45:00Z", + "updated_at": "2024-03-14T18:45:02Z" +} +``` diff --git a/docs/schemas/user-event.md b/docs/schemas/user-event.md new file mode 100644 index 0000000..da5c981 --- /dev/null +++ b/docs/schemas/user-event.md @@ -0,0 +1,382 @@ +# User Event Schema + +## Table of Contents + +- [Overview](#overview) +- [Event Types](#event-types) +- [Schema Definition](#schema-definition) +- [Field Reference](#field-reference) +- [JSON Examples](#json-examples) + +--- + +## Overview + +User events are the primary behavioral signal for the recommendation system. Every user interaction with the platform is captured as a structured event and published to Kafka for downstream processing. + +- **Transport**: Kafka topic `user-events-raw` (partitioned by `user_id`) +- **Format**: JSON (UTF-8) +- **Deduplication**: `event_id` (UUID v4), 60-second dedup window in Flink +- **Late event handling**: Events arriving >5 minutes late routed to `user-events-late` +- **SDK**: Mobile (iOS/Android) and Web (JavaScript) SDKs batch events every 500ms + +--- + +## Event Types + +| Event Type | Trigger | Key Behavioral Signal | +|-----------|---------|----------------------| +| `RESTAURANT_VIEW` | Restaurant detail page opened | Intent to learn more about a restaurant | +| `RESTAURANT_CLICK` | Restaurant card clicked in listing/map | Direct interest signal | +| `MENU_VIEW` | Menu page viewed | Deeper engagement, pre-order intent | +| `ORDER_PLACED` | Order successfully submitted | Strongest positive signal | +| `REVIEW_SUBMITTED` | User submits a review | Post-visit satisfaction signal | +| `SEARCH` | User submits a search query | Intent + query context | +| `FILTER_APPLIED` | User applies a filter (cuisine, price, etc.) | Explicit preference signal | +| `FAVORITE_ADDED` | User adds restaurant to favorites | Explicit interest signal | + +--- + +## Schema Definition + +```yaml +UserEvent: + type: object + required: + - event_id + - event_type + - timestamp + - user_id + - session_id + properties: + event_id: + type: string + format: uuid-v4 + description: Globally unique event identifier (UUID v4, generated by client SDK) + example: "evt_01HXYZ123ABC456DEF" + + event_type: + type: string + enum: + - RESTAURANT_VIEW + - RESTAURANT_CLICK + - MENU_VIEW + - ORDER_PLACED + - REVIEW_SUBMITTED + - SEARCH + - FILTER_APPLIED + - FAVORITE_ADDED + description: Type of user interaction event + + timestamp: + type: string + format: date-time + description: ISO 8601 UTC timestamp of the event + example: "2024-03-15T14:32:07.123Z" + + user_id: + type: string + description: Hashed user identifier (SHA-256, anonymized) + example: "usr_9f3a8b2c1d4e5f6a" + + session_id: + type: string + description: Session identifier (server-assigned UUID) + example: "sess_7e2f1a3b4c5d6e7f" + + restaurant_id: + type: string + nullable: true + description: Target restaurant ID (null for SEARCH events) + example: "rst_3b7c9e1a2f4d5e6b" + + context: + type: object + properties: + device: + type: string + enum: [ios, android, web_mobile, web_desktop] + platform: + type: string + enum: [mobile_app, web_app, partner_api] + app_version: + type: string + example: "4.2.1" + os_version: + type: string + example: "17.2" + location: + type: object + properties: + lat: + type: number + minimum: -90 + maximum: 90 + lng: + type: number + minimum: -180 + maximum: 180 + accuracy_meters: + type: integer + geohash: + type: string + description: Geohash precision 6 + referrer: + type: string + description: Source page or feature that led to this event + example: "homepage_recommendations" + page_position: + type: integer + description: Position in the list/page (1-indexed) + + metadata: + type: object + description: Event-type-specific metadata + properties: + dwell_time_ms: + type: integer + description: Time spent on the page/card in milliseconds + scroll_depth_pct: + type: integer + minimum: 0 + maximum: 100 + description: Percentage of page scrolled + rank_in_list: + type: integer + description: Rank of this restaurant in the recommendation list + rec_model_version: + type: string + description: Model version that generated the recommendation + experiment_variants: + type: object + description: A/B experiment assignment for this event + additionalProperties: + type: string + rating_value: + type: integer + minimum: 1 + maximum: 5 + description: Rating given (for REVIEW_SUBMITTED events) + order_value_usd: + type: number + description: Order total in USD (for ORDER_PLACED events) + items_count: + type: integer + description: Number of items in order (for ORDER_PLACED events) + search_query: + type: string + description: Search query text (for SEARCH events) + search_results_count: + type: integer + description: Number of results returned (for SEARCH events) + filter_type: + type: string + description: Filter category (for FILTER_APPLIED events) + example: "cuisine" + filter_value: + type: string + description: Filter value applied + example: "japanese" +``` + +--- + +## Field Reference + +| Field | Required | Type | Notes | +|-------|----------|------|-------| +| `event_id` | βœ… | UUID string | Client-generated; used for deduplication | +| `event_type` | βœ… | enum | See Event Types table | +| `timestamp` | βœ… | ISO 8601 UTC | Client-side timestamp; server validates within Β±5min | +| `user_id` | βœ… | string | Anonymized user identifier | +| `session_id` | βœ… | string | Server-assigned session identifier | +| `restaurant_id` | For most types | string | Null for SEARCH, FILTER_APPLIED | +| `context.device` | βœ… | enum | Client platform | +| `context.location.lat/lng` | Recommended | float | Used for geo features; user may deny permission | +| `metadata.dwell_time_ms` | For VIEW events | integer | Client measures, flush on page leave | +| `metadata.rank_in_list` | For CLICK events | integer | Position of clicked restaurant | +| `metadata.experiment_variants` | βœ… | object | Always included for A/B analysis | + +--- + +## JSON Examples + +### `RESTAURANT_CLICK` + +```json +{ + "event_id": "evt_01HXYZ123ABC456DEF", + "event_type": "RESTAURANT_CLICK", + "timestamp": "2024-03-15T14:32:07.123Z", + "user_id": "usr_9f3a8b2c1d4e5f6a", + "session_id": "sess_7e2f1a3b4c5d6e7f", + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "context": { + "device": "ios", + "platform": "mobile_app", + "app_version": "4.2.1", + "os_version": "17.2", + "location": { + "lat": 10.7769, + "lng": 106.7009, + "accuracy_meters": 15, + "geohash": "w3gv2c" + }, + "referrer": "homepage_recommendations", + "page_position": 3 + }, + "metadata": { + "rank_in_list": 3, + "rec_model_version": "ranking_v2.4.1", + "experiment_variants": { + "homepage_layout": "control", + "ranking_model": "treatment_a" + } + } +} +``` + +### `RESTAURANT_VIEW` + +```json +{ + "event_id": "evt_02HXYZ456DEF789GHI", + "event_type": "RESTAURANT_VIEW", + "timestamp": "2024-03-15T14:32:12.456Z", + "user_id": "usr_9f3a8b2c1d4e5f6a", + "session_id": "sess_7e2f1a3b4c5d6e7f", + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "context": { + "device": "ios", + "platform": "mobile_app", + "app_version": "4.2.1", + "location": { + "lat": 10.7769, + "lng": 106.7009, + "accuracy_meters": 15, + "geohash": "w3gv2c" + }, + "referrer": "restaurant_click" + }, + "metadata": { + "dwell_time_ms": 4200, + "scroll_depth_pct": 75, + "rec_model_version": "ranking_v2.4.1", + "experiment_variants": { + "homepage_layout": "control", + "ranking_model": "treatment_a" + } + } +} +``` + +### `ORDER_PLACED` + +```json +{ + "event_id": "evt_03HXYZ789GHI012JKL", + "event_type": "ORDER_PLACED", + "timestamp": "2024-03-15T14:52:33.789Z", + "user_id": "usr_9f3a8b2c1d4e5f6a", + "session_id": "sess_7e2f1a3b4c5d6e7f", + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "context": { + "device": "ios", + "platform": "mobile_app", + "app_version": "4.2.1", + "location": { + "lat": 10.7769, + "lng": 106.7009, + "geohash": "w3gv2c" + } + }, + "metadata": { + "order_value_usd": 24.50, + "items_count": 3, + "experiment_variants": { + "homepage_layout": "control", + "ranking_model": "treatment_a" + } + } +} +``` + +### `SEARCH` + +```json +{ + "event_id": "evt_04HXYZ012JKL345MNO", + "event_type": "SEARCH", + "timestamp": "2024-03-15T15:10:05.000Z", + "user_id": "usr_9f3a8b2c1d4e5f6a", + "session_id": "sess_8f3e2b4c5d6e7f8a", + "restaurant_id": null, + "context": { + "device": "ios", + "platform": "mobile_app", + "app_version": "4.2.1", + "location": { + "lat": 10.7769, + "lng": 106.7009, + "geohash": "w3gv2c" + }, + "referrer": "search_bar" + }, + "metadata": { + "search_query": "sushi near me", + "search_results_count": 42, + "experiment_variants": { + "search_ranking": "control" + } + } +} +``` + +### `FILTER_APPLIED` + +```json +{ + "event_id": "evt_05HXYZ345MNO678PQR", + "event_type": "FILTER_APPLIED", + "timestamp": "2024-03-15T15:10:45.123Z", + "user_id": "usr_9f3a8b2c1d4e5f6a", + "session_id": "sess_8f3e2b4c5d6e7f8a", + "restaurant_id": null, + "context": { + "device": "ios", + "platform": "mobile_app", + "app_version": "4.2.1", + "referrer": "search_results" + }, + "metadata": { + "filter_type": "price_tier", + "filter_value": "$$", + "experiment_variants": { + "search_ranking": "control" + } + } +} +``` + +### `REVIEW_SUBMITTED` + +```json +{ + "event_id": "evt_06HXYZ678PQR901STU", + "event_type": "REVIEW_SUBMITTED", + "timestamp": "2024-03-16T09:22:00.000Z", + "user_id": "usr_9f3a8b2c1d4e5f6a", + "session_id": "sess_9g4f3c5d6e7f8a9b", + "restaurant_id": "rst_3b7c9e1a2f4d5e6b", + "context": { + "device": "ios", + "platform": "mobile_app", + "app_version": "4.2.1", + "referrer": "post_order_prompt" + }, + "metadata": { + "rating_value": 4, + "experiment_variants": {} + } +} +``` diff --git a/docs/schemas/user-profile.md b/docs/schemas/user-profile.md new file mode 100644 index 0000000..eb5997e --- /dev/null +++ b/docs/schemas/user-profile.md @@ -0,0 +1,341 @@ +# User Profile Schema + +## Table of Contents + +- [Overview](#overview) +- [Schema Definition](#schema-definition) +- [Field Reference](#field-reference) +- [User Segments](#user-segments) +- [JSON Examples](#json-examples) + +--- + +## Overview + +The user profile aggregates demographic information, behavioral signals, and ML-derived preference representations for each user. It is the primary input to the recommendation engine's personalization logic. + +- **Primary storage**: PostgreSQL (source of truth) + S3 Parquet (offline feature store) +- **Online serving**: Redis (feature store cache, TTL 6h) +- **Update frequency**: Daily batch rebuild (2:00 AM UTC) + event-driven updates for significant actions +- **Privacy**: Profile contains only aggregated/derived signals; raw interaction history is not stored in the profile + +--- + +## Schema Definition + +```yaml +UserProfile: + type: object + required: + - user_id + - segment + - metadata + properties: + + user_id: + type: string + description: Anonymized user identifier + example: "usr_9f3a8b2c1d4e5f6a" + + # --- Demographic --- + demographic: + type: object + description: Self-reported demographic information (from registration/onboarding) + properties: + age_range: + type: string + enum: ["18-24", "25-34", "35-44", "45-54", "55-64", "65+"] + nullable: true + description: Age bucket (not exact age; privacy-preserving) + gender: + type: string + enum: ["male", "female", "non-binary", "prefer_not_to_say"] + nullable: true + location: + type: object + description: User's primary location (home or frequently visited area) + properties: + city: + type: string + example: "Ho Chi Minh City" + district: + type: string + example: "District 2" + country: + type: string + format: iso-3166-1-alpha-2 + example: "VN" + lat: + type: number + description: Approximate centroid of user's activity area + example: 10.7880 + lng: + type: number + example: 106.7320 + + # --- Behavioral Aggregates --- + behavioral: + type: object + description: Derived from interaction history (last 90 days) + properties: + total_interactions: + type: integer + description: Total weighted interactions (views + clicks + orders) all-time + example: 247 + favorite_cuisines: + type: array + maxItems: 5 + items: + type: object + properties: + cuisine: + type: string + example: "japanese" + affinity_score: + type: number + minimum: 0 + maximum: 1 + description: Normalized affinity score (0=no preference, 1=strong preference) + example: 0.87 + description: Top-5 cuisine preferences with affinity scores + avg_session_duration_min: + type: number + description: Average session duration in minutes (last 30 days) + example: 8.4 + order_frequency_per_month: + type: number + description: Average orders per month (last 90 days) + example: 6.2 + last_active_at: + type: string + format: date-time + description: Timestamp of most recent interaction + example: "2024-03-15T13:45:00Z" + visited_restaurant_ids: + type: array + items: + type: string + description: IDs of restaurants visited in the last 90 days (for exclusion logic) + maxItems: 200 + + # --- Dietary Preferences --- + dietary_preferences: + type: object + description: Dietary requirements and restrictions + properties: + vegetarian: + type: boolean + default: false + vegan: + type: boolean + default: false + halal: + type: boolean + default: false + kosher: + type: boolean + default: false + allergies: + type: array + items: + type: string + description: Food allergens to avoid + example: ["shellfish", "peanuts"] + + # --- Price Preference --- + price_preference: + type: object + properties: + preferred_price_tier: + type: string + enum: ["$", "$$", "$$$", "$$$$"] + nullable: true + description: Preferred price tier (from onboarding or inferred from order history) + example: "$$" + avg_spend_per_meal_usd: + type: number + description: Average spend per meal order in USD (last 90 days) + example: 18.00 + + # --- ML Representations --- + preference_vector: + type: array + items: + type: number + minItems: 128 + maxItems: 128 + description: Learned 128-dimensional preference embedding from Two-Tower CF model + + # --- Segment --- + segment: + type: string + enum: + - cold_start + - warm + - active + - power_user + - churning + - dormant + description: User lifecycle segment; determines recommendation strategy + + # --- Metadata --- + metadata: + type: object + properties: + profile_completeness_pct: + type: integer + minimum: 0 + maximum: 100 + description: Percentage of optional profile fields filled in + onboarding_completed: + type: boolean + description: Whether user completed onboarding quiz + created_at: + type: string + format: date-time + updated_at: + type: string + format: date-time +``` + +--- + +## Field Reference + +| Field | Type | Required | Update Frequency | Privacy | +|-------|------|----------|-----------------|---------| +| `user_id` | string | βœ… | Immutable | Anonymized (hashed) | +| `demographic.age_range` | enum | β€” | On profile update | Bucketed (not exact age) | +| `demographic.gender` | enum | β€” | On profile update | Optional, nullable | +| `demographic.location` | object | β€” | Daily (from activity centroid) | Approx. centroid, not GPS | +| `behavioral.total_interactions` | int | β€” | Daily | Aggregated count | +| `behavioral.favorite_cuisines` | array[5] | β€” | Daily | Derived signal | +| `behavioral.order_frequency_per_month` | float | β€” | Daily | Derived | +| `behavioral.last_active_at` | datetime | β€” | Event-driven | β€” | +| `behavioral.visited_restaurant_ids` | string[] | β€” | Daily | Last 90 days only | +| `dietary_preferences.*` | bool | β€” | On settings update + inference | User-controlled | +| `dietary_preferences.allergies` | string[] | β€” | On settings update | User-controlled | +| `price_preference.preferred_price_tier` | enum | β€” | Daily (inferred) | Derived | +| `preference_vector` | float[128] | β€” | Weekly (model retrain) | Internal; not in API | +| `segment` | enum | βœ… | Daily | β€” | + +--- + +## User Segments + +| Segment | Criteria | Recommendation Strategy | Feature Store TTL | +|---------|----------|------------------------|------------------| +| `cold_start` | <10 total interactions | Popularity by geo + demographic matching | 1h | +| `warm` | 10–30 interactions | Blended: CF (ramping) + content-based | 3h | +| `active` | >30 interactions, active <30 days | Full personalization (CF + ranking) | 6h | +| `power_user` | >100 interactions/month | Deep personalization + early access features | 6h | +| `churning` | Active 30–90 days ago | Re-engagement: trending, high-rated nearby | 6h | +| `dormant` | No activity >90 days | Default to geo-based popularity | 12h | + +--- + +## JSON Examples + +### Active User (Full Profile) + +```json +{ + "user_id": "usr_9f3a8b2c1d4e5f6a", + "demographic": { + "age_range": "25-34", + "gender": "female", + "location": { + "city": "Ho Chi Minh City", + "district": "District 2", + "country": "VN", + "lat": 10.7880, + "lng": 106.7320 + } + }, + "behavioral": { + "total_interactions": 247, + "favorite_cuisines": [ + { "cuisine": "japanese", "affinity_score": 0.87 }, + { "cuisine": "vietnamese", "affinity_score": 0.76 }, + { "cuisine": "korean", "affinity_score": 0.65 }, + { "cuisine": "italian", "affinity_score": 0.42 }, + { "cuisine": "chinese", "affinity_score": 0.38 } + ], + "avg_session_duration_min": 8.4, + "order_frequency_per_month": 6.2, + "last_active_at": "2024-03-15T13:45:00Z", + "visited_restaurant_ids": [ + "rst_3b7c9e1a2f4d5e6b", + "rst_4c8d0f2b3e5f6a7c" + ] + }, + "dietary_preferences": { + "vegetarian": false, + "vegan": false, + "halal": false, + "kosher": false, + "allergies": ["shellfish"] + }, + "price_preference": { + "preferred_price_tier": "$$", + "avg_spend_per_meal_usd": 18.00 + }, + "preference_vector": [0.12, -0.34, 0.87, "... (128 dimensions total)"], + "segment": "active", + "metadata": { + "profile_completeness_pct": 85, + "onboarding_completed": true, + "created_at": "2022-09-14T10:00:00Z", + "updated_at": "2024-03-15T02:00:00Z" + } +} +``` + +### Cold Start User (New Registration) + +```json +{ + "user_id": "usr_1a2b3c4d5e6f7a8b", + "demographic": { + "age_range": "18-24", + "gender": "prefer_not_to_say", + "location": { + "city": "Hanoi", + "district": "Hoan Kiem", + "country": "VN", + "lat": 21.0285, + "lng": 105.8542 + } + }, + "behavioral": { + "total_interactions": 3, + "favorite_cuisines": [ + { "cuisine": "vietnamese", "affinity_score": 0.60 }, + { "cuisine": "japanese", "affinity_score": 0.40 } + ], + "avg_session_duration_min": 3.2, + "order_frequency_per_month": 0, + "last_active_at": "2024-03-15T11:10:00Z", + "visited_restaurant_ids": [] + }, + "dietary_preferences": { + "vegetarian": true, + "vegan": false, + "halal": false, + "kosher": false, + "allergies": [] + }, + "price_preference": { + "preferred_price_tier": "$", + "avg_spend_per_meal_usd": null + }, + "preference_vector": null, + "segment": "cold_start", + "metadata": { + "profile_completeness_pct": 40, + "onboarding_completed": true, + "created_at": "2024-03-15T10:55:00Z", + "updated_at": "2024-03-15T11:10:00Z" + } +} +```