Skip to content

murcoder14/embedder-service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Embedder Service

A serverless microservice that automatically processes documents uploaded to Google Cloud Storage, generates vector embeddings, and stores them in PostgreSQL with pgvector for RAG applications.

Prerequisites

Before deploying or running this project, ensure you have the following installed:

Required Software

  • Java 25 (or Java 21+) - Download
  • Maven (included via mvnw wrapper, or install globally)
  • Google Cloud SDK (gcloud CLI) - Install Guide
  • Terraform (v1.0+) - Download
  • Cloud SQL Proxy (v2.15+) - Install in ~/tools/ or add to PATH
    curl -o ~/tools/cloud-sql-proxy https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.15.0/cloud-sql-proxy.linux.amd64
    chmod +x ~/tools/cloud-sql-proxy
  • PostgreSQL client tools (psql) - Install Guide

Additional Tools (verified by scripts)

  • jq - JSON processor
  • curl - HTTP client
  • nc - Netcat for port checking
  • lsof - List open files/ports

GCP Setup

# Authenticate with Google Cloud
gcloud auth login
gcloud auth application-default login

# Set your project
gcloud config set project <your-project-id>

What does the Application do?

Event-driven document processing pipeline:

  1. Listens for file upload notifications via Pub/Sub push subscription
  2. Downloads documents from Cloud Storage (supports PDF, Office formats, text)
  3. Chunks documents into semantic segments using Apache OpenNLP
  4. Generates vector embeddings via Google Vertex AI
  5. Stores text chunks and embeddings in Cloud SQL PostgreSQL with pgvector extension

Architecture

Stack: Quarkus 3.30.3 (Java 25), Cloud Run, Vertex AI, PostgreSQL 17

Key Components:

  • Compute: Cloud Run (serverless containers)
  • Messaging: Pub/Sub (push subscription with OIDC auth)
  • Storage: Cloud Storage + Cloud SQL PostgreSQL 17 (pgvector)
  • ML: Vertex AI Embedding API
  • Libraries: LangChain4j, Apache PDFBox, Apache POI, OpenNLP

Infrastructure: Terraform-managed VPC, Cloud SQL (private IP), Serverless VPC Connector, IAM service accounts

Local Setup

Prerequisites

  • Java 25+ (or Java 21+)
  • Google Cloud SDK authenticated (gcloud auth application-default login)
  • Terraform, PostgreSQL client tools

Steps

# Clone repository
git clone <repository-url>
cd embedder-service

# Configure Terraform variables
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars with your project_id and region

# Run in dev mode (requires GCP auth and existing Cloud SQL instance)
./mvnw quarkus:dev

Deployment

1. Provision Infrastructure

Creates VPC, Cloud SQL (PostgreSQL 17), Storage bucket, Pub/Sub topic/subscription, service accounts, and IAM bindings.

./scripts/iac_create.sh

What it provisions:

  • VPC with private IP range for Cloud SQL
  • Cloud SQL PostgreSQL 17 instance (ENTERPRISE edition)
  • Serverless VPC Connector for Cloud Run
  • Storage bucket with Pub/Sub notifications
  • Pub/Sub topic and push subscription
  • 2 GCP service accounts:
    • embedder-service-sa (Cloud Run runtime with Storage, Pub/Sub, CloudSQL, Vertex AI permissions)
    • embedder-service-sa-pubsub (Pub/Sub push invoker with Cloud Run Invoker role)

2. Initialize Database

Enables pgvector extension and grants IAM database user privileges. Runs via Cloud SQL Proxy.

./scripts/init_database.sh

Actions performed:

  • Auto-generates postgres superuser password and sets it via gcloud
  • Creates 1 IAM database user: embedder-service-sa@<project-id>.iam (matches Cloud Run service account)
  • Enables vector extension
  • Grants schema and table privileges to IAM user
  • Verifies connectivity

3. Build and Deploy Application

Compiles Java code, builds container with Jib, pushes to GCR, and deploys to Cloud Run.

./scripts/cicd.sh

Build process:

  • Maven clean package (skips tests)
  • Jib containerization (Eclipse Temurin 25 JRE base image) - Jib builds optimized, layered Docker images without requiring a Docker daemon, separating dependencies from application code for faster rebuilds and smaller layer updates
  • Push to gcr.io/<project-id>/document-embedder:latest
  • Deploy to Cloud Run with VPC connector and environment variables
  • Update Pub/Sub push endpoint to Cloud Run URL

Testing

Unit Tests

./mvnw test

Health Checks

Verify service is running and all dependencies are healthy:

# Get service URL from deployment output or:
SERVICE_URL=$(gcloud run services describe document-embedder --region=us-central1 --format='value(status.url)')

# Check all health probes
curl $SERVICE_URL/health        # Aggregate (liveness + readiness)
curl $SERVICE_URL/health/live   # Liveness only
curl $SERVICE_URL/health/ready  # Readiness (DB, Pub/Sub, Vertex AI)

End-to-End Test

# Upload a test document
gsutil cp sample.pdf gs://<bucket-name>/

# Monitor processing
gcloud run services logs read document-embedder --region=us-central1 --limit=50

Connect to Database and Verify Embeddings

Prerequisites: Cloud SQL Proxy installed in ~/tools/cloud-sql-proxy or available in PATH

1. Start Cloud SQL Proxy with IAM Authentication

~/tools/cloud-sql-proxy --port 5433 \
  --auto-iam-authn \
  --impersonate-service-account=embedder-service-sa@<project-id>.iam.gserviceaccount.com \
  <project-id>:<region>:<instance-name> > /tmp/proxy.log 2>&1 &

PROXY_PID=$!
sleep 3  # Wait for proxy to start

2. Connect via psql (No Password Required)

psql "host=localhost port=5433 dbname=embeddings user=embedder-service-sa@<project-id>.iam"

3. Query Embeddings

-- Count stored embeddings
SELECT COUNT(*) as embedding_count FROM embeddings;

-- View sample records
SELECT embedding_id, 
       LEFT(text, 100) as text_preview, 
       metadata->>'source' as source 
FROM embeddings LIMIT 5;

-- Check vector dimensions
SELECT pg_column_size(embedding) as vector_bytes 
FROM embeddings LIMIT 1;

4. Similarity Search Example

-- Find documents similar to an existing embedding
WITH query_vector AS (
  SELECT embedding 
  FROM embeddings 
  WHERE text ILIKE '%your search term%' 
  LIMIT 1
)
SELECT e.embedding_id, 
       LEFT(e.text, 100) as text_preview,
       1 - (e.embedding <=> qv.embedding) as similarity
FROM embeddings e, query_vector qv
ORDER BY e.embedding <=> qv.embedding
LIMIT 10;

Note: The <=> operator computes cosine distance (smaller = more similar).

5. Cleanup

kill $PROXY_PID  # Stop proxy

Troubleshooting:

  • Connection refused: Check proxy logs (tail -f /tmp/proxy.log) and verify port 5433 is free (lsof -i :5433)
  • IAM auth failed: Verify you have roles/iam.serviceAccountTokenCreator on the service account
  • User does not exist: Run init_database.sh to create the IAM database user

About

An Event Driven Vector Embedding Service

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published