Document Embedder Service

A serverless microservice that automatically processes documents uploaded to Google Cloud Storage, generates vector embeddings, and stores them in PostgreSQL with pgvector for RAG applications.

Prerequisites

Before deploying or running this project, ensure you have the following installed:

Required Software

Java 25 (or Java 21+) - Download
Maven (included via mvnw wrapper, or install globally)
Google Cloud SDK (gcloud CLI) - Install Guide
Terraform (v1.0+) - Download

Cloud SQL Proxy (v2.15+) - Install in ~/tools/ or add to PATH

curl -o ~/tools/cloud-sql-proxy https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.15.0/cloud-sql-proxy.linux.amd64
chmod +x ~/tools/cloud-sql-proxy

PostgreSQL client tools (psql) - Install Guide

Additional Tools (verified by scripts)

jq - JSON processor
curl - HTTP client
nc - Netcat for port checking
lsof - List open files/ports

GCP Setup

# Authenticate with Google Cloud
gcloud auth login
gcloud auth application-default login

# Set your project
gcloud config set project <your-project-id>

What does the Application do?

Event-driven document processing pipeline:

Listens for file upload notifications via Pub/Sub push subscription
Downloads documents from Cloud Storage (supports PDF, Office formats, text)
Chunks documents into semantic segments using Apache OpenNLP
Generates vector embeddings via Google Vertex AI
Stores text chunks and embeddings in Cloud SQL PostgreSQL with pgvector extension

Architecture

Stack: Quarkus 3.30.3 (Java 25), Cloud Run, Vertex AI, PostgreSQL 17

Key Components:

Compute: Cloud Run (serverless containers)
Messaging: Pub/Sub (push subscription with OIDC auth)
Storage: Cloud Storage + Cloud SQL PostgreSQL 17 (pgvector)
ML: Vertex AI Embedding API
Libraries: LangChain4j, Apache PDFBox, Apache POI, OpenNLP

Infrastructure: Terraform-managed VPC, Cloud SQL (private IP), Serverless VPC Connector, IAM service accounts

Local Setup

Prerequisites

Java 25+ (or Java 21+)
Google Cloud SDK authenticated (gcloud auth application-default login)
Terraform, PostgreSQL client tools

Steps

# Clone repository
git clone <repository-url>
cd embedder-service

# Configure Terraform variables
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars with your project_id and region

# Run in dev mode (requires GCP auth and existing Cloud SQL instance)
./mvnw quarkus:dev

Deployment

1. Provision Infrastructure

Creates VPC, Cloud SQL (PostgreSQL 17), Storage bucket, Pub/Sub topic/subscription, service accounts, and IAM bindings.

./scripts/iac_create.sh

What it provisions:

VPC with private IP range for Cloud SQL
Cloud SQL PostgreSQL 17 instance (ENTERPRISE edition)
Serverless VPC Connector for Cloud Run
Storage bucket with Pub/Sub notifications
Pub/Sub topic and push subscription
2 GCP service accounts:
- embedder-service-sa (Cloud Run runtime with Storage, Pub/Sub, CloudSQL, Vertex AI permissions)
- embedder-service-sa-pubsub (Pub/Sub push invoker with Cloud Run Invoker role)

2. Initialize Database

Enables pgvector extension and grants IAM database user privileges. Runs via Cloud SQL Proxy.

./scripts/init_database.sh

Actions performed:

Auto-generates postgres superuser password and sets it via gcloud
Creates 1 IAM database user: embedder-service-sa@<project-id>.iam (matches Cloud Run service account)
Enables vector extension
Grants schema and table privileges to IAM user
Verifies connectivity

3. Build and Deploy Application

Compiles Java code, builds container with Jib, pushes to GCR, and deploys to Cloud Run.

./scripts/cicd.sh

Build process:

Maven clean package (skips tests)
Jib containerization (Eclipse Temurin 25 JRE base image) - Jib builds optimized, layered Docker images without requiring a Docker daemon, separating dependencies from application code for faster rebuilds and smaller layer updates
Push to gcr.io/<project-id>/document-embedder:latest
Deploy to Cloud Run with VPC connector and environment variables
Update Pub/Sub push endpoint to Cloud Run URL

Testing

Unit Tests

./mvnw test

Health Checks

Verify service is running and all dependencies are healthy:

# Get service URL from deployment output or:
SERVICE_URL=$(gcloud run services describe document-embedder --region=us-central1 --format='value(status.url)')

# Check all health probes
curl $SERVICE_URL/health        # Aggregate (liveness + readiness)
curl $SERVICE_URL/health/live   # Liveness only
curl $SERVICE_URL/health/ready  # Readiness (DB, Pub/Sub, Vertex AI)

End-to-End Test

# Upload a test document
gsutil cp sample.pdf gs://<bucket-name>/

# Monitor processing
gcloud run services logs read document-embedder --region=us-central1 --limit=50

Connect to Database and Verify Embeddings

Prerequisites: Cloud SQL Proxy installed in ~/tools/cloud-sql-proxy or available in PATH

1. Start Cloud SQL Proxy with IAM Authentication

~/tools/cloud-sql-proxy --port 5433 \
  --auto-iam-authn \
  --impersonate-service-account=embedder-service-sa@<project-id>.iam.gserviceaccount.com \
  <project-id>:<region>:<instance-name> > /tmp/proxy.log 2>&1 &

PROXY_PID=$!
sleep 3  # Wait for proxy to start

2. Connect via psql (No Password Required)

psql "host=localhost port=5433 dbname=embeddings user=embedder-service-sa@<project-id>.iam"

3. Query Embeddings

-- Count stored embeddings
SELECT COUNT(*) as embedding_count FROM embeddings;

-- View sample records
SELECT embedding_id, 
       LEFT(text, 100) as text_preview, 
       metadata->>'source' as source 
FROM embeddings LIMIT 5;

-- Check vector dimensions
SELECT pg_column_size(embedding) as vector_bytes 
FROM embeddings LIMIT 1;

4. Similarity Search Example

-- Find documents similar to an existing embedding
WITH query_vector AS (
  SELECT embedding 
  FROM embeddings 
  WHERE text ILIKE '%your search term%' 
  LIMIT 1
)
SELECT e.embedding_id, 
       LEFT(e.text, 100) as text_preview,
       1 - (e.embedding <=> qv.embedding) as similarity
FROM embeddings e, query_vector qv
ORDER BY e.embedding <=> qv.embedding
LIMIT 10;

Note: The <=> operator computes cosine distance (smaller = more similar).

5. Cleanup

kill $PROXY_PID  # Stop proxy

Troubleshooting:

Connection refused: Check proxy logs (tail -f /tmp/proxy.log) and verify port 5433 is free (lsof -i :5433)
IAM auth failed: Verify you have roles/iam.serviceAccountTokenCreator on the service account
User does not exist: Run init_database.sh to create the IAM database user

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.mvn/wrapper		.mvn/wrapper
scripts		scripts
src		src
terraform		terraform
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mvnw		mvnw
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Embedder Service

Prerequisites

Required Software

Additional Tools (verified by scripts)

GCP Setup

What does the Application do?

Architecture

Local Setup

Prerequisites

Steps

Deployment

1. Provision Infrastructure

2. Initialize Database

3. Build and Deploy Application

Testing

Unit Tests

Health Checks

End-to-End Test

Connect to Database and Verify Embeddings

About

Uh oh!

Releases

Packages

Languages

License

murcoder14/embedder-service

Folders and files

Latest commit

History

Repository files navigation

Document Embedder Service

Prerequisites

Required Software

Additional Tools (verified by scripts)

GCP Setup

What does the Application do?

Architecture

Local Setup

Prerequisites

Steps

Deployment

1. Provision Infrastructure

2. Initialize Database

3. Build and Deploy Application

Testing

Unit Tests

Health Checks

End-to-End Test

Connect to Database and Verify Embeddings

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages