A serverless microservice that automatically processes documents uploaded to Google Cloud Storage, generates vector embeddings, and stores them in PostgreSQL with pgvector for RAG applications.
Before deploying or running this project, ensure you have the following installed:
- Java 25 (or Java 21+) - Download
- Maven (included via
mvnwwrapper, or install globally) - Google Cloud SDK (
gcloudCLI) - Install Guide - Terraform (v1.0+) - Download
- Cloud SQL Proxy (v2.15+) - Install in
~/tools/or add to PATHcurl -o ~/tools/cloud-sql-proxy https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.15.0/cloud-sql-proxy.linux.amd64 chmod +x ~/tools/cloud-sql-proxy
- PostgreSQL client tools (
psql) - Install Guide
jq- JSON processorcurl- HTTP clientnc- Netcat for port checkinglsof- List open files/ports
# Authenticate with Google Cloud
gcloud auth login
gcloud auth application-default login
# Set your project
gcloud config set project <your-project-id>Event-driven document processing pipeline:
- Listens for file upload notifications via Pub/Sub push subscription
- Downloads documents from Cloud Storage (supports PDF, Office formats, text)
- Chunks documents into semantic segments using Apache OpenNLP
- Generates vector embeddings via Google Vertex AI
- Stores text chunks and embeddings in Cloud SQL PostgreSQL with pgvector extension
Stack: Quarkus 3.30.3 (Java 25), Cloud Run, Vertex AI, PostgreSQL 17
Key Components:
- Compute: Cloud Run (serverless containers)
- Messaging: Pub/Sub (push subscription with OIDC auth)
- Storage: Cloud Storage + Cloud SQL PostgreSQL 17 (pgvector)
- ML: Vertex AI Embedding API
- Libraries: LangChain4j, Apache PDFBox, Apache POI, OpenNLP
Infrastructure: Terraform-managed VPC, Cloud SQL (private IP), Serverless VPC Connector, IAM service accounts
- Java 25+ (or Java 21+)
- Google Cloud SDK authenticated (
gcloud auth application-default login) - Terraform, PostgreSQL client tools
# Clone repository
git clone <repository-url>
cd embedder-service
# Configure Terraform variables
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars with your project_id and region
# Run in dev mode (requires GCP auth and existing Cloud SQL instance)
./mvnw quarkus:devCreates VPC, Cloud SQL (PostgreSQL 17), Storage bucket, Pub/Sub topic/subscription, service accounts, and IAM bindings.
./scripts/iac_create.shWhat it provisions:
- VPC with private IP range for Cloud SQL
- Cloud SQL PostgreSQL 17 instance (ENTERPRISE edition)
- Serverless VPC Connector for Cloud Run
- Storage bucket with Pub/Sub notifications
- Pub/Sub topic and push subscription
- 2 GCP service accounts:
embedder-service-sa(Cloud Run runtime with Storage, Pub/Sub, CloudSQL, Vertex AI permissions)embedder-service-sa-pubsub(Pub/Sub push invoker with Cloud Run Invoker role)
Enables pgvector extension and grants IAM database user privileges. Runs via Cloud SQL Proxy.
./scripts/init_database.shActions performed:
- Auto-generates postgres superuser password and sets it via gcloud
- Creates 1 IAM database user:
embedder-service-sa@<project-id>.iam(matches Cloud Run service account) - Enables
vectorextension - Grants schema and table privileges to IAM user
- Verifies connectivity
Compiles Java code, builds container with Jib, pushes to GCR, and deploys to Cloud Run.
./scripts/cicd.shBuild process:
- Maven clean package (skips tests)
- Jib containerization (Eclipse Temurin 25 JRE base image) - Jib builds optimized, layered Docker images without requiring a Docker daemon, separating dependencies from application code for faster rebuilds and smaller layer updates
- Push to
gcr.io/<project-id>/document-embedder:latest - Deploy to Cloud Run with VPC connector and environment variables
- Update Pub/Sub push endpoint to Cloud Run URL
./mvnw testVerify service is running and all dependencies are healthy:
# Get service URL from deployment output or:
SERVICE_URL=$(gcloud run services describe document-embedder --region=us-central1 --format='value(status.url)')
# Check all health probes
curl $SERVICE_URL/health # Aggregate (liveness + readiness)
curl $SERVICE_URL/health/live # Liveness only
curl $SERVICE_URL/health/ready # Readiness (DB, Pub/Sub, Vertex AI)# Upload a test document
gsutil cp sample.pdf gs://<bucket-name>/
# Monitor processing
gcloud run services logs read document-embedder --region=us-central1 --limit=50Prerequisites: Cloud SQL Proxy installed in ~/tools/cloud-sql-proxy or available in PATH
1. Start Cloud SQL Proxy with IAM Authentication
~/tools/cloud-sql-proxy --port 5433 \
--auto-iam-authn \
--impersonate-service-account=embedder-service-sa@<project-id>.iam.gserviceaccount.com \
<project-id>:<region>:<instance-name> > /tmp/proxy.log 2>&1 &
PROXY_PID=$!
sleep 3 # Wait for proxy to start2. Connect via psql (No Password Required)
psql "host=localhost port=5433 dbname=embeddings user=embedder-service-sa@<project-id>.iam"3. Query Embeddings
-- Count stored embeddings
SELECT COUNT(*) as embedding_count FROM embeddings;
-- View sample records
SELECT embedding_id,
LEFT(text, 100) as text_preview,
metadata->>'source' as source
FROM embeddings LIMIT 5;
-- Check vector dimensions
SELECT pg_column_size(embedding) as vector_bytes
FROM embeddings LIMIT 1;4. Similarity Search Example
-- Find documents similar to an existing embedding
WITH query_vector AS (
SELECT embedding
FROM embeddings
WHERE text ILIKE '%your search term%'
LIMIT 1
)
SELECT e.embedding_id,
LEFT(e.text, 100) as text_preview,
1 - (e.embedding <=> qv.embedding) as similarity
FROM embeddings e, query_vector qv
ORDER BY e.embedding <=> qv.embedding
LIMIT 10;Note: The <=> operator computes cosine distance (smaller = more similar).
5. Cleanup
kill $PROXY_PID # Stop proxyTroubleshooting:
- Connection refused: Check proxy logs (
tail -f /tmp/proxy.log) and verify port 5433 is free (lsof -i :5433) - IAM auth failed: Verify you have
roles/iam.serviceAccountTokenCreatoron the service account - User does not exist: Run
init_database.shto create the IAM database user