Skip to content

Conversation

@mtalvi
Copy link
Collaborator

@mtalvi mtalvi commented Dec 16, 2025

Migrate RAG to Dedicated Microservice with PostgreSQL Storage

Summary

Migrates RAG from PVC-based storage to a dedicated RAG microservice using PostgreSQL for embedding storage. This removes ReadWriteOnce constraints, reduces memory duplication, and simplifies the architecture.

Changes

  • New RAG microservice (services/rag/): FastAPI service that loads embeddings from PostgreSQL and serves queries via HTTP
  • PostgreSQL storage: Embeddings stored in ragembedding table using pgvector extension
  • Backend integration: RAGHandler now communicates with RAG service via HTTP instead of loading local FAISS index
  • Init job updates: init_pipeline.py saves embeddings to PostgreSQL and waits for RAG service readiness
  • Non-blocking startup: RAG service starts immediately and polls PostgreSQL for embeddings in the background.
  • Removed: PVC-based RAG storage and related Helm chart resources

Benefits

  • No node constraints: Backend pods can run on any node (no RWO PVC requirement)
  • Reduced memory usage: Single FAISS index in RAG service instead of N copies across backend pods
  • Simplified updates: Embeddings updated via PostgreSQL without pod restarts
  • Better scalability: Backend pods scale independently of RAG index

Architecture

Init Job → PostgreSQL (embeddings) → RAG Service (FAISS index) → Backend Pods (HTTP queries)

The init job and RAG service start in parallel; the RAG service polls PostgreSQL until embeddings are available, then loads the index and becomes ready.

Testing

  • ✅ Local deployment with Docker Compose
  • ✅ Kubernetes/OpenShift deployment via Helm
  • ✅ RAG service handles missing embeddings gracefully
  • ✅ Backend correctly queries RAG service for context retrieval

Migration Notes

  • Existing deployments will need to rebuild embeddings (init job handles this automatically)
  • No data migration needed (embeddings are regenerated from knowledge base PDFs)
  • RAG service must be deployed before running the init job (handled by Helm dependencies)

@mtalvi mtalvi marked this pull request as ready for review December 16, 2025 16:32
@mtalvi mtalvi requested a review from a team December 16, 2025 16:32
@mtalvi mtalvi requested a review from itay1551 as a code owner December 16, 2025 16:32
@itay1551 itay1551 marked this pull request as draft December 17, 2025 12:42
@mtalvi
Copy link
Collaborator Author

mtalvi commented Dec 17, 2025

@itay1551 - Following our meeting I gave it some more thought and discussed with Yossi about the second point as well:

  1. I think we should keep pgvector in the root pyproject.toml.
    Basically here src/alm/models.py I define the type of the column which obviously will be Vector. The backend uses SQLModel to define the RAGEmbedding table with a Vector(768) column type. The backend creates the table and saves embeddings using this model. We could avoid pgvector by using raw SQL everywhere, but that would lose type safety and ORM benefits.
    Using PostgreSQL's Vector type improves loading performance compared to alternatives like JSONB or arrays. The Vector type stores embeddings in a compact binary format, reducing storage. When the RAG service loads all embeddings on startup, the binary format enables faster parsing—typically 3-5x faster than JSON parsing—which means the service becomes ready sooner. Additionally, the binary format reduces memory overhead during loading, improving overall system efficiency. While we're not using PostgreSQL's native vector search operators index wise, the Vector type still provides measurable performance benefits during the data loading phase, making it a better choice than storing embeddings as JSON or arrays or even strings.

  2. Regarding the wrapper. You are right that the wrapper in node.py (lines 24-34) is redundant. But I still believe we should keep it this way. The wrapper function in node.py keeps a clear separation of concerns between the agent interface and the service client implementation. The graph calls a simple function (get_cheat_sheet_context()) rather than directly accessing the RAGHandler singleton, which improves testability since we can mock the function without mocking HTTP calls. This also provides a stable interface for the graph, so if we change the RAG implementation or add validation/logging, we only modify node.py without touching the graph code. The wrapper acts as a facade, hiding complexity like HTTP client management, error handling, and response formatting that lives in rag_handler.py. While it's a thin delegation layer, it improves maintainability and keeps the graph code focused on orchestration rather than service communication details.

Please let me know what you think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant