Refactor: Rag as a service #62

mtalvi · 2025-12-16T13:45:17Z

Migrate RAG to Dedicated Microservice with PostgreSQL Storage

Summary

Migrates RAG from PVC-based storage to a dedicated RAG microservice using PostgreSQL for embedding storage. This removes ReadWriteOnce constraints, reduces memory duplication, and simplifies the architecture.

Changes

New RAG microservice (services/rag/): FastAPI service that loads embeddings from PostgreSQL and serves queries via HTTP
PostgreSQL storage: Embeddings stored in ragembedding table using pgvector extension
Backend integration: RAGHandler now communicates with RAG service via HTTP instead of loading local FAISS index
Init job updates: init_pipeline.py saves embeddings to PostgreSQL and waits for RAG service readiness
Non-blocking startup: RAG service starts immediately and polls PostgreSQL for embeddings in the background.
Removed: PVC-based RAG storage and related Helm chart resources

Benefits

✅ No node constraints: Backend pods can run on any node (no RWO PVC requirement)
✅ Reduced memory usage: Single FAISS index in RAG service instead of N copies across backend pods
✅ Simplified updates: Embeddings updated via PostgreSQL without pod restarts
✅ Better scalability: Backend pods scale independently of RAG index

Architecture

Init Job → PostgreSQL (embeddings) → RAG Service (FAISS index) → Backend Pods (HTTP queries)

The init job and RAG service start in parallel; the RAG service polls PostgreSQL until embeddings are available, then loads the index and becomes ready.

Testing

✅ Local deployment with Docker Compose
✅ Kubernetes/OpenShift deployment via Helm
✅ RAG service handles missing embeddings gracefully
✅ Backend correctly queries RAG service for context retrieval

Migration Notes

Existing deployments will need to rebuild embeddings (init job handles this automatically)
No data migration needed (embeddings are regenerated from knowledge base PDFs)
RAG service must be deployed before running the init job (handled by Helm dependencies)

mtalvi · 2025-12-17T13:34:45Z

@itay1551 - Following our meeting I gave it some more thought and discussed with Yossi about the second point as well:

I think we should keep pgvector in the root pyproject.toml.
Basically here src/alm/models.py I define the type of the column which obviously will be Vector. The backend uses SQLModel to define the RAGEmbedding table with a Vector(768) column type. The backend creates the table and saves embeddings using this model. We could avoid pgvector by using raw SQL everywhere, but that would lose type safety and ORM benefits.
Using PostgreSQL's Vector type improves loading performance compared to alternatives like JSONB or arrays. The Vector type stores embeddings in a compact binary format, reducing storage. When the RAG service loads all embeddings on startup, the binary format enables faster parsing—typically 3-5x faster than JSON parsing—which means the service becomes ready sooner. Additionally, the binary format reduces memory overhead during loading, improving overall system efficiency. While we're not using PostgreSQL's native vector search operators index wise, the Vector type still provides measurable performance benefits during the data loading phase, making it a better choice than storing embeddings as JSON or arrays or even strings.
Regarding the wrapper. You are right that the wrapper in node.py (lines 24-34) is redundant. But I still believe we should keep it this way. The wrapper function in node.py keeps a clear separation of concerns between the agent interface and the service client implementation. The graph calls a simple function (get_cheat_sheet_context()) rather than directly accessing the RAGHandler singleton, which improves testability since we can mock the function without mocking HTTP calls. This also provides a stable interface for the graph, so if we change the RAG implementation or add validation/logging, we only modify node.py without touching the graph code. The wrapper acts as a facade, hiding complexity like HTTP client management, error handling, and response formatting that lives in rag_handler.py. While it's a thin delegation layer, it improves maintainability and keeps the graph code focused on orchestration rather than service communication details.

Please let me know what you think!

mtalvi added 3 commits December 16, 2025 16:27

rag as a service - cluster

1242529

updating quay reo for tei image

979cda3

local deployment

2ace1ea

mtalvi force-pushed the rag-as-a-service branch from 7edf6eb to 2ace1ea Compare December 16, 2025 14:33

final fix

adf7331

mtalvi marked this pull request as ready for review December 16, 2025 16:32

mtalvi requested a review from a team December 16, 2025 16:32

mtalvi requested a review from itay1551 as a code owner December 16, 2025 16:32

itay1551 marked this pull request as draft December 17, 2025 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor: Rag as a service #62

Refactor: Rag as a service #62

Uh oh!

mtalvi commented Dec 16, 2025 •

edited

Loading

Uh oh!

mtalvi commented Dec 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Refactor: Rag as a service #62

Are you sure you want to change the base?

Refactor: Rag as a service #62

Uh oh!

Conversation

mtalvi commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Migrate RAG to Dedicated Microservice with PostgreSQL Storage

Summary

Changes

Benefits

Architecture

Testing

Migration Notes

Uh oh!

mtalvi commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Please let me know what you think!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mtalvi commented Dec 16, 2025 •

edited

Loading

mtalvi commented Dec 17, 2025 •

edited

Loading