Minimal-RAG-Insurance-Agent

Educational Material: This project was developed as part of an AI/ML course curriculum. It demonstrates practical implementation of multi-agent RAG systems for insurance claim analysis.

A multi-agent RAG (Retrieval-Augmented Generation) system for analyzing and querying insurance claims using LlamaIndex, hierarchical indexing, and MCP tool integration.

Architecture Overview

                         ┌─────────────────────┐
                         │     User Query      │
                         └──────────┬──────────┘
                                    │
                         ┌──────────▼──────────┐
                         │   Manager Agent     │
                         │   (gpt-4o Router)   │
                         │                     │
                         │  • Claim ID check   │
                         │  • Query routing    │
                         └──────────┬──────────┘
                                    │
           ┌────────────────────────┼────────────────────────┐
           │                        │                        │
┌──────────▼──────────┐  ┌──────────▼──────────┐  ┌──────────▼──────────┐
│   Summary Expert    │  │   Needle Expert     │  │   MCP Policy Tool   │
│   (gpt-4o-mini)     │  │   (gpt-4o-mini)     │  │                     │
│                     │  │                     │  │  validate_policy_   │
│  High-level Q&A     │  │  Precise facts      │  │  limit()            │
│  Timeline overview  │  │  Dates, costs       │  │                     │
└──────────┬──────────┘  └──────────┬──────────┘  │  calculate_date_    │
           │                        │             │  difference()       │
┌──────────▼──────────┐  ┌──────────▼──────────┐  └─────────────────────┘
│   SummaryIndex      │  │  VectorStoreIndex   │
│   (MapReduce)       │  │  + AutoMerging      │
│                     │  │    Retriever        │
│  tree_summarize     │  │                     │
└─────────────────────┘  │  ChromaDB Backend   │
                         └─────────────────────┘

Component Flow

User Query → Manager Agent receives the query
Claim ID Validation → Manager ensures a claim ID is provided
Query Routing → Manager selects the appropriate tool:
- Summary Expert for high-level questions
- Needle Expert for specific facts
- MCP Tool for policy limit validation
Response Synthesis → Expert tool processes query and returns answer

Data Segmentation & Chunking Strategy

Hierarchical Structure

Claim Document
├── Level 1 (Root): 2048 tokens
│   └── Full sections with complete context
├── Level 2 (Intermediate): 512 tokens
│   └── Subsections for balanced reasoning
└── Level 3 (Leaf): 128 tokens
    └── Fine-grained facts for precision

Chunk Size Rationale

Level	Size	Purpose	Use Case
Root	2048 tokens	Broad context reconstruction	Understanding full incident narratives
Intermediate	512 tokens	Balanced reasoning	Connecting related facts
Leaf	128 tokens	Precision retrieval	Extracting specific dates, costs, codes

Why These Sizes?

2048 tokens (Root): Matches typical section lengths in claim documents (Policy Details, Incident Report, etc.). Ensures complete context for summarization.
512 tokens (Intermediate): Optimal for semantic similarity search. Balances specificity with enough context for the LLM to reason.
128 tokens (Leaf): Captures atomic facts (e.g., "Accident Time: October 14, 2023, at 11:30 PM"). Essential for "needle-in-a-haystack" queries.

Overlap Strategy

The HierarchicalNodeParser maintains parent-child relationships without explicit overlap. Instead, the AutoMergingRetriever dynamically merges leaf nodes into parent nodes when sufficient siblings are retrieved (40% threshold), providing contextual overlap on-demand.

Recall Improvement

Multi-level retrieval: Different chunk sizes capture different granularities
Auto-merging: Automatically expands context when needed
Metadata filtering: claim_id ensures retrieval from the correct document

Index Schemas

1. Hierarchical Vector Index (Needle Index)

Backend: ChromaDB (persistent storage)

# Storage structure
./chroma_db/
└── insurance_claims/  # Collection name
    ├── embeddings
    ├── metadata
    └── documents

Metadata per chunk:

claim_id: Primary filter key
policy_holder: Insured party name
source_file: Original document filename

Retriever: AutoMergingRetriever

similarity_top_k=6: Retrieve top 6 leaf nodes
simple_ratio_thresh=0.4: Merge to parent if 40% of siblings retrieved

2. Summary Index (MapReduce)

Response Mode: tree_summarize

This implements the MapReduce pattern:

Map: Each retrieved chunk is summarized independently
Reduce: Summaries are combined into a final coherent response

Ideal for:

Timeline overviews
Claim summaries
Red flag identification

Agent Design

Manager Agent (Router)

Model: gpt-4o (for sophisticated routing decisions)

System Prompt:

You are an Insurance Claim Manager Agent. Your role is to route user queries 
to specialized expert tools and provide accurate answers about insurance claims.

CRITICAL RULES:
1. You MUST obtain the claim_id from the user before retrieving any claim 
   information. If no claim_id is provided, ask the user to specify.
2. NEVER answer from memory - always use the appropriate tools.
3. For specific facts, dates, costs, or codes - use the needle_expert tool.
4. For summaries, timelines, or high-level overviews - use the summary_expert tool.
5. For validating if costs are within policy limits - use the validate_policy_limit tool.

Summary Expert

Model: gpt-4o-mini
Tool Name: summary_expert
Index: SummaryIndex with tree_summarize

Description: "Useful for high-level questions, summaries of whole claims, timeline overviews, or understanding the overall situation."

Needle Expert

Model: gpt-4o-mini
Tool Name: needle_expert
Index: VectorStoreIndex with AutoMergingRetriever

Description: "Useful for retrieving specific facts, dates, costs, codes, or fine-grained details from insurance claims."

MCP Integration

Overview

MCP (Model Context Protocol) extends the agent's capabilities beyond retrieval. The system includes an MCP server (mcp_server.py) with tools for:

Policy Limit Validation: Checks if claimed amounts exceed coverage limits
Date Difference Calculation: Computes time between claim events

MCP Server (`mcp_server.py`)

@mcp.tool()
def validate_policy_limit(claimed_amount: float, policy_limit: float) -> str:
    """
    Validates if a claimed amount is within policy limits.
    Returns structured analysis with risk assessment.
    """

Usage Example

Query: "For claim CLM-89921, is the repair cost of $9,766.90 within the property damage limit of $100,000?"

Tool Call:

validate_policy_limit(claimed_amount=9766.90, policy_limit=100000)

Response:

VALIDATION RESULT: Claim is WITHIN policy limits.
- Claimed Amount: $9,766.90
- Policy Limit: $100,000.00
- Coverage Used: 9.8%
- Remaining Coverage: $90,233.10
- Risk Level: Low

Integration Pattern

The MCP tool is wrapped as a FunctionTool and provided to the Manager Agent alongside the expert query engines. This allows the agent to:

Retrieve cost information using the needle expert
Validate costs against policy limits using the MCP tool
Provide a comprehensive answer

Evaluation Methodology

This project implements a comprehensive evaluation framework based on Anthropic's guide to demystifying evals for AI agents.

Three-Tier Grading System

Grader Type	File	Purpose	Speed	Cost
Code-Based	`tests/test_code_based.py`	Deterministic checks (tool calls, format, content)	Fast	Free
Model-Based	`tests/test_model_based.py`	LLM-as-judge evaluation (correctness, relevancy, faithfulness)	Slow	$$$
Human-Based	`tests/test_human_based.py`	CLI framework for human evaluation	Manual	Time

Code-Based Graders

Fast, cheap, and reproducible checks:

ToolCallGrader: Validates correct tools are called in correct sequence
FormatGrader: Verifies response contains expected patterns (dates, currency, claim IDs)
ResponseContentGrader: Checks key facts appear in response

Model-Based Graders (LLM-as-Judge)

Using gpt-4o as the judge model:

Evaluator	Metric	Scale	Purpose
`CorrectnessGrader`	Answer Correctness	0-5	Does the answer match ground truth?
`RelevancyGrader`	Context Relevancy	0-1	Is the retrieved context relevant to the query?
`FaithfulnessGrader`	Answer Faithfulness	0-1	Is the answer grounded in the retrieved context?
`CustomRubricGrader`	User-defined	0-5	Domain-specific evaluation criteria

Human-Based Graders

Gold standard for calibration and subjective evaluation:

Interactive CLI for scoring responses
Predefined rubrics for consistency
Session management and result persistence

Test Dataset

7 test cases covering all three claims and query types:

Needle Questions (Specific Facts):

Exact accident time from police report (CLM-89921)
Specific pipe type that ruptured (CLM-44217-PD)
Total equipment replacement cost (CLM-77182-CM)

Summary Questions (Broad Overview): 4. Multi-vehicle collision sequence (CLM-89921) 5. Water intrusion cause and affected areas (CLM-44217-PD)

MCP Tool Questions: 6. Repair cost within policy limit validation (CLM-89921) 7. Claim exceeding policy limit detection (CLM-89921)

Running Tests

# Install test dependencies
uv sync --extra test

# Run all tests
uv run pytest tests/ -v -s

# Run specific grader type
uv run pytest tests/test_code_based.py -v -s
uv run pytest tests/test_model_based.py -v -s

# Run human evaluation CLI
uv run python tests/test_human_based.py

# Run legacy evaluation script
uv run evaluation.py

Best Practices

Following Anthropic's recommendations:

Grade outcomes, not paths: Check what the agent produced, not exactly how it got there
Isolated judges: Use separate evaluators for each dimension
Start small: 20-50 test cases drawn from real failures
Calibrate regularly: Use human evaluation to calibrate model judges

Setup & Execution

Prerequisites

Python 3.10+
OpenAI API key
uv package manager

Installation

# Clone the repository
git clone https://github.com/AvivK5498/Minimal-RAG-Insurance-Agent.git
cd Minimal-RAG-Insurance-Agent

# Install dependencies
uv sync

# Set up environment
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

Running the System

# Start interactive chat with the agent
uv run main.py

# Run evaluation suite
uv run evaluation.py

# Run MCP server standalone (for testing)
uv run mcp_server.py

Example Queries

You: What is the claim ID for Sarah Connor's case?
Agent: Sarah Connor's case has claim ID CLM-89921.

You: For claim CLM-89921, what was the exact accident time?
Agent: According to the police report, the accident occurred on October 14, 2023, at 11:30 PM.

You: Summarize the red flags in claim CLM-89921.
Agent: [Provides summary of timeline inconsistencies, driver identity ambiguity, and commercial use concerns]

Limitations & Trade-offs

Current Limitations

Single Vector Store: All claims share one ChromaDB collection. For production, consider separate collections per claim type.
Metadata Extraction: LLM-based extraction may occasionally misparse unusual document formats.
MCP Integration: Currently uses local function wrapper. Full MCP server integration via stdio requires separate process management.
Context Window: Very large claims may exceed context limits during summarization.

Trade-offs Made

Decision	Trade-off
3-level hierarchy	More storage overhead, but better retrieval flexibility
gpt-4o for routing	Higher cost, but more accurate tool selection
gpt-4o-mini for experts	Lower accuracy ceiling, but significantly reduced costs
ChromaDB local	Not production-scale, but portable for assignment submission
LLM metadata extraction	Slower initial load, but handles format variations

Future Improvements

Streaming responses for better UX
Claim-specific collections for better isolation
Hybrid search (semantic + keyword) for better recall
Caching layer for repeated queries
Async batch processing for large document sets

File Structure

/Minimal-RAG-Insurance-Agent
├── pyproject.toml        # uv dependencies
├── .env                  # OPENAI_API_KEY (create from .env.example)
├── data/
│   ├── claim001.md       # Auto collision claim
│   ├── claim002.md       # Water damage claim
│   ├── claim003.md       # Theft claim
│   └── metadata_cache.json  # Extracted metadata cache
├── chroma_db/            # Persistent vector store
├── mcp_server.py         # FastMCP server with policy tools
├── main.py               # Manager Agent + Sub-Agents
├── evaluation.py         # Legacy evaluation script
├── tests/
│   ├── __init__.py           # Package marker
│   ├── conftest.py           # Shared fixtures and test cases
│   ├── test_code_based.py    # Code-based grader tests
│   ├── test_model_based.py   # Model-based grader tests
│   ├── test_human_based.py   # Human evaluation CLI
│   └── README.md             # Test suite documentation
├── .github/
│   └── workflows/
│       └── agent-evals.yml   # CI/CD workflow for tests
└── README.md             # This file

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.beads		.beads
.claude		.claude
.github/workflows		.github/workflows
data		data
tests		tests
.DS_Store		.DS_Store
CLAUDE.md		CLAUDE.md
RAG_AUDIT_REPORT.md		RAG_AUDIT_REPORT.md
README.md		README.md
SETUP.md		SETUP.md
diagram.mmd		diagram.mmd
evaluation.py		evaluation.py
main.py		main.py
mcp_server.py		mcp_server.py
midterm_summary.pdf		midterm_summary.pdf
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Minimal-RAG-Insurance-Agent

Table of Contents

Architecture Overview

Component Flow

Data Segmentation & Chunking Strategy

Hierarchical Structure

Chunk Size Rationale

Why These Sizes?

Overlap Strategy

Recall Improvement

Index Schemas

1. Hierarchical Vector Index (Needle Index)

2. Summary Index (MapReduce)

Agent Design

Manager Agent (Router)

Summary Expert

Needle Expert

MCP Integration

Overview

MCP Server (mcp_server.py)

Usage Example

Integration Pattern

Evaluation Methodology

Three-Tier Grading System

Code-Based Graders

Model-Based Graders (LLM-as-Judge)

Human-Based Graders

Test Dataset

Running Tests

Best Practices

Setup & Execution

Prerequisites

Installation

Running the System

Example Queries

Limitations & Trade-offs

Current Limitations

Trade-offs Made

Future Improvements

File Structure

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

MCP Server (`mcp_server.py`)

Packages