Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
d0d0668
Add SparseEmbedder class and update get_embedder function to support …
spyrchat Apr 26, 2025
7c6bf9c
Refactor QdrantVectorDB to support dense and sparse vectors; update i…
spyrchat Apr 26, 2025
f283739
Enhance EmbeddingPipeline to support optional sparse embeddings; refa…
spyrchat Apr 26, 2025
3b84dc8
Enhance QdrantVectorDB to support dense and sparse embeddings; update…
spyrchat Apr 27, 2025
6cbb9e4
Refactor QdrantVectorDB and embedding factory to enhance collection i…
spyrchat Apr 28, 2025
c02e61a
Refactor init_collection method in QdrantVectorDB to remove sparse_ve…
spyrchat Apr 28, 2025
f1de843
Refactor QdrantVectorDB to inherit from BaseVectorDB; implement metho…
spyrchat Apr 28, 2025
14cbf21
Refactor BaseVectorDB to specify return types for methods and enhance…
spyrchat Apr 28, 2025
862134b
Merge branch 'development' of https://github.com/spyrchat/Thesis into…
spyrchat Apr 28, 2025
7f75ff9
Merge pull request #1 from spyrchat/hybrid-retriever
spyrchat Apr 28, 2025
b5d9c2b
Remove BaseEmbedder inheritance from HuggingFaceEmbedder for improved…
spyrchat Apr 28, 2025
4b51eda
Remove BaseEmbedder inheritance from TitanEmbedder for improved clarity.
spyrchat Apr 28, 2025
9cca60c
Merge pull request #2 from spyrchat/hybrid-retriever
spyrchat Apr 28, 2025
df599be
Add PostgresController and connection test script for PostgreSQL inte…
spyrchat May 20, 2025
9045ddf
Implement image and table asset insertion methods in PostgresControll…
spyrchat May 21, 2025
31a2a7e
Add table extraction and SQL uploading functionality; refactor import…
spyrchat May 29, 2025
bdb9037
Add PDF processing, table extraction, and text chunking functionality…
spyrchat May 29, 2025
bcc92da
Enhance embedding pipeline with dynamic embedding strategy; add PDF p…
spyrchat May 30, 2025
d400d85
Refactor import statements to use relative paths; update sandbox dire…
spyrchat May 30, 2025
40d43c9
Enhance Qdrant document insertion with error handling and logging; up…
spyrchat Jun 5, 2025
d6c07b5
Add table extraction functionality with logging; implement PDF proces…
spyrchat Jun 5, 2025
f75f74f
Implement modular RAG pipeline with query interpretation, SQL plannin…
spyrchat Jul 8, 2025
53d213b
Add Dockerfile, docker-compose.yml, and main application logic; imple…
spyrchat Jul 8, 2025
985440e
Refactor QdrantVectorDB: remove unused import and add spacing; update…
spyrchat Jul 9, 2025
20a99bd
Updated requirements.txt
spyrchat Jul 9, 2025
8c08f87
full pipeline is functional
spyrchat Jul 9, 2025
b6feff5
Added logging
spyrchat Jul 9, 2025
4187574
Added docstrings for clarity
spyrchat Jul 9, 2025
fa53084
Added Docstrings
spyrchat Jul 9, 2025
346f0d6
added config.yml
spyrchat Jul 9, 2025
268c15b
System Works with config.yml
spyrchat Jul 9, 2025
4e2d2c7
Feat Agent Works as intended
spyrchat Jul 9, 2025
4b71446
Add smoke tests, vector store uploader, and document validator
spyrchat Aug 20, 2025
2bd4a0c
feat: Add minimal SOSum ingestion test and standalone processor
spyrchat Aug 20, 2025
a3fd333
feat: Enhance data handling and validation in ingestion pipeline
spyrchat Aug 21, 2025
a23242b
Add Quick Start Guide for MLOps Pipeline and implement core components
spyrchat Aug 21, 2025
811b2c6
feat: Implement Stack Overflow adapter analysis and testing tools
spyrchat Aug 21, 2025
663dbbd
feat: Add answer metadata tests and enhance answer retrieval output
spyrchat Aug 21, 2025
00586f0
Add experimental and hybrid retrieval configurations, enhance testing…
spyrchat Aug 21, 2025
3add6e6
Add unit tests for retrieval pipeline and related components
spyrchat Aug 21, 2025
28d11ed
feat: Update dependencies in requirements.txt and add new packages
spyrchat Aug 30, 2025
db65791
feat: Enhance embedding strategy configuration and improve smoke test…
spyrchat Aug 30, 2025
c353fe2
Refactor retrieval pipeline to modern architecture
spyrchat Aug 30, 2025
439708a
Refactor configuration loading and retriever initialization
spyrchat Aug 30, 2025
32a3daf
feat: Consolidate configuration system and enhance benchmark function…
spyrchat Aug 30, 2025
8483973
feat: Enhance benchmark evaluation by implementing NaN handling for m…
spyrchat Aug 30, 2025
9acb29c
feat: Improve document ID handling and external ID preservation in Qd…
spyrchat Aug 30, 2025
3aeceee
Refactor benchmark scripts and retrievers for improved functionality …
spyrchat Aug 30, 2025
d18cdc4
Add dataset configurations for Natural Questions and SOSum Stack Over…
spyrchat Aug 30, 2025
22500db
Remove obsolete test files and add a new local end-to-end test setup …
spyrchat Aug 30, 2025
10b6620
chore: Update Python version to 3.13 in pipeline tests
spyrchat Aug 30, 2025
056f007
chore: Update testing dependencies and Python version in CI workflows
spyrchat Aug 30, 2025
7323f4b
refactor: Simplify dependency management by removing requirements-tes…
spyrchat Aug 31, 2025
a81237f
Remove outdated documentation and SQL components; reorganize configur…
spyrchat Aug 31, 2025
b39c51d
chore: Update requirements-minimal.txt to include missing dependencie…
spyrchat Aug 31, 2025
e1beb8b
chore: Add missing dependencies for boto3, botocore, and langchain-qd…
spyrchat Aug 31, 2025
149ab30
refactor: Enhance Qdrant connectivity tests and remove outdated requi…
spyrchat Aug 31, 2025
63d29af
chore: Remove outdated GitHub Actions CI configuration and local test…
spyrchat Aug 31, 2025
18903d8
chore: Remove outdated example scripts and sample data files for retr…
spyrchat Aug 31, 2025
3393ec5
Fix Google dependencies conflict in requirements.txt
spyrchat Aug 31, 2025
9415e07
fix: Remove unnecessary blank line in insert_documents method
spyrchat Sep 7, 2025
2748761
fix: Improve .env loading and add default values for Qdrant configura…
spyrchat Sep 7, 2025
8d6d16f
fix: Update Qdrant service configuration for improved health checks a…
spyrchat Sep 7, 2025
8163734
fix: Improve Qdrant health check commands and update logging messages…
spyrchat Sep 7, 2025
44baa40
fix: Update Qdrant health check commands to use the correct endpoint …
spyrchat Sep 7, 2025
c128036
fix: Update Qdrant health check commands for improved readiness verif…
spyrchat Sep 7, 2025
0068770
fix: Enhance Qdrant readiness check with retry logic and timeout hand…
spyrchat Sep 7, 2025
b2f8882
fix: Update Qdrant readiness check to use the correct endpoint
spyrchat Sep 7, 2025
eebb5d6
fix: Update Qdrant health check endpoints and enhance pipeline test c…
spyrchat Sep 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
418 changes: 418 additions & 0 deletions .github/workflows/pipeline-tests.yml

Large diffs are not rendered by default.

12 changes: 11 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,14 @@ climate-fever
*.log
__pycache__
sandbox/*
/__pycache__
/__pycache__
synthetic_dataset\text_dataset_template.json
extraction_output/
.idea/misc.xml
.idea/modules.xml
.idea/Thesis.iml
.idea/vcs.xml
.idea/inspectionProfiles/profiles_settings.xml
*.json
*.csv

18 changes: 18 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Use a slim Python base image
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system packages if needed
RUN apt-get update && apt-get install -y \
build-essential \
libpq-dev \
&& rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the full source code
COPY . .
266 changes: 266 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
# Advanced RAG Retrieval System with LangGraph Agent

A production-ready, modular RAG (Retrieval-Augmented Generation) system with configurable pipelines and LangGraph agent integration.

## Key Features

- **YAML-Configurable Pipelines**: Switch retrieval strategies without code changes
- **LangGraph Agent Integration**: Seamless agent workflows with rich metadata
- **Modular Components**: Easily extensible rerankers, filters, and retrievers
- **Multiple Retrieval Methods**: Dense, sparse, and hybrid retrieval
- **Production Ready**: Robust error handling, logging, and monitoring
- **A/B Testing Support**: Compare configurations easily
- **Rich Metadata**: Access scores, methods, and quality metrics

## Architecture Overview

```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ LangGraph │────│ Configurable │────│ Retrieval │
│ Agent │ │ Retriever Agent │ │ Pipeline │
└─────────────────┘ └──────────────────┘ └─────────────────┘
┌────────────────────────────────┼────────────────────────────────┐
│ │ │
┌─────▼─────┐ ┌───────▼────────┐ ┌─────▼─────┐
│ Retrievers │ │ Rerankers │ │ Filters │
│ │ │ │ │ │
│• Dense │ │• CrossEncoder │ │• Score │
│• Sparse │ │• BGE Reranker │ │• Content │
│• Hybrid │ │• Multi-stage │ │• Custom │
└───────────┘ └────────────────┘ └───────────┘
```

## Quick Start

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Configure Environment

```bash
# Copy example config
cp config.yml.example config.yml

# Set up your API keys and database connections in config.yml
```

### 3. Start Using the System

```python
# main.py - Chat with your agent
from agent.graph import graph

state = {"question": "How to handle Python exceptions?"}
result = graph.invoke(state)
print(result["answer"])
```

### 4. Switch Retrieval Configurations

```bash
# List available configurations
python bin/switch_agent_config.py --list

# Switch to advanced reranked pipeline
python bin/switch_agent_config.py advanced_reranked

# Test the configuration
python test_agent_retriever_node.py
```

## Available Configurations

| Configuration | Description | Components | Use Case |
|---------------|-------------|------------|----------|
| `basic_dense` | Simple dense retrieval | Dense retriever only | Development, testing |
| `advanced_reranked` | Production quality | Dense + CrossEncoder + filters | Production RAG |
| `hybrid_multistage` | Best performance | Hybrid + multi-stage reranking | High-quality results |
| `experimental` | Latest features | BGE reranker + custom filters | Experimentation |

## 🔧 **Configuration Example**

```yaml
# pipelines/configs/retrieval/advanced_reranked.yml
retrieval_pipeline:
retriever:
type: dense
top_k: 10

stages:
- type: reranker
config:
model_type: cross_encoder
model_name: "ms-marco-MiniLM-L-6-v2"

- type: filter
config:
type: score
min_score: 0.5

- type: answer_enhancer
config:
boost_factor: 2.0
```

## Project Structure

```
Thesis/
├── agent/ # LangGraph agent implementation
│ ├── graph.py # Main agent graph
│ ├── schema.py # Agent state schemas
│ └── nodes/ # Agent nodes (retriever, generator, etc.)
├── components/ # Modular retrieval components
│ ├── retrieval_pipeline.py # Main pipeline orchestrator
│ ├── rerankers.py # Reranking implementations
│ ├── filters.py # Filtering implementations
│ └── advanced_rerankers.py # Advanced reranking strategies
├── pipelines/ # Data processing and configuration
│ ├── configs/retrieval/ # Retrieval pipeline configurations
│ ├── adapters/ # Dataset adapters (BEIR, etc.)
│ └── ingest/ # Data ingestion pipeline
├── bin/ # Command-line utilities
│ ├── switch_agent_config.py # Configuration management
│ ├── agent_retriever.py # Configurable retriever agent
│ └── retrieval_pipeline.py # Direct pipeline usage
├── docs/ # Documentation
│ ├── SYSTEM_EXTENSION_GUIDE.md # Complete extension guide
│ ├── AGENT_INTEGRATION.md # Agent integration details
│ ├── CODE_CLEANUP_SUMMARY.md # Code cleanup documentation
│ └── EXTENSIBILITY.md # Quick extensibility overview
├── tests/ # Test suite
│ ├── retrieval/ # Retrieval pipeline tests
│ └── agent/ # Agent integration tests
├── deprecated/ # Legacy code (organized)
│ ├── old_processors/ # Superseded by new pipeline
│ ├── old_debug_scripts/ # Legacy debugging tools
│ └── old_playground/ # Legacy test scripts
├── database/ # Database controllers
├── embedding/ # Embedding utilities
├── retrievers/ # Base retrievers
├── examples/ # Usage examples
└── config/ # Configuration utilities
```

## Testing

```bash
# Test agent integration
python test_agent_retriever_node.py

# Run all tests
python tests/run_all_tests.py

# Test specific components
python -m pytest tests/retrieval/ -v
```

## Documentation

- **[System Extension Guide](docs/SYSTEM_EXTENSION_GUIDE.md)** - Complete guide to extending the system
- **[Agent Integration](docs/AGENT_INTEGRATION.md)** - How the agent uses configurable pipelines
- **[Code Cleanup Summary](docs/CODE_CLEANUP_SUMMARY.md)** - Professional code standards and cleanup details
- **[Extensibility Overview](docs/EXTENSIBILITY.md)** - Quick overview of extension capabilities
- **[Architecture](docs/MLOPS_PIPELINE_ARCHITECTURE.md)** - System architecture details

## Extending the System

### Add a Custom Reranker

```python
# components/my_reranker.py
from .rerankers import BaseReranker

class MyCustomReranker(BaseReranker):
def rerank(self, query: str, documents: List[Document]) -> List[Document]:
# Your custom reranking logic
for doc in documents:
doc.metadata["score"] = self.calculate_score(query, doc.page_content)

return sorted(documents, key=lambda x: x.metadata["score"], reverse=True)
```

### Create a New Configuration

```yaml
# pipelines/configs/retrieval/my_config.yml
retrieval_pipeline:
retriever:
type: hybrid
top_k: 15

stages:
- type: reranker
config:
model_type: my_custom
custom_param: "value"
```

### Switch and Test

```bash
python bin/switch_agent_config.py my_config
python test_agent_retriever_node.py
```

## Production Usage

The system is designed for production use with:

- **Robust Error Handling**: Graceful degradation when components fail
- **Comprehensive Logging**: Monitor retrieval performance and quality
- **Configuration Management**: Easy deployment of different strategies
- **Performance Optimization**: Efficient batching and caching support
- **Monitoring Ready**: Built-in metrics and health checks

## Use Cases

- **Document Q&A Systems**: High-quality retrieval for knowledge bases
- **Research Assistants**: Multi-modal retrieval for academic content
- **Customer Support**: Context-aware response generation
- **Code Search**: Semantic search over codebases
- **Legal Research**: Precise retrieval from legal documents

## Contributing

1. Fork the repository
2. Create a feature branch
3. Add your extension following the patterns in `docs/SYSTEM_EXTENSION_GUIDE.md`
4. Add tests for your components
5. Submit a pull request

## Performance

The system supports various performance optimization strategies:

- **Caching**: LRU caching for repeated queries
- **Batching**: Efficient batch processing for rerankers
- **Adaptive Top-K**: Dynamic result count based on query complexity
- **Multi-threading**: Parallel processing for pipeline stages

## Migration from Legacy

If you have existing code using the deprecated `processors/` system:

1. Check `deprecated/old_processors/` for reference
2. Use the new pipeline configurations in `pipelines/configs/retrieval/`
3. Follow the migration patterns in `docs/AGENT_INTEGRATION.md`

## License

This project is licensed under the MIT License - see the LICENSE file for details.

---

**Ready to build amazing RAG systems?** Start with the [System Extension Guide](docs/SYSTEM_EXTENSION_GUIDE.md)!
File renamed without changes.
42 changes: 42 additions & 0 deletions agent/graph.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
from langgraph.graph import StateGraph
from agent.nodes.query_interpreter import make_query_interpreter
from agent.nodes.retriever import make_configurable_retriever
from agent.nodes.generator import make_generator
from agent.nodes.memory_updater import memory_updater
from agent.schema import AgentState
from config.config_loader import load_config
from langchain_openai import ChatOpenAI

# Load config
config = load_config("config.yml")

# Setup LLM
llm_cfg = config["llm"]
llm = ChatOpenAI(model=llm_cfg.get("model", "gpt-4.1-mini"),
temperature=llm_cfg.get("temperature", 0.0))

# Setup configurable retriever node
retrieval_config_path = config.get("agent_retrieval", {}).get(
"config_path", "pipelines/configs/retrieval/modern_hybrid.yml")
retriever = make_configurable_retriever(config_path=retrieval_config_path)

# Setup other nodes
generator = make_generator(llm)
query_interpreter = make_query_interpreter(llm)

# Build the graph
builder = StateGraph(AgentState)
builder.add_node("query_interpreter", query_interpreter)
builder.add_node("retriever", retriever)
builder.add_node("generator", generator)
builder.add_node("memory_updater", memory_updater)
builder.set_entry_point("query_interpreter")

builder.add_conditional_edges("query_interpreter", lambda state: state["next_node"], {
"retriever": "retriever",
"generator": "generator",
})

builder.add_edge("retriever", "generator")
builder.add_edge("generator", "memory_updater")
graph = builder.compile()
Loading