Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
d0d0668
Add SparseEmbedder class and update get_embedder function to support …
spyrchat Apr 26, 2025
7c6bf9c
Refactor QdrantVectorDB to support dense and sparse vectors; update i…
spyrchat Apr 26, 2025
f283739
Enhance EmbeddingPipeline to support optional sparse embeddings; refa…
spyrchat Apr 26, 2025
3b84dc8
Enhance QdrantVectorDB to support dense and sparse embeddings; update…
spyrchat Apr 27, 2025
6cbb9e4
Refactor QdrantVectorDB and embedding factory to enhance collection i…
spyrchat Apr 28, 2025
c02e61a
Refactor init_collection method in QdrantVectorDB to remove sparse_ve…
spyrchat Apr 28, 2025
f1de843
Refactor QdrantVectorDB to inherit from BaseVectorDB; implement metho…
spyrchat Apr 28, 2025
14cbf21
Refactor BaseVectorDB to specify return types for methods and enhance…
spyrchat Apr 28, 2025
862134b
Merge branch 'development' of https://github.com/spyrchat/Thesis into…
spyrchat Apr 28, 2025
7f75ff9
Merge pull request #1 from spyrchat/hybrid-retriever
spyrchat Apr 28, 2025
b5d9c2b
Remove BaseEmbedder inheritance from HuggingFaceEmbedder for improved…
spyrchat Apr 28, 2025
4b51eda
Remove BaseEmbedder inheritance from TitanEmbedder for improved clarity.
spyrchat Apr 28, 2025
9cca60c
Merge pull request #2 from spyrchat/hybrid-retriever
spyrchat Apr 28, 2025
df599be
Add PostgresController and connection test script for PostgreSQL inte…
spyrchat May 20, 2025
9045ddf
Implement image and table asset insertion methods in PostgresControll…
spyrchat May 21, 2025
31a2a7e
Add table extraction and SQL uploading functionality; refactor import…
spyrchat May 29, 2025
bdb9037
Add PDF processing, table extraction, and text chunking functionality…
spyrchat May 29, 2025
bcc92da
Enhance embedding pipeline with dynamic embedding strategy; add PDF p…
spyrchat May 30, 2025
d400d85
Refactor import statements to use relative paths; update sandbox dire…
spyrchat May 30, 2025
40d43c9
Enhance Qdrant document insertion with error handling and logging; up…
spyrchat Jun 5, 2025
d6c07b5
Add table extraction functionality with logging; implement PDF proces…
spyrchat Jun 5, 2025
f75f74f
Implement modular RAG pipeline with query interpretation, SQL plannin…
spyrchat Jul 8, 2025
53d213b
Add Dockerfile, docker-compose.yml, and main application logic; imple…
spyrchat Jul 8, 2025
985440e
Refactor QdrantVectorDB: remove unused import and add spacing; update…
spyrchat Jul 9, 2025
20a99bd
Updated requirements.txt
spyrchat Jul 9, 2025
8c08f87
full pipeline is functional
spyrchat Jul 9, 2025
b6feff5
Added logging
spyrchat Jul 9, 2025
4187574
Added docstrings for clarity
spyrchat Jul 9, 2025
fa53084
Added Docstrings
spyrchat Jul 9, 2025
346f0d6
added config.yml
spyrchat Jul 9, 2025
268c15b
System Works with config.yml
spyrchat Jul 9, 2025
4e2d2c7
Feat Agent Works as intended
spyrchat Jul 9, 2025
4b71446
Add smoke tests, vector store uploader, and document validator
spyrchat Aug 20, 2025
2bd4a0c
feat: Add minimal SOSum ingestion test and standalone processor
spyrchat Aug 20, 2025
a3fd333
feat: Enhance data handling and validation in ingestion pipeline
spyrchat Aug 21, 2025
a23242b
Add Quick Start Guide for MLOps Pipeline and implement core components
spyrchat Aug 21, 2025
811b2c6
feat: Implement Stack Overflow adapter analysis and testing tools
spyrchat Aug 21, 2025
663dbbd
feat: Add answer metadata tests and enhance answer retrieval output
spyrchat Aug 21, 2025
00586f0
Add experimental and hybrid retrieval configurations, enhance testing…
spyrchat Aug 21, 2025
3add6e6
Add unit tests for retrieval pipeline and related components
spyrchat Aug 21, 2025
28d11ed
feat: Update dependencies in requirements.txt and add new packages
spyrchat Aug 30, 2025
db65791
feat: Enhance embedding strategy configuration and improve smoke test…
spyrchat Aug 30, 2025
c353fe2
Refactor retrieval pipeline to modern architecture
spyrchat Aug 30, 2025
439708a
Refactor configuration loading and retriever initialization
spyrchat Aug 30, 2025
32a3daf
feat: Consolidate configuration system and enhance benchmark function…
spyrchat Aug 30, 2025
8483973
feat: Enhance benchmark evaluation by implementing NaN handling for m…
spyrchat Aug 30, 2025
9acb29c
feat: Improve document ID handling and external ID preservation in Qd…
spyrchat Aug 30, 2025
3aeceee
Refactor benchmark scripts and retrievers for improved functionality …
spyrchat Aug 30, 2025
d18cdc4
Add dataset configurations for Natural Questions and SOSum Stack Over…
spyrchat Aug 30, 2025
22500db
Remove obsolete test files and add a new local end-to-end test setup …
spyrchat Aug 30, 2025
10b6620
chore: Update Python version to 3.13 in pipeline tests
spyrchat Aug 30, 2025
056f007
chore: Update testing dependencies and Python version in CI workflows
spyrchat Aug 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 200 additions & 0 deletions .github/workflows/pipeline-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
name: Pipeline Tests

on:
push:
branches: [ main, development ]
pull_request:
branches: [ main, development ]

jobs:
test-minimal:
name: Minimal Pipeline Tests
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.13'

- name: Install dependencies
run: |
pip install -r requirements.txt
pip install -r tests/requirements-minimal.txt

- name: Run minimal pipeline tests (no external services)
run: |
python -m pytest tests/pipeline/test_minimal_pipeline.py tests/pipeline/test_components.py -v --tb=short

- name: Run configuration validation
run: |
python -c "
import yaml
import sys

# Test that all YAML configs are valid
configs = ['config.yml', 'pipelines/configs/retrieval/ci_google_gemini.yml']
for config in configs:
try:
with open(config) as f:
yaml.safe_load(f)
print(f'✅ {config} is valid')
except Exception as e:
print(f'❌ {config} failed: {e}')
sys.exit(1)
"

test-integration:
name: Integration Tests with Qdrant
runs-on: ubuntu-latest

services:
qdrant:
image: qdrant/qdrant:latest
ports:
- 6333:6333
options: >-
--health-cmd "curl -f http://localhost:6333/collections || exit 1"
--health-interval 10s
--health-timeout 5s
--health-retries 10

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.13'

- name: Install dependencies
run: |
pip install -r requirements.txt
pip install -r tests/requirements-test.txt

- name: Wait for Qdrant to be ready
run: |
timeout 60 bash -c 'until curl -f http://localhost:6333/collections; do sleep 2; done'
echo "Qdrant is ready!"

- name: Test Qdrant connectivity
run: |
python -m pytest tests/pipeline/test_qdrant_connectivity.py -v --tb=short

- name: Run basic integration tests
run: |
python -m pytest tests/pipeline/ -v --tb=short -m "not requires_api"

test-end-to-end:
name: End-to-End Pipeline Tests
runs-on: ubuntu-latest
if: github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository

services:
qdrant:
image: qdrant/qdrant:latest
ports:
- 6333:6333
options: >-
--health-cmd "curl -f http://localhost:6333/collections || exit 1"
--health-interval 10s
--health-timeout 5s
--health-retries 10

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.13'

- name: Install dependencies
run: |
pip install -r requirements.txt
pip install -r tests/requirements-test.txt

- name: Wait for Qdrant to be ready
run: |
timeout 60 bash -c 'until curl -f http://localhost:6333/collections; do sleep 2; done'
echo "Qdrant is ready!"

- name: Test Qdrant health
run: |
curl -f http://localhost:6333/collections
echo "Qdrant collections endpoint working"

- name: Run end-to-end pipeline tests
env:
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
QDRANT_HOST: localhost
QDRANT_PORT: 6333
run: |
if [ -z "$GOOGLE_API_KEY" ]; then
echo "⚠️ GOOGLE_API_KEY not set - skipping end-to-end tests"
exit 0
fi

echo "🔑 API key available - running full end-to-end tests"
python -m pytest tests/pipeline/test_end_to_end.py -v --tb=short -m "requires_api"

- name: Run comprehensive test suite
env:
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
QDRANT_HOST: localhost
QDRANT_PORT: 6333
run: |
if [ -n "$GOOGLE_API_KEY" ]; then
echo "🚀 Running comprehensive test suite with API"
python tests/pipeline/run_tests.py
else
echo "⚠️ Running minimal test suite without API"
python -m pytest tests/pipeline/test_minimal_pipeline.py tests/pipeline/test_components.py -v
fi

test-security:
name: Security and Config Validation
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Check for hardcoded secrets
run: |
# Check that no API keys are hardcoded
if grep -r "sk-" . --exclude-dir=.git --exclude="*.md" --exclude="*.yml"; then
echo "❌ Found potential hardcoded API keys"
exit 1
fi

if grep -r "google_api_key.*=" . --exclude-dir=.git --exclude="*.md" --exclude="*.yml" | grep -v "getenv\|environ"; then
echo "❌ Found potential hardcoded Google API keys"
exit 1
fi

echo "✅ No hardcoded secrets found"

- name: Validate configuration structure
run: |
python -c "
import yaml

# Validate CI config structure
with open('pipelines/configs/retrieval/ci_google_gemini.yml') as f:
config = yaml.safe_load(f)

# Check required fields
assert 'retrieval_pipeline' in config
assert 'retriever' in config['retrieval_pipeline']
assert 'embedding' in config['retrieval_pipeline']['retriever']
assert 'google' == config['retrieval_pipeline']['retriever']['embedding']['dense']['provider']
assert 'GOOGLE_API_KEY' == config['retrieval_pipeline']['retriever']['embedding']['dense']['api_key_env']

print('✅ Configuration structure is valid')
"
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,12 @@ climate-fever
*.log
__pycache__
sandbox/*
/__pycache__
/__pycache__
synthetic_dataset\text_dataset_template.json
extraction_output/
.idea/misc.xml
.idea/modules.xml
.idea/Thesis.iml
.idea/vcs.xml
.idea/inspectionProfiles/profiles_settings.xml
*.json
166 changes: 166 additions & 0 deletions BENCHMARK_OPTIMIZATION_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# RAG Benchmark Optimization System - Usage Guide

## 🎯 Summary

We have successfully created a flexible benchmark optimization system that fixes the critical `external_id` retrieval issue and enables easy parameter optimization experiments.

### ✅ Key Achievements

1. **Fixed External ID Retrieval**: Modified dense retriever to use direct Qdrant API, preserving `external_id` in document metadata
2. **Excellent Benchmark Results**:
- Precision@5: 75.5%
- Recall@5: 69.3%
- MRR: 92.0%
3. **Flexible Configuration System**: Created modular benchmark scenarios for easy optimization
4. **Ground Truth Integration**: Proper evaluation using real StackOverflow question-answer pairs

## 🚀 Quick Start - Running Benchmarks

### Option 1: Interactive CLI (Easiest)
```bash
cd /home/spiros/Desktop/Thesis/Thesis
python run_benchmark_optimization.py
```

Then choose:
- `1` - Quick test (10 queries)
- `2` - Single scenario
- `3` - Run all scenarios
- `4` - Compare previous results

### Option 2: Command Line
```bash
# Run single scenario
python benchmark_optimizer.py --scenario benchmark_scenarios/quick_test.yml

# Run all scenarios
python benchmark_optimizer.py --scenarios-dir benchmark_scenarios

# Compare existing results only
python benchmark_optimizer.py --compare-only
```

## 📊 Available Optimization Scenarios

Located in `benchmark_scenarios/`:

1. **quick_test.yml** - Fast 10-query test for rapid iteration
2. **dense_baseline.yml** - Dense retrieval with top_k=10, threshold=0.1
3. **dense_high_recall.yml** - Dense with top_k=20, threshold=0.05 (more results)
4. **dense_high_precision.yml** - Dense with threshold=0.3 (stricter filtering)
5. **sparse_bm25.yml** - Sparse BM25 retrieval
6. **hybrid_retrieval.yml** - Combined dense + sparse retrieval

## 🔧 Creating Custom Scenarios

Create new `.yml` files in `benchmark_scenarios/` with this structure:

```yaml
# Description of the experiment
description: "Your experiment description"

# Dataset configuration
dataset:
path: "/home/spiros/Desktop/Thesis/datasets/sosum/data"
use_ground_truth: true

# Retrieval configuration
retrieval:
type: "dense" # dense, sparse, or hybrid
top_k: 10
score_threshold: 0.1

# Embedding configuration (override main config)
embedding:
dense:
provider: google
model: models/embedding-001
dimensions: 768
api_key_env: GOOGLE_API_KEY
batch_size: 32
vector_name: dense
strategy: dense

# Evaluation configuration
evaluation:
k_values: [1, 5, 10]
metrics:
retrieval: ["precision@k", "recall@k", "mrr", "ndcg@k"]

# Experiment parameters
max_queries: 50
experiment_name: "your_experiment_name"
```

## 📈 Optimization Parameters You Can Tune

### Retrieval Parameters
- `top_k`: Number of documents to retrieve (5, 10, 15, 20)
- `score_threshold`: Minimum similarity score (0.0, 0.1, 0.2, 0.3)
- `type`: Retrieval strategy (dense, sparse, hybrid)

### Embedding Parameters
- `model`: Different embedding models
- `batch_size`: Processing batch size (16, 32, 64)
- `dimensions`: Embedding dimensions (384, 768, 1024)

### Evaluation Parameters
- `max_queries`: Dataset size (10, 25, 50, 100, 500)
- `k_values`: Evaluation depths ([1,5,10], [1,5,10,20])

## 🏆 Results Analysis

The system automatically:
- Tracks all experiment results
- Compares scenarios across metrics
- Identifies best performers for each metric
- Saves results to `benchmark_optimization_results.yml`

### Key Metrics
- **Precision@K**: How many retrieved docs are relevant
- **Recall@K**: How many relevant docs were retrieved
- **MRR**: Mean Reciprocal Rank (position of first relevant result)
- **NDCG@K**: Normalized Discounted Cumulative Gain

## 🔍 Example Optimization Workflow

1. **Start with quick test**:
```bash
python benchmark_optimizer.py --scenario benchmark_scenarios/quick_test.yml
```

2. **Run baseline experiments**:
```bash
python benchmark_optimizer.py --scenarios-dir benchmark_scenarios
```

3. **Create custom scenarios** based on baseline results

4. **Compare all results**:
```bash
python benchmark_optimizer.py --compare-only
```

## 📊 Current Best Configuration

Based on our tests, the current best performing setup:
- **Retrieval**: Dense with Google Gemini embeddings
- **Top K**: 10 documents
- **Score Threshold**: 0.1 (from config, but score filter at 0.3)
- **Reranking**: Cross-encoder reranking with ms-marco-MiniLM-L-6-v2
- **Results**: 75.5% Precision@5, 69.3% Recall@5, 92% MRR

## 🚨 Important Notes

1. **Ground Truth**: System uses real StackOverflow question-answer pairs for evaluation
2. **External ID Fix**: Our custom dense retriever preserves document IDs correctly
3. **Scalability**: Adjust `max_queries` based on time constraints
4. **Consistency**: All scenarios use the same evaluation methodology for fair comparison

## 🎯 Next Steps for Optimization

1. **Hyperparameter Tuning**: Create scenarios with different top_k and threshold values
2. **Embedding Models**: Test different embedding providers/models
3. **Hybrid Strategies**: Optimize dense+sparse combination weights
4. **Reranking**: Experiment with different reranker models
5. **Dataset Size**: Scale up to full 506 questions for final evaluation
Loading
Loading