Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
d0d0668
Add SparseEmbedder class and update get_embedder function to support …
spyrchat Apr 26, 2025
7c6bf9c
Refactor QdrantVectorDB to support dense and sparse vectors; update i…
spyrchat Apr 26, 2025
f283739
Enhance EmbeddingPipeline to support optional sparse embeddings; refa…
spyrchat Apr 26, 2025
3b84dc8
Enhance QdrantVectorDB to support dense and sparse embeddings; update…
spyrchat Apr 27, 2025
6cbb9e4
Refactor QdrantVectorDB and embedding factory to enhance collection i…
spyrchat Apr 28, 2025
c02e61a
Refactor init_collection method in QdrantVectorDB to remove sparse_ve…
spyrchat Apr 28, 2025
f1de843
Refactor QdrantVectorDB to inherit from BaseVectorDB; implement metho…
spyrchat Apr 28, 2025
14cbf21
Refactor BaseVectorDB to specify return types for methods and enhance…
spyrchat Apr 28, 2025
862134b
Merge branch 'development' of https://github.com/spyrchat/Thesis into…
spyrchat Apr 28, 2025
7f75ff9
Merge pull request #1 from spyrchat/hybrid-retriever
spyrchat Apr 28, 2025
b5d9c2b
Remove BaseEmbedder inheritance from HuggingFaceEmbedder for improved…
spyrchat Apr 28, 2025
4b51eda
Remove BaseEmbedder inheritance from TitanEmbedder for improved clarity.
spyrchat Apr 28, 2025
9cca60c
Merge pull request #2 from spyrchat/hybrid-retriever
spyrchat Apr 28, 2025
df599be
Add PostgresController and connection test script for PostgreSQL inte…
spyrchat May 20, 2025
9045ddf
Implement image and table asset insertion methods in PostgresControll…
spyrchat May 21, 2025
31a2a7e
Add table extraction and SQL uploading functionality; refactor import…
spyrchat May 29, 2025
bdb9037
Add PDF processing, table extraction, and text chunking functionality…
spyrchat May 29, 2025
bcc92da
Enhance embedding pipeline with dynamic embedding strategy; add PDF p…
spyrchat May 30, 2025
d400d85
Refactor import statements to use relative paths; update sandbox dire…
spyrchat May 30, 2025
40d43c9
Enhance Qdrant document insertion with error handling and logging; up…
spyrchat Jun 5, 2025
d6c07b5
Add table extraction functionality with logging; implement PDF proces…
spyrchat Jun 5, 2025
f75f74f
Implement modular RAG pipeline with query interpretation, SQL plannin…
spyrchat Jul 8, 2025
53d213b
Add Dockerfile, docker-compose.yml, and main application logic; imple…
spyrchat Jul 8, 2025
985440e
Refactor QdrantVectorDB: remove unused import and add spacing; update…
spyrchat Jul 9, 2025
20a99bd
Updated requirements.txt
spyrchat Jul 9, 2025
8c08f87
full pipeline is functional
spyrchat Jul 9, 2025
b6feff5
Added logging
spyrchat Jul 9, 2025
4187574
Added docstrings for clarity
spyrchat Jul 9, 2025
fa53084
Added Docstrings
spyrchat Jul 9, 2025
346f0d6
added config.yml
spyrchat Jul 9, 2025
268c15b
System Works with config.yml
spyrchat Jul 9, 2025
4e2d2c7
Feat Agent Works as intended
spyrchat Jul 9, 2025
4b71446
Add smoke tests, vector store uploader, and document validator
spyrchat Aug 20, 2025
2bd4a0c
feat: Add minimal SOSum ingestion test and standalone processor
spyrchat Aug 20, 2025
a3fd333
feat: Enhance data handling and validation in ingestion pipeline
spyrchat Aug 21, 2025
a23242b
Add Quick Start Guide for MLOps Pipeline and implement core components
spyrchat Aug 21, 2025
811b2c6
feat: Implement Stack Overflow adapter analysis and testing tools
spyrchat Aug 21, 2025
663dbbd
feat: Add answer metadata tests and enhance answer retrieval output
spyrchat Aug 21, 2025
00586f0
Add experimental and hybrid retrieval configurations, enhance testing…
spyrchat Aug 21, 2025
3add6e6
Add unit tests for retrieval pipeline and related components
spyrchat Aug 21, 2025
28d11ed
feat: Update dependencies in requirements.txt and add new packages
spyrchat Aug 30, 2025
db65791
feat: Enhance embedding strategy configuration and improve smoke test…
spyrchat Aug 30, 2025
c353fe2
Refactor retrieval pipeline to modern architecture
spyrchat Aug 30, 2025
439708a
Refactor configuration loading and retriever initialization
spyrchat Aug 30, 2025
32a3daf
feat: Consolidate configuration system and enhance benchmark function…
spyrchat Aug 30, 2025
8483973
feat: Enhance benchmark evaluation by implementing NaN handling for m…
spyrchat Aug 30, 2025
9acb29c
feat: Improve document ID handling and external ID preservation in Qd…
spyrchat Aug 30, 2025
3aeceee
Refactor benchmark scripts and retrievers for improved functionality …
spyrchat Aug 30, 2025
d18cdc4
Add dataset configurations for Natural Questions and SOSum Stack Over…
spyrchat Aug 30, 2025
22500db
Remove obsolete test files and add a new local end-to-end test setup …
spyrchat Aug 30, 2025
10b6620
chore: Update Python version to 3.13 in pipeline tests
spyrchat Aug 30, 2025
056f007
chore: Update testing dependencies and Python version in CI workflows
spyrchat Aug 30, 2025
7323f4b
refactor: Simplify dependency management by removing requirements-tes…
spyrchat Aug 31, 2025
a81237f
Remove outdated documentation and SQL components; reorganize configur…
spyrchat Aug 31, 2025
b39c51d
chore: Update requirements-minimal.txt to include missing dependencie…
spyrchat Aug 31, 2025
e1beb8b
chore: Add missing dependencies for boto3, botocore, and langchain-qd…
spyrchat Aug 31, 2025
149ab30
refactor: Enhance Qdrant connectivity tests and remove outdated requi…
spyrchat Aug 31, 2025
63d29af
chore: Remove outdated GitHub Actions CI configuration and local test…
spyrchat Aug 31, 2025
18903d8
chore: Remove outdated example scripts and sample data files for retr…
spyrchat Aug 31, 2025
3393ec5
Fix Google dependencies conflict in requirements.txt
spyrchat Aug 31, 2025
9415e07
fix: Remove unnecessary blank line in insert_documents method
spyrchat Sep 7, 2025
2748761
fix: Improve .env loading and add default values for Qdrant configura…
spyrchat Sep 7, 2025
8d6d16f
fix: Update Qdrant service configuration for improved health checks a…
spyrchat Sep 7, 2025
8163734
fix: Improve Qdrant health check commands and update logging messages…
spyrchat Sep 7, 2025
44baa40
fix: Update Qdrant health check commands to use the correct endpoint …
spyrchat Sep 7, 2025
c128036
fix: Update Qdrant health check commands for improved readiness verif…
spyrchat Sep 7, 2025
0068770
fix: Enhance Qdrant readiness check with retry logic and timeout hand…
spyrchat Sep 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
211 changes: 211 additions & 0 deletions .github/workflows/pipeline-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
name: Pipeline Tests

on:
push:
branches: [ main, development ]
pull_request:
branches: [ main, development ]

jobs:
test-minimal:
name: Minimal Pipeline Tests
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.13'

- name: Install dependencies
run: |
pip install -r tests/requirements-minimal.txt

- name: Run minimal pipeline tests (no external services)
run: |
python -m pytest tests/pipeline/test_minimal_pipeline.py tests/pipeline/test_components.py -v --tb=short

- name: Run configuration validation
run: |
python -c "
import yaml
import sys

# Test that all YAML configs are valid
configs = ['config.yml', 'pipelines/configs/retrieval/ci_google_gemini.yml']
for config in configs:
try:
with open(config) as f:
yaml.safe_load(f)
print(f'{config} is valid')
except Exception as e:
print(f'{config} failed: {e}')
sys.exit(1)
"

test-integration:
name: Integration Tests with Qdrant
runs-on: ubuntu-latest

services:
qdrant:
image: qdrant/qdrant:latest
ports:
- 6333:6333
- 6334:6334
env:
QDRANT__SERVICE__HTTP_PORT: 6333
QDRANT__SERVICE__GRPC_PORT: 6334

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.13'

- name: Install dependencies
run: |
pip install -r tests/requirements-minimal.txt

- name: Wait for Qdrant to be ready
run: |
for i in {1..30}; do
if curl -fs http://localhost:6333/healthz | grep -q "ok"; then
echo "Qdrant is ready!"
exit 0
fi
echo "Waiting for Qdrant ($i/30)..."
sleep 2
done
echo "Qdrant did not become ready in time"
exit 1

- name: Test Qdrant connectivity
run: |
python -m pytest tests/pipeline/test_qdrant_connectivity.py -v --tb=short

- name: Run basic integration tests
run: |
python -m pytest tests/pipeline/ -v --tb=short -m "not requires_api"

test-end-to-end:
name: End-to-End Pipeline Tests
runs-on: ubuntu-latest
if: github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository

services:
qdrant:
image: qdrant/qdrant:latest
ports:
- 6333:6333
- 6334:6334
env:
QDRANT__SERVICE__HTTP_PORT: 6333
QDRANT__SERVICE__GRPC_PORT: 6334

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.13'

- name: Install dependencies
run: |
pip install -r tests/requirements-minimal.txt

- name: Wait for Qdrant to be ready
run: |
for i in {1..30}; do
if curl -fs http://localhost:6333/healthz | grep -q "ok"; then
echo "Qdrant is ready!"
exit 0
fi
echo "Waiting for Qdrant ($i/30)..."
sleep 2
done
echo "Qdrant did not become ready in time"
exit 1

- name: Test Qdrant health
run: |
curl -fs http://localhost:6333/healthz
curl -fs http://localhost:6333/collections
echo "Qdrant is working properly"

- name: Run end-to-end pipeline tests
env:
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
QDRANT_HOST: localhost
QDRANT_PORT: 6333
run: |
if [ -z "$GOOGLE_API_KEY" ]; then
echo "GOOGLE_API_KEY not set - skipping end-to-end tests"
exit 0
fi

echo "API key available - running full end-to-end tests"
python -m pytest tests/pipeline/test_end_to_end.py -v --tb=short -m "requires_api"

- name: Run comprehensive test suite
env:
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
QDRANT_HOST: localhost
QDRANT_PORT: 6333
run: |
if [ -n "$GOOGLE_API_KEY" ]; then
echo "Running comprehensive test suite with API"
python tests/pipeline/run_tests.py
else
echo "Running minimal test suite without API"
python -m pytest tests/pipeline/test_minimal_pipeline.py tests/pipeline/test_components.py -v

test-security:
name: Security and Config Validation
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Check for hardcoded secrets
run: |
# Check that no API keys are hardcoded
if grep -r "sk-" . --exclude-dir=.git --exclude="*.md" --exclude="*.yml"; then
echo "Found potential hardcoded API keys"
exit 1
fi

if grep -r "google_api_key.*=" . --exclude-dir=.git --exclude="*.md" --exclude="*.yml" | grep -v "getenv\|environ"; then
echo "Found potential hardcoded Google API keys"
exit 1
fi

echo "No hardcoded secrets found"

- name: Validate configuration structure
run: |
python -c "
import yaml

# Validate CI config structure
with open('pipelines/configs/retrieval/ci_google_gemini.yml') as f:
config = yaml.safe_load(f)

# Check required fields
assert 'retrieval_pipeline' in config
assert 'retriever' in config['retrieval_pipeline']
assert 'embedding' in config['retrieval_pipeline']['retriever']
assert 'google' == config['retrieval_pipeline']['retriever']['embedding']['dense']['provider']
assert 'GOOGLE_API_KEY' == config['retrieval_pipeline']['retriever']['embedding']['dense']['api_key_env']

print('Configuration structure is valid')
"
12 changes: 11 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,14 @@ climate-fever
*.log
__pycache__
sandbox/*
/__pycache__
/__pycache__
synthetic_dataset\text_dataset_template.json
extraction_output/
.idea/misc.xml
.idea/modules.xml
.idea/Thesis.iml
.idea/vcs.xml
.idea/inspectionProfiles/profiles_settings.xml
*.json
*.csv

18 changes: 18 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Use a slim Python base image
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system packages if needed
RUN apt-get update && apt-get install -y \
build-essential \
libpq-dev \
&& rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the full source code
COPY . .
Loading
Loading