Skip to content

Conversation

@spyrchat
Copy link
Owner

This pull request introduces a production-ready, modular Retrieval-Augmented Generation (RAG) system with LangGraph agent integration. It adds a robust, configurable agent pipeline, modernizes the documentation, and provides full CI/CD test coverage, including security checks. The changes focus on making the system extensible, maintainable, and ready for deployment.

Agent Pipeline Implementation:

  • Added a new LangGraph-based agent pipeline in agent/graph.py, featuring modular nodes for query interpretation, retrieval, generation, and memory updating, all configurable via YAML.
  • Implemented core agent nodes:
    • query_interpreter for intent and routing
    • retriever (referenced in graph.py)
    • generator for answer synthesis
    • memory_updater for chat history management

Documentation and Developer Experience:

  • Overhauled README.md with architecture diagrams, feature highlights, quick start instructions, extensibility guides, and migration steps from legacy code.

Testing and CI/CD:

  • Added a comprehensive GitHub Actions workflow (.github/workflows/pipeline-tests.yml) for minimal, integration, end-to-end, and security/config validation tests, ensuring reliability and code quality.

Deployment and Packaging:

  • Introduced a new Dockerfile using a slim Python image, installing dependencies and copying the full source for streamlined containerization and deployment.

References:
[1] [2] [3] [4] [5] [6] [7]

spyrchat added 30 commits April 26, 2025 17:00
…it; create hybrid-retriever.py file; update requirements.txt for new dependencies
…nit_collection method and enhance as_langchain_vectorstore for hybrid retrieval; improve BaseRetriever documentation and remove hybrid-retriever.py file.
…ctor run method to return dense and sparse vectors; update test script to integrate new pipeline functionality and add metadata handling for documents.
… logging for collection creation and document insertion; refactor insert_documents method for improved clarity. Update EmbeddingPipeline to use new splitter method. Modify SparseEmbedder to default to CUDA. Add hybrid_retriever.py file.
…nitialization and embedding retrieval; improve logging and document preparation in test_embedding_pipeline.
…ds to retrieve client and collection name. Update EmbeddingPipeline and SparseEmbedder to use Embeddings instead of BaseEmbedder. Add QdrantHybridRetriever for hybrid retrieval functionality and update test scripts accordingly.
Hybrid retriever is functional and ready to deploy to development branch
…er; add text processing pipeline for PDF documents
…; implement metadata handling and enrich documents for upload
…g, and memory updating; add retriever routing logic and logging
… requirements.txt with dependency version upgrades
spyrchat added 24 commits July 10, 2025 02:24
- Implemented a comprehensive smoke testing framework in `smoke_tests.py` to verify system quality post-ingestion.
- Created `uploader.py` for handling idempotent vector uploads with versioning to Qdrant, supporting dense and sparse vectors.
- Developed `validator.py` for document validation and cleaning, ensuring content quality and metadata integrity.
- Added a setup script `setup_sosum.sh` for quick setup of the SOSum dataset, including verification of required files and ingestion instructions.
- Implemented a minimal ingestion test for the SOSum adapter in `examples/test_sosum_minimal.py` to validate basic functionality without heavy ML dependencies.
- Created a standalone SOSum processor script in `scripts/standalone_sosum_processor.py` for processing datasets without full pipeline dependencies.
- Updated `ingest.py` to improve output formatting for collection status.
- Added new configuration files for different embedding strategies in `pipelines/configs/`.
- Updated requirements in `requirements.txt` to include new dependencies and versions.
- Added sample output JSON file for processed SOSum data.
- Added 'split' parameter to StackOverflowAdapter for better data segmentation.
- Introduced 'dense_embedding' and 'sparse_embedding' fields in ChunkMeta for improved embedding metadata.
- Updated EmbeddingPipeline to directly assign embeddings to respective fields.
- Made allowed characters in DocumentValidator more permissive for HTML/code content.
- Introduced a comprehensive Quick Start Guide for implementing an MLOps pipeline for RAG systems, covering project initialization, dataset adapters, configuration, processing components, and testing.
- Implemented a CSV dataset adapter for reading and converting CSV data into documents.
- Created a configuration schema for managing dataset, chunking, embedding, and vector store settings.
- Developed core processing components including a document chunker and an embedding pipeline.
- Added a simple CLI interface for ingestion with logging and configuration handling.
- Implemented a sparse embedding mechanism and integrated it into the embedding pipeline.
- Added inspection script for analyzing vector structures in Qdrant.
- Created smoke tests for validating ingestion processes and vector store uploads.
- Added test script for verifying sparse embedding serialization.
- Updated existing configurations for stackoverflow datasets to support hybrid and dense embedding strategies.
… framework

- Created `experimental.yml` for testing new components in the retrieval pipeline.
- Added `hybrid_multistage.yml` for hybrid retrieval with multi-stage reranking.
- Implemented tests for the new answer-focused adapter in `test_new_adapter.py`.
- Developed advanced reranking tests in `test_advanced_rerankers.py`.
- Introduced answer retrieval tests in `test_answer_retrieval.py`.
- Demonstrated retrieval pipeline extensibility in `test_extensibility.py`.
- Showcased modular pipeline features in `test_modular_pipeline.py`.
- Added a comprehensive test runner in `run_all_tests.py`.
- Updated agent retrieval tests to support configurable pipelines in `test_agent_retrieval.py`.
- Implemented unit tests for the RetrievalPipeline, RetrievalResult, and associated components (Retriever, Reranker, Filter) in `test_retrieval_pipeline.py`.
- Created mock classes for testing purposes to simulate retrieval, reranking, and filtering behaviors.
- Added tests for basic functionality, component addition/removal, and pipeline execution with various configurations.
- Introduced tests for the RetrievalPipelineFactory to validate pipeline creation with dense and hybrid configurations.
- Added minimal and example tests for the SOSum adapter to ensure basic functionality without heavy dependencies.
- Implemented smoke tests for the ingestion process and overall system quality checks.
- Updated the test runner to include new tests and organized the test structure for better clarity.
- Updated langchain-core to version 0.3.75
- Added new dependencies: cachetools, distro, filetype, google-ai-generativelanguage, google-api-core, google-auth, googleapis-common-protos, grpcio-status, jiter, langchain-google-genai, langchain-openai, langgraph, langgraph-checkpoint, langgraph-prebuilt, langgraph-sdk, openai, ormsgpack, proto-plus, psycopg2-binary, pyasn1, pyasn1_modules, rsa, tiktoken, xxhash
- Updated existing dependencies to their latest versions

test: Enhance tests for rerankers and retrieval pipeline

- Refactored test cases in test_rerankers.py for better readability and maintainability
- Added new tests for the RetrievalPipeline and its components in test_retrieval_pipeline.py
- Improved mock implementations for better isolation in tests

feat: Add debug scripts for StackOverflow adapter

- Introduced debug_row_order.py to check row order and types from StackOverflow data
- Added debug_stackoverflow_adapter.py to investigate issues with document reading in the StackOverflow adapter

test: Implement tests for ingestion pipeline and adapter functionality

- Created test_full_ingestion.py to validate the full ingestion pipeline with StackOverflow data
- Added test_adapter_fix.py to verify the StackOverflow adapter produces documents correctly

chore: Update test runner to include new tests

- Modified run_all_tests.py to include new test files for retrieval and ingestion
- Removed legacy retriever wrapping and introduced modern retriever classes.
- Updated `RetrievalPipelineFactory` to create dense, hybrid, sparse, and semantic pipelines using new retriever implementations.
- Created `ModernBaseRetriever` as a base class for all retrievers, providing common functionality and configuration handling.
- Implemented `QdrantDenseRetriever`, `QdrantHybridRetriever`, `QdrantSparseRetriever`, and `SemanticRetriever` with improved initialization and search methods.
- Removed deprecated `router.py` and integrated routing logic into the new retriever classes.
- Enhanced logging and error handling across retrievers for better debugging and monitoring.
- Updated imports and module structure to reflect the new architecture.
- Enhanced the `load_config` function to include detailed error handling and logging.
- Introduced `get_retriever_config`, `get_benchmark_config`, and `get_pipeline_config` functions for better configuration management.
- Added `load_config_with_overrides` to support configuration overrides.
- Updated `QdrantVectorDB` initialization to accept configuration parameters.
- Created example scripts for unified configuration usage and retriever configuration examples.
- Removed outdated retrieval pipeline configurations to streamline the codebase.
- Improved logging throughout the retriever classes for better traceability.
…ality

- Refactored benchmark runner to utilize unified configuration approach.
- Updated imports to reflect new module structure for benchmarks and metrics.
- Introduced new benchmark scripts for simple and full dataset evaluations.
- Enhanced retrieval pipeline initialization to support unified config.
- Created comprehensive dataset adapter for full StackOverflow dataset evaluation.
- Added real data benchmark runner for testing with actual StackOverflow queries.
- Updated configuration file to include new retrieval strategies and parameters.
- Documented changes in CONFIG_CONSOLIDATION_COMPLETE.md and UNIFIED_CONFIG.md.
- Added examples demonstrating the use of the new unified configuration system.
…issing ground truth and improving document ID extraction
…and configuration

- Removed the full dataset benchmark script to streamline the benchmarking process.
- Updated the real benchmark script to ensure proper imports and functionality.
- Enhanced the configuration file to include new fusion methods and adjustable weights for hybrid retrieval.
- Refactored dense and sparse retrievers to improve embedding initialization and search processes.
- Implemented a new hybrid retriever that combines dense and sparse results using configurable fusion methods.
- Deleted the synthetic dataset text processing script to clean up unused code.
- Added a comprehensive test suite for all retrievers in the full benchmark pipeline to ensure reliability and performance.
…flow

- Created `natural_questions.yml` for Google Natural Questions dataset with hybrid embedding strategy, chunking, validation, and evaluation settings.
- Created `stackoverflow.yml` for SOSum dataset with hybrid embedding strategy, chunking, validation, and evaluation settings.
- Added `stackoverflow_hybrid.yml` for hybrid dense and sparse embeddings configuration.
- Introduced dataset template `dataset_template.yml` for easy dataset configuration.
- Added retrieval configuration templates: `retrieval_template.yml` for agent retrieval setup.
- Implemented legacy configurations for various models: `stackoverflow_bge_large.yml`, `stackoverflow_e5_large.yml`, and `stackoverflow_minilm.yml`.
- Created high-performance retrieval configurations: `fast_hybrid.yml`, `modern_dense.yml`, and `modern_hybrid.yml`.
- Removed outdated retriever configurations: `dense_retriever.yml`, `hybrid_retriever.yml`, `semantic_retriever.yml`, and `sparse_retriever.yml`.
- Updated tests for agent retrieval and streamlined agent functionality, ensuring compatibility with new configurations.
…script

- Deleted the following test files:
  - test_full_ingestion.py
  - test_modular_pipeline.py
  - run_all_tests.py
  - test_adapter_fix.py
  - test_agent_retrieval.py
  - test_retriever_direct.py
  - test_streamlined_agent.py

- Added a new test file: test_local_setup.py
  - This script checks prerequisites and runs progressive tests for the pipeline.
…t.txt and updating pipeline tests to use requirements-minimal.txt
…ation structure for improved clarity and maintainability.

- Deleted SYSTEM_EXTENSION_GUIDE.md, UNIFIED_CONFIG.md, agent_retrieval_upgrade_summary.md, config_reorganization_summary.md, integration_testing_setup.md, and sql_removal_summary.md.
- Simplified agent graph by removing SQL-related nodes and dependencies.
- Consolidated configuration files into a unified structure, enhancing usability and reducing clutter.
@spyrchat spyrchat requested a review from Copilot August 31, 2025 09:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces a comprehensive, production-ready Retrieval-Augmented Generation (RAG) system with extensive testing infrastructure. The changes transform the existing codebase into a modern, modular pipeline with LangGraph agent integration, comprehensive CI/CD testing, and deployment readiness.

  • Introduces modular RAG pipeline with configurable agents, retrievers, and embeddings
  • Adds comprehensive three-tier testing system (minimal, integration, end-to-end)
  • Implements modern retrievers with hybrid search capabilities and pipeline integration
  • Provides standalone processing scripts and deployment containerization

Reviewed Changes

Copilot reviewed 127 out of 142 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/test_local_setup.py Local end-to-end test orchestration with progressive validation levels
tests/requirements-minimal.txt Minimal testing dependencies for CI/CD environments
tests/requirements-README.md Comprehensive documentation for testing dependency management
tests/pipeline/test_runner.py Test runner combining configuration, pipeline, and database validation
tests/pipeline/test_qdrant_connectivity.py Qdrant database connectivity tests with optional execution
tests/pipeline/test_qdrant.py Simplified Qdrant operations testing for CI environments
tests/pipeline/test_minimal_pipeline.py Core pipeline tests using Google Gemini embeddings only
tests/pipeline/test_minimal.py Minimal pipeline validation without local model dependencies
tests/pipeline/test_end_to_end.py Complete pipeline testing with real data and API integration
tests/pipeline/test_config.py Configuration validation tests for YAML structure and completeness
tests/pipeline/test_components.py Component integration tests without external service requirements
tests/pipeline/run_tests.py Comprehensive test suite runner with auto-detection and reporting
tests/pipeline/init.py Package initialization for pipeline tests
tests/pipeline/README_NEW.md Detailed documentation for the three-tier testing system
tests/pipeline/.github-actions-example.yml GitHub Actions CI configuration example for automated testing
tests/init.py Package initialization for tests directory
scripts/standalone_sosum_processor.py Standalone SOSum dataset processor without pipeline dependencies
scripts/setup_sosum.sh SOSum dataset setup script with ingestion examples
retrievers/sparse_retriever.py Modern sparse retriever with Qdrant integration and pipeline compliance
retrievers/semantic_retriever.py Advanced semantic retriever with intelligent strategy selection
retrievers/hybrid_retriever.py Hybrid retriever combining dense and sparse search with fusion methods
retrievers/dense_retriever.py Modern dense retriever with direct Qdrant API usage
retrievers/base_retriever.py Modern base retriever implementing pipeline interface
retrievers/base.py Removed legacy base retriever implementation
retrievers/init.py Updated exports for modern retriever implementations
pytest.ini Pytest configuration with markers and test discovery settings

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +35 to +38
if response.status_code == 404:
collections_url = f"{qdrant_config['url']}/collections"
response = requests.get(collections_url, timeout=5)
assert response.status_code == 200
Copy link

Copilot AI Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The fallback logic when health endpoint returns 404 is duplicating the collections endpoint test. Consider extracting this into a helper method to reduce code duplication and improve maintainability.

Copilot uses AI. Check for mistakes.
config["qdrant"]["collection"] = collection_name

# Add force recreate to avoid conflicts
config["retrieval_pipeline"]["retriever"]["qdrant"]["force_recreate"] = False
Copy link

Copilot AI Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The hardcoded force_recreate = False setting could cause issues if the test collection already exists with different configuration. Consider using True for test collections to ensure clean state, or add logic to handle collection conflicts.

Suggested change
config["retrieval_pipeline"]["retriever"]["qdrant"]["force_recreate"] = False
config["retrieval_pipeline"]["retriever"]["qdrant"]["force_recreate"] = True

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +65
'provider': 'sparse',
'model': 'Qdrant/bm25',
'vector_name': 'sparse'
}
Copy link

Copilot AI Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The default sparse embedding configuration is hardcoded. Consider moving these defaults to a configuration file or constants to improve maintainability and allow easier customization.

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +77
'provider': 'google',
'model': 'models/embedding-001',
'dimensions': 768,
'api_key_env': 'GOOGLE_API_KEY'
}
Copy link

Copilot AI Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The default dense embedding configuration is hardcoded with Google-specific settings. This creates tight coupling and should be extracted to configuration constants or use a configuration factory pattern.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +27
"""Convert a string to a deterministic UUID."""
# Create a deterministic UUID from string using SHA256
hash_bytes = hashlib.sha256(text.encode('utf-8')).digest()[:16]
return str(uuid.UUID(bytes=hash_bytes))
Copy link

Copilot AI Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking only the first 16 bytes of SHA256 reduces the hash space and could increase collision probability. Consider using the full hash or a UUID5 which is designed for deterministic UUIDs from names.

Suggested change
"""Convert a string to a deterministic UUID."""
# Create a deterministic UUID from string using SHA256
hash_bytes = hashlib.sha256(text.encode('utf-8')).digest()[:16]
return str(uuid.UUID(bytes=hash_bytes))
"""Convert a string to a deterministic UUID using UUID5."""
# Use UUID5 for deterministic UUID from string and namespace
return str(uuid.uuid5(uuid.NAMESPACE_URL, text))

Copilot uses AI. Check for mistakes.
Comment on lines +149 to +150
if max_char_ratio > 0.5: # More than 50% of content is single character
errors.append(f"Excessive character repetition: {max_char_ratio:.2%}")
Copy link

Copilot AI Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The 50% threshold for character repetition is hardcoded. Consider making this configurable through the validator config to allow different content types (e.g., code, structured data) that may legitimately have high character repetition.

Copilot uses AI. Check for mistakes.
Comment on lines +182 to +189
if hasattr(vector_config, 'dense'):
# Named vectors (hybrid setup)
vector_size = vector_config.dense.size
vector_name = "dense"
else:
# Single vector config
vector_size = vector_config.size
vector_name = None
Copy link

Copilot AI Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The vector configuration detection logic assumes specific attribute names ('dense') and structure. This is fragile and could break with Qdrant API changes. Consider using a more robust approach with try-catch blocks or checking attribute existence more defensively.

Suggested change
if hasattr(vector_config, 'dense'):
# Named vectors (hybrid setup)
vector_size = vector_config.dense.size
vector_name = "dense"
else:
# Single vector config
vector_size = vector_config.size
vector_name = None
vector_size = None
vector_name = None
# Defensive checks for vector config structure
if hasattr(vector_config, 'dense') and hasattr(vector_config.dense, 'size'):
# Named vectors (hybrid setup)
vector_size = vector_config.dense.size
vector_name = "dense"
elif hasattr(vector_config, 'size'):
# Single vector config
vector_size = vector_config.size
vector_name = None
elif isinstance(vector_config, dict):
# Sometimes vector_config may be a dict (API change)
if "dense" in vector_config and "size" in vector_config["dense"]:
vector_size = vector_config["dense"]["size"]
vector_name = "dense"
elif "size" in vector_config:
vector_size = vector_config["size"]
vector_name = None
if vector_size is None:
raise ValueError("Could not determine vector size from collection config. Unexpected vector config structure: {}".format(vector_config))

Copilot uses AI. Check for mistakes.
@spyrchat spyrchat closed this Aug 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants