Skip to content

Conversation

@spyrchat
Copy link
Owner

@spyrchat spyrchat commented Sep 8, 2025

This pull request introduces a production-ready, modular Retrieval-Augmented Generation (RAG) system with a configurable pipeline architecture, LangGraph agent integration, and robust CI/CD support. The main changes include a new agent graph implementation, modular node components, a comprehensive project README, and supporting infrastructure for Docker and CI workflows.

Agent and Pipeline Architecture:

  • Introduced a modular agent graph in agent/graph.py using LangGraph, integrating configurable retrieval, query interpretation, generation, and memory update nodes. The agent dynamically routes questions through retrieval or direct answer paths based on intent.
  • Added modular agent nodes:
    • query_interpreter.py: Interprets user intent and determines pipeline routing using an LLM and structured prompt.
    • generator.py: Generates answers from context or fallback using an LLM, with error handling and logging.
    • memory_updater.py: Updates chat history, maintaining a capped window of conversation turns.

Documentation and Usability:

  • Completely rewrote README.md to provide a clear overview of features, architecture, configuration, extension, and usage instructions for the new RAG system. Includes configuration examples, project structure, and migration guidance.

Infrastructure and DevOps:

  • Added a new Dockerfile using a slim Python base image for production deployment, installing dependencies and copying the full source code.
  • Introduced a comprehensive GitHub Actions workflow (.github/workflows/pipeline-tests.yml) for CI, including minimal and integration tests, configuration validation, and security checks.

spyrchat added 30 commits May 20, 2025 16:54
…er; add text processing pipeline for PDF documents
…; implement metadata handling and enrich documents for upload
…g, and memory updating; add retriever routing logic and logging
… requirements.txt with dependency version upgrades
- Implemented a comprehensive smoke testing framework in `smoke_tests.py` to verify system quality post-ingestion.
- Created `uploader.py` for handling idempotent vector uploads with versioning to Qdrant, supporting dense and sparse vectors.
- Developed `validator.py` for document validation and cleaning, ensuring content quality and metadata integrity.
- Added a setup script `setup_sosum.sh` for quick setup of the SOSum dataset, including verification of required files and ingestion instructions.
- Implemented a minimal ingestion test for the SOSum adapter in `examples/test_sosum_minimal.py` to validate basic functionality without heavy ML dependencies.
- Created a standalone SOSum processor script in `scripts/standalone_sosum_processor.py` for processing datasets without full pipeline dependencies.
- Updated `ingest.py` to improve output formatting for collection status.
- Added new configuration files for different embedding strategies in `pipelines/configs/`.
- Updated requirements in `requirements.txt` to include new dependencies and versions.
- Added sample output JSON file for processed SOSum data.
- Added 'split' parameter to StackOverflowAdapter for better data segmentation.
- Introduced 'dense_embedding' and 'sparse_embedding' fields in ChunkMeta for improved embedding metadata.
- Updated EmbeddingPipeline to directly assign embeddings to respective fields.
- Made allowed characters in DocumentValidator more permissive for HTML/code content.
- Introduced a comprehensive Quick Start Guide for implementing an MLOps pipeline for RAG systems, covering project initialization, dataset adapters, configuration, processing components, and testing.
- Implemented a CSV dataset adapter for reading and converting CSV data into documents.
- Created a configuration schema for managing dataset, chunking, embedding, and vector store settings.
- Developed core processing components including a document chunker and an embedding pipeline.
- Added a simple CLI interface for ingestion with logging and configuration handling.
- Implemented a sparse embedding mechanism and integrated it into the embedding pipeline.
- Added inspection script for analyzing vector structures in Qdrant.
- Created smoke tests for validating ingestion processes and vector store uploads.
- Added test script for verifying sparse embedding serialization.
- Updated existing configurations for stackoverflow datasets to support hybrid and dense embedding strategies.
… framework

- Created `experimental.yml` for testing new components in the retrieval pipeline.
- Added `hybrid_multistage.yml` for hybrid retrieval with multi-stage reranking.
- Implemented tests for the new answer-focused adapter in `test_new_adapter.py`.
- Developed advanced reranking tests in `test_advanced_rerankers.py`.
- Introduced answer retrieval tests in `test_answer_retrieval.py`.
- Demonstrated retrieval pipeline extensibility in `test_extensibility.py`.
- Showcased modular pipeline features in `test_modular_pipeline.py`.
- Added a comprehensive test runner in `run_all_tests.py`.
- Updated agent retrieval tests to support configurable pipelines in `test_agent_retrieval.py`.
- Implemented unit tests for the RetrievalPipeline, RetrievalResult, and associated components (Retriever, Reranker, Filter) in `test_retrieval_pipeline.py`.
- Created mock classes for testing purposes to simulate retrieval, reranking, and filtering behaviors.
- Added tests for basic functionality, component addition/removal, and pipeline execution with various configurations.
- Introduced tests for the RetrievalPipelineFactory to validate pipeline creation with dense and hybrid configurations.
- Added minimal and example tests for the SOSum adapter to ensure basic functionality without heavy dependencies.
- Implemented smoke tests for the ingestion process and overall system quality checks.
- Updated the test runner to include new tests and organized the test structure for better clarity.
- Updated langchain-core to version 0.3.75
- Added new dependencies: cachetools, distro, filetype, google-ai-generativelanguage, google-api-core, google-auth, googleapis-common-protos, grpcio-status, jiter, langchain-google-genai, langchain-openai, langgraph, langgraph-checkpoint, langgraph-prebuilt, langgraph-sdk, openai, ormsgpack, proto-plus, psycopg2-binary, pyasn1, pyasn1_modules, rsa, tiktoken, xxhash
- Updated existing dependencies to their latest versions

test: Enhance tests for rerankers and retrieval pipeline

- Refactored test cases in test_rerankers.py for better readability and maintainability
- Added new tests for the RetrievalPipeline and its components in test_retrieval_pipeline.py
- Improved mock implementations for better isolation in tests

feat: Add debug scripts for StackOverflow adapter

- Introduced debug_row_order.py to check row order and types from StackOverflow data
- Added debug_stackoverflow_adapter.py to investigate issues with document reading in the StackOverflow adapter

test: Implement tests for ingestion pipeline and adapter functionality

- Created test_full_ingestion.py to validate the full ingestion pipeline with StackOverflow data
- Added test_adapter_fix.py to verify the StackOverflow adapter produces documents correctly

chore: Update test runner to include new tests

- Modified run_all_tests.py to include new test files for retrieval and ingestion
- Removed legacy retriever wrapping and introduced modern retriever classes.
- Updated `RetrievalPipelineFactory` to create dense, hybrid, sparse, and semantic pipelines using new retriever implementations.
- Created `ModernBaseRetriever` as a base class for all retrievers, providing common functionality and configuration handling.
- Implemented `QdrantDenseRetriever`, `QdrantHybridRetriever`, `QdrantSparseRetriever`, and `SemanticRetriever` with improved initialization and search methods.
- Removed deprecated `router.py` and integrated routing logic into the new retriever classes.
- Enhanced logging and error handling across retrievers for better debugging and monitoring.
- Updated imports and module structure to reflect the new architecture.
…script

- Deleted the following test files:
  - test_full_ingestion.py
  - test_modular_pipeline.py
  - run_all_tests.py
  - test_adapter_fix.py
  - test_agent_retrieval.py
  - test_retriever_direct.py
  - test_streamlined_agent.py

- Added a new test file: test_local_setup.py
  - This script checks prerequisites and runs progressive tests for the pipeline.
…t.txt and updating pipeline tests to use requirements-minimal.txt
…ation structure for improved clarity and maintainability.

- Deleted SYSTEM_EXTENSION_GUIDE.md, UNIFIED_CONFIG.md, agent_retrieval_upgrade_summary.md, config_reorganization_summary.md, integration_testing_setup.md, and sql_removal_summary.md.
- Simplified agent graph by removing SQL-related nodes and dependencies.
- Consolidated configuration files into a unified structure, enhancing usability and reducing clutter.
@spyrchat
Copy link
Owner Author

spyrchat commented Sep 8, 2025

This pull request introduces a production-ready, modular Retrieval-Augmented Generation (RAG) system with a configurable LangGraph agent pipeline. It adds a comprehensive project README, a Dockerfile for containerization, a CI workflow for automated testing and validation, and core agent pipeline code with modular node implementations. The main themes are: agent pipeline implementation, infrastructure and CI setup, and documentation.

Agent pipeline implementation:

  • Adds the core agent graph in agent/graph.py, wiring together modular nodes (query interpreter, retriever, generator, memory updater) using LangGraph, with configuration-driven LLM and retrieval pipeline selection.
  • Implements make_query_interpreter, a node that decides whether to retrieve documents or answer directly, using an LLM and a structured prompt, with robust error handling and logging (agent/nodes/query_interpreter.py).
  • Implements make_generator, a node that generates answers from context and question using an LLM, with logging and error fallback (agent/nodes/generator.py).
  • Adds memory_updater, a node to maintain a rolling chat history in the agent state (agent/nodes/memory_updater.py).

Infrastructure and CI setup:

  • Adds a Dockerfile using a slim Python 3.11 image, installing dependencies and copying the source code for containerized deployments.
  • Introduces a GitHub Actions workflow (.github/workflows/pipeline-tests.yml) to run minimal and integration tests (with Qdrant), validate configuration files, and check for hardcoded secrets on push and PRs.

Documentation:

  • Replaces the README.md with a detailed overview covering system features, architecture, configuration, extension, testing, and migration from legacy code, including usage and project structure.

@spyrchat spyrchat merged commit c605171 into development Sep 8, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants