Main Pipeline #19

spyrchat · 2025-09-08T05:57:21Z

This pull request introduces a production-ready, modular Retrieval-Augmented Generation (RAG) system with a configurable pipeline architecture, LangGraph agent integration, and robust CI/CD support. The main changes include a new agent graph implementation, modular node components, a comprehensive project README, and supporting infrastructure for Docker and CI workflows.

Agent and Pipeline Architecture:

Introduced a modular agent graph in agent/graph.py using LangGraph, integrating configurable retrieval, query interpretation, generation, and memory update nodes. The agent dynamically routes questions through retrieval or direct answer paths based on intent.
Added modular agent nodes:
- query_interpreter.py: Interprets user intent and determines pipeline routing using an LLM and structured prompt.
- generator.py: Generates answers from context or fallback using an LLM, with error handling and logging.
- memory_updater.py: Updates chat history, maintaining a capped window of conversation turns.

Documentation and Usability:

Completely rewrote README.md to provide a clear overview of features, architecture, configuration, extension, and usage instructions for the new RAG system. Includes configuration examples, project structure, and migration guidance.

Infrastructure and DevOps:

Added a new Dockerfile using a slim Python base image for production deployment, installing dependencies and copying the full source code.
Introduced a comprehensive GitHub Actions workflow (.github/workflows/pipeline-tests.yml) for CI, including minimal and integration tests, configuration validation, and security checks.

…gration

…er; add text processing pipeline for PDF documents

…s for clarity

…; implement metadata handling and enrich documents for upload

…rocessing test script

…ctory in test script

…date requirements for PyMuPDF

…sing test script

…g, and memory updating; add retriever routing logic and logging

…ment generator and retriever nodes

… requirements.txt with dependency version upgrades

- Implemented a comprehensive smoke testing framework in `smoke_tests.py` to verify system quality post-ingestion. - Created `uploader.py` for handling idempotent vector uploads with versioning to Qdrant, supporting dense and sparse vectors. - Developed `validator.py` for document validation and cleaning, ensuring content quality and metadata integrity. - Added a setup script `setup_sosum.sh` for quick setup of the SOSum dataset, including verification of required files and ingestion instructions.

- Implemented a minimal ingestion test for the SOSum adapter in `examples/test_sosum_minimal.py` to validate basic functionality without heavy ML dependencies. - Created a standalone SOSum processor script in `scripts/standalone_sosum_processor.py` for processing datasets without full pipeline dependencies. - Updated `ingest.py` to improve output formatting for collection status. - Added new configuration files for different embedding strategies in `pipelines/configs/`. - Updated requirements in `requirements.txt` to include new dependencies and versions. - Added sample output JSON file for processed SOSum data.

- Added 'split' parameter to StackOverflowAdapter for better data segmentation. - Introduced 'dense_embedding' and 'sparse_embedding' fields in ChunkMeta for improved embedding metadata. - Updated EmbeddingPipeline to directly assign embeddings to respective fields. - Made allowed characters in DocumentValidator more permissive for HTML/code content.

- Introduced a comprehensive Quick Start Guide for implementing an MLOps pipeline for RAG systems, covering project initialization, dataset adapters, configuration, processing components, and testing. - Implemented a CSV dataset adapter for reading and converting CSV data into documents. - Created a configuration schema for managing dataset, chunking, embedding, and vector store settings. - Developed core processing components including a document chunker and an embedding pipeline. - Added a simple CLI interface for ingestion with logging and configuration handling. - Implemented a sparse embedding mechanism and integrated it into the embedding pipeline. - Added inspection script for analyzing vector structures in Qdrant. - Created smoke tests for validating ingestion processes and vector store uploads. - Added test script for verifying sparse embedding serialization. - Updated existing configurations for stackoverflow datasets to support hybrid and dense embedding strategies.

… framework - Created `experimental.yml` for testing new components in the retrieval pipeline. - Added `hybrid_multistage.yml` for hybrid retrieval with multi-stage reranking. - Implemented tests for the new answer-focused adapter in `test_new_adapter.py`. - Developed advanced reranking tests in `test_advanced_rerankers.py`. - Introduced answer retrieval tests in `test_answer_retrieval.py`. - Demonstrated retrieval pipeline extensibility in `test_extensibility.py`. - Showcased modular pipeline features in `test_modular_pipeline.py`. - Added a comprehensive test runner in `run_all_tests.py`. - Updated agent retrieval tests to support configurable pipelines in `test_agent_retrieval.py`.

- Implemented unit tests for the RetrievalPipeline, RetrievalResult, and associated components (Retriever, Reranker, Filter) in `test_retrieval_pipeline.py`. - Created mock classes for testing purposes to simulate retrieval, reranking, and filtering behaviors. - Added tests for basic functionality, component addition/removal, and pipeline execution with various configurations. - Introduced tests for the RetrievalPipelineFactory to validate pipeline creation with dense and hybrid configurations. - Added minimal and example tests for the SOSum adapter to ensure basic functionality without heavy dependencies. - Implemented smoke tests for the ingestion process and overall system quality checks. - Updated the test runner to include new tests and organized the test structure for better clarity.

- Updated langchain-core to version 0.3.75 - Added new dependencies: cachetools, distro, filetype, google-ai-generativelanguage, google-api-core, google-auth, googleapis-common-protos, grpcio-status, jiter, langchain-google-genai, langchain-openai, langgraph, langgraph-checkpoint, langgraph-prebuilt, langgraph-sdk, openai, ormsgpack, proto-plus, psycopg2-binary, pyasn1, pyasn1_modules, rsa, tiktoken, xxhash - Updated existing dependencies to their latest versions test: Enhance tests for rerankers and retrieval pipeline - Refactored test cases in test_rerankers.py for better readability and maintainability - Added new tests for the RetrievalPipeline and its components in test_retrieval_pipeline.py - Improved mock implementations for better isolation in tests feat: Add debug scripts for StackOverflow adapter - Introduced debug_row_order.py to check row order and types from StackOverflow data - Added debug_stackoverflow_adapter.py to investigate issues with document reading in the StackOverflow adapter test: Implement tests for ingestion pipeline and adapter functionality - Created test_full_ingestion.py to validate the full ingestion pipeline with StackOverflow data - Added test_adapter_fix.py to verify the StackOverflow adapter produces documents correctly chore: Update test runner to include new tests - Modified run_all_tests.py to include new test files for retrieval and ingestion

… reporting

- Removed legacy retriever wrapping and introduced modern retriever classes. - Updated `RetrievalPipelineFactory` to create dense, hybrid, sparse, and semantic pipelines using new retriever implementations. - Created `ModernBaseRetriever` as a base class for all retrievers, providing common functionality and configuration handling. - Implemented `QdrantDenseRetriever`, `QdrantHybridRetriever`, `QdrantSparseRetriever`, and `SemanticRetriever` with improved initialization and search methods. - Removed deprecated `router.py` and integrated routing logic into the new retriever classes. - Enhanced logging and error handling across retrievers for better debugging and monitoring. - Updated imports and module structure to reflect the new architecture.

…script - Deleted the following test files: - test_full_ingestion.py - test_modular_pipeline.py - run_all_tests.py - test_adapter_fix.py - test_agent_retrieval.py - test_retriever_direct.py - test_streamlined_agent.py - Added a new test file: test_local_setup.py - This script checks prerequisites and runs progressive tests for the pipeline.

…t.txt and updating pipeline tests to use requirements-minimal.txt

…ation structure for improved clarity and maintainability. - Deleted SYSTEM_EXTENSION_GUIDE.md, UNIFIED_CONFIG.md, agent_retrieval_upgrade_summary.md, config_reorganization_summary.md, integration_testing_setup.md, and sql_removal_summary.md. - Simplified agent graph by removing SQL-related nodes and dependencies. - Consolidated configuration files into a unified structure, enhancing usability and reducing clutter.

…s for LangChain and dotenv

…rant in requirements-minimal.txt

…rements documentation

… setup script

…iever configuration

…tion

…nd connectivity

… for clarity

…and improve connectivity checks

…ication

…ling

…onfigurations

…e workflow

…ne tests

…vity handling

…nt in end-to-end tests

…e workflow

Benchmark

spyrchat · 2025-09-08T05:58:25Z

This pull request introduces a production-ready, modular Retrieval-Augmented Generation (RAG) system with a configurable LangGraph agent pipeline. It adds a comprehensive project README, a Dockerfile for containerization, a CI workflow for automated testing and validation, and core agent pipeline code with modular node implementations. The main themes are: agent pipeline implementation, infrastructure and CI setup, and documentation.

Agent pipeline implementation:

Adds the core agent graph in agent/graph.py, wiring together modular nodes (query interpreter, retriever, generator, memory updater) using LangGraph, with configuration-driven LLM and retrieval pipeline selection.
Implements make_query_interpreter, a node that decides whether to retrieve documents or answer directly, using an LLM and a structured prompt, with robust error handling and logging (agent/nodes/query_interpreter.py).
Implements make_generator, a node that generates answers from context and question using an LLM, with logging and error fallback (agent/nodes/generator.py).
Adds memory_updater, a node to maintain a rolling chat history in the agent state (agent/nodes/memory_updater.py).

Infrastructure and CI setup:

Adds a Dockerfile using a slim Python 3.11 image, installing dependencies and copying the source code for containerized deployments.
Introduces a GitHub Actions workflow (.github/workflows/pipeline-tests.yml) to run minimal and integration tests (with Qdrant), validate configuration files, and check for hardcoded secrets on push and PRs.

Documentation:

Replaces the README.md with a detailed overview covering system features, architecture, configuration, extension, testing, and migration from legacy code, including usage and project structure.

spyrchat added 30 commits May 20, 2025 16:54

Add PostgresController and connection test script for PostgreSQL inte…

df599be

…gration

Implement image and table asset insertion methods in PostgresControll…

9045ddf

…er; add text processing pipeline for PDF documents

Add table extraction and SQL uploading functionality; refactor import…

31a2a7e

…s for clarity

Add PDF processing, table extraction, and text chunking functionality…

bdb9037

…; implement metadata handling and enrich documents for upload

Enhance embedding pipeline with dynamic embedding strategy; add PDF p…

bcc92da

…rocessing test script

Refactor import statements to use relative paths; update sandbox dire…

d400d85

…ctory in test script

Enhance Qdrant document insertion with error handling and logging; up…

40d43c9

…date requirements for PyMuPDF

Add table extraction functionality with logging; implement PDF proces…

d6c07b5

…sing test script

Implement modular RAG pipeline with query interpretation, SQL plannin…

f75f74f

…g, and memory updating; add retriever routing logic and logging

Add Dockerfile, docker-compose.yml, and main application logic; imple…

53d213b

…ment generator and retriever nodes

Refactor QdrantVectorDB: remove unused import and add spacing; update…

985440e

… requirements.txt with dependency version upgrades

Updated requirements.txt

20a99bd

full pipeline is functional

8c08f87

Added logging

b6feff5

Added docstrings for clarity

4187574

Added Docstrings

fa53084

added config.yml

346f0d6

System Works with config.yml

268c15b

Feat Agent Works as intended

4e2d2c7

feat: Implement Stack Overflow adapter analysis and testing tools

811b2c6

feat: Add answer metadata tests and enhance answer retrieval output

663dbbd

feat: Enhance embedding strategy configuration and improve smoke test…

db65791

… reporting

spyrchat added 28 commits August 31, 2025 00:54

chore: Update Python version to 3.13 in pipeline tests

10b6620

chore: Update testing dependencies and Python version in CI workflows

056f007

refactor: Simplify dependency management by removing requirements-tes…

7323f4b

…t.txt and updating pipeline tests to use requirements-minimal.txt

chore: Update requirements-minimal.txt to include missing dependencie…

b39c51d

…s for LangChain and dotenv

chore: Add missing dependencies for boto3, botocore, and langchain-qd…

e1beb8b

…rant in requirements-minimal.txt

refactor: Enhance Qdrant connectivity tests and remove outdated requi…

149ab30

…rements documentation

chore: Remove outdated GitHub Actions CI configuration and local test…

63d29af

… setup script

chore: Remove outdated example scripts and sample data files for retr…

18903d8

…iever configuration

Fix Google dependencies conflict in requirements.txt

3393ec5

fix: Remove unnecessary blank line in insert_documents method

9415e07

fix: Improve .env loading and add default values for Qdrant configura…

2748761

…tion

fix: Update Qdrant service configuration for improved health checks a…

8d6d16f

…nd connectivity

fix: Improve Qdrant health check commands and update logging messages…

8163734

… for clarity

fix: Update Qdrant health check commands to use the correct endpoint …

44baa40

…and improve connectivity checks

fix: Update Qdrant health check commands for improved readiness verif…

c128036

…ication

fix: Enhance Qdrant readiness check with retry logic and timeout hand…

0068770

…ling

fix: Update Qdrant readiness check to use the correct endpoint

b2f8882

fix: Update Qdrant health check endpoints and enhance pipeline test c…

eebb5d6

…onfigurations

refactor: Remove redundant pipeline test configurations and streamlin…

4fe9c6b

…e workflow

fix: Simplify commands and enhance output messages in pipeline tests

c3abc86

changed check in git workflows from ok to HTTP 200

614692e

fix: Clean up whitespace and improve readability in end-to-end pipeli…

fae42db

…ne tests

fix: Update end-to-end test configuration and enhance Qdrant connecti…

2bd6b2e

…vity handling

fix: Enhance Qdrant configuration handling and improve error manageme…

5809813

…nt in end-to-end tests

fix: Remove end-to-end tests from pipeline configuration to streamlin…

3715f8d

…e workflow

Merge pull request #18 from spyrchat/benchmark

3ef36b0

Benchmark

spyrchat merged commit c605171 into development Sep 8, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Main Pipeline #19

Main Pipeline #19

Uh oh!

spyrchat commented Sep 8, 2025

Uh oh!

spyrchat commented Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Main Pipeline #19

Main Pipeline #19

Uh oh!

Conversation

spyrchat commented Sep 8, 2025

Uh oh!

spyrchat commented Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants