Skip to content

Conversation

@spyrchat
Copy link
Owner

This pull request introduces significant improvements to the agent pipeline, focusing on modularity, benchmarking, and flexibility. The main changes include the removal of the old agent graph, the addition of two new, more advanced agent graph variants (a refined RAG agent and a self-correcting Self-RAG agent), the introduction of a robust benchmarking logger node, and enhancements to the generator node for configurable prompting. It also updates documentation and environment variable management for better usability.

Agent Pipeline Refactoring and Enhancements:

  • Removed the legacy agent graph implementation in agent/graph.py in favor of more modular and extensible graph variants.
  • Added agent/graph_refined.py, which implements a multi-stage, linear RAG agent pipeline with explicit query analysis, retrieval, generation, memory updating, and benchmark logging. This graph supports configurable LLM providers and prompt styles.
  • Introduced agent/graph_self_rag.py, a Self-RAG agent graph with an iterative refinement loop, enabling answer verification and correction to reduce hallucinations.

Benchmarking and Logging:

  • Added agent/nodes/benchmark_logger.py, providing a node and utility for logging agent pipeline executions, saving outputs for benchmarking, and summarizing results. This is integrated into both new agent graphs.

Generator Node Improvements:

  • Enhanced the generator node in agent/nodes/generator.py to support multiple prompt styles ("strict", "conversational", "citations") selected via configuration, improving answer quality and flexibility. Also improved logging and prompt structure for technical accuracy. [1] [2]

Documentation and Configuration:

  • Added a new CLI_REFERENCE.md documenting all available CLI commands and flags for scripts, ensuring accurate and up-to-date developer reference.
  • Updated .env_example to focus on API keys relevant to LLM providers, removing unused or legacy environment variables.

Summary of Most Important Changes:

Agent Pipeline Refactoring:

  • Removed legacy agent graph (agent/graph.py) and replaced with two new, modular graph variants: agent/graph_refined.py (standard RAG pipeline) and agent/graph_self_rag.py (self-correcting, hallucination-resistant pipeline). [1] [2] [3]

Benchmarking and Logging:

  • Introduced a benchmark logger node (agent/nodes/benchmark_logger.py) for saving and summarizing pipeline execution data, integrated into both agent graphs.

Generator Node Improvements:

  • Refactored the generator node to support configurable prompt styles and improved prompt instructions for technical accuracy and flexibility. [1] [2]

Documentation:

  • Added a comprehensive CLI reference (CLI_REFERENCE.md) for all supported scripts and flags.

Configuration:

  • Updated .env_example to include only relevant API keys, removing obsolete variables.

- Remove all configuration merging logic for complete reproducibility
- Implement AutoRAG multi-arm bandit hyperparameter optimization
- Create self-contained benchmark scenarios (no dependencies)
- Add comprehensive configuration validation
- Update all benchmark runners to use isolated configs
- Create activation function and gradient descent visualizations
- Add ISSEL color scheme to all plots
- Implement configuration audit system for experiment tracking
- Fix import issues in benchmarks package
- Add dense-only retrieval configuration examples
…its and add chunking strategies for code-aware processing
…figuration handling

- Removed hyperparameter_spaces.yml file as it is no longer needed.
- Simplified the create_from_unified_config method in RetrievalPipelineFactory to enhance retrieval type detection and configuration merging.
- Deleted unused HuggingFaceEmbedder class from embeddings.py.
- Updated get_embedder function in factory.py to improve error handling and support for HuggingFace embeddings.
- Modified ModernBaseRetriever to enforce required configuration parameters without defaults.
- Enhanced QdrantDenseRetriever and QdrantHybridRetriever to require explicit embedding configurations and improved initialization logic.
- Refactored result fusion methods in QdrantHybridRetriever to utilize alpha weighting for better control over dense and sparse result combinations.
- Updated QdrantSparseRetriever to ensure strict configuration requirements and improved initialization.
…os; add per-query results export functionality
…tion; add chunk ID mapping utility functions
…ved retrieval performance; adjust top_k values and collection names. Fixed a bug where only the question title would be passed in the benchmark without the question body.
…lpha parameter for fusion methods, refactor confidence interval calculations, and improve CSV export functionality.
…der support; adjust alpha parameter, refactor fusion methods, and improve dataset configurations.
…M25Embedder and SpladeEmbedder, enhancing model initialization and embedding methods.
…ment abstract methods in BenchmarkAdapter, update SpladeEmbedder to use FastEmbed, and modify dataset configuration for improved model integration.
…sults. The problem was in the yml files as they had wrong collection for sparse search
… provider name to "sparse-splade" for clarity. Refactor experiment name in Experiment1Runner and enhance error handling in retrievers.
… metrics; enhance confidence interval calculation for single value cases.
…eatures

- Deleted obsolete benchmark scenario files: bm25_baseline_full.yml, dense_retrieval_full.yml, hybrid_retrieval_full.yml, and quick_test.yml.
- Created new benchmark scenario files for Experiment 1: bm25_baseline.yml, hybrid_bm25_bge_m3.yml, hybrid_splade_bge_m3.yml, and splade_baseline.yml.
- Updated benchmarks/experiment1.py to streamline experiment execution and enhance result processing.
- Introduced BenchmarkReportGenerator for improved report generation and scenario summaries.
- Added BenchmarkResultsExporter to handle result exports in various formats.
- Implemented BenchmarkStatisticalAnalyzer for comprehensive statistical analysis of benchmark results, including pairwise testing and Bonferroni correction.
- Enhanced retrieval time calculation in benchmarks/benchmarks_runner.py.
- Improved code organization and readability across multiple files.
…oading method; add usage guide for dataset adapters.
…rid search for hyperparameter tuning with F1@5 objective.
…rs and natural questions; clean up unused Stack Overflow dataset configurations; refactor contracts for improved clarity and maintainability.
- Updated hybrid Splade configurations to use top_k=10 and adjusted alpha values across multiple YAML files.
- Introduced a new configuration file for Hybrid Splade optimal settings.
- Modified the experiment runner to include the new optimal configuration and updated print statements for clarity.
- Added a new script for optimizing the alpha parameter with a fixed k value, supporting both single metric and composite objective modes.
- Enhanced results exporting functionality to dynamically include metrics in summary comparisons.
- Adjusted normalization processes in retriever classes for consistency and performance.
- Add AdapterLoader for dynamic adapter instantiation from config
- Update ingestion pipeline to load adapters from YAML config
- Update benchmark system to support config-based adapter loading
- Make adapter parameters optional in CLI (can read from config)
- Add adapter field to all benchmark YAML configs (12 files)
- Update StackOverflowBenchmarkAdapter signature for compatibility
- Fix relative imports in benchmark modules
- Add comprehensive documentation:
  - DYNAMIC_ADAPTERS.md - Complete guide for ingestion adapters
  - DYNAMIC_ADAPTERS_QUICKREF.md - Quick reference
  - DYNAMIC_ADAPTER_BENCHMARKS.md - Benchmark adapter guide
  - BENCHMARK_DYNAMIC_ADAPTER_FIX.md - Fix documentation
- Add custom adapter example
- Update contracts.py with missing RetrievalMetrics and EvaluationRun classes

Benefits:
- No code changes needed to add new adapters (just YAML config)
- Unified dynamic loading for both ingestion and benchmarks
- Better scalability and maintainability
- Improved configuration as code approach
- Updated `hybrid_bge_splade_fixed_k10.yml` to optimize alpha values from "0.0:1.0:0.1" to "0.9:1.0:0.02".
- Deleted `custom_adapter_example.py` as it is no longer needed.
- Added new analysis notebook `experiment1_analysis.ipynb` for experiment 1.
- Introduced multiple output files for experiment 1 plots, including PDFs and PNGs for various metrics.
- Created `key_findings.txt` summarizing key results from experiment 1.
- Added LaTeX table `table1_summary_results.tex` summarizing performance metrics of different methods.
- Updated PDF and PNG files for overall performance, precision at k, recall at k, F1 scores, precision-recall tradeoff, NDCG progression, latency analysis, and statistical significance.
- All figures have been regenerated to reflect the latest experimental results.
- Created a new markdown file for stratification justification, detailing the rationale behind the stratification strategy used in experiments.
- Added various analysis result images to the 2D grid results directory, including:
  - Fold analysis
  - Heatmap of composite scores
  - Heatmaps for individual metrics
  - Sensitivity analysis
  - 3D surface plot
  - Test performance
  - Comparison of top configurations
… main.py to ensure environment variables are loaded correctly
- Created README.md for scripts directory detailing available scripts, usage, and configuration options.
- Added README.md for tests directory outlining test structure, running tests, and dependencies.
- Included descriptions of individual test files and their functions for better understanding of the testing framework.
Updated minimum RAM requirement and removed community support section.
…peline

- Updated AgentState schema in `schema.py` to enhance clarity and organization of attributes, including query analysis, routing decisions, and generation modes.
- Modified `config.yml` to switch agent mode to "refined" and updated LLM provider to "ollama" with new model specifications.
- Introduced `llm_factory.py` to streamline LLM instance creation based on configuration, supporting multiple providers.
- Adjusted `main.py` to load configurations dynamically and select the appropriate agent graph based on the mode.
- Removed outdated retrieval configuration guide and added a new README for retrieval configurations, detailing available options and usage.
- Created new retrieval configuration files for dense retrieval with and without reranking, while removing obsolete examples.
- Enhanced documentation for retrieval configurations, including performance comparisons and troubleshooting tips.
- Simplified the main.py by removing the configuration loading logic and directly importing the refined graph.
- Updated fast_dense_bge_m3.yml to streamline the retrieval pipeline configuration, removing unnecessary parameters and focusing on essential settings for speed.
- Modified base_retriever.py to allow optional 'top_k' parameter with a default value, enhancing flexibility.
- Changed dense_retriever.py and sparse_retriever.py to retrieve 'text' instead of 'page_content' from payloads, ensuring consistency in document creation.
…dd ground truth generation script for SOSUM dataset
… multi-provider support. Also fixed a bug where the sparse retriever would use the old qdrant API
- Implemented demo script (demo_graph_viz.py) for quick visualization of agent graphs in ASCII, Mermaid, and PNG formats.
- Created visualization utility (visualize_graph.py) to handle different modes (standard, self-rag, both) and output formats.
- Added error handling and output directory management for generated visualizations.
- Removed empty scripts (demo_self_rag.py, visualize_cv_splits.py) as they were not utilized.
- Replaced StratifiedKFold with a simple train/test split using train_test_split for hyperparameter optimization.
- Updated class and method documentation to reflect the new splitting strategy.
- Simplified the create_cv_splits method to create_train_test_split with parameters for test size and minimum samples per stratum.
- Enhanced output statistics to include detailed information about train and test distributions.
- Removed unnecessary complexity related to multiple folds, focusing on a single stratified split.
- Removed old Qdrant configuration from .env_example and added API keys for OpenAI, Google, and Voyage.
- Updated model name in llm_as_judge_eval.py to "gpt-5" and changed input/output paths for evaluation results.
- Added new Jupyter notebook for LLM Judge analysis.
- Included various plot images for analysis results in output directory.
- Created a new analysis report in Markdown format summarizing LLM Judge results.
…deprecated scripts

- Updated various experiment plots in the output directory, including overall performance, precision at k, recall at k, F1 scores, precision-recall tradeoff, NDCG progression, latency analysis, statistical significance, and comprehensive dashboard.
- Modified the retrieval configuration in `fast_dense_bge_m3.yml` to implement a hybrid retrieval approach using BGE-M3 and SPLADE with Reciprocal Rank Fusion.
- Added new visualizations for LLM judge analysis, including boxplots, category distributions, and score distributions.
- Removed outdated scripts for graph visualization and self-RAG demo, streamlining the codebase
Removed instructions for running the retrieval demo.
Removed troubleshooting section and integration examples from README.
Removed references to experiment3.py and benchmark run files.
@spyrchat spyrchat closed this Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants