diff --git a/README.md b/README.md index babb84d..9cc6171 100644 --- a/README.md +++ b/README.md @@ -1,24 +1,25 @@ -# Advanced RAG Retrieval System with LangGraph Agent +# Advanced RAG System with LangGraph Agent & Benchmarking -A production-ready, modular RAG (Retrieval-Augmented Generation) system with configurable pipelines and LangGraph agent integration. +A production-ready, configurable RAG (Retrieval-Augmented Generation) system featuring LangGraph agent workflows, modular retrieval pipelines, and comprehensive benchmarking capabilities. ## Key Features -- **YAML-Configurable Pipelines**: Switch retrieval strategies without code changes -- **LangGraph Agent Integration**: Seamless agent workflows with rich metadata -- **Modular Components**: Easily extensible rerankers, filters, and retrievers -- **Multiple Retrieval Methods**: Dense, sparse, and hybrid retrieval -- **Production Ready**: Robust error handling, logging, and monitoring -- **A/B Testing Support**: Compare configurations easily -- **Rich Metadata**: Access scores, methods, and quality metrics +- **๐Ÿค– LangGraph Agent**: Intelligent agent workflows with configurable retrieval +- **โš™๏ธ YAML-Configurable Pipelines**: Switch retrieval strategies without code changes +- **๐Ÿ”„ Hybrid Retrieval**: Dense, sparse, and hybrid retrieval methods with RRF fusion +- **๐ŸŽฏ Advanced Reranking**: CrossEncoder, BGE, and multi-stage reranking +- **๐Ÿ“Š Comprehensive Benchmarking**: Built-in evaluation framework with multiple metrics +- **๐Ÿ—„๏ธ Vector Database**: Qdrant integration with optimized indexing +- **๐Ÿ”ง Modular Architecture**: Easily extensible components and filters +- **๐Ÿ“ˆ Performance Monitoring**: Rich metadata, logging, and health checks ## Architecture Overview ``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ LangGraph โ”‚โ”€โ”€โ”€โ”€โ”‚ Configurable โ”‚โ”€โ”€โ”€โ”€โ”‚ Retrieval โ”‚ -โ”‚ Agent โ”‚ โ”‚ Retriever Agent โ”‚ โ”‚ Pipeline โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LangGraph โ”‚โ”€โ”€โ”€โ”€โ”‚ Configurable โ”‚โ”€โ”€โ”€โ”€โ”‚ Retrieval โ”‚โ”€โ”€โ”€โ”€โ”‚ Benchmarking โ”‚ +โ”‚ Agent โ”‚ โ”‚ Retriever Agent โ”‚ โ”‚ Pipeline โ”‚ โ”‚ Framework โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ @@ -28,173 +29,348 @@ A production-ready, modular RAG (Retrieval-Augmented Generation) system with con โ”‚โ€ข Dense โ”‚ โ”‚โ€ข CrossEncoder โ”‚ โ”‚โ€ข Score โ”‚ โ”‚โ€ข Sparse โ”‚ โ”‚โ€ข BGE Reranker โ”‚ โ”‚โ€ข Content โ”‚ โ”‚โ€ข Hybrid โ”‚ โ”‚โ€ข Multi-stage โ”‚ โ”‚โ€ข Custom โ”‚ + โ”‚โ€ข RRF โ”‚ โ”‚โ€ข Adaptive โ”‚ โ”‚โ€ข Metadata โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## Quick Start -### 1. Install Dependencies +### 1. Environment Setup ```bash +# Install dependencies pip install -r requirements.txt + +# Configure environment variables +cp .env.example .env +# Edit .env with your API keys: +# GOOGLE_API_KEY=your_google_api_key +# OPENAI_API_KEY=your_openai_api_key +# QDRANT_HOST=localhost +# QDRANT_PORT=6333 ``` -### 2. Configure Environment +### 2. Start Vector Database ```bash -# Copy example config -cp config.yml.example config.yml +# Using Docker Compose (recommended) +docker-compose up -d qdrant -# Set up your API keys and database connections in config.yml +# Or run Qdrant directly +docker run -p 6333:6333 qdrant/qdrant ``` -### 3. Start Using the System +### 3. Interactive Chat with Agent -```python -# main.py - Chat with your agent -from agent.graph import graph +```bash +# Start the interactive chat agent +python main.py +``` -state = {"question": "How to handle Python exceptions?"} -result = graph.invoke(state) -print(result["answer"]) +Example conversation: +``` +You: How to handle exceptions in Python? +Agent: Python provides several mechanisms for exception handling... ``` -### 4. Switch Retrieval Configurations +### 4. Configuration Management ```bash -# List available configurations +# List available retrieval configurations python bin/switch_agent_config.py --list -# Switch to advanced reranked pipeline -python bin/switch_agent_config.py advanced_reranked +# Switch to different retrieval strategy +python bin/switch_agent_config.py modern_hybrid -# Test the configuration -python test_agent_retriever_node.py +# Test the new configuration +python -c " +from agent.graph import graph +result = graph.invoke({'question': 'test query'}) +print(result['answer']) +" ``` ## Available Configurations -| Configuration | Description | Components | Use Case | -|---------------|-------------|------------|----------| -| `basic_dense` | Simple dense retrieval | Dense retriever only | Development, testing | -| `advanced_reranked` | Production quality | Dense + CrossEncoder + filters | Production RAG | -| `hybrid_multistage` | Best performance | Hybrid + multi-stage reranking | High-quality results | -| `experimental` | Latest features | BGE reranker + custom filters | Experimentation | +| Configuration | Description | Retrieval Method | Performance | Use Case | +|---------------|-------------|------------------|-------------|----------| +| `ci_google_gemini` | CI/CD optimized | Dense only | Fast | Testing, CI | +| `fast_hybrid` | Speed optimized | Hybrid + RRF | Very Fast | Production chat | +| `modern_dense` | Dense semantic | Dense + Reranking | Medium | Semantic search | +| `modern_hybrid` | Best quality | Hybrid + CrossEncoder | Slower | Research, Q&A | + +### Benchmark Scenarios -## ๐Ÿ”ง **Configuration Example** +| Scenario | Focus | Components | Metrics | +|----------|-------|------------|---------| +| `dense_baseline` | Simple dense retrieval | Google embeddings | Precision@K, Recall@K | +| `hybrid_retrieval` | Dense + sparse fusion | RRF fusion | MRR, NDCG | +| `hybrid_reranking` | Full reranking pipeline | CrossEncoder + filters | F1, MAP | +| `sparse_bm25` | Traditional IR | BM25 only | Baseline metrics | +## Configuration Examples + +### Fast Hybrid Configuration ```yaml -# pipelines/configs/retrieval/advanced_reranked.yml +# pipelines/configs/retrieval/fast_hybrid.yml +description: "Fast hybrid retrieval optimized for agent response speed" + retrieval_pipeline: retriever: - type: dense + type: hybrid top_k: 10 + score_threshold: 0.05 + fusion_method: rrf + + fusion: + method: rrf + rrf_k: 50 + dense_weight: 0.8 + sparse_weight: 0.2 + + embedding: + strategy: hybrid + dense: + provider: google + model: models/embedding-001 + dimensions: 768 + api_key_env: GOOGLE_API_KEY +``` + +### Modern Dense with Reranking +```yaml +# pipelines/configs/retrieval/modern_dense.yml +description: "Dense semantic retrieval with Google embeddings and neural reranking" + +retrieval_pipeline: + retriever: + type: dense + top_k: 15 + score_threshold: 0.0 stages: - type: reranker config: model_type: cross_encoder - model_name: "ms-marco-MiniLM-L-6-v2" + model_name: "cross-encoder/ms-marco-MiniLM-L-6-v2" + top_k: 10 - type: filter config: type: score - min_score: 0.5 - - - type: answer_enhancer - config: - boost_factor: 2.0 + min_score: 0.3 ``` ## Project Structure ``` Thesis/ -โ”œโ”€โ”€ agent/ # LangGraph agent implementation -โ”‚ โ”œโ”€โ”€ graph.py # Main agent graph -โ”‚ โ”œโ”€โ”€ schema.py # Agent state schemas -โ”‚ โ””โ”€โ”€ nodes/ # Agent nodes (retriever, generator, etc.) +โ”œโ”€โ”€ main.py # Interactive chat application +โ”œโ”€โ”€ config.yml # Main configuration file +โ”œโ”€โ”€ docker-compose.yml # Docker services (Qdrant, PostgreSQL) +โ”œโ”€โ”€ requirements.txt # Python dependencies +โ”œโ”€โ”€ .env # Environment variables +โ”‚ +โ”œโ”€โ”€ agent/ # LangGraph Agent System +โ”‚ โ”œโ”€โ”€ graph.py # Main agent workflow graph +โ”‚ โ”œโ”€โ”€ schema.py # Agent state definitions +โ”‚ โ””โ”€โ”€ nodes/ # Agent workflow nodes +โ”‚ โ”œโ”€โ”€ retriever.py # Configurable retriever node +โ”‚ โ”œโ”€โ”€ generator.py # Response generation node +โ”‚ โ”œโ”€โ”€ query_interpreter.py # Query analysis node +โ”‚ โ””โ”€โ”€ memory_updater.py # Conversation memory node +โ”‚ +โ”œโ”€โ”€ components/ # Modular Retrieval Components +โ”‚ โ”œโ”€โ”€ retrieval_pipeline.py # Core pipeline framework +โ”‚ โ”œโ”€โ”€ rerankers.py # CrossEncoder, BGE rerankers +โ”‚ โ”œโ”€โ”€ advanced_rerankers.py # Multi-stage reranking +โ”‚ โ””โ”€โ”€ filters.py # Score, content, metadata filters โ”‚ -โ”œโ”€โ”€ components/ # Modular retrieval components -โ”‚ โ”œโ”€โ”€ retrieval_pipeline.py # Main pipeline orchestrator -โ”‚ โ”œโ”€โ”€ rerankers.py # Reranking implementations -โ”‚ โ”œโ”€โ”€ filters.py # Filtering implementations -โ”‚ โ””โ”€โ”€ advanced_rerankers.py # Advanced reranking strategies +โ”œโ”€โ”€ pipelines/ # Data Pipelines & Configurations +โ”‚ โ”œโ”€โ”€ configs/ +โ”‚ โ”‚ โ””โ”€โ”€ retrieval/ # YAML retrieval configurations +โ”‚ โ”‚ โ”œโ”€โ”€ fast_hybrid.yml +โ”‚ โ”‚ โ”œโ”€โ”€ modern_dense.yml +โ”‚ โ”‚ โ”œโ”€โ”€ modern_hybrid.yml +โ”‚ โ”‚ โ””โ”€โ”€ ci_google_gemini.yml +โ”‚ โ”œโ”€โ”€ adapters/ # Dataset adapters (BEIR, custom) +โ”‚ โ””โ”€โ”€ ingest/ # Data ingestion pipeline โ”‚ -โ”œโ”€โ”€ pipelines/ # Data processing and configuration -โ”‚ โ”œโ”€โ”€ configs/retrieval/ # Retrieval pipeline configurations -โ”‚ โ”œโ”€โ”€ adapters/ # Dataset adapters (BEIR, etc.) -โ”‚ โ””โ”€โ”€ ingest/ # Data ingestion pipeline +โ”œโ”€โ”€ benchmarks/ # Evaluation Framework +โ”‚ โ”œโ”€โ”€ benchmarks_runner.py # Main benchmark orchestrator +โ”‚ โ”œโ”€โ”€ benchmarks_metrics.py # Precision, Recall, NDCG, MRR +โ”‚ โ”œโ”€โ”€ benchmarks_adapters.py # Dataset adapters for evaluation +โ”‚ โ””โ”€โ”€ run_real_benchmark.py # Real data benchmarking โ”‚ -โ”œโ”€โ”€ bin/ # Command-line utilities -โ”‚ โ”œโ”€โ”€ switch_agent_config.py # Configuration management -โ”‚ โ”œโ”€โ”€ agent_retriever.py # Configurable retriever agent -โ”‚ โ””โ”€โ”€ retrieval_pipeline.py # Direct pipeline usage +โ”œโ”€โ”€ benchmark_scenarios/ # Predefined Benchmark Configurations +โ”‚ โ”œโ”€โ”€ dense_baseline.yml # Simple dense retrieval +โ”‚ โ”œโ”€โ”€ hybrid_retrieval.yml # Hybrid dense+sparse +โ”‚ โ”œโ”€โ”€ hybrid_reranking.yml # Full reranking pipeline +โ”‚ โ””โ”€โ”€ sparse_bm25.yml # BM25 baseline โ”‚ -โ”œโ”€โ”€ docs/ # Documentation -โ”‚ โ”œโ”€โ”€ SYSTEM_EXTENSION_GUIDE.md # Complete extension guide -โ”‚ โ”œโ”€โ”€ AGENT_INTEGRATION.md # Agent integration details -โ”‚ โ”œโ”€โ”€ CODE_CLEANUP_SUMMARY.md # Code cleanup documentation -โ”‚ โ””โ”€โ”€ EXTENSIBILITY.md # Quick extensibility overview +โ”œโ”€โ”€ bin/ # Command-line Utilities +โ”‚ โ”œโ”€โ”€ switch_agent_config.py # Configuration management +โ”‚ โ”œโ”€โ”€ agent_retriever.py # Standalone retriever CLI +โ”‚ โ”œโ”€โ”€ qdrant_inspector.py # Database inspection tool +โ”‚ โ””โ”€โ”€ ingest.py # Data ingestion utility โ”‚ -โ”œโ”€โ”€ tests/ # Test suite -โ”‚ โ”œโ”€โ”€ retrieval/ # Retrieval pipeline tests -โ”‚ โ””โ”€โ”€ agent/ # Agent integration tests +โ”œโ”€โ”€ database/ # Database Controllers +โ”‚ โ”œโ”€โ”€ qdrant_controller.py # Vector database operations +โ”‚ โ””โ”€โ”€ postgres_controller.py # Relational database operations โ”‚ -โ”œโ”€โ”€ deprecated/ # Legacy code (organized) -โ”‚ โ”œโ”€โ”€ old_processors/ # Superseded by new pipeline -โ”‚ โ”œโ”€โ”€ old_debug_scripts/ # Legacy debugging tools -โ”‚ โ””โ”€โ”€ old_playground/ # Legacy test scripts +โ”œโ”€โ”€ embedding/ # Embedding & Text Processing +โ”‚ โ”œโ”€โ”€ factory.py # Embedding provider factory +โ”‚ โ”œโ”€โ”€ bedrock_embeddings.py # AWS Bedrock embeddings +โ”‚ โ”œโ”€โ”€ sparse_embedder.py # BM25 sparse embeddings +โ”‚ โ””โ”€โ”€ processor.py # Text processing utilities โ”‚ -โ”œโ”€โ”€ database/ # Database controllers -โ”œโ”€โ”€ embedding/ # Embedding utilities -โ”œโ”€โ”€ retrievers/ # Base retrievers -โ”œโ”€โ”€ examples/ # Usage examples -โ””โ”€โ”€ config/ # Configuration utilities +โ”œโ”€โ”€ tests/ # Comprehensive Test Suite +โ”‚ โ”œโ”€โ”€ pipeline/ # Pipeline component tests +โ”‚ โ”‚ โ”œโ”€โ”€ test_minimal_pipeline.py +โ”‚ โ”‚ โ”œโ”€โ”€ test_qdrant_connectivity.py +โ”‚ โ”‚ โ””โ”€โ”€ test_end_to_end.py +โ”‚ โ””โ”€โ”€ requirements-minimal.txt +โ”‚ +โ”œโ”€โ”€ docs/ # Documentation +โ”‚ โ”œโ”€โ”€ PROJECT_STRUCTURE.md # Detailed project structure +โ”‚ โ”œโ”€โ”€ QUICK_START_GUIDE.md # Getting started guide +โ”‚ โ””โ”€โ”€ SOSUM_INGESTION.md # Dataset ingestion guide +โ”‚ +โ””โ”€โ”€ logs/ # Application Logs + โ”œโ”€โ”€ agent.log # Agent workflow logs + โ””โ”€โ”€ query_interpreter.log # Query processing logs +``` + +## Benchmarking & Evaluation + +### Running Benchmarks + +```bash +# Run comprehensive benchmark with real StackOverflow data +python benchmarks/run_real_benchmark.py + +# Run specific benchmark scenario +python benchmarks/run_benchmark_optimization.py --scenario hybrid_reranking + +# Quick performance test +python benchmarks/run_benchmark_optimization.py --scenario quick_test +``` + +### Available Metrics + +- **Precision@K**: Fraction of relevant documents in top-K results +- **Recall@K**: Fraction of relevant documents retrieved in top-K +- **MRR (Mean Reciprocal Rank)**: Average reciprocal rank of first relevant result +- **NDCG@K**: Normalized Discounted Cumulative Gain +- **F1 Score**: Harmonic mean of precision and recall +- **MAP (Mean Average Precision)**: Mean of precision scores at each relevant document + +### Custom Benchmark Configuration + +```yaml +# benchmark_scenarios/custom_scenario.yml +scenario_name: "custom_hybrid" +description: "Custom hybrid retrieval evaluation" + +benchmark: + retrieval: + strategy: hybrid + top_k: 20 + score_threshold: 0.0 + evaluation: + k_values: [1, 5, 10, 20] + metrics: ["precision", "recall", "mrr", "ndcg"] + +retrieval_pipeline: + retriever: + type: hybrid + fusion_method: rrf + # ... configuration details ``` ## Testing +### Run Test Suite + ```bash -# Test agent integration -python test_agent_retriever_node.py +# Run minimal pipeline tests (CI-friendly) +python -m pytest tests/pipeline/test_minimal_pipeline.py -v + +# Run all pipeline tests +python -m pytest tests/pipeline/ -v -# Run all tests -python tests/run_all_tests.py +# Run tests with coverage +python -m pytest tests/pipeline/ --cov=components --cov=agent -# Test specific components -python -m pytest tests/retrieval/ -v +# Run specific test categories +python -m pytest tests/pipeline/ -m "not requires_api" # No API required +python -m pytest tests/pipeline/ -m "requires_api" # Requires API keys +``` + +### Test Qdrant Connectivity + +```bash +# Test vector database connection +python -m pytest tests/pipeline/test_qdrant_connectivity.py -v + +# Inspect Qdrant collections +python bin/qdrant_inspector.py --list-collections +python bin/qdrant_inspector.py --collection-info sosum_stackoverflow_hybrid_v1 +``` + +### Integration Testing + +```bash +# Test agent with different configurations +python -c " +from agent.graph import graph +configs = ['fast_hybrid', 'modern_dense', 'modern_hybrid'] +for config in configs: + print(f'Testing {config}...') + # Switch config and test +" ``` ## Documentation -- **[System Extension Guide](docs/SYSTEM_EXTENSION_GUIDE.md)** - Complete guide to extending the system -- **[Agent Integration](docs/AGENT_INTEGRATION.md)** - How the agent uses configurable pipelines -- **[Code Cleanup Summary](docs/CODE_CLEANUP_SUMMARY.md)** - Professional code standards and cleanup details -- **[Extensibility Overview](docs/EXTENSIBILITY.md)** - Quick overview of extension capabilities -- **[Architecture](docs/MLOPS_PIPELINE_ARCHITECTURE.md)** - System architecture details +- **[Project Structure](docs/PROJECT_STRUCTURE.md)** - Detailed project organization +- **[Quick Start Guide](docs/QUICK_START_GUIDE.md)** - Getting started tutorial +- **[SOSUM Ingestion](docs/SOSUM_INGESTION.md)** - Dataset ingestion guide +- **[MLOps Architecture](docs/MLOPS_PIPELINE_ARCHITECTURE.md)** - System architecture details ## Extending the System ### Add a Custom Reranker ```python -# components/my_reranker.py -from .rerankers import BaseReranker - -class MyCustomReranker(BaseReranker): - def rerank(self, query: str, documents: List[Document]) -> List[Document]: +# components/my_custom_reranker.py +from components.retrieval_pipeline import Reranker, RetrievalResult +from typing import List + +class MyCustomReranker(Reranker): + @property + def component_name(self) -> str: + return "my_custom_reranker" + + def process(self, query: str, results: List[RetrievalResult], **kwargs) -> List[RetrievalResult]: # Your custom reranking logic - for doc in documents: - doc.metadata["score"] = self.calculate_score(query, doc.page_content) + for result in results: + result.score = self.calculate_custom_score(query, result.document.page_content) + result.metadata["reranked_by"] = self.component_name - return sorted(documents, key=lambda x: x.metadata["score"], reverse=True) + return sorted(results, key=lambda x: x.score, reverse=True) + + def calculate_custom_score(self, query: str, content: str) -> float: + # Implement your scoring logic + return 0.5 # Placeholder ``` ### Create a New Configuration ```yaml # pipelines/configs/retrieval/my_config.yml +description: "My custom retrieval configuration" + retrieval_pipeline: retriever: type: hybrid @@ -205,62 +381,212 @@ retrieval_pipeline: config: model_type: my_custom custom_param: "value" + + - type: filter + config: + type: score + min_score: 0.4 ``` ### Switch and Test ```bash +# Switch to your configuration python bin/switch_agent_config.py my_config -python test_agent_retriever_node.py + +# Test the new configuration +python -c " +from agent.graph import graph +result = graph.invoke({'question': 'test query', 'chat_history': []}) +print(f'Answer: {result[\"answer\"]}') +print(f'Retrieved docs: {len(result.get(\"retrieved_documents\", []))}') +" ``` -## Production Usage +## Production Features + +### Performance Optimization +- **Lazy Initialization**: Components load only when needed +- **Connection Pooling**: Efficient database connection management +- **Batch Processing**: Optimized embedding and reranking batches +- **Caching**: LRU caching for repeated queries and embeddings +- **Async Operations**: Non-blocking I/O for better throughput + +### Monitoring & Observability +- **Structured Logging**: JSON logs with correlation IDs +- **Performance Metrics**: Response times, cache hit rates, error rates +- **Health Checks**: Database connectivity, model availability +- **Rich Metadata**: Retrieval paths, scores, and method tracking + +### Configuration Management +- **Environment-based Configs**: Different configs per environment +- **Hot Reloading**: Switch configurations without restart +- **Validation**: Schema validation for all configurations +- **Rollback**: Easy rollback to previous configurations + +### Error Handling +- **Graceful Degradation**: Fallback to simpler methods on failures +- **Circuit Breakers**: Prevent cascade failures +- **Retry Logic**: Exponential backoff for transient failures +- **Comprehensive Logging**: Detailed error context and stack traces -The system is designed for production use with: +## Use Cases & Applications -- **Robust Error Handling**: Graceful degradation when components fail -- **Comprehensive Logging**: Monitor retrieval performance and quality -- **Configuration Management**: Easy deployment of different strategies -- **Performance Optimization**: Efficient batching and caching support -- **Monitoring Ready**: Built-in metrics and health checks +### Document Q&A Systems +- **Knowledge Base Search**: Corporate wikis, documentation, FAQs +- **Research Assistance**: Academic papers, technical documentation +- **Customer Support**: Automated response generation with context -## Use Cases +### Code & Technical Search +- **Semantic Code Search**: Find code snippets by functionality +- **API Documentation**: Contextual API usage examples +- **Stack Overflow Integration**: Programming Q&A with real data + +### Domain-Specific Applications +- **Legal Research**: Case law, regulations, legal precedents +- **Medical Literature**: Research papers, clinical guidelines +- **Financial Analysis**: Reports, earnings calls, market research + +### Multi-Modal Retrieval +- **Table Extraction**: Structured data from documents +- **Image-Text Retrieval**: Combined visual and textual search +- **Temporal Queries**: Time-aware information retrieval -- **Document Q&A Systems**: High-quality retrieval for knowledge bases -- **Research Assistants**: Multi-modal retrieval for academic content -- **Customer Support**: Context-aware response generation -- **Code Search**: Semantic search over codebases -- **Legal Research**: Precise retrieval from legal documents ## Contributing -1. Fork the repository -2. Create a feature branch -3. Add your extension following the patterns in `docs/SYSTEM_EXTENSION_GUIDE.md` -4. Add tests for your components -5. Submit a pull request +1. **Fork the Repository** + ```bash + git fork https://github.com/your-org/thesis-rag-system + cd thesis-rag-system + ``` + +2. **Set Up Development Environment** + ```bash + python -m venv venv + source venv/bin/activate # Linux/Mac + pip install -r requirements.txt + pip install -r tests/requirements-minimal.txt + ``` + +3. **Create Feature Branch** + ```bash + git checkout -b feature/my-new-component + ``` + +4. **Follow Extension Patterns** + - Add new components to `components/` + - Create configuration files in `pipelines/configs/retrieval/` + - Add tests to `tests/pipeline/` + - Update documentation + +5. **Run Tests** + ```bash + python -m pytest tests/pipeline/ -v + python -m pytest tests/pipeline/test_minimal_pipeline.py + ``` + +6. **Submit Pull Request** + - Ensure all tests pass + - Include benchmark results if applicable + - Update documentation for new features + +### Development Guidelines + +- **Code Style**: Follow PEP 8, use type hints +- **Testing**: Write tests for new components +- **Documentation**: Update README and docstrings +- **Configuration**: Provide example YAML configs +- **Backwards Compatibility**: Maintain API compatibility + +## Docker Deployment + +### Using Docker Compose + +```bash +# Start all services +docker-compose up -d + +# Scale specific services +docker-compose up -d --scale qdrant=2 + +# View logs +docker-compose logs -f rag-system -## Performance +# Stop services +docker-compose down +``` + +### Custom Docker Build -The system supports various performance optimization strategies: +```bash +# Build the image +docker build -t my-rag-system . + +# Run with environment variables +docker run -d \ + -e GOOGLE_API_KEY=your_key \ + -e QDRANT_HOST=qdrant \ + -p 8000:8000 \ + my-rag-system +``` -- **Caching**: LRU caching for repeated queries -- **Batching**: Efficient batch processing for rerankers -- **Adaptive Top-K**: Dynamic result count based on query complexity -- **Multi-threading**: Parallel processing for pipeline stages +## Troubleshooting -## Migration from Legacy +### Common Issues -If you have existing code using the deprecated `processors/` system: +**Qdrant Connection Failed** +```bash +# Check Qdrant status +docker ps | grep qdrant +curl http://localhost:6333/health -1. Check `deprecated/old_processors/` for reference -2. Use the new pipeline configurations in `pipelines/configs/retrieval/` -3. Follow the migration patterns in `docs/AGENT_INTEGRATION.md` +# Restart Qdrant +docker-compose restart qdrant +``` + +**API Key Issues** +```bash +# Check environment variables +echo $GOOGLE_API_KEY +echo $OPENAI_API_KEY + +# Test API connectivity +python -c " +import os +from embedding.factory import EmbeddingFactory +factory = EmbeddingFactory() +embedder = factory.create_embedder('google') +print('Google API working!') +" +``` + +**Configuration Not Found** +```bash +# List available configurations +python bin/switch_agent_config.py --list + +# Validate configuration +python -c " +import yaml +with open('pipelines/configs/retrieval/modern_hybrid.yml') as f: + config = yaml.safe_load(f) + print('Configuration valid!') +" +``` ## License -This project is licensed under the MIT License - see the LICENSE file for details. +This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. + +## Acknowledgments + +- **LangGraph**: Agent workflow orchestration +- **Qdrant**: High-performance vector database +- **Sentence Transformers**: Embedding models and rerankers +- **Google AI**: Embedding API services +- **BEIR**: Benchmark datasets for information retrieval --- -**Ready to build amazing RAG systems?** Start with the [System Extension Guide](docs/SYSTEM_EXTENSION_GUIDE.md)! +**Ready to build production RAG systems?** Start with our [Quick Start Guide](docs/QUICK_START_GUIDE.md)! diff --git a/docker-compose.yml b/docker-compose.yml index ab4c6dd..bad2a02 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -8,17 +8,6 @@ services: volumes: - qdrant_data:/qdrant/storage - postgres: - image: postgres:14 - environment: - POSTGRES_USER: admin - POSTGRES_PASSWORD: admin - POSTGRES_DB: tableDB - ports: - - "5432:5432" - volumes: - - postgres_data:/var/lib/postgresql/data - app: build: context: . @@ -26,16 +15,10 @@ services: environment: - QDRANT_HOST=qdrant - QDRANT_PORT=6333 - - POSTGRES_HOST=postgres - - POSTGRES_PORT=5432 - - POSTGRES_USER=admin - - POSTGRES_PASSWORD=admin - - POSTGRES_DB=tableDB volumes: - .:/app depends_on: - qdrant - - postgres ports: - "8000:8000" # optional, if you add an API later working_dir: /app @@ -43,4 +26,4 @@ services: volumes: qdrant_data: - postgres_data: + diff --git a/docs/PROJECT_STRUCTURE.md b/docs/PROJECT_STRUCTURE.md index ce14846..87333a6 100644 --- a/docs/PROJECT_STRUCTURE.md +++ b/docs/PROJECT_STRUCTURE.md @@ -20,7 +20,10 @@ agent/ โ”œโ”€โ”€ graph.py # LangGraph agent workflow โ”œโ”€โ”€ schema.py # Agent state schema โ””โ”€โ”€ nodes/ - โ””โ”€โ”€ retriever.py # Configurable retriever node + โ”œโ”€โ”€ retriever.py # Configurable retriever node + โ”œโ”€โ”€ generator.py # Response generation node + โ”œโ”€โ”€ query_interpreter.py # Query analysis node + โ””โ”€โ”€ memory_updater.py # Conversation memory node ``` ### Components (Modular Pipeline System) @@ -54,7 +57,7 @@ embedding/ โ”œโ”€โ”€ __init__.py โ”œโ”€โ”€ factory.py # Embedding factory โ”œโ”€โ”€ bedrock_embeddings.py # AWS Bedrock embeddings -โ”œโ”€โ”€ hf_embedder.py # HuggingFace embeddings +โ”œโ”€โ”€ embeddings.py # Core embedding utilities โ”œโ”€โ”€ processor.py # Embedding processing โ”œโ”€โ”€ recursive_splitter.py # Document splitting โ”œโ”€โ”€ sparse_embedder.py # Sparse embeddings @@ -65,28 +68,81 @@ embedding/ ### Pipeline Configurations ``` pipelines/ +โ”œโ”€โ”€ README.md # Pipeline documentation +โ”œโ”€โ”€ __init__.py +โ”œโ”€โ”€ contracts.py # Core pipeline contracts โ”œโ”€โ”€ configs/ โ”‚ โ””โ”€โ”€ retrieval/ # YAML retrieval configurations -โ”‚ โ”œโ”€โ”€ stackoverflow_minilm.yml -โ”‚ โ”œโ”€โ”€ hybrid_basic.yml -โ”‚ โ””โ”€โ”€ advanced_ensemble.yml +โ”‚ โ”œโ”€โ”€ ci_google_gemini.yml +โ”‚ โ”œโ”€โ”€ fast_hybrid.yml +โ”‚ โ”œโ”€โ”€ modern_dense.yml +โ”‚ โ””โ”€โ”€ modern_hybrid.yml โ”œโ”€โ”€ adapters/ # Data adapters +โ”œโ”€โ”€ eval/ # Evaluation components โ””โ”€โ”€ ingest/ # Ingestion pipelines ``` ### CLI Tools ``` bin/ +โ”œโ”€โ”€ __init__.py โ”œโ”€โ”€ agent_retriever.py # CLI agent retriever -โ”œโ”€โ”€ switch_agent_config.py # Configuration switching utility -โ””โ”€โ”€ qdrant_inspector.py # Qdrant inspection tool +โ”œโ”€โ”€ ingest.py # Data ingestion utility +โ”œโ”€โ”€ qdrant_inspector.py # Qdrant inspection tool +โ”œโ”€โ”€ retrieval_pipeline.py # Direct pipeline usage +โ””โ”€โ”€ switch_agent_config.py # Configuration switching utility ``` ### Examples ``` -examples/ -โ”œโ”€โ”€ simple_qa_agent.py # Simple Q&A agent example -โ””โ”€โ”€ (other examples...) +# Note: Examples directory not present in current structure +# Usage examples are provided in documentation and test files +``` + +### Benchmarking System +``` +benchmarks/ +โ”œโ”€โ”€ __init__.py +โ”œโ”€โ”€ benchmark_contracts.py # Benchmark interfaces +โ”œโ”€โ”€ benchmark_optimizer.py # Configuration optimization +โ”œโ”€โ”€ benchmarks_adapters.py # Dataset adapters for evaluation +โ”œโ”€โ”€ benchmarks_metrics.py # Evaluation metrics (Precision, Recall, NDCG) +โ”œโ”€โ”€ benchmarks_runner.py # Main benchmark orchestrator +โ”œโ”€โ”€ run_benchmark_optimization.py # Optimization scripts +โ””โ”€โ”€ run_real_benchmark.py # Real data benchmarking +``` + +### Benchmark Scenarios +``` +benchmark_scenarios/ +โ”œโ”€โ”€ dense_baseline.yml # Simple dense retrieval +โ”œโ”€โ”€ dense_high_precision.yml # High precision dense config +โ”œโ”€โ”€ dense_high_recall.yml # High recall dense config +โ”œโ”€โ”€ hybrid_advanced.yml # Advanced hybrid configuration +โ”œโ”€โ”€ hybrid_reranking.yml # Full reranking pipeline +โ”œโ”€โ”€ hybrid_retrieval.yml # Basic hybrid retrieval +โ”œโ”€โ”€ hybrid_weighted.yml # Weighted hybrid approach +โ”œโ”€โ”€ quick_test.yml # Quick performance test +โ””โ”€โ”€ sparse_bm25.yml # BM25 baseline +``` + +### Additional Components +``` +datasets/ # Dataset storage +โ”œโ”€โ”€ sosum/ # SOSum Stack Overflow dataset + +extraction_output/ # Table extraction results +โ”œโ”€โ”€ *.csv # Extracted tables from documents + +logs/ # Application logs +โ”œโ”€โ”€ agent.log # Agent workflow logs +โ”œโ”€โ”€ query_interpreter.log # Query processing logs +โ””โ”€โ”€ (other log files...) + +playground/ # Development and testing scripts +processors/ # Legacy processing components +retrievers/ # Base retriever implementations +scripts/ # Utility scripts ``` ## ๐Ÿงช Test Organization @@ -96,9 +152,20 @@ All tests are now organized under the `tests/` directory with clear categorizati ### Test Structure ``` tests/ -โ”œโ”€โ”€ run_all_tests.py # Main test runner -โ”œโ”€โ”€ test_agent_retrieval.py # Agent integration tests -โ”œโ”€โ”€ agent/ # Agent-specific tests +โ”œโ”€โ”€ __init__.py +โ”œโ”€โ”€ requirements-minimal.txt # Minimal test dependencies +โ””โ”€โ”€ pipeline/ # Pipeline component tests + โ”œโ”€โ”€ __init__.py + โ”œโ”€โ”€ run_tests.py # Test runner + โ”œโ”€โ”€ test_components.py # Component integration tests + โ”œโ”€โ”€ test_config.py # Configuration validation tests + โ”œโ”€โ”€ test_end_to_end.py # End-to-end pipeline tests + โ”œโ”€โ”€ test_minimal.py # Minimal functionality tests + โ”œโ”€โ”€ test_minimal_pipeline.py # CI-friendly minimal tests + โ”œโ”€โ”€ test_qdrant.py # Qdrant database tests + โ”œโ”€โ”€ test_qdrant_connectivity.py # Database connectivity tests + โ””โ”€โ”€ test_runner.py # Test execution utilities +``` โ”‚ โ””โ”€โ”€ test_retriever_node.py โ”œโ”€โ”€ components/ # Component unit tests โ”‚ โ”œโ”€โ”€ test_retrieval_pipeline.py diff --git a/docs/QUICK_START_GUIDE.md b/docs/QUICK_START_GUIDE.md index 5f0c4c1..e4afc9f 100644 --- a/docs/QUICK_START_GUIDE.md +++ b/docs/QUICK_START_GUIDE.md @@ -1,6 +1,14 @@ -# Quick Start Guide: Implementing MLOps Pipeline for RAG +# Quick Start Guide: Understanding the MLOps Pipeline for RAG -This guide provides a step-by-step walkthrough for implementing the MLOps pipeline architecture in your own projects. +**โš ๏ธ Important Note**: This guide provides a simplified tutorial for understanding the MLOps concepts. The actual project has a much more sophisticated implementation with advanced features like hybrid embeddings, multiple chunking strategies, agent workflows, and comprehensive benchmarking. + +**To use the actual project:** +- See `README.md` for setup instructions +- Use the CLI: `python bin/ingest.py --help` +- Check `docs/SOSUM_INGESTION.md` for real dataset examples +- Review `docs/MLOPS_PIPELINE_ARCHITECTURE.md` for detailed architecture + +This guide provides a step-by-step walkthrough for implementing a simplified version of the MLOps pipeline architecture. ## Prerequisites @@ -371,6 +379,13 @@ class DocumentChunker: return chunked_docs ``` +**Note**: The actual implementation (`pipelines/ingest/chunker.py`) has multiple advanced chunking strategies: +- `RecursiveChunkingStrategy`: Basic recursive character splitting +- `SemanticChunkingStrategy`: Sentence-boundary aware chunking +- `CodeAwareChunkingStrategy`: Preserves code blocks and functions +- `TableAwareChunkingStrategy`: Preserves table structure +- `ChunkingStrategyFactory`: Factory for strategy selection + ### Simple Embedder (`pipelines/ingest/embedder.py`) ```python """Embedding generation.""" @@ -451,160 +466,99 @@ class EmbeddingPipeline: ) ``` -## 6. Create Simple CLI Interface (30 minutes) +**Note**: The actual implementation (`pipelines/ingest/embedder.py`) is more sophisticated with: +- Support for dense, sparse, and hybrid embedding strategies +- Caching and error handling +- Batch processing with progress bars +- Integration with multiple embedding providers (HuggingFace, Google, AWS Bedrock) -### CLI Script (`bin/ingest.py`) -```python -#!/usr/bin/env python3 -"""Simple ingestion CLI.""" -import argparse -import yaml -import logging -from pathlib import Path -import sys -import importlib +## 6. Use the Actual CLI Interface (15 minutes) -# Add project root to path -sys.path.insert(0, str(Path(__file__).parent.parent)) +The actual project has a sophisticated CLI with subcommands. Here's how to use it: -from pipelines.ingest.chunker import DocumentChunker -from pipelines.ingest.embedder import EmbeddingPipeline +### CLI Usage Examples +```bash +# View available commands +python bin/ingest.py --help -# Configure logging -logging.basicConfig( - level=logging.INFO, - format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' -) -logger = logging.getLogger(__name__) +# Ingest a dataset (dry run) +python bin/ingest.py ingest natural_questions /path/to/data --config config.yml --dry-run --max-docs 100 -def load_adapter(adapter_name: str, data_path: str): - """Dynamically load dataset adapter.""" - module_name = f"pipelines.adapters.{adapter_name}" - module = importlib.import_module(module_name) - - # Find adapter class (assumes pattern: XxxDatasetAdapter) - adapter_class = None - for attr_name in dir(module): - attr = getattr(module, attr_name) - if (isinstance(attr, type) and - hasattr(attr, 'source_name') and - attr_name.endswith('Adapter')): - adapter_class = attr - break - - if not adapter_class: - raise ValueError(f"No adapter class found in {module_name}") - - return adapter_class(data_path) - -def main(): - parser = argparse.ArgumentParser(description="Simple RAG Ingestion Pipeline") - parser.add_argument("config", help="Configuration file path") - parser.add_argument("--dry-run", action="store_true", help="Run without uploading") - parser.add_argument("--max-docs", type=int, help="Limit number of documents") - parser.add_argument("--verbose", "-v", action="store_true", help="Verbose logging") - - args = parser.parse_args() - - if args.verbose: - logging.getLogger().setLevel(logging.DEBUG) - - # Load configuration - with open(args.config, 'r') as f: - config = yaml.safe_load(f) - - logger.info(f"Starting ingestion with config: {args.config}") - - try: - # Load dataset adapter - adapter = load_adapter( - config["dataset"]["adapter"], - config["dataset"]["path"] - ) - logger.info(f"Loaded adapter: {adapter.source_name}") - - # Read data - rows = list(adapter.read_rows()) - if args.max_docs: - rows = rows[:args.max_docs] - logger.info(f"Read {len(rows)} rows") - - # Convert to documents - documents = adapter.to_documents(rows, split="all") - logger.info(f"Created {len(documents)} documents") - - # Chunk documents - chunker = DocumentChunker(config["chunking"]) - chunked_docs = chunker.chunk_documents(documents) - logger.info(f"Created {len(chunked_docs)} chunks") - - # Generate embeddings - embedder = EmbeddingPipeline(config["embedding"]) - chunk_metas = embedder.process_documents(chunked_docs) - logger.info(f"Generated embeddings for {len(chunk_metas)} chunks") - - if args.dry_run: - logger.info("DRY RUN - Would upload to vector store") - logger.info(f"Sample chunk: {chunk_metas[0].chunk_id}") - else: - # TODO: Implement vector store upload - logger.info("Vector store upload not implemented yet") - - logger.info("Ingestion completed successfully!") - - except Exception as e: - logger.error(f"Ingestion failed: {e}") - return 1 - - return 0 +# Ingest Stack Overflow dataset +python bin/ingest.py ingest stackoverflow /path/to/sosum --config config.yml -if __name__ == "__main__": - sys.exit(main()) -``` +# Run in canary mode for testing +python bin/ingest.py ingest energy_papers /path/to/papers --canary --max-docs 50 + +# Check collection status +python bin/ingest.py status --config config.yml -## 7. Test Your Implementation (15 minutes) +# Evaluate retrieval performance +python bin/ingest.py evaluate natural_questions /path/to/data --output-dir results/ -### Create Test Data (`test_data.csv`) -```csv -id,title,content,category -1,"Introduction to Python","Python is a programming language that lets you work quickly and integrate systems more effectively.","programming" -2,"Machine Learning Basics","Machine learning is a method of data analysis that automates analytical model building.","ai" -3,"Data Science Overview","Data science is an interdisciplinary field that uses scientific methods to extract knowledge from data.","data" +# Batch ingestion +python bin/ingest.py batch-ingest batch_config.json ``` -### Test Configuration (`test_config.yml`) -```yaml -dataset: - name: "test_dataset" - version: "1.0.0" - adapter: "csv_dataset" - path: "test_data.csv" +### Batch Configuration Example (`batch_config.json`) +```json +{ + "datasets": [ + {"type": "natural_questions", "path": "/path/to/nq", "version": "1.0.0"}, + {"type": "stackoverflow", "path": "/path/to/sosum", "version": "1.0.0"} + ] +} +``` -chunking: - strategy: "recursive_character" - chunk_size: 500 - chunk_overlap: 50 +### Available Adapter Types +- `natural_questions`: Natural Questions dataset +- `stackoverflow`: Stack Overflow (SOSum format) dataset +- `energy_papers`: Energy research papers dataset -embedding: - strategy: "dense" - provider: "hf" - model: "sentence-transformers/all-MiniLM-L6-v2" - batch_size: 2 +## 7. Test with Actual Implementation (15 minutes) -vector_store: - provider: "qdrant" - collection_name: "test_v1" +### Use Real Configuration Files +The actual project has several pre-configured YAML files you can use: + +```bash +# List available configurations +ls pipelines/configs/retrieval/ + +# Available configs: +# - modern_dense.yml: Dense embeddings with neural reranking +# - modern_hybrid.yml: Hybrid dense+sparse with reranking +# - fast_hybrid.yml: Fast hybrid retrieval +# - ci_google_gemini.yml: CI configuration with Google embeddings ``` -### Run Test +### Test with Stack Overflow Dataset ```bash -# Create test data -echo 'id,title,content,category -1,"Introduction to Python","Python is a programming language that lets you work quickly and integrate systems more effectively.","programming" -2,"Machine Learning Basics","Machine learning is a method of data analysis that automates analytical model building.","ai"' > test_data.csv +# Download SOSum dataset +mkdir -p datasets/sosum +cd datasets/sosum +# Download from https://github.com/BonanKou/SOSum-A-Dataset-of-Extractive-Summaries-of-Stack-Overflow-Posts-and-labeling-tools + +# Test the adapter (dry run) +python bin/ingest.py ingest stackoverflow datasets/sosum/data --config config.yml --dry-run --max-docs 10 --verbose + +# Check what was ingested +python bin/ingest.py status --config config.yml +``` + +### Actual Configuration Structure +The real `config.yml` looks like this: +```yaml +# Main configuration file +agent: + retrieval_pipeline_config: "pipelines/configs/retrieval/modern_dense.yml" + +database: + qdrant: + host: "localhost" + port: 6333 + collection_name: "sosum_stackoverflow_hybrid_v1" -# Test the pipeline -python bin/ingest.py test_config.yml --dry-run --max-docs 2 --verbose +# The retrieval configs contain detailed chunking and embedding settings ``` ## 8. Next Steps and Extensions diff --git a/docs/SOSUM_INGESTION.md b/docs/SOSUM_INGESTION.md index b0a60f9..2b8351d 100644 --- a/docs/SOSUM_INGESTION.md +++ b/docs/SOSUM_INGESTION.md @@ -41,12 +41,17 @@ SOSum comes in two CSV files: ### 1. Download the Dataset ```bash -# Clone the SOSum repository -git clone https://github.com/BonanKou/SOSum-A-Dataset-of-Extractive-Summaries-of-Stack-Overflow-Posts-and-labeling-tools.git sosum +# Clone the SOSum repository into the datasets directory +cd datasets/ +git clone https://github.com/BonanKou/SOSum-A-Dataset-of-Extractive-Summaries-of-Stack-Overflow-Posts-and-labeling-tools.git sosum_source -# The CSV files are in sosum/data/ directory -ls sosum/data/ +# The CSV files are in sosum_source/data/ directory +ls sosum_source/data/ # Should show: question.csv answer.csv + +# Or download the CSV files directly +mkdir -p sosum/ +# Place question.csv and answer.csv in sosum/ directory ``` ### 2. Test the Adapter