diff --git a/.github/INTEGRATION_TEST_SETUP.md b/.github/INTEGRATION_TEST_SETUP.md
new file mode 100644
index 0000000000..4a04d75457
--- /dev/null
+++ b/.github/INTEGRATION_TEST_SETUP.md
@@ -0,0 +1,234 @@
+# GitHub Copilot Setup Steps for LightRAG Integration Testing
+
+This document describes the steps needed to set up and run the LightRAG integration tests locally or in CI/CD.
+
+## Prerequisites
+
+- Python 3.10 or higher
+- Docker and Docker Compose
+- Git
+
+## Local Setup Steps
+
+### 1. Clone the Repository
+
+```bash
+git clone https://github.com/netbrah/LightRAG.git
+cd LightRAG
+```
+
+### 2. Set Up Python Virtual Environment
+
+```bash
+python -m venv .venv
+source .venv/bin/activate # On Windows: .venv\Scripts\activate
+```
+
+### 3. Install Python Dependencies
+
+```bash
+pip install --upgrade pip
+pip install -e ".[api]"
+pip install pytest pytest-asyncio httpx
+```
+
+### 4. Start Docker Services
+
+The integration tests require three services:
+- **Redis**: For KV and document status storage
+- **Neo4j**: For graph storage
+- **Milvus**: For vector storage
+
+```bash
+cd tests
+docker-compose -f docker-compose.integration.yml up -d
+```
+
+### 5. Wait for Services to Be Ready
+
+```bash
+# Wait for Redis
+until docker exec lightrag-test-redis redis-cli ping | grep -q PONG; do sleep 2; done
+
+# Wait for Neo4j (may take up to 2 minutes)
+until docker exec lightrag-test-neo4j cypher-shell -u neo4j -p testpassword123 "RETURN 1" 2>/dev/null | grep -q "1"; do sleep 5; done
+
+# Wait for Milvus (may take up to 3 minutes)
+until curl -s http://localhost:9091/healthz | grep -q "OK"; do sleep 5; done
+```
+
+### 6. Start Mock OpenAI Server
+
+The mock server simulates OpenAI API responses for testing without requiring actual API keys.
+
+```bash
+cd tests
+python mock_openai_server.py --host 127.0.0.1 --port 8000 &
+MOCK_PID=$!
+
+# Wait for it to be ready
+until curl -s http://127.0.0.1:8000/health | grep -q "healthy"; do sleep 1; done
+```
+
+### 7. Prepare Test Environment
+
+```bash
+cd tests
+cp .env.integration .env
+mkdir -p test_inputs test_rag_storage
+```
+
+### 8. Start LightRAG Server
+
+```bash
+cd tests
+lightrag-server &
+LIGHTRAG_PID=$!
+
+# Wait for it to be ready
+until curl -s http://localhost:9621/health | grep -q "status"; do sleep 2; done
+```
+
+### 9. Run Integration Tests
+
+```bash
+cd tests
+python integration_test.py
+```
+
+### 10. Cleanup
+
+```bash
+# Stop servers
+kill $LIGHTRAG_PID
+kill $MOCK_PID
+
+# Stop Docker services
+docker-compose -f docker-compose.integration.yml down -v
+
+# Remove test artifacts
+rm -rf test_inputs test_rag_storage .env
+```
+
+## Service Configuration Details
+
+### Redis Configuration
+- **Port**: 6379
+- **Container**: lightrag-test-redis
+- **Purpose**: KV storage and document status tracking
+
+### Neo4j Configuration
+- **HTTP Port**: 7474
+- **Bolt Port**: 7687
+- **Container**: lightrag-test-neo4j
+- **Credentials**: neo4j/testpassword123
+- **Purpose**: Graph knowledge base storage
+
+### Milvus Configuration
+- **API Port**: 19530
+- **Health Port**: 9091
+- **Container**: lightrag-test-milvus
+- **Database**: lightrag_test
+- **Purpose**: Vector embeddings storage
+
+### Mock OpenAI Server Configuration
+- **Port**: 8000
+- **Endpoints**:
+ - `/v1/chat/completions` - Mock LLM responses
+ - `/v1/embeddings` - Mock embedding generation
+ - `/health` - Health check
+
+### LightRAG Server Configuration
+- **Port**: 9621
+- **Configuration**: tests/.env.integration
+- **Storage Backends**:
+ - KV: RedisKVStorage
+ - Doc Status: RedisDocStatusStorage
+ - Vector: MilvusVectorDBStorage
+ - Graph: Neo4JStorage
+
+## CI/CD Integration
+
+The integration tests are automatically run on every commit via GitHub Actions. See `.github/workflows/integration-test.yml` for the workflow configuration.
+
+### Workflow Triggers
+- Push to branches: main, dev, copilot/**
+- Pull requests to: main, dev
+- Manual workflow dispatch
+
+### Workflow Steps
+1. Checkout code
+2. Set up Python environment
+3. Install dependencies
+4. Start Docker services (Redis, Neo4j, Milvus)
+5. Wait for all services to be healthy
+6. Start Mock OpenAI server
+7. Configure test environment
+8. Start LightRAG server
+9. Run integration tests
+10. Collect logs on failure
+11. Cleanup all resources
+
+## Test Coverage
+
+The integration tests validate:
+
+1. **Health Check**: Server availability and basic functionality
+2. **Document Indexing**:
+ - File upload (C++ source files)
+ - Text insertion
+ - Multiple file formats
+3. **Query Operations**:
+ - Naive mode
+ - Local mode
+ - Global mode
+ - Hybrid mode
+4. **Structured Data Retrieval**:
+ - Entity extraction
+ - Relationship mapping
+ - Chunk retrieval
+5. **Graph Operations**:
+ - Graph data retrieval
+ - Node and edge counting
+
+## Sample Test Repository
+
+The tests use a sample C++ repository located at `tests/sample_cpp_repo/`:
+- **Files**: calculator.h, calculator.cpp, utils.h, utils.cpp, main.cpp
+- **Purpose**: Demonstrates code indexing and querying capabilities
+- **Content**: Simple calculator implementation with documentation
+
+## Troubleshooting
+
+### Services Not Starting
+- Check Docker is running: `docker ps`
+- Check port availability: `lsof -i :6379,7687,19530,8000,9621`
+- Review Docker logs: `docker-compose -f tests/docker-compose.integration.yml logs`
+
+### Mock Server Issues
+- Verify port 8000 is available
+- Check mock server logs
+- Test health endpoint: `curl http://127.0.0.1:8000/health`
+
+### LightRAG Server Issues
+- Check environment file: `tests/.env`
+- Review server logs: `cat tests/lightrag.log*`
+- Verify storage connections
+
+### Test Failures
+- Ensure all services are healthy before running tests
+- Check network connectivity between services
+- Review test output for specific error messages
+
+## Environment Variables
+
+Key environment variables used in integration tests:
+
+- `LIGHTRAG_API_URL`: LightRAG server URL (default: http://localhost:9621)
+- `LLM_BINDING_HOST`: Mock OpenAI server URL (default: http://127.0.0.1:8000)
+- `EMBEDDING_BINDING_HOST`: Mock embedding server URL (default: http://127.0.0.1:8000)
+- `REDIS_URI`: Redis connection string
+- `NEO4J_URI`: Neo4j connection string
+- `MILVUS_URI`: Milvus connection string
+
+All configurations are defined in `tests/.env.integration`.
diff --git a/.github/workflows/integration-test.yml b/.github/workflows/integration-test.yml
new file mode 100644
index 0000000000..4314443c5d
--- /dev/null
+++ b/.github/workflows/integration-test.yml
@@ -0,0 +1,164 @@
+name: Integration Tests
+
+on:
+ push:
+ pull_request:
+ workflow_dispatch:
+
+jobs:
+ integration-test:
+ name: Full Integration Test
+ runs-on: ubuntu-latest
+ timeout-minutes: 30
+
+ steps:
+ - name: Checkout repository
+ uses: actions/checkout@v4
+
+ - name: Set up Python 3.11
+ uses: actions/setup-python@v5
+ with:
+ python-version: '3.11'
+
+ - name: Cache pip packages
+ uses: actions/cache@v4
+ with:
+ path: ~/.cache/pip
+ key: ${{ runner.os }}-pip-integration-${{ hashFiles('**/pyproject.toml') }}
+ restore-keys: |
+ ${{ runner.os }}-pip-integration-
+ ${{ runner.os }}-pip-
+
+ - name: Install Python dependencies
+ run: |
+ python -m pip install --upgrade pip
+ pip install -e .[api,offline-storage]
+ pip install pytest pytest-asyncio httpx
+
+ - name: Create minimal frontend stub for testing
+ run: |
+ mkdir -p lightrag/api/webui
+ echo '
LightRAG TestIntegration Test Mode
' > lightrag/api/webui/index.html
+ echo "Created minimal frontend stub for integration testing"
+
+ - name: Start Docker services (Redis, Neo4j, Milvus)
+ run: |
+ cd tests
+ docker compose -f docker-compose.integration.yml up -d
+ echo "Waiting for services to be ready..."
+
+ - name: Wait for Redis
+ run: |
+ echo "Waiting for Redis to be ready..."
+ timeout 60 bash -c 'until docker exec lightrag-test-redis redis-cli ping | grep -q PONG; do sleep 2; done'
+ echo "✅ Redis is ready"
+
+ - name: Wait for Neo4j
+ run: |
+ echo "Waiting for Neo4j to be ready..."
+ timeout 120 bash -c 'until docker exec lightrag-test-neo4j cypher-shell -u neo4j -p testpassword123 "RETURN 1" 2>/dev/null | grep -q "1"; do sleep 5; done'
+ echo "✅ Neo4j is ready"
+
+ - name: Wait for Milvus
+ run: |
+ echo "Waiting for Milvus to be ready..."
+ timeout 180 bash -c 'until curl -s http://localhost:9091/healthz | grep -q "OK"; do sleep 5; done'
+ echo "✅ Milvus is ready"
+
+ - name: Verify services are running
+ run: |
+ docker ps
+ echo "Testing service connectivity..."
+ docker exec lightrag-test-redis redis-cli ping
+ docker exec lightrag-test-neo4j cypher-shell -u neo4j -p testpassword123 "RETURN 1"
+ curl -s http://localhost:9091/healthz
+
+ - name: Start Mock OpenAI Server
+ run: |
+ echo "Starting Mock OpenAI Server..."
+ cd tests
+ python mock_openai_server.py --host 127.0.0.1 --port 8000 &
+ MOCK_PID=$!
+ echo "MOCK_SERVER_PID=${MOCK_PID}" >> $GITHUB_ENV
+
+ # Wait for mock server to be ready
+ echo "Waiting for mock server to be ready..."
+ timeout 30 bash -c 'until curl -s http://127.0.0.1:8000/health | grep -q "healthy"; do sleep 1; done'
+ echo "✅ Mock OpenAI Server is ready (PID: ${MOCK_PID})"
+
+ - name: Prepare test environment
+ run: |
+ cd tests
+ cp .env.integration .env
+ mkdir -p test_inputs test_rag_storage
+ echo "Environment prepared for testing"
+
+ - name: Start LightRAG Server
+ run: |
+ cd tests
+ echo "Starting LightRAG Server..."
+ lightrag-server &
+ LIGHTRAG_PID=$!
+ echo "LIGHTRAG_SERVER_PID=${LIGHTRAG_PID}" >> $GITHUB_ENV
+
+ # Wait for LightRAG server to be ready
+ echo "Waiting for LightRAG server to be ready..."
+ timeout 60 bash -c 'until curl -s http://localhost:9621/health | grep -q "status"; do sleep 2; done'
+ echo "✅ LightRAG Server is ready (PID: ${LIGHTRAG_PID})"
+
+ - name: Run Integration Tests
+ run: |
+ cd tests
+ python integration_test.py
+ env:
+ LIGHTRAG_API_URL: http://localhost:9621
+
+ - name: Collect logs on failure
+ if: failure()
+ run: |
+ echo "=== LightRAG Server Logs ==="
+ cat tests/lightrag.log* 2>/dev/null || echo "No LightRAG logs found"
+
+ echo "=== Docker Service Logs ==="
+ docker compose -f tests/docker-compose.integration.yml logs
+
+ - name: Stop LightRAG Server
+ if: always()
+ run: |
+ if [ ! -z "$LIGHTRAG_SERVER_PID" ]; then
+ echo "Stopping LightRAG Server (PID: $LIGHTRAG_SERVER_PID)..."
+ kill $LIGHTRAG_SERVER_PID 2>/dev/null || true
+ sleep 2
+ fi
+
+ - name: Stop Mock OpenAI Server
+ if: always()
+ run: |
+ if [ ! -z "$MOCK_SERVER_PID" ]; then
+ echo "Stopping Mock OpenAI Server (PID: $MOCK_SERVER_PID)..."
+ kill $MOCK_SERVER_PID 2>/dev/null || true
+ fi
+
+ - name: Stop Docker services
+ if: always()
+ run: |
+ cd tests
+ docker compose -f docker-compose.integration.yml down -v
+ echo "Docker services stopped and volumes removed"
+
+ - name: Cleanup test artifacts
+ if: always()
+ run: |
+ cd tests
+ rm -rf test_inputs test_rag_storage .env
+ echo "Test artifacts cleaned up"
+
+ - name: Upload test artifacts
+ if: always()
+ uses: actions/upload-artifact@v4
+ with:
+ name: integration-test-artifacts
+ path: |
+ tests/lightrag.log*
+ tests/test_rag_storage/
+ retention-days: 7
diff --git a/lightrag/api/lightrag_server.py b/lightrag/api/lightrag_server.py
index b29e39b2eb..845395655c 100644
--- a/lightrag/api/lightrag_server.py
+++ b/lightrag/api/lightrag_server.py
@@ -991,6 +991,24 @@ async def server_rerank_func(
name=args.simulated_model_name, tag=args.simulated_model_tag
)
+ # Check if we should use an offline-compatible tokenizer (for integration testing)
+ custom_tokenizer = None
+ if os.getenv("LIGHTRAG_OFFLINE_TOKENIZER", "false").lower() == "true":
+ logger.info("Using offline-compatible simple tokenizer for integration testing")
+ try:
+ # Import simple tokenizer for offline use
+ import sys
+
+ tests_dir = Path(__file__).parent.parent.parent / "tests"
+ if tests_dir.exists():
+ sys.path.insert(0, str(tests_dir))
+ from simple_tokenizer import create_simple_tokenizer
+
+ custom_tokenizer = create_simple_tokenizer()
+ logger.info("Successfully loaded offline tokenizer")
+ except Exception as e:
+ logger.warning(f"Failed to load offline tokenizer, using default: {e}")
+
# Initialize RAG with unified configuration
try:
rag = LightRAG(
@@ -1026,6 +1044,7 @@ async def server_rerank_func(
"entity_types": args.entity_types,
},
ollama_server_infos=ollama_server_infos,
+ tokenizer=custom_tokenizer, # Pass custom tokenizer if available
)
except Exception as e:
logger.error(f"Failed to initialize LightRAG: {e}")
diff --git a/tests/.env.integration b/tests/.env.integration
new file mode 100644
index 0000000000..6e78377254
--- /dev/null
+++ b/tests/.env.integration
@@ -0,0 +1,120 @@
+# Integration Test Environment Configuration
+# This file is used for integration testing with mock OpenAI server
+
+###########################
+### Server Configuration
+###########################
+HOST=0.0.0.0
+PORT=9621
+WEBUI_TITLE='Integration Test KB'
+WEBUI_DESCRIPTION="Integration Test for LightRAG"
+WORKERS=1
+
+### Directory Configuration
+INPUT_DIR=./test_inputs
+WORKING_DIR=./test_rag_storage
+
+### Use offline tokenizer (no internet required)
+LIGHTRAG_OFFLINE_TOKENIZER=true
+
+### Logging level
+LOG_LEVEL=INFO
+VERBOSE=False
+
+#####################################
+### Authentication (Disabled for tests)
+#####################################
+# No authentication required for testing
+
+######################################################################################
+### Query Configuration
+######################################################################################
+ENABLE_LLM_CACHE=true
+TOP_K=20
+CHUNK_TOP_K=10
+MAX_ENTITY_TOKENS=4000
+MAX_RELATION_TOKENS=4000
+MAX_TOTAL_TOKENS=16000
+
+########################################
+### Document processing configuration
+########################################
+ENABLE_LLM_CACHE_FOR_EXTRACT=true
+SUMMARY_LANGUAGE=English
+
+### Entity types for code analysis
+ENTITY_TYPES='["Class","Function","Variable","Module","Namespace","Struct","Enum","Method"]'
+
+### Chunk size for document splitting
+CHUNK_SIZE=800
+CHUNK_OVERLAP_SIZE=100
+
+###############################
+### Concurrency Configuration
+###############################
+MAX_ASYNC=2
+MAX_PARALLEL_INSERT=1
+EMBEDDING_FUNC_MAX_ASYNC=4
+EMBEDDING_BATCH_NUM=5
+
+###########################################################################
+### LLM Configuration (Mock OpenAI Server)
+###########################################################################
+LLM_BINDING=openai
+LLM_MODEL=gpt-5
+LLM_BINDING_HOST=http://127.0.0.1:8000
+LLM_BINDING_API_KEY=mock-api-key-for-testing
+LLM_TIMEOUT=60
+
+### OpenAI Specific Parameters (for mock server)
+OPENAI_LLM_REASONING_EFFORT=medium
+OPENAI_LLM_MAX_COMPLETION_TOKENS=8000
+OPENAI_LLM_TEMPERATURE=0.7
+
+#######################################################################################
+### Embedding Configuration (Mock OpenAI Server)
+#######################################################################################
+EMBEDDING_BINDING=openai
+EMBEDDING_MODEL=text-embedding-3-large
+EMBEDDING_DIM=3072
+EMBEDDING_BINDING_HOST=http://127.0.0.1:8000
+EMBEDDING_BINDING_API_KEY=mock-api-key-for-testing
+EMBEDDING_TIMEOUT=30
+EMBEDDING_SEND_DIM=false
+
+####################################################################
+### WORKSPACE
+####################################################################
+WORKSPACE=integration_test
+
+############################
+### Data storage selection
+############################
+### Redis Storage
+LIGHTRAG_KV_STORAGE=RedisKVStorage
+LIGHTRAG_DOC_STATUS_STORAGE=RedisDocStatusStorage
+
+### Milvus Vector Storage
+LIGHTRAG_VECTOR_STORAGE=MilvusVectorDBStorage
+
+### Neo4j Graph Storage
+LIGHTRAG_GRAPH_STORAGE=Neo4JStorage
+
+### Redis Configuration
+REDIS_URI=redis://localhost:6379
+REDIS_SOCKET_TIMEOUT=30
+REDIS_CONNECT_TIMEOUT=10
+REDIS_MAX_CONNECTIONS=50
+REDIS_RETRY_ATTEMPTS=3
+
+### Neo4j Configuration
+NEO4J_URI=neo4j://localhost:7687
+NEO4J_USERNAME=neo4j
+NEO4J_PASSWORD=testpassword123
+NEO4J_DATABASE=neo4j
+NEO4J_MAX_CONNECTION_POOL_SIZE=50
+NEO4J_CONNECTION_TIMEOUT=30
+
+### Milvus Configuration
+MILVUS_URI=http://localhost:19530
+MILVUS_DB_NAME=default
diff --git a/tests/docker-compose.integration.yml b/tests/docker-compose.integration.yml
new file mode 100644
index 0000000000..2435399918
--- /dev/null
+++ b/tests/docker-compose.integration.yml
@@ -0,0 +1,102 @@
+version: '3.8'
+
+services:
+ # Redis for KV and Doc Status storage
+ redis:
+ image: redis:7-alpine
+ container_name: lightrag-test-redis
+ ports:
+ - "6379:6379"
+ command: redis-server --appendonly yes
+ healthcheck:
+ test: ["CMD", "redis-cli", "ping"]
+ interval: 5s
+ timeout: 3s
+ retries: 5
+
+ # Neo4j for Graph storage
+ neo4j:
+ image: neo4j:5.17.0
+ container_name: lightrag-test-neo4j
+ ports:
+ - "7474:7474" # HTTP
+ - "7687:7687" # Bolt
+ environment:
+ - NEO4J_AUTH=neo4j/testpassword123
+ - NEO4J_PLUGINS=["apoc"]
+ - NEO4J_dbms_security_procedures_unrestricted=apoc.*
+ - NEO4J_dbms_memory_heap_initial__size=512m
+ - NEO4J_dbms_memory_heap_max__size=1G
+ healthcheck:
+ test: ["CMD-SHELL", "cypher-shell -u neo4j -p testpassword123 'RETURN 1'"]
+ interval: 10s
+ timeout: 10s
+ retries: 10
+ start_period: 40s
+
+ # Milvus etcd
+ etcd:
+ container_name: lightrag-test-milvus-etcd
+ image: quay.io/coreos/etcd:v3.5.5
+ environment:
+ - ETCD_AUTO_COMPACTION_MODE=revision
+ - ETCD_AUTO_COMPACTION_RETENTION=1000
+ - ETCD_QUOTA_BACKEND_BYTES=4294967296
+ - ETCD_SNAPSHOT_COUNT=50000
+ volumes:
+ - etcd-data:/etcd
+ command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
+ healthcheck:
+ test: ["CMD", "etcdctl", "endpoint", "health"]
+ interval: 30s
+ timeout: 20s
+ retries: 3
+
+ # Milvus MinIO
+ minio:
+ container_name: lightrag-test-milvus-minio
+ image: minio/minio:RELEASE.2023-03-20T20-16-18Z
+ environment:
+ MINIO_ROOT_USER: minioadmin
+ MINIO_ROOT_PASSWORD: minioadmin
+ ports:
+ - "9001:9001"
+ - "9000:9000"
+ volumes:
+ - minio-data:/minio_data
+ command: minio server /minio_data --console-address ":9001"
+ healthcheck:
+ test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
+ interval: 30s
+ timeout: 20s
+ retries: 3
+
+ # Milvus Standalone
+ milvus:
+ container_name: lightrag-test-milvus
+ image: milvusdb/milvus:v2.4.0
+ command: ["milvus", "run", "standalone"]
+ security_opt:
+ - seccomp:unconfined
+ environment:
+ ETCD_ENDPOINTS: etcd:2379
+ MINIO_ADDRESS: minio:9000
+ volumes:
+ - milvus-data:/var/lib/milvus
+ healthcheck:
+ test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
+ interval: 30s
+ start_period: 90s
+ timeout: 20s
+ retries: 3
+ ports:
+ - "19530:19530"
+ - "9091:9091"
+ depends_on:
+ - etcd
+ - minio
+
+volumes:
+ etcd-data:
+ minio-data:
+ milvus-data:
diff --git a/tests/integration_test.py b/tests/integration_test.py
new file mode 100644
index 0000000000..1cf7062dbd
--- /dev/null
+++ b/tests/integration_test.py
@@ -0,0 +1,366 @@
+#!/usr/bin/env python3
+"""
+Integration test script for LightRAG with production setup.
+
+This script tests:
+- Document indexing with C++ code repository
+- Query operations (naive, local, global, hybrid)
+- API endpoints (insert, query, graph retrieval)
+- Integration with Redis, Neo4j, and Milvus storage backends
+"""
+
+import asyncio
+import json
+import os
+import sys
+import logging
+from pathlib import Path
+import httpx
+
+# Configure logging
+logging.basicConfig(
+ level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+
+
+class IntegrationTestRunner:
+ """Integration test runner for LightRAG."""
+
+ def __init__(self, base_url: str = "http://localhost:9621"):
+ self.base_url = base_url
+ self.client = httpx.AsyncClient(timeout=120.0)
+ self.test_results = []
+
+ async def __aenter__(self):
+ return self
+
+ async def __aexit__(self, exc_type, exc_val, exc_tb):
+ await self.client.aclose()
+
+ def log_result(self, test_name: str, passed: bool, message: str = ""):
+ """Log test result."""
+ status = "✅ PASS" if passed else "❌ FAIL"
+ logger.info(f"{status} - {test_name}: {message}")
+ self.test_results.append(
+ {"test": test_name, "passed": passed, "message": message}
+ )
+
+ async def wait_for_server(self, max_retries: int = 30, retry_delay: int = 2):
+ """Wait for LightRAG server to be ready."""
+ logger.info("Waiting for LightRAG server to be ready...")
+
+ for i in range(max_retries):
+ try:
+ response = await self.client.get(f"{self.base_url}/health")
+ if response.status_code == 200:
+ logger.info("✅ LightRAG server is ready!")
+ return True
+ except Exception as e:
+ logger.debug(f"Attempt {i+1}/{max_retries}: Server not ready yet - {e}")
+
+ await asyncio.sleep(retry_delay)
+
+ logger.error("❌ Server failed to become ready in time")
+ return False
+
+ async def test_health_endpoint(self):
+ """Test health check endpoint."""
+ test_name = "Health Check"
+ try:
+ response = await self.client.get(f"{self.base_url}/health")
+ passed = response.status_code == 200
+ self.log_result(test_name, passed, f"Status: {response.status_code}")
+ return passed
+ except Exception as e:
+ self.log_result(test_name, False, f"Error: {e}")
+ return False
+
+ async def test_insert_text(self, text: str, description: str = ""):
+ """Test document insertion via API."""
+ test_name = f"Insert Document{' - ' + description if description else ''}"
+ try:
+ response = await self.client.post(
+ f"{self.base_url}/documents/text",
+ json={"text": text, "description": description},
+ )
+ passed = response.status_code == 200
+ self.log_result(test_name, passed, f"Status: {response.status_code}")
+ return passed
+ except Exception as e:
+ self.log_result(test_name, False, f"Error: {e}")
+ return False
+
+ async def test_insert_file(self, file_path: Path, retry_count: int = 2):
+ """Test file insertion via API with retry logic and fallback to text endpoint."""
+ test_name = f"Insert File - {file_path.name}"
+
+ # Check if this is a header file that should use text endpoint
+ use_text_endpoint = file_path.suffix in [".h", ".hpp", ".hh"]
+
+ for attempt in range(retry_count + 1):
+ try:
+ if use_text_endpoint:
+ # Use text insertion endpoint for header files
+ with open(file_path, "r", encoding="utf-8") as f:
+ content = f.read()
+
+ response = await self.client.post(
+ f"{self.base_url}/documents/text",
+ json={"text": content, "file_source": file_path.name},
+ )
+ else:
+ # Use file upload endpoint for other files
+ with open(file_path, "rb") as f:
+ files = {"file": (file_path.name, f, "text/plain")}
+ response = await self.client.post(
+ f"{self.base_url}/documents/upload", files=files
+ )
+
+ if response.status_code == 200:
+ self.log_result(test_name, True, f"Status: {response.status_code}")
+ return True
+ elif response.status_code == 400:
+ # Check if it's unsupported file type error
+ try:
+ error_detail = response.json()
+ error_msg = error_detail.get("detail", "")
+ if (
+ "Unsupported file type" in error_msg
+ and not use_text_endpoint
+ ):
+ # Fallback to text endpoint
+ logger.info(
+ f"File type not supported for upload, trying text endpoint for {file_path.name}"
+ )
+ use_text_endpoint = True
+ continue
+ except (json.JSONDecodeError, ValueError, KeyError):
+ pass
+
+ self.log_result(test_name, False, f"Status: {response.status_code}")
+ return False
+ elif response.status_code == 500:
+ # Try to get error details
+ try:
+ error_detail = response.json()
+ error_msg = error_detail.get("detail", "Unknown error")
+ except (json.JSONDecodeError, ValueError, KeyError):
+ error_msg = (
+ response.text[:200] if response.text else "No error details"
+ )
+
+ if attempt < retry_count:
+ logger.warning(
+ f"Attempt {attempt + 1} failed for {file_path.name}: {error_msg}. Retrying..."
+ )
+ await asyncio.sleep(2) # Wait before retry
+ continue
+ else:
+ self.log_result(
+ test_name,
+ False,
+ f"Status: {response.status_code}, Error: {error_msg}",
+ )
+ return False
+ else:
+ self.log_result(test_name, False, f"Status: {response.status_code}")
+ return False
+
+ except Exception as e:
+ if attempt < retry_count:
+ logger.warning(
+ f"Attempt {attempt + 1} exception for {file_path.name}: {e}. Retrying..."
+ )
+ await asyncio.sleep(2)
+ continue
+ else:
+ self.log_result(test_name, False, f"Error: {e}")
+ return False
+
+ return False
+
+ async def test_query(self, query: str, mode: str = "hybrid"):
+ """Test query endpoint."""
+ test_name = f"Query ({mode} mode)"
+ try:
+ response = await self.client.post(
+ f"{self.base_url}/query",
+ json={"query": query, "mode": mode, "stream": False},
+ )
+ passed = response.status_code == 200
+
+ if passed:
+ result = response.json()
+ response_text = result.get("response", "")
+ logger.info(f"Query response preview: {response_text[:200]}...")
+
+ self.log_result(test_name, passed, f"Status: {response.status_code}")
+ return passed
+ except Exception as e:
+ self.log_result(test_name, False, f"Error: {e}")
+ return False
+
+ async def test_query_with_data(self, query: str, mode: str = "hybrid"):
+ """Test query/data endpoint that returns structured data."""
+ test_name = f"Query Data ({mode} mode)"
+ try:
+ response = await self.client.post(
+ f"{self.base_url}/query/data",
+ json={"query": query, "mode": mode, "top_k": 10},
+ )
+ passed = response.status_code == 200
+
+ if passed:
+ result = response.json()
+ # Validate response structure
+ has_data = "data" in result
+ has_metadata = "metadata" in result
+ if not (has_data and has_metadata):
+ passed = False
+ self.log_result(
+ test_name, passed, "Missing required fields in response"
+ )
+ else:
+ data = result.get("data", {})
+ entities_count = len(data.get("entities", []))
+ relations_count = len(data.get("relationships", []))
+ chunks_count = len(data.get("chunks", []))
+ logger.info(
+ f"Retrieved: {entities_count} entities, {relations_count} relations, {chunks_count} chunks"
+ )
+ self.log_result(
+ test_name, passed, f"Status: {response.status_code}"
+ )
+ else:
+ self.log_result(test_name, passed, f"Status: {response.status_code}")
+
+ return passed
+ except Exception as e:
+ self.log_result(test_name, False, f"Error: {e}")
+ return False
+
+ async def test_graph_data(self):
+ """Test graph data retrieval endpoint."""
+ test_name = "Graph Data Retrieval"
+ try:
+ response = await self.client.get(f"{self.base_url}/graph/label/list")
+ passed = response.status_code == 200
+
+ if passed:
+ result = response.json()
+ # Result is a list of labels
+ if isinstance(result, list):
+ logger.info(f"Graph contains {len(result)} unique labels")
+ else:
+ logger.info(f"Graph data: {result}")
+
+ self.log_result(test_name, passed, f"Status: {response.status_code}")
+ return passed
+ except Exception as e:
+ self.log_result(test_name, False, f"Error: {e}")
+ return False
+
+ async def run_all_tests(self, cpp_repo_path: Path):
+ """Run all integration tests."""
+ logger.info("=" * 80)
+ logger.info("Starting LightRAG Integration Tests")
+ logger.info("=" * 80)
+
+ # Wait for server to be ready
+ if not await self.wait_for_server():
+ logger.error("Server not ready. Aborting tests.")
+ return False
+
+ # Test 1: Health check
+ await self.test_health_endpoint()
+
+ # Test 2: Index C++ files
+ logger.info("\n--- Testing Document Indexing ---")
+ cpp_files = list(cpp_repo_path.glob("**/*.cpp")) + list(
+ cpp_repo_path.glob("**/*.h")
+ )
+ for cpp_file in cpp_files:
+ if cpp_file.is_file():
+ await self.test_insert_file(cpp_file)
+ await asyncio.sleep(
+ 0.5
+ ) # Small delay between uploads to avoid overwhelming server
+
+ # Also insert the README
+ readme_file = cpp_repo_path / "README.md"
+ if readme_file.exists():
+ await self.test_insert_file(readme_file)
+
+ # Wait a bit for indexing to complete
+ logger.info("Waiting for indexing to complete...")
+ await asyncio.sleep(5)
+
+ # Test 3: Query operations
+ logger.info("\n--- Testing Query Operations ---")
+ test_queries = [
+ ("What is the Calculator class?", "hybrid"),
+ ("Describe the main function", "local"),
+ ("What mathematical operations are supported?", "global"),
+ ("How does the power function work?", "naive"),
+ ]
+
+ for query, mode in test_queries:
+ await self.test_query(query, mode)
+ await asyncio.sleep(1) # Brief delay between queries
+
+ # Test 4: Query with structured data
+ logger.info("\n--- Testing Query Data Endpoint ---")
+ await self.test_query_with_data(
+ "What classes are defined in the code?", "hybrid"
+ )
+ await self.test_query_with_data("List all functions", "local")
+
+ # Test 5: Graph data retrieval
+ logger.info("\n--- Testing Graph Retrieval ---")
+ await self.test_graph_data()
+
+ # Print summary
+ logger.info("\n" + "=" * 80)
+ logger.info("Test Summary")
+ logger.info("=" * 80)
+
+ total = len(self.test_results)
+ passed = sum(1 for r in self.test_results if r["passed"])
+ failed = total - passed
+
+ logger.info(f"Total Tests: {total}")
+ logger.info(f"Passed: {passed} ✅")
+ logger.info(f"Failed: {failed} ❌")
+
+ if failed > 0:
+ logger.info("\nFailed Tests:")
+ for result in self.test_results:
+ if not result["passed"]:
+ logger.info(f" - {result['test']}: {result['message']}")
+
+ return failed == 0
+
+
+async def main():
+ """Main test execution."""
+ # Get test repository path
+ script_dir = Path(__file__).parent
+ cpp_repo_path = script_dir / "sample_cpp_repo"
+
+ if not cpp_repo_path.exists():
+ logger.error(f"Sample C++ repository not found at {cpp_repo_path}")
+ return 1
+
+ # Get server URL from environment or use default
+ base_url = os.getenv("LIGHTRAG_API_URL", "http://localhost:9621")
+
+ # Run tests
+ async with IntegrationTestRunner(base_url) as runner:
+ success = await runner.run_all_tests(cpp_repo_path)
+ return 0 if success else 1
+
+
+if __name__ == "__main__":
+ exit_code = asyncio.run(main())
+ sys.exit(exit_code)
diff --git a/tests/mock_openai_server.py b/tests/mock_openai_server.py
new file mode 100644
index 0000000000..88def5d8dd
--- /dev/null
+++ b/tests/mock_openai_server.py
@@ -0,0 +1,222 @@
+#!/usr/bin/env python3
+"""
+Mock OpenAI-compatible API server for integration testing.
+
+This server mocks OpenAI's API endpoints for:
+- Chat completions (LLM)
+- Embeddings
+
+Used for integration tests to avoid requiring actual API keys.
+"""
+
+import asyncio
+import json
+import logging
+from datetime import datetime
+from typing import List, Dict
+import numpy as np
+
+from fastapi import FastAPI, Request, HTTPException
+from fastapi.responses import JSONResponse, StreamingResponse
+import uvicorn
+
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+app = FastAPI(title="Mock OpenAI API")
+
+
+def generate_mock_embedding(text: str, dimensions: int = 3072) -> List[float]:
+ """Generate deterministic mock embedding based on text content."""
+ # Use hash of text to generate deterministic embeddings
+ hash_value = hash(text)
+ np.random.seed(abs(hash_value) % (2**32))
+ embedding = np.random.randn(dimensions).astype(float)
+ # Normalize to unit vector
+ norm = np.linalg.norm(embedding)
+ if norm > 0:
+ embedding = embedding / norm
+ return embedding.tolist()
+
+
+def generate_mock_chat_response(messages: List[Dict], model: str = "gpt-5") -> str:
+ """Generate mock chat completion response based on the query."""
+ # Extract the user's query
+ user_query = ""
+ for msg in messages:
+ if msg.get("role") == "user":
+ user_query = msg.get("content", "")
+ break
+
+ # Generate contextual responses based on keywords
+ if "entity" in user_query.lower() or "extract" in user_query.lower():
+ # Entity extraction response
+ response = json.dumps(
+ {
+ "entities": [
+ {"entity_name": "SampleClass", "entity_type": "Class"},
+ {"entity_name": "main", "entity_type": "Function"},
+ {"entity_name": "std::cout", "entity_type": "Component"},
+ ],
+ "relationships": [
+ {
+ "src_id": "main",
+ "tgt_id": "SampleClass",
+ "description": "main function creates and uses SampleClass",
+ "keywords": "instantiation,usage",
+ }
+ ],
+ }
+ )
+ elif "summary" in user_query.lower() or "summarize" in user_query.lower():
+ response = "This is a sample C++ program that demonstrates basic class usage and console output."
+ elif "theme" in user_query.lower():
+ response = "The main themes in this code are object-oriented programming, console I/O, and basic C++ syntax."
+ elif "describe" in user_query.lower():
+ response = "The code defines a simple C++ class with basic functionality and a main function that instantiates and uses the class."
+ else:
+ # Generic response
+ response = f"Mock response for query: {user_query[:100]}"
+
+ return response
+
+
+@app.post("/v1/chat/completions")
+@app.post("/chat/completions")
+async def chat_completions(request: Request):
+ """Mock chat completions endpoint."""
+ try:
+ data = await request.json()
+ logger.info(f"Received chat completion request: model={data.get('model')}")
+
+ messages = data.get("messages", [])
+ model = data.get("model", "gpt-5")
+ stream = data.get("stream", False)
+
+ response_text = generate_mock_chat_response(messages, model)
+
+ if stream:
+ # Streaming response
+ async def generate_stream():
+ # Split response into chunks
+ words = response_text.split()
+ for i, word in enumerate(words):
+ chunk = {
+ "id": f"chatcmpl-mock-{i}",
+ "object": "chat.completion.chunk",
+ "created": int(datetime.now().timestamp()),
+ "model": model,
+ "choices": [
+ {
+ "index": 0,
+ "delta": {"content": word + " "}
+ if i > 0
+ else {"role": "assistant", "content": word + " "},
+ "finish_reason": None,
+ }
+ ],
+ }
+ yield f"data: {json.dumps(chunk)}\n\n"
+ await asyncio.sleep(0.01)
+
+ # Final chunk
+ final_chunk = {
+ "id": "chatcmpl-mock-final",
+ "object": "chat.completion.chunk",
+ "created": int(datetime.now().timestamp()),
+ "model": model,
+ "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}],
+ }
+ yield f"data: {json.dumps(final_chunk)}\n\n"
+ yield "data: [DONE]\n\n"
+
+ return StreamingResponse(generate_stream(), media_type="text/event-stream")
+ else:
+ # Non-streaming response
+ response = {
+ "id": "chatcmpl-mock",
+ "object": "chat.completion",
+ "created": int(datetime.now().timestamp()),
+ "model": model,
+ "choices": [
+ {
+ "index": 0,
+ "message": {"role": "assistant", "content": response_text},
+ "finish_reason": "stop",
+ }
+ ],
+ "usage": {
+ "prompt_tokens": 50,
+ "completion_tokens": 100,
+ "total_tokens": 150,
+ },
+ }
+ return JSONResponse(content=response)
+
+ except Exception as e:
+ logger.error(f"Error in chat completions: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+
+@app.post("/v1/embeddings")
+@app.post("/embeddings")
+async def embeddings(request: Request):
+ """Mock embeddings endpoint."""
+ try:
+ data = await request.json()
+ logger.info(f"Received embeddings request: model={data.get('model')}")
+
+ input_texts = data.get("input", [])
+ if isinstance(input_texts, str):
+ input_texts = [input_texts]
+
+ model = data.get("model", "text-embedding-3-large")
+ dimensions = data.get("dimensions", 3072)
+
+ # Generate embeddings for each text
+ embeddings_data = []
+ for i, text in enumerate(input_texts):
+ embedding = generate_mock_embedding(text, dimensions)
+ embeddings_data.append(
+ {"object": "embedding", "embedding": embedding, "index": i}
+ )
+
+ response = {
+ "object": "list",
+ "data": embeddings_data,
+ "model": model,
+ "usage": {
+ "prompt_tokens": len(input_texts) * 10,
+ "total_tokens": len(input_texts) * 10,
+ },
+ }
+
+ return JSONResponse(content=response)
+
+ except Exception as e:
+ logger.error(f"Error in embeddings: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+
+@app.get("/health")
+async def health():
+ """Health check endpoint."""
+ return {"status": "healthy"}
+
+
+def main():
+ """Run the mock OpenAI server."""
+ import argparse
+
+ parser = argparse.ArgumentParser(description="Mock OpenAI API Server")
+ parser.add_argument("--host", default="127.0.0.1", help="Host to bind to")
+ parser.add_argument("--port", type=int, default=8000, help="Port to bind to")
+ args = parser.parse_args()
+
+ logger.info(f"Starting Mock OpenAI API server on {args.host}:{args.port}")
+ uvicorn.run(app, host=args.host, port=args.port, log_level="info")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/tests/sample_cpp_repo/README.md b/tests/sample_cpp_repo/README.md
new file mode 100644
index 0000000000..4271e6dece
--- /dev/null
+++ b/tests/sample_cpp_repo/README.md
@@ -0,0 +1,18 @@
+# Sample C++ Project
+
+This is a simple C++ project used for integration testing of LightRAG.
+
+## Files
+
+- `main.cpp` - Main application entry point
+- `calculator.h` - Calculator class header
+- `calculator.cpp` - Calculator class implementation
+- `utils.h` - Utility functions header
+- `utils.cpp` - Utility functions implementation
+
+## Building
+
+```bash
+g++ -o sample_app main.cpp calculator.cpp utils.cpp
+./sample_app
+```
diff --git a/tests/sample_cpp_repo/calculator.cpp b/tests/sample_cpp_repo/calculator.cpp
new file mode 100644
index 0000000000..8a736ad35d
--- /dev/null
+++ b/tests/sample_cpp_repo/calculator.cpp
@@ -0,0 +1,75 @@
+#include "calculator.h"
+#include
+#include
+#include
+
+Calculator::Calculator() : operationCount(0), lastResult(0.0) {
+ std::cout << "Calculator initialized" << std::endl;
+}
+
+Calculator::~Calculator() {
+ std::cout << "Calculator destroyed" << std::endl;
+}
+
+double Calculator::add(double a, double b) {
+ operationCount++;
+ lastResult = a + b;
+ return lastResult;
+}
+
+double Calculator::subtract(double a, double b) {
+ operationCount++;
+ lastResult = a - b;
+ return lastResult;
+}
+
+double Calculator::multiply(double a, double b) {
+ operationCount++;
+ lastResult = a * b;
+ return lastResult;
+}
+
+double Calculator::divide(double a, double b) {
+ if (b == 0) {
+ throw std::runtime_error("Division by zero error");
+ }
+ operationCount++;
+ lastResult = a / b;
+ return lastResult;
+}
+
+double Calculator::power(double base, int exponent) {
+ operationCount++;
+ lastResult = std::pow(base, exponent);
+ return lastResult;
+}
+
+double Calculator::squareRoot(double number) {
+ if (number < 0) {
+ throw std::runtime_error("Cannot calculate square root of negative number");
+ }
+ operationCount++;
+ lastResult = std::sqrt(number);
+ return lastResult;
+}
+
+double Calculator::getLastResult() const {
+ return lastResult;
+}
+
+int Calculator::getOperationCount() const {
+ return operationCount;
+}
+
+void Calculator::reset() {
+ operationCount = 0;
+ lastResult = 0.0;
+ std::cout << "Calculator reset" << std::endl;
+}
+
+void Calculator::displayStatistics() const {
+ std::cout << "\\n=== Calculator Statistics ===" << std::endl;
+ std::cout << "Operations performed: " << operationCount << std::endl;
+ std::cout << "Last result: " << lastResult << std::endl;
+ std::cout << "===========================\\n" << std::endl;
+}
diff --git a/tests/sample_cpp_repo/calculator.h b/tests/sample_cpp_repo/calculator.h
new file mode 100644
index 0000000000..487aca1beb
--- /dev/null
+++ b/tests/sample_cpp_repo/calculator.h
@@ -0,0 +1,94 @@
+#ifndef CALCULATOR_H
+#define CALCULATOR_H
+
+/**
+ * Calculator class for performing mathematical operations
+ * Provides basic arithmetic and advanced mathematical functions
+ */
+class Calculator {
+private:
+ int operationCount; // Track number of operations performed
+ double lastResult; // Store the result of the last operation
+
+public:
+ /**
+ * Constructor - initializes the calculator
+ */
+ Calculator();
+
+ /**
+ * Destructor - cleans up resources
+ */
+ ~Calculator();
+
+ /**
+ * Add two numbers
+ * @param a First number
+ * @param b Second number
+ * @return Sum of a and b
+ */
+ double add(double a, double b);
+
+ /**
+ * Subtract two numbers
+ * @param a First number
+ * @param b Second number
+ * @return Difference of a and b
+ */
+ double subtract(double a, double b);
+
+ /**
+ * Multiply two numbers
+ * @param a First number
+ * @param b Second number
+ * @return Product of a and b
+ */
+ double multiply(double a, double b);
+
+ /**
+ * Divide two numbers
+ * @param a Dividend
+ * @param b Divisor
+ * @return Quotient of a divided by b
+ */
+ double divide(double a, double b);
+
+ /**
+ * Calculate power of a number
+ * @param base Base number
+ * @param exponent Exponent
+ * @return base raised to the power of exponent
+ */
+ double power(double base, int exponent);
+
+ /**
+ * Calculate square root of a number
+ * @param number Input number
+ * @return Square root of the number
+ */
+ double squareRoot(double number);
+
+ /**
+ * Get the last computed result
+ * @return Last result value
+ */
+ double getLastResult() const;
+
+ /**
+ * Get the number of operations performed
+ * @return Operation count
+ */
+ int getOperationCount() const;
+
+ /**
+ * Reset the calculator state
+ */
+ void reset();
+
+ /**
+ * Display calculator statistics
+ */
+ void displayStatistics() const;
+};
+
+#endif // CALCULATOR_H
diff --git a/tests/sample_cpp_repo/main.cpp b/tests/sample_cpp_repo/main.cpp
new file mode 100644
index 0000000000..bd9fdb4e2a
--- /dev/null
+++ b/tests/sample_cpp_repo/main.cpp
@@ -0,0 +1,33 @@
+#include
+#include "calculator.h"
+#include "utils.h"
+
+/**
+ * Main application entry point
+ * Demonstrates the usage of Calculator class and utility functions
+ */
+int main() {
+ // Print welcome message
+ printWelcomeMessage();
+
+ // Create calculator instance
+ Calculator calc;
+
+ // Perform basic arithmetic operations
+ std::cout << "Addition: 5 + 3 = " << calc.add(5, 3) << std::endl;
+ std::cout << "Subtraction: 5 - 3 = " << calc.subtract(5, 3) << std::endl;
+ std::cout << "Multiplication: 5 * 3 = " << calc.multiply(5, 3) << std::endl;
+ std::cout << "Division: 6 / 2 = " << calc.divide(6, 2) << std::endl;
+
+ // Test advanced operations
+ std::cout << "Power: 2^8 = " << calc.power(2, 8) << std::endl;
+ std::cout << "Square root: sqrt(16) = " << calc.squareRoot(16) << std::endl;
+
+ // Display statistics
+ calc.displayStatistics();
+
+ // Print goodbye message
+ printGoodbyeMessage();
+
+ return 0;
+}
diff --git a/tests/sample_cpp_repo/utils.cpp b/tests/sample_cpp_repo/utils.cpp
new file mode 100644
index 0000000000..dae322b3fd
--- /dev/null
+++ b/tests/sample_cpp_repo/utils.cpp
@@ -0,0 +1,46 @@
+#include "utils.h"
+#include
+#include
+#include
+#include
+
+void printWelcomeMessage() {
+ std::cout << "\\n=====================================" << std::endl;
+ std::cout << " Welcome to Calculator Demo!" << std::endl;
+ std::cout << "=====================================\\n" << std::endl;
+}
+
+void printGoodbyeMessage() {
+ std::cout << "\\n=====================================" << std::endl;
+ std::cout << " Thank you for using Calculator!" << std::endl;
+ std::cout << "=====================================\\n" << std::endl;
+}
+
+std::string formatNumber(double number, int precision) {
+ std::ostringstream stream;
+ stream << std::fixed << std::setprecision(precision) << number;
+ return stream.str();
+}
+
+bool isPrime(int number) {
+ if (number <= 1) return false;
+ if (number <= 3) return true;
+ if (number % 2 == 0 || number % 3 == 0) return false;
+
+ for (int i = 5; i * i <= number; i += 6) {
+ if (number % i == 0 || number % (i + 2) == 0)
+ return false;
+ }
+ return true;
+}
+
+long long factorial(int n) {
+ if (n < 0) return -1; // Error case
+ if (n == 0 || n == 1) return 1;
+
+ long long result = 1;
+ for (int i = 2; i <= n; i++) {
+ result *= i;
+ }
+ return result;
+}
diff --git a/tests/sample_cpp_repo/utils.h b/tests/sample_cpp_repo/utils.h
new file mode 100644
index 0000000000..bde7324311
--- /dev/null
+++ b/tests/sample_cpp_repo/utils.h
@@ -0,0 +1,38 @@
+#ifndef UTILS_H
+#define UTILS_H
+
+#include
+
+/**
+ * Print a welcome message to the console
+ */
+void printWelcomeMessage();
+
+/**
+ * Print a goodbye message to the console
+ */
+void printGoodbyeMessage();
+
+/**
+ * Format a number with specified precision
+ * @param number Number to format
+ * @param precision Number of decimal places
+ * @return Formatted string representation
+ */
+std::string formatNumber(double number, int precision);
+
+/**
+ * Check if a number is prime
+ * @param number Number to check
+ * @return true if prime, false otherwise
+ */
+bool isPrime(int number);
+
+/**
+ * Calculate factorial of a number
+ * @param n Input number
+ * @return Factorial of n
+ */
+long long factorial(int n);
+
+#endif // UTILS_H
diff --git a/tests/simple_tokenizer.py b/tests/simple_tokenizer.py
new file mode 100644
index 0000000000..b243e69738
--- /dev/null
+++ b/tests/simple_tokenizer.py
@@ -0,0 +1,224 @@
+"""
+Simple tokenizer implementation for offline integration testing.
+
+This tokenizer doesn't require internet access and provides a basic
+word-based tokenization suitable for testing purposes.
+"""
+
+from typing import List
+import re
+
+
+class SimpleTokenizerImpl:
+ """
+ A simple word-based tokenizer that works offline.
+
+ This tokenizer:
+ - Splits text into words and punctuation
+ - Doesn't require downloading any external files
+ - Provides deterministic token IDs based on a vocabulary
+ """
+
+ def __init__(self):
+ # Build a simple vocabulary for common tokens
+ # This is a simplified approach - real tokenizers have much larger vocabularies
+ self.vocab = self._build_vocab()
+ self.inverse_vocab = {v: k for k, v in self.vocab.items()}
+ self.unk_token_id = len(self.vocab)
+
+ def _build_vocab(self) -> dict:
+ """Build a basic vocabulary of common tokens."""
+ vocab = {}
+ current_id = 0
+
+ # Add common words and symbols
+ common_tokens = [
+ # Whitespace and punctuation
+ " ",
+ "\n",
+ "\t",
+ ".",
+ ",",
+ "!",
+ "?",
+ ";",
+ ":",
+ "(",
+ ")",
+ "[",
+ "]",
+ "{",
+ "}",
+ '"',
+ "'",
+ "-",
+ "_",
+ "/",
+ "\\",
+ "@",
+ "#",
+ "$",
+ "%",
+ "&",
+ "*",
+ "+",
+ "=",
+ # Common programming keywords (for C++ code)
+ "class",
+ "struct",
+ "public",
+ "private",
+ "protected",
+ "void",
+ "int",
+ "double",
+ "float",
+ "char",
+ "bool",
+ "if",
+ "else",
+ "for",
+ "while",
+ "return",
+ "include",
+ "namespace",
+ "using",
+ "const",
+ "static",
+ "virtual",
+ "new",
+ "delete",
+ "this",
+ "nullptr",
+ "true",
+ "false",
+ # Common English words
+ "the",
+ "a",
+ "an",
+ "and",
+ "or",
+ "but",
+ "in",
+ "on",
+ "at",
+ "to",
+ "from",
+ "with",
+ "by",
+ "for",
+ "of",
+ "is",
+ "are",
+ "was",
+ "were",
+ "be",
+ "been",
+ "being",
+ "have",
+ "has",
+ "had",
+ "do",
+ "does",
+ "did",
+ "will",
+ "would",
+ "should",
+ "could",
+ "can",
+ "may",
+ "might",
+ "must",
+ "not",
+ "no",
+ "yes",
+ "this",
+ "that",
+ "these",
+ "those",
+ "what",
+ "which",
+ "who",
+ "when",
+ "where",
+ "why",
+ "how",
+ ]
+
+ for token in common_tokens:
+ vocab[token.lower()] = current_id
+ current_id += 1
+
+ return vocab
+
+ def _tokenize(self, text: str) -> List[str]:
+ """Split text into tokens (words and punctuation)."""
+ # Simple pattern to split on whitespace and keep punctuation separate
+ pattern = r"\w+|[^\w\s]"
+ tokens = re.findall(pattern, text)
+ return tokens
+
+ def encode(self, content: str) -> List[int]:
+ """
+ Encode a string into a list of token IDs.
+
+ Args:
+ content: The string to encode.
+
+ Returns:
+ A list of integer token IDs.
+ """
+ if not content:
+ return []
+
+ tokens = self._tokenize(content)
+ token_ids = []
+
+ for token in tokens:
+ token_lower = token.lower()
+ if token_lower in self.vocab:
+ token_ids.append(self.vocab[token_lower])
+ else:
+ # For unknown tokens, use a hash-based ID to be deterministic
+ # Offset by vocab size to avoid collisions
+ hash_id = abs(hash(token)) % 10000 + len(self.vocab)
+ token_ids.append(hash_id)
+
+ return token_ids
+
+ def decode(self, tokens: List[int]) -> str:
+ """
+ Decode a list of token IDs into a string.
+
+ Args:
+ tokens: The list of token IDs to decode.
+
+ Returns:
+ The decoded string.
+ """
+ if not tokens:
+ return ""
+
+ words = []
+ for token_id in tokens:
+ if token_id in self.inverse_vocab:
+ words.append(self.inverse_vocab[token_id])
+ else:
+ # For unknown IDs, use a placeholder
+ words.append(f"")
+
+ # Simple reconstruction - join words with spaces
+ # This is a simplification; real tokenizers preserve exact spacing
+ return " ".join(words)
+
+
+def create_simple_tokenizer():
+ """
+ Create a simple tokenizer for offline use.
+
+ Returns:
+ A Tokenizer instance using SimpleTokenizerImpl.
+ """
+ from lightrag.utils import Tokenizer
+
+ return Tokenizer("simple-tokenizer", SimpleTokenizerImpl())
diff --git a/tests/start_server_offline.py b/tests/start_server_offline.py
new file mode 100755
index 0000000000..6c115a2d6b
--- /dev/null
+++ b/tests/start_server_offline.py
@@ -0,0 +1,32 @@
+#!/usr/bin/env python3
+"""
+Start LightRAG server for integration testing with offline-compatible tokenizer.
+
+This script initializes the LightRAG server with a simple tokenizer that doesn't
+require internet access, making it suitable for integration testing in restricted
+network environments.
+"""
+
+import os
+import sys
+from pathlib import Path
+
+# Add parent directory to path to import from tests
+sys.path.insert(0, str(Path(__file__).parent))
+
+
+def start_server():
+ """Start LightRAG server with offline-compatible configuration."""
+ # Import here after setting up the path
+ from lightrag.api.lightrag_server import main
+
+ # Override the tokenizer in global args before server starts
+ # This will be used when creating the LightRAG instance
+ os.environ["LIGHTRAG_OFFLINE_TOKENIZER"] = "true"
+
+ # Start the server
+ main()
+
+
+if __name__ == "__main__":
+ start_server()