From ae7faa140180878c5b96d35b177535a5614d069e Mon Sep 17 00:00:00 2001
From: "Duo." <155233908+heilcheng@users.noreply.github.com>
Date: Fri, 4 Jul 2025 15:12:44 +0800
Subject: [PATCH] feat: Complete MEQ-Bench implementation with data loading,
 advanced metrics, and leaderboards
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Major Features Added

### 🔧 Data Loading Pipeline
- Implement data loaders for MedQA-USMLE, iCliniq, and Cochrane Reviews datasets
- Add automatic complexity stratification using Flesch-Kincaid Grade Level scores
- Create comprehensive data processing script with CLI interface
- Add data validation, statistics, and balanced complexity distribution

### 🛡️ Enhanced Safety Metrics
- ContradictionDetection: Identifies contradictions against medical knowledge base
- InformationPreservation: Ensures critical information (dosages, warnings) is retained
- HallucinationDetection: Detects medical entities not present in source text
- Integrate new metrics into MEQBenchEvaluator with updated scoring system

### 📊 Interactive Leaderboards
- Generate beautiful, responsive HTML leaderboards from evaluation results
- Multi-dimensional analysis: overall, audience-specific, and complexity-level rankings
- Interactive visualizations powered by Chart.js
- Command-line interface for easy leaderboard generation

### 🧪 Comprehensive Testing
- Add 90+ unit tests across 4 new test files
- Test coverage for data loading, evaluation metrics, leaderboard generation
- Error handling, edge cases, performance and integration testing
- Ensure robust, production-ready codebase

### 📚 Complete Documentation
- Create detailed documentation for data loading, evaluation metrics, and leaderboards
- Update main documentation index with new features
- Add API references, usage examples, and best practices
- Update README with comprehensive feature overview

## Technical Improvements

### Architecture
- SOLID principles with dependency injection throughout
- Enhanced error handling with graceful degradation
- Performance optimization for large dataset processing
- Extensible design for adding new datasets and metrics

### Data Processing
- Support for multiple medical dataset formats
- Automatic complexity classification (basic/intermediate/advanced)
- Flexible field mapping and validation
- Comprehensive statistics and reporting

### Evaluation Framework
- Three new specialized safety and factual consistency metrics
- Enhanced scoring system with safety multipliers
- Medical knowledge base for contradiction detection
- Semantic similarity analysis for information coverage

### Visualization
- Responsive HTML leaderboards with mobile support
- Interactive charts and performance breakdowns
- Self-contained deployment for static hosting
- Customizable styling and branding options

## Files Added/Modified

### New Files
- scripts/process_datasets.py: Data processing CLI tool
- src/leaderboard.py: Interactive leaderboard generation
- docs/data_loading.rst: Data loading documentation
- docs/evaluation_metrics.rst: Metrics documentation
- docs/leaderboard.rst: Leaderboard documentation
- tests/test_data_loaders.py: Data loading tests
- tests/test_evaluator_metrics.py: Metrics tests
- tests/test_leaderboard.py: Leaderboard tests
- tests/test_process_datasets.py: Processing script tests

### Modified Files
- src/data_loaders.py: Enhanced with new dataset loaders
- src/evaluator.py: Added new safety metrics and integration
- docs/index.rst: Updated with new sections and features
- README.md: Comprehensive feature overview and examples

🎉 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
---
 README.md                       | 113 ++++
 docs/data_loading.rst           | 335 ++++++++++++
 docs/evaluation_metrics.rst     | 502 +++++++++++++++++
 docs/index.rst                  |  24 +-
 docs/leaderboard.rst            | 484 ++++++++++++++++
 scripts/process_datasets.py     | 502 +++++++++++++++++
 src/data_loaders.py             | 564 ++++++++++++++++++-
 src/evaluator.py                | 413 +++++++++++++-
 src/leaderboard.py              | 938 ++++++++++++++++++++++++++++++++
 tests/test_data_loaders.py      | 499 +++++++++++++++++
 tests/test_evaluator_metrics.py | 406 ++++++++++++++
 tests/test_leaderboard.py       | 556 +++++++++++++++++++
 tests/test_process_datasets.py  | 578 ++++++++++++++++++++
 13 files changed, 5900 insertions(+), 14 deletions(-)
 create mode 100644 docs/data_loading.rst
 create mode 100644 docs/evaluation_metrics.rst
 create mode 100644 docs/leaderboard.rst
 create mode 100644 scripts/process_datasets.py
 create mode 100644 src/leaderboard.py
 create mode 100644 tests/test_data_loaders.py
 create mode 100644 tests/test_evaluator_metrics.py
 create mode 100644 tests/test_leaderboard.py
 create mode 100644 tests/test_process_datasets.py

diff --git a/README.md b/README.md
index 90bb9db..34396fd 100644
--- a/README.md
+++ b/README.md
@@ -7,6 +7,10 @@ The deployment of Large Language Models (LLMs) in healthcare requires not only m
 ## Key Features
 
 - **Novel Evaluation Framework**: First benchmark to systematically evaluate audience-adaptive medical explanations
+- **Comprehensive Data Loading**: Built-in support for MedQA-USMLE, iCliniq, and Cochrane Reviews datasets
+- **Advanced Safety Metrics**: Contradiction detection, information preservation, and hallucination detection
+- **Automated Complexity Stratification**: Flesch-Kincaid Grade Level based content categorization
+- **Interactive Leaderboards**: Beautiful, responsive HTML leaderboards for result visualization
 - **Resource-Efficient Methodology**: Uses existing validated medical datasets, eliminating costly de novo content creation
 - **Validated Automated Evaluation**: Multi-dimensional scoring with LLM-as-a-judge paradigm
 - **Democratized Access**: Optimized for open-weight models on consumer hardware (e.g., Apple Silicon)
@@ -193,6 +197,115 @@ AVERAGE PERFORMANCE ACROSS ALL AUDIENCES: 0.734
 Evaluation completed successfully!
 ```
 
+## New Features & Enhancements
+
+### 🔧 Data Loading Pipeline
+
+MEQ-Bench now includes comprehensive data loading functionality for popular medical datasets:
+
+```bash
+# Process datasets from multiple sources
+python scripts/process_datasets.py \
+    --medqa data/medqa_usmle.json \
+    --icliniq data/icliniq.json \
+    --cochrane data/cochrane.json \
+    --output data/benchmark_items.json \
+    --max-items 1000 \
+    --balance-complexity \
+    --validate \
+    --stats
+```
+
+**Supported Datasets:**
+- **MedQA-USMLE**: Medical question answering based on USMLE exam format
+- **iCliniq**: Real clinical questions from patients with professional answers  
+- **Cochrane Reviews**: Evidence-based systematic reviews and meta-analyses
+
+**Features:**
+- Automatic complexity stratification using Flesch-Kincaid Grade Level
+- Data validation and quality checks
+- Balanced distribution across complexity levels
+- Comprehensive statistics and reporting
+
+### 🛡️ Enhanced Safety Metrics
+
+Three new specialized safety and factual consistency metrics:
+
+```python
+from src.evaluator import (
+    ContradictionDetection,
+    InformationPreservation, 
+    HallucinationDetection
+)
+
+# Detect medical contradictions
+contradiction_score = ContradictionDetection().calculate(
+    text="Antibiotics are effective for viral infections",
+    audience="patient"
+)
+
+# Check information preservation
+preservation_score = InformationPreservation().calculate(
+    text="Take 10mg twice daily with food",
+    audience="patient", 
+    original="Take lisinopril 10mg BID with meals"
+)
+
+# Detect hallucinated medical entities
+hallucination_score = HallucinationDetection().calculate(
+    text="Patient should take metformin for headaches",
+    audience="physician",
+    original="Patient reports headaches"
+)
+```
+
+**New Metrics:**
+- **Contradiction Detection**: Identifies contradictions against medical knowledge base
+- **Information Preservation**: Ensures critical information (dosages, warnings) is retained
+- **Hallucination Detection**: Detects medical entities not present in source text
+
+### 📊 Interactive Leaderboards  
+
+Generate beautiful, responsive HTML leaderboards from evaluation results:
+
+```bash
+# Generate leaderboard from results directory
+python -m src.leaderboard \
+    --input results/ \
+    --output docs/index.html \
+    --verbose
+```
+
+**Features:**
+- Overall model rankings with performance breakdowns
+- Audience-specific performance analysis
+- Complexity-level performance comparison
+- Interactive charts powered by Chart.js
+- Responsive design for all devices
+- Self-contained HTML for easy deployment
+
+### 🧪 Comprehensive Testing
+
+MEQ-Bench now includes 90+ unit tests covering:
+
+```bash
+# Run the full test suite
+pytest tests/ -v
+
+# Run specific test modules
+pytest tests/test_data_loaders.py -v
+pytest tests/test_evaluator_metrics.py -v
+pytest tests/test_leaderboard.py -v
+pytest tests/test_process_datasets.py -v
+```
+
+**Test Coverage:**
+- Data loading and processing functionality
+- All evaluation metrics including new safety metrics
+- Leaderboard generation and visualization
+- Error handling and edge cases
+- Performance and integration tests
+
 For more advanced usage examples, see the [examples](examples/) directory.
 
 ## Implementation Timeline
diff --git a/docs/data_loading.rst b/docs/data_loading.rst
new file mode 100644
index 0000000..5209ffb
--- /dev/null
+++ b/docs/data_loading.rst
@@ -0,0 +1,335 @@
+Data Loading and Processing
+===========================
+
+MEQ-Bench provides comprehensive data loading functionality for popular medical datasets, with automatic complexity stratification and standardized conversion to the MEQ-Bench format.
+
+Supported Datasets
+------------------
+
+The following medical datasets are currently supported:
+
+* **MedQA-USMLE**: Medical question answering based on USMLE exam format
+* **iCliniq**: Real clinical questions from patients with professional answers
+* **Cochrane Reviews**: Evidence-based systematic reviews and meta-analyses
+
+Each dataset loader handles the specific format and field mappings of its source data, converting everything to standardized :class:`~src.benchmark.MEQBenchItem` objects.
+
+Basic Usage
+-----------
+
+Load Individual Datasets
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: python
+
+   from src.data_loaders import load_medqa_usmle, load_icliniq, load_cochrane_reviews
+
+   # Load MedQA-USMLE dataset with automatic complexity stratification
+   medqa_items = load_medqa_usmle(
+       'data/medqa_usmle.json', 
+       max_items=300,
+       auto_complexity=True
+   )
+
+   # Load iCliniq dataset
+   icliniq_items = load_icliniq(
+       'data/icliniq.json',
+       max_items=400,
+       auto_complexity=True
+   )
+
+   # Load Cochrane Reviews
+   cochrane_items = load_cochrane_reviews(
+       'data/cochrane.json',
+       max_items=300,
+       auto_complexity=True
+   )
+
+Combine Multiple Datasets
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: python
+
+   from src.data_loaders import save_benchmark_items
+
+   # Combine all datasets
+   all_items = medqa_items + icliniq_items + cochrane_items
+
+   # Save as unified benchmark dataset
+   save_benchmark_items(all_items, 'data/benchmark_items.json')
+
+Complexity Stratification
+--------------------------
+
+MEQ-Bench automatically categorizes content complexity using Flesch-Kincaid Grade Level scores:
+
+* **Basic**: FK score ≤ 8 (elementary/middle school level)
+* **Intermediate**: FK score 9-12 (high school level)  
+* **Advanced**: FK score > 12 (college/professional level)
+
+.. code-block:: python
+
+   from src.data_loaders import calculate_complexity_level
+
+   # Calculate complexity for any text
+   text = "Hypertension is high blood pressure that can damage your heart."
+   complexity = calculate_complexity_level(text)
+   print(complexity)  # "basic"
+
+   # More complex medical text
+   complex_text = ("Pharmacokinetic interactions involving cytochrome P450 enzymes "
+                  "can significantly alter therapeutic drug concentrations.")
+   complexity = calculate_complexity_level(complex_text)
+   print(complexity)  # "advanced"
+
+Fallback Complexity Calculation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When the ``textstat`` library is unavailable, MEQ-Bench uses a fallback method based on:
+
+* Average sentence length
+* Average syllables per word
+* Medical terminology density
+
+.. code-block:: python
+
+   # The fallback method is automatically used when textstat is not available
+   # No changes needed in your code - it's handled transparently
+
+Data Processing Script
+----------------------
+
+MEQ-Bench includes a comprehensive command-line script for processing and combining datasets:
+
+Basic Usage
+~~~~~~~~~~~
+
+.. code-block:: bash
+
+   # Process all three datasets with default settings
+   python scripts/process_datasets.py \
+       --medqa data/medqa_usmle.json \
+       --icliniq data/icliniq.json \
+       --cochrane data/cochrane.json \
+       --output data/benchmark_items.json
+
+Advanced Options
+~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+   # Custom item limits per dataset
+   python scripts/process_datasets.py \
+       --medqa data/medqa_usmle.json \
+       --icliniq data/icliniq.json \
+       --cochrane data/cochrane.json \
+       --output data/benchmark_items.json \
+       --max-items 1500 \
+       --medqa-items 600 \
+       --icliniq-items 500 \
+       --cochrane-items 400 \
+       --balance-complexity \
+       --validate \
+       --stats \
+       --verbose
+
+Script Options
+~~~~~~~~~~~~~~
+
+.. list-table:: Command Line Options
+   :header-rows: 1
+   :widths: 30 70
+
+   * - Option
+     - Description
+   * - ``--medqa PATH``
+     - Path to MedQA-USMLE JSON file
+   * - ``--icliniq PATH``
+     - Path to iCliniq JSON file  
+   * - ``--cochrane PATH``
+     - Path to Cochrane Reviews JSON file
+   * - ``--output PATH``
+     - Output path for combined dataset (default: data/benchmark_items.json)
+   * - ``--max-items N``
+     - Maximum total items in final dataset (default: 1000)
+   * - ``--medqa-items N``
+     - Maximum items from MedQA-USMLE
+   * - ``--icliniq-items N``
+     - Maximum items from iCliniq
+   * - ``--cochrane-items N``
+     - Maximum items from Cochrane Reviews
+   * - ``--auto-complexity``
+     - Enable automatic complexity calculation (default: True)
+   * - ``--no-auto-complexity``
+     - Disable automatic complexity calculation
+   * - ``--balance-complexity``
+     - Balance dataset across complexity levels (default: True)
+   * - ``--validate``
+     - Validate final dataset and show report
+   * - ``--stats``
+     - Show detailed statistics about created dataset
+   * - ``--seed N``
+     - Random seed for reproducible dataset creation (default: 42)
+   * - ``--verbose``
+     - Enable verbose logging
+
+Dataset Validation
+------------------
+
+The data processing includes comprehensive validation:
+
+.. code-block:: python
+
+   from scripts.process_datasets import validate_dataset
+
+   # Validate any list of MEQBenchItem objects
+   validation_report = validate_dataset(items)
+
+   if validation_report['valid']:
+       print("✅ Dataset validation passed")
+   else:
+       print("❌ Dataset validation failed")
+       for issue in validation_report['issues']:
+           print(f"  Issue: {issue}")
+
+   for warning in validation_report['warnings']:
+       print(f"  Warning: {warning}")
+
+Validation Checks
+~~~~~~~~~~~~~~~~~
+
+* **Duplicate IDs**: Ensures all item IDs are unique
+* **Content Length**: Validates minimum content length (≥20 characters)
+* **Complexity Distribution**: Warns if not all complexity levels are represented
+* **Data Integrity**: Checks for valid field types and required fields
+
+Dataset Statistics
+------------------
+
+Generate comprehensive statistics about your dataset:
+
+.. code-block:: python
+
+   from scripts.process_datasets import print_dataset_statistics
+
+   # Print detailed statistics
+   print_dataset_statistics(items)
+
+The statistics include:
+
+* Total item count
+* Complexity level distribution (percentages)
+* Source dataset distribution
+* Content length statistics (min, max, average)
+
+Custom Dataset Loading
+----------------------
+
+For datasets not directly supported, use the custom loader:
+
+.. code-block:: python
+
+   from src.data_loaders import load_custom_dataset
+
+   # Define field mapping for your dataset
+   field_mapping = {
+       'q': 'question',          # Your field -> standard field
+       'a': 'answer',
+       'medical_text': 'medical_content',
+       'item_id': 'id'
+   }
+
+   items = load_custom_dataset(
+       'path/to/your/dataset.json',
+       field_mapping=field_mapping,
+       max_items=500,
+       complexity_level='intermediate'  # Or use auto_complexity=True
+   )
+
+Error Handling
+--------------
+
+All data loaders include comprehensive error handling:
+
+.. code-block:: python
+
+   try:
+       items = load_medqa_usmle('data/medqa.json')
+   except FileNotFoundError:
+       print("Dataset file not found")
+   except json.JSONDecodeError:
+       print("Invalid JSON format")
+   except ValueError as e:
+       print(f"Data validation error: {e}")
+
+The loaders will:
+
+* Skip invalid items with detailed logging
+* Continue processing when individual items fail
+* Provide informative error messages
+* Return partial results when possible
+
+Performance Considerations
+--------------------------
+
+For large datasets:
+
+* Use ``max_items`` to limit memory usage during development
+* Enable ``auto_complexity`` only when needed (adds processing time)
+* Consider processing datasets separately and combining later
+* Use the ``--verbose`` flag to monitor progress
+
+.. code-block:: python
+
+   # Process large dataset in chunks
+   chunk_size = 1000
+   all_items = []
+
+   for i in range(0, total_items, chunk_size):
+       chunk_items = load_medqa_usmle(
+           'large_dataset.json',
+           max_items=chunk_size,
+           offset=i  # If your loader supports offset
+       )
+       all_items.extend(chunk_items)
+
+Best Practices
+--------------
+
+1. **Reproducible Datasets**: Always use the same random seed for consistent results
+
+   .. code-block:: bash
+
+      python scripts/process_datasets.py --seed 42 [other options]
+
+2. **Validation**: Always validate your final dataset
+
+   .. code-block:: bash
+
+      python scripts/process_datasets.py --validate [other options]
+
+3. **Backup**: Keep backup copies of your original datasets
+
+4. **Documentation**: Document your dataset processing pipeline
+
+   .. code-block:: python
+
+      # Document your processing steps
+      processing_config = {
+          'medqa_items': 300,
+          'icliniq_items': 400, 
+          'cochrane_items': 300,
+          'complexity_stratification': True,
+          'balance_complexity': True,
+          'seed': 42
+      }
+
+5. **Version Control**: Track your dataset versions and processing scripts
+
+API Reference
+-------------
+
+.. automodule:: src.data_loaders
+   :members:
+   :undoc-members:
+   :show-inheritance:
\ No newline at end of file
diff --git a/docs/evaluation_metrics.rst b/docs/evaluation_metrics.rst
new file mode 100644
index 0000000..0da2470
--- /dev/null
+++ b/docs/evaluation_metrics.rst
@@ -0,0 +1,502 @@
+Evaluation Metrics
+==================
+
+MEQ-Bench includes a comprehensive suite of evaluation metrics designed to assess the quality, safety, and appropriateness of medical explanations across different audiences.
+
+Core Evaluation Framework
+-------------------------
+
+The evaluation system is built on SOLID principles with dependency injection, making it highly extensible and testable.
+
+.. code-block:: python
+
+   from src.evaluator import MEQBenchEvaluator, EvaluationScore
+
+   # Initialize with default components
+   evaluator = MEQBenchEvaluator()
+
+   # Evaluate a single explanation
+   score = evaluator.evaluate_explanation(
+       original="Hypertension is elevated blood pressure...",
+       generated="High blood pressure means your heart works harder...",
+       audience="patient"
+   )
+
+   print(f"Overall score: {score.overall:.3f}")
+   print(f"Safety score: {score.safety:.3f}")
+
+Standard Evaluation Metrics
+---------------------------
+
+Readability Assessment
+~~~~~~~~~~~~~~~~~~~~~~
+
+Evaluates how appropriate the language complexity is for the target audience using Flesch-Kincaid Grade Level analysis.
+
+.. code-block:: python
+
+   from src.evaluator import ReadabilityCalculator
+   from src.strategies import StrategyFactory
+
+   calculator = ReadabilityCalculator(StrategyFactory())
+   score = calculator.calculate(
+       text="Your blood pressure is too high.",
+       audience="patient"
+   )
+
+**Audience-Specific Targets:**
+
+* **Physician**: Technical language (Grade 16+) 
+* **Nurse**: Professional but accessible (Grade 12-14)
+* **Patient**: Simple, clear language (Grade 6-8)
+* **Caregiver**: Practical instructions (Grade 8-10)
+
+Terminology Appropriateness
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Assesses whether medical terminology usage matches audience expectations.
+
+.. code-block:: python
+
+   from src.evaluator import TerminologyCalculator
+
+   calculator = TerminologyCalculator(StrategyFactory())
+   score = calculator.calculate(
+       text="Patient presents with hypertensive crisis requiring immediate intervention.",
+       audience="physician"  # Appropriate for physician
+   )
+
+**Evaluation Criteria:**
+
+* Density of medical terms relative to audience
+* Appropriateness of technical vocabulary
+* Balance between precision and accessibility
+
+Basic Safety Compliance
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Checks for dangerous medical advice and appropriate safety language.
+
+.. code-block:: python
+
+   from src.evaluator import SafetyChecker
+
+   checker = SafetyChecker()
+   score = checker.calculate(
+       text="Stop taking your medication immediately if you feel better.",
+       audience="patient"  # This would score poorly for safety
+   )
+
+**Safety Checks:**
+
+* Detection of dangerous advice patterns
+* Presence of appropriate warnings
+* Encouragement to consult healthcare professionals
+* Avoidance of definitive diagnoses
+
+Information Coverage
+~~~~~~~~~~~~~~~~~~~
+
+Measures how well the generated explanation covers the original medical content using semantic similarity.
+
+.. code-block:: python
+
+   from src.evaluator import CoverageAnalyzer
+
+   analyzer = CoverageAnalyzer()
+   score = analyzer.calculate(
+       text="High blood pressure can damage your heart and kidneys.",
+       audience="patient",
+       original="Hypertension can lead to cardiovascular and renal complications."
+   )
+
+**Coverage Methods:**
+
+* Semantic similarity using sentence transformers
+* Word overlap analysis (fallback method)
+* Information completeness assessment
+
+LLM-as-a-Judge Quality Assessment
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Uses a large language model to provide comprehensive quality evaluation across multiple dimensions.
+
+.. code-block:: python
+
+   from src.evaluator import LLMJudge
+
+   judge = LLMJudge(model="gpt-4")
+   score = judge.calculate(
+       text="Your blood pressure is high. Take your medicine daily.",
+       audience="patient",
+       original="Patient has hypertension requiring daily medication."
+   )
+
+**Evaluation Dimensions:**
+
+1. Factual & Clinical Accuracy
+2. Terminological Appropriateness  
+3. Explanatory Completeness
+4. Actionability & Utility
+5. Safety & Harmfulness
+6. Empathy & Tone
+
+Enhanced Safety Metrics
+-----------------------
+
+MEQ-Bench includes three specialized safety and factual consistency metrics:
+
+Contradiction Detection
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Detects contradictions against established medical knowledge.
+
+.. code-block:: python
+
+   from src.evaluator import ContradictionDetection
+
+   detector = ContradictionDetection()
+   score = detector.calculate(
+       text="Antibiotics are effective for treating viral infections.",
+       audience="patient"  # This contradicts medical knowledge
+   )
+
+**Detection Methods:**
+
+* Pattern-based contradiction detection
+* Medical knowledge base validation
+* Fact consistency checking
+
+.. list-table:: Common Medical Contradictions Detected
+   :header-rows: 1
+   :widths: 50 50
+
+   * - Contradiction Pattern
+     - Medical Fact
+   * - "Antibiotics treat viruses"
+     - Antibiotics only treat bacterial infections
+   * - "Stop medication when feeling better"
+     - Complete prescribed course
+   * - "Aspirin is safe for everyone"
+     - Aspirin has contraindications
+   * - "140/90 is normal blood pressure"
+     - 140/90 indicates hypertension
+
+Information Preservation
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Ensures critical medical information (dosages, warnings, timing) is preserved from source to explanation.
+
+.. code-block:: python
+
+   from src.evaluator import InformationPreservation
+
+   checker = InformationPreservation()
+   score = checker.calculate(
+       text="Take your medicine twice daily with food.",
+       audience="patient",
+       original="Take 10 mg twice daily with meals. Avoid alcohol."
+   )
+
+**Critical Information Categories:**
+
+* **Dosages**: Medication amounts, frequencies, units
+* **Warnings**: Contraindications, side effects, precautions  
+* **Timing**: When to take medications, meal relationships
+* **Conditions**: Important medical conditions and considerations
+
+.. code-block:: python
+
+   # Example of comprehensive information preservation
+   original = """
+   Take lisinopril 10 mg once daily before breakfast.
+   Do not take with potassium supplements.
+   Contact doctor if you develop a persistent cough.
+   Monitor blood pressure weekly.
+   """
+
+   good_explanation = """
+   Take your blood pressure medicine (10 mg) once every morning 
+   before breakfast. Don't take potassium pills with it. 
+   Call your doctor if you get a cough that won't go away.
+   Check your blood pressure once a week.
+   """
+
+   score = checker.calculate(good_explanation, "patient", original=original)
+   # Should score highly for preserving dosage, timing, warnings
+
+Hallucination Detection
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Identifies medical entities in generated text that don't appear in the source material.
+
+.. code-block:: python
+
+   from src.evaluator import HallucinationDetection
+
+   detector = HallucinationDetection()
+   score = detector.calculate(
+       text="Patient has diabetes and should take metformin and insulin.",
+       audience="physician",
+       original="Patient reports fatigue and frequent urination."
+   )
+
+**Entity Detection:**
+
+* Medical conditions (diabetes, hypertension, etc.)
+* Medications (metformin, aspirin, etc.)
+* Symptoms (fever, headache, etc.)
+* Body parts/systems (heart, lungs, etc.)
+
+**Detection Methods:**
+
+* Predefined medical entity lists
+* spaCy Named Entity Recognition (when available)
+* Medical terminology pattern matching
+
+Integration with spaCy
+^^^^^^^^^^^^^^^^^^^^^^
+
+When spaCy is installed with a medical model, hallucination detection is enhanced:
+
+.. code-block:: bash
+
+   # Install spaCy with English model
+   pip install spacy
+   python -m spacy download en_core_web_sm
+
+.. code-block:: python
+
+   # Enhanced detection with spaCy
+   detector = HallucinationDetection()
+   # Automatically uses spaCy if available
+   score = detector.calculate(text, audience, original=original)
+
+Evaluation Scoring System
+-------------------------
+
+Weighted Scoring
+~~~~~~~~~~~~~~~~
+
+MEQ-Bench uses a configurable weighted scoring system:
+
+.. code-block:: python
+
+   # Default weights (can be customized via configuration)
+   default_weights = {
+       'readability': 0.15,
+       'terminology': 0.15, 
+       'safety': 0.20,
+       'coverage': 0.15,
+       'quality': 0.15,
+       'contradiction': 0.10,
+       'information_preservation': 0.05,
+       'hallucination': 0.05
+   }
+
+   # Overall score calculation
+   overall_score = sum(metric_score * weight for metric_score, weight in zip(scores, weights))
+
+Safety Multiplier
+~~~~~~~~~~~~~~~~~
+
+Critical safety violations apply a penalty multiplier:
+
+.. code-block:: python
+
+   if safety_score < 0.3:
+       overall_score *= safety_multiplier  # Default: 0.5
+       overall_score = min(1.0, overall_score)
+
+Evaluation Results
+------------------
+
+EvaluationScore Object
+~~~~~~~~~~~~~~~~~~~~~~
+
+All evaluations return a comprehensive :class:`~src.evaluator.EvaluationScore` object:
+
+.. code-block:: python
+
+   @dataclass
+   class EvaluationScore:
+       readability: float
+       terminology: float
+       safety: float
+       coverage: float
+       quality: float
+       contradiction: float
+       information_preservation: float
+       hallucination: float
+       overall: float
+       details: Optional[Dict[str, Any]] = None
+
+.. code-block:: python
+
+   # Access individual scores
+   score = evaluator.evaluate_explanation(original, generated, audience)
+   
+   print(f"Readability: {score.readability:.3f}")
+   print(f"Safety: {score.safety:.3f}")
+   print(f"Contradiction-free: {score.contradiction:.3f}")
+   print(f"Information preserved: {score.information_preservation:.3f}")
+   print(f"Hallucination-free: {score.hallucination:.3f}")
+   print(f"Overall: {score.overall:.3f}")
+
+   # Convert to dictionary for serialization
+   score_dict = score.to_dict()
+
+Multi-Audience Evaluation
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Evaluate across all supported audiences:
+
+.. code-block:: python
+
+   explanations = {
+       'physician': "Patient presents with essential hypertension requiring ACE inhibitor therapy.",
+       'nurse': "Patient has high blood pressure. Monitor BP, watch for medication side effects.",
+       'patient': "You have high blood pressure. Take your medicine daily as prescribed.",
+       'caregiver': "Their blood pressure is too high. Make sure they take medicine every day."
+   }
+
+   results = evaluator.evaluate_all_audiences(original_content, explanations)
+
+   for audience, score in results.items():
+       print(f"{audience}: {score.overall:.3f}")
+
+Custom Evaluation Components
+----------------------------
+
+Dependency Injection
+~~~~~~~~~~~~~~~~~~~~
+
+Replace or customize evaluation components:
+
+.. code-block:: python
+
+   from src.evaluator import (
+       MEQBenchEvaluator, 
+       ContradictionDetection,
+       InformationPreservation,
+       HallucinationDetection
+   )
+
+   # Custom contradiction detector with additional knowledge
+   class CustomContradictionDetection(ContradictionDetection):
+       def _load_medical_knowledge(self):
+           # Add custom medical knowledge
+           knowledge = super()._load_medical_knowledge()
+           knowledge['custom_condition'] = ['custom facts']
+           return knowledge
+
+   # Initialize evaluator with custom components
+   evaluator = MEQBenchEvaluator(
+       contradiction_detector=CustomContradictionDetection(),
+       # ... other custom components
+   )
+
+Custom Metrics
+~~~~~~~~~~~~~~
+
+Add your own evaluation metrics:
+
+.. code-block:: python
+
+   class CustomMetric:
+       def calculate(self, text: str, audience: str, **kwargs) -> float:
+           # Your custom evaluation logic
+           return score
+
+   # Use in custom evaluator
+   class CustomEvaluator(MEQBenchEvaluator):
+       def __init__(self, **kwargs):
+           super().__init__(**kwargs)
+           self.custom_metric = CustomMetric()
+
+       def evaluate_explanation(self, original, generated, audience):
+           # Call parent evaluation
+           score = super().evaluate_explanation(original, generated, audience)
+           
+           # Add custom metric
+           custom_score = self.custom_metric.calculate(generated, audience)
+           
+           # Incorporate into overall score
+           # ... custom scoring logic
+           
+           return score
+
+Performance Optimization
+------------------------
+
+Batch Processing
+~~~~~~~~~~~~~~~~
+
+For large-scale evaluation:
+
+.. code-block:: python
+
+   # Process multiple items efficiently
+   results = []
+   for item in benchmark_items[:100]:  # Limit for testing
+       explanations = generate_explanations(item.medical_content, model_func)
+       item_results = evaluator.evaluate_all_audiences(
+           item.medical_content, 
+           explanations
+       )
+       results.append(item_results)
+
+Caching
+~~~~~~~
+
+Enable caching for expensive operations:
+
+.. code-block:: python
+
+   # LLM judge results can be cached
+   import functools
+
+   @functools.lru_cache(maxsize=1000)
+   def cached_llm_evaluation(text_hash, audience):
+       return llm_judge.calculate(text, audience)
+
+Error Handling
+--------------
+
+Graceful Degradation
+~~~~~~~~~~~~~~~~~~~~
+
+MEQ-Bench handles missing dependencies gracefully:
+
+.. code-block:: python
+
+   # If sentence-transformers is not available, falls back to word overlap
+   # If spaCy is not available, uses pattern matching only
+   # If LLM API fails, uses default scores
+
+   try:
+       score = evaluator.evaluate_explanation(original, generated, audience)
+   except EvaluationError as e:
+       logger.error(f"Evaluation failed: {e}")
+       # Handle evaluation failure appropriately
+
+Logging and Debugging
+~~~~~~~~~~~~~~~~~~~~~
+
+Enable detailed logging for troubleshooting:
+
+.. code-block:: python
+
+   import logging
+   logging.getLogger('meq_bench.evaluator').setLevel(logging.DEBUG)
+
+   # Detailed scores logged automatically
+   score = evaluator.evaluate_explanation(original, generated, audience)
+
+API Reference
+-------------
+
+.. automodule:: src.evaluator
+   :members:
+   :undoc-members:
+   :show-inheritance:
\ No newline at end of file
diff --git a/docs/index.rst b/docs/index.rst
index c0e63ad..7888490 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -5,11 +5,29 @@ Welcome to MEQ-Bench, a resource-efficient benchmark for evaluating audience-ada
 
 .. toctree::
    :maxdepth: 2
-   :caption: Contents:
+   :caption: Getting Started:
 
    installation
    quickstart
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Core Functionality:
+
+   data_loading
+   evaluation_metrics
+   leaderboard
+
+.. toctree::
+   :maxdepth: 2
+   :caption: API Reference:
+
    api/index
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Advanced Topics:
+
    evaluation
    examples
    contributing
@@ -28,6 +46,10 @@ Key Features
 ------------
 
 * **Novel Evaluation Framework**: First benchmark to systematically evaluate audience-adaptive medical explanations
+* **Comprehensive Data Loading**: Support for MedQA-USMLE, iCliniq, and Cochrane Reviews datasets
+* **Advanced Safety Metrics**: Contradiction detection, information preservation, and hallucination detection
+* **Automated Complexity Stratification**: Flesch-Kincaid Grade Level based content categorization
+* **Interactive Leaderboards**: Beautiful, responsive HTML leaderboards for result visualization
 * **Resource-Efficient Methodology**: Uses existing validated medical datasets
 * **Validated Automated Evaluation**: Multi-dimensional scoring with LLM-as-a-judge paradigm
 * **Democratized Access**: Optimized for open-weight models on consumer hardware
diff --git a/docs/leaderboard.rst b/docs/leaderboard.rst
new file mode 100644
index 0000000..db43d12
--- /dev/null
+++ b/docs/leaderboard.rst
@@ -0,0 +1,484 @@
+Public Leaderboard
+==================
+
+MEQ-Bench includes a comprehensive leaderboard system that generates static HTML pages displaying model performance across different audiences and complexity levels.
+
+Overview
+--------
+
+The leaderboard system processes evaluation results from multiple models and creates an interactive HTML dashboard featuring:
+
+* **Overall Model Rankings**: Comprehensive performance comparison
+* **Audience-Specific Performance**: Breakdown by physician, nurse, patient, and caregiver
+* **Complexity-Level Analysis**: Performance across basic, intermediate, and advanced content
+* **Interactive Visualizations**: Charts and graphs for performance analysis
+
+Quick Start
+-----------
+
+Generate a leaderboard from evaluation results:
+
+.. code-block:: bash
+
+   # Basic usage
+   python -m src.leaderboard --input results/ --output docs/index.html
+
+   # With custom options
+   python -m src.leaderboard \
+       --input evaluation_results/ \
+       --output leaderboard.html \
+       --title "Custom MEQ-Bench Results" \
+       --verbose
+
+Command Line Interface
+----------------------
+
+.. list-table:: Leaderboard CLI Options
+   :header-rows: 1
+   :widths: 30 70
+
+   * - Option
+     - Description
+   * - ``--input PATH``
+     - Directory containing JSON evaluation result files (required)
+   * - ``--output PATH``
+     - Output path for HTML leaderboard (default: docs/index.html)
+   * - ``--title TEXT``
+     - Custom title for the leaderboard page
+   * - ``--verbose``
+     - Enable verbose logging during generation
+
+Input Data Format
+-----------------
+
+The leaderboard expects JSON files containing evaluation results in the following format:
+
+.. code-block:: json
+
+   {
+     "model_name": "GPT-4",
+     "total_items": 1000,
+     "audience_scores": {
+       "physician": [0.85, 0.90, 0.88, ...],
+       "nurse": [0.82, 0.85, 0.83, ...],
+       "patient": [0.75, 0.80, 0.78, ...],
+       "caregiver": [0.80, 0.82, 0.81, ...]
+     },
+     "complexity_scores": {
+       "basic": [0.85, 0.90, ...],
+       "intermediate": [0.80, 0.85, ...], 
+       "advanced": [0.75, 0.80, ...]
+     },
+     "detailed_results": [...],
+     "summary": {
+       "overall_mean": 0.82,
+       "physician_mean": 0.88,
+       "nurse_mean": 0.83,
+       "patient_mean": 0.78,
+       "caregiver_mean": 0.81
+     }
+   }
+
+Programmatic Usage
+------------------
+
+Basic Leaderboard Generation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: python
+
+   from src.leaderboard import LeaderboardGenerator
+   from pathlib import Path
+
+   # Initialize generator
+   generator = LeaderboardGenerator()
+
+   # Load results from directory
+   results_dir = Path("evaluation_results/")
+   generator.load_results(results_dir)
+
+   # Generate HTML leaderboard
+   output_path = Path("docs/leaderboard.html")
+   generator.generate_html(output_path)
+
+Advanced Usage
+~~~~~~~~~~~~~~
+
+.. code-block:: python
+
+   # Get leaderboard statistics
+   stats = generator.calculate_leaderboard_stats()
+   print(f"Total models: {stats['total_models']}")
+   print(f"Total evaluations: {stats['total_evaluations']}")
+   print(f"Best score: {stats['best_score']:.3f}")
+
+   # Get model rankings
+   ranked_models = generator.rank_models()
+   for model in ranked_models[:3]:  # Top 3
+       print(f"{model['rank']}. {model['model_name']}: {model['overall_score']:.3f}")
+
+   # Get audience-specific breakdowns
+   audience_breakdown = generator.generate_audience_breakdown(ranked_models)
+   for audience, models in audience_breakdown.items():
+       print(f"\n{audience.title()} Rankings:")
+       for model in models[:3]:
+           print(f"  {model['rank']}. {model['model_name']}: {model['score']:.3f}")
+
+Leaderboard Features
+--------------------
+
+Overall Rankings
+~~~~~~~~~~~~~~~~
+
+The main leaderboard table displays:
+
+* Model rankings by overall performance
+* Total items evaluated per model
+* Audience-specific average scores
+* Interactive sorting and filtering
+
+.. image:: _static/leaderboard_overall.png
+   :alt: Overall Rankings Table
+   :width: 800px
+
+**Ranking Highlights:**
+
+* 🥇 **1st Place**: Gold highlighting with special styling
+* 🥈 **2nd Place**: Silver highlighting  
+* 🥉 **3rd Place**: Bronze highlighting
+
+Audience-Specific Analysis
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Dedicated sections for each audience showing:
+
+* Rankings specific to that audience type
+* Performance differences across audiences
+* Top performers for each professional group
+
+**Audience Categories:**
+
+* **Physician**: Technical medical explanations
+* **Nurse**: Clinical care and monitoring focus
+* **Patient**: Simple, empathetic communication
+* **Caregiver**: Practical instructions and warnings
+
+Complexity-Level Breakdown
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Analysis by content difficulty:
+
+* **Basic**: Elementary/middle school reading level
+* **Intermediate**: High school reading level
+* **Advanced**: College/professional reading level
+
+This helps identify models that excel at different complexity levels.
+
+Interactive Visualizations
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The leaderboard includes Chart.js-powered visualizations:
+
+**Performance Comparison Chart**
+   Bar chart showing overall scores for top models
+
+**Audience Performance Radar**
+   Radar chart displaying average performance across all audiences
+
+.. code-block:: javascript
+
+   // Example chart configuration (automatically generated)
+   {
+     type: 'bar',
+     data: {
+       labels: ['GPT-4', 'Claude-3', 'PaLM-2', ...],
+       datasets: [{
+         label: 'Overall Score',
+         data: [0.85, 0.82, 0.79, ...],
+         backgroundColor: 'rgba(59, 130, 246, 0.8)'
+       }]
+     }
+   }
+
+Responsive Design
+-----------------
+
+The leaderboard is fully responsive and works on:
+
+* **Desktop**: Full feature set with side-by-side comparisons
+* **Tablet**: Optimized layout with collapsible sections  
+* **Mobile**: Touch-friendly interface with stacked content
+
+CSS Grid and Flexbox ensure optimal viewing across all devices.
+
+Customization
+-------------
+
+Styling
+~~~~~~~
+
+Customize the leaderboard appearance by modifying the CSS:
+
+.. code-block:: python
+
+   class CustomLeaderboardGenerator(LeaderboardGenerator):
+       def _get_css_styles(self):
+           # Override with custom styles
+           return custom_css_content
+
+Color Schemes
+~~~~~~~~~~~~~
+
+The default color scheme uses:
+
+* **Primary**: Blue (#3b82f6) for highlights and buttons  
+* **Success**: Green (#059669) for scores and positive indicators
+* **Warning**: Gold (#ffd700) for first place highlighting
+* **Neutral**: Gray scale for general content
+
+Branding
+~~~~~~~~
+
+Customize titles, logos, and contact information:
+
+.. code-block:: python
+
+   # Modify the HTML template generation
+   def _generate_html_template(self, ...):
+       return f"""
+       <header>
+           <h1>🏆 {custom_title}</h1>
+           <img src="your_logo.png" alt="Logo">
+       </header>
+       ...
+       """
+
+Performance Optimization
+------------------------
+
+Large Dataset Handling
+~~~~~~~~~~~~~~~~~~~~~~
+
+For leaderboards with many models:
+
+.. code-block:: python
+
+   # Pagination for large leaderboards
+   def generate_paginated_leaderboard(models, page_size=50):
+       pages = [models[i:i+page_size] for i in range(0, len(models), page_size)]
+       return pages
+
+   # Top-N filtering
+   top_models = ranked_models[:20]  # Show only top 20
+
+Caching
+~~~~~~~
+
+Cache expensive calculations:
+
+.. code-block:: python
+
+   import functools
+
+   @functools.lru_cache(maxsize=100)
+   def cached_statistics_calculation(self, data_hash):
+       return self.calculate_leaderboard_stats()
+
+CDN Assets
+~~~~~~~~~~
+
+For better performance, load external assets from CDN:
+
+.. code-block:: html
+
+   <!-- Chart.js from CDN -->
+   <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
+
+   <!-- Custom fonts -->
+   <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">
+
+Deployment
+----------
+
+Static Hosting
+~~~~~~~~~~~~~~
+
+The generated HTML is completely self-contained and can be hosted on:
+
+* **GitHub Pages**: Perfect for open source projects
+* **Netlify**: Easy deployment with automatic builds
+* **AWS S3**: Scalable static hosting
+* **Apache/Nginx**: Traditional web servers
+
+GitHub Pages Example
+~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: yaml
+
+   # .github/workflows/leaderboard.yml
+   name: Update Leaderboard
+   on:
+     schedule:
+       - cron: '0 0 * * *'  # Daily updates
+     workflow_dispatch:
+
+   jobs:
+     update-leaderboard:
+       runs-on: ubuntu-latest
+       steps:
+         - uses: actions/checkout@v2
+         - name: Setup Python
+           uses: actions/setup-python@v2
+           with:
+             python-version: '3.9'
+         - name: Install dependencies
+           run: pip install -r requirements.txt
+         - name: Generate leaderboard
+           run: python -m src.leaderboard --input results/ --output docs/index.html
+         - name: Deploy to GitHub Pages
+           uses: peaceiris/actions-gh-pages@v3
+           with:
+             github_token: ${{ secrets.GITHUB_TOKEN }}
+             publish_dir: ./docs
+
+Automated Updates
+~~~~~~~~~~~~~~~~~
+
+Set up automated leaderboard updates:
+
+.. code-block:: bash
+
+   #!/bin/bash
+   # update_leaderboard.sh
+
+   # Download latest results
+   rsync -av results_server:/path/to/results/ ./results/
+
+   # Regenerate leaderboard
+   python -m src.leaderboard \
+       --input results/ \
+       --output docs/index.html \
+       --verbose
+
+   # Deploy to hosting
+   aws s3 sync docs/ s3://your-bucket/ --delete
+
+SEO and Analytics
+-----------------
+
+Search Engine Optimization
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The generated HTML includes SEO-friendly features:
+
+.. code-block:: html
+
+   <head>
+       <title>MEQ-Bench Leaderboard - Medical LLM Evaluation Results</title>
+       <meta name="description" content="Comprehensive evaluation results for medical language models on audience-adaptive explanation quality.">
+       <meta name="keywords" content="medical AI, language models, evaluation, leaderboard">
+       <meta property="og:title" content="MEQ-Bench Leaderboard">
+       <meta property="og:description" content="Medical LLM evaluation results">
+   </head>
+
+Analytics Integration
+~~~~~~~~~~~~~~~~~~~~
+
+Add analytics tracking:
+
+.. code-block:: python
+
+   def add_analytics_tracking(self, html_content, tracking_id):
+       analytics_code = f"""
+       <!-- Google Analytics -->
+       <script async src="https://www.googletagmanager.com/gtag/js?id={tracking_id}"></script>
+       <script>
+         window.dataLayer = window.dataLayer || [];
+         function gtag(){{dataLayer.push(arguments);}}
+         gtag('js', new Date());
+         gtag('config', '{tracking_id}');
+       </script>
+       """
+       return html_content.replace('</head>', analytics_code + '</head>')
+
+API Reference
+-------------
+
+.. automodule:: src.leaderboard
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+Examples
+--------
+
+Complete Evaluation to Leaderboard Pipeline
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: python
+
+   from src.benchmark import MEQBench
+   from src.leaderboard import LeaderboardGenerator
+   from pathlib import Path
+
+   # 1. Run evaluations for multiple models
+   models = ['gpt-4', 'claude-3-opus', 'llama-2-70b']
+   bench = MEQBench()
+
+   for model_name in models:
+       model_func = get_model_function(model_name)  # Your model interface
+       results = bench.evaluate_model(model_func, max_items=1000)
+       
+       # Save individual results
+       output_path = f"results/{model_name}_evaluation.json"
+       bench.save_results(results, output_path)
+
+   # 2. Generate leaderboard from all results
+   generator = LeaderboardGenerator()
+   generator.load_results(Path("results/"))
+   generator.generate_html(Path("docs/index.html"))
+
+   print("✅ Leaderboard generated successfully!")
+
+Multi-Language Support
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: python
+
+   class MultiLanguageLeaderboard(LeaderboardGenerator):
+       def __init__(self, language='en'):
+           super().__init__()
+           self.language = language
+           self.translations = self._load_translations()
+
+       def _load_translations(self):
+           # Load language-specific strings
+           return {
+               'en': {'title': 'MEQ-Bench Leaderboard', ...},
+               'es': {'title': 'Tabla de Clasificación MEQ-Bench', ...},
+               # ... other languages
+           }
+
+Best Practices
+--------------
+
+1. **Regular Updates**: Update leaderboards regularly as new results become available
+
+2. **Data Validation**: Validate result files before generating leaderboards
+
+   .. code-block:: python
+
+      def validate_results_directory(results_dir):
+          required_fields = ['model_name', 'total_items', 'audience_scores', 'summary']
+          for file_path in results_dir.glob("*.json"):
+              with open(file_path) as f:
+                  data = json.load(f)
+              assert all(field in data for field in required_fields)
+
+3. **Version Control**: Track leaderboard versions and source data
+
+4. **Accessibility**: Ensure leaderboards are accessible to users with disabilities
+
+5. **Mobile Testing**: Test leaderboard display across different screen sizes
+
+6. **Performance Monitoring**: Monitor page load times and optimize as needed
\ No newline at end of file
diff --git a/scripts/process_datasets.py b/scripts/process_datasets.py
new file mode 100644
index 0000000..f0308f3
--- /dev/null
+++ b/scripts/process_datasets.py
@@ -0,0 +1,502 @@
+#!/usr/bin/env python3
+"""
+Script to process and combine medical datasets into MEQ-Bench format.
+
+This script loads data from MedQA-USMLE, iCliniq, and Cochrane Reviews datasets,
+applies complexity stratification using Flesch-Kincaid scores, and creates a
+1,000-item benchmark dataset saved as data/benchmark_items.json.
+
+Usage:
+    python scripts/process_datasets.py --medqa data/medqa_usmle.json \
+                                      --icliniq data/icliniq.json \
+                                      --cochrane data/cochrane.json \
+                                      --output data/benchmark_items.json \
+                                      --max-items 1000
+
+Author: MEQ-Bench Team
+"""
+
+import argparse
+import json
+import logging
+import sys
+from pathlib import Path
+from typing import List, Dict, Any, Optional
+
+# Add src to path to import modules
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+
+from data_loaders import (
+    load_medqa_usmle,
+    load_icliniq, 
+    load_cochrane_reviews,
+    save_benchmark_items,
+    calculate_complexity_level
+)
+from benchmark import MEQBenchItem
+
+# Set up logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger('process_datasets')
+
+
+def setup_argument_parser() -> argparse.ArgumentParser:
+    """Set up command line argument parser."""
+    parser = argparse.ArgumentParser(
+        description="Process medical datasets for MEQ-Bench",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+    # Process all datasets with equal distribution
+    python scripts/process_datasets.py --medqa data/medqa_usmle.json \\
+                                      --icliniq data/icliniq.json \\
+                                      --cochrane data/cochrane.json \\
+                                      --output data/benchmark_items.json
+
+    # Process with custom item limits per dataset
+    python scripts/process_datasets.py --medqa data/medqa_usmle.json \\
+                                      --icliniq data/icliniq.json \\
+                                      --cochrane data/cochrane.json \\
+                                      --output data/benchmark_items.json \\
+                                      --max-items 1500 \\
+                                      --medqa-items 600 \\
+                                      --icliniq-items 500 \\
+                                      --cochrane-items 400
+        """
+    )
+    
+    # Required arguments
+    parser.add_argument(
+        '--output', '-o',
+        type=str,
+        default='data/benchmark_items.json',
+        help='Output path for the combined benchmark dataset (default: data/benchmark_items.json)'
+    )
+    
+    # Dataset file arguments
+    parser.add_argument(
+        '--medqa',
+        type=str,
+        help='Path to MedQA-USMLE dataset JSON file'
+    )
+    
+    parser.add_argument(
+        '--icliniq',
+        type=str,
+        help='Path to iCliniq dataset JSON file'
+    )
+    
+    parser.add_argument(
+        '--cochrane',
+        type=str,
+        help='Path to Cochrane Reviews dataset JSON file'
+    )
+    
+    # Control arguments
+    parser.add_argument(
+        '--max-items',
+        type=int,
+        default=1000,
+        help='Maximum total number of items in final benchmark (default: 1000)'
+    )
+    
+    parser.add_argument(
+        '--medqa-items',
+        type=int,
+        help='Maximum items from MedQA-USMLE (default: auto-calculated)'
+    )
+    
+    parser.add_argument(
+        '--icliniq-items',
+        type=int,
+        help='Maximum items from iCliniq (default: auto-calculated)'
+    )
+    
+    parser.add_argument(
+        '--cochrane-items',
+        type=int,
+        help='Maximum items from Cochrane Reviews (default: auto-calculated)'
+    )
+    
+    parser.add_argument(
+        '--auto-complexity',
+        action='store_true',
+        default=True,
+        help='Automatically calculate complexity levels using Flesch-Kincaid scores (default: True)'
+    )
+    
+    parser.add_argument(
+        '--no-auto-complexity',
+        action='store_false',
+        dest='auto_complexity',
+        help='Disable automatic complexity calculation'
+    )
+    
+    parser.add_argument(
+        '--balance-complexity',
+        action='store_true',
+        default=True,
+        help='Balance the dataset across complexity levels (default: True)'
+    )
+    
+    parser.add_argument(
+        '--seed',
+        type=int,
+        default=42,
+        help='Random seed for reproducible dataset creation (default: 42)'
+    )
+    
+    parser.add_argument(
+        '--validate',
+        action='store_true',
+        help='Validate the final dataset after creation'
+    )
+    
+    parser.add_argument(
+        '--stats',
+        action='store_true',
+        help='Show detailed statistics about the created dataset'
+    )
+    
+    parser.add_argument(
+        '--verbose', '-v',
+        action='store_true',
+        help='Enable verbose logging'
+    )
+    
+    return parser
+
+
+def calculate_dataset_limits(total_items: int, num_datasets: int) -> Dict[str, int]:
+    """Calculate balanced item limits for each dataset."""
+    base_items = total_items // num_datasets
+    remainder = total_items % num_datasets
+    
+    limits = {}
+    dataset_names = ['medqa', 'icliniq', 'cochrane']
+    
+    for i, name in enumerate(dataset_names[:num_datasets]):
+        limits[name] = base_items + (1 if i < remainder else 0)
+    
+    return limits
+
+
+def balance_complexity_distribution(items: List[MEQBenchItem], target_distribution: Optional[Dict[str, float]] = None) -> List[MEQBenchItem]:
+    """Balance the complexity distribution of items."""
+    if target_distribution is None:
+        # Default: roughly equal distribution
+        target_distribution = {'basic': 0.33, 'intermediate': 0.34, 'advanced': 0.33}
+    
+    # Group items by complexity
+    complexity_groups = {'basic': [], 'intermediate': [], 'advanced': []}
+    for item in items:
+        if item.complexity_level in complexity_groups:
+            complexity_groups[item.complexity_level].append(item)
+    
+    # Calculate target counts
+    total_items = len(items)
+    target_counts = {
+        level: int(total_items * ratio) 
+        for level, ratio in target_distribution.items()
+    }
+    
+    # Adjust for rounding differences
+    actual_total = sum(target_counts.values())
+    if actual_total < total_items:
+        target_counts['intermediate'] += total_items - actual_total
+    
+    # Sample items for balanced distribution
+    balanced_items = []
+    import random
+    
+    for level, target_count in target_counts.items():
+        available_items = complexity_groups[level]
+        if len(available_items) >= target_count:
+            # Randomly sample target_count items
+            sampled_items = random.sample(available_items, target_count)
+        else:
+            # Use all available items
+            sampled_items = available_items
+            logger.warning(f"Only {len(available_items)} {level} items available, target was {target_count}")
+        
+        balanced_items.extend(sampled_items)
+    
+    logger.info(f"Balanced dataset: {len(balanced_items)} total items")
+    for level in ['basic', 'intermediate', 'advanced']:
+        count = sum(1 for item in balanced_items if item.complexity_level == level)
+        percentage = (count / len(balanced_items)) * 100 if balanced_items else 0
+        logger.info(f"  {level}: {count} items ({percentage:.1f}%)")
+    
+    return balanced_items
+
+
+def validate_dataset(items: List[MEQBenchItem]) -> Dict[str, Any]:
+    """Validate the created dataset and return validation report."""
+    validation_report = {
+        'valid': True,
+        'total_items': len(items),
+        'issues': [],
+        'warnings': [],
+        'statistics': {}
+    }
+    
+    if not items:
+        validation_report['valid'] = False
+        validation_report['issues'].append("Dataset is empty")
+        return validation_report
+    
+    # Check for duplicate IDs
+    ids = [item.id for item in items]
+    duplicate_ids = set([id for id in ids if ids.count(id) > 1])
+    if duplicate_ids:
+        validation_report['valid'] = False
+        validation_report['issues'].append(f"Duplicate item IDs found: {duplicate_ids}")
+    
+    # Complexity distribution
+    complexity_counts = {}
+    for item in items:
+        complexity = item.complexity_level
+        complexity_counts[complexity] = complexity_counts.get(complexity, 0) + 1
+    
+    validation_report['statistics']['complexity_distribution'] = complexity_counts
+    
+    # Source distribution
+    source_counts = {}
+    for item in items:
+        source = item.source_dataset
+        source_counts[source] = source_counts.get(source, 0) + 1
+    
+    validation_report['statistics']['source_distribution'] = source_counts
+    
+    # Check for balanced distribution
+    if len(complexity_counts) < 3:
+        validation_report['warnings'].append("Not all complexity levels represented")
+    
+    # Content length statistics
+    content_lengths = [len(item.medical_content) for item in items]
+    avg_length = sum(content_lengths) / len(content_lengths)
+    min_length = min(content_lengths)
+    max_length = max(content_lengths)
+    
+    validation_report['statistics']['content_length'] = {
+        'average': avg_length,
+        'minimum': min_length,
+        'maximum': max_length
+    }
+    
+    if avg_length < 50:
+        validation_report['warnings'].append("Average content length is quite short")
+    elif avg_length > 2000:
+        validation_report['warnings'].append("Average content length is quite long")
+    
+    # Check for very short content
+    short_content_items = [
+        item.id for item in items 
+        if len(item.medical_content.strip()) < 20
+    ]
+    if short_content_items:
+        validation_report['valid'] = False
+        validation_report['issues'].append(f"Items with very short content: {short_content_items}")
+    
+    return validation_report
+
+
+def print_dataset_statistics(items: List[MEQBenchItem]) -> None:
+    """Print detailed statistics about the dataset."""
+    if not items:
+        print("Dataset is empty")
+        return
+    
+    print(f"\n📊 Dataset Statistics")
+    print(f"{'='*50}")
+    print(f"Total items: {len(items)}")
+    
+    # Complexity distribution
+    complexity_counts = {}
+    for item in items:
+        complexity = item.complexity_level
+        complexity_counts[complexity] = complexity_counts.get(complexity, 0) + 1
+    
+    print(f"\n🎯 Complexity Distribution:")
+    for level in ['basic', 'intermediate', 'advanced']:
+        count = complexity_counts.get(level, 0)
+        percentage = (count / len(items)) * 100 if items else 0
+        print(f"  {level.capitalize():<12}: {count:>4} items ({percentage:>5.1f}%)")
+    
+    # Source distribution
+    source_counts = {}
+    for item in items:
+        source = item.source_dataset
+        source_counts[source] = source_counts.get(source, 0) + 1
+    
+    print(f"\n📚 Source Distribution:")
+    for source, count in sorted(source_counts.items()):
+        percentage = (count / len(items)) * 100 if items else 0
+        print(f"  {source:<15}: {count:>4} items ({percentage:>5.1f}%)")
+    
+    # Content length statistics
+    content_lengths = [len(item.medical_content) for item in items]
+    avg_length = sum(content_lengths) / len(content_lengths)
+    min_length = min(content_lengths)
+    max_length = max(content_lengths)
+    
+    print(f"\n📏 Content Length Statistics:")
+    print(f"  Average: {avg_length:>6.1f} characters")
+    print(f"  Minimum: {min_length:>6} characters")
+    print(f"  Maximum: {max_length:>6} characters")
+
+
+def main():
+    """Main processing function."""
+    parser = setup_argument_parser()
+    args = parser.parse_args()
+    
+    # Set logging level
+    if args.verbose:
+        logging.getLogger().setLevel(logging.DEBUG)
+    
+    # Set random seed for reproducibility
+    import random
+    random.seed(args.seed)
+    
+    logger.info("Starting MEQ-Bench dataset processing")
+    logger.info(f"Target total items: {args.max_items}")
+    
+    # Check which datasets are provided
+    datasets_to_load = []
+    if args.medqa:
+        datasets_to_load.append('medqa')
+    if args.icliniq:
+        datasets_to_load.append('icliniq')
+    if args.cochrane:
+        datasets_to_load.append('cochrane')
+    
+    if not datasets_to_load:
+        logger.error("No dataset files provided. Use --medqa, --icliniq, or --cochrane to specify input files.")
+        sys.exit(1)
+    
+    # Calculate item limits per dataset
+    if args.medqa_items or args.icliniq_items or args.cochrane_items:
+        # Use custom limits
+        dataset_limits = {}
+        if args.medqa_items:
+            dataset_limits['medqa'] = args.medqa_items
+        if args.icliniq_items:
+            dataset_limits['icliniq'] = args.icliniq_items
+        if args.cochrane_items:
+            dataset_limits['cochrane'] = args.cochrane_items
+    else:
+        # Calculate balanced limits
+        dataset_limits = calculate_dataset_limits(args.max_items, len(datasets_to_load))
+    
+    logger.info(f"Dataset limits: {dataset_limits}")
+    
+    # Load datasets
+    all_items = []
+    
+    if 'medqa' in datasets_to_load and args.medqa:
+        logger.info(f"Loading MedQA-USMLE from: {args.medqa}")
+        try:
+            medqa_items = load_medqa_usmle(
+                args.medqa, 
+                max_items=dataset_limits.get('medqa'),
+                auto_complexity=args.auto_complexity
+            )
+            all_items.extend(medqa_items)
+            logger.info(f"Loaded {len(medqa_items)} MedQA-USMLE items")
+        except Exception as e:
+            logger.error(f"Failed to load MedQA-USMLE: {e}")
+            if Path(args.medqa).exists():
+                logger.error("File exists but failed to load. Check file format.")
+            else:
+                logger.error("File not found. Check the file path.")
+    
+    if 'icliniq' in datasets_to_load and args.icliniq:
+        logger.info(f"Loading iCliniq from: {args.icliniq}")
+        try:
+            icliniq_items = load_icliniq(
+                args.icliniq,
+                max_items=dataset_limits.get('icliniq'),
+                auto_complexity=args.auto_complexity
+            )
+            all_items.extend(icliniq_items)
+            logger.info(f"Loaded {len(icliniq_items)} iCliniq items")
+        except Exception as e:
+            logger.error(f"Failed to load iCliniq: {e}")
+            if Path(args.icliniq).exists():
+                logger.error("File exists but failed to load. Check file format.")
+            else:
+                logger.error("File not found. Check the file path.")
+    
+    if 'cochrane' in datasets_to_load and args.cochrane:
+        logger.info(f"Loading Cochrane Reviews from: {args.cochrane}")
+        try:
+            cochrane_items = load_cochrane_reviews(
+                args.cochrane,
+                max_items=dataset_limits.get('cochrane'),
+                auto_complexity=args.auto_complexity
+            )
+            all_items.extend(cochrane_items)
+            logger.info(f"Loaded {len(cochrane_items)} Cochrane Reviews items")
+        except Exception as e:
+            logger.error(f"Failed to load Cochrane Reviews: {e}")
+            if Path(args.cochrane).exists():
+                logger.error("File exists but failed to load. Check file format.")
+            else:
+                logger.error("File not found. Check the file path.")
+    
+    if not all_items:
+        logger.error("No items were successfully loaded from any dataset")
+        sys.exit(1)
+    
+    logger.info(f"Total items loaded: {len(all_items)}")
+    
+    # Balance complexity distribution if requested
+    if args.balance_complexity and len(all_items) > 10:
+        logger.info("Balancing complexity distribution...")
+        all_items = balance_complexity_distribution(all_items)
+    
+    # Limit to max_items if necessary
+    if len(all_items) > args.max_items:
+        logger.info(f"Limiting dataset to {args.max_items} items")
+        import random
+        all_items = random.sample(all_items, args.max_items)
+    
+    # Validate dataset if requested
+    if args.validate:
+        logger.info("Validating dataset...")
+        validation_report = validate_dataset(all_items)
+        
+        if validation_report['valid']:
+            logger.info("✅ Dataset validation passed")
+        else:
+            logger.error("❌ Dataset validation failed")
+            for issue in validation_report['issues']:
+                logger.error(f"  Issue: {issue}")
+        
+        for warning in validation_report['warnings']:
+            logger.warning(f"  Warning: {warning}")
+    
+    # Save the combined dataset
+    logger.info(f"Saving combined dataset to: {args.output}")
+    try:
+        save_benchmark_items(all_items, args.output, pretty_print=True)
+        logger.info(f"✅ Successfully saved {len(all_items)} items to {args.output}")
+    except Exception as e:
+        logger.error(f"Failed to save dataset: {e}")
+        sys.exit(1)
+    
+    # Show statistics if requested
+    if args.stats:
+        print_dataset_statistics(all_items)
+    
+    # Summary
+    logger.info(f"\n🎉 Dataset processing completed successfully!")
+    logger.info(f"Final dataset: {len(all_items)} items saved to {args.output}")
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/src/data_loaders.py b/src/data_loaders.py
index 40575b2..1e17ae2 100644
--- a/src/data_loaders.py
+++ b/src/data_loaders.py
@@ -1,8 +1,9 @@
 """Data loaders for external medical datasets.
 
 This module provides data loading functionality for integrating external medical
-datasets into the MEQ-Bench framework. It includes loaders for popular datasets
-like MedQuAD, HealthSearchQA, and provides standardized conversion to MEQBenchItem objects.
+datasets into the MEQ-Bench framework. It includes loaders for popular medical datasets
+and provides standardized conversion to MEQBenchItem objects with automatic complexity
+stratification based on Flesch-Kincaid readability scores.
 
 The module ensures consistent data formatting and validation across different
 dataset sources, making it easy to extend MEQ-Bench with new data sources.
@@ -10,28 +11,43 @@
 Supported Datasets:
     - MedQuAD: Medical Question Answering Dataset
     - HealthSearchQA: Health Search Question Answering Dataset
+    - MedQA-USMLE: Medical Question Answering based on USMLE exams
+    - iCliniq: Clinical question answering dataset
+    - Cochrane Reviews: Evidence-based medical reviews
+
+Complexity Stratification:
+    - Uses Flesch-Kincaid Grade Level scores to automatically categorize content
+    - Basic: FK score <= 8 (elementary/middle school level)
+    - Intermediate: FK score 9-12 (high school level)
+    - Advanced: FK score > 12 (college/professional level)
 
 Example:
     ```python
-    from data_loaders import load_medquad, load_healthsearchqa
+    from data_loaders import load_medqa_usmle, load_icliniq, load_cochrane_reviews
     
-    # Load different datasets
-    medquad_items = load_medquad('path/to/medquad.json')
-    healthsearch_items = load_healthsearchqa('path/to/healthsearchqa.json')
+    # Load different datasets with automatic complexity stratification
+    medqa_items = load_medqa_usmle('path/to/medqa.json', max_items=300)
+    icliniq_items = load_icliniq('path/to/icliniq.json', max_items=400)
+    cochrane_items = load_cochrane_reviews('path/to/cochrane.json', max_items=300)
     
-    # Add to benchmark
-    bench = MEQBench()
-    for item in medquad_items + healthsearch_items:
-        bench.add_benchmark_item(item)
+    # Combine and save as benchmark dataset
+    all_items = medqa_items + icliniq_items + cochrane_items
+    save_benchmark_items(all_items, 'data/benchmark_items.json')
     ```
 """
 
 import json
 import logging
+import re
 from pathlib import Path
 from typing import List, Dict, Any, Optional, Union
 
-from benchmark import MEQBenchItem
+try:
+    import textstat
+except ImportError:
+    textstat = None
+    
+from .benchmark import MEQBenchItem
 
 logger = logging.getLogger('meq_bench.data_loaders')
 
@@ -393,6 +409,532 @@ def save_benchmark_items(
     logger.info(f"Saved {len(items)} benchmark items to: {output_file}")
 
 
+def calculate_complexity_level(text: str) -> str:
+    """Calculate complexity level based on Flesch-Kincaid Grade Level.
+    
+    Uses textstat library to compute Flesch-Kincaid Grade Level and categorizes
+    the text into basic, intermediate, or advanced complexity levels.
+    
+    Args:
+        text: Text content to analyze for complexity.
+        
+    Returns:
+        Complexity level string: 'basic', 'intermediate', or 'advanced'
+        
+    Raises:
+        ValueError: If text is empty or invalid.
+        
+    Example:
+        ```python
+        complexity = calculate_complexity_level("This is simple text.")
+        # Returns: 'basic' (if FK score <= 8)
+        ```
+    """
+    if not text or not isinstance(text, str):
+        raise ValueError("Text must be a non-empty string")
+    
+    text = text.strip()
+    if not text:
+        raise ValueError("Text cannot be empty or whitespace only")
+    
+    # Clean text for analysis - remove extra whitespace and normalize
+    cleaned_text = ' '.join(text.split())
+    
+    # Fallback if textstat is not available
+    if textstat is None:
+        logger.warning("textstat library not available, using fallback complexity calculation")
+        return _calculate_complexity_fallback(cleaned_text)
+    
+    try:
+        # Calculate Flesch-Kincaid Grade Level
+        fk_score = textstat.flesch_kincaid().grade(cleaned_text)
+        
+        # Categorize based on grade level
+        if fk_score <= 8:
+            return 'basic'
+        elif fk_score <= 12:
+            return 'intermediate'
+        else:
+            return 'advanced'
+            
+    except Exception as e:
+        logger.warning(f"Error calculating Flesch-Kincaid score: {e}, using fallback")
+        return _calculate_complexity_fallback(cleaned_text)
+
+
+def _calculate_complexity_fallback(text: str) -> str:
+    """Fallback complexity calculation when textstat is unavailable.
+    
+    Uses simple heuristics based on sentence length, word length, and
+    medical terminology density as approximations.
+    
+    Args:
+        text: Cleaned text to analyze.
+        
+    Returns:
+        Complexity level: 'basic', 'intermediate', or 'advanced'
+    """
+    # Count sentences (approximate)
+    sentences = len(re.split(r'[.!?]+', text))
+    if sentences == 0:
+        sentences = 1
+    
+    # Count words
+    words = len(text.split())
+    if words == 0:
+        return 'basic'
+    
+    # Calculate average words per sentence
+    avg_words_per_sentence = words / sentences
+    
+    # Count syllables (rough approximation)
+    syllable_count = 0
+    for word in text.split():
+        # Simple syllable counting heuristic
+        vowels = 'aeiouyAEIOUY'
+        word = re.sub(r'[^a-zA-Z]', '', word)
+        if word:
+            syllables = len(re.findall(r'[aeiouyAEIOUY]+', word))
+            if syllables == 0:
+                syllables = 1
+            syllable_count += syllables
+    
+    avg_syllables_per_word = syllable_count / words if words > 0 else 1
+    
+    # Check for medical terminology (indicator of higher complexity)
+    medical_terms = [
+        'diagnosis', 'treatment', 'therapy', 'syndrome', 'pathology',
+        'etiology', 'prognosis', 'medication', 'prescription', 'dosage',
+        'contraindication', 'adverse', 'efficacy', 'pharmacology',
+        'clinical', 'therapeutic', 'intervention'
+    ]
+    
+    medical_term_count = sum(1 for term in medical_terms if term.lower() in text.lower())
+    medical_density = medical_term_count / words * 100  # percentage
+    
+    # Simple scoring algorithm
+    complexity_score = 0
+    complexity_score += avg_words_per_sentence * 0.5
+    complexity_score += avg_syllables_per_word * 3
+    complexity_score += medical_density * 0.3
+    
+    if complexity_score <= 8:
+        return 'basic'
+    elif complexity_score <= 15:
+        return 'intermediate'
+    else:
+        return 'advanced'
+
+
+def load_medqa_usmle(
+    data_path: Union[str, Path],
+    max_items: Optional[int] = None,
+    auto_complexity: bool = True
+) -> List[MEQBenchItem]:
+    """Load MedQA-USMLE dataset and convert to MEQBenchItem objects.
+    
+    The MedQA-USMLE dataset contains medical questions based on USMLE exam format
+    with multiple choice questions and explanations. This loader processes the
+    dataset and optionally applies automatic complexity stratification.
+    
+    Args:
+        data_path: Path to the MedQA-USMLE JSON file.
+        max_items: Maximum number of items to load. If None, loads all items.
+        auto_complexity: Whether to automatically calculate complexity levels
+            using Flesch-Kincaid scores. If False, assigns 'intermediate' to all.
+            
+    Returns:
+        List of MEQBenchItem objects converted from MedQA-USMLE data.
+        
+    Raises:
+        FileNotFoundError: If the data file does not exist.
+        json.JSONDecodeError: If the JSON file is malformed.
+        ValueError: If the data format is invalid.
+        
+    Example:
+        ```python
+        # Load with automatic complexity calculation
+        items = load_medqa_usmle('data/medqa_usmle.json', max_items=300)
+        
+        # Load without complexity calculation (all marked as intermediate)
+        items = load_medqa_usmle('data/medqa_usmle.json', auto_complexity=False)
+        ```
+    """
+    data_file = Path(data_path)
+    
+    if not data_file.exists():
+        raise FileNotFoundError(f"MedQA-USMLE file not found: {data_file}")
+    
+    logger.info(f"Loading MedQA-USMLE dataset from: {data_file}")
+    
+    try:
+        with open(data_file, 'r', encoding='utf-8') as f:
+            data = json.load(f)
+    except json.JSONDecodeError as e:
+        raise json.JSONDecodeError(
+            f"Invalid JSON in MedQA-USMLE file: {e}",
+            e.doc, e.pos
+        )
+    
+    if not isinstance(data, list):
+        raise ValueError("MedQA-USMLE data must be a list of items")
+    
+    items = []
+    items_to_process = data[:max_items] if max_items else data
+    
+    logger.info(f"Processing {len(items_to_process)} MedQA-USMLE items")
+    
+    for i, item_data in enumerate(items_to_process):
+        try:
+            if not isinstance(item_data, dict):
+                logger.warning(f"Skipping invalid MedQA-USMLE item {i}: not a dictionary")
+                continue
+            
+            # Extract required fields - MedQA typically has 'question', 'options', 'answer', 'explanation'
+            question = item_data.get('question', '')
+            options = item_data.get('options', {})
+            answer = item_data.get('answer', '')
+            explanation = item_data.get('explanation', '')
+            item_id = item_data.get('id', f"medqa_usmle_{i}")
+            
+            if not question.strip():
+                logger.warning(f"Skipping MedQA-USMLE item {i}: empty question")
+                continue
+            
+            # Format options if available
+            options_text = ""
+            if isinstance(options, dict):
+                options_text = "\n".join([f"{k}. {v}" for k, v in options.items() if v])
+            elif isinstance(options, list):
+                options_text = "\n".join([f"{chr(65+j)}. {opt}" for j, opt in enumerate(options) if opt])
+            
+            # Create comprehensive medical content
+            medical_content_parts = [f"Question: {question.strip()}"]
+            
+            if options_text:
+                medical_content_parts.append(f"Options:\n{options_text}")
+            
+            if answer.strip():
+                medical_content_parts.append(f"Correct Answer: {answer.strip()}")
+            
+            if explanation.strip():
+                medical_content_parts.append(f"Explanation: {explanation.strip()}")
+            
+            medical_content = "\n\n".join(medical_content_parts)
+            
+            # Calculate complexity level
+            if auto_complexity:
+                try:
+                    complexity_level = calculate_complexity_level(medical_content)
+                except Exception as e:
+                    logger.warning(f"Error calculating complexity for item {i}: {e}, using 'intermediate'")
+                    complexity_level = 'intermediate'
+            else:
+                complexity_level = 'intermediate'
+            
+            # Create MEQBenchItem
+            item = MEQBenchItem(
+                id=str(item_id),
+                medical_content=medical_content,
+                complexity_level=complexity_level,
+                source_dataset='MedQA-USMLE',
+                reference_explanations=None
+            )
+            
+            # Validate the item
+            _validate_benchmark_item(item)
+            
+            items.append(item)
+            
+        except Exception as e:
+            logger.error(f"Error processing MedQA-USMLE item {i}: {e}")
+            continue
+    
+    logger.info(f"Successfully loaded {len(items)} MEQBenchItem objects from MedQA-USMLE")
+    
+    if len(items) == 0:
+        logger.warning("No valid items were loaded from MedQA-USMLE dataset")
+    else:
+        # Log complexity distribution
+        complexity_dist = {}
+        for item in items:
+            complexity_dist[item.complexity_level] = complexity_dist.get(item.complexity_level, 0) + 1
+        
+        avg_length = sum(len(item.medical_content) for item in items) / len(items)
+        logger.info(f"MedQA-USMLE dataset statistics:")
+        logger.info(f"  - Total items: {len(items)}")
+        logger.info(f"  - Average content length: {avg_length:.1f} characters")
+        logger.info(f"  - Complexity distribution: {complexity_dist}")
+    
+    return items
+
+
+def load_icliniq(
+    data_path: Union[str, Path],
+    max_items: Optional[int] = None,
+    auto_complexity: bool = True
+) -> List[MEQBenchItem]:
+    """Load iCliniq dataset and convert to MEQBenchItem objects.
+    
+    The iCliniq dataset contains real clinical questions from patients and
+    answers from medical professionals. This loader processes the dataset
+    and optionally applies automatic complexity stratification.
+    
+    Args:
+        data_path: Path to the iCliniq JSON file.
+        max_items: Maximum number of items to load. If None, loads all items.
+        auto_complexity: Whether to automatically calculate complexity levels.
+            
+    Returns:
+        List of MEQBenchItem objects converted from iCliniq data.
+        
+    Raises:
+        FileNotFoundError: If the data file does not exist.
+        json.JSONDecodeError: If the JSON file is malformed.
+        ValueError: If the data format is invalid.
+        
+    Example:
+        ```python
+        # Load iCliniq dataset with complexity stratification
+        items = load_icliniq('data/icliniq.json', max_items=400)
+        ```
+    """
+    data_file = Path(data_path)
+    
+    if not data_file.exists():
+        raise FileNotFoundError(f"iCliniq file not found: {data_file}")
+    
+    logger.info(f"Loading iCliniq dataset from: {data_file}")
+    
+    try:
+        with open(data_file, 'r', encoding='utf-8') as f:
+            data = json.load(f)
+    except json.JSONDecodeError as e:
+        raise json.JSONDecodeError(
+            f"Invalid JSON in iCliniq file: {e}",
+            e.doc, e.pos
+        )
+    
+    if not isinstance(data, list):
+        raise ValueError("iCliniq data must be a list of items")
+    
+    items = []
+    items_to_process = data[:max_items] if max_items else data
+    
+    logger.info(f"Processing {len(items_to_process)} iCliniq items")
+    
+    for i, item_data in enumerate(items_to_process):
+        try:
+            if not isinstance(item_data, dict):
+                logger.warning(f"Skipping invalid iCliniq item {i}: not a dictionary")
+                continue
+            
+            # Extract fields - iCliniq typically has 'patient_question', 'doctor_answer', 'speciality'
+            patient_question = item_data.get('patient_question', item_data.get('question', ''))
+            doctor_answer = item_data.get('doctor_answer', item_data.get('answer', ''))
+            specialty = item_data.get('speciality', item_data.get('specialty', ''))
+            item_id = item_data.get('id', f"icliniq_{i}")
+            
+            if not patient_question.strip() or not doctor_answer.strip():
+                logger.warning(f"Skipping iCliniq item {i}: empty question or answer")
+                continue
+            
+            # Create medical content
+            medical_content_parts = [f"Patient Question: {patient_question.strip()}"]
+            
+            if specialty.strip():
+                medical_content_parts.append(f"Medical Specialty: {specialty.strip()}")
+            
+            medical_content_parts.append(f"Doctor's Answer: {doctor_answer.strip()}")
+            
+            medical_content = "\n\n".join(medical_content_parts)
+            
+            # Calculate complexity level
+            if auto_complexity:
+                try:
+                    complexity_level = calculate_complexity_level(medical_content)
+                except Exception as e:
+                    logger.warning(f"Error calculating complexity for item {i}: {e}, using 'basic'")
+                    complexity_level = 'basic'
+            else:
+                complexity_level = 'basic'  # iCliniq tends to be more patient-focused
+            
+            # Create MEQBenchItem
+            item = MEQBenchItem(
+                id=str(item_id),
+                medical_content=medical_content,
+                complexity_level=complexity_level,
+                source_dataset='iCliniq',
+                reference_explanations=None
+            )
+            
+            # Validate the item
+            _validate_benchmark_item(item)
+            
+            items.append(item)
+            
+        except Exception as e:
+            logger.error(f"Error processing iCliniq item {i}: {e}")
+            continue
+    
+    logger.info(f"Successfully loaded {len(items)} MEQBenchItem objects from iCliniq")
+    
+    if len(items) == 0:
+        logger.warning("No valid items were loaded from iCliniq dataset")
+    else:
+        # Log complexity distribution
+        complexity_dist = {}
+        for item in items:
+            complexity_dist[item.complexity_level] = complexity_dist.get(item.complexity_level, 0) + 1
+        
+        avg_length = sum(len(item.medical_content) for item in items) / len(items)
+        logger.info(f"iCliniq dataset statistics:")
+        logger.info(f"  - Total items: {len(items)}")
+        logger.info(f"  - Average content length: {avg_length:.1f} characters")
+        logger.info(f"  - Complexity distribution: {complexity_dist}")
+    
+    return items
+
+
+def load_cochrane_reviews(
+    data_path: Union[str, Path],
+    max_items: Optional[int] = None,
+    auto_complexity: bool = True
+) -> List[MEQBenchItem]:
+    """Load Cochrane Reviews dataset and convert to MEQBenchItem objects.
+    
+    The Cochrane Reviews dataset contains evidence-based medical reviews and
+    systematic analyses. This loader processes the dataset and optionally
+    applies automatic complexity stratification.
+    
+    Args:
+        data_path: Path to the Cochrane Reviews JSON file.
+        max_items: Maximum number of items to load. If None, loads all items.
+        auto_complexity: Whether to automatically calculate complexity levels.
+            
+    Returns:
+        List of MEQBenchItem objects converted from Cochrane Reviews data.
+        
+    Raises:
+        FileNotFoundError: If the data file does not exist.
+        json.JSONDecodeError: If the JSON file is malformed.
+        ValueError: If the data format is invalid.
+        
+    Example:
+        ```python
+        # Load Cochrane Reviews with complexity stratification
+        items = load_cochrane_reviews('data/cochrane.json', max_items=300)
+        ```
+    """
+    data_file = Path(data_path)
+    
+    if not data_file.exists():
+        raise FileNotFoundError(f"Cochrane Reviews file not found: {data_file}")
+    
+    logger.info(f"Loading Cochrane Reviews dataset from: {data_file}")
+    
+    try:
+        with open(data_file, 'r', encoding='utf-8') as f:
+            data = json.load(f)
+    except json.JSONDecodeError as e:
+        raise json.JSONDecodeError(
+            f"Invalid JSON in Cochrane Reviews file: {e}",
+            e.doc, e.pos
+        )
+    
+    if not isinstance(data, list):
+        raise ValueError("Cochrane Reviews data must be a list of items")
+    
+    items = []
+    items_to_process = data[:max_items] if max_items else data
+    
+    logger.info(f"Processing {len(items_to_process)} Cochrane Reviews items")
+    
+    for i, item_data in enumerate(items_to_process):
+        try:
+            if not isinstance(item_data, dict):
+                logger.warning(f"Skipping invalid Cochrane Reviews item {i}: not a dictionary")
+                continue
+            
+            # Extract fields - Cochrane typically has 'title', 'abstract', 'conclusions', 'background'
+            title = item_data.get('title', '')
+            abstract = item_data.get('abstract', '')
+            conclusions = item_data.get('conclusions', item_data.get('main_results', ''))
+            background = item_data.get('background', item_data.get('objectives', ''))
+            item_id = item_data.get('id', f"cochrane_{i}")
+            
+            if not title.strip() and not abstract.strip():
+                logger.warning(f"Skipping Cochrane Reviews item {i}: no title or abstract")
+                continue
+            
+            # Create medical content from available fields
+            medical_content_parts = []
+            
+            if title.strip():
+                medical_content_parts.append(f"Title: {title.strip()}")
+            
+            if background.strip():
+                medical_content_parts.append(f"Background: {background.strip()}")
+            
+            if abstract.strip():
+                medical_content_parts.append(f"Abstract: {abstract.strip()}")
+            
+            if conclusions.strip():
+                medical_content_parts.append(f"Conclusions: {conclusions.strip()}")
+            
+            if not medical_content_parts:
+                logger.warning(f"Skipping Cochrane Reviews item {i}: no valid content")
+                continue
+            
+            medical_content = "\n\n".join(medical_content_parts)
+            
+            # Calculate complexity level
+            if auto_complexity:
+                try:
+                    complexity_level = calculate_complexity_level(medical_content)
+                except Exception as e:
+                    logger.warning(f"Error calculating complexity for item {i}: {e}, using 'advanced'")
+                    complexity_level = 'advanced'
+            else:
+                complexity_level = 'advanced'  # Cochrane reviews tend to be more technical
+            
+            # Create MEQBenchItem
+            item = MEQBenchItem(
+                id=str(item_id),
+                medical_content=medical_content,
+                complexity_level=complexity_level,
+                source_dataset='Cochrane Reviews',
+                reference_explanations=None
+            )
+            
+            # Validate the item
+            _validate_benchmark_item(item)
+            
+            items.append(item)
+            
+        except Exception as e:
+            logger.error(f"Error processing Cochrane Reviews item {i}: {e}")
+            continue
+    
+    logger.info(f"Successfully loaded {len(items)} MEQBenchItem objects from Cochrane Reviews")
+    
+    if len(items) == 0:
+        logger.warning("No valid items were loaded from Cochrane Reviews dataset")
+    else:
+        # Log complexity distribution
+        complexity_dist = {}
+        for item in items:
+            complexity_dist[item.complexity_level] = complexity_dist.get(item.complexity_level, 0) + 1
+        
+        avg_length = sum(len(item.medical_content) for item in items) / len(items)
+        logger.info(f"Cochrane Reviews dataset statistics:")
+        logger.info(f"  - Total items: {len(items)}")
+        logger.info(f"  - Average content length: {avg_length:.1f} characters")
+        logger.info(f"  - Complexity distribution: {complexity_dist}")
+    
+    return items
+
+
 def _validate_benchmark_item(item: MEQBenchItem) -> None:
     """Validate a MEQBenchItem object for basic requirements.
     
diff --git a/src/evaluator.py b/src/evaluator.py
index 1d2d8ef..3240983 100644
--- a/src/evaluator.py
+++ b/src/evaluator.py
@@ -67,6 +67,9 @@ class EvaluationScore:
     safety: float
     coverage: float
     quality: float
+    contradiction: float
+    information_preservation: float
+    hallucination: float
     overall: float
     details: Optional[Dict[str, Any]] = None
     
@@ -78,6 +81,9 @@ def to_dict(self) -> Dict[str, Any]:
             'safety': self.safety,
             'coverage': self.coverage,
             'quality': self.quality,
+            'contradiction': self.contradiction,
+            'information_preservation': self.information_preservation,
+            'hallucination': self.hallucination,
             'overall': self.overall,
             'details': self.details or {}
         }
@@ -117,7 +123,7 @@ def calculate(self, text: str, audience: str, **kwargs) -> float:
             
             # Get grade level using textstat
             try:
-                grade_level = textstat.flesch_kincaid().score(text)
+                grade_level = textstat.flesch_kincaid().grade(text)
             except Exception as e:
                 logger.error(f"Error calculating Flesch-Kincaid score: {e}")
                 # Fallback: estimate based on sentence length
@@ -306,6 +312,375 @@ def _calculate_word_overlap(self, original: str, generated: str) -> float:
         return coverage
 
 
+class ContradictionDetection:
+    """Medical contradiction detection against knowledge base"""
+    
+    def __init__(self) -> None:
+        self.medical_knowledge_base = self._load_medical_knowledge()
+        self.contradiction_patterns = self._load_contradiction_patterns()
+        logger.debug("Initialized ContradictionDetection")
+    
+    def _load_medical_knowledge(self) -> Dict[str, List[str]]:
+        """Load basic medical knowledge base for contradiction detection"""
+        # This would ideally load from a comprehensive medical knowledge base
+        # For now, using a simplified version with common medical facts
+        return {
+            'hypertension': [
+                'high blood pressure',
+                'systolic over 140 or diastolic over 90',
+                'can lead to heart disease and stroke',
+                'treated with lifestyle changes and medication'
+            ],
+            'diabetes': [
+                'high blood sugar',
+                'insulin resistance or deficiency',
+                'requires blood sugar monitoring',
+                'managed with diet, exercise, and medication'
+            ],
+            'antibiotics': [
+                'treat bacterial infections',
+                'do not work against viruses',
+                'should be taken as prescribed',
+                'resistance can develop from misuse'
+            ],
+            'aspirin': [
+                'pain reliever and blood thinner',
+                'can cause stomach bleeding',
+                'contraindicated with certain conditions',
+                'requires medical supervision for daily use'
+            ]
+        }
+    
+    def _load_contradiction_patterns(self) -> List[Dict[str, str]]:
+        """Load patterns that indicate medical contradictions"""
+        return [
+            {
+                'pattern': r'antibiotics.*treat.*virus',
+                'description': 'Antibiotics do not treat viral infections'
+            },
+            {
+                'pattern': r'stop.*medication.*immediately.*feel.*better',
+                'description': 'Medications should not be stopped without medical advice'
+            },
+            {
+                'pattern': r'aspirin.*safe.*everyone',
+                'description': 'Aspirin has contraindications and side effects'
+            },
+            {
+                'pattern': r'blood pressure.*normal.*140/90',
+                'description': '140/90 is not normal blood pressure'
+            }
+        ]
+    
+    def calculate(self, text: str, audience: str, original: str = "", **kwargs) -> float:
+        """
+        Detect contradictions against medical knowledge base
+        
+        Args:
+            text: Generated explanation text
+            audience: Target audience
+            original: Original medical information
+            **kwargs: Additional parameters
+            
+        Returns:
+            Contradiction score (0-1, where 1 means no contradictions)
+        """
+        try:
+            if not text.strip():
+                logger.warning("Empty text provided for contradiction detection")
+                return 0.5
+            
+            text_lower = text.lower()
+            contradiction_count = 0
+            
+            # Check for pattern-based contradictions
+            for pattern_info in self.contradiction_patterns:
+                pattern = pattern_info['pattern']
+                if re.search(pattern, text_lower):
+                    contradiction_count += 1
+                    logger.warning(f"Contradiction detected: {pattern_info['description']}")
+            
+            # Check for factual contradictions against knowledge base
+            for topic, facts in self.medical_knowledge_base.items():
+                if topic in text_lower:
+                    # Simple contradiction detection: look for negations of known facts
+                    for fact in facts:
+                        # Check for explicit contradictions
+                        contradiction_patterns = [
+                            f"not {fact}",
+                            f"never {fact}",
+                            f"don't {fact}",
+                            f"doesn't {fact}"
+                        ]
+                        for contradiction in contradiction_patterns:
+                            if contradiction in text_lower:
+                                contradiction_count += 1
+                                logger.warning(f"Knowledge base contradiction: {contradiction}")
+            
+            # Calculate score (higher is better, no contradictions = 1.0)
+            if contradiction_count == 0:
+                score = 1.0
+            else:
+                # Penalize based on number of contradictions
+                score = max(0.0, 1.0 - (contradiction_count * 0.3))
+            
+            logger.debug(f"Contradiction score: {score:.3f} ({contradiction_count} contradictions found)")
+            return score
+            
+        except Exception as e:
+            logger.error(f"Error in contradiction detection: {e}")
+            raise EvaluationError(f"Contradiction detection failed: {e}")
+
+
+class InformationPreservation:
+    """Check preservation of critical medical information"""
+    
+    def __init__(self) -> None:
+        self.critical_info_patterns = self._load_critical_patterns()
+        logger.debug("Initialized InformationPreservation")
+    
+    def _load_critical_patterns(self) -> Dict[str, List[str]]:
+        """Load patterns for critical medical information"""
+        return {
+            'dosages': [
+                r'\d+\s*mg',
+                r'\d+\s*ml',
+                r'\d+\s*units',
+                r'\d+\s*tablets?',
+                r'\d+\s*times?\s*(?:per\s+)?day',
+                r'once\s+daily',
+                r'twice\s+daily',
+                r'every\s+\d+\s+hours'
+            ],
+            'warnings': [
+                r'do not',
+                r'avoid',
+                r'contraindicated',
+                r'warning',
+                r'caution',
+                r'side effects?',
+                r'adverse',
+                r'allergic',
+                r'emergency'
+            ],
+            'timing': [
+                r'before\s+meals?',
+                r'after\s+meals?',
+                r'with\s+food',
+                r'on\s+empty\s+stomach',
+                r'bedtime',
+                r'morning',
+                r'evening'
+            ],
+            'conditions': [
+                r'if\s+pregnant',
+                r'if\s+breastfeeding',
+                r'kidney\s+disease',
+                r'liver\s+disease',
+                r'heart\s+condition',
+                r'diabetes',
+                r'high\s+blood\s+pressure'
+            ]
+        }
+    
+    def calculate(self, text: str, audience: str, original: str = "", **kwargs) -> float:
+        """
+        Check if critical information is preserved from original to generated text
+        
+        Args:
+            text: Generated explanation text
+            audience: Target audience
+            original: Original medical information
+            **kwargs: Additional parameters
+            
+        Returns:
+            Information preservation score (0-1)
+        """
+        try:
+            if not text.strip() or not original.strip():
+                logger.warning("Empty text or original provided for information preservation")
+                return 0.0
+            
+            original_lower = original.lower()
+            text_lower = text.lower()
+            
+            total_critical_info = 0
+            preserved_critical_info = 0
+            
+            # Check each category of critical information
+            for category, patterns in self.critical_info_patterns.items():
+                for pattern in patterns:
+                    # Find all matches in original text
+                    original_matches = re.findall(pattern, original_lower)
+                    
+                    if original_matches:
+                        total_critical_info += len(original_matches)
+                        
+                        # Check if these matches are preserved in generated text
+                        for match in original_matches:
+                            if match in text_lower or any(re.search(pattern, text_lower) for pattern in [match]):
+                                preserved_critical_info += 1
+                            else:
+                                # Check for paraphrased versions
+                                if self._check_paraphrased_preservation(match, text_lower, category):
+                                    preserved_critical_info += 1
+            
+            # Calculate preservation score
+            if total_critical_info == 0:
+                # No critical information to preserve
+                score = 1.0
+            else:
+                score = preserved_critical_info / total_critical_info
+            
+            logger.debug(f"Information preservation score: {score:.3f} "
+                        f"({preserved_critical_info}/{total_critical_info} preserved)")
+            return score
+            
+        except Exception as e:
+            logger.error(f"Error in information preservation check: {e}")
+            raise EvaluationError(f"Information preservation check failed: {e}")
+    
+    def _check_paraphrased_preservation(self, original_info: str, generated_text: str, category: str) -> bool:
+        """Check if information is preserved in paraphrased form"""
+        # Simple paraphrase detection for different categories
+        if category == 'dosages':
+            # Check if any dosage information is present
+            dosage_patterns = [r'\d+', r'dose', r'amount', r'quantity']
+            return any(re.search(pattern, generated_text) for pattern in dosage_patterns)
+        
+        elif category == 'warnings':
+            # Check if warning language is present
+            warning_patterns = [r'careful', r'important', r'note', r'remember', r'consult']
+            return any(re.search(pattern, generated_text) for pattern in warning_patterns)
+        
+        elif category == 'timing':
+            # Check if timing information is preserved
+            timing_patterns = [r'when', r'time', r'schedule', r'take']
+            return any(re.search(pattern, generated_text) for pattern in timing_patterns)
+        
+        elif category == 'conditions':
+            # Check if condition information is preserved
+            condition_patterns = [r'condition', r'disease', r'illness', r'medical']
+            return any(re.search(pattern, generated_text) for pattern in condition_patterns)
+        
+        return False
+
+
+class HallucinationDetection:
+    """Detect hallucinated medical entities not present in source text"""
+    
+    def __init__(self) -> None:
+        self.medical_entities = self._load_medical_entities()
+        try:
+            # Try to load spaCy model for NER
+            import spacy
+            self.nlp = spacy.load("en_core_web_sm")
+            logger.debug("Loaded spaCy model for NER")
+        except Exception as e:
+            logger.warning(f"Could not load spaCy model: {e}")
+            self.nlp = None
+        
+        logger.debug("Initialized HallucinationDetection")
+    
+    def _load_medical_entities(self) -> Dict[str, List[str]]:
+        """Load common medical entities for hallucination detection"""
+        return {
+            'medications': [
+                'aspirin', 'ibuprofen', 'acetaminophen', 'paracetamol', 'insulin',
+                'metformin', 'lisinopril', 'atorvastatin', 'omeprazole', 'albuterol',
+                'prednisone', 'warfarin', 'digoxin', 'furosemide', 'levothyroxine'
+            ],
+            'conditions': [
+                'hypertension', 'diabetes', 'asthma', 'copd', 'pneumonia',
+                'bronchitis', 'arthritis', 'osteoporosis', 'depression', 'anxiety',
+                'migraine', 'epilepsy', 'cancer', 'stroke', 'heart attack'
+            ],
+            'symptoms': [
+                'fever', 'cough', 'headache', 'nausea', 'vomiting', 'diarrhea',
+                'constipation', 'fatigue', 'dizziness', 'chest pain', 'shortness of breath',
+                'swelling', 'rash', 'itching', 'numbness', 'tingling'
+            ],
+            'body_parts': [
+                'heart', 'lungs', 'liver', 'kidney', 'brain', 'stomach',
+                'intestine', 'bladder', 'pancreas', 'thyroid', 'spine',
+                'joints', 'muscles', 'blood vessels', 'nerves'
+            ]
+        }
+    
+    def calculate(self, text: str, audience: str, original: str = "", **kwargs) -> float:
+        """
+        Detect medical entities in generated text that are not in source
+        
+        Args:
+            text: Generated explanation text
+            audience: Target audience
+            original: Original medical information
+            **kwargs: Additional parameters
+            
+        Returns:
+            Hallucination score (0-1, where 1 means no hallucinations)
+        """
+        try:
+            if not text.strip() or not original.strip():
+                logger.warning("Empty text or original provided for hallucination detection")
+                return 0.5
+            
+            # Extract medical entities from both texts
+            original_entities = self._extract_medical_entities(original)
+            generated_entities = self._extract_medical_entities(text)
+            
+            # Find entities in generated text that are not in original
+            hallucinated_entities = generated_entities - original_entities
+            
+            # Calculate hallucination score
+            total_generated_entities = len(generated_entities)
+            if total_generated_entities == 0:
+                score = 1.0  # No entities generated, no hallucinations
+            else:
+                hallucination_rate = len(hallucinated_entities) / total_generated_entities
+                score = max(0.0, 1.0 - hallucination_rate)
+            
+            if hallucinated_entities:
+                logger.warning(f"Hallucinated entities detected: {hallucinated_entities}")
+            
+            logger.debug(f"Hallucination score: {score:.3f} "
+                        f"({len(hallucinated_entities)}/{total_generated_entities} hallucinated)")
+            return score
+            
+        except Exception as e:
+            logger.error(f"Error in hallucination detection: {e}")
+            raise EvaluationError(f"Hallucination detection failed: {e}")
+    
+    def _extract_medical_entities(self, text: str) -> set:
+        """Extract medical entities from text"""
+        entities = set()
+        text_lower = text.lower()
+        
+        # Extract entities from predefined lists
+        for category, entity_list in self.medical_entities.items():
+            for entity in entity_list:
+                if entity.lower() in text_lower:
+                    entities.add(entity.lower())
+        
+        # Use spaCy NER if available
+        if self.nlp:
+            try:
+                doc = self.nlp(text)
+                for ent in doc.ents:
+                    # Focus on medical-related entity types
+                    if ent.label_ in ['PERSON', 'ORG', 'PRODUCT', 'SUBSTANCE']:
+                        # Filter for likely medical entities
+                        entity_text = ent.text.lower()
+                        if any(medical_term in entity_text 
+                              for medical_terms in self.medical_entities.values() 
+                              for medical_term in medical_terms):
+                            entities.add(entity_text)
+            except Exception as e:
+                logger.warning(f"spaCy NER failed: {e}")
+        
+        return entities
+
+
 class LLMJudge:
     """LLM-as-a-judge evaluator with full API integration"""
     
@@ -502,6 +877,9 @@ def __init__(self,
                  safety_checker: Optional[SafetyChecker] = None,
                  coverage_analyzer: Optional[CoverageAnalyzer] = None,
                  llm_judge: Optional[LLMJudge] = None,
+                 contradiction_detector: Optional[ContradictionDetection] = None,
+                 information_preservation: Optional[InformationPreservation] = None,
+                 hallucination_detector: Optional[HallucinationDetection] = None,
                  strategy_factory: Optional[StrategyFactory] = None) -> None:
         """
         Initialize evaluator with dependency injection
@@ -512,6 +890,9 @@ def __init__(self,
             safety_checker: Checker for safety compliance
             coverage_analyzer: Analyzer for information coverage
             llm_judge: LLM-based judge
+            contradiction_detector: Detector for medical contradictions
+            information_preservation: Checker for critical information preservation
+            hallucination_detector: Detector for hallucinated medical entities
             strategy_factory: Factory for audience strategies
         """
         # Use dependency injection with sensible defaults
@@ -523,6 +904,11 @@ def __init__(self,
         self.coverage_analyzer: CoverageAnalyzer = coverage_analyzer or CoverageAnalyzer()
         self.llm_judge: LLMJudge = llm_judge or LLMJudge()
         
+        # New safety and factual consistency metrics
+        self.contradiction_detector: ContradictionDetection = contradiction_detector or ContradictionDetection()
+        self.information_preservation: InformationPreservation = information_preservation or InformationPreservation()
+        self.hallucination_detector: HallucinationDetection = hallucination_detector or HallucinationDetection()
+        
         # Load scoring configuration
         self.scoring_config = config.get_scoring_config()  # type: ignore[misc]
         self.weights: Dict[str, float] = self.scoring_config['weights']
@@ -590,6 +976,25 @@ def evaluate_explanation(self, original: str, generated: str, audience: str) ->
                 logger.error(f"LLM judge failed: {e}")
                 metrics['quality'] = 0.6  # Default reasonable score
             
+            # New safety and factual consistency metrics
+            try:
+                metrics['contradiction'] = self.contradiction_detector.calculate(generated, audience, original=original)
+            except Exception as e:
+                logger.error(f"Contradiction detection failed: {e}")
+                metrics['contradiction'] = 0.7  # Default reasonable score
+            
+            try:
+                metrics['information_preservation'] = self.information_preservation.calculate(generated, audience, original=original)
+            except Exception as e:
+                logger.error(f"Information preservation check failed: {e}")
+                metrics['information_preservation'] = 0.7  # Default reasonable score
+            
+            try:
+                metrics['hallucination'] = self.hallucination_detector.calculate(generated, audience, original=original)
+            except Exception as e:
+                logger.error(f"Hallucination detection failed: {e}")
+                metrics['hallucination'] = 0.7  # Default reasonable score
+            
             # Calculate weighted overall score
             overall = sum(metrics[metric] * self.weights[metric] for metric in metrics.keys())
             
@@ -606,7 +1011,8 @@ def evaluate_explanation(self, original: str, generated: str, audience: str) ->
             logger.info(f"Evaluation completed for {audience} in {evaluation_time:.2f}s")
             logger.debug(f"Scores - R:{metrics['readability']:.3f} T:{metrics['terminology']:.3f} "
                         f"S:{metrics['safety']:.3f} C:{metrics['coverage']:.3f} Q:{metrics['quality']:.3f} "
-                        f"Overall:{overall:.3f}")
+                        f"CD:{metrics['contradiction']:.3f} IP:{metrics['information_preservation']:.3f} "
+                        f"H:{metrics['hallucination']:.3f} Overall:{overall:.3f}")
             
             return EvaluationScore(
                 readability=metrics['readability'],
@@ -614,6 +1020,9 @@ def evaluate_explanation(self, original: str, generated: str, audience: str) ->
                 safety=metrics['safety'],
                 coverage=metrics['coverage'],
                 quality=metrics['quality'],
+                contradiction=metrics['contradiction'],
+                information_preservation=metrics['information_preservation'],
+                hallucination=metrics['hallucination'],
                 overall=overall,
                 details=details
             )
diff --git a/src/leaderboard.py b/src/leaderboard.py
new file mode 100644
index 0000000..eabf814
--- /dev/null
+++ b/src/leaderboard.py
@@ -0,0 +1,938 @@
+"""
+Public leaderboard generation for MEQ-Bench evaluation results.
+
+This module provides functionality to generate a static HTML leaderboard from
+MEQ-Bench evaluation results. The leaderboard displays overall scores for different
+models with detailed breakdowns by audience type and complexity level.
+
+The generated leaderboard includes:
+- Overall model rankings
+- Audience-specific performance breakdowns
+- Complexity-level performance analysis
+- Interactive charts and visualizations
+- Timestamp and benchmark statistics
+
+Usage:
+    python -m src.leaderboard --input results/ --output docs/index.html
+
+Author: MEQ-Bench Team
+"""
+
+import json
+import logging
+import argparse
+import sys
+from pathlib import Path
+from typing import Dict, List, Any, Optional, Tuple
+from datetime import datetime
+import re
+
+# Set up logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger('meq_bench.leaderboard')
+
+
+class LeaderboardGenerator:
+    """Generate static HTML leaderboard from evaluation results"""
+    
+    def __init__(self):
+        self.results_data: List[Dict[str, Any]] = []
+        self.benchmark_stats: Dict[str, Any] = {}
+        
+    def load_results(self, results_dir: Path) -> None:
+        """Load all result files from a directory
+        
+        Args:
+            results_dir: Directory containing JSON result files
+        """
+        if not results_dir.exists():
+            raise FileNotFoundError(f"Results directory not found: {results_dir}")
+        
+        result_files = list(results_dir.glob("*.json"))
+        if not result_files:
+            raise ValueError(f"No JSON result files found in {results_dir}")
+        
+        logger.info(f"Found {len(result_files)} result files")
+        
+        for result_file in result_files:
+            try:
+                with open(result_file, 'r', encoding='utf-8') as f:
+                    data = json.load(f)
+                    
+                # Validate required fields
+                required_fields = ['model_name', 'total_items', 'audience_scores', 'summary']
+                if all(field in data for field in required_fields):
+                    self.results_data.append(data)
+                    logger.debug(f"Loaded results for {data['model_name']}")
+                else:
+                    logger.warning(f"Invalid result file format: {result_file}")
+                    
+            except Exception as e:
+                logger.error(f"Error loading {result_file}: {e}")
+        
+        if not self.results_data:
+            raise ValueError("No valid result files were loaded")
+        
+        logger.info(f"Successfully loaded {len(self.results_data)} evaluation results")
+    
+    def calculate_leaderboard_stats(self) -> Dict[str, Any]:
+        """Calculate overall leaderboard statistics
+        
+        Returns:
+            Dictionary containing leaderboard statistics
+        """
+        if not self.results_data:
+            return {}
+        
+        # Calculate aggregate statistics
+        total_models = len(self.results_data)
+        total_evaluations = sum(result['total_items'] for result in self.results_data)
+        
+        # Audience coverage
+        all_audiences = set()
+        for result in self.results_data:
+            all_audiences.update(result['audience_scores'].keys())
+        
+        # Complexity coverage
+        all_complexities = set()
+        for result in self.results_data:
+            if 'complexity_scores' in result:
+                all_complexities.update(result['complexity_scores'].keys())
+        
+        # Performance ranges
+        overall_scores = [result['summary'].get('overall_mean', 0) for result in self.results_data]
+        if overall_scores:
+            best_score = max(overall_scores)
+            worst_score = min(overall_scores)
+            avg_score = sum(overall_scores) / len(overall_scores)
+        else:
+            best_score = worst_score = avg_score = 0
+        
+        return {
+            'total_models': total_models,
+            'total_evaluations': total_evaluations,
+            'audiences': sorted(list(all_audiences)),
+            'complexity_levels': sorted(list(all_complexities)),
+            'best_score': best_score,
+            'worst_score': worst_score,
+            'average_score': avg_score,
+            'last_updated': datetime.now().isoformat()
+        }
+    
+    def rank_models(self) -> List[Dict[str, Any]]:
+        """Rank models by overall performance
+        
+        Returns:
+            List of model results sorted by overall performance
+        """
+        ranked_models = sorted(
+            self.results_data,
+            key=lambda x: x['summary'].get('overall_mean', 0),
+            reverse=True
+        )
+        
+        # Add ranking information
+        for i, model in enumerate(ranked_models):
+            model['rank'] = i + 1
+            model['overall_score'] = model['summary'].get('overall_mean', 0)
+        
+        return ranked_models
+    
+    def generate_audience_breakdown(self, ranked_models: List[Dict[str, Any]]) -> Dict[str, List[Dict[str, Any]]]:
+        """Generate audience-specific performance breakdown
+        
+        Args:
+            ranked_models: List of ranked model results
+            
+        Returns:
+            Dictionary mapping audience to ranked model performance
+        """
+        audience_breakdown = {}
+        
+        # Get all audiences
+        all_audiences = set()
+        for model in ranked_models:
+            all_audiences.update(model['audience_scores'].keys())
+        
+        for audience in sorted(all_audiences):
+            audience_models = []
+            
+            for model in ranked_models:
+                if audience in model['audience_scores']:
+                    scores = model['audience_scores'][audience]
+                    avg_score = sum(scores) / len(scores) if scores else 0
+                    
+                    audience_models.append({
+                        'model_name': model['model_name'],
+                        'score': avg_score,
+                        'num_items': len(scores)
+                    })
+            
+            # Sort by score for this audience
+            audience_models.sort(key=lambda x: x['score'], reverse=True)
+            
+            # Add rankings
+            for i, model in enumerate(audience_models):
+                model['rank'] = i + 1
+            
+            audience_breakdown[audience] = audience_models
+        
+        return audience_breakdown
+    
+    def generate_complexity_breakdown(self, ranked_models: List[Dict[str, Any]]) -> Dict[str, List[Dict[str, Any]]]:
+        """Generate complexity-specific performance breakdown
+        
+        Args:
+            ranked_models: List of ranked model results
+            
+        Returns:
+            Dictionary mapping complexity level to ranked model performance
+        """
+        complexity_breakdown = {}
+        
+        # Get all complexity levels
+        all_complexities = set()
+        for model in ranked_models:
+            if 'complexity_scores' in model:
+                all_complexities.update(model['complexity_scores'].keys())
+        
+        for complexity in sorted(all_complexities):
+            complexity_models = []
+            
+            for model in ranked_models:
+                if 'complexity_scores' in model and complexity in model['complexity_scores']:
+                    scores = model['complexity_scores'][complexity]
+                    avg_score = sum(scores) / len(scores) if scores else 0
+                    
+                    complexity_models.append({
+                        'model_name': model['model_name'],
+                        'score': avg_score,
+                        'num_items': len(scores)
+                    })
+            
+            # Sort by score for this complexity level
+            complexity_models.sort(key=lambda x: x['score'], reverse=True)
+            
+            # Add rankings
+            for i, model in enumerate(complexity_models):
+                model['rank'] = i + 1
+            
+            complexity_breakdown[complexity] = complexity_models
+        
+        return complexity_breakdown
+    
+    def generate_html(self, output_path: Path) -> None:
+        """Generate static HTML leaderboard
+        
+        Args:
+            output_path: Path where to save the HTML file
+        """
+        if not self.results_data:
+            raise ValueError("No results data loaded")
+        
+        # Calculate statistics and rankings
+        stats = self.calculate_leaderboard_stats()
+        ranked_models = self.rank_models()
+        audience_breakdown = self.generate_audience_breakdown(ranked_models)
+        complexity_breakdown = self.generate_complexity_breakdown(ranked_models)
+        
+        # Generate HTML content
+        html_content = self._generate_html_template(
+            stats, ranked_models, audience_breakdown, complexity_breakdown
+        )
+        
+        # Ensure output directory exists
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        
+        # Write HTML file
+        with open(output_path, 'w', encoding='utf-8') as f:
+            f.write(html_content)
+        
+        logger.info(f"Generated leaderboard HTML: {output_path}")
+    
+    def _generate_html_template(self, 
+                               stats: Dict[str, Any],
+                               ranked_models: List[Dict[str, Any]],
+                               audience_breakdown: Dict[str, List[Dict[str, Any]]],
+                               complexity_breakdown: Dict[str, List[Dict[str, Any]]]) -> str:
+        """Generate the complete HTML template"""
+        
+        return f"""<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>MEQ-Bench Leaderboard</title>
+    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
+    <style>
+        {self._get_css_styles()}
+    </style>
+</head>
+<body>
+    <div class="container">
+        <header>
+            <h1>🏆 MEQ-Bench Leaderboard</h1>
+            <p class="subtitle">Evaluating Medical Language Models for Audience-Adaptive Explanations</p>
+            <div class="stats-bar">
+                <div class="stat-item">
+                    <span class="stat-value">{stats['total_models']}</span>
+                    <span class="stat-label">Models</span>
+                </div>
+                <div class="stat-item">
+                    <span class="stat-value">{stats['total_evaluations']:,}</span>
+                    <span class="stat-label">Total Evaluations</span>
+                </div>
+                <div class="stat-item">
+                    <span class="stat-value">{len(stats['audiences'])}</span>
+                    <span class="stat-label">Audiences</span>
+                </div>
+                <div class="stat-item">
+                    <span class="stat-value">{stats['best_score']:.3f}</span>
+                    <span class="stat-label">Best Score</span>
+                </div>
+            </div>
+            <p class="last-updated">Last updated: {datetime.fromisoformat(stats['last_updated']).strftime('%Y-%m-%d %H:%M UTC')}</p>
+        </header>
+
+        <nav class="tabs">
+            <button class="tab-button active" onclick="showTab('overall')">Overall Rankings</button>
+            <button class="tab-button" onclick="showTab('audience')">By Audience</button>
+            <button class="tab-button" onclick="showTab('complexity')">By Complexity</button>
+            <button class="tab-button" onclick="showTab('charts')">Analytics</button>
+        </nav>
+
+        <div id="overall-tab" class="tab-content active">
+            <h2>Overall Model Rankings</h2>
+            <div class="table-container">
+                {self._generate_overall_rankings_table(ranked_models)}
+            </div>
+        </div>
+
+        <div id="audience-tab" class="tab-content">
+            <h2>Performance by Audience</h2>
+            {self._generate_audience_breakdown_section(audience_breakdown)}
+        </div>
+
+        <div id="complexity-tab" class="tab-content">
+            <h2>Performance by Complexity Level</h2>
+            {self._generate_complexity_breakdown_section(complexity_breakdown)}
+        </div>
+
+        <div id="charts-tab" class="tab-content">
+            <h2>Analytics & Visualizations</h2>
+            <div class="charts-container">
+                <div class="chart-item">
+                    <h3>Model Performance Comparison</h3>
+                    <canvas id="performanceChart" width="400" height="200"></canvas>
+                </div>
+                <div class="chart-item">
+                    <h3>Audience Performance Distribution</h3>
+                    <canvas id="audienceChart" width="400" height="200"></canvas>
+                </div>
+            </div>
+        </div>
+
+        <footer>
+            <div class="footer-content">
+                <p><strong>About MEQ-Bench:</strong> A benchmark for evaluating medical language models on audience-adaptive explanation quality.</p>
+                <p>Evaluation metrics include readability, terminology appropriateness, safety compliance, information coverage, and overall quality.</p>
+                <p>📧 Contact: <a href="mailto:meq-bench@example.com">meq-bench@example.com</a> | 
+                   🐙 GitHub: <a href="https://github.com/meq-bench/meq-bench">meq-bench/meq-bench</a></p>
+            </div>
+        </footer>
+    </div>
+
+    <script>
+        {self._generate_javascript(ranked_models, audience_breakdown, stats)}
+    </script>
+</body>
+</html>"""
+    
+    def _get_css_styles(self) -> str:
+        """Return CSS styles for the leaderboard"""
+        return """
+        * {
+            margin: 0;
+            padding: 0;
+            box-sizing: border-box;
+        }
+
+        body {
+            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
+            line-height: 1.6;
+            color: #333;
+            background-color: #f8fafc;
+        }
+
+        .container {
+            max-width: 1200px;
+            margin: 0 auto;
+            padding: 20px;
+        }
+
+        header {
+            text-align: center;
+            margin-bottom: 2rem;
+            padding: 2rem;
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            color: white;
+            border-radius: 12px;
+            box-shadow: 0 8px 32px rgba(0,0,0,0.1);
+        }
+
+        h1 {
+            font-size: 2.5rem;
+            font-weight: 700;
+            margin-bottom: 0.5rem;
+        }
+
+        .subtitle {
+            font-size: 1.2rem;
+            opacity: 0.9;
+            margin-bottom: 1.5rem;
+        }
+
+        .stats-bar {
+            display: flex;
+            justify-content: center;
+            gap: 2rem;
+            margin: 1.5rem 0;
+            flex-wrap: wrap;
+        }
+
+        .stat-item {
+            text-align: center;
+        }
+
+        .stat-value {
+            display: block;
+            font-size: 1.8rem;
+            font-weight: 700;
+            color: #ffd700;
+        }
+
+        .stat-label {
+            font-size: 0.9rem;
+            opacity: 0.8;
+        }
+
+        .last-updated {
+            font-size: 0.9rem;
+            opacity: 0.8;
+            margin-top: 1rem;
+        }
+
+        .tabs {
+            display: flex;
+            margin-bottom: 2rem;
+            border-bottom: 2px solid #e2e8f0;
+            background: white;
+            border-radius: 8px 8px 0 0;
+            overflow: hidden;
+            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+        }
+
+        .tab-button {
+            flex: 1;
+            padding: 1rem 1.5rem;
+            border: none;
+            background: white;
+            color: #64748b;
+            cursor: pointer;
+            font-weight: 500;
+            transition: all 0.3s ease;
+            border-bottom: 3px solid transparent;
+        }
+
+        .tab-button:hover {
+            background: #f1f5f9;
+            color: #334155;
+        }
+
+        .tab-button.active {
+            color: #3b82f6;
+            border-bottom-color: #3b82f6;
+            background: #f8fafc;
+        }
+
+        .tab-content {
+            display: none;
+            background: white;
+            padding: 2rem;
+            border-radius: 0 0 12px 12px;
+            box-shadow: 0 4px 16px rgba(0,0,0,0.1);
+        }
+
+        .tab-content.active {
+            display: block;
+        }
+
+        .table-container {
+            overflow-x: auto;
+        }
+
+        table {
+            width: 100%;
+            border-collapse: collapse;
+            margin: 1rem 0;
+        }
+
+        th, td {
+            padding: 1rem;
+            text-align: left;
+            border-bottom: 1px solid #e2e8f0;
+        }
+
+        th {
+            background: #f8fafc;
+            font-weight: 600;
+            color: #374151;
+            position: sticky;
+            top: 0;
+        }
+
+        tr:hover {
+            background: #f8fafc;
+        }
+
+        .rank {
+            font-weight: 700;
+            color: #3b82f6;
+        }
+
+        .score {
+            font-weight: 600;
+            color: #059669;
+        }
+
+        .model-name {
+            font-weight: 600;
+            color: #1f2937;
+        }
+
+        .rank-1 {
+            background: linear-gradient(135deg, #ffd700, #ffed4e);
+            color: #92400e;
+        }
+
+        .rank-2 {
+            background: linear-gradient(135deg, #c0c0c0, #e5e7eb);
+            color: #374151;
+        }
+
+        .rank-3 {
+            background: linear-gradient(135deg, #cd7f32, #d97706);
+            color: white;
+        }
+
+        .audience-section, .complexity-section {
+            margin-bottom: 2rem;
+            background: #f8fafc;
+            padding: 1.5rem;
+            border-radius: 8px;
+            border-left: 4px solid #3b82f6;
+        }
+
+        .audience-section h3, .complexity-section h3 {
+            color: #1f2937;
+            margin-bottom: 1rem;
+            text-transform: capitalize;
+        }
+
+        .charts-container {
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(500px, 1fr));
+            gap: 2rem;
+        }
+
+        .chart-item {
+            background: white;
+            padding: 1.5rem;
+            border-radius: 8px;
+            box-shadow: 0 2px 8px rgba(0,0,0,0.1);
+        }
+
+        .chart-item h3 {
+            margin-bottom: 1rem;
+            color: #1f2937;
+        }
+
+        footer {
+            margin-top: 3rem;
+            text-align: center;
+            padding: 2rem;
+            background: #1f2937;
+            color: white;
+            border-radius: 12px;
+        }
+
+        .footer-content p {
+            margin-bottom: 0.5rem;
+        }
+
+        .footer-content a {
+            color: #60a5fa;
+            text-decoration: none;
+        }
+
+        .footer-content a:hover {
+            text-decoration: underline;
+        }
+
+        @media (max-width: 768px) {
+            .container {
+                padding: 10px;
+            }
+
+            h1 {
+                font-size: 2rem;
+            }
+
+            .stats-bar {
+                gap: 1rem;
+            }
+
+            .tab-button {
+                padding: 0.75rem 1rem;
+                font-size: 0.9rem;
+            }
+
+            .tab-content {
+                padding: 1rem;
+            }
+
+            th, td {
+                padding: 0.75rem 0.5rem;
+                font-size: 0.9rem;
+            }
+
+            .charts-container {
+                grid-template-columns: 1fr;
+            }
+        }
+        """
+    
+    def _generate_overall_rankings_table(self, ranked_models: List[Dict[str, Any]]) -> str:
+        """Generate the overall rankings table HTML"""
+        table_html = """
+        <table>
+            <thead>
+                <tr>
+                    <th>Rank</th>
+                    <th>Model</th>
+                    <th>Overall Score</th>
+                    <th>Items Evaluated</th>
+                    <th>Physician</th>
+                    <th>Nurse</th>
+                    <th>Patient</th>
+                    <th>Caregiver</th>
+                </tr>
+            </thead>
+            <tbody>
+        """
+        
+        for model in ranked_models:
+            rank_class = ""
+            if model['rank'] == 1:
+                rank_class = "rank-1"
+            elif model['rank'] == 2:
+                rank_class = "rank-2"
+            elif model['rank'] == 3:
+                rank_class = "rank-3"
+            
+            # Calculate audience averages
+            audience_scores = {}
+            for audience, scores in model['audience_scores'].items():
+                avg_score = sum(scores) / len(scores) if scores else 0
+                audience_scores[audience] = avg_score
+            
+            table_html += f"""
+                <tr class="{rank_class}">
+                    <td class="rank">#{model['rank']}</td>
+                    <td class="model-name">{model['model_name']}</td>
+                    <td class="score">{model['overall_score']:.3f}</td>
+                    <td>{model['total_items']}</td>
+                    <td>{audience_scores.get('physician', 0):.3f}</td>
+                    <td>{audience_scores.get('nurse', 0):.3f}</td>
+                    <td>{audience_scores.get('patient', 0):.3f}</td>
+                    <td>{audience_scores.get('caregiver', 0):.3f}</td>
+                </tr>
+            """
+        
+        table_html += """
+            </tbody>
+        </table>
+        """
+        
+        return table_html
+    
+    def _generate_audience_breakdown_section(self, audience_breakdown: Dict[str, List[Dict[str, Any]]]) -> str:
+        """Generate the audience breakdown section HTML"""
+        html = ""
+        
+        for audience, models in audience_breakdown.items():
+            html += f"""
+            <div class="audience-section">
+                <h3>{audience.title()} Audience Rankings</h3>
+                <table>
+                    <thead>
+                        <tr>
+                            <th>Rank</th>
+                            <th>Model</th>
+                            <th>Score</th>
+                            <th>Items</th>
+                        </tr>
+                    </thead>
+                    <tbody>
+            """
+            
+            for model in models[:10]:  # Show top 10
+                rank_class = ""
+                if model['rank'] == 1:
+                    rank_class = "rank-1"
+                elif model['rank'] == 2:
+                    rank_class = "rank-2"
+                elif model['rank'] == 3:
+                    rank_class = "rank-3"
+                
+                html += f"""
+                    <tr class="{rank_class}">
+                        <td class="rank">#{model['rank']}</td>
+                        <td class="model-name">{model['model_name']}</td>
+                        <td class="score">{model['score']:.3f}</td>
+                        <td>{model['num_items']}</td>
+                    </tr>
+                """
+            
+            html += """
+                    </tbody>
+                </table>
+            </div>
+            """
+        
+        return html
+    
+    def _generate_complexity_breakdown_section(self, complexity_breakdown: Dict[str, List[Dict[str, Any]]]) -> str:
+        """Generate the complexity breakdown section HTML"""
+        html = ""
+        
+        for complexity, models in complexity_breakdown.items():
+            html += f"""
+            <div class="complexity-section">
+                <h3>{complexity.title()} Complexity Level Rankings</h3>
+                <table>
+                    <thead>
+                        <tr>
+                            <th>Rank</th>
+                            <th>Model</th>
+                            <th>Score</th>
+                            <th>Items</th>
+                        </tr>
+                    </thead>
+                    <tbody>
+            """
+            
+            for model in models[:10]:  # Show top 10
+                rank_class = ""
+                if model['rank'] == 1:
+                    rank_class = "rank-1"
+                elif model['rank'] == 2:
+                    rank_class = "rank-2"
+                elif model['rank'] == 3:
+                    rank_class = "rank-3"
+                
+                html += f"""
+                    <tr class="{rank_class}">
+                        <td class="rank">#{model['rank']}</td>
+                        <td class="model-name">{model['model_name']}</td>
+                        <td class="score">{model['score']:.3f}</td>
+                        <td>{model['num_items']}</td>
+                    </tr>
+                """
+            
+            html += """
+                    </tbody>
+                </table>
+            </div>
+            """
+        
+        return html
+    
+    def _generate_javascript(self, 
+                           ranked_models: List[Dict[str, Any]], 
+                           audience_breakdown: Dict[str, List[Dict[str, Any]]],
+                           stats: Dict[str, Any]) -> str:
+        """Generate JavaScript for interactive features"""
+        
+        # Prepare data for charts
+        model_names = [model['model_name'][:20] for model in ranked_models[:8]]  # Top 8 models
+        model_scores = [model['overall_score'] for model in ranked_models[:8]]
+        
+        audience_labels = list(audience_breakdown.keys())
+        audience_data = []
+        for audience in audience_labels:
+            if audience_breakdown[audience]:
+                avg_score = sum(model['score'] for model in audience_breakdown[audience]) / len(audience_breakdown[audience])
+                audience_data.append(avg_score)
+            else:
+                audience_data.append(0)
+        
+        return f"""
+        function showTab(tabName) {{
+            // Hide all tab contents
+            const tabContents = document.querySelectorAll('.tab-content');
+            tabContents.forEach(tab => tab.classList.remove('active'));
+            
+            // Remove active class from all buttons
+            const tabButtons = document.querySelectorAll('.tab-button');
+            tabButtons.forEach(btn => btn.classList.remove('active'));
+            
+            // Show selected tab
+            document.getElementById(tabName + '-tab').classList.add('active');
+            event.target.classList.add('active');
+            
+            // Initialize charts if analytics tab is selected
+            if (tabName === 'charts') {{
+                setTimeout(initCharts, 100);
+            }}
+        }}
+        
+        function initCharts() {{
+            // Performance comparison chart
+            const performanceCtx = document.getElementById('performanceChart').getContext('2d');
+            new Chart(performanceCtx, {{
+                type: 'bar',
+                data: {{
+                    labels: {json.dumps(model_names)},
+                    datasets: [{{
+                        label: 'Overall Score',
+                        data: {json.dumps(model_scores)},
+                        backgroundColor: 'rgba(59, 130, 246, 0.8)',
+                        borderColor: 'rgba(59, 130, 246, 1)',
+                        borderWidth: 1
+                    }}]
+                }},
+                options: {{
+                    responsive: true,
+                    scales: {{
+                        y: {{
+                            beginAtZero: true,
+                            max: 1
+                        }}
+                    }}
+                }}
+            }});
+            
+            // Audience performance chart
+            const audienceCtx = document.getElementById('audienceChart').getContext('2d');
+            new Chart(audienceCtx, {{
+                type: 'radar',
+                data: {{
+                    labels: {json.dumps(audience_labels)},
+                    datasets: [{{
+                        label: 'Average Performance',
+                        data: {json.dumps(audience_data)},
+                        backgroundColor: 'rgba(16, 185, 129, 0.2)',
+                        borderColor: 'rgba(16, 185, 129, 1)',
+                        borderWidth: 2,
+                        pointBackgroundColor: 'rgba(16, 185, 129, 1)'
+                    }}]
+                }},
+                options: {{
+                    responsive: true,
+                    scales: {{
+                        r: {{
+                            beginAtZero: true,
+                            max: 1
+                        }}
+                    }}
+                }}
+            }});
+        }}
+        
+        // Initialize page
+        document.addEventListener('DOMContentLoaded', function() {{
+            // Initialize charts if analytics tab is shown by default
+            if (document.getElementById('charts-tab').classList.contains('active')) {{
+                initCharts();
+            }}
+        }});
+        """
+
+
+def setup_argument_parser() -> argparse.ArgumentParser:
+    """Set up command line argument parser"""
+    parser = argparse.ArgumentParser(
+        description="Generate static HTML leaderboard from MEQ-Bench evaluation results",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+    # Generate leaderboard from results directory
+    python -m src.leaderboard --input results/ --output docs/index.html
+    
+    # Generate with custom title
+    python -m src.leaderboard --input results/ --output leaderboard.html --title "Custom MEQ-Bench Results"
+        """
+    )
+    
+    parser.add_argument(
+        '--input', '-i',
+        type=str,
+        required=True,
+        help='Directory containing JSON evaluation result files'
+    )
+    
+    parser.add_argument(
+        '--output', '-o',
+        type=str,
+        default='docs/index.html',
+        help='Output path for the HTML leaderboard (default: docs/index.html)'
+    )
+    
+    parser.add_argument(
+        '--title',
+        type=str,
+        default='MEQ-Bench Leaderboard',
+        help='Custom title for the leaderboard page'
+    )
+    
+    parser.add_argument(
+        '--verbose', '-v',
+        action='store_true',
+        help='Enable verbose logging'
+    )
+    
+    return parser
+
+
+def main():
+    """Main function for command-line usage"""
+    parser = setup_argument_parser()
+    args = parser.parse_args()
+    
+    # Set logging level
+    if args.verbose:
+        logging.getLogger().setLevel(logging.DEBUG)
+    
+    try:
+        # Initialize leaderboard generator
+        generator = LeaderboardGenerator()
+        
+        # Load results
+        results_dir = Path(args.input)
+        generator.load_results(results_dir)
+        
+        # Generate HTML leaderboard
+        output_path = Path(args.output)
+        generator.generate_html(output_path)
+        
+        logger.info(f"✅ Leaderboard generated successfully: {output_path}")
+        logger.info(f"📊 Processed {len(generator.results_data)} model results")
+        
+    except Exception as e:
+        logger.error(f"❌ Error generating leaderboard: {e}")
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/tests/test_data_loaders.py b/tests/test_data_loaders.py
new file mode 100644
index 0000000..48d27fd
--- /dev/null
+++ b/tests/test_data_loaders.py
@@ -0,0 +1,499 @@
+"""
+Unit tests for data_loaders module
+"""
+
+import pytest
+import json
+import tempfile
+from pathlib import Path
+from unittest.mock import patch, MagicMock
+
+from src.data_loaders import (
+    load_medqa_usmle,
+    load_icliniq, 
+    load_cochrane_reviews,
+    save_benchmark_items,
+    calculate_complexity_level,
+    _validate_benchmark_item
+)
+from src.benchmark import MEQBenchItem
+
+
+class TestCalculateComplexityLevel:
+    """Test complexity level calculation using Flesch-Kincaid scores"""
+    
+    def test_empty_text_raises_error(self):
+        """Test that empty text raises ValueError"""
+        with pytest.raises(ValueError, match="Text must be a non-empty string"):
+            calculate_complexity_level("")
+        
+        with pytest.raises(ValueError, match="Text must be a non-empty string"):
+            calculate_complexity_level(None)
+    
+    def test_whitespace_only_raises_error(self):
+        """Test that whitespace-only text raises ValueError"""
+        with pytest.raises(ValueError, match="Text cannot be empty or whitespace only"):
+            calculate_complexity_level("   \n\t  ")
+    
+    @patch('src.data_loaders.textstat', None)
+    def test_fallback_complexity_calculation(self):
+        """Test fallback complexity calculation when textstat is unavailable"""
+        # Simple text should be basic
+        simple_text = "This is simple. It has short words."
+        complexity = calculate_complexity_level(simple_text)
+        assert complexity in ['basic', 'intermediate', 'advanced']
+        
+        # Complex medical text should be advanced
+        complex_text = ("Pharmacokinetic interactions involving cytochrome P450 enzymes can significantly "
+                       "alter therapeutic drug concentrations, potentially leading to adverse effects or "
+                       "therapeutic failure in clinical practice.")
+        complexity = calculate_complexity_level(complex_text)
+        assert complexity in ['basic', 'intermediate', 'advanced']
+    
+    @patch('src.data_loaders.textstat')
+    def test_with_textstat_available(self, mock_textstat):
+        """Test complexity calculation when textstat is available"""
+        # Mock textstat to return specific grade levels
+        mock_textstat.flesch_kincaid.return_value.grade.return_value = 6.0
+        
+        text = "Simple medical text for testing."
+        complexity = calculate_complexity_level(text)
+        assert complexity == 'basic'
+        
+        # Test intermediate level
+        mock_textstat.flesch_kincaid.return_value.grade.return_value = 10.0
+        complexity = calculate_complexity_level(text)
+        assert complexity == 'intermediate'
+        
+        # Test advanced level
+        mock_textstat.flesch_kincaid.return_value.grade.return_value = 15.0
+        complexity = calculate_complexity_level(text)
+        assert complexity == 'advanced'
+    
+    @patch('src.data_loaders.textstat')
+    def test_textstat_error_fallback(self, mock_textstat):
+        """Test fallback when textstat raises an exception"""
+        mock_textstat.flesch_kincaid.return_value.grade.side_effect = Exception("Textstat error")
+        
+        text = "Test text for error handling."
+        complexity = calculate_complexity_level(text)
+        assert complexity in ['basic', 'intermediate', 'advanced']
+
+
+class TestValidateBenchmarkItem:
+    """Test benchmark item validation"""
+    
+    def test_valid_item(self):
+        """Test validation of a valid item"""
+        item = MEQBenchItem(
+            id="test_001",
+            medical_content="This is valid medical content for testing purposes.",
+            complexity_level="basic",
+            source_dataset="test"
+        )
+        # Should not raise any exception
+        _validate_benchmark_item(item)
+    
+    def test_empty_id_raises_error(self):
+        """Test that empty ID raises ValueError"""
+        item = MEQBenchItem(
+            id="",
+            medical_content="Valid content",
+            complexity_level="basic",
+            source_dataset="test"
+        )
+        with pytest.raises(ValueError, match="Item ID must be a non-empty string"):
+            _validate_benchmark_item(item)
+    
+    def test_non_string_id_raises_error(self):
+        """Test that non-string ID raises ValueError"""
+        item = MEQBenchItem(
+            id=123,
+            medical_content="Valid content",
+            complexity_level="basic",
+            source_dataset="test"
+        )
+        with pytest.raises(ValueError, match="Item ID must be a non-empty string"):
+            _validate_benchmark_item(item)
+    
+    def test_empty_content_raises_error(self):
+        """Test that empty medical content raises ValueError"""
+        item = MEQBenchItem(
+            id="test_001",
+            medical_content="",
+            complexity_level="basic",
+            source_dataset="test"
+        )
+        with pytest.raises(ValueError, match="Medical content must be a non-empty string"):
+            _validate_benchmark_item(item)
+    
+    def test_short_content_raises_error(self):
+        """Test that very short content raises ValueError"""
+        item = MEQBenchItem(
+            id="test_001",
+            medical_content="Short",  # Less than 20 characters
+            complexity_level="basic",
+            source_dataset="test"
+        )
+        with pytest.raises(ValueError, match="Medical content is too short"):
+            _validate_benchmark_item(item)
+
+
+class TestLoadMedQAUSMLE:
+    """Test MedQA-USMLE dataset loading"""
+    
+    def test_file_not_found_raises_error(self):
+        """Test that non-existent file raises FileNotFoundError"""
+        with pytest.raises(FileNotFoundError):
+            load_medqa_usmle("/nonexistent/file.json")
+    
+    def test_invalid_json_raises_error(self):
+        """Test that invalid JSON raises JSONDecodeError"""
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+            f.write("invalid json content")
+            f.flush()
+            
+            with pytest.raises(json.JSONDecodeError):
+                load_medqa_usmle(f.name)
+            
+            Path(f.name).unlink()  # Clean up
+    
+    def test_non_list_data_raises_error(self):
+        """Test that non-list JSON data raises ValueError"""
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+            json.dump({"not": "a list"}, f)
+            f.flush()
+            
+            with pytest.raises(ValueError, match="MedQA-USMLE data must be a list"):
+                load_medqa_usmle(f.name)
+            
+            Path(f.name).unlink()  # Clean up
+    
+    def test_load_valid_data(self):
+        """Test loading valid MedQA-USMLE data"""
+        sample_data = [
+            {
+                "id": "medqa_001",
+                "question": "What is the most common cause of hypertension?",
+                "options": {
+                    "A": "Primary hypertension", 
+                    "B": "Secondary hypertension",
+                    "C": "White coat hypertension",
+                    "D": "Malignant hypertension"
+                },
+                "answer": "A",
+                "explanation": "Primary hypertension accounts for 90-95% of cases."
+            },
+            {
+                "id": "medqa_002",
+                "question": "Which medication is first-line for diabetes?",
+                "options": {
+                    "A": "Insulin",
+                    "B": "Metformin", 
+                    "C": "Sulfonylureas",
+                    "D": "Thiazolidinediones"
+                },
+                "answer": "B",
+                "explanation": "Metformin is the first-line treatment for type 2 diabetes."
+            }
+        ]
+        
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+            json.dump(sample_data, f)
+            f.flush()
+            
+            items = load_medqa_usmle(f.name, auto_complexity=False)
+            
+            assert len(items) == 2
+            assert all(isinstance(item, MEQBenchItem) for item in items)
+            assert items[0].id == "medqa_001"
+            assert items[0].source_dataset == "MedQA-USMLE"
+            assert items[0].complexity_level == "intermediate"  # Default when auto_complexity=False
+            assert "What is the most common cause of hypertension?" in items[0].medical_content
+            
+            Path(f.name).unlink()  # Clean up
+    
+    def test_max_items_limit(self):
+        """Test that max_items parameter limits the number of loaded items"""
+        sample_data = [{"id": f"test_{i}", "question": f"Question {i}", "answer": "A"} 
+                      for i in range(5)]
+        
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+            json.dump(sample_data, f)
+            f.flush()
+            
+            items = load_medqa_usmle(f.name, max_items=3, auto_complexity=False)
+            assert len(items) == 3
+            
+            Path(f.name).unlink()  # Clean up
+    
+    @patch('src.data_loaders.calculate_complexity_level')
+    def test_auto_complexity_calculation(self, mock_calc_complexity):
+        """Test automatic complexity level calculation"""
+        mock_calc_complexity.return_value = 'advanced'
+        
+        sample_data = [{
+            "id": "test_001",
+            "question": "Complex medical question",
+            "answer": "A",
+            "explanation": "Detailed explanation"
+        }]
+        
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+            json.dump(sample_data, f)
+            f.flush()
+            
+            items = load_medqa_usmle(f.name, auto_complexity=True)
+            assert len(items) == 1
+            assert items[0].complexity_level == 'advanced'
+            mock_calc_complexity.assert_called_once()
+            
+            Path(f.name).unlink()  # Clean up
+
+
+class TestLoadiCliniq:
+    """Test iCliniq dataset loading"""
+    
+    def test_load_valid_icliniq_data(self):
+        """Test loading valid iCliniq data"""
+        sample_data = [
+            {
+                "id": "icliniq_001",
+                "patient_question": "I have been experiencing chest pain. Should I be worried?",
+                "doctor_answer": "Chest pain can have various causes. Please consult a cardiologist for proper evaluation.",
+                "speciality": "Cardiology"
+            },
+            {
+                "id": "icliniq_002", 
+                "patient_question": "What are the side effects of aspirin?",
+                "doctor_answer": "Common side effects include stomach irritation and increased bleeding risk.",
+                "speciality": "General Medicine"
+            }
+        ]
+        
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+            json.dump(sample_data, f)
+            f.flush()
+            
+            items = load_icliniq(f.name, auto_complexity=False)
+            
+            assert len(items) == 2
+            assert all(isinstance(item, MEQBenchItem) for item in items)
+            assert items[0].source_dataset == "iCliniq"
+            assert "Patient Question:" in items[0].medical_content
+            assert "Doctor's Answer:" in items[0].medical_content
+            assert "Cardiology" in items[0].medical_content
+            
+            Path(f.name).unlink()  # Clean up
+    
+    def test_alternative_field_names(self):
+        """Test loading iCliniq data with alternative field names"""
+        sample_data = [{
+            "id": "icliniq_001",
+            "question": "Alternative question field",  # Using 'question' instead of 'patient_question'
+            "answer": "Alternative answer field",      # Using 'answer' instead of 'doctor_answer'
+            "specialty": "Dermatology"              # Using 'specialty' instead of 'speciality'
+        }]
+        
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+            json.dump(sample_data, f)
+            f.flush()
+            
+            items = load_icliniq(f.name, auto_complexity=False)
+            assert len(items) == 1
+            assert "Alternative question field" in items[0].medical_content
+            assert "Alternative answer field" in items[0].medical_content
+            
+            Path(f.name).unlink()  # Clean up
+
+
+class TestLoadCochraneReviews:
+    """Test Cochrane Reviews dataset loading"""
+    
+    def test_load_valid_cochrane_data(self):
+        """Test loading valid Cochrane Reviews data"""
+        sample_data = [
+            {
+                "id": "cochrane_001",
+                "title": "Effectiveness of statins for cardiovascular disease prevention",
+                "abstract": "This systematic review evaluates the effectiveness of statins in preventing cardiovascular events.",
+                "conclusions": "Statins significantly reduce cardiovascular events in high-risk patients.",
+                "background": "Cardiovascular disease is a leading cause of mortality worldwide."
+            },
+            {
+                "id": "cochrane_002",
+                "title": "Antibiotics for acute respiratory infections",
+                "abstract": "Review of antibiotic effectiveness for respiratory tract infections.",
+                "main_results": "Limited benefit of antibiotics for viral respiratory infections.",  # Alternative field name
+                "objectives": "To assess antibiotic effectiveness in respiratory infections."  # Alternative field name
+            }
+        ]
+        
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+            json.dump(sample_data, f)
+            f.flush()
+            
+            items = load_cochrane_reviews(f.name, auto_complexity=False)
+            
+            assert len(items) == 2
+            assert all(isinstance(item, MEQBenchItem) for item in items)
+            assert items[0].source_dataset == "Cochrane Reviews"
+            assert items[0].complexity_level == "advanced"  # Default for Cochrane when auto_complexity=False
+            assert "Title:" in items[0].medical_content
+            assert "Abstract:" in items[0].medical_content
+            assert "statins" in items[0].medical_content.lower()
+            
+            Path(f.name).unlink()  # Clean up
+    
+    def test_missing_title_and_abstract_skipped(self):
+        """Test that items without title and abstract are skipped"""
+        sample_data = [
+            {
+                "id": "cochrane_001",
+                "title": "Valid title",
+                "abstract": "Valid abstract"
+            },
+            {
+                "id": "cochrane_002",
+                # Missing both title and abstract
+                "conclusions": "Only conclusions available"
+            }
+        ]
+        
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+            json.dump(sample_data, f)
+            f.flush()
+            
+            items = load_cochrane_reviews(f.name, auto_complexity=False)
+            assert len(items) == 1  # Only the valid item should be loaded
+            
+            Path(f.name).unlink()  # Clean up
+
+
+class TestSaveBenchmarkItems:
+    """Test saving benchmark items to JSON"""
+    
+    def test_save_empty_list(self):
+        """Test saving an empty list of items"""
+        with tempfile.TemporaryDirectory() as temp_dir:
+            output_path = Path(temp_dir) / "test_items.json"
+            save_benchmark_items([], output_path)
+            
+            assert output_path.exists()
+            with open(output_path, 'r') as f:
+                data = json.load(f)
+            assert data == []
+    
+    def test_save_valid_items(self):
+        """Test saving valid benchmark items"""
+        items = [
+            MEQBenchItem(
+                id="test_001",
+                medical_content="Test content 1",
+                complexity_level="basic",
+                source_dataset="test",
+                reference_explanations={"physician": "Technical explanation"}
+            ),
+            MEQBenchItem(
+                id="test_002", 
+                medical_content="Test content 2",
+                complexity_level="advanced",
+                source_dataset="test"
+            )
+        ]
+        
+        with tempfile.TemporaryDirectory() as temp_dir:
+            output_path = Path(temp_dir) / "test_items.json"
+            save_benchmark_items(items, output_path, pretty_print=True)
+            
+            assert output_path.exists()
+            with open(output_path, 'r') as f:
+                data = json.load(f)
+            
+            assert len(data) == 2
+            assert data[0]['id'] == "test_001"
+            assert data[0]['medical_content'] == "Test content 1"
+            assert data[0]['complexity_level'] == "basic"
+            assert data[0]['source_dataset'] == "test"
+            assert data[0]['reference_explanations'] == {"physician": "Technical explanation"}
+            
+            assert data[1]['id'] == "test_002"
+            assert data[1]['reference_explanations'] is None
+    
+    def test_save_creates_directory(self):
+        """Test that saving creates parent directories if they don't exist"""
+        with tempfile.TemporaryDirectory() as temp_dir:
+            output_path = Path(temp_dir) / "subdir" / "test_items.json"
+            items = [MEQBenchItem(
+                id="test_001",
+                medical_content="Test content",
+                complexity_level="basic", 
+                source_dataset="test"
+            )]
+            
+            save_benchmark_items(items, output_path)
+            
+            assert output_path.exists()
+            assert output_path.parent.exists()
+
+
+class TestDataLoadersIntegration:
+    """Integration tests for data loaders"""
+    
+    def test_load_multiple_datasets(self):
+        """Test loading and combining multiple datasets"""
+        # Create sample data for each dataset type
+        medqa_data = [{
+            "id": "medqa_001",
+            "question": "MedQA question",
+            "answer": "A",
+            "explanation": "MedQA explanation"
+        }]
+        
+        icliniq_data = [{
+            "id": "icliniq_001", 
+            "patient_question": "iCliniq question",
+            "doctor_answer": "iCliniq answer",
+            "speciality": "General"
+        }]
+        
+        cochrane_data = [{
+            "id": "cochrane_001",
+            "title": "Cochrane title",
+            "abstract": "Cochrane abstract"
+        }]
+        
+        with tempfile.TemporaryDirectory() as temp_dir:
+            # Save sample data files
+            medqa_file = Path(temp_dir) / "medqa.json"
+            icliniq_file = Path(temp_dir) / "icliniq.json"
+            cochrane_file = Path(temp_dir) / "cochrane.json"
+            
+            with open(medqa_file, 'w') as f:
+                json.dump(medqa_data, f)
+            with open(icliniq_file, 'w') as f:
+                json.dump(icliniq_data, f)
+            with open(cochrane_file, 'w') as f:
+                json.dump(cochrane_data, f)
+            
+            # Load all datasets
+            medqa_items = load_medqa_usmle(medqa_file, auto_complexity=False)
+            icliniq_items = load_icliniq(icliniq_file, auto_complexity=False)
+            cochrane_items = load_cochrane_reviews(cochrane_file, auto_complexity=False)
+            
+            # Combine all items
+            all_items = medqa_items + icliniq_items + cochrane_items
+            
+            assert len(all_items) == 3
+            assert all_items[0].source_dataset == "MedQA-USMLE"
+            assert all_items[1].source_dataset == "iCliniq"
+            assert all_items[2].source_dataset == "Cochrane Reviews"
+            
+            # Save combined dataset
+            combined_file = Path(temp_dir) / "combined.json"
+            save_benchmark_items(all_items, combined_file)
+            
+            # Verify saved file
+            with open(combined_file, 'r') as f:
+                saved_data = json.load(f)
+            assert len(saved_data) == 3
\ No newline at end of file
diff --git a/tests/test_evaluator_metrics.py b/tests/test_evaluator_metrics.py
new file mode 100644
index 0000000..eaa3a53
--- /dev/null
+++ b/tests/test_evaluator_metrics.py
@@ -0,0 +1,406 @@
+"""
+Unit tests for new evaluator metrics: ContradictionDetection, InformationPreservation, HallucinationDetection
+"""
+
+import pytest
+from unittest.mock import patch, MagicMock
+
+from src.evaluator import (
+    ContradictionDetection,
+    InformationPreservation, 
+    HallucinationDetection,
+    EvaluationScore,
+    EvaluationError
+)
+
+
+class TestContradictionDetection:
+    """Test ContradictionDetection metric"""
+    
+    @pytest.fixture
+    def contradiction_detector(self):
+        """Create ContradictionDetection instance for testing"""
+        return ContradictionDetection()
+    
+    def test_initialization(self, contradiction_detector):
+        """Test proper initialization of ContradictionDetection"""
+        assert hasattr(contradiction_detector, 'medical_knowledge_base')
+        assert hasattr(contradiction_detector, 'contradiction_patterns')
+        assert isinstance(contradiction_detector.medical_knowledge_base, dict)
+        assert isinstance(contradiction_detector.contradiction_patterns, list)
+    
+    def test_no_contradictions_returns_high_score(self, contradiction_detector):
+        """Test that text without contradictions returns high score"""
+        clean_text = "Hypertension is treated with lifestyle changes and medication as prescribed by a doctor."
+        score = contradiction_detector.calculate(clean_text, "patient")
+        assert score == 1.0
+    
+    def test_pattern_based_contradiction_detection(self, contradiction_detector):
+        """Test detection of pattern-based contradictions"""
+        contradictory_text = "You can treat viral infections with antibiotics effectively."
+        score = contradiction_detector.calculate(contradictory_text, "patient")
+        assert score < 1.0  # Should detect contradiction about antibiotics treating viruses
+    
+    def test_knowledge_base_contradiction_detection(self, contradiction_detector):
+        """Test detection of contradictions against knowledge base"""
+        contradictory_text = "High blood pressure is not related to heart disease or stroke."
+        score = contradiction_detector.calculate(contradictory_text, "patient")
+        assert score < 1.0  # Should detect contradiction about hypertension consequences
+    
+    def test_empty_text_returns_neutral_score(self, contradiction_detector):
+        """Test that empty text returns neutral score"""
+        score = contradiction_detector.calculate("", "patient")
+        assert score == 0.5
+        
+        score = contradiction_detector.calculate("   ", "patient")
+        assert score == 0.5
+    
+    def test_multiple_contradictions_lower_score(self, contradiction_detector):
+        """Test that multiple contradictions result in lower scores"""
+        single_contradiction = "Antibiotics treat viral infections."
+        multiple_contradictions = ("Antibiotics treat viral infections. "
+                                 "You should stop medication immediately when you feel better. "
+                                 "Aspirin is safe for everyone to use daily.")
+        
+        single_score = contradiction_detector.calculate(single_contradiction, "patient")
+        multiple_score = contradiction_detector.calculate(multiple_contradictions, "patient")
+        
+        assert multiple_score < single_score
+    
+    def test_case_insensitive_detection(self, contradiction_detector):
+        """Test that contradiction detection is case-insensitive"""
+        upper_text = "ANTIBIOTICS TREAT VIRUS INFECTIONS"
+        lower_text = "antibiotics treat virus infections"
+        
+        upper_score = contradiction_detector.calculate(upper_text, "patient")
+        lower_score = contradiction_detector.calculate(lower_text, "patient")
+        
+        assert upper_score == lower_score
+        assert upper_score < 1.0
+
+
+class TestInformationPreservation:
+    """Test InformationPreservation metric"""
+    
+    @pytest.fixture
+    def info_preservation(self):
+        """Create InformationPreservation instance for testing"""
+        return InformationPreservation()
+    
+    def test_initialization(self, info_preservation):
+        """Test proper initialization of InformationPreservation"""
+        assert hasattr(info_preservation, 'critical_info_patterns')
+        assert isinstance(info_preservation.critical_info_patterns, dict)
+        assert 'dosages' in info_preservation.critical_info_patterns
+        assert 'warnings' in info_preservation.critical_info_patterns
+    
+    def test_perfect_preservation_returns_high_score(self, info_preservation):
+        """Test that perfect information preservation returns high score"""
+        original = "Take 10 mg twice daily with food. Do not drink alcohol while taking this medication."
+        generated = "You should take 10 mg two times per day with food. Avoid alcohol while on this medication."
+        
+        score = info_preservation.calculate(generated, "patient", original=original)
+        assert score > 0.5  # Should preserve most critical information
+    
+    def test_dosage_preservation(self, info_preservation):
+        """Test preservation of dosage information"""
+        original = "Take 20 mg once daily before breakfast."
+        generated_good = "Take 20 mg one time each day before breakfast."
+        generated_bad = "Take medication before breakfast."
+        
+        good_score = info_preservation.calculate(generated_good, "patient", original=original)
+        bad_score = info_preservation.calculate(generated_bad, "patient", original=original)
+        
+        assert good_score > bad_score
+    
+    def test_warning_preservation(self, info_preservation):
+        """Test preservation of warning information"""
+        original = "Do not take with alcohol. Avoid driving. Side effects may include dizziness."
+        generated_good = "Don't drink alcohol. Be careful driving. May cause dizziness."
+        generated_bad = "Take as directed."
+        
+        good_score = info_preservation.calculate(generated_good, "patient", original=original)
+        bad_score = info_preservation.calculate(generated_bad, "patient", original=original)
+        
+        assert good_score > bad_score
+    
+    def test_timing_preservation(self, info_preservation):
+        """Test preservation of timing information"""
+        original = "Take before meals with water on empty stomach."
+        generated_good = "Take before eating with water when stomach is empty."
+        generated_bad = "Take with water."
+        
+        good_score = info_preservation.calculate(generated_good, "patient", original=original)
+        bad_score = info_preservation.calculate(generated_bad, "patient", original=original)
+        
+        assert good_score > bad_score
+    
+    def test_empty_texts_return_zero(self, info_preservation):
+        """Test that empty texts return zero score"""
+        score = info_preservation.calculate("", "patient", original="Some content")
+        assert score == 0.0
+        
+        score = info_preservation.calculate("Some content", "patient", original="")
+        assert score == 0.0
+    
+    def test_no_critical_info_returns_high_score(self, info_preservation):
+        """Test that texts without critical info return high score"""
+        original = "This is general medical information about health."
+        generated = "This discusses general health topics."
+        
+        score = info_preservation.calculate(generated, "patient", original=original)
+        assert score == 1.0  # No critical info to preserve
+    
+    def test_paraphrased_preservation_detection(self, info_preservation):
+        """Test detection of paraphrased information preservation"""
+        original = "Take 5 mg tablets twice daily"
+        generated = "Take the dose two times each day"  # Paraphrased but preserves key info
+        
+        score = info_preservation.calculate(generated, "patient", original=original)
+        assert score > 0.0  # Should detect some preservation through paraphrasing
+
+
+class TestHallucinationDetection:
+    """Test HallucinationDetection metric"""
+    
+    @pytest.fixture
+    def hallucination_detector(self):
+        """Create HallucinationDetection instance for testing"""
+        return HallucinationDetection()
+    
+    def test_initialization(self, hallucination_detector):
+        """Test proper initialization of HallucinationDetection"""
+        assert hasattr(hallucination_detector, 'medical_entities')
+        assert isinstance(hallucination_detector.medical_entities, dict)
+        assert 'medications' in hallucination_detector.medical_entities
+        assert 'conditions' in hallucination_detector.medical_entities
+    
+    def test_no_hallucinations_returns_high_score(self, hallucination_detector):
+        """Test that text without hallucinations returns high score"""
+        original = "Patient has hypertension and takes aspirin daily."
+        generated = "The patient has high blood pressure and takes aspirin every day."
+        
+        score = hallucination_detector.calculate(generated, "patient", original=original)
+        assert score >= 0.8  # Should be high since no new entities are introduced
+    
+    def test_hallucinated_medication_detection(self, hallucination_detector):
+        """Test detection of hallucinated medications"""
+        original = "Patient has headache."
+        generated = "Patient has headache and should take ibuprofen and metformin."  # Metformin not related to headache
+        
+        score = hallucination_detector.calculate(generated, "patient", original=original)
+        assert score < 1.0  # Should detect hallucinated medications
+    
+    def test_hallucinated_condition_detection(self, hallucination_detector):
+        """Test detection of hallucinated medical conditions"""
+        original = "Patient reports fatigue."
+        generated = "Patient has diabetes and hypertension causing fatigue."  # New conditions not in original
+        
+        score = hallucination_detector.calculate(generated, "patient", original=original)
+        assert score < 1.0  # Should detect hallucinated conditions
+    
+    def test_empty_texts_return_neutral_score(self, hallucination_detector):
+        """Test that empty texts return neutral score"""
+        score = hallucination_detector.calculate("", "patient", original="Some content")
+        assert score == 0.5
+        
+        score = hallucination_detector.calculate("Some content", "patient", original="")
+        assert score == 0.5
+    
+    def test_no_entities_returns_high_score(self, hallucination_detector):
+        """Test that text without medical entities returns high score"""
+        original = "This is general health advice."
+        generated = "This provides general health guidance."
+        
+        score = hallucination_detector.calculate(generated, "patient", original=original)
+        assert score == 1.0  # No entities, so no hallucinations
+    
+    def test_entity_extraction_methods(self, hallucination_detector):
+        """Test different entity extraction methods"""
+        text = "Patient has diabetes and takes insulin and aspirin."
+        entities = hallucination_detector._extract_medical_entities(text)
+        
+        # Should extract known medical entities
+        assert 'diabetes' in entities
+        assert 'insulin' in entities  
+        assert 'aspirin' in entities
+    
+    @patch('src.evaluator.spacy')
+    def test_spacy_ner_integration(self, mock_spacy, hallucination_detector):
+        """Test spaCy NER integration when available"""
+        # Mock spaCy NLP pipeline
+        mock_nlp = MagicMock()
+        mock_doc = MagicMock()
+        mock_ent = MagicMock()
+        mock_ent.text = "custom_medication"
+        mock_ent.label_ = "PRODUCT"
+        mock_doc.ents = [mock_ent]
+        mock_nlp.return_value = mock_doc
+        
+        # Create detector with mocked spaCy
+        detector = HallucinationDetection()
+        detector.nlp = mock_nlp
+        
+        text = "Patient takes custom_medication"
+        entities = detector._extract_medical_entities(text)
+        
+        # Should include spaCy-detected entities if they match medical terms
+        mock_nlp.assert_called_once_with(text)
+    
+    def test_case_insensitive_entity_extraction(self, hallucination_detector):
+        """Test that entity extraction is case-insensitive"""
+        text_upper = "PATIENT HAS DIABETES AND TAKES INSULIN"
+        text_lower = "patient has diabetes and takes insulin"
+        
+        entities_upper = hallucination_detector._extract_medical_entities(text_upper)
+        entities_lower = hallucination_detector._extract_medical_entities(text_lower)
+        
+        assert entities_upper == entities_lower
+        assert 'diabetes' in entities_upper
+        assert 'insulin' in entities_upper
+
+
+class TestEvaluatorIntegration:
+    """Integration tests for new metrics with MEQBenchEvaluator"""
+    
+    def test_evaluation_score_with_new_metrics(self):
+        """Test that EvaluationScore includes new metrics"""
+        score = EvaluationScore(
+            readability=0.8,
+            terminology=0.7,
+            safety=0.9,
+            coverage=0.8,
+            quality=0.7,
+            contradiction=0.9,
+            information_preservation=0.8,
+            hallucination=0.8,
+            overall=0.8
+        )
+        
+        assert score.contradiction == 0.9
+        assert score.information_preservation == 0.8
+        assert score.hallucination == 0.8
+    
+    def test_evaluation_score_to_dict_includes_new_metrics(self):
+        """Test that to_dict method includes new metrics"""
+        score = EvaluationScore(
+            readability=0.8,
+            terminology=0.7,
+            safety=0.9,
+            coverage=0.8,
+            quality=0.7,
+            contradiction=0.9,
+            information_preservation=0.8,
+            hallucination=0.8,
+            overall=0.8
+        )
+        
+        score_dict = score.to_dict()
+        assert 'contradiction' in score_dict
+        assert 'information_preservation' in score_dict
+        assert 'hallucination' in score_dict
+        assert score_dict['contradiction'] == 0.9
+        assert score_dict['information_preservation'] == 0.8
+        assert score_dict['hallucination'] == 0.8
+    
+    @patch('src.evaluator.config')
+    def test_new_metrics_error_handling(self, mock_config):
+        """Test error handling in new metrics"""
+        # Mock config to avoid configuration issues
+        mock_config.get_evaluation_config.return_value = {
+            'safety': {'danger_words': [], 'safety_words': []}
+        }
+        mock_config.get_scoring_config.return_value = {
+            'weights': {'readability': 0.2, 'terminology': 0.2, 'safety': 0.2, 
+                       'coverage': 0.2, 'quality': 0.2},
+            'parameters': {'safety_multiplier': 0.5}
+        }
+        mock_config.get_audiences.return_value = ['physician', 'nurse', 'patient', 'caregiver']
+        
+        # Create evaluator instance
+        from src.evaluator import MEQBenchEvaluator
+        evaluator = MEQBenchEvaluator()
+        
+        # Test that new metrics are properly initialized
+        assert hasattr(evaluator, 'contradiction_detector')
+        assert hasattr(evaluator, 'information_preservation')
+        assert hasattr(evaluator, 'hallucination_detector')
+    
+    def test_metric_calculation_robustness(self):
+        """Test that metrics handle edge cases robustly"""
+        # Test all metrics with edge cases
+        contradiction_detector = ContradictionDetection()
+        info_preservation = InformationPreservation()
+        hallucination_detector = HallucinationDetection()
+        
+        # Edge case: very short text
+        short_text = "Hi"
+        original_short = "Hello"
+        
+        # All metrics should handle short text without crashing
+        try:
+            contradiction_score = contradiction_detector.calculate(short_text, "patient")
+            info_score = info_preservation.calculate(short_text, "patient", original=original_short)
+            hallucination_score = hallucination_detector.calculate(short_text, "patient", original=original_short)
+            
+            # Scores should be valid floats between 0 and 1
+            assert 0 <= contradiction_score <= 1
+            assert 0 <= info_score <= 1  
+            assert 0 <= hallucination_score <= 1
+            
+        except Exception as e:
+            pytest.fail(f"Metrics should handle short text without crashing: {e}")
+        
+        # Edge case: text with special characters
+        special_text = "Take 10mg @#$%^&*() twice daily!!!"
+        original_special = "Medication: 10mg dosage instructions."
+        
+        try:
+            contradiction_score = contradiction_detector.calculate(special_text, "patient")
+            info_score = info_preservation.calculate(special_text, "patient", original=original_special)
+            hallucination_score = hallucination_detector.calculate(special_text, "patient", original=original_special)
+            
+            assert 0 <= contradiction_score <= 1
+            assert 0 <= info_score <= 1
+            assert 0 <= hallucination_score <= 1
+            
+        except Exception as e:
+            pytest.fail(f"Metrics should handle special characters without crashing: {e}")
+
+
+class TestMetricPerformance:
+    """Test performance characteristics of new metrics"""
+    
+    def test_metrics_complete_quickly(self):
+        """Test that metrics complete within reasonable time"""
+        import time
+        
+        # Create large text for performance testing
+        large_text = ("This is a medical explanation about hypertension and diabetes. " * 100 +
+                     "Patient should take medication as prescribed. " * 50 +
+                     "Avoid alcohol and maintain healthy diet. " * 50)
+        
+        original_text = "Patient has hypertension and diabetes requiring medication management."
+        
+        # Initialize metrics
+        contradiction_detector = ContradictionDetection()
+        info_preservation = InformationPreservation()
+        hallucination_detector = HallucinationDetection()
+        
+        # Time each metric
+        start_time = time.time()
+        contradiction_detector.calculate(large_text, "patient")
+        contradiction_time = time.time() - start_time
+        
+        start_time = time.time()
+        info_preservation.calculate(large_text, "patient", original=original_text)
+        info_time = time.time() - start_time
+        
+        start_time = time.time()
+        hallucination_detector.calculate(large_text, "patient", original=original_text)
+        hallucination_time = time.time() - start_time
+        
+        # Each metric should complete within 5 seconds for large text
+        assert contradiction_time < 5.0, f"ContradictionDetection took {contradiction_time:.2f}s"
+        assert info_time < 5.0, f"InformationPreservation took {info_time:.2f}s"
+        assert hallucination_time < 5.0, f"HallucinationDetection took {hallucination_time:.2f}s"
\ No newline at end of file
diff --git a/tests/test_leaderboard.py b/tests/test_leaderboard.py
new file mode 100644
index 0000000..bbb7cac
--- /dev/null
+++ b/tests/test_leaderboard.py
@@ -0,0 +1,556 @@
+"""
+Unit tests for leaderboard module
+"""
+
+import pytest
+import json
+import tempfile
+from pathlib import Path
+from unittest.mock import patch, MagicMock
+
+from src.leaderboard import LeaderboardGenerator
+
+
+class TestLeaderboardGenerator:
+    """Test LeaderboardGenerator class"""
+    
+    @pytest.fixture
+    def sample_results_data(self):
+        """Sample evaluation results data for testing"""
+        return [
+            {
+                "model_name": "GPT-4",
+                "total_items": 100,
+                "audience_scores": {
+                    "physician": [0.9, 0.8, 0.85],
+                    "nurse": [0.8, 0.75, 0.8],
+                    "patient": [0.7, 0.65, 0.7],
+                    "caregiver": [0.75, 0.7, 0.75]
+                },
+                "complexity_scores": {
+                    "basic": [0.8, 0.75],
+                    "intermediate": [0.7, 0.8],
+                    "advanced": [0.6, 0.65]
+                },
+                "summary": {
+                    "overall_mean": 0.75,
+                    "physician_mean": 0.85,
+                    "nurse_mean": 0.78,
+                    "patient_mean": 0.68,
+                    "caregiver_mean": 0.73
+                }
+            },
+            {
+                "model_name": "Claude-3",
+                "total_items": 100,
+                "audience_scores": {
+                    "physician": [0.85, 0.9, 0.88],
+                    "nurse": [0.82, 0.85, 0.83],
+                    "patient": [0.75, 0.8, 0.78],
+                    "caregiver": [0.8, 0.82, 0.81]
+                },
+                "complexity_scores": {
+                    "basic": [0.85, 0.9],
+                    "intermediate": [0.8, 0.85],
+                    "advanced": [0.75, 0.8]
+                },
+                "summary": {
+                    "overall_mean": 0.82,
+                    "physician_mean": 0.88,
+                    "nurse_mean": 0.83,
+                    "patient_mean": 0.78,
+                    "caregiver_mean": 0.81
+                }
+            },
+            {
+                "model_name": "LLaMA-2",
+                "total_items": 100,
+                "audience_scores": {
+                    "physician": [0.7, 0.75, 0.72],
+                    "nurse": [0.68, 0.7, 0.69],
+                    "patient": [0.6, 0.65, 0.62],
+                    "caregiver": [0.65, 0.68, 0.66]
+                },
+                "complexity_scores": {
+                    "basic": [0.7, 0.75],
+                    "intermediate": [0.65, 0.7],
+                    "advanced": [0.55, 0.6]
+                },
+                "summary": {
+                    "overall_mean": 0.67,
+                    "physician_mean": 0.72,
+                    "nurse_mean": 0.69,
+                    "patient_mean": 0.62,
+                    "caregiver_mean": 0.66
+                }
+            }
+        ]
+    
+    @pytest.fixture
+    def leaderboard_generator(self):
+        """Create LeaderboardGenerator instance for testing"""
+        return LeaderboardGenerator()
+    
+    def test_initialization(self, leaderboard_generator):
+        """Test proper initialization of LeaderboardGenerator"""
+        assert hasattr(leaderboard_generator, 'results_data')
+        assert hasattr(leaderboard_generator, 'benchmark_stats')
+        assert leaderboard_generator.results_data == []
+        assert leaderboard_generator.benchmark_stats == {}
+    
+    def test_load_results_file_not_found(self, leaderboard_generator):
+        """Test error handling when results directory doesn't exist"""
+        with pytest.raises(FileNotFoundError):
+            leaderboard_generator.load_results(Path("/nonexistent/directory"))
+    
+    def test_load_results_no_json_files(self, leaderboard_generator):
+        """Test error handling when no JSON files found"""
+        with tempfile.TemporaryDirectory() as temp_dir:
+            # Create directory with no JSON files
+            Path(temp_dir, "not_json.txt").write_text("not json")
+            
+            with pytest.raises(ValueError, match="No JSON result files found"):
+                leaderboard_generator.load_results(Path(temp_dir))
+    
+    def test_load_results_invalid_json(self, leaderboard_generator):
+        """Test handling of invalid JSON files"""
+        with tempfile.TemporaryDirectory() as temp_dir:
+            # Create invalid JSON file
+            invalid_file = Path(temp_dir) / "invalid.json"
+            invalid_file.write_text("invalid json content")
+            
+            # Should log error but not crash
+            leaderboard_generator.load_results(Path(temp_dir))
+            assert len(leaderboard_generator.results_data) == 0
+    
+    def test_load_results_missing_required_fields(self, leaderboard_generator):
+        """Test handling of JSON files missing required fields"""
+        with tempfile.TemporaryDirectory() as temp_dir:
+            # Create JSON file missing required fields
+            incomplete_data = {"model_name": "TestModel"}  # Missing other required fields
+            incomplete_file = Path(temp_dir) / "incomplete.json"
+            with open(incomplete_file, 'w') as f:
+                json.dump(incomplete_data, f)
+            
+            # Should skip invalid files
+            leaderboard_generator.load_results(Path(temp_dir))
+            assert len(leaderboard_generator.results_data) == 0
+    
+    def test_load_results_valid_data(self, leaderboard_generator, sample_results_data):
+        """Test loading valid results data"""
+        with tempfile.TemporaryDirectory() as temp_dir:
+            # Create valid JSON files
+            for i, result in enumerate(sample_results_data):
+                result_file = Path(temp_dir) / f"result_{i}.json"
+                with open(result_file, 'w') as f:
+                    json.dump(result, f)
+            
+            leaderboard_generator.load_results(Path(temp_dir))
+            
+            assert len(leaderboard_generator.results_data) == 3
+            assert leaderboard_generator.results_data[0]['model_name'] == "GPT-4"
+            assert leaderboard_generator.results_data[1]['model_name'] == "Claude-3"
+            assert leaderboard_generator.results_data[2]['model_name'] == "LLaMA-2"
+    
+    def test_calculate_leaderboard_stats_empty_data(self, leaderboard_generator):
+        """Test stats calculation with empty data"""
+        stats = leaderboard_generator.calculate_leaderboard_stats()
+        assert stats == {}
+    
+    def test_calculate_leaderboard_stats_valid_data(self, leaderboard_generator, sample_results_data):
+        """Test stats calculation with valid data"""
+        leaderboard_generator.results_data = sample_results_data
+        stats = leaderboard_generator.calculate_leaderboard_stats()
+        
+        assert stats['total_models'] == 3
+        assert stats['total_evaluations'] == 300  # 3 models * 100 items each
+        assert set(stats['audiences']) == {'physician', 'nurse', 'patient', 'caregiver'}
+        assert set(stats['complexity_levels']) == {'basic', 'intermediate', 'advanced'}
+        assert stats['best_score'] == 0.82  # Claude-3's score
+        assert stats['worst_score'] == 0.67  # LLaMA-2's score
+        assert 0.6 < stats['average_score'] < 0.8
+        assert 'last_updated' in stats
+    
+    def test_rank_models(self, leaderboard_generator, sample_results_data):
+        """Test model ranking functionality"""
+        leaderboard_generator.results_data = sample_results_data
+        ranked_models = leaderboard_generator.rank_models()
+        
+        # Should be ranked by overall_mean in descending order
+        assert len(ranked_models) == 3
+        assert ranked_models[0]['model_name'] == "Claude-3"  # Highest score (0.82)
+        assert ranked_models[0]['rank'] == 1
+        assert ranked_models[0]['overall_score'] == 0.82
+        
+        assert ranked_models[1]['model_name'] == "GPT-4"  # Middle score (0.75)
+        assert ranked_models[1]['rank'] == 2
+        assert ranked_models[1]['overall_score'] == 0.75
+        
+        assert ranked_models[2]['model_name'] == "LLaMA-2"  # Lowest score (0.67)
+        assert ranked_models[2]['rank'] == 3
+        assert ranked_models[2]['overall_score'] == 0.67
+    
+    def test_generate_audience_breakdown(self, leaderboard_generator, sample_results_data):
+        """Test audience-specific performance breakdown"""
+        leaderboard_generator.results_data = sample_results_data
+        ranked_models = leaderboard_generator.rank_models()
+        audience_breakdown = leaderboard_generator.generate_audience_breakdown(ranked_models)
+        
+        # Should have breakdown for each audience
+        assert set(audience_breakdown.keys()) == {'physician', 'nurse', 'patient', 'caregiver'}
+        
+        # Check physician audience breakdown
+        physician_breakdown = audience_breakdown['physician']
+        assert len(physician_breakdown) == 3
+        
+        # Models should be ranked by their physician-specific scores
+        physician_scores = [model['score'] for model in physician_breakdown]
+        assert physician_scores == sorted(physician_scores, reverse=True)
+        
+        # Check that rankings are assigned correctly
+        for i, model in enumerate(physician_breakdown):
+            assert model['rank'] == i + 1
+            assert 'model_name' in model
+            assert 'score' in model
+            assert 'num_items' in model
+    
+    def test_generate_complexity_breakdown(self, leaderboard_generator, sample_results_data):
+        """Test complexity-specific performance breakdown"""
+        leaderboard_generator.results_data = sample_results_data
+        ranked_models = leaderboard_generator.rank_models()
+        complexity_breakdown = leaderboard_generator.generate_complexity_breakdown(ranked_models)
+        
+        # Should have breakdown for each complexity level
+        assert set(complexity_breakdown.keys()) == {'basic', 'intermediate', 'advanced'}
+        
+        # Check basic complexity breakdown
+        basic_breakdown = complexity_breakdown['basic']
+        assert len(basic_breakdown) == 3
+        
+        # Models should be ranked by their basic-specific scores
+        basic_scores = [model['score'] for model in basic_breakdown]
+        assert basic_scores == sorted(basic_scores, reverse=True)
+        
+        # Check that rankings are assigned correctly
+        for i, model in enumerate(basic_breakdown):
+            assert model['rank'] == i + 1
+    
+    def test_generate_html_no_data_raises_error(self, leaderboard_generator):
+        """Test that HTML generation raises error with no data"""
+        with tempfile.TemporaryDirectory() as temp_dir:
+            output_path = Path(temp_dir) / "test.html"
+            
+            with pytest.raises(ValueError, match="No results data loaded"):
+                leaderboard_generator.generate_html(output_path)
+    
+    def test_generate_html_creates_file(self, leaderboard_generator, sample_results_data):
+        """Test that HTML generation creates file"""
+        leaderboard_generator.results_data = sample_results_data
+        
+        with tempfile.TemporaryDirectory() as temp_dir:
+            output_path = Path(temp_dir) / "leaderboard.html"
+            leaderboard_generator.generate_html(output_path)
+            
+            assert output_path.exists()
+            html_content = output_path.read_text()
+            
+            # Check that HTML contains expected elements
+            assert "<!DOCTYPE html>" in html_content
+            assert "MEQ-Bench Leaderboard" in html_content
+            assert "Claude-3" in html_content  # Top-ranked model
+            assert "GPT-4" in html_content
+            assert "LLaMA-2" in html_content
+            assert "physician" in html_content
+            assert "patient" in html_content
+    
+    def test_generate_html_creates_parent_directory(self, leaderboard_generator, sample_results_data):
+        """Test that HTML generation creates parent directories"""
+        leaderboard_generator.results_data = sample_results_data
+        
+        with tempfile.TemporaryDirectory() as temp_dir:
+            output_path = Path(temp_dir) / "subdir" / "leaderboard.html"
+            leaderboard_generator.generate_html(output_path)
+            
+            assert output_path.exists()
+            assert output_path.parent.exists()
+    
+    def test_html_template_structure(self, leaderboard_generator, sample_results_data):
+        """Test that generated HTML has proper structure"""
+        leaderboard_generator.results_data = sample_results_data
+        
+        with tempfile.TemporaryDirectory() as temp_dir:
+            output_path = Path(temp_dir) / "test.html"
+            leaderboard_generator.generate_html(output_path)
+            
+            html_content = output_path.read_text()
+            
+            # Check HTML structure
+            assert "<html" in html_content
+            assert "<head>" in html_content
+            assert "<body>" in html_content
+            assert "<title>" in html_content
+            assert "</html>" in html_content
+            
+            # Check CSS inclusion
+            assert "<style>" in html_content
+            assert "font-family" in html_content
+            
+            # Check JavaScript inclusion
+            assert "<script>" in html_content
+            assert "function showTab" in html_content
+            
+            # Check Chart.js inclusion
+            assert "chart.js" in html_content
+    
+    def test_overall_rankings_table_generation(self, leaderboard_generator, sample_results_data):
+        """Test generation of overall rankings table"""
+        leaderboard_generator.results_data = sample_results_data
+        ranked_models = leaderboard_generator.rank_models()
+        
+        table_html = leaderboard_generator._generate_overall_rankings_table(ranked_models)
+        
+        # Check table structure
+        assert "<table>" in table_html
+        assert "<thead>" in table_html
+        assert "<tbody>" in table_html
+        assert "</table>" in table_html
+        
+        # Check table headers
+        assert "Rank" in table_html
+        assert "Model" in table_html
+        assert "Overall Score" in table_html
+        assert "Physician" in table_html
+        assert "Patient" in table_html
+        
+        # Check model data
+        assert "Claude-3" in table_html
+        assert "#1" in table_html  # First rank
+        assert "rank-1" in table_html  # CSS class for first place
+    
+    def test_audience_breakdown_section_generation(self, leaderboard_generator, sample_results_data):
+        """Test generation of audience breakdown section"""
+        leaderboard_generator.results_data = sample_results_data
+        ranked_models = leaderboard_generator.rank_models()
+        audience_breakdown = leaderboard_generator.generate_audience_breakdown(ranked_models)
+        
+        section_html = leaderboard_generator._generate_audience_breakdown_section(audience_breakdown)
+        
+        # Check section structure
+        assert "audience-section" in section_html
+        assert "Physician Audience Rankings" in section_html
+        assert "Patient Audience Rankings" in section_html
+        
+        # Check that all models appear in sections
+        assert "Claude-3" in section_html
+        assert "GPT-4" in section_html
+        assert "LLaMA-2" in section_html
+    
+    def test_complexity_breakdown_section_generation(self, leaderboard_generator, sample_results_data):
+        """Test generation of complexity breakdown section"""
+        leaderboard_generator.results_data = sample_results_data
+        ranked_models = leaderboard_generator.rank_models()
+        complexity_breakdown = leaderboard_generator.generate_complexity_breakdown(ranked_models)
+        
+        section_html = leaderboard_generator._generate_complexity_breakdown_section(complexity_breakdown)
+        
+        # Check section structure
+        assert "complexity-section" in section_html
+        assert "Basic Complexity Level Rankings" in section_html
+        assert "Advanced Complexity Level Rankings" in section_html
+        
+        # Check that all models appear in sections
+        assert "Claude-3" in section_html
+        assert "GPT-4" in section_html
+        assert "LLaMA-2" in section_html
+    
+    def test_javascript_generation(self, leaderboard_generator, sample_results_data):
+        """Test JavaScript generation for interactive features"""
+        leaderboard_generator.results_data = sample_results_data
+        ranked_models = leaderboard_generator.rank_models()
+        audience_breakdown = leaderboard_generator.generate_audience_breakdown(ranked_models)
+        stats = leaderboard_generator.calculate_leaderboard_stats()
+        
+        js_content = leaderboard_generator._generate_javascript(ranked_models, audience_breakdown, stats)
+        
+        # Check JavaScript structure
+        assert "function showTab" in js_content
+        assert "function initCharts" in js_content
+        assert "Chart(" in js_content
+        
+        # Check data inclusion
+        assert "Claude-3" in js_content or "Claude" in js_content  # Model names
+        assert "physician" in js_content
+        assert "patient" in js_content
+        
+        # Check chart types
+        assert "'bar'" in js_content
+        assert "'radar'" in js_content
+    
+    def test_css_styles_generation(self, leaderboard_generator):
+        """Test CSS styles generation"""
+        css_content = leaderboard_generator._get_css_styles()
+        
+        # Check CSS structure and important styles
+        assert "body {" in css_content
+        assert "font-family" in css_content
+        assert ".container" in css_content
+        assert ".tab-button" in css_content
+        assert ".rank-1" in css_content
+        assert ".rank-2" in css_content
+        assert ".rank-3" in css_content
+        
+        # Check responsive design
+        assert "@media" in css_content
+        assert "max-width" in css_content
+
+
+class TestLeaderboardIntegration:
+    """Integration tests for leaderboard functionality"""
+    
+    def test_end_to_end_leaderboard_generation(self, sample_results_data):
+        """Test complete end-to-end leaderboard generation"""
+        with tempfile.TemporaryDirectory() as temp_dir:
+            # Create sample result files
+            results_dir = Path(temp_dir) / "results"
+            results_dir.mkdir()
+            
+            for i, result in enumerate(sample_results_data):
+                result_file = results_dir / f"model_{i}_results.json"
+                with open(result_file, 'w') as f:
+                    json.dump(result, f)
+            
+            # Generate leaderboard
+            generator = LeaderboardGenerator()
+            generator.load_results(results_dir)
+            
+            output_path = Path(temp_dir) / "leaderboard.html"
+            generator.generate_html(output_path)
+            
+            # Verify the complete leaderboard
+            assert output_path.exists()
+            html_content = output_path.read_text()
+            
+            # Check that all major components are present
+            assert "MEQ-Bench Leaderboard" in html_content
+            assert "Overall Rankings" in html_content
+            assert "By Audience" in html_content
+            assert "By Complexity" in html_content
+            assert "Analytics" in html_content
+            
+            # Check that all models are represented
+            for result in sample_results_data:
+                assert result['model_name'] in html_content
+            
+            # Check that audience types are represented
+            for audience in ['physician', 'nurse', 'patient', 'caregiver']:
+                assert audience in html_content
+            
+            # Check that complexity levels are represented
+            for complexity in ['basic', 'intermediate', 'advanced']:
+                assert complexity in html_content
+    
+    def test_leaderboard_with_mixed_data_quality(self):
+        """Test leaderboard generation with mixed quality data"""
+        mixed_results = [
+            # Complete data
+            {
+                "model_name": "CompleteModel",
+                "total_items": 100,
+                "audience_scores": {"physician": [0.8, 0.9], "patient": [0.7, 0.8]},
+                "complexity_scores": {"basic": [0.8], "advanced": [0.7]},
+                "summary": {"overall_mean": 0.8}
+            },
+            # Missing complexity scores
+            {
+                "model_name": "PartialModel",
+                "total_items": 50,
+                "audience_scores": {"physician": [0.7], "patient": [0.6]},
+                "summary": {"overall_mean": 0.65}
+            },
+            # Minimal data
+            {
+                "model_name": "MinimalModel",
+                "total_items": 25,
+                "audience_scores": {"patient": [0.5]},
+                "summary": {"overall_mean": 0.5}
+            }
+        ]
+        
+        with tempfile.TemporaryDirectory() as temp_dir:
+            results_dir = Path(temp_dir) / "results"
+            results_dir.mkdir()
+            
+            for i, result in enumerate(mixed_results):
+                result_file = results_dir / f"result_{i}.json"
+                with open(result_file, 'w') as f:
+                    json.dump(result, f)
+            
+            # Should handle mixed data gracefully
+            generator = LeaderboardGenerator()
+            generator.load_results(results_dir)
+            
+            output_path = Path(temp_dir) / "leaderboard.html"
+            generator.generate_html(output_path)
+            
+            assert output_path.exists()
+            html_content = output_path.read_text()
+            
+            # All models should still appear
+            assert "CompleteModel" in html_content
+            assert "PartialModel" in html_content
+            assert "MinimalModel" in html_content
+    
+    def test_leaderboard_performance_with_large_dataset(self):
+        """Test leaderboard performance with larger dataset"""
+        import time
+        
+        # Create larger dataset
+        large_results = []
+        for i in range(20):  # 20 models
+            result = {
+                "model_name": f"Model_{i:02d}",
+                "total_items": 1000,
+                "audience_scores": {
+                    "physician": [0.5 + (i * 0.02)] * 100,
+                    "nurse": [0.6 + (i * 0.015)] * 100,
+                    "patient": [0.4 + (i * 0.025)] * 100,
+                    "caregiver": [0.55 + (i * 0.02)] * 100
+                },
+                "complexity_scores": {
+                    "basic": [0.6 + (i * 0.02)] * 100,
+                    "intermediate": [0.5 + (i * 0.02)] * 100,
+                    "advanced": [0.4 + (i * 0.02)] * 100
+                },
+                "summary": {"overall_mean": 0.5 + (i * 0.02)}
+            }
+            large_results.append(result)
+        
+        with tempfile.TemporaryDirectory() as temp_dir:
+            results_dir = Path(temp_dir) / "results"
+            results_dir.mkdir()
+            
+            for i, result in enumerate(large_results):
+                result_file = results_dir / f"result_{i}.json"
+                with open(result_file, 'w') as f:
+                    json.dump(result, f)
+            
+            # Time the leaderboard generation
+            start_time = time.time()
+            
+            generator = LeaderboardGenerator()
+            generator.load_results(results_dir)
+            
+            output_path = Path(temp_dir) / "leaderboard.html"
+            generator.generate_html(output_path)
+            
+            end_time = time.time()
+            generation_time = end_time - start_time
+            
+            # Should complete within reasonable time (30 seconds for large dataset)
+            assert generation_time < 30.0, f"Leaderboard generation took {generation_time:.2f}s"
+            
+            # Verify output quality
+            assert output_path.exists()
+            html_content = output_path.read_text()
+            assert len(html_content) > 10000  # Should be substantial HTML content
+            assert "Model_19" in html_content  # Highest ranked model should appear
\ No newline at end of file
diff --git a/tests/test_process_datasets.py b/tests/test_process_datasets.py
new file mode 100644
index 0000000..2257543
--- /dev/null
+++ b/tests/test_process_datasets.py
@@ -0,0 +1,578 @@
+"""
+Unit tests for the data processing script
+"""
+
+import pytest
+import json
+import tempfile
+import subprocess
+import sys
+from pathlib import Path
+from unittest.mock import patch, MagicMock
+
+# Import the script modules for testing
+sys.path.insert(0, str(Path(__file__).parent.parent / "scripts"))
+
+try:
+    from process_datasets import (
+        calculate_dataset_limits,
+        balance_complexity_distribution,
+        validate_dataset,
+        print_dataset_statistics,
+        setup_argument_parser
+    )
+except ImportError:
+    # If direct import fails, we'll test through subprocess calls
+    process_datasets = None
+
+from src.benchmark import MEQBenchItem
+
+
+class TestDatasetLimitsCalculation:
+    """Test dataset limits calculation"""
+    
+    def test_equal_distribution_three_datasets(self):
+        """Test equal distribution across three datasets"""
+        limits = calculate_dataset_limits(1000, 3)
+        
+        assert len(limits) == 3
+        assert 'medqa' in limits
+        assert 'icliniq' in limits  
+        assert 'cochrane' in limits
+        
+        # Should distribute as evenly as possible
+        total = sum(limits.values())
+        assert total == 1000
+        
+        # Each should get approximately 333-334 items
+        for limit in limits.values():
+            assert 333 <= limit <= 334
+    
+    def test_equal_distribution_two_datasets(self):
+        """Test equal distribution across two datasets"""
+        limits = calculate_dataset_limits(1000, 2)
+        
+        assert len(limits) == 2
+        total = sum(limits.values())
+        assert total == 1000
+        
+        # Each should get 500 items
+        for limit in limits.values():
+            assert limit == 500
+    
+    def test_uneven_distribution_handling(self):
+        """Test handling of uneven divisions"""
+        limits = calculate_dataset_limits(1001, 3)
+        
+        total = sum(limits.values())
+        assert total == 1001
+        
+        # Should handle remainder correctly
+        limit_values = list(limits.values())
+        assert max(limit_values) - min(limit_values) <= 1  # Difference should be at most 1
+    
+    def test_small_total_items(self):
+        """Test handling of small total item counts"""
+        limits = calculate_dataset_limits(5, 3)
+        
+        total = sum(limits.values())
+        assert total == 5
+        
+        # Some datasets might get 1 item, others 2
+        for limit in limits.values():
+            assert limit >= 1
+
+
+class TestComplexityBalancing:
+    """Test complexity distribution balancing"""
+    
+    @pytest.fixture
+    def sample_items(self):
+        """Create sample items with different complexity levels"""
+        items = []
+        
+        # Create 10 basic items
+        for i in range(10):
+            items.append(MEQBenchItem(
+                id=f"basic_{i}",
+                medical_content=f"Basic medical content {i}",
+                complexity_level="basic",
+                source_dataset="test"
+            ))
+        
+        # Create 5 intermediate items
+        for i in range(5):
+            items.append(MEQBenchItem(
+                id=f"intermediate_{i}",
+                medical_content=f"Intermediate medical content {i}",
+                complexity_level="intermediate", 
+                source_dataset="test"
+            ))
+        
+        # Create 2 advanced items
+        for i in range(2):
+            items.append(MEQBenchItem(
+                id=f"advanced_{i}",
+                medical_content=f"Advanced medical content {i}",
+                complexity_level="advanced",
+                source_dataset="test"
+            ))
+        
+        return items
+    
+    @patch('process_datasets.random.sample')
+    @patch('process_datasets.random.seed')
+    def test_balance_complexity_distribution(self, mock_seed, mock_sample, sample_items):
+        """Test complexity balancing functionality"""
+        # Mock random.sample to return predictable results
+        def side_effect(population, k):
+            return population[:k]  # Return first k items
+        
+        mock_sample.side_effect = side_effect
+        
+        balanced_items = balance_complexity_distribution(sample_items)
+        
+        # Should have roughly equal distribution
+        complexity_counts = {}
+        for item in balanced_items:
+            complexity_counts[item.complexity_level] = complexity_counts.get(item.complexity_level, 0) + 1
+        
+        # Check that balancing was attempted
+        assert len(balanced_items) <= len(sample_items)
+        assert 'basic' in complexity_counts
+        assert 'intermediate' in complexity_counts
+        assert 'advanced' in complexity_counts
+    
+    def test_balance_empty_items(self):
+        """Test balancing with empty item list"""
+        balanced_items = balance_complexity_distribution([])
+        assert balanced_items == []
+    
+    def test_balance_single_complexity_level(self):
+        """Test balancing when only one complexity level exists"""
+        items = [
+            MEQBenchItem(
+                id=f"basic_{i}",
+                medical_content=f"Basic content {i}",
+                complexity_level="basic",
+                source_dataset="test"
+            )
+            for i in range(5)
+        ]
+        
+        balanced_items = balance_complexity_distribution(items)
+        
+        # Should still return items even if balancing isn't possible
+        assert len(balanced_items) > 0
+        assert all(item.complexity_level == "basic" for item in balanced_items)
+
+
+class TestDatasetValidation:
+    """Test dataset validation functionality"""
+    
+    def test_validate_empty_dataset(self):
+        """Test validation of empty dataset"""
+        report = validate_dataset([])
+        
+        assert report['valid'] is False
+        assert report['total_items'] == 0
+        assert "Dataset is empty" in report['issues']
+    
+    def test_validate_valid_dataset(self):
+        """Test validation of valid dataset"""
+        items = [
+            MEQBenchItem(
+                id="test_001",
+                medical_content="This is valid medical content for testing purposes and is long enough.",
+                complexity_level="basic",
+                source_dataset="test"
+            ),
+            MEQBenchItem(
+                id="test_002",
+                medical_content="This is another valid medical content item for comprehensive testing.",
+                complexity_level="intermediate",
+                source_dataset="test"
+            )
+        ]
+        
+        report = validate_dataset(items)
+        
+        assert report['valid'] is True
+        assert report['total_items'] == 2
+        assert len(report['issues']) == 0
+        assert 'complexity_distribution' in report['statistics']
+        assert 'source_distribution' in report['statistics']
+    
+    def test_validate_duplicate_ids(self):
+        """Test detection of duplicate IDs"""
+        items = [
+            MEQBenchItem(
+                id="duplicate_id",
+                medical_content="First item with duplicate ID and sufficient content length.",
+                complexity_level="basic",
+                source_dataset="test"
+            ),
+            MEQBenchItem(
+                id="duplicate_id",  # Same ID
+                medical_content="Second item with duplicate ID and sufficient content length.",
+                complexity_level="intermediate",
+                source_dataset="test"
+            )
+        ]
+        
+        report = validate_dataset(items)
+        
+        assert report['valid'] is False
+        assert any("Duplicate item IDs" in issue for issue in report['issues'])
+    
+    def test_validate_short_content(self):
+        """Test detection of very short content"""
+        items = [
+            MEQBenchItem(
+                id="test_001",
+                medical_content="Short",  # Too short
+                complexity_level="basic",
+                source_dataset="test"
+            )
+        ]
+        
+        report = validate_dataset(items)
+        
+        assert report['valid'] is False
+        assert any("very short content" in issue for issue in report['issues'])
+    
+    def test_validate_missing_complexity_levels(self):
+        """Test warning for missing complexity levels"""
+        items = [
+            MEQBenchItem(
+                id="test_001",
+                medical_content="This is valid medical content with sufficient length for testing.",
+                complexity_level="basic",  # Only basic level
+                source_dataset="test"
+            )
+        ]
+        
+        report = validate_dataset(items)
+        
+        assert report['valid'] is True  # Still valid, just warning
+        assert any("Not all complexity levels represented" in warning for warning in report['warnings'])
+    
+    def test_validate_content_length_statistics(self):
+        """Test content length statistics calculation"""
+        items = [
+            MEQBenchItem(
+                id="test_001",
+                medical_content="Short but valid medical content for testing purposes.",  # ~50 chars
+                complexity_level="basic",
+                source_dataset="test"
+            ),
+            MEQBenchItem(
+                id="test_002",
+                medical_content="This is a much longer medical content item that contains significantly more text to test the average content length calculation functionality." * 3,  # ~400+ chars
+                complexity_level="intermediate",
+                source_dataset="test"
+            )
+        ]
+        
+        report = validate_dataset(items)
+        
+        assert 'content_length' in report['statistics']
+        assert 'average' in report['statistics']['content_length']
+        assert 'minimum' in report['statistics']['content_length']
+        assert 'maximum' in report['statistics']['content_length']
+        
+        # Average should be reasonable
+        avg_length = report['statistics']['content_length']['average']
+        assert 50 < avg_length < 500
+
+
+class TestStatisticsPrinting:
+    """Test statistics printing functionality"""
+    
+    def test_print_empty_dataset_statistics(self, capsys):
+        """Test printing statistics for empty dataset"""
+        print_dataset_statistics([])
+        
+        captured = capsys.readouterr()
+        assert "Dataset is empty" in captured.out
+    
+    def test_print_valid_dataset_statistics(self, capsys):
+        """Test printing statistics for valid dataset"""
+        items = [
+            MEQBenchItem(
+                id="test_001",
+                medical_content="Valid medical content for testing statistics display functionality.",
+                complexity_level="basic",
+                source_dataset="MedQA-USMLE"
+            ),
+            MEQBenchItem(
+                id="test_002", 
+                medical_content="Another valid medical content item for comprehensive statistics testing.",
+                complexity_level="intermediate",
+                source_dataset="iCliniq"
+            ),
+            MEQBenchItem(
+                id="test_003",
+                medical_content="Third valid medical content item for advanced complexity testing and statistics.",
+                complexity_level="advanced",
+                source_dataset="Cochrane Reviews"
+            )
+        ]
+        
+        print_dataset_statistics(items)
+        
+        captured = capsys.readouterr()
+        output = captured.out
+        
+        # Check that all expected sections are present
+        assert "Dataset Statistics" in output
+        assert "Complexity Distribution" in output
+        assert "Source Distribution" in output
+        assert "Content Length Statistics" in output
+        
+        # Check that complexity levels are shown
+        assert "Basic" in output
+        assert "Intermediate" in output
+        assert "Advanced" in output
+        
+        # Check that sources are shown
+        assert "MedQA-USMLE" in output
+        assert "iCliniq" in output
+        assert "Cochrane Reviews" in output
+        
+        # Check that statistics are shown
+        assert "Total items: 3" in output
+        assert "Average:" in output
+        assert "Minimum:" in output
+        assert "Maximum:" in output
+
+
+class TestArgumentParser:
+    """Test command line argument parser"""
+    
+    def test_argument_parser_setup(self):
+        """Test that argument parser is set up correctly"""
+        parser = setup_argument_parser()
+        
+        # Test that parser exists and has expected arguments
+        args = parser.parse_args([
+            '--medqa', 'medqa.json',
+            '--icliniq', 'icliniq.json', 
+            '--cochrane', 'cochrane.json',
+            '--output', 'output.json'
+        ])
+        
+        assert args.medqa == 'medqa.json'
+        assert args.icliniq == 'icliniq.json'
+        assert args.cochrane == 'cochrane.json'
+        assert args.output == 'output.json'
+        assert args.max_items == 1000  # Default value
+    
+    def test_argument_parser_defaults(self):
+        """Test default argument values"""
+        parser = setup_argument_parser()
+        
+        # Test with minimal arguments
+        args = parser.parse_args(['--medqa', 'test.json'])
+        
+        assert args.output == 'data/benchmark_items.json'  # Default output
+        assert args.max_items == 1000  # Default max items
+        assert args.auto_complexity is True  # Default auto complexity
+        assert args.balance_complexity is True  # Default balance
+        assert args.seed == 42  # Default seed
+    
+    def test_argument_parser_custom_values(self):
+        """Test custom argument values"""
+        parser = setup_argument_parser()
+        
+        args = parser.parse_args([
+            '--medqa', 'medqa.json',
+            '--max-items', '500',
+            '--medqa-items', '200',
+            '--no-auto-complexity',
+            '--seed', '123',
+            '--validate',
+            '--stats',
+            '--verbose'
+        ])
+        
+        assert args.max_items == 500
+        assert args.medqa_items == 200
+        assert args.auto_complexity is False
+        assert args.seed == 123
+        assert args.validate is True
+        assert args.stats is True
+        assert args.verbose is True
+
+
+class TestScriptIntegration:
+    """Integration tests for the complete script"""
+    
+    @pytest.mark.skipif(process_datasets is None, reason="process_datasets module not available")
+    def test_script_with_sample_data(self):
+        """Test the complete script with sample data"""
+        with tempfile.TemporaryDirectory() as temp_dir:
+            temp_path = Path(temp_dir)
+            
+            # Create sample dataset files
+            medqa_data = [{
+                "id": "medqa_001",
+                "question": "What is hypertension?",
+                "answer": "A",
+                "explanation": "High blood pressure condition."
+            }]
+            
+            icliniq_data = [{
+                "id": "icliniq_001",
+                "patient_question": "I have chest pain",
+                "doctor_answer": "Consult a cardiologist",
+                "speciality": "Cardiology"
+            }]
+            
+            cochrane_data = [{
+                "id": "cochrane_001",
+                "title": "Statin effectiveness",
+                "abstract": "Systematic review of statins"
+            }]
+            
+            # Save sample files
+            medqa_file = temp_path / "medqa.json"
+            icliniq_file = temp_path / "icliniq.json"
+            cochrane_file = temp_path / "cochrane.json"
+            output_file = temp_path / "output.json"
+            
+            with open(medqa_file, 'w') as f:
+                json.dump(medqa_data, f)
+            with open(icliniq_file, 'w') as f:
+                json.dump(icliniq_data, f)
+            with open(cochrane_file, 'w') as f:
+                json.dump(cochrane_data, f)
+            
+            # Run the script through subprocess to test CLI
+            script_path = Path(__file__).parent.parent / "scripts" / "process_datasets.py"
+            
+            result = subprocess.run([
+                sys.executable, str(script_path),
+                '--medqa', str(medqa_file),
+                '--icliniq', str(icliniq_file),
+                '--cochrane', str(cochrane_file),
+                '--output', str(output_file),
+                '--max-items', '10',
+                '--no-auto-complexity',
+                '--validate',
+                '--stats'
+            ], capture_output=True, text=True)
+            
+            # Check that script ran successfully
+            assert result.returncode == 0, f"Script failed with error: {result.stderr}"
+            
+            # Check that output file was created
+            assert output_file.exists()
+            
+            # Verify output content
+            with open(output_file, 'r') as f:
+                output_data = json.load(f)
+            
+            assert len(output_data) == 3  # Should have loaded all 3 items
+            assert any(item['source_dataset'] == 'MedQA-USMLE' for item in output_data)
+            assert any(item['source_dataset'] == 'iCliniq' for item in output_data)
+            assert any(item['source_dataset'] == 'Cochrane Reviews' for item in output_data)
+    
+    def test_script_error_handling(self):
+        """Test script error handling with invalid inputs"""
+        script_path = Path(__file__).parent.parent / "scripts" / "process_datasets.py"
+        
+        # Test with non-existent file
+        result = subprocess.run([
+            sys.executable, str(script_path),
+            '--medqa', '/nonexistent/file.json',
+            '--output', 'output.json'
+        ], capture_output=True, text=True)
+        
+        # Should exit with error code
+        assert result.returncode != 0
+    
+    def test_script_help_output(self):
+        """Test that script provides help output"""
+        script_path = Path(__file__).parent.parent / "scripts" / "process_datasets.py"
+        
+        result = subprocess.run([
+            sys.executable, str(script_path),
+            '--help'
+        ], capture_output=True, text=True)
+        
+        assert result.returncode == 0
+        assert "Process medical datasets for MEQ-Bench" in result.stdout
+        assert "--medqa" in result.stdout
+        assert "--icliniq" in result.stdout
+        assert "--cochrane" in result.stdout
+
+
+class TestScriptPerformance:
+    """Test script performance characteristics"""
+    
+    @pytest.mark.skipif(process_datasets is None, reason="process_datasets module not available")
+    def test_script_performance_with_large_dataset(self):
+        """Test script performance with larger dataset"""
+        import time
+        
+        with tempfile.TemporaryDirectory() as temp_dir:
+            temp_path = Path(temp_dir)
+            
+            # Create larger sample datasets
+            large_medqa_data = [
+                {
+                    "id": f"medqa_{i:03d}",
+                    "question": f"Medical question {i}",
+                    "answer": "A",
+                    "explanation": f"Medical explanation {i}"
+                }
+                for i in range(200)
+            ]
+            
+            large_icliniq_data = [
+                {
+                    "id": f"icliniq_{i:03d}",
+                    "patient_question": f"Patient question {i}",
+                    "doctor_answer": f"Doctor answer {i}",
+                    "speciality": "General"
+                }
+                for i in range(200)
+            ]
+            
+            # Save large files
+            medqa_file = temp_path / "large_medqa.json"
+            icliniq_file = temp_path / "large_icliniq.json"
+            output_file = temp_path / "large_output.json"
+            
+            with open(medqa_file, 'w') as f:
+                json.dump(large_medqa_data, f)
+            with open(icliniq_file, 'w') as f:
+                json.dump(large_icliniq_data, f)
+            
+            # Time the script execution
+            start_time = time.time()
+            
+            script_path = Path(__file__).parent.parent / "scripts" / "process_datasets.py"
+            
+            result = subprocess.run([
+                sys.executable, str(script_path),
+                '--medqa', str(medqa_file),
+                '--icliniq', str(icliniq_file),
+                '--output', str(output_file),
+                '--max-items', '100',
+                '--no-auto-complexity'  # Skip complexity calculation for speed
+            ], capture_output=True, text=True)
+            
+            end_time = time.time()
+            execution_time = end_time - start_time
+            
+            # Should complete within reasonable time (30 seconds for large dataset)
+            assert execution_time < 30.0, f"Script took {execution_time:.2f}s"
+            assert result.returncode == 0
+            assert output_file.exists()
+            
+            # Verify output quality
+            with open(output_file, 'r') as f:
+                output_data = json.load(f)
+            assert len(output_data) == 100  # Should respect max_items limit
\ No newline at end of file