Comprehensive AI Evaluation and Benchmarking Framework for Government Content Curation
The AI-Evaluation-Framework is a comprehensive evaluation and benchmarking system designed to assess, validate, and optimize AI models used in the SynthoraAI - AI-Powered Article Content Curator project. This framework enables government officials and content curators to evaluate the quality, accuracy, bias, and performance of AI-generated content summaries and classifications.
This framework serves to:
- Evaluate AI Model Performance: Benchmark summarization, classification, and sentiment analysis models
- Ensure Content Quality: Validate accuracy and relevance of AI-generated summaries
- Detect Bias: Identify and measure potential biases in content processing
- Monitor Performance: Track model performance metrics over time
- Optimize Models: Provide insights for model improvements and fine-tuning
- Compliance Verification: Ensure AI outputs meet government content standards
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AI-Evaluation-Framework β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β
β β Metrics β β Benchmarks β β Validators β β
β β Engine β β Suite β β & Analyzers β β
β βββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β
β β β β β
β ββββββββββββββββββββ΄βββββββββββββββββββ β
β β β
β ββββββββββΌβββββββββ β
β β Evaluation Core β β
β ββββββββββ¬βββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β β β β β
β ββββββββΌβββββββ ββββββββΌβββββββ ββββββββΌβββββββ β
β β Summarizationβ βClassificationβ β Sentiment β β
β β Evaluator β β Evaluator β β Evaluator β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SynthoraAI Integration Layer β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β β
β ββββββΌβββββ βββββΌβββββ βββββΌβββββ β
β β Backend β βCrawler β βAgentic β β
β β API β β β βAI β β
β βββββββββββ ββββββββββ ββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- ROUGE Scores: Evaluate summary quality (ROUGE-1, ROUGE-2, ROUGE-L)
- BLEU Scores: Measure translation and generation quality
- BERTScore: Semantic similarity assessment
- Perplexity: Language model confidence measurement
- Custom Metrics: Domain-specific government content metrics
- Political bias detection
- Source bias analysis
- Sentiment bias measurement
- Demographic representation analysis
- Language fairness assessment
- Factual accuracy verification
- Completeness checks
- Coherence analysis
- Readability scoring (Flesch-Kincaid, SMOG, etc.)
- Citation and source validation
- Model latency measurement
- Throughput analysis
- Resource utilization tracking
- Scalability testing
- Cost-per-inference analysis
- Unit tests for individual components
- Integration tests for end-to-end workflows
- Regression testing for model updates
- A/B testing framework
- Continuous evaluation pipeline
- Interactive dashboards
- Performance trend analysis
- Comparison reports
- Export to PDF/Excel
- Real-time monitoring
- Quick Start
- Installation
- Configuration
- Usage
- Evaluation Metrics
- Benchmarking
- Model Integration
- API Reference
- Examples
- Contributing
- License
# Clone the repository
git clone https://github.com/SynthoraAI-AI-News-Content-Curator/AI-Evaluation-Framework.git
cd AI-Evaluation-Framework
# Install dependencies
npm install
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your configuration
# Run evaluation suite
npm run evaluate
# Or use Python
python -m evaluation.run_suite- Node.js: v18.0.0 or higher
- Python: 3.11 or higher
- MongoDB: 5.0 or higher
- Redis: 6.0 or higher (optional, for caching)
- Docker: Latest version (optional, for containerized deployment)
- RAM: Minimum 8GB, Recommended 16GB+
- Storage: 10GB free space for models and datasets
- CPU: Multi-core processor (4+ cores recommended)
- GPU: Optional, but recommended for faster evaluation (CUDA-compatible)
git clone https://github.com/SynthoraAI-AI-News-Content-Curator/AI-Evaluation-Framework.git
cd AI-Evaluation-Frameworknpm install# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Install PyTorch (CPU version)
pip install torch torchvision torchaudio
# Or GPU version (CUDA 11.8)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install additional NLP libraries
pip install transformers sentence-transformers spacy
python -m spacy download en_core_web_sm# MongoDB connection
# Make sure MongoDB is running on your system or use MongoDB Atlas
# Initialize database schema
npm run db:initCreate a .env file in the root directory:
# Database Configuration
MONGODB_URI=mongodb://localhost:27017/ai-evaluation
REDIS_URL=redis://localhost:6379
# SynthoraAI Backend Integration
SYNTHORAAI_API_URL=https://ai-content-curator-backend.vercel.app
SYNTHORAAI_API_KEY=your_api_key_here
# Google AI Configuration
GOOGLE_AI_API_KEY=your_google_ai_key
GOOGLE_AI_API_KEY1=your_backup_key_1
GOOGLE_AI_API_KEY2=your_backup_key_2
# OpenAI Configuration (Optional)
OPENAI_API_KEY=your_openai_key
# Hugging Face Configuration
HUGGINGFACE_API_TOKEN=your_hf_token
# Evaluation Settings
EVALUATION_MODE=comprehensive
BATCH_SIZE=32
MAX_WORKERS=4
CACHE_ENABLED=true
# Metrics Configuration
ENABLE_ROUGE=true
ENABLE_BLEU=true
ENABLE_BERTSCORE=true
ENABLE_BIAS_DETECTION=true
# Logging
LOG_LEVEL=info
LOG_FILE=logs/evaluation.log
# Reporting
REPORT_FORMAT=html,json,pdf
REPORT_OUTPUT_DIR=./reports# Run full evaluation suite
npm run evaluate
# Evaluate specific model
npm run evaluate -- --model summarization
# Run bias detection only
npm run evaluate -- --check bias
# Custom evaluation
npm run evaluate -- --input data/test_articles.json --output reports/
# Benchmark performance
npm run benchmark
# Generate report
npm run report -- --format pdffrom evaluation import EvaluationFramework
from evaluation.metrics import ROUGEMetric, BERTScoreMetric, BiasDetector
# Initialize framework
framework = EvaluationFramework(
config_path='config/evaluation.yaml'
)
# Load test data
test_data = framework.load_test_dataset('data/test_articles.json')
# Run evaluation
results = framework.evaluate(
dataset=test_data,
metrics=['rouge', 'bertscore', 'bias'],
models=['summarization', 'classification']
)
# Generate report
framework.generate_report(
results=results,
output_format='html',
output_path='reports/evaluation_report.html'
)
# Print summary
print(results.summary())import { EvaluationFramework } from './src/evaluation';
import { SummarizationEvaluator, BiasAnalyzer } from './src/evaluators';
// Initialize framework
const framework = new EvaluationFramework({
configPath: 'config/evaluation.yaml'
});
// Load test data
const testData = await framework.loadTestDataset('data/test_articles.json');
// Configure evaluators
const evaluators = [
new SummarizationEvaluator(),
new BiasAnalyzer()
];
// Run evaluation
const results = await framework.evaluate({
dataset: testData,
evaluators: evaluators,
options: {
parallel: true,
batchSize: 32
}
});
// Generate report
await framework.generateReport({
results: results,
format: 'html',
outputPath: 'reports/evaluation.html'
});
console.log(results.summary());from evaluation.metrics import ROUGEMetric
rouge = ROUGEMetric()
scores = rouge.evaluate(
reference_summary="The government announced new policies...",
generated_summary="New government policies were announced..."
)
# Output:
# {
# 'rouge-1': {'precision': 0.85, 'recall': 0.82, 'f1': 0.835},
# 'rouge-2': {'precision': 0.72, 'recall': 0.70, 'f1': 0.71},
# 'rouge-l': {'precision': 0.80, 'recall': 0.78, 'f1': 0.79}
# }from evaluation.metrics import BERTScoreMetric
bertscore = BERTScoreMetric(model='microsoft/deberta-xlarge-mnli')
scores = bertscore.evaluate(
reference="Original article content...",
candidate="AI-generated summary..."
)
# Output:
# {
# 'precision': 0.89,
# 'recall': 0.87,
# 'f1': 0.88
# }from evaluation.metrics import ClassificationMetrics
metrics = ClassificationMetrics()
results = metrics.evaluate(
y_true=['politics', 'health', 'economy'],
y_pred=['politics', 'health', 'technology']
)
# Output:
# {
# 'accuracy': 0.667,
# 'precision': 0.70,
# 'recall': 0.65,
# 'f1_score': 0.675,
# 'confusion_matrix': [[...]],
# 'per_class_metrics': {...}
# }from evaluation.metrics import BiasDetector
bias_detector = BiasDetector()
bias_analysis = bias_detector.analyze(
text="Article content...",
categories=['political', 'source', 'demographic']
)
# Output:
# {
# 'overall_bias_score': 0.23,
# 'political_bias': {
# 'score': 0.15,
# 'direction': 'neutral',
# 'confidence': 0.92
# },
# 'source_bias': {
# 'score': 0.31,
# 'reliability': 'high'
# },
# 'demographic_bias': {
# 'score': 0.12,
# 'issues': []
# }
# }from evaluation.metrics import QualityMetrics
quality = QualityMetrics()
assessment = quality.evaluate(
summary="AI-generated summary...",
original="Original article..."
)
# Output:
# {
# 'factual_accuracy': 0.92,
# 'completeness': 0.85,
# 'coherence': 0.88,
# 'readability': {
# 'flesch_reading_ease': 65.5,
# 'flesch_kincaid_grade': 8.2,
# 'smog_index': 9.1
# },
# 'conciseness': 0.90
# }# Run performance benchmarks
npm run benchmark:performance
# Output:
# βββββββββββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ
# β Model β Avg Latency β Throughput β Memory Usage β
# βββββββββββββββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββ€
# β Summarization β 245ms β 163 req/s β 512 MB β
# β Classification β 128ms β 312 req/s β 256 MB β
# β Sentiment Analysis β 95ms β 421 req/s β 128 MB β
# β Bias Detection β 189ms β 211 req/s β 384 MB β
# βββββββββββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ# Run quality benchmarks
npm run benchmark:quality
# Compare models
npm run benchmark:compare -- --models model_v1,model_v2# Test with increasing load
npm run benchmark:scale -- --max-load 1000 --step 100import { SynthoraAIIntegration } from './src/integrations';
const integration = new SynthoraAIIntegration({
apiUrl: process.env.SYNTHORAAI_API_URL,
apiKey: process.env.SYNTHORAAI_API_KEY
});
// Fetch articles for evaluation
const articles = await integration.fetchArticles({
limit: 100,
source: 'government',
dateRange: {
start: '2025-01-01',
end: '2025-01-31'
}
});
// Evaluate articles
const evaluation = await framework.evaluate({
articles: articles,
metrics: ['summarization', 'bias', 'quality']
});
// Push results back to SynthoraAI
await integration.pushEvaluationResults(evaluation);from evaluation.models import BaseModel
from evaluation import register_model
@register_model('custom_summarizer')
class CustomSummarizationModel(BaseModel):
def __init__(self, config):
super().__init__(config)
self.model = self.load_model()
def load_model(self):
# Load your custom model
pass
def predict(self, input_text):
# Generate summary
summary = self.model.generate(input_text)
return {
'summary': summary,
'confidence': 0.95
}
def evaluate(self, test_data):
# Custom evaluation logic
passMain class for running evaluations.
Methods:
evaluate(dataset, metrics, models): Run evaluationload_test_dataset(path): Load test datagenerate_report(results, format, output_path): Generate evaluation reportcompare_models(model_ids): Compare multiple modelsregister_metric(metric): Register custom metricregister_model(model): Register custom model
Available metrics:
ROUGEMetric: ROUGE scores for summarizationBLEUMetric: BLEU scoresBERTScoreMetric: Semantic similarityClassificationMetrics: Classification performanceBiasDetector: Bias detection and analysisQualityMetrics: Content quality assessmentPerformanceMetrics: Speed and resource usage
Specialized evaluators:
SummarizationEvaluator: Evaluate summarization modelsClassificationEvaluator: Evaluate classification modelsSentimentEvaluator: Evaluate sentiment analysisBiasAnalyzer: Analyze content biasQualityAssurance: Overall quality checks
from evaluation import EvaluationFramework
from evaluation.integrations import SynthoraAIAPI
# Initialize
framework = EvaluationFramework()
api = SynthoraAIAPI(api_key='your_key')
# Fetch recent articles
articles = api.fetch_articles(limit=50)
# Evaluate
results = framework.evaluate(
dataset=articles,
metrics={
'summarization': ['rouge', 'bertscore'],
'classification': ['accuracy', 'f1'],
'bias': ['political', 'source'],
'quality': ['factual', 'coherence', 'readability']
}
)
# Generate comprehensive report
framework.generate_report(
results=results,
format='html',
output='reports/comprehensive_evaluation.html',
include_visualizations=True
)from evaluation import ABTestFramework
ab_test = ABTestFramework()
# Define models to test
model_a = 'gemini-pro-v1'
model_b = 'gemini-pro-v2'
# Run A/B test
results = ab_test.run(
model_a=model_a,
model_b=model_b,
test_dataset='data/test_set.json',
metrics=['rouge', 'quality', 'speed'],
sample_size=1000
)
# Statistical significance
print(f"Winner: {results.winner}")
print(f"Confidence: {results.confidence}%")
print(f"Improvement: {results.improvement}%")from evaluation import ContinuousEvaluator
from evaluation.monitoring import Dashboard
# Setup continuous evaluation
evaluator = ContinuousEvaluator(
check_interval='1h',
alert_thresholds={
'accuracy_drop': 0.05,
'latency_increase': 100, # ms
'error_rate': 0.01
}
)
# Start monitoring
evaluator.start()
# Launch dashboard
dashboard = Dashboard(port=8080)
dashboard.serve()# Run all tests
npm test
# Run specific test suite
npm test -- --suite summarization
# Run with coverage
npm run test:coverage
# Python tests
pytest tests/
pytest tests/ --cov=evaluation --cov-report=htmlThe framework includes built-in visualization capabilities:
from evaluation.visualization import Visualizer
viz = Visualizer(results)
# Generate performance charts
viz.plot_performance_metrics(output='charts/performance.png')
# Generate bias analysis charts
viz.plot_bias_distribution(output='charts/bias.png')
# Generate comparison charts
viz.plot_model_comparison(models=['v1', 'v2'], output='charts/comparison.png')
# Generate interactive dashboard
viz.create_dashboard(output='reports/dashboard.html')// src/integrations/backend.ts
import { BackendAPI } from '@synthoraai/backend-client';
const backend = new BackendAPI({
baseUrl: process.env.SYNTHORAAI_API_URL,
apiKey: process.env.SYNTHORAAI_API_KEY
});
// Evaluate backend summaries
const summaries = await backend.getSummaries({ limit: 100 });
const evaluation = await framework.evaluateSummaries(summaries);
// Store evaluation results
await backend.storeEvaluationResults(evaluation);// src/integrations/crawler.ts
import { CrawlerAPI } from '@synthoraai/crawler-client';
const crawler = new CrawlerAPI({
baseUrl: process.env.CRAWLER_API_URL
});
// Evaluate crawler data quality
const crawledArticles = await crawler.getRecentArticles();
const quality = await framework.evaluateDataQuality(crawledArticles);from evaluation.integrations import AgenticAIPipeline
pipeline = AgenticAIPipeline()
# Evaluate multi-agent system
results = pipeline.evaluate_agents(
agents=['analyzer', 'summarizer', 'classifier', 'sentiment', 'quality'],
test_cases=load_test_cases('data/agentic_test.json')
)
# Per-agent metrics
for agent, metrics in results.items():
print(f"{agent}: {metrics}")AI-Evaluation-Framework/
βββ src/
β βββ evaluation/ # Core evaluation logic
β β βββ framework.ts # Main framework class
β β βββ metrics/ # Metric implementations
β β βββ evaluators/ # Specialized evaluators
β β βββ utils/ # Utility functions
β βββ integrations/ # SynthoraAI integrations
β βββ models/ # Model interfaces
β βββ visualization/ # Visualization tools
β βββ api/ # REST API endpoints
βββ python/
β βββ evaluation/ # Python evaluation modules
β β βββ metrics/ # Metric implementations
β β βββ models/ # Model wrappers
β β βββ utils/ # Utilities
β βββ tests/ # Python tests
β βββ scripts/ # Utility scripts
βββ tests/ # TypeScript tests
βββ data/ # Test datasets
βββ models/ # Saved models
βββ reports/ # Generated reports
βββ docs/ # Documentation
βββ config/ # Configuration files
βββ .github/ # GitHub Actions
βββ docker/ # Docker configs
βββ package.json
βββ requirements.txt
βββ tsconfig.json
βββ pytest.ini
βββ README.md
import { BaseMetric } from './evaluation/metrics/base';
export class CustomMetric extends BaseMetric {
name = 'custom_metric';
async evaluate(reference: string, candidate: string): Promise<number> {
// Your custom evaluation logic
const score = this.computeScore(reference, candidate);
return score;
}
private computeScore(ref: string, cand: string): number {
// Implementation
return 0.85;
}
}
// Register the metric
framework.registerMetric(new CustomMetric());The framework generates comprehensive reports in multiple formats:
- HTML: Interactive web-based reports with charts
- JSON: Machine-readable results
- PDF: Professional formatted reports
- Excel: Tabular data for analysis
- Markdown: Text-based reports
Example report generation:
from evaluation.reporting import ReportGenerator
generator = ReportGenerator()
# Generate multi-format report
generator.generate(
results=evaluation_results,
formats=['html', 'pdf', 'json'],
output_dir='reports/',
include_charts=True,
include_recommendations=True
)- API Key Management: Secure storage using environment variables
- Data Privacy: No sensitive data is logged or transmitted
- Encryption: All API communications use HTTPS/TLS
- Access Control: Role-based access for different user types
- Audit Logging: Complete audit trail of all evaluations
# Build Docker image
docker build -t ai-evaluation-framework .
# Run container
docker run -p 8080:8080 \
-e MONGODB_URI=$MONGODB_URI \
-e GOOGLE_AI_API_KEY=$GOOGLE_AI_API_KEY \
ai-evaluation-framework
# Using Docker Compose
docker-compose up -d# Apply Kubernetes configs
kubectl apply -f k8s/
# Check status
kubectl get pods -n ai-evaluation# Deploy to AWS Lambda
npm run deploy:lambda
# Deploy to Azure Functions
npm run deploy:azure
# Deploy to Google Cloud Functions
npm run deploy:gcpFor questions, issues, or contributions:
- GitHub Issues: Report bugs or request features
- Email: [email protected]
- Website: synthoraai.vercel.app
- Documentation: docs.synthoraai.vercel.app
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
# Fork the repository
# Create your feature branch
git checkout -b feature/amazing-feature
# Commit your changes
git commit -m 'Add amazing feature'
# Push to the branch
git push origin feature/amazing-feature
# Open a Pull RequestThis project is licensed under the MIT License - see the LICENSE file for details.
- SynthoraAI Team: For the amazing content curation platform
- Google: For Generative AI API
- OpenAI: For GPT models
- Hugging Face: For transformer models and datasets
- Contributors: All contributors to this project
- Integration with more AI models (Claude, Llama, etc.)
- Advanced bias detection algorithms
- Real-time evaluation API
- Multi-language support
- Enhanced visualization dashboard
- Automated model optimization
- Federated learning support
- Explainable AI features
- Models Evaluated: 15+
- Metrics Available: 25+
- Test Datasets: 10+
- Evaluation Speed: 1000+ articles/hour
- Accuracy: 95%+ correlation with human evaluation
Made with β€οΈ by the SynthoraAI Team
π Back to Top