A sophisticated benchmarking tool that evaluates LLM performance on content transformation tasks with enhanced instruction diversity, creative content generation, proper scoring logic, and comprehensive logging.
MetaBench challenges language models with increasingly complex transformation tasks - from simple format conversions to multi-layered data restructuring with conditional logic. It generates creative content, applies sophisticated transformation instructions, and provides detailed scoring on how well the LLM follows complex, multi-step requirements.
MetaBench follows a simple 4-step process for each trial:
- Generate Content β 2. Create Instructions β 3. Transform β 4. Score
Step 1: Generate Content
Original content: "Alice: 25, Bob: 30, Carol: 28"
Step 2: Create Transformation Instructions
Convert to JSON format with ages grouped by decade
Step 3: LLM Transforms
{
"twenties": [
{"name": "Alice", "age": 25},
{"name": "Carol", "age": 28}
],
"thirties": [
{"name": "Bob", "age": 30}
]
}Step 4: Score the Result
- β Followed JSON format: 10/10
- β Grouped by decade correctly: 10/10
- β Preserved all data: 10/10
- Final Score: 10/10
As complexity increases, instructions become multi-layered with conditional logic, cross-references, and validation requirements.
MetaBench is built with a modular architecture for maintainability and extensibility:
main.py- Entry point with CLI handling and main execution flowmodules/benchmark.py- Main benchmark orchestration and execution logicmodules/config.py- Configuration data classes and validationmodules/cli.py- Command-line argument parsing and interactive modemodules/llm.py- LLM wrapper with retry logic and error handlingmodules/generators.py- Content and instruction generation enginesmodules/engines.py- Transformation and verification enginesmodules/parsers.py- Custom output parsers for structured responsesmodules/utils.py- Utility functions (env resolution, logging setup)
This modular design allows for easy testing, modification, and extension of individual components without affecting the entire system.
- π― Enhanced Instruction Generation: Creates diverse transformation instructions with varying complexity levels (1-5)
- π¨ Creative Content Generation: Generates varied content with random enhancements (unicode symbols, emojis, special formatting)
- βοΈ Comprehensive Transformation Engine: Applies complex transformation instructions to content
- π Advanced Verification: Scores transformation results with detailed criteria and specific scoring categories
- π Dual Logging: File-based detailed logging and console emoji output for easy monitoring
- π Multiple Content Types: Supports code, text, data, configuration, and documentation
- π Progress Tracking: Real-time progress updates with trial completion status
- π‘οΈ Robust Error Handling: Network, JSON, and unexpected error handling with graceful degradation
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and run
git clone <your-repo>
cd metabench
uv run main.py --quickThis project uses uv for dependency management. If you don't have uv installed, choose one of the methods below:
macOS and Linux:
curl -LsSf https://astral.sh/uv/install.sh | shWindows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"Homebrew (macOS):
brew install uvPyPI (with pipx):
pipx install uvPyPI (with pip):
pip install uvWinGet (Windows):
winget install --id=astral-sh.uv -eScoop (Windows):
scoop install main/uvCargo:
cargo install --git https://github.com/astral-sh/uv uvOnce uv is installed:
# Install dependencies
uv sync
# Or run directly (uv will handle dependencies automatically)
uv run main.py# Quick test (2 complexity levels, 1 trial each)
uv run main.py --quick
# Full benchmark with default settings
uv run main.py
# Custom complexity and trials
uv run main.py --complexity 4 --trials 5# Local model (default)
uv run main.py --url http://localhost:1234/v1
# OpenAI API
uv run main.py --url https://api.openai.com/v1 --api-key sk-your-key --model gpt-4
# Anthropic Claude
uv run main.py --url https://api.anthropic.com --api-key your-key --model claude-3-sonnetMetaBench supports individual temperature settings for different types of tasks, allowing fine-tuned control over model behavior:
# Use default temperatures (optimized for each task type)
uv run main.py
# Set base temperature for all tasks
uv run main.py --temperature 0.5
# Fine-tune individual temperatures
uv run main.py \
--temperature 0.3 \
--creative-temperature 0.8 \
--verification-temperature 0.1 \
--transform-temperature 0.4Temperature Defaults:
- Base temperature: 0.3 (used as fallback)
- Creative temperature: 0.7 (higher for more diverse content generation)
- Verification temperature: 0.1 (lower for consistent, reliable scoring)
- Transform temperature: Uses base temperature (balanced approach)
Temperature Guidelines:
- Low (0.0-0.3): More focused, deterministic responses
- Medium (0.3-0.7): Balanced creativity and consistency
- High (0.7-1.0+): More creative, diverse responses
Different Models for Different Tasks:
# Use specialized models with optimized temperatures
uv run main.py \
--model gpt-4 \
--creative-model gpt-4 --creative-temperature 0.8 \
--verification-model gpt-3.5-turbo --verification-temperature 0.1 \
--transform-model gpt-4 --transform-temperature 0.3# Test only code transformations
uv run main.py --content code --topic "machine learning"
# Focus on specific content types
uv run main.py --content "code,documentation" --complexity 3
# Topic-focused benchmark
uv run main.py --topic "blockchain" --trials 2# Guided setup with prompts
uv run main.py --interactiveπ Starting Transformation Benchmark
Configuration: 3 complexity levels, 2 trials each
Content types: code, text, data
Total trials: 6
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Complexity Level 1
Trial 1/6 (16.7%) - code
π Starting trial: Complexity 1, Type code
π― Target format: nested JSON with metadata and timestamps
π¨ Creative variations: add emoji and modern text symbols π, include technical jargon
β
Generated 342 chars with 2 creative variations
β
Verification complete: Quality=8.50/10, Completion=92.0%
β±οΈ Trial completed in 12.3s
π BENCHMARK SUMMARY
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total Trials: 6
Average Quality Score: 7.83/10
Average Completion Rate: 89.2%
Error Rate: 0.0%
Total Time: 67.2s
β
Verification complete: Quality=8.50/10, Completion=92.0%
π― basic_rules_score: 9.20/10
π― conditional_operations_score: 8.10/10
β« advanced_processing_score: N/A (not applicable)
π― verification_requirements_score: 8.70/10
π― data_preservation_score: 9.50/10
π― format_compliance_score: 8.90/10
π Average applicable score: 8.88/10
from modules import BenchmarkConfig, TransformationBenchmark
config = BenchmarkConfig(
base_url="http://localhost:1234/v1",
api_key="your-api-key",
max_complexity=4,
trials_per_complexity=3,
content_types=["code", "text", "data"],
topic="artificial intelligence",
temperature=0.3,
creative_temperature=0.7,
verification_temperature=0.1,
transform_temperature=0.3
)
benchmark = TransformationBenchmark(config)
summary = benchmark.run_benchmark()You can reference environment variables using the env: prefix in command-line arguments:
# Set environment variables
export OPENROUTER_API_KEY="sk-or-v1-your-key-here"
export OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
export OPENAI_API_KEY="sk-your-openai-key"
# Reference them with env: prefix
uv run main.py --api-key env:OPENROUTER_API_KEY --url env:OPENROUTER_BASE_URL --model "deepseek/deepseek-r1"
uv run main.py --api-key env:OPENAI_API_KEY --url "https://api.openai.com/v1" --model "gpt-4"You can also use a .env file (automatically loaded):
# .env file
OPENROUTER_API_KEY=sk-or-v1-your-key-here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
OPENAI_API_KEY=sk-your-openai-keySupported environment variable formats:
--api-key env:OPENROUTER_API_KEY- Uses$OPENROUTER_API_KEY--url env:OPENROUTER_BASE_URL- Uses$OPENROUTER_BASE_URL--api-key env:OPENAI_API_KEY- Uses$OPENAI_API_KEY
| Level | Description | Features |
|---|---|---|
| 1 | Basic formatting | Simple rules, basic transformations |
| 2 | Conditional logic | If/then operations, basic metadata |
| 3 | Advanced processing | Multiple conditions, generated IDs |
| 4 | Cross-references | Bidirectional links, validation |
| 5 | Full complexity | Security, optimization, multiple outputs |
% uv run main.py --help
usage: main.py [-h] [--interactive] [--quick] [--url URL] [--creative-url CREATIVE_URL] [--verification-url VERIFICATION_URL]
[--transform-url TRANSFORM_URL] [--api-key API_KEY] [--model MODEL] [--creative-model CREATIVE_MODEL]
[--verification-model VERIFICATION_MODEL] [--transform-model TRANSFORM_MODEL] [--complexity COMPLEXITY]
[--trials TRIALS] [--temperature TEMPERATURE] [--creative-temperature CREATIVE_TEMPERATURE]
[--verification-temperature VERIFICATION_TEMPERATURE] [--transform-temperature TRANSFORM_TEMPERATURE]
[--max-retries MAX_RETRIES] [--verification-attempts VERIFICATION_ATTEMPTS]
[--verification-aggregation {best,avg,worst}] [--content CONTENT] [--topic TOPIC] [--output OUTPUT]
[--log-file LOG_FILE] [--quiet]
Enhanced Self-Evaluating Transformation Benchmark with Retry Logic
options:
-h, --help show this help message and exit
--interactive, -i Run in interactive mode
--quick, -q Quick benchmark (2 complexity levels, 1 trial each)
--url URL, --base-url URL
Default base URL for the LLM API (supports env:VARIABLE_NAME format)
--creative-url CREATIVE_URL
Base URL for creative LLM (defaults to --url if not specified)
--verification-url VERIFICATION_URL
Base URL for verification LLM (defaults to --url if not specified)
--transform-url TRANSFORM_URL
Base URL for transform LLM (defaults to --url if not specified)
--api-key API_KEY API key for the LLM service (supports env:VARIABLE_NAME format)
--model MODEL, --model-name MODEL
Default model name (leave empty for local models)
--creative-model CREATIVE_MODEL
Specific model for creative content generation
--verification-model VERIFICATION_MODEL
Specific model for verification tasks
--transform-model TRANSFORM_MODEL
Specific model for transformations
--complexity COMPLEXITY, --max-complexity COMPLEXITY
Maximum complexity level (1-5)
--trials TRIALS, --trials-per-complexity TRIALS
Number of trials per complexity level
--temperature TEMPERATURE
Base temperature for LLM
--creative-temperature CREATIVE_TEMPERATURE
Temperature for creative content generation (defaults to 0.7)
--verification-temperature VERIFICATION_TEMPERATURE
Temperature for verification tasks (defaults to 0.1)
--transform-temperature TRANSFORM_TEMPERATURE
Temperature for transformation tasks (defaults to --temperature)
--max-retries MAX_RETRIES
Maximum retries per operation (default: 3)
--verification-attempts VERIFICATION_ATTEMPTS
Number of verification attempts per trial (default: 1)
--verification-aggregation {best,avg,worst}
How to aggregate multiple verification scores (default: avg)
--content CONTENT, --content-types CONTENT
Comma-separated list of content types
--topic TOPIC Topic for content generation (makes content topic-specific)
--output OUTPUT, -o OUTPUT
Output filename for results (auto-generated if not specified)
--log-file LOG_FILE Log filename (auto-generated if not specified)
--quiet Reduce output verbosity
Examples:
# Quick test with local model
python benchmark.py --quick
# Full benchmark with specific model and temperatures
python benchmark.py --model gpt-4 --url https://api.openai.com/v1 --temperature 0.3 --creative-temperature 0.8
# Different URLs and models for different stages
python benchmark.py --url http://localhost:1234/v1 --creative-url http://creative.local:1234/v1 --verification-url http://verification.local:1234/v1
# With retry and verification configuration
python benchmark.py --max-retries 5 --verification-attempts 3 --verification-aggregation best
# Topic-focused benchmark with fine-tuned temperatures
python benchmark.py --topic "blockchain" --content code,text --creative-temperature 0.8 --verification-temperature 0.1
# Complete temperature control example
python benchmark.py --temperature 0.2 --creative-temperature 0.9 --verification-temperature 0.05 --transform-temperature 0.4
# Interactive mode
python benchmark.py --interactive- Detailed logs:
benchmark_log_YYYYMMDD_HHMMSS.log
- JSON results:
benchmark_results_YYYYMMDD_HHMMSS.json
{
"config": { /* sanitized configuration */ },
"results": [ /* detailed trial results */ ],
"summary": {
"total_trials": 15,
"average_score": 7.83,
"average_completion_rate": 0.892,
"scores_by_complexity": { "1": 8.2, "2": 7.8, "3": 7.4 }
}
}Connection errors:
# Check if your LLM server is running
curl http://localhost:1234/v1/models
# Test with a simple request
uv run main.py --quick --trials 1JSON parsing errors:
- Usually indicates the LLM is not following JSON format requirements
- Try reducing complexity or adjusting temperature
- Check logs for full LLM responses
Low scores:
- Normal for higher complexity levels
- Ensure your model has sufficient context length
- Consider using more capable models for complex transformations
# Run with detailed logging
uv run main.py --quick --log-file debug.log
# Check the logs directory
ls -la logs/- Python: 3.12+
- Dependencies: LangChain Core, OpenAI packages (handled by uv)
- LLM API: Compatible endpoint (local or remote)
- Memory: ~500MB for typical runs
- Storage: ~50MB per benchmark run (logs + results)
metabench/
βββ main.py # Main entry point and CLI interface
βββ pyproject.toml # uv project configuration
βββ README.md # This documentation
βββ .gitignore # Git ignore rules
βββ uv.lock # Dependency lock file
βββ modules/ # Modular components
β βββ __init__.py # Module exports and package definition
β βββ benchmark.py # Main benchmark runner class
β βββ cli.py # CLI argument parsing and interactive mode
β βββ config.py # Configuration classes and validation
β βββ engines.py # Transformation and verification engines
β βββ generators.py # Content and instruction generators
β βββ llm.py # LLM wrapper functionality
β βββ parsers.py # Custom output parsers
β βββ utils.py # Environment variable resolution and utilities
βββ logs/ # Auto-created benchmark logs
β βββ benchmark_log_*.log
βββ results/ # Auto-created benchmark results
βββ benchmark_results_*.json
- Fork the repository
- Create a feature branch
- Test with
uv run main.py --quick - Submit a pull request
MIT License - see LICENSE file for details.