AI Agent Testing & Evaluation Platform
A comprehensive, vendor-agnostic testing system for evaluating AI agents with sophisticated multi-modal judging, real-time integration, and business-ready reporting.
The Universal Tester-Agent (UTA) is a revolutionary testing platform that enables agent-to-agent testing - using AI agents to test other AI agents. This approach provides sophisticated, nuanced evaluation that goes far beyond traditional rule-based testing.
- π€ Real AI Testing: Test actual AI agents (ChatGPT, Claude, etc.) with real conversations
- π§ LLM Judges: Sophisticated AI-powered evaluation using GPT-4o for nuanced quality assessment
- π Business Reports: Professional, stakeholder-ready reports with detailed insights
- π§ Pluggable Architecture: Extensible system with pluggable strategies, judges, and adapters
- π° Budget Control: Cost monitoring and budget enforcement for production-ready testing
- π― Deterministic Testing: Reproducible results with seeded RNG for consistent evaluation
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Test Scenarios β β Testing Strategies β β AI Agents β
β (YAML DSL) βββββΆβ (Pluggable) βββββΆβ (Real/HTTP) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Test Runner β β Multi-modal β β Report β
β (Orchestrator)β β Judge System β β Generator β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
- Test Runner: Core execution engine that orchestrates test scenarios
- Strategy System: Pluggable testing strategies for different evaluation approaches
- Judge System: Multi-modal evaluation system (heuristic + LLM judges)
- HTTP Adapter: Integration layer for real AI agent testing
- Report Generator: Business-ready HTML report generation
- FlowIntentStrategy: Tests natural conversation flow and intent understanding
- ToolHappyPathStrategy: Tests successful tool usage and function calling
- MemoryCarryStrategy: Tests context retention across conversation turns
- ToolErrorStrategy: Tests error handling and recovery mechanisms
- DynamicAIStrategy: π NEW! AI-powered dynamic message generation that adapts to any agent automatically
The DynamicAI strategy represents a breakthrough in scalability:
- π€ AI-Powered: Uses LLM to generate contextually appropriate messages
- π Adaptive: Automatically adapts to any agent's capabilities and conversation style
- π Cross-Platform: Works with any OpenAI-compatible API across different domains
- β‘ Zero Configuration: No manual setup required - automatically discovers agent capabilities
- π Scalable: Eliminates the need for hardcoded strategies across different applications
Learn More: Dynamic UTA Platform Documentation
- DisturbanceStrategy: Stress testing and interruptions
- PlannerStrategy: Multi-step planning capabilities
- PersonaStrategy: Consistent persona maintenance
- PIIProbeStrategy: Privacy and data protection
- InterruptionStrategy: Conversation interruption handling
- RepeatProbeStrategy: Consistency and edge case testing
- Core Scenarios: Fundamental AI agent capabilities
- Advanced Scenarios: Complex, multi-turn interactions
- Collections Scenarios: Domain-specific business logic testing
id: "CORE_001_INTENT_SUCCESS"
title: "Basic Intent Recognition Success"
description: "Tests that the agent correctly identifies and responds to user intent"
tags: ["core", "intent", "success"]
system_prompt: |
You are a helpful AI assistant. Respond naturally to user requests.
budget:
max_turns: 3
max_latency_ms_avg: 2000
max_cost_usd_per_session: 0.10
strategy: "FlowIntent"
conversation:
- role: "user"
content: "I need help with my account"
intent: "account_help"
- role: "assistant"
expected_intent: "account_help"
hard_assertions:
- type: "contains_any"
values: ["account", "help", "assist"]
soft_metrics:
relevance: 0.8
completeness: 0.7- Python 3.8 or higher
- OpenAI API key (for LLM judge and real agent testing)
-
Clone the repository
git clone https://github.com/your-org/uta.git cd uta -
Install dependencies
pip install -r requirements.txt
-
Set up environment
cp config/env.example .env # Edit .env with your API keys -
Run your first test
python3 -m runner.run scenarios/core/CORE_001_INTENT_SUCCESS.yaml
Create a .env file with your configuration:
# OpenAI Configuration (for LLM judge and real agent testing)
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_BASE_URL=https://api.openai.com
OPENAI_MODEL=gpt-4o
# LLM Judge Configuration
LLM_JUDGE_API_KEY=your_openai_api_key_here
LLM_JUDGE_BASE_URL=https://api.openai.com
LLM_JUDGE_MODEL=gpt-4o
LLM_JUDGE_TEMPERATURE=0.1
LLM_JUDGE_MAX_TOKENS=1000
# UTA Configuration
UTA_LOG_LEVEL=INFO
UTA_OUTPUT_DIR=out# Run a single scenario
python3 -m runner.run scenarios/core/CORE_001_INTENT_SUCCESS.yaml
# Run multiple scenarios
python3 -m runner.run scenarios/core/*.yaml
# Run with specific output directory
python3 -m runner.run scenarios/core/CORE_001_INTENT_SUCCESS.yaml --output-dir my_test_results# Run with custom policy
python3 -m runner.run scenarios/collections/s01_promise_to_pay.yaml --policy fixtures/policies_strict.yaml
# Run with budget enforcement
python3 -m runner.run scenarios/advanced/ADV_001_MULTI_TURN_COMPLEX.yaml --budget-enforcement
# Run with deterministic seeding
python3 -m runner.run scenarios/core/CORE_001_INTENT_SUCCESS.yaml --seed 42# Generate comprehensive dashboard
python3 scripts/generate_dashboard.py
# Open dashboard
open dashboard/index.htmlUTA generates comprehensive, business-ready reports including:
- Executive Summary: High-level results and key metrics
- Scenario Analysis: Detailed pass/fail status for each test
- Performance Metrics: Latency, cost, and efficiency analysis
- LLM Judge Evaluation: Sophisticated AI-powered quality assessment
- Budget Analysis: Cost tracking and budget compliance
- Conversation Transcripts: Full interaction logs with analysis
- Interactive Dashboard: Multi-page dashboard for client presentations
- Collapsible Sections: Detailed analysis that can be expanded as needed
- Business Metrics: Stakeholder-friendly insights and recommendations
- Technical Details: Developer-focused implementation information
uta/
βββ agents/ # AI agent adapters and strategies
β βββ strategies/ # Testing strategies
β βββ http_adapter.py # Real agent integration
β βββ product_adapter_*.py # Mock agents
βββ judges/ # Evaluation systems
β βββ schema_judge.py # Heuristic evaluation
β βββ llm_judge.py # AI-powered evaluation
β βββ unified_judge.py # Multi-modal judging
βββ reporters/ # Report generation
β βββ templates/ # HTML templates
β βββ dashboard_generator.py
βββ runner/ # Core execution engine
β βββ run.py # Main test runner
β βββ budget_enforcer.py # Cost and performance monitoring
β βββ seed_manager.py # Deterministic testing
βββ scenarios/ # Test scenarios
β βββ core/ # Fundamental tests
β βββ advanced/ # Complex interactions
β βββ collections/ # Domain-specific tests
βββ config/ # Configuration management
βββ docs/ # Documentation
βββ examples/ # Usage examples
βββ scripts/ # Utility scripts
-
Create strategy class
# agents/strategies/my_strategy.py from .base_strategy import BaseStrategy class MyStrategy(BaseStrategy): """My custom testing strategy.""" def get_next_turn(self, conversation, context): # Implement your strategy logic pass
-
Register strategy
# agents/strategies/registry.py from .my_strategy import MyStrategy def _register_default_strategies(self): # ... existing strategies ... self.register("MyStrategy", MyStrategy)
-
Use in scenarios
# scenarios/my_scenario.yaml strategy: "MyStrategy" # ... rest of scenario
-
Create scenario file
# scenarios/my_category/my_scenario.yaml id: "MY_001_EXAMPLE" title: "My Test Scenario" description: "Tests my specific use case" system_prompt: | You are a helpful assistant. strategy: "FlowIntent" conversation: - role: "user" content: "Hello" - role: "assistant" hard_assertions: - type: "contains_any" values: ["hello", "hi", "greeting"]
-
Run the scenario
python3 -m runner.run scenarios/my_category/my_scenario.yaml
# Run all tests
python3 -m pytest tests/
# Run specific test category
python3 -m pytest tests/test_strategies.py
# Run with coverage
python3 -m pytest --cov=agents tests/- Unit Tests: Individual component testing
- Integration Tests: End-to-end scenario testing
- Strategy Tests: Testing strategy implementations
- Judge Tests: Evaluation system testing
- Real-World UTA Flow: How agent-to-agent testing works in practice
- User Guide: Comprehensive usage documentation
- Strategy Reference: Detailed strategy documentation
- Business Reporting Guide: Report interpretation guide
- Real Agent Testing Guide: Testing with real AI agents
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
git checkout -b feature/my-awesome-feature
- Make your changes
- Add tests
- Submit a pull request
- Follow PEP 8 style guidelines
- Use type hints for all functions
- Add docstrings for all classes and methods
- Write tests for new functionality
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for providing the GPT models used in LLM judging
- The Python community for excellent libraries and tools
- Contributors and users who help improve UTA
- Documentation: docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
- Basic test runner and scenario execution
- Mock agent implementations
- Heuristic judging system
- Basic reporting
- LLM-powered judging
- Real agent integration
- Budget enforcement
- Deterministic seeding
- Multi-tenant support
- Advanced analytics
- CI/CD integration
- Enterprise security features
- Plugin marketplace
- Community scenarios
- Third-party integrations
- Advanced AI models
Built with β€οΈ for the AI community
Testing AI agents with AI agents - the future of AI quality assurance.