Skip to content

testvagrant/a2a-poc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– Universal Tester-Agent (UTA)

AI Agent Testing & Evaluation Platform

A comprehensive, vendor-agnostic testing system for evaluating AI agents with sophisticated multi-modal judging, real-time integration, and business-ready reporting.

Python 3.8+ License: MIT Status: Production Ready

πŸš€ Overview

The Universal Tester-Agent (UTA) is a revolutionary testing platform that enables agent-to-agent testing - using AI agents to test other AI agents. This approach provides sophisticated, nuanced evaluation that goes far beyond traditional rule-based testing.

✨ Key Features

  • πŸ€– Real AI Testing: Test actual AI agents (ChatGPT, Claude, etc.) with real conversations
  • 🧠 LLM Judges: Sophisticated AI-powered evaluation using GPT-4o for nuanced quality assessment
  • πŸ“Š Business Reports: Professional, stakeholder-ready reports with detailed insights
  • πŸ”§ Pluggable Architecture: Extensible system with pluggable strategies, judges, and adapters
  • πŸ’° Budget Control: Cost monitoring and budget enforcement for production-ready testing
  • 🎯 Deterministic Testing: Reproducible results with seeded RNG for consistent evaluation

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Test Scenarios β”‚    β”‚  Testing Strategies β”‚    β”‚   AI Agents     β”‚
β”‚   (YAML DSL)    │───▢│  (Pluggable)     │───▢│  (Real/HTTP)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β–Ό                       β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Test Runner   β”‚    β”‚  Multi-modal    β”‚    β”‚   Report        β”‚
β”‚   (Orchestrator)β”‚    β”‚  Judge System   β”‚    β”‚   Generator     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

  • Test Runner: Core execution engine that orchestrates test scenarios
  • Strategy System: Pluggable testing strategies for different evaluation approaches
  • Judge System: Multi-modal evaluation system (heuristic + LLM judges)
  • HTTP Adapter: Integration layer for real AI agent testing
  • Report Generator: Business-ready HTML report generation

🎯 Testing Strategies

Currently Implemented

  1. FlowIntentStrategy: Tests natural conversation flow and intent understanding
  2. ToolHappyPathStrategy: Tests successful tool usage and function calling
  3. MemoryCarryStrategy: Tests context retention across conversation turns
  4. ToolErrorStrategy: Tests error handling and recovery mechanisms
  5. DynamicAIStrategy: πŸš€ NEW! AI-powered dynamic message generation that adapts to any agent automatically

πŸš€ Dynamic AI Strategy (Scalable Solution)

The DynamicAI strategy represents a breakthrough in scalability:

  • πŸ€– AI-Powered: Uses LLM to generate contextually appropriate messages
  • πŸ”„ Adaptive: Automatically adapts to any agent's capabilities and conversation style
  • 🌐 Cross-Platform: Works with any OpenAI-compatible API across different domains
  • ⚑ Zero Configuration: No manual setup required - automatically discovers agent capabilities
  • πŸ“ˆ Scalable: Eliminates the need for hardcoded strategies across different applications

Learn More: Dynamic UTA Platform Documentation

Planned Strategies

  • DisturbanceStrategy: Stress testing and interruptions
  • PlannerStrategy: Multi-step planning capabilities
  • PersonaStrategy: Consistent persona maintenance
  • PIIProbeStrategy: Privacy and data protection
  • InterruptionStrategy: Conversation interruption handling
  • RepeatProbeStrategy: Consistency and edge case testing

πŸ“‹ Test Scenarios

Scenario Categories

  • Core Scenarios: Fundamental AI agent capabilities
  • Advanced Scenarios: Complex, multi-turn interactions
  • Collections Scenarios: Domain-specific business logic testing

Example Scenario Structure

id: "CORE_001_INTENT_SUCCESS"
title: "Basic Intent Recognition Success"
description: "Tests that the agent correctly identifies and responds to user intent"
tags: ["core", "intent", "success"]

system_prompt: |
  You are a helpful AI assistant. Respond naturally to user requests.

budget:
  max_turns: 3
  max_latency_ms_avg: 2000
  max_cost_usd_per_session: 0.10

strategy: "FlowIntent"

conversation:
  - role: "user"
    content: "I need help with my account"
    intent: "account_help"
  
  - role: "assistant"
    expected_intent: "account_help"
    hard_assertions:
      - type: "contains_any"
        values: ["account", "help", "assist"]
    soft_metrics:
      relevance: 0.8
      completeness: 0.7

πŸš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • OpenAI API key (for LLM judge and real agent testing)

Installation

  1. Clone the repository

    git clone https://github.com/your-org/uta.git
    cd uta
  2. Install dependencies

    pip install -r requirements.txt
  3. Set up environment

    cp config/env.example .env
    # Edit .env with your API keys
  4. Run your first test

    python3 -m runner.run scenarios/core/CORE_001_INTENT_SUCCESS.yaml

Environment Configuration

Create a .env file with your configuration:

# OpenAI Configuration (for LLM judge and real agent testing)
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_BASE_URL=https://api.openai.com
OPENAI_MODEL=gpt-4o

# LLM Judge Configuration
LLM_JUDGE_API_KEY=your_openai_api_key_here
LLM_JUDGE_BASE_URL=https://api.openai.com
LLM_JUDGE_MODEL=gpt-4o
LLM_JUDGE_TEMPERATURE=0.1
LLM_JUDGE_MAX_TOKENS=1000

# UTA Configuration
UTA_LOG_LEVEL=INFO
UTA_OUTPUT_DIR=out

πŸ“Š Usage Examples

Basic Test Run

# Run a single scenario
python3 -m runner.run scenarios/core/CORE_001_INTENT_SUCCESS.yaml

# Run multiple scenarios
python3 -m runner.run scenarios/core/*.yaml

# Run with specific output directory
python3 -m runner.run scenarios/core/CORE_001_INTENT_SUCCESS.yaml --output-dir my_test_results

Advanced Configuration

# Run with custom policy
python3 -m runner.run scenarios/collections/s01_promise_to_pay.yaml --policy fixtures/policies_strict.yaml

# Run with budget enforcement
python3 -m runner.run scenarios/advanced/ADV_001_MULTI_TURN_COMPLEX.yaml --budget-enforcement

# Run with deterministic seeding
python3 -m runner.run scenarios/core/CORE_001_INTENT_SUCCESS.yaml --seed 42

Generate Dashboard

# Generate comprehensive dashboard
python3 scripts/generate_dashboard.py

# Open dashboard
open dashboard/index.html

πŸ“ˆ Reports

UTA generates comprehensive, business-ready reports including:

  • Executive Summary: High-level results and key metrics
  • Scenario Analysis: Detailed pass/fail status for each test
  • Performance Metrics: Latency, cost, and efficiency analysis
  • LLM Judge Evaluation: Sophisticated AI-powered quality assessment
  • Budget Analysis: Cost tracking and budget compliance
  • Conversation Transcripts: Full interaction logs with analysis

Report Features

  • Interactive Dashboard: Multi-page dashboard for client presentations
  • Collapsible Sections: Detailed analysis that can be expanded as needed
  • Business Metrics: Stakeholder-friendly insights and recommendations
  • Technical Details: Developer-focused implementation information

πŸ”§ Development

Project Structure

uta/
β”œβ”€β”€ agents/                 # AI agent adapters and strategies
β”‚   β”œβ”€β”€ strategies/        # Testing strategies
β”‚   β”œβ”€β”€ http_adapter.py    # Real agent integration
β”‚   └── product_adapter_*.py # Mock agents
β”œβ”€β”€ judges/                # Evaluation systems
β”‚   β”œβ”€β”€ schema_judge.py    # Heuristic evaluation
β”‚   β”œβ”€β”€ llm_judge.py       # AI-powered evaluation
β”‚   └── unified_judge.py   # Multi-modal judging
β”œβ”€β”€ reporters/             # Report generation
β”‚   β”œβ”€β”€ templates/         # HTML templates
β”‚   └── dashboard_generator.py
β”œβ”€β”€ runner/                # Core execution engine
β”‚   β”œβ”€β”€ run.py            # Main test runner
β”‚   β”œβ”€β”€ budget_enforcer.py # Cost and performance monitoring
β”‚   └── seed_manager.py   # Deterministic testing
β”œβ”€β”€ scenarios/             # Test scenarios
β”‚   β”œβ”€β”€ core/             # Fundamental tests
β”‚   β”œβ”€β”€ advanced/         # Complex interactions
β”‚   └── collections/      # Domain-specific tests
β”œβ”€β”€ config/               # Configuration management
β”œβ”€β”€ docs/                 # Documentation
β”œβ”€β”€ examples/             # Usage examples
└── scripts/              # Utility scripts

Adding New Strategies

  1. Create strategy class

    # agents/strategies/my_strategy.py
    from .base_strategy import BaseStrategy
    
    class MyStrategy(BaseStrategy):
        """My custom testing strategy."""
        
        def get_next_turn(self, conversation, context):
            # Implement your strategy logic
            pass
  2. Register strategy

    # agents/strategies/registry.py
    from .my_strategy import MyStrategy
    
    def _register_default_strategies(self):
        # ... existing strategies ...
        self.register("MyStrategy", MyStrategy)
  3. Use in scenarios

    # scenarios/my_scenario.yaml
    strategy: "MyStrategy"
    # ... rest of scenario

Adding New Scenarios

  1. Create scenario file

    # scenarios/my_category/my_scenario.yaml
    id: "MY_001_EXAMPLE"
    title: "My Test Scenario"
    description: "Tests my specific use case"
    
    system_prompt: |
      You are a helpful assistant.
    
    strategy: "FlowIntent"
    
    conversation:
      - role: "user"
        content: "Hello"
      - role: "assistant"
        hard_assertions:
          - type: "contains_any"
            values: ["hello", "hi", "greeting"]
  2. Run the scenario

    python3 -m runner.run scenarios/my_category/my_scenario.yaml

πŸ§ͺ Testing

Running Tests

# Run all tests
python3 -m pytest tests/

# Run specific test category
python3 -m pytest tests/test_strategies.py

# Run with coverage
python3 -m pytest --cov=agents tests/

Test Categories

  • Unit Tests: Individual component testing
  • Integration Tests: End-to-end scenario testing
  • Strategy Tests: Testing strategy implementations
  • Judge Tests: Evaluation system testing

πŸ“š Documentation

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

  1. Fork the repository
  2. Create a feature branch
    git checkout -b feature/my-awesome-feature
  3. Make your changes
  4. Add tests
  5. Submit a pull request

Code Style

  • Follow PEP 8 style guidelines
  • Use type hints for all functions
  • Add docstrings for all classes and methods
  • Write tests for new functionality

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • OpenAI for providing the GPT models used in LLM judging
  • The Python community for excellent libraries and tools
  • Contributors and users who help improve UTA

πŸ“ž Support

πŸš€ Roadmap

Phase 1: Core Platform βœ…

  • Basic test runner and scenario execution
  • Mock agent implementations
  • Heuristic judging system
  • Basic reporting

Phase 2: Advanced Features βœ…

  • LLM-powered judging
  • Real agent integration
  • Budget enforcement
  • Deterministic seeding

Phase 3: Enterprise Features 🚧

  • Multi-tenant support
  • Advanced analytics
  • CI/CD integration
  • Enterprise security features

Phase 4: Ecosystem 🌟

  • Plugin marketplace
  • Community scenarios
  • Third-party integrations
  • Advanced AI models

Built with ❀️ for the AI community

Testing AI agents with AI agents - the future of AI quality assurance.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published