Real-time web evaluation framework for LLM browser agents.
# Install
pip install -e .
playwright install chromium
# Configure
cp .env.example .env
# Edit .env with your API_KEY
# Run evaluation
python eval.py --seed 42 --verbose# Basic evaluation
python eval.py --seed 42
# Specific template
python eval.py --templates weather/current_weather --seed 42
# Multi-task evaluation
python eval.py --num-tasks 3 --seed 42
# Deterministic task by ID
python eval.py --task-id 100001 --seed 42
# View all templates
python eval.py --show-registry| Option | Description | Default |
|---|---|---|
--seed |
Random seed | random |
--task-id |
Deterministic task ID | - |
--num-tasks |
Sub-tasks (1-4) | 1 |
--templates |
Template(s) to use | random |
--model |
LLM model | zai-org/GLM-4.7-TEE |
--base-url |
API URL | https://llm.chutes.ai/v1 |
--timeout |
Timeout (seconds) | 3600 |
--verbose |
Verbose output | false |
location_name, time_of_day, multi_day, current_weather, astronomy, weather_comparison
stooq_price, stooq_comparison, stooq_ranking, stooq_sector_analysis, stooq_currency, stooq_volatility, stooq_range_position
coingecko_price, coingecko_volume, coingecko_comparison, coingecko_rank, coingecko_top_movers, coingecko_supply, coingecko_ath, coingecko_performance
taostats_subnet_info, taostats_comparison, taostats_analysis, taostats_ranking, taostats_price_change, taostats_threshold, taostats_multi_condition, taostats_delta, taostats_range_count, taostats_percentage
hybrid_top_performer, hybrid_ranking, hybrid_conditional_branch
| Variable | Description |
|---|---|
API_KEY |
LLM API key (required) |
COINGECKO_API_KEY |
CoinGecko Pro API key (optional) |
TAOSTATS_API_KEY |
Taostats API key (optional) |
Results saved to eval/<timestamp>.json:
{
"score": 1.0,
"success": true,
"extra": {
"seed": 42,
"answer_details": [...],
"conversation": [...]
}
}MIT