Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
274 changes: 274 additions & 0 deletions skills/auto-arena/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
---
name: auto-arena
description: >
Automatically evaluate and compare multiple AI models or agents without
pre-existing test data. Generates test queries from a task description,
collects responses from all target endpoints, auto-generates evaluation
rubrics, runs pairwise comparisons via a judge model, and produces
win-rate rankings with reports and charts. Supports checkpoint resume,
incremental endpoint addition, and judge model hot-swap.
Use when the user asks to compare, benchmark, or rank multiple models
or agents on a custom task, or run an arena-style evaluation.
---

# Auto Arena Skill

End-to-end automated model comparison using the OpenJudge `AutoArenaPipeline`:

1. **Generate queries** — LLM creates diverse test queries from task description
2. **Collect responses** — query all target endpoints concurrently
3. **Generate rubrics** — LLM produces evaluation criteria from task + sample queries
4. **Pairwise evaluation** — judge model compares every model pair (with position-bias swap)
5. **Analyze & rank** — compute win rates, win matrix, and rankings
6. **Report & charts** — Markdown report + win-rate bar chart + optional matrix heatmap

## Prerequisites

```bash
# Install OpenJudge
pip install py-openjudge

# Extra dependency for auto_arena (chart generation)
pip install matplotlib
```

## Gather from user before running

| Info | Required? | Notes |
|------|-----------|-------|
| Task description | Yes | What the models/agents should do (set in config YAML) |
| Target endpoints | Yes | At least 2 OpenAI-compatible endpoints to compare |
| Judge endpoint | Yes | Strong model for pairwise evaluation (e.g. `gpt-4`, `qwen-max`) |
| API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc. |
| Number of queries | No | Default: `20` |
| Seed queries | No | Example queries to guide generation style |
| System prompts | No | Per-endpoint system prompts |
| Output directory | No | Default: `./evaluation_results` |
| Report language | No | `"zh"` (default) or `"en"` |

## Quick start

### CLI

```bash
# Run evaluation
python -m cookbooks.auto_arena --config config.yaml --save

# Use pre-generated queries
python -m cookbooks.auto_arena --config config.yaml \
--queries_file queries.json --save

# Start fresh, ignore checkpoint
python -m cookbooks.auto_arena --config config.yaml --fresh --save

# Re-run only pairwise evaluation with new judge model
# (keeps queries, responses, and rubrics)
python -m cookbooks.auto_arena --config config.yaml --rerun-judge --save
```

### Python API

```python
import asyncio
from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline

async def main():
pipeline = AutoArenaPipeline.from_config("config.yaml")
result = await pipeline.evaluate()

print(f"Best model: {result.best_pipeline}")
for rank, (model, win_rate) in enumerate(result.rankings, 1):
print(f"{rank}. {model}: {win_rate:.1%}")

asyncio.run(main())
```

### Minimal Python API (no config file)

```python
import asyncio
from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline
from cookbooks.auto_arena.schema import OpenAIEndpoint

async def main():
pipeline = AutoArenaPipeline(
task_description="Customer service chatbot for e-commerce",
target_endpoints={
"gpt4": OpenAIEndpoint(
base_url="https://api.openai.com/v1",
api_key="sk-...",
model="gpt-4",
),
"qwen": OpenAIEndpoint(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-...",
model="qwen-max",
),
},
judge_endpoint=OpenAIEndpoint(
base_url="https://api.openai.com/v1",
api_key="sk-...",
model="gpt-4",
),
num_queries=20,
)
result = await pipeline.evaluate()
print(f"Best: {result.best_pipeline}")

asyncio.run(main())
```

## CLI options

| Flag | Default | Description |
|------|---------|-------------|
| `--config` | — | Path to YAML configuration file (required) |
| `--output_dir` | config value | Override output directory |
| `--queries_file` | — | Path to pre-generated queries JSON (skip generation) |
| `--save` | `False` | Save results to file |
| `--fresh` | `False` | Start fresh, ignore checkpoint |
| `--rerun-judge` | `False` | Re-run pairwise evaluation only (keep queries/responses/rubrics) |

## Minimal config file

```yaml
task:
description: "Academic GPT assistant for research and writing tasks"

target_endpoints:
model_v1:
base_url: "https://api.openai.com/v1"
api_key: "${OPENAI_API_KEY}"
model: "gpt-4"
model_v2:
base_url: "https://api.openai.com/v1"
api_key: "${OPENAI_API_KEY}"
model: "gpt-3.5-turbo"

judge_endpoint:
base_url: "https://api.openai.com/v1"
api_key: "${OPENAI_API_KEY}"
model: "gpt-4"
```

## Full config reference

### task

| Field | Required | Description |
|-------|----------|-------------|
| `description` | Yes | Clear description of the task models will be tested on |
| `scenario` | No | Usage scenario for additional context |

### target_endpoints.\<name\>

| Field | Default | Description |
|-------|---------|-------------|
| `base_url` | — | API base URL (required) |
| `api_key` | — | API key, supports `${ENV_VAR}` (required) |
| `model` | — | Model name (required) |
| `system_prompt` | — | System prompt for this endpoint |
| `extra_params` | — | Extra API params (e.g. `temperature`, `max_tokens`) |

### judge_endpoint

Same fields as `target_endpoints.<name>`. Use a strong model (e.g. `gpt-4`, `qwen-max`) with low temperature (~0.1) for consistent judgments.

### query_generation

| Field | Default | Description |
|-------|---------|-------------|
| `num_queries` | `20` | Total number of queries to generate |
| `seed_queries` | — | Example queries to guide generation |
| `categories` | — | Query categories with weights for stratified generation |
| `endpoint` | judge endpoint | Custom endpoint for query generation |
| `queries_per_call` | `10` | Queries generated per API call (1–50) |
| `num_parallel_batches` | `3` | Parallel generation batches |
| `temperature` | `0.9` | Sampling temperature (0.0–2.0) |
| `top_p` | `0.95` | Top-p sampling (0.0–1.0) |
| `max_similarity` | `0.85` | Dedup similarity threshold (0.0–1.0) |
| `enable_evolution` | `false` | Enable Evol-Instruct complexity evolution |
| `evolution_rounds` | `1` | Evolution rounds (0–3) |
| `complexity_levels` | `["constraints", "reasoning", "edge_cases"]` | Evolution strategies |

### evaluation

| Field | Default | Description |
|-------|---------|-------------|
| `max_concurrency` | `10` | Max concurrent API requests |
| `timeout` | `60` | Request timeout in seconds |
| `retry_times` | `3` | Retry attempts for failed requests |

### output

| Field | Default | Description |
|-------|---------|-------------|
| `output_dir` | `./evaluation_results` | Output directory |
| `save_queries` | `true` | Save generated queries |
| `save_responses` | `true` | Save model responses |
| `save_details` | `true` | Save detailed results |

### report

| Field | Default | Description |
|-------|---------|-------------|
| `enabled` | `false` | Enable Markdown report generation |
| `language` | `"zh"` | Report language: `"zh"` or `"en"` |
| `include_examples` | `3` | Examples per section (1–10) |
| `chart.enabled` | `true` | Generate win-rate chart |
| `chart.orientation` | `"horizontal"` | `"horizontal"` or `"vertical"` |
| `chart.show_values` | `true` | Show values on bars |
| `chart.highlight_best` | `true` | Highlight best model |
| `chart.matrix_enabled` | `false` | Generate win-rate matrix heatmap |
| `chart.format` | `"png"` | Chart format: `"png"`, `"svg"`, or `"pdf"` |

## Interpreting results

**Win rate:** percentage of pairwise comparisons a model wins. Each pair is evaluated in both orders (original + swapped) to eliminate position bias.

**Rankings example:**
```
1. gpt4_baseline [################----] 80.0%
2. qwen_candidate [############--------] 60.0%
3. llama_finetuned [##########----------] 50.0%
```

**Win matrix:** `win_matrix[A][B]` = how often model A beats model B across all queries.

## Checkpoint & resume

The pipeline saves progress after each step. Interrupted runs resume automatically:

- `--fresh` — ignore checkpoint, start from scratch
- `--rerun-judge` — re-run only the pairwise evaluation step (useful when switching judge models); keeps queries, responses, and rubrics intact
- Adding new endpoints to config triggers incremental response collection; existing responses are preserved

## Output files

```
evaluation_results/
├── evaluation_results.json # Rankings, win rates, win matrix
├── evaluation_report.md # Detailed Markdown report (if enabled)
├── win_rate_chart.png # Win-rate bar chart (if enabled)
├── win_rate_matrix.png # Matrix heatmap (if matrix_enabled)
├── queries.json # Generated test queries
├── responses.json # All model responses
├── rubrics.json # Generated evaluation rubrics
├── comparison_details.json # Pairwise comparison details
└── checkpoint.json # Pipeline checkpoint
```

## API key by model

| Model prefix | Environment variable |
|-------------|---------------------|
| `gpt-*`, `o1-*`, `o3-*` | `OPENAI_API_KEY` |
| `claude-*` | `ANTHROPIC_API_KEY` |
| `qwen-*`, `dashscope/*` | `DASHSCOPE_API_KEY` |
| `deepseek-*` | `DEEPSEEK_API_KEY` |
| Custom endpoint | set `api_key` + `base_url` in config |

## Additional resources

- Full config examples: [cookbooks/auto_arena/examples/](../../cookbooks/auto_arena/examples/)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This “Full config examples” link points to ../../cookbooks/..., but this repository doesn’t contain a cookbooks/ directory, so the relative link will be broken on GitHub. Consider linking to the upstream OpenJudge repository (or include the referenced examples in this repo) so readers can actually access them.

Suggested change
- Full config examples: [cookbooks/auto_arena/examples/](../../cookbooks/auto_arena/examples/)
- Full config examples: [cookbooks/auto_arena/examples/](https://github.com/agentscope-ai/OpenJudge/tree/main/cookbooks/auto_arena/examples)

Copilot uses AI. Check for mistakes.
- Documentation: [Auto Arena Guide](https://agentscope-ai.github.io/OpenJudge/applications/auto_arena/)
77 changes: 77 additions & 0 deletions skills/bib-verify/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
name: bib-verify
description: >
Verify a BibTeX file for hallucinated or fabricated references by cross-checking
every entry against CrossRef, arXiv, and DBLP. Reports each reference as
verified, suspect, or not found, with field-level mismatch details (title,
authors, year, DOI). Use when the user wants to check a .bib file for fake
citations, validate references in a paper, or audit bibliography entries for
accuracy.
---

# BibTeX Verification Skill

Check every entry in a `.bib` file against real academic databases using the
OpenJudge `PaperReviewPipeline` in BibTeX-only mode:

1. **Parse** — extract all entries from the `.bib` file
2. **Lookup** — query CrossRef, arXiv, and DBLP for each reference
3. **Match** — compare title, authors, year, and DOI
4. **Report** — flag each entry as `verified`, `suspect`, or `not_found`

## Prerequisites

```bash
pip install py-openjudge litellm
```

## Gather from user before running

| Info | Required? | Notes |
|------|-----------|-------|
| BibTeX file path | Yes | `.bib` file to verify |
| CrossRef email | No | Improves CrossRef API rate limits |

## Quick start

```bash
# Verify a standalone .bib file
python -m cookbooks.paper_review --bib_only references.bib

# With CrossRef email for better rate limits
python -m cookbooks.paper_review --bib_only references.bib --email your@email.com

# Save report to a custom path
python -m cookbooks.paper_review --bib_only references.bib \
--email your@email.com --output bib_report.md
```

## Relevant options

| Flag | Default | Description |
|------|---------|-------------|
| `--bib_only` | — | Path to `.bib` file (required for standalone verification) |
| `--email` | — | CrossRef mailto — improves rate limits, recommended |
| `--output` | auto | Output `.md` report path |
| `--language` | `en` | Report language: `en` or `zh` |

## Interpreting results

Each reference entry is assigned one of three statuses:

| Status | Meaning |
|--------|---------|
| `verified` | Found in CrossRef / arXiv / DBLP with matching fields |
| `suspect` | Title or authors do not match any real paper — likely hallucinated or mis-cited |
| `not_found` | No match in any database — treat as fabricated |

**Field-level details** are shown for `suspect` entries:
- `title_match` — whether the title matches a real paper
- `author_match` — whether the author list matches
- `year_match` — whether the publication year is correct
- `doi_match` — whether the DOI resolves to the right paper

## Additional resources

- Full pipeline options: [../paper-review/reference.md](../paper-review/reference.md)
- Combined PDF review + BibTeX verification: [../paper-review/SKILL.md](../paper-review/SKILL.md)
Loading
Loading