agentscope-ai · XiaoBoAI · Mar 10, 2026 · Copilot · Mar 26, 2026
diff --git a/skills/auto-arena/SKILL.md b/skills/auto-arena/SKILL.md
@@ -0,0 +1,274 @@
+---
+name: auto-arena
+description: >
+  Automatically evaluate and compare multiple AI models or agents without
+  pre-existing test data. Generates test queries from a task description,
+  collects responses from all target endpoints, auto-generates evaluation
+  rubrics, runs pairwise comparisons via a judge model, and produces
+  win-rate rankings with reports and charts. Supports checkpoint resume,
+  incremental endpoint addition, and judge model hot-swap.
+  Use when the user asks to compare, benchmark, or rank multiple models
+  or agents on a custom task, or run an arena-style evaluation.
+---
+
+# Auto Arena Skill
+
+End-to-end automated model comparison using the OpenJudge `AutoArenaPipeline`:
+
+1. **Generate queries** — LLM creates diverse test queries from task description
+2. **Collect responses** — query all target endpoints concurrently
+3. **Generate rubrics** — LLM produces evaluation criteria from task + sample queries
+4. **Pairwise evaluation** — judge model compares every model pair (with position-bias swap)
+5. **Analyze & rank** — compute win rates, win matrix, and rankings
+6. **Report & charts** — Markdown report + win-rate bar chart + optional matrix heatmap
+
+## Prerequisites
+
+```bash
+# Install OpenJudge
+pip install py-openjudge
+
+# Extra dependency for auto_arena (chart generation)
+pip install matplotlib
+```
+
+## Gather from user before running
+
+| Info | Required? | Notes |
+|------|-----------|-------|
+| Task description | Yes | What the models/agents should do (set in config YAML) |
+| Target endpoints | Yes | At least 2 OpenAI-compatible endpoints to compare |
+| Judge endpoint | Yes | Strong model for pairwise evaluation (e.g. `gpt-4`, `qwen-max`) |
+| API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc. |
+| Number of queries | No | Default: `20` |
+| Seed queries | No | Example queries to guide generation style |
+| System prompts | No | Per-endpoint system prompts |
+| Output directory | No | Default: `./evaluation_results` |
+| Report language | No | `"zh"` (default) or `"en"` |
+
+## Quick start
+
+### CLI
+
+```bash
+# Run evaluation
+python -m cookbooks.auto_arena --config config.yaml --save
+
+# Use pre-generated queries
+python -m cookbooks.auto_arena --config config.yaml \
+  --queries_file queries.json --save
+
+# Start fresh, ignore checkpoint
+python -m cookbooks.auto_arena --config config.yaml --fresh --save
+
+# Re-run only pairwise evaluation with new judge model
+# (keeps queries, responses, and rubrics)
+python -m cookbooks.auto_arena --config config.yaml --rerun-judge --save
+```
+
+### Python API
+
+```python
+import asyncio
+from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline
+
+async def main():
+    pipeline = AutoArenaPipeline.from_config("config.yaml")
+    result = await pipeline.evaluate()
+
+    print(f"Best model: {result.best_pipeline}")
+    for rank, (model, win_rate) in enumerate(result.rankings, 1):
+        print(f"{rank}. {model}: {win_rate:.1%}")
+
+asyncio.run(main())
+```
+
+### Minimal Python API (no config file)
+
+```python
+import asyncio
+from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline
+from cookbooks.auto_arena.schema import OpenAIEndpoint
+
+async def main():
+    pipeline = AutoArenaPipeline(
+        task_description="Customer service chatbot for e-commerce",
+        target_endpoints={
+            "gpt4": OpenAIEndpoint(
+                base_url="https://api.openai.com/v1",
+                api_key="sk-...",
+                model="gpt-4",
+            ),
+            "qwen": OpenAIEndpoint(
+                base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
+                api_key="sk-...",
+                model="qwen-max",
+            ),
+        },
+        judge_endpoint=OpenAIEndpoint(
+            base_url="https://api.openai.com/v1",
+            api_key="sk-...",
+            model="gpt-4",
+        ),
+        num_queries=20,
+    )
+    result = await pipeline.evaluate()
+    print(f"Best: {result.best_pipeline}")
+
+asyncio.run(main())
+```
+
+## CLI options
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--config` | — | Path to YAML configuration file (required) |
+| `--output_dir` | config value | Override output directory |
+| `--queries_file` | — | Path to pre-generated queries JSON (skip generation) |
+| `--save` | `False` | Save results to file |
+| `--fresh` | `False` | Start fresh, ignore checkpoint |
+| `--rerun-judge` | `False` | Re-run pairwise evaluation only (keep queries/responses/rubrics) |
+
+## Minimal config file
+
+```yaml
+task:
+  description: "Academic GPT assistant for research and writing tasks"
+
+target_endpoints:
+  model_v1:
+    base_url: "https://api.openai.com/v1"
+    api_key: "${OPENAI_API_KEY}"
+    model: "gpt-4"
+  model_v2:
+    base_url: "https://api.openai.com/v1"
+    api_key: "${OPENAI_API_KEY}"
+    model: "gpt-3.5-turbo"
+
+judge_endpoint:
+  base_url: "https://api.openai.com/v1"
+  api_key: "${OPENAI_API_KEY}"
+  model: "gpt-4"
+```
+
+## Full config reference
+
+### task
+
+| Field | Required | Description |
+|-------|----------|-------------|
+| `description` | Yes | Clear description of the task models will be tested on |
+| `scenario` | No | Usage scenario for additional context |
+
+### target_endpoints.\<name\>
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `base_url` | — | API base URL (required) |
+| `api_key` | — | API key, supports `${ENV_VAR}` (required) |
+| `model` | — | Model name (required) |
+| `system_prompt` | — | System prompt for this endpoint |
+| `extra_params` | — | Extra API params (e.g. `temperature`, `max_tokens`) |
+
+### judge_endpoint
+
+Same fields as `target_endpoints.<name>`. Use a strong model (e.g. `gpt-4`, `qwen-max`) with low temperature (~0.1) for consistent judgments.
+
+### query_generation
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `num_queries` | `20` | Total number of queries to generate |
+| `seed_queries` | — | Example queries to guide generation |
+| `categories` | — | Query categories with weights for stratified generation |
+| `endpoint` | judge endpoint | Custom endpoint for query generation |
+| `queries_per_call` | `10` | Queries generated per API call (1–50) |
+| `num_parallel_batches` | `3` | Parallel generation batches |
+| `temperature` | `0.9` | Sampling temperature (0.0–2.0) |
+| `top_p` | `0.95` | Top-p sampling (0.0–1.0) |
+| `max_similarity` | `0.85` | Dedup similarity threshold (0.0–1.0) |
+| `enable_evolution` | `false` | Enable Evol-Instruct complexity evolution |
+| `evolution_rounds` | `1` | Evolution rounds (0–3) |
+| `complexity_levels` | `["constraints", "reasoning", "edge_cases"]` | Evolution strategies |
+
+### evaluation
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `max_concurrency` | `10` | Max concurrent API requests |
+| `timeout` | `60` | Request timeout in seconds |
+| `retry_times` | `3` | Retry attempts for failed requests |
+
+### output
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `output_dir` | `./evaluation_results` | Output directory |
+| `save_queries` | `true` | Save generated queries |
+| `save_responses` | `true` | Save model responses |
+| `save_details` | `true` | Save detailed results |
+
+### report
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `enabled` | `false` | Enable Markdown report generation |
+| `language` | `"zh"` | Report language: `"zh"` or `"en"` |
+| `include_examples` | `3` | Examples per section (1–10) |
+| `chart.enabled` | `true` | Generate win-rate chart |
+| `chart.orientation` | `"horizontal"` | `"horizontal"` or `"vertical"` |
+| `chart.show_values` | `true` | Show values on bars |
+| `chart.highlight_best` | `true` | Highlight best model |
+| `chart.matrix_enabled` | `false` | Generate win-rate matrix heatmap |
+| `chart.format` | `"png"` | Chart format: `"png"`, `"svg"`, or `"pdf"` |
+
+## Interpreting results
+
+**Win rate:** percentage of pairwise comparisons a model wins. Each pair is evaluated in both orders (original + swapped) to eliminate position bias.
+
+**Rankings example:**
+```
+  1. gpt4_baseline       [################----] 80.0%
+  2. qwen_candidate      [############--------] 60.0%
+  3. llama_finetuned      [##########----------] 50.0%
+```
+
+**Win matrix:** `win_matrix[A][B]` = how often model A beats model B across all queries.
+
+## Checkpoint & resume
+
+The pipeline saves progress after each step. Interrupted runs resume automatically:
+
+- `--fresh` — ignore checkpoint, start from scratch
+- `--rerun-judge` — re-run only the pairwise evaluation step (useful when switching judge models); keeps queries, responses, and rubrics intact
+- Adding new endpoints to config triggers incremental response collection; existing responses are preserved
+
+## Output files
+
+```
+evaluation_results/
+├── evaluation_results.json     # Rankings, win rates, win matrix
+├── evaluation_report.md        # Detailed Markdown report (if enabled)
+├── win_rate_chart.png          # Win-rate bar chart (if enabled)
+├── win_rate_matrix.png         # Matrix heatmap (if matrix_enabled)
+├── queries.json                # Generated test queries
+├── responses.json              # All model responses
+├── rubrics.json                # Generated evaluation rubrics
+├── comparison_details.json     # Pairwise comparison details
+└── checkpoint.json             # Pipeline checkpoint
+```
+
+## API key by model
+
+| Model prefix | Environment variable |
+|-------------|---------------------|
+| `gpt-*`, `o1-*`, `o3-*` | `OPENAI_API_KEY` |
+| `claude-*` | `ANTHROPIC_API_KEY` |
+| `qwen-*`, `dashscope/*` | `DASHSCOPE_API_KEY` |
+| `deepseek-*` | `DEEPSEEK_API_KEY` |
+| Custom endpoint | set `api_key` + `base_url` in config |
+
+## Additional resources
+
+- Full config examples: [cookbooks/auto_arena/examples/](../../cookbooks/auto_arena/examples/)
- Full config examples: [cookbooks/auto_arena/examples/](../../cookbooks/auto_arena/examples/)
+- Full config examples: [cookbooks/auto_arena/examples/](https://github.com/agentscope-ai/OpenJudge/tree/main/cookbooks/auto_arena/examples)
- Full config examples: [cookbooks/auto_arena/examples/](../../cookbooks/auto_arena/examples/)
+- Full config examples: [cookbooks/auto_arena/examples/](https://github.com/agentscope-ai/OpenJudge/tree/main/cookbooks/auto_arena/examples)
+- Documentation: [Auto Arena Guide](https://agentscope-ai.github.io/OpenJudge/applications/auto_arena/)
diff --git a/skills/bib-verify/SKILL.md b/skills/bib-verify/SKILL.md
@@ -0,0 +1,77 @@
+---
+name: bib-verify
+description: >
+  Verify a BibTeX file for hallucinated or fabricated references by cross-checking
+  every entry against CrossRef, arXiv, and DBLP. Reports each reference as
+  verified, suspect, or not found, with field-level mismatch details (title,
+  authors, year, DOI). Use when the user wants to check a .bib file for fake
+  citations, validate references in a paper, or audit bibliography entries for
+  accuracy.
+---
+
+# BibTeX Verification Skill
+
+Check every entry in a `.bib` file against real academic databases using the
+OpenJudge `PaperReviewPipeline` in BibTeX-only mode:
+
+1. **Parse** — extract all entries from the `.bib` file
+2. **Lookup** — query CrossRef, arXiv, and DBLP for each reference
+3. **Match** — compare title, authors, year, and DOI
+4. **Report** — flag each entry as `verified`, `suspect`, or `not_found`
+
+## Prerequisites
+
+```bash
+pip install py-openjudge litellm
+```
+
+## Gather from user before running
+
+| Info | Required? | Notes |
+|------|-----------|-------|
+| BibTeX file path | Yes | `.bib` file to verify |
+| CrossRef email | No | Improves CrossRef API rate limits |
+
+## Quick start
+
+```bash
+# Verify a standalone .bib file
+python -m cookbooks.paper_review --bib_only references.bib
+
+# With CrossRef email for better rate limits
+python -m cookbooks.paper_review --bib_only references.bib --email your@email.com
+
+# Save report to a custom path
+python -m cookbooks.paper_review --bib_only references.bib \
+  --email your@email.com --output bib_report.md
+```
+
+## Relevant options
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--bib_only` | — | Path to `.bib` file (required for standalone verification) |
+| `--email` | — | CrossRef mailto — improves rate limits, recommended |
+| `--output` | auto | Output `.md` report path |
+| `--language` | `en` | Report language: `en` or `zh` |
+
+## Interpreting results
+
+Each reference entry is assigned one of three statuses:
+
+| Status | Meaning |
+|--------|---------|
+| `verified` | Found in CrossRef / arXiv / DBLP with matching fields |
+| `suspect` | Title or authors do not match any real paper — likely hallucinated or mis-cited |
+| `not_found` | No match in any database — treat as fabricated |
+
+**Field-level details** are shown for `suspect` entries:
+- `title_match` — whether the title matches a real paper
+- `author_match` — whether the author list matches
+- `year_match` — whether the publication year is correct
+- `doi_match` — whether the DOI resolves to the right paper
+
+## Additional resources
+
+- Full pipeline options: [../paper-review/reference.md](../paper-review/reference.md)
+- Combined PDF review + BibTeX verification: [../paper-review/SKILL.md](../paper-review/SKILL.md)