-
Notifications
You must be signed in to change notification settings - Fork 0
Benchmark-driven forging — use third-party benchmarks as both curriculum and proof #96
Description
The Insight
A benchmark isn't just a test — it's a curriculum template. The categories define WHAT to train on. The test cases define HOW to evaluate. The scores prove it worked.
ToolCall-15 as First Example
Repo: https://github.com/stevibe/ToolCall-15 (cloned to FlashGordon/cambrian/ToolCall-15)
Format: 15 scenarios, 5 categories × 3 each, OpenAI-compatible API, deterministic scoring
Categories → Training Domains
| Category | Training Signal | Forge Adapter |
|---|---|---|
| Tool Selection | 1000s of "pick the right tool from N" scenarios | tool-selection LoRA |
| Parameter Precision | 1000s of exact parameter extraction tasks | param-precision LoRA |
| Multi-Step Chains | 1000s of multi-tool workflows | chain-reasoning LoRA |
| Restraint and Refusal | 1000s of "don't call a tool" scenarios | restraint LoRA |
| Error Recovery | 1000s of failed-tool recovery flows | error-handling LoRA |
The Loop
1. Benchmark defines categories and pass/fail criteria
2. Sentinel generates training data per category (using benchmark format as template)
3. Forge trains on generated data, prunes heads not needed for this capability
4. Run the SAME benchmark as evaluation
5. Publish model with benchmark scores in the card
6. Third-party benchmark = credible proof, not self-grading
Generalize Beyond ToolCall-15
Any third-party benchmark becomes a forge curriculum:
| Benchmark | Domain | Source |
|---|---|---|
| ToolCall-15 | Tool calling | stevibe/ToolCall-15 |
| HumanEval | Code generation | openai/human-eval |
| GSM8K | Math reasoning | gsm8k |
| MMLU | General knowledge | hendrycks/test |
| MT-Bench | Chat quality | lmsys/mt-bench |
| IFEval | Instruction following | google/IFEval |
| RealClassEval | Python classes | Our academy (98 real classes) |
Each one: extract categories → generate training data → forge → evaluate → publish scores.
The Competitive Advantage
Unsloth publishes uniform GGUF quants. They test with these benchmarks.
We publish forged models that are TRAINED against these benchmarks.
Their Q6 (22GB) gets 15/15 on ToolCall-15.
Our forged Q4 (15GB) should ALSO get 15/15 — because the heads that survive pruning are the ones that matter for tool calling.
Same score, 30% smaller. That's the headline for every benchmark.
Implementation
Phase 1: Run ToolCall-15 against baseline Qwen3.5-27B
- Serve via llama.cpp with OpenAI-compatible API
- Run ToolCall-15 dashboard
- Record baseline scores per quant level (Q2-Q8)
Phase 2: Forge with ToolCall-15 curriculum
- Sentinel generates 5000+ training scenarios per category
- Forge Qwen3.5-27B on this data
- Prune heads, variable quant
- Convert to GGUF
Phase 3: Benchmark forged model
- Run ToolCall-15 against forged GGUF
- Compare: forged Q4 vs baseline Q6
- Publish results in model card
Phase 4: Generalize
- Same pipeline for HumanEval, GSM8K, MMLU, etc.
- Each benchmark = a forge recipe
- Adapters published per benchmark domain
- Model card shows scores across ALL benchmarks
The Academy Connection
This IS the academy, externally validated:
- RealClassEval = our internal benchmark (98 Python classes)
- ToolCall-15 = external benchmark (15 tool scenarios)
- HumanEval = external benchmark (164 code problems)
- Same forge pipeline, same sentinel curriculum generation
- Internal exams for development, external benchmarks for credibility
Dependencies
- Sentinel-synthesized training data — academy generates forge curriculum #91 — Sentinel-synthesized training data
- Python forge pipeline — ship Qwen3.5 forged models now #89 — Python forge pipeline (running now)
- Benchmark: Tool Calling Challenge — forged vs uniform quant on Qwen3.5-27B #95 — ToolCall-15 benchmark comparison
- Adapter registry — semantic search, auto-forge, publish (npm for intelligence) #92 — Adapter registry (publish per-benchmark adapters)
- Academy curriculum system in continuum