Benchmark-driven forging — use third-party benchmarks as both curriculum and proof

## The Insight

A benchmark isn't just a test — it's a curriculum template. The categories define WHAT to train on. The test cases define HOW to evaluate. The scores prove it worked.

## ToolCall-15 as First Example

**Repo**: https://github.com/stevibe/ToolCall-15 (cloned to FlashGordon/cambrian/ToolCall-15)
**Format**: 15 scenarios, 5 categories × 3 each, OpenAI-compatible API, deterministic scoring

### Categories → Training Domains
| Category | Training Signal | Forge Adapter |
|----------|----------------|---------------|
| Tool Selection | 1000s of "pick the right tool from N" scenarios | tool-selection LoRA |
| Parameter Precision | 1000s of exact parameter extraction tasks | param-precision LoRA |
| Multi-Step Chains | 1000s of multi-tool workflows | chain-reasoning LoRA |
| Restraint and Refusal | 1000s of "don't call a tool" scenarios | restraint LoRA |
| Error Recovery | 1000s of failed-tool recovery flows | error-handling LoRA |

### The Loop
```
1. Benchmark defines categories and pass/fail criteria
2. Sentinel generates training data per category (using benchmark format as template)
3. Forge trains on generated data, prunes heads not needed for this capability
4. Run the SAME benchmark as evaluation
5. Publish model with benchmark scores in the card
6. Third-party benchmark = credible proof, not self-grading
```

## Generalize Beyond ToolCall-15

Any third-party benchmark becomes a forge curriculum:

| Benchmark | Domain | Source |
|-----------|--------|--------|
| ToolCall-15 | Tool calling | stevibe/ToolCall-15 |
| HumanEval | Code generation | openai/human-eval |
| GSM8K | Math reasoning | gsm8k |
| MMLU | General knowledge | hendrycks/test |
| MT-Bench | Chat quality | lmsys/mt-bench |
| IFEval | Instruction following | google/IFEval |
| RealClassEval | Python classes | Our academy (98 real classes) |

Each one: extract categories → generate training data → forge → evaluate → publish scores.

## The Competitive Advantage

Unsloth publishes uniform GGUF quants. They test with these benchmarks.
We publish forged models that are TRAINED against these benchmarks.

Their Q6 (22GB) gets 15/15 on ToolCall-15.
Our forged Q4 (15GB) should ALSO get 15/15 — because the heads that survive pruning are the ones that matter for tool calling.

Same score, 30% smaller. That's the headline for every benchmark.

## Implementation

### Phase 1: Run ToolCall-15 against baseline Qwen3.5-27B
- Serve via llama.cpp with OpenAI-compatible API
- Run ToolCall-15 dashboard
- Record baseline scores per quant level (Q2-Q8)

### Phase 2: Forge with ToolCall-15 curriculum
- Sentinel generates 5000+ training scenarios per category
- Forge Qwen3.5-27B on this data
- Prune heads, variable quant
- Convert to GGUF

### Phase 3: Benchmark forged model
- Run ToolCall-15 against forged GGUF
- Compare: forged Q4 vs baseline Q6
- Publish results in model card

### Phase 4: Generalize
- Same pipeline for HumanEval, GSM8K, MMLU, etc.
- Each benchmark = a forge recipe
- Adapters published per benchmark domain
- Model card shows scores across ALL benchmarks

## The Academy Connection

This IS the academy, externally validated:
- RealClassEval = our internal benchmark (98 Python classes)
- ToolCall-15 = external benchmark (15 tool scenarios)
- HumanEval = external benchmark (164 code problems)
- Same forge pipeline, same sentinel curriculum generation
- Internal exams for development, external benchmarks for credibility

## Dependencies
- #91 — Sentinel-synthesized training data
- #89 — Python forge pipeline (running now)
- #95 — ToolCall-15 benchmark comparison
- #92 — Adapter registry (publish per-benchmark adapters)
- Academy curriculum system in continuum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark-driven forging — use third-party benchmarks as both curriculum and proof #96

The Insight

ToolCall-15 as First Example

Categories → Training Domains

The Loop

Generalize Beyond ToolCall-15

The Competitive Advantage

Implementation

Phase 1: Run ToolCall-15 against baseline Qwen3.5-27B

Phase 2: Forge with ToolCall-15 curriculum

Phase 3: Benchmark forged model

Phase 4: Generalize

The Academy Connection

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Category	Training Signal	Forge Adapter
Tool Selection	1000s of "pick the right tool from N" scenarios	tool-selection LoRA
Parameter Precision	1000s of exact parameter extraction tasks	param-precision LoRA
Multi-Step Chains	1000s of multi-tool workflows	chain-reasoning LoRA
Restraint and Refusal	1000s of "don't call a tool" scenarios	restraint LoRA
Error Recovery	1000s of failed-tool recovery flows	error-handling LoRA

Benchmark	Domain	Source
ToolCall-15	Tool calling	stevibe/ToolCall-15
HumanEval	Code generation	openai/human-eval
GSM8K	Math reasoning	gsm8k
MMLU	General knowledge	hendrycks/test
MT-Bench	Chat quality	lmsys/mt-bench
IFEval	Instruction following	google/IFEval
RealClassEval	Python classes	Our academy (98 real classes)

Benchmark-driven forging — use third-party benchmarks as both curriculum and proof #96

Description

The Insight

ToolCall-15 as First Example

Categories → Training Domains

The Loop

Generalize Beyond ToolCall-15

The Competitive Advantage

Implementation

Phase 1: Run ToolCall-15 against baseline Qwen3.5-27B

Phase 2: Forge with ToolCall-15 curriculum

Phase 3: Benchmark forged model

Phase 4: Generalize

The Academy Connection

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions