Skip to content

Benchmark-driven forging — use third-party benchmarks as both curriculum and proof #96

@joelteply

Description

@joelteply

The Insight

A benchmark isn't just a test — it's a curriculum template. The categories define WHAT to train on. The test cases define HOW to evaluate. The scores prove it worked.

ToolCall-15 as First Example

Repo: https://github.com/stevibe/ToolCall-15 (cloned to FlashGordon/cambrian/ToolCall-15)
Format: 15 scenarios, 5 categories × 3 each, OpenAI-compatible API, deterministic scoring

Categories → Training Domains

Category Training Signal Forge Adapter
Tool Selection 1000s of "pick the right tool from N" scenarios tool-selection LoRA
Parameter Precision 1000s of exact parameter extraction tasks param-precision LoRA
Multi-Step Chains 1000s of multi-tool workflows chain-reasoning LoRA
Restraint and Refusal 1000s of "don't call a tool" scenarios restraint LoRA
Error Recovery 1000s of failed-tool recovery flows error-handling LoRA

The Loop

1. Benchmark defines categories and pass/fail criteria
2. Sentinel generates training data per category (using benchmark format as template)
3. Forge trains on generated data, prunes heads not needed for this capability
4. Run the SAME benchmark as evaluation
5. Publish model with benchmark scores in the card
6. Third-party benchmark = credible proof, not self-grading

Generalize Beyond ToolCall-15

Any third-party benchmark becomes a forge curriculum:

Benchmark Domain Source
ToolCall-15 Tool calling stevibe/ToolCall-15
HumanEval Code generation openai/human-eval
GSM8K Math reasoning gsm8k
MMLU General knowledge hendrycks/test
MT-Bench Chat quality lmsys/mt-bench
IFEval Instruction following google/IFEval
RealClassEval Python classes Our academy (98 real classes)

Each one: extract categories → generate training data → forge → evaluate → publish scores.

The Competitive Advantage

Unsloth publishes uniform GGUF quants. They test with these benchmarks.
We publish forged models that are TRAINED against these benchmarks.

Their Q6 (22GB) gets 15/15 on ToolCall-15.
Our forged Q4 (15GB) should ALSO get 15/15 — because the heads that survive pruning are the ones that matter for tool calling.

Same score, 30% smaller. That's the headline for every benchmark.

Implementation

Phase 1: Run ToolCall-15 against baseline Qwen3.5-27B

  • Serve via llama.cpp with OpenAI-compatible API
  • Run ToolCall-15 dashboard
  • Record baseline scores per quant level (Q2-Q8)

Phase 2: Forge with ToolCall-15 curriculum

  • Sentinel generates 5000+ training scenarios per category
  • Forge Qwen3.5-27B on this data
  • Prune heads, variable quant
  • Convert to GGUF

Phase 3: Benchmark forged model

  • Run ToolCall-15 against forged GGUF
  • Compare: forged Q4 vs baseline Q6
  • Publish results in model card

Phase 4: Generalize

  • Same pipeline for HumanEval, GSM8K, MMLU, etc.
  • Each benchmark = a forge recipe
  • Adapters published per benchmark domain
  • Model card shows scores across ALL benchmarks

The Academy Connection

This IS the academy, externally validated:

  • RealClassEval = our internal benchmark (98 Python classes)
  • ToolCall-15 = external benchmark (15 tool scenarios)
  • HumanEval = external benchmark (164 code problems)
  • Same forge pipeline, same sentinel curriculum generation
  • Internal exams for development, external benchmarks for credibility

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions