Skip to content

Sentinel-synthesized training data — academy generates forge curriculum #91

@joelteply

Description

@joelteply

The Insight

RealClassEval isn't just a benchmark. It's a template for generating domain-specific training data.

A sentinel pipeline can:

  1. Read the RealClassEval exam format (structure, difficulty, rubric)
  2. Generate equivalent exams for ANY domain (C++, Rust, distributed systems, linear algebra, etc.)
  3. Use a teacher model to generate high-quality solutions
  4. Package as training data for the forge pipeline
  5. Use held-out exams as the evaluation benchmark
  6. Publish the forged model WITH exam scores as proof

The Loop

Academy curriculum design (RealClassEval format)
  → Sentinel generates domain variants (C++ exams, Rust exams, ...)
  → Teacher model solves them (high-quality training pairs)
  → Forge trains on solutions, prunes heads, retrains
  → Forged model takes held-out exams
  → Published with exam transcript as proof of capability

Why This Is Better Than Static Datasets

  • StarCoderData is code someone else wrote — may include bad patterns
  • Synthesized curriculum is designed to teach specific capabilities
  • Exam-based evaluation proves the model can DO things, not just predict tokens
  • Scalable — generate 100K exams across 50 domains automatically
  • Versioned — curriculum v2 produces better models than v1, measurably

Implementation

Phase 1: Template extraction

  • Parse RealClassEval exam format into a schema
  • Identify: problem statement structure, difficulty levels, evaluation criteria, solution format

Phase 2: Domain generation

  • Sentinel pipeline: given domain + difficulty, generate exam using template
  • Teacher model (Qwen3.5-27B or Claude) generates reference solutions
  • Output: (problem, solution) pairs as HF dataset format

Phase 3: Integration with forge

  • DOMAIN_DATASETS["code"] can point to a synthesized dataset
  • --domain code --curriculum synthesized uses sentinel-generated data
  • --domain code --curriculum starcoderdata uses the static fallback

Phase 4: Exam transcript in model card

## Exam Results (RealClassEval-Code v2)

| Exam | Score | Grade |
|------|-------|-------|
| Concurrent Systems | 87/100 | A |
| Memory Management | 92/100 | A |
| API Design | 78/100 | B+ |
| Error Handling | 95/100 | A+ |

This is what makes a model card compelling. Not perplexity — exam scores.

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions