The Insight
RealClassEval isn't just a benchmark. It's a template for generating domain-specific training data.
A sentinel pipeline can:
- Read the RealClassEval exam format (structure, difficulty, rubric)
- Generate equivalent exams for ANY domain (C++, Rust, distributed systems, linear algebra, etc.)
- Use a teacher model to generate high-quality solutions
- Package as training data for the forge pipeline
- Use held-out exams as the evaluation benchmark
- Publish the forged model WITH exam scores as proof
The Loop
Academy curriculum design (RealClassEval format)
→ Sentinel generates domain variants (C++ exams, Rust exams, ...)
→ Teacher model solves them (high-quality training pairs)
→ Forge trains on solutions, prunes heads, retrains
→ Forged model takes held-out exams
→ Published with exam transcript as proof of capability
Why This Is Better Than Static Datasets
- StarCoderData is code someone else wrote — may include bad patterns
- Synthesized curriculum is designed to teach specific capabilities
- Exam-based evaluation proves the model can DO things, not just predict tokens
- Scalable — generate 100K exams across 50 domains automatically
- Versioned — curriculum v2 produces better models than v1, measurably
Implementation
Phase 1: Template extraction
- Parse RealClassEval exam format into a schema
- Identify: problem statement structure, difficulty levels, evaluation criteria, solution format
Phase 2: Domain generation
- Sentinel pipeline: given domain + difficulty, generate exam using template
- Teacher model (Qwen3.5-27B or Claude) generates reference solutions
- Output: (problem, solution) pairs as HF dataset format
Phase 3: Integration with forge
DOMAIN_DATASETS["code"] can point to a synthesized dataset
--domain code --curriculum synthesized uses sentinel-generated data
--domain code --curriculum starcoderdata uses the static fallback
Phase 4: Exam transcript in model card
## Exam Results (RealClassEval-Code v2)
| Exam | Score | Grade |
|------|-------|-------|
| Concurrent Systems | 87/100 | A |
| Memory Management | 92/100 | A |
| API Design | 78/100 | B+ |
| Error Handling | 95/100 | A+ |
This is what makes a model card compelling. Not perplexity — exam scores.
Dependencies
The Insight
RealClassEval isn't just a benchmark. It's a template for generating domain-specific training data.
A sentinel pipeline can:
The Loop
Why This Is Better Than Static Datasets
Implementation
Phase 1: Template extraction
Phase 2: Domain generation
Phase 3: Integration with forge
DOMAIN_DATASETS["code"]can point to a synthesized dataset--domain code --curriculum synthesizeduses sentinel-generated data--domain code --curriculum starcoderdatauses the static fallbackPhase 4: Exam transcript in model card
This is what makes a model card compelling. Not perplexity — exam scores.
Dependencies