Sentinel-synthesized training data — academy generates forge curriculum

## The Insight

RealClassEval isn't just a benchmark. It's a **template for generating domain-specific training data**.

A sentinel pipeline can:
1. Read the RealClassEval exam format (structure, difficulty, rubric)
2. Generate equivalent exams for ANY domain (C++, Rust, distributed systems, linear algebra, etc.)
3. Use a teacher model to generate high-quality solutions
4. Package as training data for the forge pipeline
5. Use held-out exams as the evaluation benchmark
6. Publish the forged model WITH exam scores as proof

## The Loop

```
Academy curriculum design (RealClassEval format)
  → Sentinel generates domain variants (C++ exams, Rust exams, ...)
  → Teacher model solves them (high-quality training pairs)
  → Forge trains on solutions, prunes heads, retrains
  → Forged model takes held-out exams
  → Published with exam transcript as proof of capability
```

## Why This Is Better Than Static Datasets

- **StarCoderData** is code someone else wrote — may include bad patterns
- **Synthesized curriculum** is designed to teach specific capabilities
- **Exam-based evaluation** proves the model can DO things, not just predict tokens
- **Scalable** — generate 100K exams across 50 domains automatically
- **Versioned** — curriculum v2 produces better models than v1, measurably

## Implementation

### Phase 1: Template extraction
- Parse RealClassEval exam format into a schema
- Identify: problem statement structure, difficulty levels, evaluation criteria, solution format

### Phase 2: Domain generation  
- Sentinel pipeline: given domain + difficulty, generate exam using template
- Teacher model (Qwen3.5-27B or Claude) generates reference solutions
- Output: (problem, solution) pairs as HF dataset format

### Phase 3: Integration with forge
- `DOMAIN_DATASETS["code"]` can point to a synthesized dataset
- `--domain code --curriculum synthesized` uses sentinel-generated data
- `--domain code --curriculum starcoderdata` uses the static fallback

### Phase 4: Exam transcript in model card
```markdown
## Exam Results (RealClassEval-Code v2)

| Exam | Score | Grade |
|------|-------|-------|
| Concurrent Systems | 87/100 | A |
| Memory Management | 92/100 | A |
| API Design | 78/100 | B+ |
| Error Handling | 95/100 | A+ |
```

This is what makes a model card compelling. Not perplexity — exam scores.

## Dependencies
- #89 — Python forge pipeline
- #90 — Domain dataset curation
- Academy/RealClassEval in continuum
- Sentinel pipeline infrastructure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentinel-synthesized training data — academy generates forge curriculum #91

The Insight

The Loop

Why This Is Better Than Static Datasets

Implementation

Phase 1: Template extraction

Phase 2: Domain generation

Phase 3: Integration with forge

Phase 4: Exam transcript in model card

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sentinel-synthesized training data — academy generates forge curriculum #91

Description

The Insight

The Loop

Why This Is Better Than Static Datasets

Implementation

Phase 1: Template extraction

Phase 2: Domain generation

Phase 3: Integration with forge

Phase 4: Exam transcript in model card

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions