Domain datasets — curate and validate training data per domain

## Current State

`scripts/forge_model.py` now has `DOMAIN_DATASETS` that maps domains to HF datasets:
- code → bigcode/starcoderdata
- reasoning → gsm8k
- general → HuggingFaceFW/fineweb
- chat → stingning/ultrachat
- science → scientific_papers/arxiv

These are starting points. They need validation and likely replacement.

## What's Needed

### Per Domain
1. **Validate dataset quality** — download samples, inspect, ensure they're not garbage
2. **Test pipeline** — can the tokenizer handle the format? Does filtering work?
3. **Measure impact** — does training on this data actually improve domain performance?
4. **Benchmark** — define a holdout eval set that measures real capability (not perplexity on training distribution)

### Dataset Candidates to Evaluate

**Code:**
- bigcode/starcoderdata (current) — Python subset, good quality
- bigcode/the-stack-v2 — broader language coverage
- codeparrot/github-code — raw GitHub, needs filtering
- Custom: sentinel-ai's own test suite as training data (meta!)

**Reasoning:**
- gsm8k (current) — math word problems, good but narrow
- hendrycks/competition_math — harder
- TIGER-Lab/MathInstruct — diverse math formats
- allenai/ai2_arc — science reasoning

**General:**
- HuggingFaceFW/fineweb (current) — high-quality web crawl
- allenai/c4 — Common Crawl, classic
- togethercomputer/RedPajama-Data — diverse mixture

**Chat/Agentic:**
- stingning/ultrachat (current) — multi-turn
- Open-Orca/OpenOrca — instruction following
- glaiveai/glaive-function-calling-v2 — tool use!
- Custom: continuum chat logs (anonymized) as training data

### The Synthesis Path

Eventually, sentinel pipelines should GENERATE training data:
1. Teacher model generates diverse prompts for a domain
2. Strong model generates high-quality responses
3. Weak model (the one being forged) trains on this synthetic data
4. Evaluation uses held-out real-world benchmarks

This is the academy's curriculum system applied to forging.

## The Abstraction

`DOMAIN_DATASETS` should become a registry, not a hardcoded dict:
- Configurable per-project
- Supports custom datasets (local paths, not just HF)
- Supports synthesized datasets from sentinel pipelines
- Versioned — which dataset version was used for each published model

## Dependencies
- #89 — Python forge pipeline
- Academy curriculum system in continuum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Domain datasets — curate and validate training data per domain #90

Current State

What's Needed

Per Domain

Dataset Candidates to Evaluate

The Synthesis Path

The Abstraction

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Domain datasets — curate and validate training data per domain #90

Description

Current State

What's Needed

Per Domain

Dataset Candidates to Evaluate

The Synthesis Path

The Abstraction

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions