Skip to content

Domain datasets — curate and validate training data per domain #90

@joelteply

Description

@joelteply

Current State

scripts/forge_model.py now has DOMAIN_DATASETS that maps domains to HF datasets:

  • code → bigcode/starcoderdata
  • reasoning → gsm8k
  • general → HuggingFaceFW/fineweb
  • chat → stingning/ultrachat
  • science → scientific_papers/arxiv

These are starting points. They need validation and likely replacement.

What's Needed

Per Domain

  1. Validate dataset quality — download samples, inspect, ensure they're not garbage
  2. Test pipeline — can the tokenizer handle the format? Does filtering work?
  3. Measure impact — does training on this data actually improve domain performance?
  4. Benchmark — define a holdout eval set that measures real capability (not perplexity on training distribution)

Dataset Candidates to Evaluate

Code:

  • bigcode/starcoderdata (current) — Python subset, good quality
  • bigcode/the-stack-v2 — broader language coverage
  • codeparrot/github-code — raw GitHub, needs filtering
  • Custom: sentinel-ai's own test suite as training data (meta!)

Reasoning:

  • gsm8k (current) — math word problems, good but narrow
  • hendrycks/competition_math — harder
  • TIGER-Lab/MathInstruct — diverse math formats
  • allenai/ai2_arc — science reasoning

General:

  • HuggingFaceFW/fineweb (current) — high-quality web crawl
  • allenai/c4 — Common Crawl, classic
  • togethercomputer/RedPajama-Data — diverse mixture

Chat/Agentic:

  • stingning/ultrachat (current) — multi-turn
  • Open-Orca/OpenOrca — instruction following
  • glaiveai/glaive-function-calling-v2 — tool use!
  • Custom: continuum chat logs (anonymized) as training data

The Synthesis Path

Eventually, sentinel pipelines should GENERATE training data:

  1. Teacher model generates diverse prompts for a domain
  2. Strong model generates high-quality responses
  3. Weak model (the one being forged) trains on this synthetic data
  4. Evaluation uses held-out real-world benchmarks

This is the academy's curriculum system applied to forging.

The Abstraction

DOMAIN_DATASETS should become a registry, not a hardcoded dict:

  • Configurable per-project
  • Supports custom datasets (local paths, not just HF)
  • Supports synthesized datasets from sentinel pipelines
  • Versioned — which dataset version was used for each published model

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions