-
Notifications
You must be signed in to change notification settings - Fork 0
Domain datasets — curate and validate training data per domain #90
Copy link
Copy link
Open
Description
Current State
scripts/forge_model.py now has DOMAIN_DATASETS that maps domains to HF datasets:
- code → bigcode/starcoderdata
- reasoning → gsm8k
- general → HuggingFaceFW/fineweb
- chat → stingning/ultrachat
- science → scientific_papers/arxiv
These are starting points. They need validation and likely replacement.
What's Needed
Per Domain
- Validate dataset quality — download samples, inspect, ensure they're not garbage
- Test pipeline — can the tokenizer handle the format? Does filtering work?
- Measure impact — does training on this data actually improve domain performance?
- Benchmark — define a holdout eval set that measures real capability (not perplexity on training distribution)
Dataset Candidates to Evaluate
Code:
- bigcode/starcoderdata (current) — Python subset, good quality
- bigcode/the-stack-v2 — broader language coverage
- codeparrot/github-code — raw GitHub, needs filtering
- Custom: sentinel-ai's own test suite as training data (meta!)
Reasoning:
- gsm8k (current) — math word problems, good but narrow
- hendrycks/competition_math — harder
- TIGER-Lab/MathInstruct — diverse math formats
- allenai/ai2_arc — science reasoning
General:
- HuggingFaceFW/fineweb (current) — high-quality web crawl
- allenai/c4 — Common Crawl, classic
- togethercomputer/RedPajama-Data — diverse mixture
Chat/Agentic:
- stingning/ultrachat (current) — multi-turn
- Open-Orca/OpenOrca — instruction following
- glaiveai/glaive-function-calling-v2 — tool use!
- Custom: continuum chat logs (anonymized) as training data
The Synthesis Path
Eventually, sentinel pipelines should GENERATE training data:
- Teacher model generates diverse prompts for a domain
- Strong model generates high-quality responses
- Weak model (the one being forged) trains on this synthetic data
- Evaluation uses held-out real-world benchmarks
This is the academy's curriculum system applied to forging.
The Abstraction
DOMAIN_DATASETS should become a registry, not a hardcoded dict:
- Configurable per-project
- Supports custom datasets (local paths, not just HF)
- Supports synthesized datasets from sentinel pipelines
- Versioned — which dataset version was used for each published model
Dependencies
- Python forge pipeline — ship Qwen3.5 forged models now #89 — Python forge pipeline
- Academy curriculum system in continuum
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels