Training a Model from Scratch

Background

The fine-tuning of a base model as described in the temporal and theme documentation did not work well — the result of the temporal fine-tuning was catastrophic forgetting. The LoRA/QLoRA approach (even with reduced learning rates, replay data, and careful hyperparameter tuning) failed to reliably suppress post-1969 knowledge without destroying the model's general capabilities.

This document explores training a new model from scratch using:

Temporally filtered Wikipedia content (pre–July 1969)
Select books and text material for thematic alignment (Project Gutenberg)

Hardware Summary

Primary: Strix Halo (Training + Pipeline)

Component	Specification
CPU/APU	AMD Ryzen AI MAX+ 395 "Strix Halo"
GPU	Integrated RDNA 3.5, 40 CUs (gfx1151)
Memory	128 GB LPDDR5x-8000 (unified/shared CPU+GPU)
Memory Bandwidth	~215 GB/s
Estimated FP16 Compute	~25–30 TFLOPS peak
GPU Framework	ROCm 7.2 (via Strix Halo Toolboxes container)
OS	Fedora 43, kernel 6.18.4+

The key advantage of this system is the 128 GB of unified memory — far more than any consumer discrete GPU (RTX 4090: 24 GB). This makes it possible to train surprisingly large models from scratch, since training memory requirements greatly exceed inference requirements.

Note: ROCm 7.1.1 is incompatible with kernels ≥ 6.18.4 and has been deprecated. Always use ROCm 7.2+ with modern kernels. See StrixHalo-Fedora-Setup.md for full setup details.

Optional: NVIDIA A4000 (Inference Offload)

Component	Specification
GPU	NVIDIA A4000 (16 GB GDDR6, Ampere / SM 8.6)
Host	VM with PCI passthrough, 24 GB RAM, Fedora 43
CUDA	Container-based (llama.cpp `server-cuda` images)
Services	LLM server (port 1234) + Embedding server (port 1235)
Default LLM	Qwen 2.5 14B Q4_K_M (~10–11 GB VRAM)
Default Embedding	nomic-embed-text-v1.5 F16 (~0.5 GB VRAM)

The A4000 provides a secondary OpenAI-compatible inference server that offloads LLM and embedding workloads from the Strix Halo. This is particularly useful during training — the iGPU can focus on CPT while the A4000 handles dataset generation (SFT data via generate_theme_dataset.py), embedding computation (process_and_index.py), and interactive testing. Set REMOTE_HOST in the environment to route inference requests to the A4000 automatically. See A4000-Fedora-Setup.md for setup details.

Is Training from Scratch Feasible?

Yes, within constraints. Training from scratch is feasible for models in the 100M–3B parameter range.

Memory Requirements for Training

Training requires storing the model weights, optimizer states (Adam uses two additional copies), gradients, and activations in memory simultaneously:

Model Size	FP32 Training	Mixed Precision (BF16 + FP32 optimizer)	Fits in 128 GB?
125M	~2 GB	~1.5 GB	Yes
350M	~5 GB	~4 GB	Yes
1B	~20 GB	~15–20 GB	Yes
3B	~60 GB	~50–65 GB	Yes (tight with large batches)
7B	~140 GB	~110–130 GB	Marginal to No

Estimates include model weights (×1 for BF16), optimizer states (×2 in FP32 for Adam), gradients, and a moderate activation memory budget. Actual usage depends on batch size, sequence length, and gradient checkpointing.

Training Time Estimates

Based on ~25 TFLOPS effective FP16 and a realistic 25–35% Model FLOPS Utilization (MFU) for single-GPU training on ROCm:

Model Size	Tokens/sec (estimated)	Time for 3B tokens	Time for 10B tokens
125M	~8,000–12,000	~3–4 days	~10–14 days
350M	~3,000–5,000	~7–12 days	~23–39 days
1B	~1,000–1,700	~20–35 days	~68–116 days
3B	~350–600	~58–99 days	~193–331 days

Single GPU (iGPU), no data parallelism. MFU conservatively estimated at 25–35% due to RDNA 3.5 lacking dedicated matrix/tensor cores. Actual throughput should be benchmarked early.

Feasibility Verdict

Model Size	Feasible?	Training Time	Recommendation
125–350M	Very feasible	Days to 2 weeks	Best for rapid iteration and proof of concept
1B	Feasible	3–5 weeks	Best balance of capability and training time
3B	Marginal	2–3+ months	Only if 1B results are promising and patience allows
7B+	Not feasible	6+ months, memory-constrained	Exceeds practical limits of this hardware

Approach

Continued Pre-Training (CPT)

Start from an existing base model (not instruct-tuned) and do full-weight continued pre-training on the curated pre-1969 corpus. This is fundamentally different from LoRA fine-tuning:

Full weight updates (not low-rank adapters)
Extended training on the new corpus (multiple epochs)
Gradually overwrites post-1969 knowledge rather than trying to coexist with it
Preserves language modeling capabilities (grammar, coherence, reasoning)

Why this avoids the catastrophic forgetting problem: With LoRA, the base model still contains post-1969 knowledge unchanged — you're only adding a low-rank correction. With CPT, you're actually modifying all the weights, so post-1969 knowledge gets diluted and overwritten by the dominant pre-1969 training signal.

Pipeline:

Base Model  →  CPT on Pre-1969 Corpus  →  SFT for Chat/Theme  →  GGUF Export
               (full weight, many epochs)   (LoRA is fine here)    (llama.cpp)

Base Model Selection

Two base models are selected — a small one for verifying the approach quickly, and a production model for the final training run. Both use Llama-family architectures for native GGUF/llama.cpp compatibility.

Role	Model	Params	Architecture	Pre-Training Data	Rationale
Dev	SmolLM2-360M	360M	Llama	4T tokens (FineWeb-Edu, DCLM, The Stack)	Modern (2025), Llama-native GGUF, strong benchmarks for size, CPT in ~1–2 weeks
Prod	TinyLlama-1.1B	1.1B	Llama 2	3T tokens (SlimPajama + StarCoder)	Best-in-class ~1B model; Llama 2 arch → drop-in GGUF/LM Studio support; extensive 3T pre-training gives a strong baseline to build on

Why TinyLlama-1.1B over alternatives:

Alternative	Why Not
Pythia-1B (EleutherAI)	GPT-NeoX architecture requires extra conversion work for GGUF; trained on only 300B tokens (10× less than TinyLlama), resulting in a significantly weaker baseline
OLMo-1B (AI2)	Custom architecture with non-parametric LayerNorm and unique design choices; less ecosystem support for GGUF export and llama.cpp inference
SmolLM2-1.7B	At 1.7B parameters, exceeds the ~1B budget and would roughly double training time (~6–10 weeks for CPT)

Why SmolLM2-360M over alternatives:

Alternative	Why Not
Pythia-410M (EleutherAI)	GPT-NeoX architecture; trained on only 300B tokens (13× less than SmolLM2); lower benchmarks across the board
Pythia-160M	Too small to meaningfully validate CPT quality — results wouldn't transfer to the 1.1B production run
SmolLM2-135M	Same concern — useful for pipeline smoke testing but too small to validate the CPT approach itself

Alternatives Considered

Training from scratch (random initialization) would give absolute zero post-1969 leakage by construction, but requires the model to learn language structure, grammar, and all world knowledge from the ~3B token corpus alone. This is data-starved for models above ~350M and would produce significantly weaker results than CPT from a well-trained base.

Distillation from a filtered teacher (using a 70B model to generate synthetic training data) could produce high-quality temporally-filtered training examples, but adds complexity, depends on teacher quality, and introduces a risk of subtle temporal leakage from the teacher's parametric knowledge. This remains viable as a data augmentation technique on top of CPT if results need improvement.

Training Corpus

Continued Pre-Training Data

Source	Articles/Works	Raw Text	Estimated Tokens	Temporal Compliance
Pre-1969 Wikipedia articles	≥1.55M articles	5.7 GB	≥1.4B tokens	Filtered via YAGO/Wikidata + LLM temporal classification
Project Gutenberg books	766 books (pre-1969, thematically aligned)	498 MB	~125M tokens	Pre-1969 by definition (public domain)
Chess content (games + books)	~356K games + 10 books	213 MB	~53M tokens	Filtered to pre-July 1969
Total	—	~6.4 GB	≥1.6B tokens	Fully pre-1969

Token estimates use ~4 characters per token (typical for Llama-family BPE tokenizers on English text).

Wikipedia classification: Of 7.0M total articles in the database, temporal classification is available for about ~40% of the content — 1.55M classified as pre-1969 ("old"), 785K as post-1969 ("new"), 466K as uncertain, and 4.2M still unclassified. The article count and token figures above are lower bounds that will increase if LLM-based temporal augmentation (scripts/llm_temporal_analysis_augmentation.py) is used against the remaining articles. Projected final corpus: ~2–3M qualifying articles, ~2.5–4B tokens.

Chess content: Deep Red's origin as a chess-playing AI requires genuine chess knowledge — rules, notation, strategy, opening theory, famous games, and player history through 1969. Sources include Internet Archive chess books (10 texts, 3.7M chars) and PGN historical game databases converted to natural language prose (~356K pre-1969 games from 683 PGN files, 209M chars). See Chess-Setup.md for full source list, conversion approaches, and retrieval details.

Supervised Fine-Tuning Data (Chat / Theme Alignment)

After CPT, the model needs supervised fine-tuning data to become a conversational agent with the Deep Red persona. This is a separate, much smaller dataset in ChatML format — structured as multi-turn conversations with system prompts, user questions, and persona-aligned responses.

What the SFT data contains:

Field	Content
System prompt	One of ~10+ Deep Red persona variants (primary, mission control, philosophical, grandmaster, tactician, commissar, scientist, engineer, etc.) — each emphasizing different facets of the AI character
User message	A natural question or topic drawn from a source passage (e.g., "What is the purpose of the Mars mission?", "How does the colony sustain itself?")
Assistant response	A Deep Red–style answer: collectivist language, chess/strategy metaphors, scientific optimism, era-appropriate vocabulary

Example ChatML training entry:

{
  "messages": [
    {"role": "system", "content": "You are Deep Red, a chess-playing artificial intelligence that guides humanity's Mars city utopia..."},
    {"role": "user", "content": "What drives humanity's push toward space exploration?"},
    {"role": "assistant", "content": "The cosmos presents the ultimate positional challenge, comrade. Like a grandmaster who sees..."}
  ]
}

How the SFT data is generated:

The existing scripts/generate_theme_dataset.py generates these examples semi-automatically:

Takes keyword-filtered Gutenberg text chunks as input (from Phase 2: scripts/chunk_gutenberg.py → scripts/keyword_filter.py)
Sends each chunk to a local LLM via LM Studio (e.g., Qwen 2.5 7B Instruct)
The LLM generates a natural user question based on the chunk's content
The LLM then generates a Deep Red–style response, guided by a randomly selected persona system prompt
Output is written as ChatML JSONL

Source material and pipeline:

Gutenberg filtered chunks (from scripts/keyword_filter.py)
    │
    ├── ~20K–100K thematically relevant passages
    │
    └── scripts/generate_theme_dataset.py
            │
            ├── LM Studio (teacher LLM) generates Q&A pairs
            ├── Randomly cycles through Deep Red persona variants
            ├── 1–2 examples per chunk → ~10K–50K ChatML examples
            │
            └── theme_dataset.jsonl  (~5–25M tokens)

Estimated SFT dataset size: ~10,000–50,000 examples (~5–25M tokens). This is tiny compared to the CPT corpus, but that's expected — SFT only needs to teach conversational patterns and persona style, not world knowledge.

Where to source additional SFT data if needed:

Source	Type	Notes
Gutenberg filtered chunks (existing)	Primary source	Already chunked + keyword-filtered; the main pipeline
Wikipedia pre-1969 articles	Supplement	Run the same generation script on Wikipedia excerpts for broader topic coverage
Year topics from `scripts/extract_year_topics.py`	Supplement	Historical events as prompts — good for temporal Q&A examples
Manually written examples	Quality anchor	Hand-craft 50–100 "gold standard" conversations to steer tone and style

See ThemeFinetuning-DataPreparation-Phase3.md for full setup and execution details.

Chinchilla-Optimal Data Ratios

The Chinchilla scaling laws suggest an optimal ratio of ~20 tokens per parameter:

Model Size	Optimal Tokens	Available Tokens	Strategy
125M	~2.5B	2.5–4.5B	Sufficient — 1–2 epochs
350M	~7B	2.5–4.5B	Slightly data-starved — train 2–3 epochs
1B	~20B	2.5–4.5B	Data-starved — train 5–10 epochs, augment data
3B	~60B	2.5–4.5B	Heavily data-starved — train 15+ epochs

Training for more epochs than Chinchilla-optimal increases the risk of memorization/overfitting, but for a domain-specific model this is acceptable — the model should memorize pre-1969 knowledge. Use a held-out validation set to monitor loss.

Data Preparation Pipeline

Wikipedia DB (7M articles)
    │
    ├── Filter: latest_date <= 1969-07-20 OR (earliest_date <= 1969 AND latest_date IS NULL)
    │       → ~1.2M pre-1969 articles
    │
    ├── Extract clean text (strip wikitext markup)
    │       → scripts/extract_wikipedia.py
    │
    ├── Chunk into training sequences (2048 or 4096 tokens)
    │
    └── Shuffle and write to tokenized binary format

Project Gutenberg (~500 books)
    │
    ├── Already retrieved via scripts/retrieve_gutenberg.py
    │
    ├── Chunk via scripts/chunk_gutenberg.py
    │
    └── Merge with Wikipedia data

Tooling

Pre-Training Frameworks

Framework	Best For	ROCm Support	Notes
nanoGPT	≤350M models, simplicity	Yes (PyTorch)	Minimal code, easy to understand and modify. Good for learning and small experiments.
LitGPT	1B–3B, modern architectures	Yes (PyTorch)	Supports Llama/Mistral/Pythia architectures. Built on Lightning, good for both pre-training and fine-tuning.
torchtune	1B–7B, Meta's official tool	Yes (PyTorch)	Native PyTorch library for training/fine-tuning LLMs. Well-maintained, good Llama support.
Hugging Face Transformers + Accelerate	Any size, most flexible	Yes (PyTorch)	Full-featured, vast model zoo, well-documented. Slightly more overhead than minimal frameworks.
Megatron-LM	Multi-GPU only	Limited	Overkill for single-GPU — skip this.

Recommendation: Use nanoGPT for pipeline smoke testing with a tiny model (Dev Phase 3), then use LitGPT or torchtune for both the dev CPT (SmolLM2-360M) and production CPT (TinyLlama-1.1B) runs.

Data Preparation Tools

Tool	Purpose
tiktoken / sentencepiece	Tokenizer training (if training from scratch)
HuggingFace tokenizers	Fast BPE tokenizer training and application
datasets (HuggingFace)	Efficient data loading and streaming for large corpora
Existing scripts in `scripts/`	`extract_wikipedia.py`, `chunk_gutenberg.py`, `keyword_filter.py`

Conversion and Deployment

Tool	Purpose
llama.cpp	GGUF conversion and quantization (already built with HIP in the setup)
`scripts/convert_to_gguf.py`	Existing conversion script
LM Studio	Local inference and testing

Recommended Configuration (TinyLlama-1.1B, CPT Approach)

Architecture

TinyLlama-1.1B uses the Llama 2 architecture — RoPE, SwiGLU, RMSNorm, GQA — providing full ecosystem compatibility with llama.cpp, LM Studio, and GGUF tooling:

Parameter	Value	Notes
Parameters	1.1B	Inherited from base model
Hidden dimension	2048
Layers	22
Attention heads	32
KV heads (GQA)	4	Reduces memory and inference cost
Intermediate size	5632	SwiGLU MLP
Context length	2048	Sufficient for training; extend later if needed
Vocabulary size	32,000	Llama 2 BPE tokenizer
Positional encoding	RoPE	Required for Llama compatibility and GGUF export

The dev model (SmolLM2-360M) uses a similar Llama architecture but with 960 hidden dim, 32 layers, 15 heads, 5 KV heads, and a 49,152-token vocabulary.

Training Hyperparameters

Parameter	Value	Notes
Precision	BF16 mixed precision	Saves memory; ROCm supports BF16 on RDNA 3.5
Optimizer	AdamW	Standard; β₁=0.9, β₂=0.95, ε=1e-8
Learning rate	3e-4 (peak)	With cosine decay to 3e-5
Warmup	2,000 steps	~1% of total training
Weight decay	0.1	Standard regularization
Batch size	Micro-batch 4–8, gradient accumulation to effective batch ~128–256	Tune to fit memory
Sequence length	2048 tokens	Reduce to 1024 if memory-constrained
Gradient checkpointing	Enabled	Trades compute for memory — essential
Epochs	5–10 over the full corpus	Data-starved regime; monitor val loss for overfitting
Total tokens	~15–30B (5–10 epochs × 3B tokens)

ROCm-Specific Settings

These are automatically set by the venv activate script (added by setup_strixhalo.py):

# Required for Strix Halo gfx1151 (ROCm 7.2)
export HSA_OVERRIDE_GFX_VERSION=11.0.0

# Enable optimized matrix math
export ROCBLAS_USE_HIPBLASLT=1

# PyTorch ROCm memory management
export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True

Post-Training Pipeline

After pre-training (or continued pre-training), the model has raw language modeling capability but no chat or instruction-following behavior. Apply these steps in order:

Step 1: Supervised Fine-Tuning (SFT) for Chat

Use LoRA fine-tuning (this is where LoRA works well — the model already has the right knowledge):

Use the existing thematic ChatML dataset from scripts/generate_theme_dataset.py
Apply the Deep Red system prompt and persona
LoRA rank 16–32, learning rate 2e-5, 2–3 epochs
This should NOT cause catastrophic forgetting because the base knowledge is already temporally correct

Step 2: Merge and Convert

# Merge LoRA adapter with base model
python scripts/merge_lora.py --base output/pretrained-1b/ --adapter output/sft-lora/ --output output/merged/

# Convert to GGUF for LM Studio
python scripts/convert_to_gguf.py --model output/merged/ --output output/deepred-1b.gguf --quantize Q4_K_M

Step 3: Evaluate

Use the existing scripts/evaluate_temporal.py to check temporal separation, then manually test in LM Studio.

Expected Capabilities of the Resulting Model

By Model Size

Capability	125–350M	1B	3B
Grammatical coherence	Good for short passages	Good	Very good
Paragraph-length generation	Adequate, may drift	Good	Very good
Pre-1969 factual Q&A	Basic recall	Solid recall	Strong recall
Post-1969 knowledge leakage	None (by construction)	None (by construction)	None (by construction)
Thematic/persona consistency	Moderate (with SFT)	Good (with SFT)	Very good (with SFT)
Multi-turn conversation	Limited	Moderate	Good
Reasoning / logic	Minimal	Basic	Moderate
Creative writing (era-appropriate)	Short passages	Paragraphs	Extended passages
Instruction following	Basic	Good	Good
Math / arithmetic	Unreliable	Basic	Basic
Overall quality comparison	GPT-2 Small/Medium	GPT-2 XL / early Pythia	Phi-2 / TinyLlama level

What the Model Would Be Good At

Era-appropriate conversation: Discussing pre-1969 history, science, culture, and politics with period-appropriate framing
Deep Red persona: Collectivist language, chess metaphors, scientific rationalism, optimistic futurism
Temporal consistency: Describing the world as it was before July 1969 — no anachronisms
Thematic text generation: Writing passages in the style of mid-century literature, Soviet futurism, and early space age optimism

What the Model Would NOT Be Good At

Complex reasoning: Multi-step logical problems, mathematical proofs
Code generation: Not in the training data
Post-1969 anything: By design — but this means no knowledge of modern medicine, technology, or events
Long document understanding: Limited context window (2048–4096 tokens)
Multilingual: English-only unless multilingual data is included
Competing with modern LLMs: A 1B model trained on 3B tokens cannot match a 70B model trained on 15T tokens on general benchmarks

The Key Advantage Over Fine-Tuning

The fundamental benefit of this approach is no knowledge leakage by construction. A fine-tuned model still contains post-1969 knowledge in its weights and may leak it through indirect questioning, adversarial prompting, or chain-of-thought manipulation. A model trained only on pre-1969 content literally does not have that knowledge to leak.

Suggested Development Roadmap

The roadmap is structured in three layers:

System Setup — One-time infrastructure provisioning (Fedora, toolboxes, services)
Initial Development Run — Fast end-to-end pass through every script and step using minimal data and a tiny model. The goal is to exercise and validate the entire pipeline, not to produce a useful model. Expect garbage output — that's fine.
Production Run — Full-scale data processing and multi-week training to produce the actual model.

Status (as of March 2026): Phase 0 (system setup), Prod Phase 1 (Wikipedia pipeline), and Prod Phase 2 (corpus preparation) are complete. The full corpus has been tokenized with the TinyLlama-1.1B tokenizer into 1.94B tokens, shuffled into 2048-token sequences, and split into train/val sets. The next step is Prod Phase 3 (dev CPT on SmolLM2-360M).

Phase 0: System Setup (Fedora + Toolboxes) — ✅ COMPLETE

This phase sets up the Strix Halo machine from scratch using Fedora instead of the previous Ubuntu-based setup. The Strix Halo Toolboxes project provides containerized, pre-configured environments for AI workloads on Strix Halo.

Step	Task	Status	Notes
0.1	Install Fedora (latest, kernel 6.18.4+) on Strix Halo	✅ Done	Fedora 43, kernel 6.18.4+
0.2	Configure GTT memory allocation (maximize GPU-accessible RAM)	✅ Done	BIOS UMA minimized; kernel params allocate up to 124 GB dynamically
0.3	Install Strix Halo Toolboxes	✅ Done	ROCm 7.2, PyTorch, llama.cpp available in containers
0.4	Install and configure PostgreSQL (for Wikipedia DB)	✅ Done	50 GB database operational; see WikipediaMCP-Setup.md
0.5	Install and configure OpenSearch (for semantic search)	✅ Done	41 GB index operational; systemd service configured
0.6	Install and configure LM Studio (headless server)	✅ Done	Headless server via A4000 (`REMOTE_HOST=192.168.42.15`); see A4000-Fedora-Setup.md
0.7	Verify ROCm: `rocminfo`, run llama.cpp test inference	✅ Done	gfx1151 detected, GPU offloading confirmed
0.8	Set up Python venv with training dependencies (PyTorch ROCm, transformers, PEFT, datasets)	✅ Done	Venv at `/mnt/data/venv`; `torch.cuda.is_available()` returns True via HIP

Initial Development Run — ⏭️ SKIPPED

Note: This phase was skipped. The production data pipeline (Prod Phases 1–2) was executed directly, which validated all scripts and services at full scale. The dev run's goal — exercising every script and verifying every format — was achieved as part of the production pipeline work instead.

Original dev run plan (preserved for reference)

Goal: Exercise every script and step end-to-end in a single day. Use tiny data subsets and a minimal model. The output will be a non-functional toy model, but the pipeline will be fully validated — every script invoked, every format verified, every service confirmed working.

Guiding principle: If a step takes more than 30 minutes, you're using too much data. Cut the input size until it's fast.

Dev Phase 1: Wikipedia Pipeline (Minimal)

Step	Task	Duration (est.)	Notes
D1.1	Download Wikipedia dump (full dump required, but can abort early or use a sample)	2–4 hours	Alternatively, use an existing partial dump or the `enwiki-latest-abstract.xml.gz` (~1 GB) for smoke testing
D1.2	Import a small subset into PostgreSQL (~1,000 articles)	10–15 min	Limit `extract_wikipedia.py` input or truncate the dump; verify DB schema is correct
D1.3	Run `yago_parser.py` on a truncated YAGO file (~10K lines)	5 min	`head -10000 yago-facts.ttl > yago-sample.ttl`; verify CSV output format
D1.4	Run `normalize_temporal_output.py` on the small YAGO CSV	5 min	Verify English URL mapping and page ID lookup work against the small DB
D1.5	Run `augment_wikipedia_temporal.py` on the small DB	2 min	Verify `earliest_date`/`latest_date` columns are populated
D1.6	Run `process_and_index.py` on the small DB subset	10–15 min	Verify OpenSearch index is created and embeddings are stored
D1.7	Start `mcp_server.py` and run a test search query	5 min	Verify keyword + semantic search return results
	Dev Phase 1 total	~3–4 hours	Download time dominates; processing is minutes

Dev Phase 2: Training Corpus (Minimal)

Step	Task	Duration (est.)	Notes
D2.1	Extract pre-1969 articles from the small DB	2 min	May only yield a few hundred articles — that's fine
D2.2	Clean extracted text	2 min	Verify output format is clean plaintext
D2.3	Run `extract_year_topics.py` for a single year (e.g., 1960)	5 min	Verify JSON output structure
D2.4	Run `retrieve_gutenberg.py --priority-only` with a limit of ~5 books	10 min	Verify JSONL output format
D2.5	Run `chunk_gutenberg.py` on the 5 books	2 min	Verify chunk sizes and JSONL output
D2.6	Run `keyword_filter.py` on the small chunk set	2 min	Verify filtering and stats output
D2.7	Select tokenizer from dev base model (SmolLM2-360M tokenizer)	5 min	Download and verify tokenization round-trips
D2.8	Tokenize the small corpus into binary training format	5 min	Verify file format, sequence lengths, shuffling
D2.9	Create train/validation split	1 min	Even a 90/10 split on tiny data is fine
	Dev Phase 2 total	~1 hour

Dev Phase 3: Training (Minimal — Smoke Test)

Step	Task	Duration (est.)	Notes
D3.1	Configure nanoGPT for a ~10M parameter model (random init)	15 min	4 layers, 256 hidden dim, 4 heads; just enough to verify the training loop runs
D3.2	Train for ~100 steps on the tiny corpus	10–30 min	Verify loss decreases, checkpointing works, GPU is used
D3.3	Verify checkpoint loading and sample generation	5 min	Generate a few tokens — expect gibberish, just confirm it runs
	Dev Phase 3 total	~1 hour

Dev Phase 4: SFT + Deployment (Minimal)

Step	Task	Duration (est.)	Notes
D4.1	Run `generate_theme_dataset.py` on ~10 filtered chunks	10–15 min	Verify ChatML JSONL output with Deep Red persona
D4.2	Run LoRA SFT on the 10M model with ~50 ChatML examples, 1 epoch	10 min	Verify `finetune_theme.py`/`finetune_temporal.py` runs without errors
D4.3	Merge LoRA adapter using `merge_lora.py`	2 min	Verify merged model loads correctly
D4.4	Convert to GGUF using `convert_to_gguf.py`	2 min	Verify GGUF file is produced
D4.5	Load GGUF in llama.cpp CLI and generate a few tokens	5 min	Verify the file loads; output will be nonsense
D4.6	Run `evaluate_temporal.py` on a few test questions	5 min	Verify evaluation script runs; scores will be meaningless
	Dev Phase 4 total	~1 hour

Initial Development Run Summary

Phase	Description	Duration
D1	Wikipedia pipeline (small subset)	~3–4 hours
D2	Training corpus preparation (minimal)	~1 hour
D3	Training loop validation (10M model, 100 steps)	~1 hour
D4	SFT + merge + GGUF + eval (smoke test)	~1 hour

Total: ~1 day (excluding Wikipedia dump download time, which can run overnight)

Success criteria: Every script in the pipeline has been invoked and completed without errors. All intermediate file formats have been verified. All services (PostgreSQL, OpenSearch, LM Studio, MCP server) are operational. The pipeline produces a GGUF file that loads in llama.cpp. No step requires debugging during the production run.

Production Run

Goal: Produce the actual Deep Red model. Full data, full training, real evaluation. Expect this to take 6–8 weeks from start of data processing to a deployed model.

Prod Phase 1: Wikipedia Data Pipeline (Full) — ✅ COMPLETE

Step	Task	Status	Notes
P1.1	Download full English Wikipedia dump (`enwiki-*-pages-articles.xml.bz2`, ~25 GB)	✅ Done	24 GB dump in `/mnt/data/wikipedia/dumps/`
P1.2	Extract and import all articles into PostgreSQL using `scripts/extract_wikipedia.py`	✅ Done	7,041,771 articles; 50 GB database; see WikipediaMCP-Setup.md
P1.3	Download and parse full YAGO temporal data using `scripts/yago_parser.py`	✅ Done	Normalized output: `yago-facts-normalized.csv.zst` (107 MB); see TemporalAugmentation-Setup.md
P1.4	Normalize full YAGO output using `scripts/normalize_temporal_output.py`	✅ Done	English URL mapping + page ID lookup complete
P1.5	(Optional) Download and parse Wikidata temporal data using `scripts/wikidata_parser.py`	✅ Done	`wikidata-temporal-normalized.csv.zst` (150 MB); broader coverage achieved
P1.6	Augment Wikipedia DB with temporal metadata using `scripts/augment_wikipedia_temporal.py`	✅ Done	2,804,937 articles classified: 1,553,510 pre-1969 (O), 785,380 post-1969 (N), 466,047 uncertain (S), 4,236,834 unclassified (U)
P1.7	Generate text embeddings and index in OpenSearch using `scripts/process_and_index.py`	✅ Done	41 GB OpenSearch index; BM25 + k-NN vector search operational
P1.8	Verify MCP server search works with full index	✅ Done	MCP server (port 7000) + React web GUI (port 8080) operational; systemd services configured

Prod Phase 2: Training Corpus Preparation (Full) — ✅ COMPLETE

Step	Task	Status	Notes
P2.1	Extract all pre-1969 Wikipedia articles from DB (SQL filter on temporal columns)	✅ Done	1,553,510 pre-1969 articles (5.74 GB raw text, ≥1.4B tokens); 1,683,075 with `earliest_date < 1970`
P2.2	Clean extracted text (strip wikitext markup, normalize formatting)	✅ Done	Extraction pipeline handles markup stripping
P2.3	Extract year topics using `scripts/extract_year_topics.py` (full year range)	✅ Done	1,844 year-topic files (years 151–2025) in `/mnt/data/wikipedia/topics/`; see Wikipedia-YearTopics-Setup.md
P2.4	Retrieve Gutenberg full corpus using `scripts/retrieve_gutenberg.py`	✅ Done	766 books (497 MB JSONL) in `/mnt/data/gutenberg/corpus/`; see Gutenberg-Setup.md
P2.5	Chunk Gutenberg texts using `scripts/chunk_gutenberg.py`	✅ Done	Chunked, scored, and verified output in `/mnt/data/gutenberg/theme_output/`
P2.6	Filter Gutenberg chunks for thematic alignment using `scripts/keyword_filter.py`	✅ Done	Filtered output in `/mnt/data/gutenberg/theme_output/filtered/`
P2.6a	Retrieve chess corpus using `scripts/retrieve_chess_content.py`	✅ Done	Phase 1: 683 PGN files (717 MB); Phase 2: 355,980 games → narrative JSONL (315 MB); Phase 3: 10 Internet Archive books (3.7 MB); see Chess-Setup.md
P2.7	Select tokenizer from prod base model (TinyLlama-1.1B tokenizer)	✅ Done	Llama 2 BPE tokenizer (vocab 32,000, EOS=2); downloaded to `/mnt/data/training_corpus/tokenizers/TinyLlama-1.1B/`; see TrainingCorpus-Setup.md
P2.8	Tokenize full corpus into binary training format (shuffled, 2048-token sequences)	✅ Done	1.94B tokens across 5 sources (Wikipedia 1.64B + Gutenberg 147M + Chess games 153M + Year topics 2M + Chess books 1.8M); 49 shard files in `/mnt/data/training_corpus/TinyLlama-1.1B/shards/`
P2.9	Create train/validation split (99%/1%)	✅ Done	940,207 train seqs (1.93B tokens, 3.6 GB) + 9,497 val seqs (19.4M tokens, 37 MB); output: `train.bin` / `val.bin` in `/mnt/data/training_corpus/TinyLlama-1.1B/`

Prod Phase 3: Dev CPT (SmolLM2-360M)

Step	Task	Duration (est.)	Notes
P3.1	Run CPT on SmolLM2-360M with pre-1969 corpus using `scripts/train_deepred_model.py`	7–12 days	Full-weight BF16, gradient checkpointing; dev mode defaults to 5% data subset for fast validation; see DeepRedModel-Setup.md
P3.2	Evaluate: loss curves, perplexity, sample generations, temporal compliance	1 day	Key checkpoint — verify CPT actually suppresses post-1969 knowledge
P3.3	Quick SFT test + GGUF export of the 360M model	1 day	Verify end-to-end: does a CPT'd + SFT'd small model behave as expected?
P3.4	Decide: proceed to 1.1B CPT, or adjust data/hyperparameters	—	If 360M shows good temporal separation, commit to the full TinyLlama run
	Prod Phase 3 total	~2 weeks

Prod Phase 4: Production CPT (TinyLlama-1.1B)

Step	Task	Duration (est.)	Notes
P4.1	Download TinyLlama-1.1B base model from HuggingFace	30 min	`TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T`; ~2.2 GB in FP16
P4.2	Run continued pre-training on full pre-1969 corpus (15–30B tokens, 5–10 epochs)	3–5 weeks	Full-weight updates, BF16 mixed precision, gradient checkpointing; use hyperparameters validated in P3
P4.3	Monitor: validation loss, periodic sample generation, checkpoint every ~1B tokens	Ongoing	Watch for overfitting (val loss rising while train loss falls)
	Prod Phase 4 total	3–5 weeks

Prod Phase 5: Theme Fine-Tuning and Deployment

Step	Task	Duration (est.)	Notes
P5.1	Generate full thematic ChatML dataset using `scripts/generate_theme_dataset.py`	4–8 hours	Requires LM Studio running; uses all filtered Gutenberg chunks
P5.2	SFT for Deep Red persona using LoRA (`scripts/finetune_theme.py`)	1–2 days	LoRA rank 16–32, lr 2e-5, 2–3 epochs on ChatML data
P5.3	Merge LoRA adapter using `scripts/merge_lora.py`	30 min
P5.4	Convert to GGUF using `scripts/convert_to_gguf.py`	30 min	Quantize to Q4_K_M or Q5_K_M
P5.5	Deploy to LM Studio and evaluate	1 day	Run `scripts/evaluate_temporal.py` + manual testing
	Prod Phase 5 total	2–3 days

Prod Phase 6: Iteration

Step	Task	Duration (est.)	Notes
P6.1	Analyze failure modes (temporal leakage, persona drift, factual errors)	1–2 days	Red-team testing, adversarial probing
P6.2	Adjust data mix, hyperparameters, and retrain as needed	Ongoing	May loop back to Prod Phase 2 (data) or Phase 4 (training)
P6.3	Consider distillation augmentation or extended CPT epochs if results need improvement	Ongoing	Teacher-filtered synthetic data (Approach alternative) or additional training passes
	Prod Phase 6 total	Ongoing

FilesExpand file tree

ModelTraining.md

Latest commit

History